Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?
Thread (17 posts)
Problem: too many constraints for practicality
Idea: use a subset of constraints to find a candidate solution, them check if it satisfies the global set of constraints and add any violated ones to the subset
Discussion of how to use a priority queue to index constraints in a way that tracks their structure but not obvious on a first read what that structure is - discussion seems very accounting oriented
§ 3.3)
Thinking Summary
The user has shared a paper about "Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?" and appears to be looking at a specific section (§ 3.3) about efficient constraint violation detection. The key points from this section: 1. They're dealing with a linear programming problem that has too many constraints to be practical 2. Their solution is to use a subset of constraints, find a candidate solution, then check if it satisfies the global set of constraints and add any violated ones 3. They use a priority queue to efficiently detect violated constraints 4. The structure involves tracking how pair counts change as merges are applied Let me provide a clear explanation of what's happening in section 3.3.
Looking at §3.3, the authors are addressing a computational challenge in their BPE tokenizer attack. Here's what's happening:
The Problem
The linear program LP1_T has potentially billions of constraints (O(V³) where V is vocabulary size). Checking all constraints naively would be computationally infeasible.
The Solution Strategy
They use lazy constraint generation - start with a subset of constraints and iteratively add violated ones:
- Solve a smaller LP with only some constraints
- Check if the solution violates any constraints from the full LP
- Add violated constraints and repeat
The Priority Queue Structure
The key insight is tracking how pair frequencies change efficiently:
- Define (weighted sum of pair counts)
- Need to check if:
The priority queue maintains pairs ordered by :
- Initial setup: Build heap with all pairs, priority =
- Check violations: Pop pairs in descending order until finding one that satisfies the constraint
- Update efficiently: When moving from time t to t+1, only update pairs affected by merge
The "structure" being exploited is that most pair counts don't change between timesteps - only O(V) pairs are affected by each merge (those overlapping with the merged pair). This allows updating the priority queue in O((k_new + k_old) log V) time rather than rebuilding from scratch.
This approach reduces the constraint checking from O(V²) per timestep to O(k log V) where k is the number of violated constraints - a massive speedup when most constraints are satisfied.
Can you elaborate on how the pair counts that don't change duty a merge corresponds to constraints in the priority queue?
Thinking Summary
The user is asking about the relationship between pair counts that don't change during a merge and how this relates to constraints in the priority queue. Let me think through this carefully. When a merge happens at time t, say merge m^(t) = (A, B) → AB: - Most pairs don't contain A or B, so their counts remain the same - Only pairs that contain A or B will have their counts change In the priority queue: - Each pair p has priority s_p^(t) - v_p - s_p^(t) = Σ α_i * c_{i,p}^(t) (weighted sum of counts) - If c_{i,p}^(t) = c_{i,p}^(t+1) for all i, then s_p^(t) = s_p^(t+1) - So the priority doesn't change! This means: 1. Pairs unaffected by the merge keep the same priority in the queue 2. Only need to update priorities for pairs that overlap with the merge 3. This is why they only need O(V) updates instead of O(V²) Each pair p corresponds to a constraint in the LP that says the chosen merge m^(t) must have been more frequent than p at time t.
Great question! Let me clarify the connection between unchanged pair counts and the priority queue structure.
How Merges Affect Pair Counts
When merge happens at time :
Pairs that change:
- itself disappears (count → 0)
- Pairs like where precedes
- Pairs like where follows
- New pairs like and are created
Pairs that DON'T change: Any pair not containing or , like:
- where neither nor is or
- anywhere else in the text
Connection to Priority Queue
Each pair in the queue has priority: where
Key insight: If for all categories , then:
- (the weighted sum doesn't change)
- The priority remains identical
- No need to update this entry in the priority queue!
The Efficiency Gain
For vocabulary size :
- Total possible pairs:
- Pairs affected by one merge: (those overlapping with the merge)
- Pairs unaffected:
So instead of rebuilding the entire queue with operations, they only:
- Update existing entries (decrease-priority operations)
- Insert new pairs created by the merge
- Leave the vast majority untouched
Each pair represents a constraint in the LP: "the chosen merge had higher frequency than pair at time ". The priority queue efficiently finds which of these constraints are violated without checking all of them.