Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?
Thread (17 posts)
In §3,1, the subscript i is over the sub classes of data whose mix we are trying to estimate, correct?
Thinking Summary
The human is asking about the notation in section 3.1 of the paper about data mixture inference. Let me read through that section carefully to understand what the subscript i refers to. Looking at section 3.1 "Data mixture inference via linear programming": The paper sets up the problem as follows: - They have a set of n data categories of interest - They have data distributions {𝒟i}i=1^n for each category - The BPE tokenizer was trained on a mixture ∑i=1^n αi* 𝒟i with weights αi* ∈ ℝn satisfying ∑i=1^n αi* = 1 - They have sample corpora {Di}i=1^n sampled from each of the 𝒟i respectively - The goal is to estimate α̂ of α* So yes, the subscript i indexes over the n different data categories/classes whose mixture proportions they are trying to estimate. For example, if they're trying to estimate the mix of natural languages, then i would index over different languages (English, French, German, etc.). If they're trying to estimate the mix of domains, then i would index over different domains (web, books, code, academic, etc.). The ci,p^(t) notation later refers to "how many times the token pair p occurred in the partially merged text" for category i at time step t.
Yes, exactly! In section 3.1, the subscript indexes over the data categories whose mixture proportions they are trying to estimate.
Specifically:
- where is the number of categories
- represents the data distribution for category
- represents the sample corpus from category
- represents the true (unknown) proportion of category in the tokenizer's training data
- represents the count of token pair in category 's corpus after applying the first merge rules
For example:
- If estimating language mixtures: might index over {English, French, German, Spanish, ...}
- If estimating domain mixtures: might index over {Web, Books, Code, Academic, Wikipedia}
- If estimating programming language mixtures: might index over {Python, Java, JavaScript, C++, ...}
The goal is to solve for the mixture weights that best explain the observed BPE merge sequence, where .
In §3,1, the subscript i is over the sub classes of data whose mix we are trying to estimate, correct?
Thinking Summary
Yes, the user is asking about section 3.1 of the paper "What do BPE Tokenizers Reveal about their Training Data?" and specifically about the subscript i in the notation. Looking at section 3.1, I can see that they define: - n data categories of interest - data distributions {𝒟i}i=1^n for each category - The tokenizer was trained on a mixture ∑i=1^n αi* 𝒟i - Each αi* represents the proportion of category i in the training data So yes, the subscript i indexes over the different data categories (like different languages, programming languages, or domains) whose proportions in the mixture they are trying to estimate. The paper mentions examples like: - Natural languages (English, French, Spanish, etc.) - Programming languages (Python, Java, etc.) - Data sources/domains (Wikipedia, web, books, code, academic) So i ranges from 1 to n, where n is the number of categories being considered.
Yes, exactly! In §3.1, the subscript indexes over the data categories whose mixture proportions the authors are trying to estimate.
Specifically:
- represents the data distribution for category
- is the true (unknown) proportion of category in the tokenizer's training data
- The constraint ensures these proportions form a valid probability distribution
The categories could be:
- Natural languages: English, French, Spanish, Russian, etc.
- Programming languages: Python, Java, JavaScript, etc.
- Data domains: Wikipedia, web crawl, books, academic papers, code
So if you're trying to infer the language mixture of a tokenizer trained on 10 languages, then would index those 10 languages. The goal is to estimate each - what fraction of the training data came from each language.
Can you unpack each part of the linear constraint system in §3.1?
Thinking Summary
The user is asking me to explain the linear constraint system from §3.1 of the paper. Let me break down each component of the constraint: $$\sum_{i=1}^n\alpha_i c_{i,m^{(t)}}^{(t)} \geq \sum_{i=1}^n\alpha_i c_{i,p}^{(t)} \quad \text{for all } p \neq m^{(t)}$$ And the relaxed version: $$v^{(t)} + v_p + \sum_{i=1}^n\alpha_i c_{i,m^{(t)}}^{(t)} \geq \sum_{i=1}^n\alpha_i c_{i,p}^{(t)} \quad \text{for all } p \neq m^{(t)}$$ I need to explain what each symbol means and why this constraint makes sense.
Let me break down each component of the linear constraint system from §3.1:
The Core Constraint
Components:
: The proportion of category in the tokenizer's training data (what we're solving for)
: The count of token pair in category 's data after applying the first merges
: The actual merge that was chosen at step (known from the tokenizer)
: Any other possible token pair that could have been merged
What this constraint means:
At training time step , the BPE algorithm chose to merge pair because it had the highest frequency across the entire training dataset. This constraint encodes that fact: the weighted sum of counts for the chosen merge must be at least as large as the weighted sum for any other pair .
The Relaxed Constraint (for feasibility)
Additional components:
: Slack variable for time step - allows the chosen merge to have been "over-represented" in the original training data compared to our samples
: Slack variable for pair - allows pair to be "over-represented" in our current mixture estimate compared to the original training data
Why relaxation is needed:
Since we only have samples from each category distribution (not the exact data the tokenizer was trained on), the strict constraints might be infeasible due to sampling noise. The slack variables allow violations while the objective function minimizes them:
This finds the mixture proportions that require the least amount of "explanation" via sampling noise.
How do linear programming solvers handle the relaxation variables?
Thinking Summary
The user is asking about how linear programming solvers handle the relaxation variables (the v variables) in the paper's formulation. This is about the computational/algorithmic aspects of solving the LP. Let me think about what aspects to cover: 1. How LP solvers treat these variables computationally 2. The role of the objective function in driving them to be small 3. The specific solver they use (Gurobi) 4. The lazy constraint generation approach they use The key points: - LP solvers treat v variables just like any other decision variables - The objective function minimizes their sum, so the solver naturally tries to keep them small - They use Gurobi, a commercial LP solver - They use lazy constraint generation to handle the massive number of constraints
Linear programming solvers handle the relaxation variables and just like any other decision variables, but their treatment is shaped by several key aspects:
1. Objective Function Drives Minimization
The LP objective ensures the solver actively tries to keep these variables small. Since all variables are:
- Non-negative (constrained to )
- Present in the objective with positive coefficients
- Not appearing elsewhere with negative coefficients
The solver will naturally set them to the smallest values that maintain feasibility.
2. Computational Treatment
Modern LP solvers like Gurobi (used in this paper) handle these through:
Interior point methods: Start with large slack values and iteratively reduce them while maintaining feasibility
Simplex algorithm: Keeps most variables at zero (non-basic) and only increases them when necessary to satisfy constraints
3. Lazy Variable/Constraint Generation
The paper uses a sophisticated approach to handle the massive scale:
Initially: Start with v^(t) = v_p = 0 for all t,p
Repeat:
1. Check which constraints are violated
2. Add only violated constraints and their v_p variables
3. Re-solve the reduced LP
Until: No violations remain
This means:
- Most variables never enter the problem (stay implicitly at 0)
- Only "problematic" pairs that would violate constraints get explicit slack variables
- Dramatically reduces problem size from variables to a much smaller active set
4. Interpretation of Non-Zero Slacks
When the solver returns non-zero values:
- Large : Merge was likely over-represented in the tokenizer's training data compared to the samples
- Large : Pair appears more frequently in the sampled data than it did during tokenizer training
The solver balances these to find the values that minimize total "unexplained" frequency differences.
5. Numerical Precision
LP solvers typically use floating-point arithmetic with tolerances (e.g., ). Very small slack values effectively indicate satisfied constraints, while larger values point to genuine distributional differences between training and sample data.
What exactly is meant by "apply the tokenizer" in $3.2. Doesn't applying the tokenizer transform a string into tokens? Where do the pair counts come from? Do we even have access to the tokenizer in the problem setup?
Thinking Summary
The user is asking about section 3.2 of the paper, which discusses "Efficient storage of pair counts." They're confused about what "apply the tokenizer" means in this context, and whether we have access to the tokenizer. Let me read the relevant section carefully: From section 3.2: "First, as a preprocessing step, we apply the target tokenizer to each language corpus Di, recording the pair counts ci,p(t) after each merge is applied for later use." The confusion seems to be about: 1. What does "apply the tokenizer" mean here? 2. How do we get pair counts from tokenization? 3. Do we have access to the tokenizer? Looking at the problem setup in section 3, they say: "Then suppose we receive a BPE tokenizer, which was trained on a large sample of text from the mixture..." "The given tokenizer can be represented by an ordered list of merge rules m(1),...,m(M)." So yes, we DO have access to the tokenizer - specifically its merge list. The key insight is that in section 3.2, they're not just applying the tokenizer in the normal sense (string → tokens). Instead, they're simulating the BPE training process on their sample data Di, applying merges one by one and tracking pair counts at each step. The notation ci,p(t) represents: - How many times the token pair p occurred in corpus Di - After applying the first t-1 merges So "apply the tokenizer" here means: take the merge rules from the tokenizer and apply them sequentially to the sample data, tracking pair counts at each step. This simulates what would have happened during the original BPE training process.
Great question! This is indeed a bit confusing in the paper. Let me clarify what's happening in §3.2.
Yes, you DO have access to the tokenizer
From the problem setup in §3:
"Then suppose we receive a BPE tokenizer, which was trained on a large sample of text from the mixture..."
Specifically, you have access to the tokenizer's ordered merge list .
What "apply the tokenizer" means here
In §3.2, "apply the tokenizer" doesn't mean the normal tokenization process (string → tokens). Instead, it means:
- Take your sample corpus and start with it split into bytes
- Apply the merge rules sequentially to this data
- After each merge, count how many times each possible token pair appears
The notation represents:
- The count of token pair in corpus
- After applying the first merges
Example
Say you have the text "the cat" and the first merge is (t, h) → th
:
- Before any merges: pairs are
(t,h)
,(h,e)
,(e,_)
,(_,c)
,(c,a)
,(a,t)
- After applying merge 1: text becomes
th e _ c a t
- New pairs are:
(th,e)
,(e,_)
,(_,c)
,(c,a)
,(a,t)
The key insight is that you're simulating the BPE training process on your sample data using the tokenizer's merge rules, tracking pair counts at each step. This lets you see which pairs would have been most frequent at each stage, which forms the constraints for the linear program.
But the paper describes analysis of Claudes which have a proprietary tokenizer
Thinking Summary
The user is pointing out an apparent contradiction - the paper analyzes Claude's tokenizer but says Claude has a proprietary tokenizer. Let me check the paper more carefully. Looking at section 5.6: "Very little is known about models from the Claude family, but a remark in the Anthropic SDK suggests that Claude 1 [8] and 2 [9] share the same tokenizer, which is open-source, while Claude 3 [10] uses a different (closed) tokenizer." So actually, Claude 1 and 2 use an open-source tokenizer, while only Claude 3 uses a closed tokenizer. The paper analyzes the open-source tokenizer from Claude 1/2, not Claude 3's closed tokenizer. This is an important distinction that resolves the apparent contradiction.
You're right to catch this! The paper clarifies this important distinction in §5.6:
"Very little is known about models from the Claude family, but a remark in the Anthropic SDK suggests that Claude 1 [8] and 2 [9] share the same tokenizer, which is open-source, while Claude 3 [10] uses a different (closed) tokenizer."
So the paper analyzes:
- Claude 1/2's tokenizer: Open-source and available for analysis
- Claude 3's tokenizer: Closed/proprietary and NOT analyzed
The results shown for "Claude" in the paper are specifically for the Claude 1/2 tokenizer, not Claude 3. The authors explicitly note that Claude 3's tokenizer is closed and therefore not included in their analysis.
This is actually quite interesting because the paper infers that the lack of multilingual training data in Claude 1/2 (only 3.7% non-English) "likely explains why a new tokenizer was trained for Claude 3, which boasts 'increased capabilities... in non-English languages'."
So the paper only analyzes tokenizers where they have access to the merge rules - even for "closed" models like GPT-4, the tokenizers themselves are often open (via tiktoken in OpenAI's case) specifically so users can estimate API costs.
Problem: too many constraints for practicality
Idea: use a subset of constraints to find a candidate solution, them check if it satisfies the global set of constraints and add any violated ones to the subset
Discussion of how to use a priority queue to index constraints in a way that tracks their structure but not obvious on a first read what that structure is - discussion seems very accounting oriented
§ 3.3)
Thinking Summary
The user has shared a paper about "Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?" and appears to be looking at a specific section (§ 3.3) about efficient constraint violation detection. The key points from this section: 1. They're dealing with a linear programming problem that has too many constraints to be practical 2. Their solution is to use a subset of constraints, find a candidate solution, then check if it satisfies the global set of constraints and add any violated ones 3. They use a priority queue to efficiently detect violated constraints 4. The structure involves tracking how pair counts change as merges are applied Let me provide a clear explanation of what's happening in section 3.3.
Looking at §3.3, the authors are addressing a computational challenge in their BPE tokenizer attack. Here's what's happening:
The Problem
The linear program LP1_T has potentially billions of constraints (O(V³) where V is vocabulary size). Checking all constraints naively would be computationally infeasible.
The Solution Strategy
They use lazy constraint generation - start with a subset of constraints and iteratively add violated ones:
- Solve a smaller LP with only some constraints
- Check if the solution violates any constraints from the full LP
- Add violated constraints and repeat
The Priority Queue Structure
The key insight is tracking how pair frequencies change efficiently:
- Define (weighted sum of pair counts)
- Need to check if:
The priority queue maintains pairs ordered by :
- Initial setup: Build heap with all pairs, priority =
- Check violations: Pop pairs in descending order until finding one that satisfies the constraint
- Update efficiently: When moving from time t to t+1, only update pairs affected by merge
The "structure" being exploited is that most pair counts don't change between timesteps - only O(V) pairs are affected by each merge (those overlapping with the merged pair). This allows updating the priority queue in O((k_new + k_old) log V) time rather than rebuilding from scratch.
This approach reduces the constraint checking from O(V²) per timestep to O(k log V) where k is the number of violated constraints - a massive speedup when most constraints are satisfied.
Can you elaborate on how the pair counts that don't change duty a merge corresponds to constraints in the priority queue?
Thinking Summary
The user is asking about the relationship between pair counts that don't change during a merge and how this relates to constraints in the priority queue. Let me think through this carefully. When a merge happens at time t, say merge m^(t) = (A, B) → AB: - Most pairs don't contain A or B, so their counts remain the same - Only pairs that contain A or B will have their counts change In the priority queue: - Each pair p has priority s_p^(t) - v_p - s_p^(t) = Σ α_i * c_{i,p}^(t) (weighted sum of counts) - If c_{i,p}^(t) = c_{i,p}^(t+1) for all i, then s_p^(t) = s_p^(t+1) - So the priority doesn't change! This means: 1. Pairs unaffected by the merge keep the same priority in the queue 2. Only need to update priorities for pairs that overlap with the merge 3. This is why they only need O(V) updates instead of O(V²) Each pair p corresponds to a constraint in the LP that says the chosen merge m^(t) must have been more frequent than p at time t.
Great question! Let me clarify the connection between unchanged pair counts and the priority queue structure.
How Merges Affect Pair Counts
When merge happens at time :
Pairs that change:
- itself disappears (count → 0)
- Pairs like where precedes
- Pairs like where follows
- New pairs like and are created
Pairs that DON'T change: Any pair not containing or , like:
- where neither nor is or
- anywhere else in the text
Connection to Priority Queue
Each pair in the queue has priority: where
Key insight: If for all categories , then:
- (the weighted sum doesn't change)
- The priority remains identical
- No need to update this entry in the priority queue!
The Efficiency Gain
For vocabulary size :
- Total possible pairs:
- Pairs affected by one merge: (those overlapping with the merge)
- Pairs unaffected:
So instead of rebuilding the entire queue with operations, they only:
- Update existing entries (decrease-priority operations)
- Insert new pairs created by the merge
- Leave the vast majority untouched
Each pair represents a constraint in the LP: "the chosen merge had higher frequency than pair at time ". The priority queue efficiently finds which of these constraints are violated without checking all of them.