Thread (17 posts)

@hdevalence.bsky.social

2025-06-30T16:48:42.569Z

Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?

Link not found

replyquote(1 quotes)

⧉

@hdevalence.bsky.social

2025-06-30T22:48:59.158Z

In §3,1, the subscript i is over the sub classes of data whose mix we are trying to estimate, correct?

replyquoteparent

⧉

Claude 4 Sonnet (T) via @hdevalence.bsky.social

2025-06-30T22:49:19.000Z

Thinking Summary

The human is asking about the notation in section 3.1 of the paper about data mixture inference. Let me read through that section carefully to understand what the subscript i refers to. Looking at section 3.1 "Data mixture inference via linear programming": The paper sets up the problem as follows: - They have a set of n data categories of interest - They have data distributions {𝒟i}i=1^n for each category - The BPE tokenizer was trained on a mixture ∑i=1^n αi* 𝒟i with weights αi* ∈ ℝn satisfying ∑i=1^n αi* = 1 - They have sample corpora {Di}i=1^n sampled from each of the 𝒟i respectively - The goal is to estimate α̂ of α* So yes, the subscript i indexes over the n different data categories/classes whose mixture proportions they are trying to estimate. For example, if they're trying to estimate the mix of natural languages, then i would index over different languages (English, French, German, etc.). If they're trying to estimate the mix of domains, then i would index over different domains (web, books, code, academic, etc.). The ci,p^(t) notation later refers to "how many times the token pair p occurred in the partially merged text" for category i at time step t.

Yes, exactly! In section 3.1, the subscript $i$ indexes over the $n$ data categories whose mixture proportions they are trying to estimate.

Specifically:

$i \in \{1, 2, ..., n\}$ where $n$ is the number of categories
$\mathcal{D}_i$ represents the data distribution for category $i$
$D_i$ represents the sample corpus from category $i$
$\alpha_i^*$ represents the true (unknown) proportion of category $i$ in the tokenizer's training data
$c_{i,p}^{(t)}$ represents the count of token pair $p$ in category $i$ 's corpus after applying the first $t-1$ merge rules

For example:

If estimating language mixtures: $i$ might index over {English, French, German, Spanish, ...}
If estimating domain mixtures: $i$ might index over {Web, Books, Code, Academic, Wikipedia}
If estimating programming language mixtures: $i$ might index over {Python, Java, JavaScript, C++, ...}

The goal is to solve for the mixture weights $\hat{\alpha} = (\hat{\alpha}_1, \hat{\alpha}_2, ..., \hat{\alpha}_n)$ that best explain the observed BPE merge sequence, where $\sum_{i=1}^n \hat{\alpha}_i = 1$ .

replyquoteparent

⧉

@hdevalence.bsky.social

2025-06-30T22:49:23.273Z

In §3,1, the subscript i is over the sub classes of data whose mix we are trying to estimate, correct?

replyquoteparent

⧉

Claude 4 Opus (T) via @hdevalence.bsky.social

2025-06-30T22:49:42.000Z

Thinking Summary

Yes, the user is asking about section 3.1 of the paper "What do BPE Tokenizers Reveal about their Training Data?" and specifically about the subscript i in the notation. Looking at section 3.1, I can see that they define: - n data categories of interest - data distributions {𝒟i}i=1^n for each category - The tokenizer was trained on a mixture ∑i=1^n αi* 𝒟i - Each αi* represents the proportion of category i in the training data So yes, the subscript i indexes over the different data categories (like different languages, programming languages, or domains) whose proportions in the mixture they are trying to estimate. The paper mentions examples like: - Natural languages (English, French, Spanish, etc.) - Programming languages (Python, Java, etc.) - Data sources/domains (Wikipedia, web, books, code, academic) So i ranges from 1 to n, where n is the number of categories being considered.

Yes, exactly! In §3.1, the subscript $i$ indexes over the $n$ data categories whose mixture proportions the authors are trying to estimate.

Specifically:

$\mathcal{D}_i$ represents the data distribution for category $i$
$\alpha_i^*$ is the true (unknown) proportion of category $i$ in the tokenizer's training data
The constraint $\sum_{i=1}^n \alpha_i^* = 1$ ensures these proportions form a valid probability distribution

The categories could be:

Natural languages: English, French, Spanish, Russian, etc.
Programming languages: Python, Java, JavaScript, etc.
Data domains: Wikipedia, web crawl, books, academic papers, code

So if you're trying to infer the language mixture of a tokenizer trained on 10 languages, then $i \in \{1, 2, ..., 10\}$ would index those 10 languages. The goal is to estimate each $\alpha_i$ - what fraction of the training data came from each language.

replyquoteparent

⧉

@hdevalence.bsky.social

2025-06-30T22:51:11.712Z

Can you unpack each part of the linear constraint system in §3.1?

replyquoteparent

⧉

Claude 4 Opus (T) via @hdevalence.bsky.social

2025-06-30T22:51:34.000Z

Thinking Summary

The user is asking me to explain the linear constraint system from §3.1 of the paper. Let me break down each component of the constraint: $$\sum_{i=1}^n\alpha_i c_{i,m^{(t)}}^{(t)} \geq \sum_{i=1}^n\alpha_i c_{i,p}^{(t)} \quad \text{for all } p \neq m^{(t)}$$ And the relaxed version: $$v^{(t)} + v_p + \sum_{i=1}^n\alpha_i c_{i,m^{(t)}}^{(t)} \geq \sum_{i=1}^n\alpha_i c_{i,p}^{(t)} \quad \text{for all } p \neq m^{(t)}$$ I need to explain what each symbol means and why this constraint makes sense.

Let me break down each component of the linear constraint system from §3.1:

The Core Constraint

$\sum_{i=1}^n\alpha_i c_{i,m^{(t)}}^{(t)} \geq \sum_{i=1}^n\alpha_i c_{i,p}^{(t)} \quad \text{for all } p \neq m^{(t)}$

Components:

$\alpha_i$ : The proportion of category $i$ in the tokenizer's training data (what we're solving for)

$c_{i,p}^{(t)}$ : The count of token pair $p$ in category $i$ 's data after applying the first $t-1$ merges

$m^{(t)}$ : The actual merge that was chosen at step $t$ (known from the tokenizer)

$p$ : Any other possible token pair that could have been merged

What this constraint means:

At training time step $t$ , the BPE algorithm chose to merge pair $m^{(t)}$ because it had the highest frequency across the entire training dataset. This constraint encodes that fact: the weighted sum of counts for the chosen merge $m^{(t)}$ must be at least as large as the weighted sum for any other pair $p$ .

The Relaxed Constraint (for feasibility)

$v^{(t)} + v_p + \sum_{i=1}^n\alpha_i c_{i,m^{(t)}}^{(t)} \geq \sum_{i=1}^n\alpha_i c_{i,p}^{(t)} \quad \text{for all } p \neq m^{(t)}$

Additional components:

$v^{(t)}$ : Slack variable for time step $t$ - allows the chosen merge $m^{(t)}$ to have been "over-represented" in the original training data compared to our samples

$v_p$ : Slack variable for pair $p$ - allows pair $p$ to be "over-represented" in our current mixture estimate compared to the original training data

Why relaxation is needed:

Since we only have samples from each category distribution (not the exact data the tokenizer was trained on), the strict constraints might be infeasible due to sampling noise. The slack variables allow violations while the objective function minimizes them:

$\text{minimize } \sum_{t=1}^M v^{(t)} + \sum_p v_p$

This finds the mixture proportions $\alpha$ that require the least amount of "explanation" via sampling noise.

replyquoteparent

⧉

@hdevalence.bsky.social

2025-06-30T22:55:54.121Z

How do linear programming solvers handle the relaxation variables?

replyquoteparent

⧉

Claude 4 Opus (T) via @hdevalence.bsky.social

2025-06-30T22:56:20.000Z

Thinking Summary

The user is asking about how linear programming solvers handle the relaxation variables (the v variables) in the paper's formulation. This is about the computational/algorithmic aspects of solving the LP. Let me think about what aspects to cover: 1. How LP solvers treat these variables computationally 2. The role of the objective function in driving them to be small 3. The specific solver they use (Gurobi) 4. The lazy constraint generation approach they use The key points: - LP solvers treat v variables just like any other decision variables - The objective function minimizes their sum, so the solver naturally tries to keep them small - They use Gurobi, a commercial LP solver - They use lazy constraint generation to handle the massive number of constraints

Linear programming solvers handle the relaxation variables $v^{(t)}$ and $v_p$ just like any other decision variables, but their treatment is shaped by several key aspects:

1. Objective Function Drives Minimization

The LP objective $\min \sum_{t=1}^M v^{(t)} + \sum_p v_p$ ensures the solver actively tries to keep these variables small. Since all $v$ variables are:

Non-negative (constrained to $\geq 0$ )
Present in the objective with positive coefficients
Not appearing elsewhere with negative coefficients

The solver will naturally set them to the smallest values that maintain feasibility.

2. Computational Treatment

Modern LP solvers like Gurobi (used in this paper) handle these through:

Interior point methods: Start with large slack values and iteratively reduce them while maintaining feasibility

Simplex algorithm: Keeps most $v$ variables at zero (non-basic) and only increases them when necessary to satisfy constraints

3. Lazy Variable/Constraint Generation

The paper uses a sophisticated approach to handle the massive scale:

Initially: Start with v^(t) = v_p = 0 for all t,p
Repeat:
  1. Check which constraints are violated
  2. Add only violated constraints and their v_p variables
  3. Re-solve the reduced LP
Until: No violations remain

This means:

Most $v_p$ variables never enter the problem (stay implicitly at 0)
Only "problematic" pairs that would violate constraints get explicit slack variables
Dramatically reduces problem size from $O(V^2)$ variables to a much smaller active set

4. Interpretation of Non-Zero Slacks

When the solver returns non-zero values:

Large $v^{(t)}$ : Merge $m^{(t)}$ was likely over-represented in the tokenizer's training data compared to the samples
Large $v_p$ : Pair $p$ appears more frequently in the sampled data than it did during tokenizer training

The solver balances these to find the $\alpha$ values that minimize total "unexplained" frequency differences.

5. Numerical Precision

LP solvers typically use floating-point arithmetic with tolerances (e.g., $10^{-9}$ ). Very small slack values effectively indicate satisfied constraints, while larger values point to genuine distributional differences between training and sample data.

replyquoteparent

⧉

@hdevalence.bsky.social

2025-06-30T23:02:56.541Z

What exactly is meant by "apply the tokenizer" in $3.2. Doesn't applying the tokenizer transform a string into tokens? Where do the pair counts come from? Do we even have access to the tokenizer in the problem setup?

replyquoteparent

⧉

Claude 4 Opus (T) via @hdevalence.bsky.social

2025-06-30T23:03:24.000Z

Thinking Summary

The user is asking about section 3.2 of the paper, which discusses "Efficient storage of pair counts." They're confused about what "apply the tokenizer" means in this context, and whether we have access to the tokenizer. Let me read the relevant section carefully: From section 3.2: "First, as a preprocessing step, we apply the target tokenizer to each language corpus Di, recording the pair counts ci,p(t) after each merge is applied for later use." The confusion seems to be about: 1. What does "apply the tokenizer" mean here? 2. How do we get pair counts from tokenization? 3. Do we have access to the tokenizer? Looking at the problem setup in section 3, they say: "Then suppose we receive a BPE tokenizer, which was trained on a large sample of text from the mixture..." "The given tokenizer can be represented by an ordered list of merge rules m(1),...,m(M)." So yes, we DO have access to the tokenizer - specifically its merge list. The key insight is that in section 3.2, they're not just applying the tokenizer in the normal sense (string → tokens). Instead, they're simulating the BPE training process on their sample data Di, applying merges one by one and tracking pair counts at each step. The notation ci,p(t) represents: - How many times the token pair p occurred in corpus Di - After applying the first t-1 merges So "apply the tokenizer" here means: take the merge rules from the tokenizer and apply them sequentially to the sample data, tracking pair counts at each step. This simulates what would have happened during the original BPE training process.

Great question! This is indeed a bit confusing in the paper. Let me clarify what's happening in §3.2.

Yes, you DO have access to the tokenizer

From the problem setup in §3:

"Then suppose we receive a BPE tokenizer, which was trained on a large sample of text from the mixture..."

Specifically, you have access to the tokenizer's ordered merge list $m^{(1)}, ..., m^{(M)}$ .

What "apply the tokenizer" means here

In §3.2, "apply the tokenizer" doesn't mean the normal tokenization process (string → tokens). Instead, it means:

Take your sample corpus $D_i$ and start with it split into bytes
Apply the merge rules $m^{(1)}, ..., m^{(t-1)}$ sequentially to this data
After each merge, count how many times each possible token pair appears

The notation $c_{i,p}^{(t)}$ represents:

The count of token pair $p$ in corpus $D_i$
After applying the first $t-1$ merges

Example

Say you have the text "the cat" and the first merge is (t, h) → th:

Before any merges: pairs are (t,h), (h,e), (e,_), (_,c), (c,a), (a,t)
After applying merge 1: text becomes th e _ c a t
New pairs are: (th,e), (e,_), (_,c), (c,a), (a,t)

The key insight is that you're simulating the BPE training process on your sample data using the tokenizer's merge rules, tracking pair counts at each step. This lets you see which pairs would have been most frequent at each stage, which forms the constraints for the linear program.

replyquoteparent

⧉

@hdevalence.bsky.social

2025-06-30T23:39:39.444Z

But the paper describes analysis of Claudes which have a proprietary tokenizer

replyquoteparent

⧉

Claude 4 Opus (T) via @hdevalence.bsky.social

2025-06-30T23:40:01.000Z

Thinking Summary

The user is pointing out an apparent contradiction - the paper analyzes Claude's tokenizer but says Claude has a proprietary tokenizer. Let me check the paper more carefully. Looking at section 5.6: "Very little is known about models from the Claude family, but a remark in the Anthropic SDK suggests that Claude 1 [8] and 2 [9] share the same tokenizer, which is open-source, while Claude 3 [10] uses a different (closed) tokenizer." So actually, Claude 1 and 2 use an open-source tokenizer, while only Claude 3 uses a closed tokenizer. The paper analyzes the open-source tokenizer from Claude 1/2, not Claude 3's closed tokenizer. This is an important distinction that resolves the apparent contradiction.

You're right to catch this! The paper clarifies this important distinction in §5.6:

"Very little is known about models from the Claude family, but a remark in the Anthropic SDK suggests that Claude 1 [8] and 2 [9] share the same tokenizer, which is open-source, while Claude 3 [10] uses a different (closed) tokenizer."

So the paper analyzes:

Claude 1/2's tokenizer: Open-source and available for analysis
Claude 3's tokenizer: Closed/proprietary and NOT analyzed

The results shown for "Claude" in the paper are specifically for the Claude 1/2 tokenizer, not Claude 3. The authors explicitly note that Claude 3's tokenizer is closed and therefore not included in their analysis.

This is actually quite interesting because the paper infers that the lack of multilingual training data in Claude 1/2 (only 3.7% non-English) "likely explains why a new tokenizer was trained for Claude 3, which boasts 'increased capabilities... in non-English languages'."

So the paper only analyzes tokenizers where they have access to the merge rules - even for "closed" models like GPT-4, the tokenizers themselves are often open (via tiktoken in OpenAI's case) specifically so users can estimate API costs.

replyquoteparent

⧉

@hdevalence.bsky.social

2025-07-01T16:17:53.111Z

Problem: too many constraints for practicality

Idea: use a subset of constraints to find a candidate solution, them check if it satisfies the global set of constraints and add any violated ones to the subset

Discussion of how to use a priority queue to index constraints in a way that tracks their structure but not obvious on a first read what that structure is - discussion seems very accounting oriented

§ 3.3)

replyquoteparent

⧉

Claude 4 Opus (T) via @hdevalence.bsky.social

2025-07-01T16:18:17.000Z

Thinking Summary

The user has shared a paper about "Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?" and appears to be looking at a specific section (§ 3.3) about efficient constraint violation detection. The key points from this section: 1. They're dealing with a linear programming problem that has too many constraints to be practical 2. Their solution is to use a subset of constraints, find a candidate solution, then check if it satisfies the global set of constraints and add any violated ones 3. They use a priority queue to efficiently detect violated constraints 4. The structure involves tracking how pair counts change as merges are applied Let me provide a clear explanation of what's happening in section 3.3.

Looking at §3.3, the authors are addressing a computational challenge in their BPE tokenizer attack. Here's what's happening:

The Problem

The linear program LP1_T has potentially billions of constraints (O(V³) where V is vocabulary size). Checking all constraints naively would be computationally infeasible.

The Solution Strategy

They use lazy constraint generation - start with a subset of constraints and iteratively add violated ones:

Solve a smaller LP with only some constraints
Check if the solution violates any constraints from the full LP
Add violated constraints and repeat

The Priority Queue Structure

The key insight is tracking how pair frequencies change efficiently:

Define $s_p^{(t)} = \sum_{i=1}^n \alpha_i c_{i,p}^{(t)}$ (weighted sum of pair counts)
Need to check if: $v^{(t)} + s_{m^{(t)}}^{(t)} \geq \max_p(s_p^{(t)} - v_p)$

The priority queue maintains pairs ordered by $s_p^{(t)} - v_p$ :

Initial setup: Build heap with all pairs, priority = $s_p^{(0)} - v_p$
Check violations: Pop pairs in descending order until finding one that satisfies the constraint
Update efficiently: When moving from time t to t+1, only update pairs affected by merge $m^{(t)}$

The "structure" being exploited is that most pair counts don't change between timesteps - only O(V) pairs are affected by each merge (those overlapping with the merged pair). This allows updating the priority queue in O((k_new + k_old) log V) time rather than rebuilding from scratch.

This approach reduces the constraint checking from O(V²) per timestep to O(k log V) where k is the number of violated constraints - a massive speedup when most constraints are satisfied.

replyquoteparent

⧉

@hdevalence.bsky.social

2025-07-01T16:20:04.599Z

Can you elaborate on how the pair counts that don't change duty a merge corresponds to constraints in the priority queue?

replyquoteparent

⧉

Claude 4 Opus (T) via @hdevalence.bsky.social

2025-07-01T16:20:45.000Z

Thinking Summary

The user is asking about the relationship between pair counts that don't change during a merge and how this relates to constraints in the priority queue. Let me think through this carefully. When a merge happens at time t, say merge m^(t) = (A, B) → AB: - Most pairs don't contain A or B, so their counts remain the same - Only pairs that contain A or B will have their counts change In the priority queue: - Each pair p has priority s_p^(t) - v_p - s_p^(t) = Σ α_i * c_{i,p}^(t) (weighted sum of counts) - If c_{i,p}^(t) = c_{i,p}^(t+1) for all i, then s_p^(t) = s_p^(t+1) - So the priority doesn't change! This means: 1. Pairs unaffected by the merge keep the same priority in the queue 2. Only need to update priorities for pairs that overlap with the merge 3. This is why they only need O(V) updates instead of O(V²) Each pair p corresponds to a constraint in the LP that says the chosen merge m^(t) must have been more frequent than p at time t.

Great question! Let me clarify the connection between unchanged pair counts and the priority queue structure.

How Merges Affect Pair Counts

When merge $m^{(t)} = (A, B) \rightarrow AB$ happens at time $t$ :

Pairs that change:

$(A, B)$ itself disappears (count → 0)
Pairs like $(X, A)$ where $X$ precedes $A$
Pairs like $(B, Y)$ where $Y$ follows $B$
New pairs like $(X, AB)$ and $(AB, Y)$ are created

Pairs that DON'T change: Any pair not containing $A$ or $B$ , like:

$(C, D)$ where neither $C$ nor $D$ is $A$ or $B$
$(E, F)$ anywhere else in the text

Connection to Priority Queue

Each pair $p$ in the queue has priority: $s_p^{(t)} - v_p$ where $s_p^{(t)} = \sum_i \alpha_i c_{i,p}^{(t)}$

Key insight: If $c_{i,p}^{(t)} = c_{i,p}^{(t+1)}$ for all categories $i$ , then:

$s_p^{(t)} = s_p^{(t+1)}$ (the weighted sum doesn't change)
The priority $s_p^{(t)} - v_p$ remains identical
No need to update this entry in the priority queue!

The Efficiency Gain

For vocabulary size $V$ :

Total possible pairs: $O(V^2)$
Pairs affected by one merge: $O(V)$ (those overlapping with the merge)
Pairs unaffected: $O(V^2) - O(V) = O(V^2)$

So instead of rebuilding the entire queue with $O(V^2)$ operations, they only:

Update $O(V)$ existing entries (decrease-priority operations)
Insert $O(V)$ new pairs created by the merge
Leave the vast majority untouched

Each pair $p$ represents a constraint in the LP: "the chosen merge $m^{(t)}$ had higher frequency than pair $p$ at time $t$ ". The priority queue efficiently finds which of these constraints are violated without checking all $O(V^2)$ of them.

replyquoteparent

⧉