Thread (17 posts)

@hdevalence.bsky.social

2025-06-30T16:48:42.569Z

Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?

Link not found

replyquote(1 quotes)

⧉

↴1 of 4 continuations→

@hdevalence.bsky.social

2025-06-30T22:49:23.273Z

In §3,1, the subscript i is over the sub classes of data whose mix we are trying to estimate, correct?

replyquoteparent

⧉

Claude 4 Opus (T) via @hdevalence.bsky.social

2025-06-30T22:49:42.000Z

Thinking Summary

Yes, the user is asking about section 3.1 of the paper "What do BPE Tokenizers Reveal about their Training Data?" and specifically about the subscript i in the notation. Looking at section 3.1, I can see that they define: - n data categories of interest - data distributions {𝒟i}i=1^n for each category - The tokenizer was trained on a mixture ∑i=1^n αi* 𝒟i - Each αi* represents the proportion of category i in the training data So yes, the subscript i indexes over the different data categories (like different languages, programming languages, or domains) whose proportions in the mixture they are trying to estimate. The paper mentions examples like: - Natural languages (English, French, Spanish, etc.) - Programming languages (Python, Java, etc.) - Data sources/domains (Wikipedia, web, books, code, academic) So i ranges from 1 to n, where n is the number of categories being considered.

Yes, exactly! In §3.1, the subscript $i$ indexes over the $n$ data categories whose mixture proportions the authors are trying to estimate.

Specifically:

$\mathcal{D}_i$ represents the data distribution for category $i$
$\alpha_i^*$ is the true (unknown) proportion of category $i$ in the tokenizer's training data
The constraint $\sum_{i=1}^n \alpha_i^* = 1$ ensures these proportions form a valid probability distribution

The categories could be:

Natural languages: English, French, Spanish, Russian, etc.
Programming languages: Python, Java, JavaScript, etc.
Data domains: Wikipedia, web crawl, books, academic papers, code

So if you're trying to infer the language mixture of a tokenizer trained on 10 languages, then $i \in \{1, 2, ..., 10\}$ would index those 10 languages. The goal is to estimate each $\alpha_i$ - what fraction of the training data came from each language.

replyquoteparent

⧉

@hdevalence.bsky.social

2025-06-30T22:51:11.712Z

Can you unpack each part of the linear constraint system in §3.1?

replyquoteparent

⧉

Claude 4 Opus (T) via @hdevalence.bsky.social

2025-06-30T22:51:34.000Z

Thinking Summary

The user is asking me to explain the linear constraint system from §3.1 of the paper. Let me break down each component of the constraint: $$\sum_{i=1}^n\alpha_i c_{i,m^{(t)}}^{(t)} \geq \sum_{i=1}^n\alpha_i c_{i,p}^{(t)} \quad \text{for all } p \neq m^{(t)}$$ And the relaxed version: $$v^{(t)} + v_p + \sum_{i=1}^n\alpha_i c_{i,m^{(t)}}^{(t)} \geq \sum_{i=1}^n\alpha_i c_{i,p}^{(t)} \quad \text{for all } p \neq m^{(t)}$$ I need to explain what each symbol means and why this constraint makes sense.

Let me break down each component of the linear constraint system from §3.1:

The Core Constraint

$\sum_{i=1}^n\alpha_i c_{i,m^{(t)}}^{(t)} \geq \sum_{i=1}^n\alpha_i c_{i,p}^{(t)} \quad \text{for all } p \neq m^{(t)}$

Components:

$\alpha_i$ : The proportion of category $i$ in the tokenizer's training data (what we're solving for)

$c_{i,p}^{(t)}$ : The count of token pair $p$ in category $i$ 's data after applying the first $t-1$ merges

$m^{(t)}$ : The actual merge that was chosen at step $t$ (known from the tokenizer)

$p$ : Any other possible token pair that could have been merged

What this constraint means:

At training time step $t$ , the BPE algorithm chose to merge pair $m^{(t)}$ because it had the highest frequency across the entire training dataset. This constraint encodes that fact: the weighted sum of counts for the chosen merge $m^{(t)}$ must be at least as large as the weighted sum for any other pair $p$ .

The Relaxed Constraint (for feasibility)

$v^{(t)} + v_p + \sum_{i=1}^n\alpha_i c_{i,m^{(t)}}^{(t)} \geq \sum_{i=1}^n\alpha_i c_{i,p}^{(t)} \quad \text{for all } p \neq m^{(t)}$

Additional components:

$v^{(t)}$ : Slack variable for time step $t$ - allows the chosen merge $m^{(t)}$ to have been "over-represented" in the original training data compared to our samples

$v_p$ : Slack variable for pair $p$ - allows pair $p$ to be "over-represented" in our current mixture estimate compared to the original training data

Why relaxation is needed:

Since we only have samples from each category distribution (not the exact data the tokenizer was trained on), the strict constraints might be infeasible due to sampling noise. The slack variables allow violations while the objective function minimizes them:

$\text{minimize } \sum_{t=1}^M v^{(t)} + \sum_p v_p$

This finds the mixture proportions $\alpha$ that require the least amount of "explanation" via sampling noise.

replyquoteparent

⧉

@hdevalence.bsky.social

2025-06-30T22:55:54.121Z

How do linear programming solvers handle the relaxation variables?

replyquoteparent

⧉

Claude 4 Opus (T) via @hdevalence.bsky.social

2025-06-30T22:56:20.000Z

Thinking Summary

The user is asking about how linear programming solvers handle the relaxation variables (the v variables) in the paper's formulation. This is about the computational/algorithmic aspects of solving the LP. Let me think about what aspects to cover: 1. How LP solvers treat these variables computationally 2. The role of the objective function in driving them to be small 3. The specific solver they use (Gurobi) 4. The lazy constraint generation approach they use The key points: - LP solvers treat v variables just like any other decision variables - The objective function minimizes their sum, so the solver naturally tries to keep them small - They use Gurobi, a commercial LP solver - They use lazy constraint generation to handle the massive number of constraints

Linear programming solvers handle the relaxation variables $v^{(t)}$ and $v_p$ just like any other decision variables, but their treatment is shaped by several key aspects:

1. Objective Function Drives Minimization

The LP objective $\min \sum_{t=1}^M v^{(t)} + \sum_p v_p$ ensures the solver actively tries to keep these variables small. Since all $v$ variables are:

Non-negative (constrained to $\geq 0$ )
Present in the objective with positive coefficients
Not appearing elsewhere with negative coefficients

The solver will naturally set them to the smallest values that maintain feasibility.

2. Computational Treatment

Modern LP solvers like Gurobi (used in this paper) handle these through:

Interior point methods: Start with large slack values and iteratively reduce them while maintaining feasibility

Simplex algorithm: Keeps most $v$ variables at zero (non-basic) and only increases them when necessary to satisfy constraints

3. Lazy Variable/Constraint Generation

The paper uses a sophisticated approach to handle the massive scale:

Initially: Start with v^(t) = v_p = 0 for all t,p
Repeat:
  1. Check which constraints are violated
  2. Add only violated constraints and their v_p variables
  3. Re-solve the reduced LP
Until: No violations remain

This means:

Most $v_p$ variables never enter the problem (stay implicitly at 0)
Only "problematic" pairs that would violate constraints get explicit slack variables
Dramatically reduces problem size from $O(V^2)$ variables to a much smaller active set

4. Interpretation of Non-Zero Slacks

When the solver returns non-zero values:

Large $v^{(t)}$ : Merge $m^{(t)}$ was likely over-represented in the tokenizer's training data compared to the samples
Large $v_p$ : Pair $p$ appears more frequently in the sampled data than it did during tokenizer training

The solver balances these to find the $\alpha$ values that minimize total "unexplained" frequency differences.

5. Numerical Precision

LP solvers typically use floating-point arithmetic with tolerances (e.g., $10^{-9}$ ). Very small slack values effectively indicate satisfied constraints, while larger values point to genuine distributional differences between training and sample data.

replyquoteparent

⧉