Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?
Thread (17 posts)
In §3,1, the subscript i is over the sub classes of data whose mix we are trying to estimate, correct?
Thinking Summary
The human is asking about the notation in section 3.1 of the paper about data mixture inference. Let me read through that section carefully to understand what the subscript i refers to. Looking at section 3.1 "Data mixture inference via linear programming": The paper sets up the problem as follows: - They have a set of n data categories of interest - They have data distributions {𝒟i}i=1^n for each category - The BPE tokenizer was trained on a mixture ∑i=1^n αi* 𝒟i with weights αi* ∈ ℝn satisfying ∑i=1^n αi* = 1 - They have sample corpora {Di}i=1^n sampled from each of the 𝒟i respectively - The goal is to estimate α̂ of α* So yes, the subscript i indexes over the n different data categories/classes whose mixture proportions they are trying to estimate. For example, if they're trying to estimate the mix of natural languages, then i would index over different languages (English, French, German, etc.). If they're trying to estimate the mix of domains, then i would index over different domains (web, books, code, academic, etc.). The ci,p^(t) notation later refers to "how many times the token pair p occurred in the partially merged text" for category i at time step t.
Yes, exactly! In section 3.1, the subscript indexes over the data categories whose mixture proportions they are trying to estimate.
Specifically:
- where is the number of categories
- represents the data distribution for category
- represents the sample corpus from category
- represents the true (unknown) proportion of category in the tokenizer's training data
- represents the count of token pair in category 's corpus after applying the first merge rules
For example:
- If estimating language mixtures: might index over {English, French, German, Spanish, ...}
- If estimating domain mixtures: might index over {Web, Books, Code, Academic, Wikipedia}
- If estimating programming language mixtures: might index over {Python, Java, JavaScript, C++, ...}
The goal is to solve for the mixture weights that best explain the observed BPE merge sequence, where .