Scaling and evaluating sparse autoencoders
Thread (60 posts)
I don't understand the setup in the second paragraph of §4.5. When they say "at a particular token index" what is the actual setup there? What is the index set? When they say "each autoencoder latent" what precisely is the referent of latent? The description seems densely packed and comprehensible for practitioners but is a little hard to be sure I am following precisely. cf. linked posts
Thinking Summary
The user is asking for clarification about Section 4.5 on "Sparsity of ablation effects". They find the description dense and hard to follow precisely. Let me reread that section carefully. The key paragraph is: "At a particular token index, we obtain the latents at the residual stream, and proceed to ablate each autoencoder latent one by one, and compare the resulting logits before and after ablation. This process leads to V logit differences per ablation and affected token, where V is the size of the token vocabulary." Breaking this down: 1. "At a particular token index" - This seems to mean they pick a specific position in the sequence (e.g., the 5th token in a sequence) 2. "we obtain the latents at the residual stream" - They run the model forward and extract the residual stream activations at that position, then encode them with the SAE to get the latent activations z 3. "ablate each autoencoder latent one by one" - For each of the k active latents (since they use TopK), they set that latent to 0 and decode back to residual stream 4. "compare the resulting logits before and after ablation" - They continue the forward pass with the modified residual stream and see how the output logits change The confusion might be: - What exactly is the "index set"? It seems like they're doing this for multiple token positions - When they say "each autoencoder latent", do they mean each of the k active ones, or all n latents? - The dense packing of technical details Let me provide a clearer explanation.
Looking at §4.5, I can help clarify the setup. The description is indeed quite dense. Let me break it down:
The Setup:
-
"At a particular token index" - They pick a specific position in a sequence (e.g., position 5 in a 64-token context). The index set appears to be multiple sampled positions across their evaluation data.
-
"We obtain the latents at the residual stream" - At that position, they:
- Extract the residual stream activation vector
- Pass it through the SAE encoder to get latent activations
- For TopK, only of these will be non-zero
-
"Ablate each autoencoder latent one by one" - Given the context of TopK, "each autoencoder latent" likely means each of the active latents (those with non-zero activations). For each active latent :
- Set (while keeping other latents unchanged)
- Decode back to get modified
- Replace the original residual stream with this modified version
- Complete the forward pass to get new logits
-
Measuring effects - They track how this ablation affects logits at:
- The current position
- The next positions
The resulting metric measures whether ablating individual SAE features has sparse, localized effects on model outputs (good) or dense, widespread effects (bad).
The terminology confusion you noted in your linked posts is relevant here - "latent" is being used to mean both the latent unit and its activation value, which adds to the density of the description.
Thinking Summary
The user is asking about V and T in the context of section 4.5. Let me re-read that section carefully. From the text: "This process leads to V logit differences per ablation and affected token, where V is the size of the token vocabulary." "Finally, we concatenate these vectors together across some set of T future tokens (at the ablated index or later) to obtain a vector of V⋅T total numbers." "We perform this for various autoencoders trained on the post-MLP residual stream at layer 8 in GPT-2 small, with T=16." So: - V = vocabulary size (number of possible tokens) - T = number of future positions they track (16 in their experiments) The measurement process: 1. For each ablation of an active latent 2. At each of T positions (current + next T-1) 3. They get V logit differences (one per vocabulary item) 4. This gives V×T total numbers per ablation
V and T define the dimensions of their measurement:
- V = vocabulary size (e.g., 50,257 for GPT-2)
- T = 16 = number of token positions they track
For each ablation experiment:
- When you ablate one active latent at position
- You measure logit changes at positions
- At each position, you get logit differences (one per vocabulary token)
- Total: numbers per ablation
They then measure sparsity of this dimensional vector using , which gives an "effective number of vocab tokens affected."
The key insight: If ablating a single SAE feature only affects a few specific tokens at a few specific positions (sparse effect), that's interpretable. If it changes many logits across many positions (dense effect), that suggests the feature isn't capturing a clean, modular computation.
For example:
- Sparse/good: Ablating a "plural noun" feature mainly affects "-s" endings at the next position
- Dense/bad: Ablating a feature changes logits broadly across vocabulary and positions