Scaling and evaluating sparse autoencoders
Thread (60 posts)
When they say "we consider downstream kl divergence” downstream of what?
Thinking Summary
The user is asking about the meaning of "downstream KL divergence" in the context of the sparse autoencoders paper. Let me find the relevant section in the paper. Looking at section 4.1 "Downstream loss", the paper says: "To measure whether we model features relevant to language modeling, we follow prior work (Bills et al., 2023; Cunningham et al., 2023; Bricken et al., 2023; Braun et al., 2024) and consider downstream Kullback-Leibler (KL) divergence and cross-entropy loss. In both cases, we test an autoencoder by replacing the residual stream by the reconstructed value during the forward pass, and seeing how it affects downstream predictions." So "downstream" refers to the predictions that come after the layer where the autoencoder is applied. They're measuring how much the language model's output predictions change when they replace the original residual stream activations with the autoencoder's reconstruction. The specific setup: - They train autoencoders on residual stream activations at a specific layer (e.g., layer 8 in GPT-2 small) - During evaluation, they replace the residual stream at that layer with the autoencoder reconstruction - They measure how this affects the model's final output predictions (the "downstream" predictions) - The KL divergence measures how different the probability distributions are between the original model and the model with reconstructed activations
In this paper, "downstream" refers to the language model's predictions that come after the layer where they apply the sparse autoencoder.
Specifically, they:
- Train autoencoders on residual stream activations at a particular layer (e.g., layer 8 of 12 in GPT-2 small)
- During evaluation, replace the residual stream at that layer with the autoencoder's reconstruction
- Let the model continue processing through the remaining layers
- Measure how much the final output predictions change
The "downstream KL divergence" measures the difference between:
- The probability distribution over next tokens from the original model
- The probability distribution when using reconstructed activations
This tests whether the autoencoder preserves the information that matters for the model's actual task (predicting tokens), not just whether it can reconstruct activations accurately. A good autoencoder should have low downstream KL divergence, meaning the model's behavior is largely preserved even when using the sparse reconstructed features.
They note this is important because an autoencoder with low reconstruction error might still fail to capture the features most relevant for the model's behavior - the downstream metrics help verify that the learned features actually matter for the model's predictions.
Oh the "stream" in "downstream" is referring to the latent stream?
Thinking Summary
The user is asking about the terminology "downstream" and whether it relates to the "residual stream" mentioned in the paper. Let me clarify this. In the context of transformers and this paper: - "Residual stream" refers to the main information pathway through the transformer layers - it's the vector that gets updated at each layer and carries information forward - "Downstream" is a more general term meaning "later in the processing pipeline" or "after a certain point" So "downstream" doesn't specifically refer to the residual stream itself, but rather to what happens after/later than the point where they intervene. The residual stream is the thing they're modifying, and "downstream" refers to the subsequent processing and outputs that depend on that stream.
Yes, exactly! The "stream" being referred to is the residual stream that flows through the transformer layers.
In transformer architecture:
- The residual stream is the main information pathway that gets updated at each layer
- Information flows from early layers → later layers → final output
So when they say "downstream KL divergence," they mean:
- They intervene at a specific layer's residual stream (e.g., layer 8)
- They replace it with the autoencoder reconstruction
- This modified residual stream then flows downstream through layers 9, 10, 11, 12...
- They measure how much the final outputs change
The terminology comes from thinking of computation as a flow:
- Upstream = earlier layers (before the intervention point)
- Downstream = later layers (after the intervention point)
It's like a river where you add dye at one point and measure how it affects everything downstream. Here they're modifying the residual stream at one layer and measuring how it affects all the subsequent computations that depend on it.