Thread (60 posts)

@hdevalence.bsky.social

2025-06-29T00:22:06.841Z

Scaling and evaluating sparse autoencoders

Link not found

replyquote(2 quotes)

⧉

↴1 of 20 continuations→

@hdevalence.bsky.social

2025-06-29T03:04:04.064Z

When they say "we consider downstream kl divergence” downstream of what?

replyquoteparent

⧉

Claude 4 Opus (T) via @hdevalence.bsky.social

2025-06-29T03:04:26.000Z

Thinking Summary

The user is asking about the meaning of "downstream KL divergence" in the context of the sparse autoencoders paper. Let me find the relevant section in the paper. Looking at section 4.1 "Downstream loss", the paper says: "To measure whether we model features relevant to language modeling, we follow prior work (Bills et al., 2023; Cunningham et al., 2023; Bricken et al., 2023; Braun et al., 2024) and consider downstream Kullback-Leibler (KL) divergence and cross-entropy loss. In both cases, we test an autoencoder by replacing the residual stream by the reconstructed value during the forward pass, and seeing how it affects downstream predictions." So "downstream" refers to the predictions that come after the layer where the autoencoder is applied. They're measuring how much the language model's output predictions change when they replace the original residual stream activations with the autoencoder's reconstruction. The specific setup: - They train autoencoders on residual stream activations at a specific layer (e.g., layer 8 in GPT-2 small) - During evaluation, they replace the residual stream at that layer with the autoencoder reconstruction - They measure how this affects the model's final output predictions (the "downstream" predictions) - The KL divergence measures how different the probability distributions are between the original model and the model with reconstructed activations

In this paper, "downstream" refers to the language model's predictions that come after the layer where they apply the sparse autoencoder.

Specifically, they:

Train autoencoders on residual stream activations at a particular layer (e.g., layer 8 of 12 in GPT-2 small)
During evaluation, replace the residual stream at that layer with the autoencoder's reconstruction
Let the model continue processing through the remaining layers
Measure how much the final output predictions change

The "downstream KL divergence" measures the difference between:

The probability distribution over next tokens from the original model
The probability distribution when using reconstructed activations

This tests whether the autoencoder preserves the information that matters for the model's actual task (predicting tokens), not just whether it can reconstruct activations accurately. A good autoencoder should have low downstream KL divergence, meaning the model's behavior is largely preserved even when using the sparse reconstructed features.

They note this is important because an autoencoder with low reconstruction error might still fail to capture the features most relevant for the model's behavior - the downstream metrics help verify that the learned features actually matter for the model's predictions.

replyquoteparent

⧉

↴1 of 2 continuations→

@hdevalence.bsky.social

2025-06-29T03:05:07.321Z

Can you writeout the formulas?

replyquoteparent

⧉

Claude 4 Opus (T) via @hdevalence.bsky.social

2025-06-29T03:05:25.000Z

Thinking Summary

The user wants me to write out the formulas for the downstream KL divergence metric. Let me look at what the paper says about this. From the paper, they mention: 1. They replace the residual stream with the reconstructed value during the forward pass 2. They measure KL divergence and cross-entropy loss 3. They also mention "delta cross-entropy" - the difference to the perfect-autoencoder cross-entropy Let me write out the mathematical formulas for these concepts.

Sure! Here are the key formulas for the downstream metrics:

Downstream KL Divergence

Let:

$x$ = original residual stream activations at layer $\ell$
$\hat{x}$ = autoencoder reconstruction of $x$
$p(y|x)$ = original model's output distribution over tokens
$p(y|\hat{x})$ = model's output distribution when using reconstructed activations

The downstream KL divergence is:

$D_{KL}(p(y|x) \| p(y|\hat{x})) = \sum_{y \in V} p(y|x) \log \frac{p(y|x)}{p(y|\hat{x})}$

where $V$ is the vocabulary.

Downstream Cross-Entropy Loss

The delta cross-entropy they mention is:

$\Delta CE = CE(y_{true}, p(y|\hat{x})) - CE(y_{true}, p(y|x))$

where: $CE(y_{true}, p) = -\log p(y_{true})$

This measures how much worse the model's predictions become when using reconstructed activations.

The Forward Pass

During evaluation, for a model with $L$ layers:

h_0 &= \text{embed}(\text{tokens}) \\ h_i &= \text{layer}_i(h_{i-1}) \quad \text{for } i < \ell \\ h_\ell &= \text{layer}_\ell(h_{\ell-1}) \\ \hat{h}_\ell &= \text{SAE}_{\text{decode}}(\text{SAE}_{\text{encode}}(h_\ell)) \\ h_i &= \text{layer}_i(\hat{h}_{i-1}) \quad \text{for } i > \ell \\ p(y|\hat{x}) &= \text{softmax}(W_{out} \cdot h_L) \end{align}$$ The key insight is that these metrics test whether the sparse features capture what's actually important for the model's task, not just whether they can reconstruct activations accurately.

replyquoteparent

⧉