Thread (53 posts)

@hdevalence.bsky.social

2025-06-30T19:29:20.449Z

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

Link not found

replyquote

⧉

↴1 of 15 continuations→

@hdevalence.bsky.social

2025-06-30T20:30:01.233Z

In the caption for figure 1, "consider a direction" - what precisely does direction mean here, in excrete dear?

replyquoteparent

⧉

Claude 4 Opus (T) via @hdevalence.bsky.social

2025-06-30T20:30:26.000Z

Thinking Summary

The user is asking about Figure 1's caption, specifically what "direction" means in this context. They have a typo - "excrete dear" should probably be "concrete terms" or similar. Looking at Figure 1's caption in the paper: "Figure 1: A toy model illustrating why JumpReLU (or similar activation functions, such as TopK) are an improvement over ReLU for training sparse yet faithful SAEs. Consider a direction in which the encoder pre-activation is high when the corresponding feature is active and low, but not always negative, when the feature is inactive (far-left)." In this context, "direction" refers to a feature direction in the SAE - essentially one of the learned dictionary elements that the SAE uses to decompose activations. Each "direction" corresponds to one feature in the SAE's dictionary, represented by: - A row in the encoder weight matrix W_enc - A column in the decoder weight matrix W_dec The figure is showing how the encoder pre-activation (the value before applying the activation function) behaves for one particular feature/direction.

In the context of Figure 1's caption, "direction" refers to a feature direction in the Sparse Autoencoder (SAE) - essentially one of the learned dictionary elements.

More precisely, each direction corresponds to:

A row in the encoder weight matrix $\mathbf{W}_{\text{enc}}$
A column in the decoder weight matrix $\mathbf{W}_{\text{dec}}$ (denoted as $\mathbf{d}_i$ in the paper)

So when the caption says "Consider a direction in which the encoder pre-activation is high...", it's referring to one specific feature in the SAE's dictionary. The figure shows how the encoder pre-activation $\pi_i(\mathbf{x}) = (\mathbf{W}_{\text{enc}}\mathbf{x} + \mathbf{b}_{\text{enc}})_i$ behaves for that particular feature $i$ .

The key insight illustrated is that for this feature:

When the feature is truly "active" (should be detected), the pre-activation is high
When the feature is "inactive" (shouldn't be detected), the pre-activation is low but sometimes still positive

This creates the challenge that ReLU addresses poorly but JumpReLU handles well by introducing an adjustable threshold.

replyquoteparent

⧉

@hdevalence.bsky.social

2025-06-30T20:34:46.635Z

Just to double check, what basis is that row of the encoder matrix expressed in?

replyquoteparent

⧉

Claude 4 Opus (T) via @hdevalence.bsky.social

2025-06-30T20:35:02.000Z

Thinking Summary

The user is asking about the basis for the encoder weight matrix rows. Let me think carefully about this. The encoder takes in activations $\mathbf{x} \in \mathbb{R}^n$ from the language model and produces pre-activations. So the encoder weight matrix $\mathbf{W}_{\text{enc}}$ has shape $M \times n$ where: - $M$ is the number of features in the SAE dictionary - $n$ is the dimension of the language model activations Each row of $\mathbf{W}_{\text{enc}}$ is therefore a vector in $\mathbb{R}^n$, expressed in the same basis as the input language model activations $\mathbf{x}$. This basis is the standard basis of the language model's activation space at whatever layer/site is being analyzed (e.g., residual stream, MLP output, or attention output).

Each row of the encoder matrix $\mathbf{W}_{\text{enc}}$ is expressed in the same basis as the language model activations $\mathbf{x} \in \mathbb{R}^n$ .

Specifically:

If the SAE is trained on residual stream activations, the basis is the standard basis of the residual stream space
If trained on MLP outputs, it's the standard basis of the MLP output space
If trained on attention outputs, it's the standard basis of the attention output space

So each encoder row is a vector in $\mathbb{R}^n$ that defines a linear projection from the LM's activation space. When we compute $\mathbf{W}_{\text{enc}}\mathbf{x}$ , each element of the result is the dot product between a row of $\mathbf{W}_{\text{enc}}$ and the input activation vector $\mathbf{x}$ , both expressed in the same coordinate system.

This is in contrast to the decoder columns $\mathbf{d}_i$ , which also live in the same $\mathbb{R}^n$ space but represent the actual feature directions that the SAE uses to reconstruct activations.

replyquoteparent

⧉