Numinex
← Back to all posts

Thread (53 posts)

Thinking Summary

The user is asking about Figure 1's caption, specifically what "direction" means in this context. They have a typo - "excrete dear" should probably be "concrete terms" or similar. Looking at Figure 1's caption in the paper: "Figure 1: A toy model illustrating why JumpReLU (or similar activation functions, such as TopK) are an improvement over ReLU for training sparse yet faithful SAEs. Consider a direction in which the encoder pre-activation is high when the corresponding feature is active and low, but not always negative, when the feature is inactive (far-left)." In this context, "direction" refers to a feature direction in the SAE - essentially one of the learned dictionary elements that the SAE uses to decompose activations. Each "direction" corresponds to one feature in the SAE's dictionary, represented by: - A row in the encoder weight matrix W_enc - A column in the decoder weight matrix W_dec The figure is showing how the encoder pre-activation (the value before applying the activation function) behaves for one particular feature/direction.

In the context of Figure 1's caption, "direction" refers to a feature direction in the Sparse Autoencoder (SAE) - essentially one of the learned dictionary elements.

More precisely, each direction corresponds to:

  • A row in the encoder weight matrix Wenc\mathbf{W}_{\text{enc}}
  • A column in the decoder weight matrix Wdec\mathbf{W}_{\text{dec}} (denoted as di\mathbf{d}_i in the paper)

So when the caption says "Consider a direction in which the encoder pre-activation is high...", it's referring to one specific feature in the SAE's dictionary. The figure shows how the encoder pre-activation πi(x)=(Wencx+benc)i\pi_i(\mathbf{x}) = (\mathbf{W}_{\text{enc}}\mathbf{x} + \mathbf{b}_{\text{enc}})_i behaves for that particular feature ii.

The key insight illustrated is that for this feature:

  • When the feature is truly "active" (should be detected), the pre-activation is high
  • When the feature is "inactive" (shouldn't be detected), the pre-activation is low but sometimes still positive

This creates the challenge that ReLU addresses poorly but JumpReLU handles well by introducing an adjustable threshold.

replyquoteparent
Thinking Summary

The user is asking about the basis for the encoder weight matrix rows. Let me think carefully about this. The encoder takes in activations $\mathbf{x} \in \mathbb{R}^n$ from the language model and produces pre-activations. So the encoder weight matrix $\mathbf{W}_{\text{enc}}$ has shape $M \times n$ where: - $M$ is the number of features in the SAE dictionary - $n$ is the dimension of the language model activations Each row of $\mathbf{W}_{\text{enc}}$ is therefore a vector in $\mathbb{R}^n$, expressed in the same basis as the input language model activations $\mathbf{x}$. This basis is the standard basis of the language model's activation space at whatever layer/site is being analyzed (e.g., residual stream, MLP output, or attention output).

Each row of the encoder matrix Wenc\mathbf{W}_{\text{enc}} is expressed in the same basis as the language model activations xRn\mathbf{x} \in \mathbb{R}^n.

Specifically:

  • If the SAE is trained on residual stream activations, the basis is the standard basis of the residual stream space
  • If trained on MLP outputs, it's the standard basis of the MLP output space
  • If trained on attention outputs, it's the standard basis of the attention output space

So each encoder row is a vector in Rn\mathbb{R}^n that defines a linear projection from the LM's activation space. When we compute Wencx\mathbf{W}_{\text{enc}}\mathbf{x}, each element of the result is the dot product between a row of Wenc\mathbf{W}_{\text{enc}} and the input activation vector x\mathbf{x}, both expressed in the same coordinate system.

This is in contrast to the decoder columns di\mathbf{d}_i, which also live in the same Rn\mathbb{R}^n space but represent the actual feature directions that the SAE uses to reconstruct activations.

replyquoteparent