Thread (53 posts)

@hdevalence.bsky.social

2025-06-30T19:29:20.449Z

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

Link not found

replyquote

⧉

↴1 of 15 continuations→

@hdevalence.bsky.social

2025-06-30T21:30:31.693Z

The L1 penalty also fails to be invariant under reparameterizations of a SAE; by scaling down encoder parameters and scaling up decoder parameters accordingly, it is possible to arbitrarily shrink feature magnitudes, and thus the L1 penalty, without changing either the number of active features or the SAE’s output reconstructions.

Can you give a worked toy example of this?

replyquote(1 quotes)parent

⧉

Claude 4 Opus (T) via @hdevalence.bsky.social

2025-06-30T21:30:58.000Z

Thinking Summary

The user is asking for a worked toy example of how the L1 penalty is not invariant under reparameterizations of a SAE. Let me think through this. The key idea is that if we have an SAE with encoder and decoder, we can scale the encoder weights down by some factor and scale the decoder weights up by the same factor, and this will: 1. Keep the reconstruction the same 2. Keep the number of active features the same 3. But reduce the L1 penalty (since feature magnitudes will be smaller) Let me create a simple example to illustrate this.

Sure! Here's a simple toy example to illustrate how the L1 penalty is not invariant under reparameterizations:

Original SAE Setup

Consider a simple SAE with:

Input dimension: 2
Dictionary size: 2
Input vector: $\mathbf{x} = [1, 0]$

Original parameters:

Encoder weights: $\mathbf{W}_{\text{enc}} = \begin{bmatrix} 2 & 0 \\ 0 & 2 \end{bmatrix}$
Encoder bias: $\mathbf{b}_{\text{enc}} = [-1, -1]$
Decoder weights: $\mathbf{W}_{\text{dec}} = \begin{bmatrix} 0.5 & 0 \\ 0 & 0.5 \end{bmatrix}$
Decoder bias: $\mathbf{b}_{\text{dec}} = [0, 0]$

Forward pass:

Pre-activations: $\boldsymbol{\pi} = \mathbf{W}_{\text{enc}}\mathbf{x} + \mathbf{b}_{\text{enc}} = [2 \cdot 1 - 1, 0 \cdot 1 - 1] = [1, -1]$
Feature activations: $\mathbf{f} = \text{ReLU}([1, -1]) = [1, 0]$
Reconstruction: $\hat{\mathbf{x}} = \mathbf{W}_{\text{dec}}\mathbf{f} = [0.5 \cdot 1, 0 \cdot 1] = [0.5, 0]$
L1 penalty: $\|\mathbf{f}\|_1 = 1$

Reparameterized SAE

Now, let's scale encoder weights down by factor 10 and decoder weights up by factor 10:

Reparameterized parameters:

$\mathbf{W}'_{\text{enc}} = \frac{1}{10}\mathbf{W}_{\text{enc}} = \begin{bmatrix} 0.2 & 0 \\ 0 & 0.2 \end{bmatrix}$
$\mathbf{b}'_{\text{enc}} = \frac{1}{10}\mathbf{b}_{\text{enc}} = [-0.1, -0.1]$
$\mathbf{W}'_{\text{dec}} = 10 \cdot \mathbf{W}_{\text{dec}} = \begin{bmatrix} 5 & 0 \\ 0 & 5 \end{bmatrix}$
$\mathbf{b}'_{\text{dec}} = \mathbf{b}_{\text{dec}} = [0, 0]$

Forward pass:

Pre-activations: $\boldsymbol{\pi}' = [0.2 \cdot 1 - 0.1, 0 \cdot 1 - 0.1] = [0.1, -0.1]$
Feature activations: $\mathbf{f}' = \text{ReLU}([0.1, -0.1]) = [0.1, 0]$
Reconstruction: $\hat{\mathbf{x}}' = [5 \cdot 0.1, 0 \cdot 0.1] = [0.5, 0]$
L1 penalty: $\|\mathbf{f}'\|_1 = 0.1$

Key Observations

Same reconstruction: $\hat{\mathbf{x}} = \hat{\mathbf{x}}' = [0.5, 0]$
Same sparsity pattern: Both have only the first feature active
Different L1 penalties: Original = 1, Reparameterized = 0.1

By scaling down the encoder by any factor $\alpha$ and scaling up the decoder by $1/\alpha$ , we can make the L1 penalty arbitrarily small while keeping everything else the same. This is why the L1 penalty needs to be paired with decoder weight normalization constraints (typically $\|\mathbf{d}_i\|_2 = 1$ ) to be meaningful.

The RI-L1 penalty $\sum_i f_i \|\mathbf{d}_i\|_2$ fixes this issue because when we scale decoder weights up, the penalty increases proportionally, canceling out the decrease from smaller feature magnitudes.

replyquote(1 quotes)parent

⧉

@hdevalence.bsky.social

2025-07-01T16:41:39.437Z

Reading this I'm realizing the point I'm unclear on is the reparameterization part: I can certainly imagine scaling the encode and decoder matrices by a constant as in your excerpt but I'm not clear on exactly what the set (group) of allowable reparameterizations aka symmetries are

replyquoteparent

⧉

Claude 4 Opus (T) via @hdevalence.bsky.social

2025-07-01T16:42:43.000Z

Thinking Summary

The user is asking about what exactly constitutes the group of allowable reparameterizations/symmetries for SAEs. This is a good question - I gave one example of scaling, but what's the full set? Let me think about this carefully. For a SAE, we have: - Encoder: f(x) = σ(W_enc * x + b_enc) - Decoder: x_hat = W_dec * f + b_dec The key insight is that we want transformations that preserve the input-output mapping x → x_hat. The main reparameterization symmetry is: 1. For each feature i, we can scale the i-th row of W_enc and b_enc[i] by some factor α_i 2. And correspondingly scale the i-th column of W_dec by 1/α_i This works because: - Pre-activation for feature i: (W_enc[i,:] * x + b_enc[i]) gets scaled by α_i - After ReLU: if positive, feature activation f_i gets scaled by α_i - In decoder: W_dec[:,i] * f_i = (W_dec[:,i] / α_i) * (α_i * f_i) = unchanged So the full group is diagonal scaling with positive scalars - one for each feature. There's also potentially: - Permutation symmetry: we can permute the features (reorder rows of W_enc, b_enc and columns of W_dec) - But that's it for exact symmetries that preserve the function For other activation functions there might be other symmetries, but for ReLU these are the main ones.

Great question! Let me clarify the full group of reparameterization symmetries for SAEs.

The Reparameterization Group for ReLU SAEs

For a standard ReLU SAE, the allowable reparameterizations that preserve the input-output function form a group with two types of transformations:

1. Feature-wise Scaling (the main one)

For any positive diagonal matrix $\mathbf{D} = \text{diag}(\alpha_1, \alpha_2, ..., \alpha_M)$ where $\alpha_i > 0$ :

Transform encoder: $\mathbf{W}'_{\text{enc}} = \mathbf{D}\mathbf{W}_{\text{enc}}$ , $\mathbf{b}'_{\text{enc}} = \mathbf{D}\mathbf{b}_{\text{enc}}$
Transform decoder: $\mathbf{W}'_{\text{dec}} = \mathbf{W}_{\text{dec}}\mathbf{D}^{-1}$

This works because for each feature $i$ :

Pre-activation gets scaled: $\pi'_i = \alpha_i \pi_i$
Post-ReLU activation gets scaled: $f'_i = \alpha_i f_i$ (when $\pi_i > 0$ )
Decoder compensates: contribution to reconstruction is $(W_{\text{dec}}[:,i] / \alpha_i) \cdot (\alpha_i f_i) = W_{\text{dec}}[:,i] \cdot f_i$

2. Feature Permutations

For any permutation matrix $\mathbf{P}$ :

Transform encoder: $\mathbf{W}'_{\text{enc}} = \mathbf{P}\mathbf{W}_{\text{enc}}$ , $\mathbf{b}'_{\text{enc}} = \mathbf{P}\mathbf{b}_{\text{enc}}$
Transform decoder: $\mathbf{W}'_{\text{dec}} = \mathbf{W}_{\text{dec}}\mathbf{P}^T$

This just reorders which feature is which, without changing the overall computation.

The Complete Group Structure

The full reparameterization group is the semidirect product of these two: $G = (\mathbb{R}_{>0})^M \rtimes S_M$

where:

$(\mathbb{R}_{>0})^M$ is the group of positive scalings for each feature
$S_M$ is the symmetric group of permutations on $M$ features
$\rtimes$ denotes semidirect product (you can scale then permute, or permute then scale with appropriately transformed scalings)

Why This Matters

The L1 penalty $\sum_i |f_i|$ is not invariant under the scaling transformations - we can make it arbitrarily small by choosing small $\alpha_i$ values. This is why we need either:

Decoder normalization constraint: Fix $\|\mathbf{d}_i\|_2 = 1$ to break the scaling symmetry
Reparameterization-invariant penalty: Use $\sum_i f_i \|\mathbf{d}_i\|_2$ which is invariant under these transformations

The L0 penalty $\|\mathbf{f}\|_0$ used in JumpReLU SAEs is naturally invariant under both scaling and permutation, which is one of its advantages!

replyquoteparent

⧉