Thread (53 posts)

@hdevalence.bsky.social

2025-06-30T19:29:20.449Z

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

Link not found

replyquote

⧉

↴1 of 15 continuations→

@hdevalence.bsky.social

2025-07-01T17:00:57.889Z

This is a loss function of the standard form Eq. 5 where crucially we are using a L0 sparsity penalty to avoid the limitations of training with a L1 sparsity penalty (Wright and Sharkey, 2024; Rajamanoharan et al., 2024)

What are the limitations of training with an L1 sparsity penalty in the inked paper?

Link not found

replyquote(1 quotes)parent

⧉

Claude 4 Opus (T) via @hdevalence.bsky.social

2025-07-01T17:01:32.000Z

Thinking Summary

The user is asking about the limitations of training with an L1 sparsity penalty that are mentioned in the JumpReLU SAEs paper. The paper references Wright and Sharkey (2024) and Rajamanoharan et al. (2024) for this point. Looking at the paper text, I can find several mentions of L1 penalty limitations: 1. In the Sparsity penalties section: "While this has the advantage of providing a useful gradient for training (unlike the L0-norm), it has the disadvantage of penalising feature magnitudes in addition to sparsity, which harms reconstruction fidelity (Rajamanoharan et al., 2024; Wright and Sharkey, 2024)." 2. Also in the same section: "The L1 penalty also fails to be invariant under reparameterizations of a SAE; by scaling down encoder parameters and scaling up decoder parameters accordingly, it is possible to arbitrarily shrink feature magnitudes, and thus the L1 penalty, without changing either the number of active features or the SAE's output reconstructions. As a result, it is necessary to impose a further constraint on SAE parameters during training to enforce sparsity: typically this is achieved by constraining columns of the decoder weight matrix 𝐝_i to have unit norm (Bricken et al., 2023)." 3. In Section 3: "This is a loss function of the standard form Eq. 5 where crucially we are using a L0 sparsity penalty to avoid the limitations of training with a L1 sparsity penalty (Wright and Sharkey, 2024; Rajamanoharan et al., 2024)." The linked paper "Addressing Feature Suppression in SAEs" by Wright and Sharkey seems to discuss this issue in more detail. So the main limitations are: 1. L1 penalizes feature magnitudes in addition to sparsity, which harms reconstruction fidelity 2. L1 penalty is not invariant under reparameterizations of the SAE - you can scale encoder weights down and decoder weights up to arbitrarily reduce the L1 penalty without changing the actual sparsity or reconstructions 3. This requires additional constraints during training (like constraining decoder columns to unit norm) I notice the SAE training approach reveals critical weaknesses in using L1 penalties. The technique allows parameter manipulation that artificially reduces penalty without genuinely improving feature sparsity. Researchers must implement additional normalization constraints to prevent this parameter scaling trick, typically by enforcing unit norm on decoder weight columns to maintain meaningful sparsity measurements.

The paper identifies two key limitations of L1 sparsity penalties for training SAEs:

1. Feature Magnitude Suppression

The L1 penalty penalizes feature magnitudes in addition to sparsity, which harms reconstruction fidelity. This is because L1 measures the sum of absolute values, so it incentivizes making active features smaller even when they should have larger values for accurate reconstruction.

2. Lack of Reparameterization Invariance

The L1 penalty is not invariant under reparameterizations of the SAE. Specifically:

You can scale down encoder parameters and scale up decoder parameters proportionally
This arbitrarily shrinks feature magnitudes (and thus the L1 penalty)
But it doesn't change the number of active features or the SAE's output reconstructions
This creates a "cheat" where the model can reduce the loss without actually becoming more sparse

Due to this issue, L1-trained SAEs require additional constraints during training, typically constraining decoder weight matrix columns (𝐝ᵢ) to have unit norm.

The paper introduces the RI-L1 (Reparameterization-Invariant L1) penalty as an alternative:

$S_{\text{RI-L1}}(\mathbf{f}) = \sum_{i=1}^{M} f_i \|\mathbf{d}_i\|_2$

This weights feature coefficients by the norms of corresponding dictionary directions, making it invariant to the reparameterization problem.

By using L0 instead, JumpReLU SAEs avoid both issues - L0 only counts the number of non-zero features, not their magnitudes, and is inherently invariant to reparameterization.

replyquoteparent

⧉

@hdevalence.bsky.social

2025-07-01T17:04:41.487Z

Sorry when you say "the paper" which paper do you mean? I was asking about the contents of the "addressing feature suppression"

replyquoteparent

⧉

Claude 4 Opus (T) via @hdevalence.bsky.social

2025-07-01T17:04:56.000Z

Thinking Summary

Oh, I see! The user was asking about what the linked paper "Addressing Feature Suppression in SAEs" by Wright and Sharkey (2024) says about the limitations of L1 sparsity penalties, not what the JumpReLU paper says about it. I should look at the content snapshot from the embedded link to see what information is available about that paper. Looking at the embedded content, I only see: - Title: "Addressing Feature Suppression in SAEs" - URL: https://www.alignmentforum.org/posts/3JuSjTZyMzaSeTxKk/addressing-feature-suppression-in-saes - Description: "Produced as part of the ML Alignment Theory Scholars Program - Winter 2023-24 Cohort as part of Lee Sharkey's stream. …" - Content Snapshot: "AI ALIGNMENT FORUMAF" Unfortunately, the content snapshot doesn't actually contain any substantive content from the Wright and Sharkey paper - it just shows "AI ALIGNMENT FORUMAF" which appears to be a header or navigation element from the website. So I cannot actually tell what Wright and Sharkey (2024) specifically says about the limitations of L1 sparsity penalties based on the provided embed. I should clarify this to the user.

You're right - I apologize for the confusion! You were asking about what Wright and Sharkey (2024) "Addressing Feature Suppression in SAEs" says about L1 limitations.

Unfortunately, the embedded content snapshot for that paper only shows "AI ALIGNMENT FORUMAF" and doesn't include the actual content of the paper. Based on just the title "Addressing Feature Suppression in SAEs", I can infer that Wright and Sharkey likely discuss how L1 penalties cause feature suppression (reducing the magnitudes of features that should be active), but I cannot see their specific arguments or findings from the provided embed.

To properly answer your question about what limitations of L1 sparsity penalties Wright and Sharkey identify, I would need access to the actual content of their paper rather than just the title and URL.

replyquoteparent

⧉