Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders
Thread (53 posts)
Am i correctly understanding the caption / point of figure I to be that because JumpReLU has a vector of offsets which is learned during training, it can precisely identify the "intended" activations without alter them?
Thinking Summary
The user is asking about Figure 1 in the paper. Let me look at what Figure 1 shows and its caption. From the paper: "Figure 1: A toy model illustrating why JumpReLU (or similar activation functions, such as TopK) are an improvement over ReLU for training sparse yet faithful SAEs. Consider a direction in which the encoder pre-activation is high when the corresponding feature is active and low, but not always negative, when the feature is inactive (far-left). Applying a ReLU activation function fails to remove all false positives (centre-left), harming sparsity. It is possible to get rid of false positives while maintaining the ReLU, e.g. by decreasing the encoder bias (centre-right), but this leads to feature magnitudes being systematically underestimated, harming fidelity. The JumpReLU activation function (far-right) provides an independent threshold below which pre-activations are screened out, minimising false positives, while leaving pre-activations above the threshold unaffected, improving fidelity." So the key points are: 1. The encoder pre-activation can be high when the feature is truly active, but also sometimes positive (though lower) when the feature is inactive 2. ReLU passes through all positive values, so it includes false positives (when the feature is inactive but pre-activation is positive) 3. You could lower the bias to make more pre-activations negative, but then you systematically underestimate the magnitudes of true activations 4. JumpReLU has a learned threshold θ that can be set to separate true activations from false positives, while preserving the magnitudes of true activations above the threshold The user is asking if they correctly understand that JumpReLU can "precisely identify the 'intended' activations without altering them" - this is essentially correct. The threshold allows it to filter out false positives while preserving the magnitudes of true activations.
Yes, you've got it exactly right! Figure 1 illustrates the key advantage of JumpReLU over standard ReLU for sparse autoencoders.
The core insight is that with a learned threshold vector , JumpReLU can:
-
Filter out false positives: Set the threshold above the "noise floor" of pre-activations that occur when the feature is actually inactive (but happen to be slightly positive)
-
Preserve true activation magnitudes: Unlike solutions that shift the bias to make more pre-activations negative, JumpReLU doesn't systematically shrink the magnitudes of true activations - it just passes them through unchanged if they're above the threshold
The figure shows this nicely:
- ReLU (center-left): Lets through all positive values, including false positives
- ReLU with shifted bias (center-right): Eliminates false positives but systematically underestimates all magnitudes
- JumpReLU (far-right): Clean separation - zeros out everything below , preserves everything above
This is why the paper emphasizes that JumpReLU (like TopK and Gated SAEs) can "separate out deciding which features are active from estimating active features' magnitudes." The threshold handles the binary active/inactive decision, while the magnitude above threshold captures the strength of activation.
OK and the point of this part is that there's actually a way to learn that parameter theta?
Thinking Summary
The user is asking about whether this part (referring to the quote about expected loss) is showing that there's actually a way to learn the parameter theta. They seem to understand that the individual loss is piecewise constant (no gradient) but are asking about how the expected loss insight enables learning theta.
Yes, exactly! This is the paper's key technical contribution - showing how to actually learn despite the fact that the loss function has zero gradient with respect to it.
The clever trick is:
- Individual samples give no gradient: For any single activation , the loss is piecewise constant in (flat everywhere except at discontinuities)
- But the expected loss is smooth: When you average over the distribution of all activations, you get a smooth function with non-zero gradients
The authors then use straight-through estimators (STEs) to approximate this expected gradient during training. Specifically:
- They define pseudo-derivatives (Equations 11-12) that approximate the true gradient of the expected loss
- These use kernel functions to "smooth out" the discontinuity in a small window of width around the threshold
- During backpropagation, whenever they would normally get zero gradient, they substitute these pseudo-derivatives instead
The mathematical justification (Section 4) shows that this STE approach is actually computing an unbiased kernel density estimate of the true expected loss gradient:
So yes - despite the discontinuous activation function, they can train effectively using standard gradient descent, because the STEs give them a principled way to estimate gradients that would otherwise be zero.
What is theta called in the literature? Does it have any other interesting interpretations than the obvious one.
Thinking Summary
The user is asking about what $\theta$ (the threshold parameter in JumpReLU) is called in the literature and if it has any interesting interpretations beyond the obvious one of being a threshold. Let me think about this: 1. In the JumpReLU paper, they simply call it the "threshold" or "threshold parameter" 2. In the context of sparse coding/dictionary learning, similar parameters might be called: - Activation threshold - Firing threshold - Detection threshold - Gating threshold (in Gated SAEs) 3. Interesting interpretations: - Statistical interpretation: $\theta_i$ could be seen as defining a hypothesis test threshold - we're testing whether feature $i$ is "truly present" vs just noise - Information theoretic: It's setting a signal-to-noise threshold - Neuroscience analogy: Similar to neuronal firing thresholds - Bayesian interpretation: Could be viewed as related to a decision boundary in a classification problem - It implicitly defines a "baseline activation level" below which we consider the feature to be absent Let me check if the paper mentions any of these interpretations...
In this paper and related SAE literature, is typically just called the threshold or threshold parameter. In the Gated SAE paper, the equivalent parameter (when using weight sharing) emerges from the "gating" mechanism, hence "gating threshold."
Some interesting interpretations beyond the obvious:
-
Statistical significance threshold: implicitly defines a hypothesis test - "is this feature truly active or just noise?" The threshold separates signal from background fluctuations in the pre-activations.
-
Implicit prior on sparsity: The threshold encodes an assumption about how strong a feature's pre-activation needs to be before we believe it's "really there." Higher thresholds encode stronger priors that features should be rare.
-
Neuroscience analogy: Similar to neuronal firing thresholds - a neuron needs sufficient input before it "decides" to fire. The JumpReLU makes this binary decision explicit.
-
Information-theoretic view: The threshold defines a channel capacity constraint - features must exceed a certain "signal strength" to transmit information through the autoencoder.
-
Learned noise floor: As Figure 1 illustrates, can be interpreted as learning the typical "false positive" activation level for each feature when it's not truly present.
The paper hints at another interpretation in Equation 13 - the gradient involves terms that measure:
- How much reconstruction error increases when we raise the threshold (screening out marginal activations)
- How much sparsity improves when we raise the threshold
So represents the learned equilibrium point where the marginal cost to reconstruction equals times the marginal benefit to sparsity.