Scaling and evaluating sparse autoencoders
Thread (60 posts)
In the section on progressive codes, they mention that using Multi-TopK mitigates overfitting. I don't see an explicit discussion on how they measure how progressive the code is, what am I missing?
Thinking Summary
The user is asking about the progressive codes section of the paper and how the authors measure how progressive a code is. Let me look at section 5.3 "Progressive recovery" carefully. Looking at the text: "In a progressive code, a partial transmission still allows reconstructing the signal with reasonable fidelity (Skodras et al., 2001). For autoencoders, learning a progressive code means that ordering latents by activation magnitude gives a way to progressively recover the original vector. To study this property, we replace the autoencoder activation function (after training) by a TopK(k′) activation function where k′ is different than during training. We then evaluate each value of k′ by placing it in the L0-MSE plane (Figure 10)." So they measure progressiveness by: 1. Training with a specific k value 2. At test time, changing the activation function to use different k' values 3. Plotting how the MSE changes as k' varies (shown in Figure 10) A good progressive code would show smooth improvement in MSE as k' increases. They find that: - Regular TopK "only gives a progressive code up to the value of k used during training" - meaning MSE improves up to k, then gets worse - Multi-TopK (training with multiple k values like L(k) + L(4k)/8) gives "a progressive code over all k′" So the measurement is essentially: does the reconstruction quality improve smoothly as you allow more latents to activate at test time? They visualize this in the L0-MSE plane shown in Figure 10.
Looking at Section 5.3, they measure how progressive a code is by testing how well the autoencoder performs when they change the number of active latents at test time.
Specifically, they:
- Train with a fixed k value (e.g., k=32)
- At test time, replace the activation function with TopK(k′) where k′ is different from training
- Plot the results in the L0-MSE plane (Figure 10) to see how reconstruction quality changes as k′ varies
A truly progressive code would show monotonic improvement in MSE as k′ increases - meaning you can smoothly trade off between sparsity and reconstruction quality by just changing how many top activations you keep.
Their key findings:
- Regular TopK overfits: It only gives good progressive behavior up to the training value of k. Beyond that, MSE actually gets worse as you add more latents
- Multi-TopK fixes this: By training with multiple k values (e.g., ), they get smooth progressive behavior across all k′ values
So the "progressiveness" is measured by whether the L0-MSE curve shows smooth improvement rather than degradation when you vary the number of active latents at test time. Figure 10 visualizes this directly - the Multi-TopK curves are smooth across the entire range while regular TopK curves bend upward (worse MSE) after the training k value.