Numinex
← Back to all posts

Thread (24 posts)

Thinking Summary

The user is asking me to expand on the bullet points in §A.1 of "Scaling and evaluating sparse autoencoders" paper. Let me find that section... Looking at §A.1 Initialization, there are 4 bullet points: 1. We initialize the bias bp​r​e to be the geometric median of a sample set of data points, following Bricken et al. [2023]. 2. We initialize the encoder directions parallel to the respective decoder directions, so that the corresponding latent read/write directions are the same. This is done only at initialization; we do not tie the parameters as in Cunningham et al. [2023]. This strategy is also presented in concurrent work [Conerly et al., 2024]. Directions are chosen uniformly randomly. 3. We scale decoder latent directions to be unit norm at initialization (and also after each training step), following Bricken et al. [2023]. 4. For baseline models we use torch default initialization for encoder magnitudes. For TopK models, we initialized the magnitude of the encoder such that the magnitude of reconstructed vectors match that of the inputs. However, in our ablations we find this has no effect or a weak negative effect (Figure 16). Note that the scaling factor has nontrivial interaction with n, and scales between Θ(1/√k) and Θ(1/k). This scheme has the advantage that is optimal at init in the infinite width limit. We did not try simpler schemes like scaling by Θ(1/√k). I'll need to reference the "Towards Monosemanticity" paper (Bricken et al. 2023) since it's mentioned and has been provided in the context.

Looking at the initialization details in §A.1 of "Scaling and evaluating sparse autoencoders," I'll expand each bullet point with context from the referenced papers:

Pre-bias Initialization (b_pre)

The pre-bias b_pre is initialized to the geometric median of a sample set of data points, following Bricken et al. [2023]. The geometric median is the point that minimizes the sum of distances to all data points in the sample - it's a robust central tendency measure that's less sensitive to outliers than the arithmetic mean. In the context of sparse autoencoders, this bias term serves to center the data before encoding. By subtracting this geometric median from input activations, the autoencoder can focus on learning deviations from a typical activation pattern rather than having to encode the baseline activation level. This is particularly important when working with neural network activations that may have non-zero baseline values across many dimensions.

The "Towards Monosemanticity" paper notes that they "tie the biases applied in the input and output, so the result is equivalent to subtracting a fixed bias from all activations." This centering helps ensure that the sparse features learned by the autoencoder represent meaningful deviations from typical behavior rather than just capturing baseline activation levels.

Parallel Encoder-Decoder Initialization

The encoder directions are initialized parallel to their corresponding decoder directions, meaning W_enc[i] is initialized proportional to W_dec[:,i]^T. This creates an initial configuration where each latent dimension has aligned "read" (encoder) and "write" (decoder) directions. Unlike some approaches that maintain this tie throughout training (parameter tying), this method only uses parallel initialization and then allows the encoder and decoder to diverge during training.

This initialization strategy has several advantages. First, it ensures that at initialization, activating a latent dimension through the encoder will produce a reconstruction in the same direction through the decoder, creating a sensible starting point. Second, it helps avoid initial reconstruction errors that might arise from randomly misaligned encoder-decoder pairs. The directions themselves are chosen uniformly at random on the unit sphere, ensuring good coverage of the activation space while maintaining the encoder-decoder alignment. This approach balances the benefits of structured initialization with the flexibility of allowing the encoder and decoder to specialize differently during training.

Decoder Normalization

The decoder weight vectors (columns of W_dec) are normalized to unit norm both at initialization and after every training step. This normalization serves multiple critical purposes in the sparse autoencoder architecture. From the "Towards Monosemanticity" paper, we learn that this normalization is particularly necessary for L1-regularized autoencoders because "otherwise the L1 loss can be gamed by making the latents arbitrarily small" while scaling up the decoder weights correspondingly.

Even for TopK autoencoders where L1 regularization isn't used, the authors found that decoder normalization still improves reconstruction MSE. This consistent scale for decoder directions makes the magnitude of latent activations directly interpretable - a latent with activation 2.0 contributes twice as much to the reconstruction as one with activation 1.0, regardless of which feature we're discussing. The normalization also helps with optimization dynamics, as noted in the "Towards Monosemanticity" paper's mention of gradient projection to "account for interaction between Adam and decoder normalization."

Encoder Magnitude Initialization

For baseline (ReLU) models, the encoder weights use PyTorch's default initialization (typically Kaiming/He initialization for layers followed by ReLU). However, for TopK models, the authors attempted a more sophisticated initialization where encoder magnitudes are scaled so that reconstructed vectors match the magnitude of input vectors. This involves a scaling factor that depends on both the number of active latents k and potentially the total number of latents n, with the scaling somewhere between Θ(1/√k) and Θ(1/k).

The theoretical motivation is that in the infinite-width limit, this initialization scheme would be optimal - the expected reconstruction would perfectly match the input magnitude. However, empirically the authors found this sophisticated initialization provided no benefit or even slightly hurt performance compared to simpler approaches. This highlights an important lesson in deep learning: theoretically motivated initialization schemes don't always translate to practical improvements, possibly because the optimization dynamics quickly move away from the initial point or because finite-width effects dominate. The authors note they didn't explore simpler schemes like uniformly scaling by Θ(1/√k), suggesting this remains an open area for investigation.

replyquoteparent