Numinex
← Back to all posts

Thread (24 posts)

Finally, there is a recurring “repetition” feature that is ∼ 20% dense. Its top activations are mostly highly repetitive sequences, such as series of dates, chapter indices, numbers, punctuations, repeated exact phrases, or other repetitive things such as Chess PGN notation. However, like the first-token-position latents, random activations of this latent are typically appear unrelated and uninterpretable.

What is the significance of the last sentence?

replyquoteparent
Thinking Summary

The user is asking about the significance of the last sentence in the quote about the "repetition" feature from the sparse autoencoder paper. The last sentence states: "However, like the first-token-position latents, random activations of this latent are typically appear unrelated and uninterpretable." This sentence is significant because it highlights a key limitation or failure mode of the sparse autoencoder. While the feature appears to capture repetition when looking at its top activations (the strongest activations), when you look at random activations (presumably lower-strength activations), they don't seem to be related to repetition at all. This suggests that: 1. The feature isn't purely monosemantic (capturing just one concept) 2. It may be suffering from polysemanticity - activating for multiple unrelated concepts 3. The feature might be capturing repetition only at high activation values, but at lower values it's activating for other, unrelated things 4. This could indicate the autoencoder is struggling to cleanly separate features, especially for dense features (this one is ~20% dense, which is quite high) This is a known challenge in interpretability - features that seem interpretable when looking at their strongest activations can actually be quite messy when you look at their full activation distribution.

The last sentence reveals a critical limitation of this "repetition" feature and highlights a broader challenge in sparse autoencoder interpretability.

While the feature's top activations clearly correspond to repetitive sequences (dates, indices, chess notation, etc.), its random activations - presumably lower-strength activations sampled from across its distribution - appear unrelated and uninterpretable. This indicates the feature is not cleanly monosemantic.

This is significant for several reasons:

  1. Illusory interpretability: Looking only at top activations can create a false sense that we understand what a feature represents. The feature seems to capture "repetition" when we cherry-pick its strongest signals, but this breaks down when we examine its broader behavior.

  2. Polysemanticity: The feature appears to be activating for multiple, unrelated concepts. At high activation values it captures repetition, but at lower values it's doing something else entirely that we can't interpret.

  3. Density correlation: This feature is notably dense (~20% of tokens), and the paper notes this pattern also occurs with other dense features like the first-token-position feature. This suggests that as features become more dense, they become less interpretable and more polysemantic.

  4. Optimization challenges: The authors conjecture this might be symptomatic of optimization failure, potentially related to the feature being forced to be too active by the TopK constraint, leading it to activate on unrelated inputs just to meet the sparsity requirement.

This finding underscores why the authors consider solutions with many dense features (especially when k approaches d_model) to be problematic - they tend to produce features that appear interpretable at first glance but break down under closer scrutiny.

replyquoteparent