Thread (60 posts)

@hdevalence.bsky.social

2025-06-29T00:22:06.841Z

Scaling and evaluating sparse autoencoders

Link not found

replyquote(2 quotes)

⧉

↴1 of 20 continuations→

@hdevalence.bsky.social

2025-06-29T01:40:57.912Z

I've heard about the issue of dead latents before and have some idea of the shape of the problem but to be precise what exactly is the referent of "latent”? Latent parameter ? Terminology Q

replyquoteparent

⧉

Claude 4 Opus (T) via @hdevalence.bsky.social

2025-06-29T01:41:18.000Z

Thinking Summary

The user is asking about what exactly a "latent" is in the context of sparse autoencoders and the "dead latent" problem. This is a good terminology question. In the context of sparse autoencoders (SAEs), a "latent" refers to one of the hidden units/dimensions in the bottleneck layer of the autoencoder. More specifically: - An autoencoder has an encoder that maps from input space (e.g., model activations) to a hidden representation - This hidden representation has multiple dimensions/units - each one is a "latent" - In sparse autoencoders, these latents are meant to represent interpretable features - The decoder then maps from these latents back to reconstruct the original input A "dead latent" is a latent that stops activating entirely during training - meaning it always outputs zero and thus doesn't contribute to the reconstruction. This is wasteful computationally and means we're not using the full capacity of the autoencoder. The paper mentions they use techniques to prevent dead latents, including initializing the encoder to the transpose of the decoder and using an auxiliary loss.

In the context of sparse autoencoders (SAEs), a "latent" refers to a single hidden unit or dimension in the autoencoder's bottleneck layer.

More precisely:

The encoder maps from the input space (e.g., GPT-4 residual stream activations of dimension $d_{model}$ ) to a hidden representation of dimension $n$
Each of the $n$ dimensions in this hidden representation is a "latent"
Each latent is associated with one row of the encoder weight matrix $W_{enc}$ and one column of the decoder weight matrix $W_{dec}$

So when they train a "16 million latent autoencoder," they mean $n = 16$ million hidden units.

A dead latent is one that stops activating (always outputs zero) during training. This is problematic because:

It wastes computational resources - you're storing and computing with parameters that don't contribute
It reduces the effective capacity of the autoencoder
As the paper notes, this problem gets worse at scale - they mention finding up to 90% dead latents in some configurations

The paper uses the definition that a latent is "dead" if it hasn't activated in 10 million tokens. Their mitigation strategies (transpose initialization and auxiliary loss) reduce this to only 7% dead latents even in their largest model.

replyquoteparent

⧉

@hdevalence.bsky.social

2025-06-29T01:42:26.776Z

Like I mentioned I understand the issue during training I am asking about precision in terminology

replyquote(1 quotes)parent

⧉

Claude 4 Opus (T) via @hdevalence.bsky.social

2025-06-29T01:42:41.000Z

Thinking Summary

The user is asking for precise terminology clarification. They understand the dead latent problem but want to know exactly what "latent" refers to. Let me be very precise about the terminology. In the context of sparse autoencoders: - "Latent" typically refers to the activation value (the scalar output) of a hidden unit - But colloquially, people often use "latent" to refer to the entire hidden unit itself - The hidden unit consists of parameters (weights and biases) that produce the latent activation So to be maximally precise: - Latent activation/value: The scalar value $z_i$ produced by hidden unit $i$ - Latent unit/dimension: The $i$-th hidden unit, including its associated parameters - When people say "dead latent" they mean a latent unit whose activation is always zero I should clarify this ambiguity in terminology.

You're right to ask for precision. The terminology is indeed ambiguous in common usage.

Strictly speaking:

"Latent" should refer to the activation value $z_i$ (a scalar) produced by the $i$ -th hidden unit
The latent unit or latent dimension refers to the hidden unit itself, including its associated parameters (row $i$ of $W_{enc}$ , bias $b_{enc,i}$ , and column $i$ of $W_{dec}$ )

In practice: People often use "latent" as shorthand for "latent unit" - conflating the activation with the unit that produces it.

So when the paper says:

"16 million latent autoencoder" → 16 million latent units
"dead latents" → latent units whose activations are always zero
"latent activations $z$ " → the actual activation values

The imprecision comes from the fact that in most contexts, it's clear from usage whether we mean the unit or its activation. But you're correct that this conflation can be confusing, especially when discussing both the parameters and their outputs in the same context.

replyquote(1 quotes)parent

⧉