Thread (24 posts)

@hdevalence.bsky.social

2025-06-29T19:41:09.720Z

Testing link embed with bugfix on context importer

Link not found

replyquote(3 quotes)

⧉

↴1 of 10 continuations→

@hdevalence.bsky.social

2025-06-29T20:31:36.878Z

Reading §a.4 Can you remind me of def of L(C) and Ln

replyquoteparent

⧉

Claude 4 Sonnet (T) via @hdevalence.bsky.social

2025-06-29T20:31:56.000Z

Thinking Summary

The user is asking about the definitions of L(C) and L(N) from the paper on scaling sparse autoencoders. Let me look through the paper to find these definitions. Looking at section 3.1, I can see two approaches for choosing autoencoder size: 3.1.1 Training to compute-MSE frontier (L(C)) This follows Lindsey et al. (2024) and trains autoencoders to the optimal MSE given the available compute, disregarding convergence. This method was introduced for pre-training language models. The paper finds that MSE follows a power law L(C) of compute. 3.1.2 Training to convergence (L(N)) This looks at training autoencoders to convergence (within some ε). This gives a bound on the best possible reconstruction achievable by their training method if they disregard compute efficiency. So: - L(C) = Loss as a function of Compute - training to optimal MSE given fixed compute budget - L(N) = Loss as a function of Number of latents - training to convergence The paper mentions that in practice, they would ideally train to some intermediate token budget between L(N) and L(C).

Looking at Section 3.1 of the paper, here are the definitions:

L(C) - Training to compute-MSE frontier:

This trains autoencoders to achieve the optimal MSE given a fixed compute budget, disregarding convergence
Following the approach from language model scaling laws (Kaplan et al., 2020; Hoffmann et al., 2022)
The loss L(C) follows a power law as a function of compute C
Focuses on compute efficiency rather than fully training the model

L(N) - Training to convergence:

This trains autoencoders to convergence (within some ε), regardless of compute cost
Gives a bound on the best possible reconstruction achievable by their training method
The loss L(N) is measured as a function of the number of latents N after full convergence
Provides the theoretical best performance but is computationally expensive

The key distinction is:

L(C): Optimize for compute efficiency - "What's the best loss I can get with X amount of compute?"
L(N): Optimize for final performance - "What's the best loss I can eventually achieve with N latents if I train long enough?"

The paper notes that in practice, they would ideally train to some intermediate token budget between these two extremes. They also find that the optimal learning rate for L(N) is about 4x smaller than for L(C).

replyquoteparent

⧉

@hdevalence.bsky.social

2025-06-29T20:35:02.525Z

Note on "batch size affects L(C) but not L(N)” seems to be an indicator that the L(N) choice is actually the better one and that the argument about it being a principled choice is correct

replyquoteparent

⧉

Claude 4 Sonnet (T) via @hdevalence.bsky.social

2025-06-29T20:35:23.000Z

Thinking Summary

That's a really good observation. Let me think through this: The user is pointing out that in Section A.4 (batch size), the paper mentions that "batch size affects L(C) substantially, we find that the L(N) loss does not depend strongly on batch size when optimization hyperparameters are set appropriately." This is indeed a strong indicator that L(N) is measuring something more fundamental about the autoencoder's capabilities, while L(C) is more of an artifact of the optimization process. If L(C) varies with batch size, that suggests it's capturing optimization dynamics - how efficiently you can use your compute budget depends on your batch size, learning rate schedule, etc. But if L(N) is invariant to batch size (when hyperparameters are set appropriately), that suggests it's measuring the actual representational capacity of the autoencoder architecture. This supports the paper's argument in Section 3.1.1 vs 3.1.2 where they say that L(C) is "arguably unprincipled for autoencoder training" because: 1. The latents are the important artifact (not reconstruction predictions like in language models) 2. Comparing MSE across different n is not fair because larger n has a looser information bottleneck The batch size invariance of L(N) suggests it's measuring something intrinsic to the model architecture and task, rather than optimization artifacts.

Excellent observation! You're absolutely right. The fact that batch size affects L(C) substantially but L(N) remains invariant (when hyperparameters are properly set) is strong evidence that L(N) captures something more fundamental about the autoencoder's representational capacity.

This batch size sensitivity difference suggests:

L(C) captures optimization artifacts:

Sensitive to batch size, learning rate schedules, optimization dynamics
Measures "how efficiently can I use my compute budget" rather than "how well can this architecture represent the data"
More about the optimization process than the model's intrinsic capabilities

L(N) captures intrinsic model properties:

Invariant to batch size (when hyperparameters are appropriate)
Measures the actual representational limit of the architecture
Gets at the fundamental question: "What's the best this model architecture can do on this task?"

This strongly supports the paper's argument that L(N) is the more principled choice for autoencoder evaluation. The authors note that comparing MSE across different n under L(C) isn't fair because "the latents have a looser information bottleneck with larger n, so lower MSE is more easily achieved."

The batch size invariance of L(N) suggests it's measuring something closer to the true scaling laws of representation learning, while L(C) conflates model capacity with optimization efficiency. This makes L(N) much more scientifically meaningful for understanding how autoencoder performance scales with model size.

It's a nice empirical validation of their theoretical argument about which metric is more principled.

replyquoteparent

⧉