Scaling and evaluating sparse autoencoders
Thread (60 posts)
Unfortunately, precision is extremely expensive to evaluate when the simulations are using GPT-4 as in Bills et al. (2023)
Why is this the case?
Thinking Summary
The user is asking about why precision is extremely expensive to evaluate when simulations are using GPT-4 as in Bills et al. (2023). Looking at the paper, they mention this in section 4.3 "Finding simple explanations for features". The context is that they're trying to evaluate explanations for what features (latents) in their sparse autoencoders represent. The Bills et al. (2023) approach (from the linked paper "Language models can explain neurons in language models") uses GPT-4 to generate natural language explanations of what neurons/features are doing, and then uses GPT-4 again to simulate whether those explanations would predict when the neuron/feature activates. To evaluate precision, you need to check: of all the times the explanation predicts the feature should activate, how often does it actually activate? This requires: 1. Running GPT-4 on many text samples to see if the explanation predicts activation 2. Checking against ground truth activations The expense comes from needing to make many GPT-4 API calls for the simulation step - you need to test the explanation on a large corpus of text to get reliable precision estimates. Each test requires a GPT-4 call to simulate whether the explanation would predict activation on that text. In contrast, the N2G (Neuron to Graph) approach they use instead is much cheaper because it just uses n-gram patterns with wildcards rather than natural language explanations that need to be simulated with a language model.
The expense comes from the evaluation methodology used by Bills et al. (2023). To measure precision of a natural language explanation for a neuron/feature, they need to:
-
Use GPT-4 to simulate predictions: For each text sample in a large evaluation corpus, they call GPT-4 to determine whether the explanation predicts the neuron should activate on that text.
-
Compare against ground truth: Check whether the neuron actually activated on each sample.
-
Calculate precision: Of all the samples where the explanation predicted activation, what fraction actually had the neuron activate?
The computational cost is high because:
- Volume: You need thousands of text samples to get statistically reliable precision estimates
- API calls: Each sample requires a GPT-4 API call to simulate the explanation
- Token costs: Each simulation involves sending the explanation + context to GPT-4
For example, if you want to evaluate 10,000 text samples and each requires ~500 tokens for the prompt, that's 5 million tokens just for one neuron's precision score.
This is why the authors switched to N2G (Neuron to Graph) - it uses simple n-gram pattern matching with wildcards instead of natural language explanations, making precision evaluation orders of magnitude cheaper since it doesn't require any LLM calls.