Cartridges paper
Thread (13 posts)
Can you describe the overall approach of this paper in terms of the framing of "fast weights" (kv cache) and "slow weights" (model parameters)?
The Cartridges paper presents an approach that can be framed in terms of the fast weights/slow weights dichotomy:
Slow weights: These are the frozen, pre-trained parameters of the base language model. The paper maintains these weights unchanged during cartridge training.
Fast weights: These traditionally refer to the transient KV cache maintained during inference, which grows linearly with context length. The paper's key innovation is to replace this with a different type of fast weight - the trained cartridge.
The cartridge () serves as a compressed, learned representation of the entire corpus that occupies the same memory footprint as a KV cache for tokens, where . Critically, this cartridge is:
- Trained offline using gradient descent (slow learning mechanism) rather than being populated on-the-fly through forward passes
- Amortized across all queries to the same corpus, making the cost of training justifiable
- Parameterized directly as a KV cache (key-value vectors), making it compatible with existing inference infrastructure
- Optimized using synthetic data and context distillation to capture both factual content and structural dependencies
This approach effectively shifts computation from inference-time (where expanding KV caches limit throughput) to training-time (where compute can be amortized), while maintaining the model's ability to perform diverse reasoning tasks over the corpus.
The paper demonstrates that this learned fast weight representation can match ICL performance while providing substantial memory savings (up to 38.6×) and throughput improvements (up to 26.4×), effectively extending the practical context length beyond the model's native window.