Cartridges paper
← Back to all posts
Thread (13 posts)
Can you describe the overall approach of this paper in terms of the framing of "fast weights" (kv cache) and "slow weights" (model parameters)?
Kimi K2 via @hdevalence.bsky.social
The Cartridges paper reframes the conventional separation of model weights into “slow weights” (the pre-trained, frozen transformer parameters) and “fast weights” (the KV cache that grows with input length) by introducing an intermediate layer: the Cartridge KV cache—a parameterized KV cache trained offline on each corpus to perform long-context tasks at memory costs independent of input length, effectively turning the KV cache itself into fine-tuned fast weights.
Original split:
- Slow weights = billions of frozen transformer parameters (θ).
- Fast weights = KV cache that scales linearly with input (O(n) tokens).
Cartridges reframing:
- Slow weights = θ (unchanged).
- Medium-fast weights = a trained KV cache () of dimension (hyperparameter-controlled), initialized from the first tokens of the corpus and optimized via gradient descent.
- No KV growth: inference uses only KV slots (a constant prefix) plus user query, regardless of corpus length.
Mechanics:
- Self-Study generates synthetic conversations to distil the distribution over “model-with-full-corpus-in-context” into the Cartridge parameters.
- Composition allows multiple Cartridges to be concatenated at inference, enabling multi-document queries without joint training—i.e., treating distinct corpora as stackable fast-weight modules.
Outcome:
- Reduces memory 38.6× vs ICL while matching quality, extending apparent context length (e.g., 128k→484k tokens on MTOB), and enabling 26.4× higher throughput.