§F.2 Impact of token position
We find that tokens at later positions are harder to reconstruct (Figure 29). We hypothesize that this is because the residual stream at later positions have more features.
Seems related to earlier notes
None of this is actually working with the attention mechanism except indirectly so seems hard to understand how we hope for it to work for the interesting long context behavior