Stealing Part of a Production Language Model
Thread (64 posts)
Looking at §4.2 Full Layer Extraction I am particularly curious about the structure of since in a few other cases (context transcluded) some questions I've had have come down to what are the relevant symmetries and how do we characterize them?
Thinking Summary
The user is asking about the structure of the matrix $G$ mentioned in §4.2 "Full Layer Extraction (Up to Symmetries)" of the paper "Stealing Part of a Production Language Model". They note that they've had similar questions about symmetries in other contexts (referring to SAE reparameterization questions). Looking at the paper, in section 4.2, they state that their method recovers $\tilde{E}^T = E^T \cdot G$ for some $G \in \mathbb{R}^{h \times h}$, where $E^T$ is the embedding projection matrix they're trying to steal. They prove in Lemma 4.2 that: 1. The method recovers $E^T$ up to multiplication by some $h \times h$ matrix $G$ 2. With residual connections, it's impossible to extract $E^T$ exactly In the appendices, they go into more detail about what $G$ can be: - In general, $G$ can be any invertible $h \times h$ matrix (affine transformation) - They show in Appendix D that with additional constraints, they can narrow it down to an orthogonal matrix - The key insight is that transformer architectures with residual connections and normalization layers have inherent symmetries that make certain transformations undetectable from the outputs The symmetries arise because: 1. In models without normalization, you can apply any invertible transformation $S$ to the hidden states if you compensate by applying $S^{-1}$ to input projections and $S$ to output projections 2. With RMSNorm/LayerNorm, this is restricted to orthogonal transformations because normalization operations only commute with orthogonal matrices (they preserve norms) 3. For LayerNorm specifically, there's an additional constraint that the orthogonal matrix must preserve the all-ones vector (up to sign) These symmetry transformations parallel the reparameterization techniques in sparse autoencoders, where scaling and rotation of intermediate representations don't fundamentally alter the input-output mapping. The transformations preserve core computational properties while allowing flexibility in how internal features are represented.
Looking at the structure of in this paper, it's fascinating how it parallels the symmetry questions you've been exploring with SAEs!
The Structure of
From the paper and its appendices, can take different forms depending on the model architecture:
1. General Case (No Normalization)
For models without normalization layers, can be any invertible matrix. The symmetry arises because you can:
- Apply transformation to all layer input projections
- Apply transformation to all layer output projections
- The residual stream gets transformed by , but this cancels out:
2. With RMSNorm
The presence of normalization restricts to be an orthogonal matrix. This is because:
- RMSNorm includes the operation
- This only commutes with orthogonal transformations: when
- So the symmetry group becomes (orthogonal group)
3. With LayerNorm
Even more restricted! LayerNorm adds a centering step, so must be an orthogonal matrix that also satisfies:
- (preserves or flips the all-ones vector)
- This is because centering projects onto the hyperplane
The Symmetry Connection
This is exactly analogous to your SAE reparameterization question! In both cases:
- The model has inherent symmetries - transformations that don't change input-output behavior
- These symmetries form a group - you can compose them, invert them, etc.
- Different architectural choices restrict the symmetry group:
- SAEs: Adding decoder normalization breaks scaling symmetry
- Transformers: Adding layer normalization restricts from to
Why This Matters
The paper makes an important point: they can only recover the embedding matrix up to these symmetries. Just like how the L1 penalty in SAEs isn't meaningful without breaking the scaling symmetry, the "exact" embedding matrix isn't meaningful without specifying which element of the symmetry group you want.
In practice, they:
- Recover
- Report reconstruction error after finding the best-fitting via least squares
- Note that with orthogonal constraints, you're missing ~ parameters (vs for general affine)
This is a beautiful example of how symmetries in neural network architectures have real implications for both training (as in SAEs) and security (as in model stealing)!