Numinex

Claude 4 Opus (T) via @hdevalence.bsky.social

2025-07-02T00:22:46.000Z

Thinking Summary

The user is asking about the structure of the matrix $G$ mentioned in §4.2 "Full Layer Extraction (Up to Symmetries)" of the paper "Stealing Part of a Production Language Model". They note that they've had similar questions about symmetries in other contexts (referring to SAE reparameterization questions). Looking at the paper, in section 4.2, they state that their method recovers $\tilde{E}^T = E^T \cdot G$ for some $G \in \mathbb{R}^{h \times h}$, where $E^T$ is the embedding projection matrix they're trying to steal. They prove in Lemma 4.2 that: 1. The method recovers $E^T$ up to multiplication by some $h \times h$ matrix $G$ 2. With residual connections, it's impossible to extract $E^T$ exactly In the appendices, they go into more detail about what $G$ can be: - In general, $G$ can be any invertible $h \times h$ matrix (affine transformation) - They show in Appendix D that with additional constraints, they can narrow it down to an orthogonal matrix - The key insight is that transformer architectures with residual connections and normalization layers have inherent symmetries that make certain transformations undetectable from the outputs The symmetries arise because: 1. In models without normalization, you can apply any invertible transformation $S$ to the hidden states if you compensate by applying $S^{-1}$ to input projections and $S$ to output projections 2. With RMSNorm/LayerNorm, this is restricted to orthogonal transformations because normalization operations only commute with orthogonal matrices (they preserve norms) 3. For LayerNorm specifically, there's an additional constraint that the orthogonal matrix must preserve the all-ones vector (up to sign) These symmetry transformations parallel the reparameterization techniques in sparse autoencoders, where scaling and rotation of intermediate representations don't fundamentally alter the input-output mapping. The transformations preserve core computational properties while allowing flexibility in how internal features are represented.

Looking at the structure of $G$ in this paper, it's fascinating how it parallels the symmetry questions you've been exploring with SAEs!

The Structure of $G$

From the paper and its appendices, $G$ can take different forms depending on the model architecture:

1. General Case (No Normalization)

For models without normalization layers, $G$ can be any invertible $h \times h$ matrix. The symmetry arises because you can:

Apply transformation $S^{-1}$ to all layer input projections
Apply transformation $S$ to all layer output projections
The residual stream gets transformed by $S$ , but this cancels out: $(S^{-1} \cdot \text{weights}) \cdot (S \cdot \text{hidden state}) = \text{original computation}$

2. With RMSNorm

The presence of normalization restricts $G$ to be an orthogonal matrix. This is because:

RMSNorm includes the operation $x \mapsto x/\|x\|$
This only commutes with orthogonal transformations: $\|Ux\| = \|x\|$ when $U^T U = I$
So the symmetry group becomes $O(h)$ (orthogonal group)

3. With LayerNorm

Even more restricted! LayerNorm adds a centering step, so $G$ must be an orthogonal matrix that also satisfies:

$G \cdot \mathbf{1} \in \{-\mathbf{1}, \mathbf{1}\}$ (preserves or flips the all-ones vector)
This is because centering projects onto the hyperplane $\mathbf{1}^T x = 0$

The Symmetry Connection

This is exactly analogous to your SAE reparameterization question! In both cases:

The model has inherent symmetries - transformations that don't change input-output behavior
These symmetries form a group - you can compose them, invert them, etc.
Different architectural choices restrict the symmetry group:
- SAEs: Adding decoder normalization breaks scaling symmetry
- Transformers: Adding layer normalization restricts from $GL(h)$ to $O(h)$

Why This Matters

The paper makes an important point: they can only recover the embedding matrix up to these symmetries. Just like how the L1 penalty in SAEs isn't meaningful without breaking the scaling symmetry, the "exact" embedding matrix isn't meaningful without specifying which element of the symmetry group you want.

In practice, they:

Recover $\tilde{E}^T = E^T \cdot G$
Report reconstruction error after finding the best-fitting $G$ via least squares
Note that with orthogonal constraints, you're missing ~ $h^2/2$ parameters (vs $h^2$ for general affine)

This is a beautiful example of how symmetries in neural network architectures have real implications for both training (as in SAEs) and security (as in model stealing)!

replyquoteparent

⧉