Stealing Part of a Production Language Model
Thread (64 posts)
Thinking about the "attack success rate" evaluation in $4.2, as well as the included context focusing on explicit characterization of the symmetries, can you explain the RMS computation and how it does or doesn't respect symmetry
Thinking Summary
The user is asking about the RMS (Root Mean Square) computation in Section 4.2 of the Carlini et al. paper and how it does or doesn't respect symmetry. Let me think through this carefully. Looking at the paper, they're evaluating their attack success by computing the RMS error between their extracted matrix $\tilde{\mathbf{E}}^T$ and the actual weight matrix $\mathbf{E}^T$, after allowing for an $h \times h$ affine transformation. The key part is that they solve a least squares system to find the best aligning transformation $\mathbf{G}$: - They solve $\tilde{\mathbf{E}}^T \cdot \mathbf{G} \approx \mathbf{E}^T$ for $\mathbf{G}$ - Then they report the RMS of $\mathbf{E}^T$ and $\tilde{\mathbf{E}}^T \cdot \mathbf{G}$ This is explicitly respecting the symmetry! They're not computing the naive RMS between the raw extracted matrix and the true matrix, which would be meaningless since the extracted matrix is only defined up to an $h \times h$ transformation. The paper proves that you can only recover $\mathbf{E}^T$ up to multiplication by some matrix $\mathbf{G}$, and their evaluation methodology explicitly accounts for this by finding the best possible $\mathbf{G}$ before computing the error. This is a good example of symmetry-aware evaluation. They're measuring extraction success within the equivalence class of matrices that differ by right-multiplication by an invertible $h \times h$ matrix. Actually, in the appendices they go even further and show that with additional assumptions (like the model having residual connections and normalization layers), you can narrow it down to recovery up to an orthogonal transformation rather than a general affine transformation. But their practical evaluation uses the more general affine case.
Looking at the RMS computation in Section 4.2 of the Carlini et al. paper, it's actually a great example of symmetry-aware evaluation. Let me break down what they're doing and why it respects the inherent symmetries:
The RMS Computation
The paper states:
"we evaluate the attack success rate by comparing the root mean square (RMS) between our extracted matrix and the actual weight matrix, after allowing for a affine transformation."
Specifically, they:
- Solve the least squares system: for
- Report the RMS of and
Why This Respects Symmetry
This evaluation explicitly accounts for the fundamental symmetry in their extraction problem. They proved (Lemma 4.2) that given only API access, you can only recover:
for some unknown matrix . This isn't a limitation of their attack—it's a fundamental information-theoretic constraint.
The Symmetry Group
For transformer models with residual connections, the symmetry group consists of all transformations of the form:
where is any invertible matrix (or orthogonal if you include normalization constraints).
Why Direct RMS Would Be Wrong
If they computed RMS directly between and without alignment:
- The error could be arbitrarily large even for perfect extraction (just with different basis)
- It would penalize the attack for not recovering information that's theoretically impossible to recover
- The metric wouldn't distinguish between genuine extraction errors and harmless reparameterizations
The Deeper Structure
What's elegant here is that they're implicitly working with equivalence classes:
- Two matrices are equivalent if they differ by right-multiplication by an invertible matrix
- Their metric measures distance between equivalence classes, not between specific representatives
- This is the "right" way to measure extraction success given the symmetries
Comparison to Baseline
They note that random initialization gives RMS ≈ , while their attack achieves to . This 100-500× improvement shows they're extracting real information, not just exploiting the alignment procedure.
This approach beautifully illustrates your earlier point about defining fidelity modulo symmetries rather than seeking meaningless "exact" extraction!
Solving the least squares problem is equivalent to finding a minimizing symmetry?
Thinking Summary
The user is asking whether solving the least squares problem $\tilde{\mathbf{E}}^T \cdot \mathbf{G} \approx \mathbf{E}^T$ for $\mathbf{G}$ is equivalent to finding a minimizing symmetry transformation. This is a great observation - let me think through this carefully. In the context of the paper: - They extract $\tilde{\mathbf{E}}^T = \mathbf{E}^T \cdot \mathbf{G}_{\text{true}}$ for some unknown $\mathbf{G}_{\text{true}}$ - They want to measure how close their extraction is to the true $\mathbf{E}^T$ - But they can only recover things up to this symmetry So when they solve $\tilde{\mathbf{E}}^T \cdot \mathbf{G} \approx \mathbf{E}^T$, they're finding the $\mathbf{G}$ that best aligns their extracted matrix with the true one. This is indeed finding the "best" symmetry transformation that minimizes the distance between their extraction and the truth. More formally: - The symmetry group acts on the space of matrices by right multiplication - They're finding $\mathbf{G}^* = \arg\min_{\mathbf{G}} \|\tilde{\mathbf{E}}^T \cdot \mathbf{G} - \mathbf{E}^T\|_F$ - This is finding the group element that moves their extracted matrix closest to the true matrix - The residual error after this optimal alignment is their measure of extraction quality So yes, the user is correct - solving the least squares problem is exactly finding the minimizing element in the symmetry group!
Yes, exactly! This is a beautiful way to think about it. The least squares problem is precisely finding the optimal symmetry transformation.
The Mathematical Structure
When they solve:
They're finding the group element that minimizes the distance between:
- Their extracted matrix transformed by
- The true matrix
Why This Is Natural
Since they proved you can only recover up to right-multiplication by an matrix, the space of all possible "correct" extractions is the orbit:
Their evaluation metric asks: "What's the closest point in our extracted matrix's orbit to the true matrix?"
The Geometry
This has elegant geometric interpretation:
- The symmetry group acts on the space of matrices
- Each orbit represents an equivalence class of functionally identical matrices
- The least squares solution finds the point in 's orbit closest to
- The residual RMS measures the distance between orbits, not between arbitrary representatives
Connection to Optimization on Manifolds
This is actually a classic problem in optimization on manifolds:
- The quotient space (matrices modulo the symmetry) forms a manifold
- They're computing the distance between points on this quotient manifold
- The least squares problem is the standard way to compute this distance
So yes, solving the least squares problem is exactly finding the symmetry transformation that best aligns their extraction with ground truth. The final RMS then measures only the "real" error that can't be explained by symmetries.
This reinforces why naive RMS without alignment would be meaningless - it would conflate actual extraction error with arbitrary choice of basis!