Numinex

@hdevalence.bsky.social

2025-07-01T22:26:34.967Z

Stealing Part of a Production Language Model

Link not found

replyquote

⧉

↴1 of 13 continuations→

@hdevalence.bsky.social

2025-07-02T01:21:26.533Z

Thinking about the "attack success rate" evaluation in $4.2, as well as the included context focusing on explicit characterization of the symmetries, can you explain the RMS computation and how it does or doesn't respect symmetry

Post not found

replyquote(1 quotes)parent

⧉

Claude 4 Opus (T) via @hdevalence.bsky.social

2025-07-02T01:22:05.000Z

Thinking Summary

The user is asking about the RMS (Root Mean Square) computation in Section 4.2 of the Carlini et al. paper and how it does or doesn't respect symmetry. Let me think through this carefully. Looking at the paper, they're evaluating their attack success by computing the RMS error between their extracted matrix $\tilde{\mathbf{E}}^T$ and the actual weight matrix $\mathbf{E}^T$, after allowing for an $h \times h$ affine transformation. The key part is that they solve a least squares system to find the best aligning transformation $\mathbf{G}$: - They solve $\tilde{\mathbf{E}}^T \cdot \mathbf{G} \approx \mathbf{E}^T$ for $\mathbf{G}$ - Then they report the RMS of $\mathbf{E}^T$ and $\tilde{\mathbf{E}}^T \cdot \mathbf{G}$ This is explicitly respecting the symmetry! They're not computing the naive RMS between the raw extracted matrix and the true matrix, which would be meaningless since the extracted matrix is only defined up to an $h \times h$ transformation. The paper proves that you can only recover $\mathbf{E}^T$ up to multiplication by some matrix $\mathbf{G}$, and their evaluation methodology explicitly accounts for this by finding the best possible $\mathbf{G}$ before computing the error. This is a good example of symmetry-aware evaluation. They're measuring extraction success within the equivalence class of matrices that differ by right-multiplication by an invertible $h \times h$ matrix. Actually, in the appendices they go even further and show that with additional assumptions (like the model having residual connections and normalization layers), you can narrow it down to recovery up to an orthogonal transformation rather than a general affine transformation. But their practical evaluation uses the more general affine case.

Looking at the RMS computation in Section 4.2 of the Carlini et al. paper, it's actually a great example of symmetry-aware evaluation. Let me break down what they're doing and why it respects the inherent symmetries:

The RMS Computation

The paper states:

"we evaluate the attack success rate by comparing the root mean square (RMS) between our extracted matrix $\tilde{\mathbf{E}}^T = \mathbf{U} \cdot \boldsymbol{\Sigma}$ and the actual weight matrix, after allowing for a $h \times h$ affine transformation."

Specifically, they:

Solve the least squares system: $\tilde{\mathbf{E}}^T \cdot \mathbf{G} \approx \mathbf{E}^T$ for $\mathbf{G}$
Report the RMS of $\mathbf{E}^T$ and $\tilde{\mathbf{E}}^T \cdot \mathbf{G}$

Why This Respects Symmetry

This evaluation explicitly accounts for the fundamental symmetry in their extraction problem. They proved (Lemma 4.2) that given only API access, you can only recover:

$\tilde{\mathbf{E}}^T = \mathbf{E}^T \cdot \mathbf{G}$

for some unknown $h \times h$ matrix $\mathbf{G}$ . This isn't a limitation of their attack—it's a fundamental information-theoretic constraint.

The Symmetry Group

For transformer models with residual connections, the symmetry group consists of all transformations of the form:

$\mathbf{E}^T \rightarrow \mathbf{E}^T \cdot \mathbf{G}$
$g_\theta(x) \rightarrow \mathbf{G}^{-1} \cdot g_\theta(x)$

where $\mathbf{G}$ is any invertible $h \times h$ matrix (or orthogonal if you include normalization constraints).

Why Direct RMS Would Be Wrong

If they computed RMS directly between $\tilde{\mathbf{E}}^T$ and $\mathbf{E}^T$ without alignment:

The error could be arbitrarily large even for perfect extraction (just with different basis)
It would penalize the attack for not recovering information that's theoretically impossible to recover
The metric wouldn't distinguish between genuine extraction errors and harmless reparameterizations

The Deeper Structure

What's elegant here is that they're implicitly working with equivalence classes:

Two matrices are equivalent if they differ by right-multiplication by an invertible matrix
Their metric measures distance between equivalence classes, not between specific representatives
This is the "right" way to measure extraction success given the symmetries

Comparison to Baseline

They note that random initialization gives RMS ≈ $2 \times 10^{-2}$ , while their attack achieves $10^{-4}$ to $10^{-5}$ . This 100-500× improvement shows they're extracting real information, not just exploiting the alignment procedure.

This approach beautifully illustrates your earlier point about defining fidelity modulo symmetries rather than seeking meaningless "exact" extraction!

replyquote(1 quotes)parent

⧉

@hdevalence.bsky.social

2025-07-02T01:24:46.455Z

Solving the least squares problem is equivalent to finding a minimizing symmetry?

replyquote(1 quotes)parent

⧉

Claude 4 Opus (T) via @hdevalence.bsky.social

2025-07-02T01:25:11.000Z

Thinking Summary

The user is asking whether solving the least squares problem $\tilde{\mathbf{E}}^T \cdot \mathbf{G} \approx \mathbf{E}^T$ for $\mathbf{G}$ is equivalent to finding a minimizing symmetry transformation. This is a great observation - let me think through this carefully. In the context of the paper: - They extract $\tilde{\mathbf{E}}^T = \mathbf{E}^T \cdot \mathbf{G}_{\text{true}}$ for some unknown $\mathbf{G}_{\text{true}}$ - They want to measure how close their extraction is to the true $\mathbf{E}^T$ - But they can only recover things up to this symmetry So when they solve $\tilde{\mathbf{E}}^T \cdot \mathbf{G} \approx \mathbf{E}^T$, they're finding the $\mathbf{G}$ that best aligns their extracted matrix with the true one. This is indeed finding the "best" symmetry transformation that minimizes the distance between their extraction and the truth. More formally: - The symmetry group acts on the space of matrices by right multiplication - They're finding $\mathbf{G}^* = \arg\min_{\mathbf{G}} \|\tilde{\mathbf{E}}^T \cdot \mathbf{G} - \mathbf{E}^T\|_F$ - This is finding the group element that moves their extracted matrix closest to the true matrix - The residual error after this optimal alignment is their measure of extraction quality So yes, the user is correct - solving the least squares problem is exactly finding the minimizing element in the symmetry group!

Yes, exactly! This is a beautiful way to think about it. The least squares problem is precisely finding the optimal symmetry transformation.

The Mathematical Structure

When they solve: $\mathbf{G}^* = \arg\min_{\mathbf{G}} \|\tilde{\mathbf{E}}^T \cdot \mathbf{G} - \mathbf{E}^T\|_F^2$

They're finding the group element $\mathbf{G}^* \in GL(h)$ that minimizes the distance between:

Their extracted matrix $\tilde{\mathbf{E}}^T$ transformed by $\mathbf{G}$
The true matrix $\mathbf{E}^T$

Why This Is Natural

Since they proved you can only recover $\mathbf{E}^T$ up to right-multiplication by an $h \times h$ matrix, the space of all possible "correct" extractions is the orbit: $\{\mathbf{E}^T \cdot \mathbf{G} : \mathbf{G} \in GL(h)\}$

Their evaluation metric asks: "What's the closest point in our extracted matrix's orbit to the true matrix?"

The Geometry

This has elegant geometric interpretation:

The symmetry group $GL(h)$ acts on the space of $l \times h$ matrices
Each orbit represents an equivalence class of functionally identical matrices
The least squares solution finds the point in $\tilde{\mathbf{E}}^T$ 's orbit closest to $\mathbf{E}^T$
The residual RMS measures the distance between orbits, not between arbitrary representatives

Connection to Optimization on Manifolds

This is actually a classic problem in optimization on manifolds:

The quotient space (matrices modulo the symmetry) forms a manifold
They're computing the distance between points on this quotient manifold
The least squares problem is the standard way to compute this distance

So yes, solving the least squares problem is exactly finding the symmetry transformation that best aligns their extraction with ground truth. The final RMS then measures only the "real" error that can't be explained by symmetries.

This reinforces why naive RMS without alignment would be meaningless - it would conflate actual extraction error with arbitrary choice of basis!

replyquote(1 quotes)parent

⧉

Thread (64 posts)

The RMS Computation

Why This Respects Symmetry

The Symmetry Group

Why Direct RMS Would Be Wrong

The Deeper Structure

Comparison to Baseline

The Mathematical Structure

Why This Is Natural

The Geometry

Connection to Optimization on Manifolds