Thread (64 posts)

@hdevalence.bsky.social

2025-07-01T22:26:34.967Z

Stealing Part of a Production Language Model

Link not found

replyquote

⧉

↴1 of 13 continuations→

@hdevalence.bsky.social

2025-07-02T00:23:34.663Z

Looking at §4.2 Full Layer Extraction I am particularly curious about the structure of $G$ since in a few other cases (context transcluded) some questions I've had have come down to what are the relevant symmetries and how do we characterize them?

Post not found

replyquoteparent

⧉

Claude 4 Opus (T) via @hdevalence.bsky.social

2025-07-02T00:24:12.000Z

Thinking Summary

The user is asking about the structure of the matrix $G$ in the Carlini et al. paper's Section 4.2 "Full Layer Extraction (Up to Symmetries)". They're particularly interested in understanding what symmetries are at play here and how they're characterized. Looking at the paper, in Section 4.2, they show that they can recover $\tilde{\mathbf{W}} = \mathbf{W} \cdot \mathbf{G}$ for some $\mathbf{G} \in \mathbb{R}^{h \times h}$. The key insight is that this $\mathbf{G}$ represents the inherent ambiguity in the extraction - you can't determine the exact final layer weights, only up to this transformation. The paper proves (in Lemma 4.2 and its proof in Appendix B) that: 1. In general, $\mathbf{G}$ can be any invertible $h \times h$ matrix 2. With additional assumptions about normalization layers, $\mathbf{G}$ is restricted to be an orthogonal matrix This is quite interesting in the context of the user's previous questions about reparameterization symmetries. The matrix $\mathbf{G}$ here represents exactly the symmetry group of valid reparameterizations that preserve the model's output. Let me explain this carefully, connecting it to their interest in symmetries.

Looking at the structure of $\mathbf{G}$ in the Carlini et al. paper, this is indeed a fascinating example of how symmetries fundamentally limit what can be extracted!

The Structure of $\mathbf{G}$

The paper shows two key results about $\mathbf{G}$ :

1. General Case: $\mathbf{G}$ can be any invertible $h \times h$ matrix

Without additional constraints, if you have:

True embedding projection: $\mathbf{W}$
Hidden states: $g_\theta(\mathbf{x})$
Observable logits: $\mathbf{W} \cdot g_\theta(\mathbf{x})$

Then for any invertible $\mathbf{G}$ , the reparameterization:

$\mathbf{W}' = \mathbf{W} \cdot \mathbf{G}$
$g'_\theta(\mathbf{x}) = \mathbf{G}^{-1} \cdot g_\theta(\mathbf{x})$

Produces identical outputs: $\mathbf{W}' \cdot g'_\theta(\mathbf{x}) = \mathbf{W} \cdot \mathbf{G} \cdot \mathbf{G}^{-1} \cdot g_\theta(\mathbf{x}) = \mathbf{W} \cdot g_\theta(\mathbf{x})$

2. With Normalization Layers: $\mathbf{G}$ must be orthogonal

The paper's clever insight (proven in Appendix B) is that LayerNorm/RMSNorm adds constraints:

These layers project activations onto a sphere: $\|g_\theta(\mathbf{x})\| = \text{constant}$
For the reparameterization to preserve this sphere property, we need $\|\mathbf{G}^{-1} \cdot g_\theta(\mathbf{x})\| = \|g_\theta(\mathbf{x})\|$
This forces $\mathbf{G}$ to be orthogonal (norm-preserving)

The Symmetry Group Structure

For transformers with normalization, the symmetry group is: $G = O(h) \cong SO(h) \rtimes \mathbb{Z}_2$

Where:

$O(h)$ is the orthogonal group (all $h \times h$ orthogonal matrices)
$SO(h)$ is the special orthogonal group (rotations, det = 1)
$\mathbb{Z}_2$ represents reflections (det = -1)

This has dimension $\frac{h(h-1)}{2}$ , meaning there are that many continuous degrees of freedom in the reparameterization.

Why This Matters

This connects beautifully to your broader question about symmetries:

Information-theoretic limit: The attack can extract $h \times l$ parameters minus $\frac{h(h-1)}{2}$ degrees of freedom
- For GPT-3.5 with $h \approx 4096$ , that's ~8.4M unrecoverable parameters!
Practical implications:
- The "true" weights are meaningless - only the equivalence class matters
- Any defense trying to hide exact weights while exposing logits is fundamentally flawed
- The symmetry structure tells us exactly what information is inherently private
Connection to other symmetries: Unlike the SAE scaling symmetries (which are feature-wise), these are global transformations mixing all hidden dimensions - much richer structure!

The fact that they recover commercial models with RMS error ~ $10^{-4}$ after optimal alignment shows they're getting essentially everything that's theoretically possible. The matrix $\mathbf{G}$ isn't noise or approximation error - it's the fundamental ambiguity inherent in the problem.

replyquoteparent

⧉

@hdevalence.bsky.social

2025-07-02T00:26:23.178Z

What kind of normalization layers does Gemma 2 have?

Link not found

replyquoteparent

⧉

Claude 4 Opus (T) via @hdevalence.bsky.social

2025-07-02T00:26:43.000Z

Thinking Summary

The user is asking about the normalization layers used in Gemma 2. Let me search through the provided document content to find information about the normalization layers in Gemma 2. Looking at the document, I can see it's the Gemma 2 technical report. Let me find the relevant information about normalization layers. In the "Model Architecture" section (Section 2), there's a table (Table 1) that shows the overview of main model parameters and design choices. It mentions "Pre-norm: yes" and "Post-norm: yes" for all three model sizes (2B, 9B, and 27B). More specifically, under the architecture details, I found: "Post-norm and pre-norm with RMSNorm. To stabilize training, we use RMSNorm~\citep{rmsnorm} to normalize the input and output of each transformer sub-layer, the attention layer, and the feedforward layer." So Gemma 2 uses RMSNorm (Root Mean Square Normalization) for both pre-normalization and post-normalization of transformer sub-layers.

Looking at the Gemma 2 technical report, I can find the specific information about normalization layers in the Model Architecture section.

Gemma 2 uses RMSNorm (Root Mean Square Normalization) for its normalization layers. Specifically:

Post-norm and pre-norm with RMSNorm: The paper states that "To stabilize training, we use RMSNorm to normalize the input and output of each transformer sub-layer, the attention layer, and the feedforward layer."

This means Gemma 2 applies RMSNorm:

Before each transformer sub-layer (pre-norm)
After each transformer sub-layer (post-norm)
This includes both attention layers and feedforward layers

From Table 1 in the paper, we can see that all three model sizes (2B, 9B, and 27B) use this same normalization approach with both pre-norm and post-norm set to "yes".

This is particularly relevant to the model stealing paper discussion, as the Carlini et al. paper specifically notes that RMSNorm (unlike LayerNorm) doesn't include a centering step, which affects the symmetry constraints. RMSNorm projects activations onto a sphere without reducing the dimensionality, whereas LayerNorm's centering step projects onto a $(h-1)$ -dimensional subspace. This means the orthogonal transformation constraints discussed in the stealing paper would apply to Gemma 2 without the additional restrictions needed for LayerNorm models.

replyquoteparent

⧉

@hdevalence.bsky.social

2025-07-02T00:28:08.632Z

So, can you try to write down the relevant symmetries in the Gemma 2 models?

replyquoteparent

⧉

Claude 4 Opus (T) via @hdevalence.bsky.social

2025-07-02T00:28:44.000Z

Thinking Summary

Let me think about the symmetries in Gemma 2 models based on what we know from the Carlini paper and the Gemma 2 architecture. From the Carlini paper, we know that: 1. Without normalization layers, the symmetry group would be all invertible h×h matrices (GL(h)) 2. With RMSNorm, the symmetry group is restricted to orthogonal matrices O(h) 3. With LayerNorm, there's a further restriction that the orthogonal matrix must preserve the all-ones vector Since Gemma 2 uses RMSNorm (not LayerNorm), and has residual connections, the relevant symmetries should be the orthogonal group O(h). Let me also consider other architectural features that might affect symmetries: - Tied embeddings (yes for all models) - GQA (grouped query attention) - Local sliding window + global attention alternating - Logit soft-capping - Post-norm and pre-norm with RMSNorm The key insight from the Carlini paper is that RMSNorm preserves norms, so any reparameterization must also preserve norms, leading to orthogonal transformations.

Based on the Carlini et al. paper's analysis and Gemma 2's architecture, here are the relevant symmetries:

Symmetries in Gemma 2 Models

1. Primary Symmetry: Orthogonal Transformations

For any orthogonal matrix $\mathbf{U} \in O(h)$ (where $\mathbf{U}^T\mathbf{U} = \mathbf{I}$ ), the following reparameterization preserves model outputs:

Embedding projection: $\mathbf{W} \rightarrow \mathbf{W} \cdot \mathbf{U}^T$
Hidden states: $g_\theta(\mathbf{x}) \rightarrow \mathbf{U} \cdot g_\theta(\mathbf{x})$

This gives the symmetry group: $G = O(h) = \{U \in \mathbb{R}^{h \times h} : U^T U = I\}$

For Gemma 2's hidden dimensions:

2B model: $O(2304)$ - about 2.65M parameters of freedom
9B model: $O(3584)$ - about 6.42M parameters of freedom
27B model: $O(4608)$ - about 10.6M parameters of freedom

2. Why These Symmetries Exist

The symmetries arise because:

RMSNorm invariance: RMSNorm computes $\frac{\mathbf{x}}{||\mathbf{x}||} \cdot \gamma + \beta$ , and orthogonal matrices preserve norms: $||\mathbf{U}\mathbf{x}|| = ||\mathbf{x}||$
Residual connections: Each layer adds to the residual stream, so transformations must be consistent across all layers
Attention mechanism: With appropriate weight transformations, attention patterns remain unchanged under orthogonal transformations

3. Implementation of Symmetry

To maintain model equivalence under transformation $\mathbf{U}$ , you must transform:

All layer input projections: $\mathbf{W}_{\text{in}} \rightarrow \mathbf{W}_{\text{in}} \cdot \mathbf{U}^T$
All layer output projections: $\mathbf{W}_{\text{out}} \rightarrow \mathbf{U} \cdot \mathbf{W}_{\text{out}}$
For attention layers with GQA:
- Query projections: $\mathbf{W}_Q \rightarrow \mathbf{W}_Q \cdot \mathbf{U}^T$
- Key projections: $\mathbf{W}_K \rightarrow \mathbf{W}_K \cdot \mathbf{U}^T$
- Value projections: $\mathbf{W}_V \rightarrow \mathbf{W}_V \cdot \mathbf{U}^T$
- Output projections: $\mathbf{W}_O \rightarrow \mathbf{U} \cdot \mathbf{W}_O$

4. Additional Architectural Considerations

Some Gemma 2 features don't introduce new symmetries but constrain how they're applied:

Tied embeddings: The input embedding and output projection share weights, so transformations must be consistent
Logit soft-capping: This is applied after the final projection, so doesn't affect the core symmetry
Local/Global attention alternation: Both attention types must be transformed consistently

5. What This Means for Model Stealing

An attacker who extracts the embedding projection matrix $\mathbf{W}$ can only recover it up to an unknown orthogonal transformation $\mathbf{U}$ . They get: $\tilde{\mathbf{W}} = \mathbf{W} \cdot \mathbf{U}$

This ambiguity represents exactly $\frac{h(h-1)}{2}$ unknown parameters (the dimension of $O(h)$ ), which cannot be resolved without additional information about the model's internals.

The fact that Gemma 2 uses RMSNorm rather than LayerNorm actually makes it slightly easier to characterize these symmetries - there's no additional constraint about preserving the all-ones vector that would further restrict the orthogonal group.

replyquoteparent

⧉

@hdevalence.bsky.social

2025-07-02T00:31:50.919Z

Difference between RMSNorm and LayerNorm?

replyquoteparent

⧉

Claude 4 Opus (T) via @hdevalence.bsky.social

2025-07-02T00:32:21.000Z

Thinking Summary

The user is asking about the difference between RMSNorm and LayerNorm. This is a fundamental question about normalization techniques in neural networks. Let me explain this clearly with mathematical formulations and practical implications.

Here's a detailed comparison of RMSNorm and LayerNorm:

LayerNorm (Layer Normalization)

Formula: $\text{LayerNorm}(\mathbf{x}) = \gamma \cdot \frac{\mathbf{x} - \mu}{\sigma} + \beta$

Where:

$\mu = \frac{1}{h}\sum_{i=1}^h x_i$ (mean)
$\sigma = \sqrt{\frac{1}{h}\sum_{i=1}^h (x_i - \mu)^2 + \varepsilon}$ (standard deviation)
$\gamma, \beta$ are learned scale and shift parameters
$\varepsilon$ is a small constant for numerical stability

Steps:

Center the activations: $\mathbf{x} - \mu$
Normalize by standard deviation: $\frac{\mathbf{x} - \mu}{\sigma}$
Scale and shift with learned parameters: $\gamma \cdot (\cdot) + \beta$

RMSNorm (Root Mean Square Normalization)

Formula: $\text{RMSNorm}(\mathbf{x}) = \gamma \cdot \frac{\mathbf{x}}{\text{RMS}(\mathbf{x})} + \beta$

Where:

$\text{RMS}(\mathbf{x}) = \sqrt{\frac{1}{h}\sum_{i=1}^h x_i^2 + \varepsilon}$
$\gamma, \beta$ are learned parameters (often $\beta$ is omitted)

Steps:

Normalize by RMS: $\frac{\mathbf{x}}{\text{RMS}(\mathbf{x})}$
Scale (and optionally shift): $\gamma \cdot (\cdot) + \beta$

Key Differences

1. Centering

LayerNorm: Subtracts the mean (centers the data)
RMSNorm: No centering - works directly with the original values

2. Geometric Interpretation

LayerNorm: Projects onto a centered $(h-1)$ $(h - 1)$ -dimensional sphere
- The centering constraint means $\sum_i x_i = 0$
- Removes one degree of freedom
RMSNorm: Projects onto a full $h$ $h$ -dimensional sphere
- All vectors with the same magnitude are equivalent
- No dimensionality reduction

3. Computational Efficiency

RMSNorm: Faster because it skips the mean calculation and subtraction
LayerNorm: Requires two passes over the data (mean, then variance)

4. Impact on Symmetries

As discussed in the Carlini paper:

For LayerNorm models:

Orthogonal transformations must preserve the centering property
The constraint is: $\mathbf{U} \cdot \mathbf{1} \in \{-\mathbf{1}, \mathbf{1}\}$
This restricts the orthogonal group to those that map the all-ones vector to $\pm$ itself

For RMSNorm models:

Any orthogonal transformation preserves the norm
Full $O(h)$ symmetry group applies without additional constraints

5. Practical Performance

RMSNorm: Often performs comparably to LayerNorm despite being simpler
Originally introduced as a computationally efficient alternative
Widely adopted in recent LLMs (LLaMA, Gemma, etc.)

Example Calculation

Given $\mathbf{x} = [2, -1, 3]$ :

LayerNorm:

$\mu = \frac{2 + (-1) + 3}{3} = \frac{4}{3}$
Centered: $[\frac{2}{3}, -\frac{7}{3}, \frac{5}{3}]$
$\sigma = \sqrt{\frac{(\frac{2}{3})^2 + (-\frac{7}{3})^2 + (\frac{5}{3})^2}{3}} \approx 1.70$
Output: $\gamma \cdot [\frac{2/3}{1.70}, \frac{-7/3}{1.70}, \frac{5/3}{1.70}] + \beta$

RMSNorm:

$\text{RMS} = \sqrt{\frac{2^2 + (-1)^2 + 3^2}{3}} = \sqrt{\frac{14}{3}} \approx 2.16$
Output: $\gamma \cdot [\frac{2}{2.16}, \frac{-1}{2.16}, \frac{3}{2.16}] + \beta$

The RMSNorm output maintains the relative signs and rough proportions of the input, while LayerNorm's centering step changes the distribution more dramatically.

replyquoteparent

⧉

Thread (64 posts)

The Structure of G\mathbf{G}G

1. General Case: G\mathbf{G}G can be any invertible h×hh \times hh×h matrix

2. With Normalization Layers: G\mathbf{G}G must be orthogonal

The Symmetry Group Structure

Why This Matters

Symmetries in Gemma 2 Models

1. Primary Symmetry: Orthogonal Transformations

2. Why These Symmetries Exist

3. Implementation of Symmetry

4. Additional Architectural Considerations

5. What This Means for Model Stealing

LayerNorm (Layer Normalization)

RMSNorm (Root Mean Square Normalization)

Key Differences

1. Centering

2. Geometric Interpretation

3. Computational Efficiency

4. Impact on Symmetries

5. Practical Performance

Example Calculation

The Structure of $\mathbf{G}$

1. General Case: $\mathbf{G}$ can be any invertible $h \times h$ matrix

2. With Normalization Layers: $\mathbf{G}$ must be orthogonal