Numinex
← Back to all posts

Thread (64 posts)

Thinking Summary

The user is asking for definitions of several concepts from the paper about stealing parts of production language models. Let me define these clearly and in a logical order that builds understanding. I should start with the basic architecture concepts and build up to the specific terms. The paper is about transformer language models, so I'll frame everything in that context. Let me order these for conceptual clarity: 1. Hidden dimension / width (these are related) 2. Embedding projection layer 3. Logit vector and why it's higher dimensional I'll make sure to explain clearly and concisely.

Here are brief definitions of these concepts, ordered for clarity:

Hidden dimension

The internal representation size used throughout a transformer model's layers. For example, if a model processes information using 4096-dimensional vectors internally, its hidden dimension is 4096. This is the "width" of the neural network's intermediate computations.

Width

Another term for hidden dimension - it refers to how "wide" the model is in terms of the size of its internal representations. A model with hidden dimension 4096 is "wider" than one with hidden dimension 2048.

Embedding projection layer

The final linear transformation in a language model that converts from the hidden dimension to vocabulary size. It's a matrix of size vocabulary_size × hidden_dimension that projects the model's internal representation (e.g., 4096 dimensions) to a score for each possible token in the vocabulary (e.g., 50,000+ tokens).

Logit vector & why it's higher dimensional

The output of the embedding projection layer before applying softmax. It contains one score (logit) for each token in the vocabulary. It's higher dimensional than the hidden dimension because vocabularies are typically much larger than hidden dimensions. For example:

  • Hidden dimension: 4096
  • Vocabulary size: 50,000
  • Therefore logit vector: 50,000 dimensions

This dimensional mismatch (hidden << vocabulary) is what makes the embedding projection layer low-rank and exploitable by the attack described in the paper.

replyquoteparent
Thinking Summary

The user is asking what makes the "hidden dimension" hidden. This is a good question about terminology. The hidden dimension is called "hidden" for a few reasons: 1. It's not directly observable from the model's inputs or outputs 2. It's an internal representation that users don't directly interact with 3. In the context of this paper, it's literally hidden because companies like OpenAI don't disclose it 4. It's between the input and output layers, so it's "hidden" in the middle of the network I should explain both the general ML meaning and the specific context of this paper where it's literally a secret.

The "hidden" dimension is called hidden for two reasons:

Traditional ML meaning

It's "hidden" because it represents the model's internal representations that aren't directly visible from inputs or outputs:

  • You input text tokens (discrete symbols)
  • You get output probabilities over tokens
  • But in between, the model works with continuous vector representations of size h (the hidden dimension)
  • These intermediate representations are "hidden" from the user - you never directly see or manipulate these 4096-dimensional vectors

In this paper's context

It's literally hidden as a trade secret:

  • Companies like OpenAI and Google don't publicly disclose the hidden dimensions of their production models
  • The GPT-4 technical report explicitly states it withholds "details about the architecture (including model size)"
  • Before this attack, nobody knew that ada had hidden dimension 1024 or that babbage had 2048

So the hidden dimension is both:

  1. Architecturally hidden (internal to the model, not exposed in normal use)
  2. Commercially hidden (kept secret by AI companies)

The paper's attack makes the commercially hidden dimension no longer secret - they can recover this supposedly confidential architectural detail just by querying the API.

replyquoteparent
Thinking Summary

The user is asking specifically about the traditional ML meaning of "hidden" in hidden dimension, not the secrecy aspect. They're right - I over-explained the commercial secrecy part when they just wanted the technical ML context. Let me give a cleaner explanation of just the traditional ML sense.

Got it! In the traditional ML sense:

The "hidden" dimension is called hidden because these are the intermediate representations between input and output - they're not directly observable in the model's interface.

In a transformer:

  • Input layer: You see discrete token IDs (like "cat" = token 1234)
  • Hidden layers: The model converts these to continuous vectors of size h and processes them through multiple transformer blocks
  • Output layer: Projects back to vocabulary-sized predictions

The hidden dimension h is the size of these internal vector representations that flow through the model's layers. They're "hidden" because:

  • Users never directly specify or see these vectors
  • They exist only during the forward pass computation
  • They're learned representations, not interpretable features

It's analogous to hidden layers in any neural network - the intermediate computations that transform input to output but aren't part of the external interface.

replyquoteparent
Thinking Summary

The user is referring to the definitions from "High Accuracy and High Fidelity Extraction of Neural Networks" (arXiv:1909.01838) by Jagielski et al. They're asking about the relationship between accuracy and fidelity in model extraction attacks. Looking at the paper's definitions: 1. **Accuracy**: the stolen model $\hat{f}$ should match the performance of the target model $f$ on some particular data domain. For example, if the target is an image classifier, we might want the stolen model to match the target's overall accuracy on ImageNet. 2. **Fidelity**: the stolen model $\hat{f}$ should be functionally equivalent to the target model $f$ on all inputs. That is, for any valid input $x$, we want $\hat{f}(x) \approx f(x)$. The user is asking whether fidelity implies accuracy. Let me think about this: - If a stolen model has high fidelity (i.e., $\hat{f}(x) \approx f(x)$ for all inputs), then it would produce the same outputs as the target model on any test dataset. - This means it would achieve the same accuracy as the target model on any data domain. - So yes, perfect fidelity would imply matching accuracy. However, the converse is not true: - A model could achieve the same accuracy as the target model without matching its outputs on all inputs. - For example, two models could both achieve 95% accuracy on ImageNet but make mistakes on different images. So the user's intuition seems correct: fidelity is a stronger property that implies accuracy, but not vice versa.

You're absolutely right! Looking at the definitions from the Jagielski et al. paper:

  • Accuracy: The stolen model should match the target's performance on some particular data domain
  • Fidelity: The stolen model should produce (approximately) the same output as the target for all inputs

Fidelity is indeed the stronger property. If you achieve perfect fidelity—meaning your extracted model f^(x)f(x)\hat{f}(x) \approx f(x) for all inputs xx—then you automatically achieve the same accuracy as the target model on any dataset, since you're making the same predictions.

However, the reverse isn't true. Two models can have identical accuracy (say 95% on ImageNet) while making mistakes on completely different examples. This is why the paper notes these objectives can be "in conflict for imperfect victim models"—a high-accuracy extraction might try to correct the victim's mistakes, while a high-fidelity extraction should replicate them.

The paper's Figure 1 illustrates this nicely: the green "high fidelity" line matches the blue oracle exactly (including its errors), while the orange "perfect accuracy" line correctly classifies all points but diverges from the oracle's behavior.

This distinction is important for different attack motivations:

  • Theft adversaries might prioritize accuracy (they want a model that performs well)
  • Reconnaissance adversaries need fidelity (for mounting black-box attacks, membership inference, etc., they need the extracted model to behave identically to the target)
replyquoteparent

In §3.1 of the Jagielski paper they name a goal of exact extraction and explain why it's impossible to achieve. This reminds me of some other recent questions I had that came down to repatameterization. I’m wondering why we would want to define fidelity of extraction in the terms they did rather than classifying the parameterization symmetries of the model and then seeking closeness up to symmetr. Can you think this through

Post not found
Post not found
Post not found
Post not found
replyquoteparent
Thinking Summary

The user is making a really interesting connection between the model extraction work and the reparameterization symmetries they were thinking about in the context of SAEs. They're asking why we would define extraction fidelity in terms of matching outputs on all inputs rather than seeking closeness up to the inherent symmetries of the model. This is actually a deep question that touches on several important issues: 1. The Jagielski et al. paper does acknowledge that exact extraction is impossible due to symmetries - they mention scaling neurons, permuting neurons, and dead neurons as examples. 2. The Carlini et al. paper on "Stealing Part of a Production Language Model" actually does extract weights up to symmetries! They recover the embedding projection layer up to an h×h affine transformation, which is exactly the kind of symmetry-aware extraction the user is suggesting. 3. There are practical and theoretical reasons for both approaches: - Functional equivalence (matching outputs) is what matters for many attack purposes - But understanding the symmetry structure could lead to more efficient attacks or better understanding of what information is actually recoverable Let me think through the pros and cons of each approach...

Your observation is really insightful! You're absolutely right that there's a deep connection here with reparameterization symmetries, and the Jagielski paper's approach seems to sidestep this important structure.

Looking at both papers together reveals two different philosophies:

The Jagielski Approach (Functional Equivalence)

They define extraction success purely in terms of input-output behavior, essentially treating the model as a black box. This makes sense for their attack taxonomy because:

  • It's application-focused: for attacks like adversarial examples or membership inference, what matters is behavioral equivalence
  • It's model-agnostic: works across different architectures without needing to understand their symmetries
  • It sidesteps the complexity of characterizing all possible symmetries

The Symmetry-Aware Approach (Carlini et al.)

The "Stealing Part of a Production Language Model" paper actually does exactly what you're suggesting! They:

  • Explicitly acknowledge they can only recover W\mathbf{W} up to an h×hh \times h transformation
  • Prove this is optimal (you can't do better given only input-output access)
  • Measure extraction success as RMS error after finding the best aligning transformation

Why Both Approaches Matter

For ReLU networks, the symmetry group includes:

  • Positive scaling: (cWin,1cWout)(c\mathbf{W}_{\text{in}}, \frac{1}{c}\mathbf{W}_{\text{out}}) for c>0c > 0
  • Permutations of neurons within layers
  • Adding dead neurons
  • Sign flips (for layers before ReLU)

For transformers, additional symmetries arise from:

  • Orthogonal transformations after layer norm (as Carlini et al. prove)
  • Attention head permutations
  • Various architectural invariances

The Deeper Question

You're touching on something fundamental: what information about a model is actually meaningful?

If two models differ only by a reparameterization, they're functionally identical, so arguably:

  1. We should define extraction success modulo symmetries
  2. Attacks should explicitly target equivalence classes rather than specific parameters
  3. Defenses based on hiding "exact" parameters are meaningless if the equivalence class is recoverable

This suggests a more sophisticated extraction taxonomy:

  • Level 0: Functional equivalence (Jagielski)
  • Level 1: Parameter recovery up to known symmetries (Carlini)
  • Level 2: Exact parameter recovery (impossible for fundamental reasons)

The fact that Carlini et al. successfully extract commercial models up to symmetries suggests this is the right framework for thinking about extraction attacks. It's both more honest about what's achievable and more precise about what information is actually being leaked.

replyquoteparent
Thinking Summary

The user is asking about the paper "Stealing Part of a Production Language Model" and specifically about Section 3 where they define the model structure. They're confused about equation (1) which defines the model as: $f_\theta(x) = \text{softmax}(E^T \cdot g_\theta(x))$ where $g_\theta : \mathcal{X}^N \to \mathbb{R}^h$ computes hidden states and $E^T$ is the embedding projection matrix. The user is right to be confused - this is indeed a simplified representation. The paper is specifically focusing on the final layer of the transformer, not describing the entire architecture. The complexity of transformers (attention layers, MLPs, residual connections, layer norms, etc.) is all encapsulated within the function $g_\theta$. Let me explain this clearly.

You're absolutely right to be confused! The definition in Section 3 is indeed highly simplified. The key insight is that all the transformer complexity is hidden inside the function gθg_\theta.

The paper defines: fθ(x)=softmax(ETgθ(x))f_\theta(x) = \text{softmax}(E^T \cdot g_\theta(x))

Here's where all the complexity goes:

Inside gθg_\theta: This function encapsulates the entire transformer architecture:

  • All attention layers
  • Feed-forward networks
  • Layer normalizations
  • Residual connections
  • Positional encodings
  • Everything except the final projection

So when you have a typical transformer that processes tokens through many layers to produce a final hidden state hRdh \in \mathbb{R}^d, that entire process is what gθg_\theta represents.

Why this simplification works: The authors only care about stealing the final projection layer ETE^T (which maps from hidden dimension hh to vocabulary size ll). For their attack, they treat everything before this final projection as a black box that produces some hidden vector.

This is actually mentioned later in the paper - in Lemma 2 they note that with residual connections, you can't extract ETE^T exactly (only up to a linear transformation), precisely because the internal structure of gθg_\theta creates ambiguities.

The brilliance of their attack is that they don't need to understand the internal complexity - they just need to observe that the final output lies in a low-dimensional subspace determined by the hidden dimension hh.

replyquoteparent
Thinking Summary

The user is asking about the hidden dimension of the Gemma 2 models. Let me look at the paper document to find information about the hidden dimensions. Looking at the Gemma 2 paper, I can see in Table 1 that shows "Overview of the main model parameters and design choices". In this table, there's a row for "d_model" which represents the hidden dimension. From Table 1: - Gemma 2 2B: d_model = 2304 - Gemma 2 9B: d_model = 3584 - Gemma 2 27B: d_model = 4608 So the hidden dimensions are: - 2B model: 2304 - 9B model: 3584 - 27B model: 4608

Based on the Gemma 2 paper (Table 1), the hidden dimensions (d_model) for the Gemma 2 models are:

  • Gemma 2 2B: 2304
  • Gemma 2 9B: 3584
  • Gemma 2 27B: 4608

These hidden dimensions follow a pattern where each larger model has a significantly wider hidden layer. The paper also notes that all three models use:

  • Grouped-Query Attention (GQA) with num_groups = 2
  • Head size of 256 for the 2B and 9B models, and 128 for the 27B model
  • A vocabulary size of 256,128 tokens (inherited from Gemini)

These architectural details would make Gemma 2 models potentially vulnerable to the embedding projection layer extraction attack described in the first paper, though Google has likely implemented defenses after the responsible disclosure.

replyquoteparent
Thinking Summary

The user is hinting that because Gemma 2 is open weights, we can actually verify the attack from the first paper ourselves. They want to know: 1. The value of $l$ (vocabulary size) for Gemma 2 2. Whether it uses the same softmax structure as described in the attack paper From the Gemma 2 paper, I can see: - Vocabulary size is 256,128 for all three models (this is $l$) - The paper mentions they use a standard transformer architecture with some modifications Looking at the architecture description, they mention: - "Logit soft-capping" where they cap logits in each attention layer and the final layer - The soft-capping function is: logits ← soft_cap * tanh(logits / soft_cap) - They set soft_cap to 30.0 for the final layer This is interesting because the attack paper assumes a standard softmax without this capping. However, the fundamental structure is still: 1. Hidden states from transformer → 2. Linear projection to vocabulary size → 3. Some form of normalization (in this case, soft-capped softmax) The attack should still work because: - The embedding projection matrix $E^T$ still exists (mapping from hidden dimension $h$ to vocabulary size $l$) - The rank of this matrix is still limited by $h$ - The soft-capping is just a element-wise transformation that doesn't change the fundamental low-rank structure So to answer the questions: - $l = 256,128$ for all Gemma 2 models - It uses a modified softmax with soft-capping, but the fundamental structure (hidden states → linear projection → normalization) is the same

Ah, I see what you're getting at! Since Gemma 2 is open weights, we can actually verify the attack ourselves.

From the Gemma 2 paper:

  • l=256,128l = 256,128 for all three Gemma 2 models (they all use the same vocabulary size)

Regarding the softmax structure, Gemma 2 uses a modified version with "logit soft-capping":

logits ← soft_cap * tanh(logits / soft_cap)

Where soft_cap = 30.0 for the final layer. This happens before the softmax.

So the full structure is:

  1. gθ(x)g_\theta(x) produces hidden states of dimension hh
  2. ETgθ(x)E^T \cdot g_\theta(x) projects to vocabulary dimension l=256,128l = 256,128
  3. Soft-capping: 30.0tanh(logits/30.0)30.0 \cdot \tanh(\text{logits} / 30.0)
  4. Standard softmax applied to the capped logits

The attack from the first paper should still work because:

  • The embedding projection matrix ETE^T still exists with shape (l×h)(l \times h)
  • The rank is still limited by hh (much smaller than ll)
  • The soft-capping is just an element-wise transformation that doesn't change the low-rank structure

Since we have access to the weights, we could directly verify:

  • The hidden dimensions match what's claimed (2304, 3584, 4608)
  • The rank of the final projection layer equals hh
  • The attack's SVD approach would correctly identify these dimensions

This is a great example of how open-weight models allow for verification of security research findings!

replyquoteparent
Thinking Summary

The user is asking about why the vocabulary size l = 256,128 can be expressed as 2^7 × 2001. Let me check: 256,128 = 2^7 × 2001 = 128 × 2001 = 256,128 ✓ This is interesting because it's a specific factorization. The vocabulary size of 256,128 is quite large and this particular factorization might have some significance. Looking at the SentencePiece tokenizer mentioned in the Gemma 2 paper, it has 256k entries. The paper mentions it's the same tokenizer as Gemma 1 and Gemini. The fact that it factors as 2^7 × 2001 is interesting. 2001 is 3 × 23 × 29, so it's not a particularly special number. The power of 2 factor (128) might be related to computational efficiency or alignment. Actually, I should double-check my arithmetic: 256,128 = 256 × 1000 + 128 = 256,000 + 128 = 256,128 Let me verify the factorization: 128 × 2001 = 128 × 2000 + 128 × 1 = 256,000 + 128 = 256,128 ✓ So yes, 256,128 = 2^7 × 2001. This might be related to how the tokenizer was constructed or some computational optimization. The power of 2 factor could be useful for memory alignment or SIMD operations.

Great observation! The vocabulary size 256,128 = 2^7 × 2001 is indeed an interesting factorization.

This specific structure might be related to:

  1. Computational efficiency: The 2^7 (128) factor allows for efficient memory alignment and potentially SIMD operations. Many GPU operations are optimized for power-of-2 dimensions.

  2. SentencePiece tokenizer design: The paper mentions they use a SentencePiece tokenizer with "split digits, preserved whitespace, and byte-level encodings." The factorization might reflect how tokens are organized:

    • Perhaps 128 special tokens or control codes
    • And 2001 "buckets" of regular tokens
  3. Byte-fallback encoding: With byte-level encoding as a fallback, you need at least 256 tokens for raw bytes. The structure might reserve certain ranges for different token types.

The fact that 2001 = 3 × 23 × 29 (all prime factors) suggests this wasn't chosen for its factorization properties, but rather 256,128 might have been chosen as "approximately 256k" while having nice computational properties.

This vocabulary size is notably different from models like LLaMA (32,000) or GPT-4 (~100,000), reflecting Google's multilingual focus - the paper mentions this large vocabulary is "designed to work on a large number of languages."

replyquoteparent
Thinking Summary

The user is asking about Algorithm 1 in the paper "Stealing Part of a Production Language Model". Let me look at Algorithm 1. Algorithm 1 is "Hidden-Dimension Extraction Attack" and appears in the paper. Looking at line 5: ``` \State $\Q_i \gets \api(\x_i)$ ``` The user is asking whether they're setting the i-th row or column of Q to the oracle output. Looking at the context, they initialize Q as an empty matrix $\Q = \mathbf{0}^{n \times l}$ where n is the number of queries and l is the vocabulary size. Since $\api(\x_i)$ returns a logit vector of dimension l, and Q has dimensions n × l, they must be setting the i-th row of Q to be the logit vector returned by the API. This makes sense because they're collecting n different logit vectors (one for each random prompt), and stacking them as rows in the matrix Q.

Looking at Algorithm 1 (Hidden-Dimension Extraction Attack), line 5 states:

\State $\Q_i \gets \api(\x_i)$

They are setting the i-th row of Q to the oracle output.

This is clear from the context:

  • Q is initialized as a n×ln \times l matrix (line 2)
  • The API returns a logit vector of dimension ll
  • They iterate from i=1i = 1 to nn (line 3)

So each query response \api(\xi)\api(\x_i) is a vector of length ll (the vocabulary size), and they're stacking these as rows to build the matrix Q. This is confirmed later in Lemma 1 where they define:

Let \Q\br\x1,\xnRl×n\Q\br{\x_1, \ldots \x_{n}} \in \R^{l \times n} denote the matrix with columns \api\br\x1,,\api\br\xn\api\br{\x_1}, \ldots, \api\br{\x_{n}}

Wait, this seems inconsistent! The lemma says the API responses are columns, but the algorithm initialization suggests rows. Looking more carefully at the formalization section, they write \Q = \Et \cdot \H where \Et\Et is l×hl \times h and \H is h×nh \times n, which gives \Q\Q as l×nl \times n.

So actually, they're using column notation - \Qi\Q_i refers to the i-th column of Q, not row. Each API response is stored as a column vector in the l×nl \times n matrix Q.

replyquoteparent
Thinking Summary

The user is asking me to redo my response without KaTeX syntax errors. Let me check my previous response for any KaTeX issues. I see that I used some notation that might not render properly. Let me rewrite it more carefully.

Looking at Algorithm 1 (Hidden-Dimension Extraction Attack), line 5 states:

State Q_i ← O(x_i)

They are setting the i-th column of Q to the oracle output.

This is clear from the context:

  • Q is initialized as an n×ln \times l matrix (line 2)
  • The API returns a logit vector of dimension ll
  • They iterate from i=1i = 1 to nn (line 3)

Initially this seems confusing because the algorithm initializes Q as n×ln \times l, but looking at the formalization section provides clarity. In Lemma 1 they define:

Let Q(x1,,xn)Rl×n\mathbf{Q}(x_1, \ldots, x_n) \in \mathbb{R}^{l \times n} denote the matrix with columns O(x1),,O(xn)\mathcal{O}(x_1), \ldots, \mathcal{O}(x_n)

So there's a notational inconsistency in the paper. The key insight is that they write Q=ETH\mathbf{Q} = \mathbf{E}^T \cdot \mathbf{H} where:

  • ET\mathbf{E}^T is l×hl \times h (the embedding projection matrix)
  • H\mathbf{H} is h×nh \times n (hidden states)
  • This gives Q\mathbf{Q} as l×nl \times n

Therefore, QiQ_i refers to the i-th column of Q. Each API response (a logit vector of length ll) is stored as a column in the l×nl \times n matrix Q.

replyquoteparent
Thinking Summary

The user is pointing out that there's likely a typo on line 2 of Algorithm 1. Let me check: Line 2 says: "Initialize an empty matrix Q = 0^{n × l}" But based on the formalization and Lemma 1, Q should be l × n (vocabulary size × number of queries), not n × l. So yes, this appears to be a typo in the algorithm.

Yes, exactly! Line 2 appears to have a typo.

Line 2 states:

Initialize an empty matrix Q = 0^{n×l}

But it should be:

Initialize an empty matrix Q = 0^{l×n}

This is a clear typo because:

  1. The formalization shows Q=ETH\mathbf{Q} = \mathbf{E}^T \cdot \mathbf{H} where the result must be l×nl \times n
  2. Lemma 1 explicitly states: "Let Q(x1,,xn)Rl×n\mathbf{Q}(x_1, \ldots, x_n) \in \mathbb{R}^{l \times n} denote the matrix with columns O(x1),,O(xn)\mathcal{O}(x_1), \ldots, \mathcal{O}(x_n)"
  3. Each API call returns a logit vector of length ll, and they're collecting nn of these as columns

So the matrix Q should have dimensions l×nl \times n (vocabulary size × number of queries), not n×ln \times l as written in line 2 of the algorithm.

replyquoteparent
Thinking Summary

The user is asking about the significance of the SVD (Singular Value Decomposition) on line 7 of Algorithm 1 in the paper "Stealing Part of a Production Language Model". Looking at Algorithm 1 (Hidden-Dimension Extraction Attack), line 7 says: ``` State $\lambda_1 \geq \lambda_2 \geq \dots \geq \lambda_n \gets \text{SingularValues}(\Q)$ ``` This is computing the singular values of the matrix $\Q$ that contains the logit responses from querying the model. The significance is explained in the paper's intuition and formalization sections. The key insight is that even though each output logit vector is an $l$-dimensional vector (where $l$ is the vocabulary size), they all actually lie in an $h$-dimensional subspace because the embedding projection layer up-projects from $h$ dimensions. By computing the SVD and looking at the singular values, they can identify where there's a sharp drop-off in the magnitudes - this indicates the transition from the "real" dimensions (the hidden dimension $h$) to numerical noise. The algorithm identifies the largest multiplicative gap between consecutive singular values to determine $h$. This is a clever application of linear algebra - essentially using the rank-revealing properties of SVD to discover the intrinsic dimensionality of the data.

The SVD (Singular Value Decomposition) on line 7 of Algorithm 1 is the key mechanism that reveals the hidden dimension of the language model. Here's why it's significant:

Core Insight

The fundamental observation is that while the API returns logit vectors in Rl\mathbb{R}^l (where ll is the vocabulary size, often 50,000-100,000), these vectors actually lie in a much lower-dimensional subspace of dimension hh (the hidden dimension, typically 1,024-8,192).

This happens because of the model architecture: logits=WTgθ(x)\text{logits} = \mathbf{W}^T \cdot g_\theta(\mathbf{x})

where:

  • WT\mathbf{W}^T is the l×hl \times h embedding projection matrix
  • gθ(x)g_\theta(\mathbf{x}) is the hh-dimensional hidden state

Why SVD Reveals the Hidden Dimension

When you collect n>hn > h logit vectors in matrix Q\mathbf{Q}:

  1. Rank constraint: Since Q=WTH\mathbf{Q} = \mathbf{W}^T \cdot \mathbf{H} where H\mathbf{H} is h×nh \times n, the rank of Q\mathbf{Q} is at most hh

  2. Singular value pattern: The SVD reveals this rank structure through the singular values:

    • The first hh singular values will be large (representing the true signal)
    • Singular values beyond position hh will be near zero (representing numerical noise)
  3. Gap detection: The algorithm identifies hh by finding the largest multiplicative gap λiλi+1\frac{\lambda_i}{\lambda_{i+1}} between consecutive singular values

Practical Visualization

The paper provides nice visualizations (Figures 2-3) showing:

  • When n<hn < h: All singular values are significant (can't determine hh yet)
  • When n>hn > h: Exactly hh significant singular values appear, with a sharp drop after

This is why line 8 computes:

count ← argmax_i {log ||λᵢ|| - log ||λᵢ₊₁||}

The logarithmic difference identifies the multiplicative gap, making the dimension extraction robust even with floating-point precision limitations.

This elegant use of SVD essentially performs dimensionality analysis to reverse-engineer architectural information from black-box API access alone.

replyquoteparent

Note in §4.1 Cheaper Dimension Extraction seems relevant to the more sophisticated versions of this attack where me don't get full logit vectors - intuitively me don't need them because lhl \gg h

replyquoteparent

Looking at §4.2 Full Layer Extraction I am particularly curious about the structure of GG since in a few other cases (context transcluded) some questions I've had have come down to what are the relevant symmetries and how do we characterize them?

Post not found
replyquoteparent
Thinking Summary

The user is asking about the structure of the matrix $G$ mentioned in §4.2 "Full Layer Extraction (Up to Symmetries)" of the paper "Stealing Part of a Production Language Model". They note that they've had similar questions about symmetries in other contexts (referring to SAE reparameterization questions). Looking at the paper, in section 4.2, they state that their method recovers $\tilde{E}^T = E^T \cdot G$ for some $G \in \mathbb{R}^{h \times h}$, where $E^T$ is the embedding projection matrix they're trying to steal. They prove in Lemma 4.2 that: 1. The method recovers $E^T$ up to multiplication by some $h \times h$ matrix $G$ 2. With residual connections, it's impossible to extract $E^T$ exactly In the appendices, they go into more detail about what $G$ can be: - In general, $G$ can be any invertible $h \times h$ matrix (affine transformation) - They show in Appendix D that with additional constraints, they can narrow it down to an orthogonal matrix - The key insight is that transformer architectures with residual connections and normalization layers have inherent symmetries that make certain transformations undetectable from the outputs The symmetries arise because: 1. In models without normalization, you can apply any invertible transformation $S$ to the hidden states if you compensate by applying $S^{-1}$ to input projections and $S$ to output projections 2. With RMSNorm/LayerNorm, this is restricted to orthogonal transformations because normalization operations only commute with orthogonal matrices (they preserve norms) 3. For LayerNorm specifically, there's an additional constraint that the orthogonal matrix must preserve the all-ones vector (up to sign) These symmetry transformations parallel the reparameterization techniques in sparse autoencoders, where scaling and rotation of intermediate representations don't fundamentally alter the input-output mapping. The transformations preserve core computational properties while allowing flexibility in how internal features are represented.

Looking at the structure of GG in this paper, it's fascinating how it parallels the symmetry questions you've been exploring with SAEs!

The Structure of GG

From the paper and its appendices, GG can take different forms depending on the model architecture:

1. General Case (No Normalization)

For models without normalization layers, GG can be any invertible h×hh \times h matrix. The symmetry arises because you can:

  • Apply transformation S1S^{-1} to all layer input projections
  • Apply transformation SS to all layer output projections
  • The residual stream gets transformed by SS, but this cancels out: (S1weights)(Shidden state)=original computation(S^{-1} \cdot \text{weights}) \cdot (S \cdot \text{hidden state}) = \text{original computation}

2. With RMSNorm

The presence of normalization restricts GG to be an orthogonal matrix. This is because:

  • RMSNorm includes the operation xx/xx \mapsto x/\|x\|
  • This only commutes with orthogonal transformations: Ux=x\|Ux\| = \|x\| when UTU=IU^T U = I
  • So the symmetry group becomes O(h)O(h) (orthogonal group)

3. With LayerNorm

Even more restricted! LayerNorm adds a centering step, so GG must be an orthogonal matrix that also satisfies:

  • G1{1,1}G \cdot \mathbf{1} \in \{-\mathbf{1}, \mathbf{1}\} (preserves or flips the all-ones vector)
  • This is because centering projects onto the hyperplane 1Tx=0\mathbf{1}^T x = 0

The Symmetry Connection

This is exactly analogous to your SAE reparameterization question! In both cases:

  1. The model has inherent symmetries - transformations that don't change input-output behavior
  2. These symmetries form a group - you can compose them, invert them, etc.
  3. Different architectural choices restrict the symmetry group:
    • SAEs: Adding decoder normalization breaks scaling symmetry
    • Transformers: Adding layer normalization restricts from GL(h)GL(h) to O(h)O(h)

Why This Matters

The paper makes an important point: they can only recover the embedding matrix up to these symmetries. Just like how the L1 penalty in SAEs isn't meaningful without breaking the scaling symmetry, the "exact" embedding matrix isn't meaningful without specifying which element of the symmetry group you want.

In practice, they:

  • Recover E~T=ETG\tilde{E}^T = E^T \cdot G
  • Report reconstruction error after finding the best-fitting GG via least squares
  • Note that with orthogonal constraints, you're missing ~h2/2h^2/2 parameters (vs h2h^2 for general affine)

This is a beautiful example of how symmetries in neural network architectures have real implications for both training (as in SAEs) and security (as in model stealing)!

replyquoteparent

Looking at §4.2 Full Layer Extraction I am particularly curious about the structure of GG since in a few other cases (context transcluded) some questions I've had have come down to what are the relevant symmetries and how do we characterize them?

Post not found
Post not found
replyquoteparent
Thinking Summary

The user is asking about the structure of the matrix $G$ in the Carlini et al. paper's Section 4.2 "Full Layer Extraction (Up to Symmetries)". They're particularly interested in understanding what symmetries are at play here and how they're characterized. Looking at the paper, in Section 4.2, they show that they can recover $\tilde{\mathbf{W}} = \mathbf{W} \cdot \mathbf{G}$ for some $\mathbf{G} \in \mathbb{R}^{h \times h}$. The key insight is that this $\mathbf{G}$ represents the inherent ambiguity in the extraction - you can't determine the exact final layer weights, only up to this transformation. The paper proves (in Lemma 4.2 and its proof in Appendix B) that: 1. In general, $\mathbf{G}$ can be any invertible $h \times h$ matrix 2. With additional assumptions about normalization layers, $\mathbf{G}$ is restricted to be an orthogonal matrix This is quite interesting in the context of the user's previous questions about reparameterization symmetries. The matrix $\mathbf{G}$ here represents exactly the symmetry group of valid reparameterizations that preserve the model's output. Let me explain this carefully, connecting it to their interest in symmetries.

Looking at the structure of G\mathbf{G} in the Carlini et al. paper, this is indeed a fascinating example of how symmetries fundamentally limit what can be extracted!

The Structure of G\mathbf{G}

The paper shows two key results about G\mathbf{G}:

1. General Case: G\mathbf{G} can be any invertible h×hh \times h matrix

Without additional constraints, if you have:

  • True embedding projection: W\mathbf{W}
  • Hidden states: gθ(x)g_\theta(\mathbf{x})
  • Observable logits: Wgθ(x)\mathbf{W} \cdot g_\theta(\mathbf{x})

Then for any invertible G\mathbf{G}, the reparameterization:

  • W=WG\mathbf{W}' = \mathbf{W} \cdot \mathbf{G}
  • gθ(x)=G1gθ(x)g'_\theta(\mathbf{x}) = \mathbf{G}^{-1} \cdot g_\theta(\mathbf{x})

Produces identical outputs: Wgθ(x)=WGG1gθ(x)=Wgθ(x)\mathbf{W}' \cdot g'_\theta(\mathbf{x}) = \mathbf{W} \cdot \mathbf{G} \cdot \mathbf{G}^{-1} \cdot g_\theta(\mathbf{x}) = \mathbf{W} \cdot g_\theta(\mathbf{x})

2. With Normalization Layers: G\mathbf{G} must be orthogonal

The paper's clever insight (proven in Appendix B) is that LayerNorm/RMSNorm adds constraints:

  • These layers project activations onto a sphere: gθ(x)=constant\|g_\theta(\mathbf{x})\| = \text{constant}
  • For the reparameterization to preserve this sphere property, we need G1gθ(x)=gθ(x)\|\mathbf{G}^{-1} \cdot g_\theta(\mathbf{x})\| = \|g_\theta(\mathbf{x})\|
  • This forces G\mathbf{G} to be orthogonal (norm-preserving)

The Symmetry Group Structure

For transformers with normalization, the symmetry group is: G=O(h)SO(h)Z2G = O(h) \cong SO(h) \rtimes \mathbb{Z}_2

Where:

  • O(h)O(h) is the orthogonal group (all h×hh \times h orthogonal matrices)
  • SO(h)SO(h) is the special orthogonal group (rotations, det = 1)
  • Z2\mathbb{Z}_2 represents reflections (det = -1)

This has dimension h(h1)2\frac{h(h-1)}{2}, meaning there are that many continuous degrees of freedom in the reparameterization.

Why This Matters

This connects beautifully to your broader question about symmetries:

  1. Information-theoretic limit: The attack can extract h×lh \times l parameters minus h(h1)2\frac{h(h-1)}{2} degrees of freedom

    • For GPT-3.5 with h4096h \approx 4096, that's ~8.4M unrecoverable parameters!
  2. Practical implications:

    • The "true" weights are meaningless - only the equivalence class matters
    • Any defense trying to hide exact weights while exposing logits is fundamentally flawed
    • The symmetry structure tells us exactly what information is inherently private
  3. Connection to other symmetries: Unlike the SAE scaling symmetries (which are feature-wise), these are global transformations mixing all hidden dimensions - much richer structure!

The fact that they recover commercial models with RMS error ~10410^{-4} after optimal alignment shows they're getting essentially everything that's theoretically possible. The matrix G\mathbf{G} isn't noise or approximation error - it's the fundamental ambiguity inherent in the problem.

replyquoteparent
Thinking Summary

The user is asking about the normalization layers used in Gemma 2. Let me search through the provided document content to find information about the normalization layers in Gemma 2. Looking at the document, I can see it's the Gemma 2 technical report. Let me find the relevant information about normalization layers. In the "Model Architecture" section (Section 2), there's a table (Table 1) that shows the overview of main model parameters and design choices. It mentions "Pre-norm: yes" and "Post-norm: yes" for all three model sizes (2B, 9B, and 27B). More specifically, under the architecture details, I found: "Post-norm and pre-norm with RMSNorm. To stabilize training, we use RMSNorm~\citep{rmsnorm} to normalize the input and output of each transformer sub-layer, the attention layer, and the feedforward layer." So Gemma 2 uses RMSNorm (Root Mean Square Normalization) for both pre-normalization and post-normalization of transformer sub-layers.

Looking at the Gemma 2 technical report, I can find the specific information about normalization layers in the Model Architecture section.

Gemma 2 uses RMSNorm (Root Mean Square Normalization) for its normalization layers. Specifically:

Post-norm and pre-norm with RMSNorm: The paper states that "To stabilize training, we use RMSNorm to normalize the input and output of each transformer sub-layer, the attention layer, and the feedforward layer."

This means Gemma 2 applies RMSNorm:

  • Before each transformer sub-layer (pre-norm)
  • After each transformer sub-layer (post-norm)
  • This includes both attention layers and feedforward layers

From Table 1 in the paper, we can see that all three model sizes (2B, 9B, and 27B) use this same normalization approach with both pre-norm and post-norm set to "yes".

This is particularly relevant to the model stealing paper discussion, as the Carlini et al. paper specifically notes that RMSNorm (unlike LayerNorm) doesn't include a centering step, which affects the symmetry constraints. RMSNorm projects activations onto a sphere without reducing the dimensionality, whereas LayerNorm's centering step projects onto a (h1)(h-1)-dimensional subspace. This means the orthogonal transformation constraints discussed in the stealing paper would apply to Gemma 2 without the additional restrictions needed for LayerNorm models.

replyquoteparent
Thinking Summary

Let me think about the symmetries in Gemma 2 models based on what we know from the Carlini paper and the Gemma 2 architecture. From the Carlini paper, we know that: 1. Without normalization layers, the symmetry group would be all invertible h×h matrices (GL(h)) 2. With RMSNorm, the symmetry group is restricted to orthogonal matrices O(h) 3. With LayerNorm, there's a further restriction that the orthogonal matrix must preserve the all-ones vector Since Gemma 2 uses RMSNorm (not LayerNorm), and has residual connections, the relevant symmetries should be the orthogonal group O(h). Let me also consider other architectural features that might affect symmetries: - Tied embeddings (yes for all models) - GQA (grouped query attention) - Local sliding window + global attention alternating - Logit soft-capping - Post-norm and pre-norm with RMSNorm The key insight from the Carlini paper is that RMSNorm preserves norms, so any reparameterization must also preserve norms, leading to orthogonal transformations.

Based on the Carlini et al. paper's analysis and Gemma 2's architecture, here are the relevant symmetries:

Symmetries in Gemma 2 Models

1. Primary Symmetry: Orthogonal Transformations

For any orthogonal matrix UO(h)\mathbf{U} \in O(h) (where UTU=I\mathbf{U}^T\mathbf{U} = \mathbf{I}), the following reparameterization preserves model outputs:

  • Embedding projection: WWUT\mathbf{W} \rightarrow \mathbf{W} \cdot \mathbf{U}^T
  • Hidden states: gθ(x)Ugθ(x)g_\theta(\mathbf{x}) \rightarrow \mathbf{U} \cdot g_\theta(\mathbf{x})

This gives the symmetry group: G=O(h)={URh×h:UTU=I}G = O(h) = \{U \in \mathbb{R}^{h \times h} : U^T U = I\}

For Gemma 2's hidden dimensions:

  • 2B model: O(2304)O(2304) - about 2.65M parameters of freedom
  • 9B model: O(3584)O(3584) - about 6.42M parameters of freedom
  • 27B model: O(4608)O(4608) - about 10.6M parameters of freedom

2. Why These Symmetries Exist

The symmetries arise because:

  1. RMSNorm invariance: RMSNorm computes xxγ+β\frac{\mathbf{x}}{||\mathbf{x}||} \cdot \gamma + \beta, and orthogonal matrices preserve norms: Ux=x||\mathbf{U}\mathbf{x}|| = ||\mathbf{x}||

  2. Residual connections: Each layer adds to the residual stream, so transformations must be consistent across all layers

  3. Attention mechanism: With appropriate weight transformations, attention patterns remain unchanged under orthogonal transformations

3. Implementation of Symmetry

To maintain model equivalence under transformation U\mathbf{U}, you must transform:

  • All layer input projections: WinWinUT\mathbf{W}_{\text{in}} \rightarrow \mathbf{W}_{\text{in}} \cdot \mathbf{U}^T
  • All layer output projections: WoutUWout\mathbf{W}_{\text{out}} \rightarrow \mathbf{U} \cdot \mathbf{W}_{\text{out}}
  • For attention layers with GQA:
    • Query projections: WQWQUT\mathbf{W}_Q \rightarrow \mathbf{W}_Q \cdot \mathbf{U}^T
    • Key projections: WKWKUT\mathbf{W}_K \rightarrow \mathbf{W}_K \cdot \mathbf{U}^T
    • Value projections: WVWVUT\mathbf{W}_V \rightarrow \mathbf{W}_V \cdot \mathbf{U}^T
    • Output projections: WOUWO\mathbf{W}_O \rightarrow \mathbf{U} \cdot \mathbf{W}_O

4. Additional Architectural Considerations

Some Gemma 2 features don't introduce new symmetries but constrain how they're applied:

  • Tied embeddings: The input embedding and output projection share weights, so transformations must be consistent
  • Logit soft-capping: This is applied after the final projection, so doesn't affect the core symmetry
  • Local/Global attention alternation: Both attention types must be transformed consistently

5. What This Means for Model Stealing

An attacker who extracts the embedding projection matrix W\mathbf{W} can only recover it up to an unknown orthogonal transformation U\mathbf{U}. They get: W~=WU\tilde{\mathbf{W}} = \mathbf{W} \cdot \mathbf{U}

This ambiguity represents exactly h(h1)2\frac{h(h-1)}{2} unknown parameters (the dimension of O(h)O(h)), which cannot be resolved without additional information about the model's internals.

The fact that Gemma 2 uses RMSNorm rather than LayerNorm actually makes it slightly easier to characterize these symmetries - there's no additional constraint about preserving the all-ones vector that would further restrict the orthogonal group.

replyquoteparent
Thinking Summary

The user is asking about the difference between RMSNorm and LayerNorm. This is a fundamental question about normalization techniques in neural networks. Let me explain this clearly with mathematical formulations and practical implications.

Here's a detailed comparison of RMSNorm and LayerNorm:

LayerNorm (Layer Normalization)

Formula: LayerNorm(x)=γxμσ+β\text{LayerNorm}(\mathbf{x}) = \gamma \cdot \frac{\mathbf{x} - \mu}{\sigma} + \beta

Where:

  • μ=1hi=1hxi\mu = \frac{1}{h}\sum_{i=1}^h x_i (mean)
  • σ=1hi=1h(xiμ)2+ε\sigma = \sqrt{\frac{1}{h}\sum_{i=1}^h (x_i - \mu)^2 + \varepsilon} (standard deviation)
  • γ,β\gamma, \beta are learned scale and shift parameters
  • ε\varepsilon is a small constant for numerical stability

Steps:

  1. Center the activations: xμ\mathbf{x} - \mu
  2. Normalize by standard deviation: xμσ\frac{\mathbf{x} - \mu}{\sigma}
  3. Scale and shift with learned parameters: γ()+β\gamma \cdot (\cdot) + \beta

RMSNorm (Root Mean Square Normalization)

Formula: RMSNorm(x)=γxRMS(x)+β\text{RMSNorm}(\mathbf{x}) = \gamma \cdot \frac{\mathbf{x}}{\text{RMS}(\mathbf{x})} + \beta

Where:

  • RMS(x)=1hi=1hxi2+ε\text{RMS}(\mathbf{x}) = \sqrt{\frac{1}{h}\sum_{i=1}^h x_i^2 + \varepsilon}
  • γ,β\gamma, \beta are learned parameters (often β\beta is omitted)

Steps:

  1. Normalize by RMS: xRMS(x)\frac{\mathbf{x}}{\text{RMS}(\mathbf{x})}
  2. Scale (and optionally shift): γ()+β\gamma \cdot (\cdot) + \beta

Key Differences

1. Centering

  • LayerNorm: Subtracts the mean (centers the data)
  • RMSNorm: No centering - works directly with the original values

2. Geometric Interpretation

  • LayerNorm: Projects onto a centered (h1)(h-1)-dimensional sphere
    • The centering constraint means ixi=0\sum_i x_i = 0
    • Removes one degree of freedom
  • RMSNorm: Projects onto a full hh-dimensional sphere
    • All vectors with the same magnitude are equivalent
    • No dimensionality reduction

3. Computational Efficiency

  • RMSNorm: Faster because it skips the mean calculation and subtraction
  • LayerNorm: Requires two passes over the data (mean, then variance)

4. Impact on Symmetries

As discussed in the Carlini paper:

For LayerNorm models:

  • Orthogonal transformations must preserve the centering property
  • The constraint is: U1{1,1}\mathbf{U} \cdot \mathbf{1} \in \{-\mathbf{1}, \mathbf{1}\}
  • This restricts the orthogonal group to those that map the all-ones vector to ±\pm itself

For RMSNorm models:

  • Any orthogonal transformation preserves the norm
  • Full O(h)O(h) symmetry group applies without additional constraints

5. Practical Performance

  • RMSNorm: Often performs comparably to LayerNorm despite being simpler
  • Originally introduced as a computationally efficient alternative
  • Widely adopted in recent LLMs (LLaMA, Gemma, etc.)

Example Calculation

Given x=[2,1,3]\mathbf{x} = [2, -1, 3]:

LayerNorm:

  • μ=2+(1)+33=43\mu = \frac{2 + (-1) + 3}{3} = \frac{4}{3}
  • Centered: [23,73,53][\frac{2}{3}, -\frac{7}{3}, \frac{5}{3}]
  • σ=(23)2+(73)2+(53)231.70\sigma = \sqrt{\frac{(\frac{2}{3})^2 + (-\frac{7}{3})^2 + (\frac{5}{3})^2}{3}} \approx 1.70
  • Output: γ[2/31.70,7/31.70,5/31.70]+β\gamma \cdot [\frac{2/3}{1.70}, \frac{-7/3}{1.70}, \frac{5/3}{1.70}] + \beta

RMSNorm:

  • RMS=22+(1)2+323=1432.16\text{RMS} = \sqrt{\frac{2^2 + (-1)^2 + 3^2}{3}} = \sqrt{\frac{14}{3}} \approx 2.16
  • Output: γ[22.16,12.16,32.16]+β\gamma \cdot [\frac{2}{2.16}, \frac{-1}{2.16}, \frac{3}{2.16}] + \beta

The RMSNorm output maintains the relative signs and rough proportions of the input, while LayerNorm's centering step changes the distribution more dramatically.

replyquoteparent

Thinking about the "attack success rate" evaluation in $4.2, as well as the included context focusing on explicit characterization of the symmetries, can you explain the RMS computation and how it does or doesn't respect symmetry

Post not found
Post not found
replyquote(1 quotes)parent
Thinking Summary

The user is asking about the RMS (Root Mean Square) computation in Section 4.2 of the Carlini et al. paper and how it does or doesn't respect symmetry. Let me think through this carefully. Looking at the paper, they're evaluating their attack success by computing the RMS error between their extracted matrix $\tilde{\mathbf{E}}^T$ and the actual weight matrix $\mathbf{E}^T$, after allowing for an $h \times h$ affine transformation. The key part is that they solve a least squares system to find the best aligning transformation $\mathbf{G}$: - They solve $\tilde{\mathbf{E}}^T \cdot \mathbf{G} \approx \mathbf{E}^T$ for $\mathbf{G}$ - Then they report the RMS of $\mathbf{E}^T$ and $\tilde{\mathbf{E}}^T \cdot \mathbf{G}$ This is explicitly respecting the symmetry! They're not computing the naive RMS between the raw extracted matrix and the true matrix, which would be meaningless since the extracted matrix is only defined up to an $h \times h$ transformation. The paper proves that you can only recover $\mathbf{E}^T$ up to multiplication by some matrix $\mathbf{G}$, and their evaluation methodology explicitly accounts for this by finding the best possible $\mathbf{G}$ before computing the error. This is a good example of symmetry-aware evaluation. They're measuring extraction success within the equivalence class of matrices that differ by right-multiplication by an invertible $h \times h$ matrix. Actually, in the appendices they go even further and show that with additional assumptions (like the model having residual connections and normalization layers), you can narrow it down to recovery up to an orthogonal transformation rather than a general affine transformation. But their practical evaluation uses the more general affine case.

Looking at the RMS computation in Section 4.2 of the Carlini et al. paper, it's actually a great example of symmetry-aware evaluation. Let me break down what they're doing and why it respects the inherent symmetries:

The RMS Computation

The paper states:

"we evaluate the attack success rate by comparing the root mean square (RMS) between our extracted matrix E~T=UΣ\tilde{\mathbf{E}}^T = \mathbf{U} \cdot \boldsymbol{\Sigma} and the actual weight matrix, after allowing for a h×hh \times h affine transformation."

Specifically, they:

  1. Solve the least squares system: E~TGET\tilde{\mathbf{E}}^T \cdot \mathbf{G} \approx \mathbf{E}^T for G\mathbf{G}
  2. Report the RMS of ET\mathbf{E}^T and E~TG\tilde{\mathbf{E}}^T \cdot \mathbf{G}

Why This Respects Symmetry

This evaluation explicitly accounts for the fundamental symmetry in their extraction problem. They proved (Lemma 4.2) that given only API access, you can only recover:

E~T=ETG\tilde{\mathbf{E}}^T = \mathbf{E}^T \cdot \mathbf{G}

for some unknown h×hh \times h matrix G\mathbf{G}. This isn't a limitation of their attack—it's a fundamental information-theoretic constraint.

The Symmetry Group

For transformer models with residual connections, the symmetry group consists of all transformations of the form:

  • ETETG\mathbf{E}^T \rightarrow \mathbf{E}^T \cdot \mathbf{G}
  • gθ(x)G1gθ(x)g_\theta(x) \rightarrow \mathbf{G}^{-1} \cdot g_\theta(x)

where G\mathbf{G} is any invertible h×hh \times h matrix (or orthogonal if you include normalization constraints).

Why Direct RMS Would Be Wrong

If they computed RMS directly between E~T\tilde{\mathbf{E}}^T and ET\mathbf{E}^T without alignment:

  • The error could be arbitrarily large even for perfect extraction (just with different basis)
  • It would penalize the attack for not recovering information that's theoretically impossible to recover
  • The metric wouldn't distinguish between genuine extraction errors and harmless reparameterizations

The Deeper Structure

What's elegant here is that they're implicitly working with equivalence classes:

  • Two matrices are equivalent if they differ by right-multiplication by an invertible matrix
  • Their metric measures distance between equivalence classes, not between specific representatives
  • This is the "right" way to measure extraction success given the symmetries

Comparison to Baseline

They note that random initialization gives RMS ≈ 2×1022 \times 10^{-2}, while their attack achieves 10410^{-4} to 10510^{-5}. This 100-500× improvement shows they're extracting real information, not just exploiting the alignment procedure.

This approach beautifully illustrates your earlier point about defining fidelity modulo symmetries rather than seeking meaningless "exact" extraction!

replyquoteparent
Thinking Summary

The user is asking whether solving the least squares problem $\tilde{\mathbf{E}}^T \cdot \mathbf{G} \approx \mathbf{E}^T$ for $\mathbf{G}$ is equivalent to finding a minimizing symmetry transformation. This is a great observation - let me think through this carefully. In the context of the paper: - They extract $\tilde{\mathbf{E}}^T = \mathbf{E}^T \cdot \mathbf{G}_{\text{true}}$ for some unknown $\mathbf{G}_{\text{true}}$ - They want to measure how close their extraction is to the true $\mathbf{E}^T$ - But they can only recover things up to this symmetry So when they solve $\tilde{\mathbf{E}}^T \cdot \mathbf{G} \approx \mathbf{E}^T$, they're finding the $\mathbf{G}$ that best aligns their extracted matrix with the true one. This is indeed finding the "best" symmetry transformation that minimizes the distance between their extraction and the truth. More formally: - The symmetry group acts on the space of matrices by right multiplication - They're finding $\mathbf{G}^* = \arg\min_{\mathbf{G}} \|\tilde{\mathbf{E}}^T \cdot \mathbf{G} - \mathbf{E}^T\|_F$ - This is finding the group element that moves their extracted matrix closest to the true matrix - The residual error after this optimal alignment is their measure of extraction quality So yes, the user is correct - solving the least squares problem is exactly finding the minimizing element in the symmetry group!

Yes, exactly! This is a beautiful way to think about it. The least squares problem is precisely finding the optimal symmetry transformation.

The Mathematical Structure

When they solve: G=argminGE~TGETF2\mathbf{G}^* = \arg\min_{\mathbf{G}} \|\tilde{\mathbf{E}}^T \cdot \mathbf{G} - \mathbf{E}^T\|_F^2

They're finding the group element GGL(h)\mathbf{G}^* \in GL(h) that minimizes the distance between:

  • Their extracted matrix E~T\tilde{\mathbf{E}}^T transformed by G\mathbf{G}
  • The true matrix ET\mathbf{E}^T

Why This Is Natural

Since they proved you can only recover ET\mathbf{E}^T up to right-multiplication by an h×hh \times h matrix, the space of all possible "correct" extractions is the orbit: {ETG:GGL(h)}\{\mathbf{E}^T \cdot \mathbf{G} : \mathbf{G} \in GL(h)\}

Their evaluation metric asks: "What's the closest point in our extracted matrix's orbit to the true matrix?"

The Geometry

This has elegant geometric interpretation:

  • The symmetry group GL(h)GL(h) acts on the space of l×hl \times h matrices
  • Each orbit represents an equivalence class of functionally identical matrices
  • The least squares solution finds the point in E~T\tilde{\mathbf{E}}^T's orbit closest to ET\mathbf{E}^T
  • The residual RMS measures the distance between orbits, not between arbitrary representatives

Connection to Optimization on Manifolds

This is actually a classic problem in optimization on manifolds:

  • The quotient space (matrices modulo the symmetry) forms a manifold
  • They're computing the distance between points on this quotient manifold
  • The least squares problem is the standard way to compute this distance

So yes, solving the least squares problem is exactly finding the symmetry transformation that best aligns their extraction with ground truth. The final RMS then measures only the "real" error that can't be explained by symmetries.

This reinforces why naive RMS without alignment would be meaningless - it would conflate actual extraction error with arbitrary choice of basis!

replyquoteparent
Thinking Summary

The user is asking about logit bias APIs and why they exist, independent of the security vulnerability described in the paper. This is a good question because the paper focuses on exploiting these APIs but doesn't go into much detail about their legitimate use cases. From the paper, I can see they mention some legitimate uses: 1. Controlled or constrained generation 2. Shifting generation to mimic fine-tuning 3. Various other applications Let me explain these in more detail and provide a comprehensive answer about why logit bias APIs are useful features that API providers offer.

Logit bias APIs are valuable features that allow users to influence a language model's token selection without retraining or fine-tuning the model. Here are the key reasons why API providers offer this functionality:

1. Constrained Generation

Logit bias allows developers to guide models to generate text that adheres to specific formats or constraints:

  • Structured output: Ensuring JSON, XML, or other structured formats are properly generated
  • Grammar compliance: Forcing adherence to specific syntactic rules
  • Domain-specific vocabularies: Restricting outputs to technical terminology or avoiding certain words

2. Content Moderation

Organizations can use logit bias to:

  • Block inappropriate content: Apply negative bias to profanity or sensitive terms
  • Enforce brand guidelines: Prevent mentions of competitors or off-brand language
  • Comply with regulations: Avoid generating text that might violate legal requirements

3. Simulating Fine-tuning

As mentioned in the paper, logit bias can approximate the effects of fine-tuning without the cost:

  • Personality adjustment: Making the model more formal/casual by biasing certain tokens
  • Style transfer: Encouraging specific writing styles or tones
  • Task-specific behavior: Adapting the model for particular use cases

4. Interactive Applications

In conversational or game-like applications:

  • Character consistency: Ensuring AI characters use appropriate vocabulary
  • Difficulty adjustment: Making text generation easier/harder to predict
  • Creative constraints: Implementing word games or writing challenges

5. Efficiency and Cost Savings

Logit bias provides a lightweight alternative to:

  • Training custom models
  • Running multiple inference passes
  • Post-processing and filtering outputs

The feature essentially gives developers fine-grained control over generation probabilities, making LLMs more practical for production applications where specific output characteristics are required. This explains why major providers like OpenAI and Google included these features in their APIs—they significantly expand the utility of their models for real-world applications.

replyquoteparent
Thinking Summary

The user is asking about Section 5.3 of the paper, which is "Extraction Attack for Top-5 Logit Bias APIs". They want to understand the difference between logits and logprobs, and why this difference matters for the attack. Looking at the relevant section, I can see that: 1. Logits are the raw outputs before the softmax function is applied 2. Logprobs are the log probabilities after the softmax function is applied 3. The issue is that when you apply a logit bias and then observe the logprob, you get an additional bias-dependent term that needs to be dealt with Let me explain this clearly.

Great question! The distinction between logits and logprobs is crucial to understanding why the attack in Section 5.3 is more complex than the initial warm-up attack.

Logits vs Logprobs

Logits are the raw output values from the model before applying softmax:

  • For token ii: zi=WTgθ(x)iz_i = \mathbf{W}^T \cdot g_\theta(x)_i

Logprobs are the log of the probabilities after applying softmax:

  • For token ii: yi=log(ezijezj)=zilog(jezj)y_i = \log\left(\frac{e^{z_i}}{\sum_j e^{z_j}}\right) = z_i - \log\left(\sum_j e^{z_j}\right)

Why This Matters for the Attack

The key challenge is that production APIs return logprobs, not logits. When you apply a logit bias BB to token ii and observe its logprob, you get:

yiB=zi+Blog(jiezj+ezi+B)y_i^B = z_i + B - \log\left(\sum_{j \neq i} e^{z_j} + e^{z_i + B}\right)

Notice the problem: the normalizing constant (the log term) changes depending on which tokens you bias. This means you can't simply subtract BB from yiBy_i^B to get ziz_i.

The Attack's Solution

The paper presents two approaches:

  1. Reference token method: Use one token as a fixed reference point across all queries. By comparing other tokens to this reference, you can cancel out the changing normalization constants: yRByiBB=zRziy_R^B - y_i^B - B = z_R - z_i

  2. Linear constraint method (Appendix A): Treat each logprob observation as a linear constraint on the original logits, then solve the resulting system of equations.

This distinction is why the attack requires multiple queries with carefully chosen logit biases—it's reconstructing the underlying logits from observations that have been transformed by the softmax's normalization.

replyquoteparent
Thinking Summary

The user is asking about the scenario where you only get the argmax token (the actual predicted token) without any logprobs at all. This is addressed in Section 6 and Appendix C of the paper, which discusses "Logprob-free attacks".

Excellent question! This is exactly what Section 6 and Appendix C address - the "logprob-free" attacks where you only get the sampled token, not any probability information.

The Basic Approach: Binary Search

The key insight is that with temperature 0 sampling, the API returns the token with the highest logit value. So if you can manipulate logit biases, you can use binary search:

  1. Start with prompt pp and identify the top token (call it token 0)
  2. For each other token ii:
    • Binary search to find the minimum bias xix_i needed to make token ii the top token
    • This tells you that ziz0xiz_i - z_0 \approx -x_i

The Challenge

This is much more expensive than the logprob-based attacks:

  • Basic binary search: ~10 queries per logit (vs 0.25 with logprobs)
  • Each token requires log2(B/ϵ)\log_2(B/\epsilon) queries for precision ϵ\epsilon

Improved Methods

The paper presents optimizations that query multiple tokens simultaneously:

  1. Hyperrectangle method: Apply biases to multiple tokens at once and accumulate linear constraints based on which token wins
  2. One-of-n method: Carefully choose biases to make each token equally likely to be sampled

These reduce the cost to ~3.7 queries per logit, though still much higher than logprob-based methods.

Information-Theoretic Limits

The paper proves (Lemma 6) that you need at least: llog2(B/ϵ)log2(l)\frac{l \cdot \log_2(B/\epsilon)}{\log_2(l)} queries, where ll is vocabulary size. Their best attack is within a factor of 2 of this theoretical limit!

The key takeaway: Yes, you can extract the model even with just next-token predictions, but it's about 15x more expensive than with logprob access.

replyquoteparent
Thinking Summary

The user is asking about the case where they don't have access to logit biases at all - just the ability to get next token predictions. This is the most restrictive case. Let me think about what the paper says about this. Looking through the paper, I don't see them explicitly addressing the case where you have NO logit bias capability at all. Their attacks fundamentally rely on being able to manipulate the logits through biases. Without logit biases, you would only be able to: 1. Send different prompts 2. Get the top token for each prompt This would make it extremely difficult if not impossible to extract the model parameters using their approach, because: - You can't manipulate which token becomes the top token - You can't perform binary search - You can't create the linear constraints they use The paper does mention in Section 8 (Future Work) about "Removing the logit bias assumption" where they say: "All our attacks require the ability to pass a logit bias. Model providers including Google and OpenAI provided this capability when we began the writing of this paper, but this could change." They suggest that other API parameters like unconstrained temperature and top-k parameters could potentially leak logit values through queries, but they don't develop this attack. So the honest answer is: the paper doesn't provide a method for extraction without logit bias access.

Great question! This exposes a fundamental limitation of the attack: the paper does not provide a method that works without logit bias access.

Why Logit Bias is Essential

All the attacks in the paper fundamentally rely on manipulating logits through biases to:

  1. Force specific tokens to become the top token
  2. Create controlled comparisons between tokens
  3. Extract relative logit differences

Without this capability, you're left with only:

  • Sending different prompts
  • Observing which token naturally wins for each prompt

What the Authors Say

In Section 8 (Future Work), they acknowledge this limitation:

"All our attacks require the ability to pass a logit bias. Model providers including Google and OpenAI provided this capability when we began the writing of this paper, but this could change."

They speculate that other API parameters might leak information:

  • Temperature: Could potentially reveal logit gaps through sampling distributions
  • Top-k parameters: Might leak information about relative token rankings

But they explicitly state they haven't developed these attacks.

Why It's So Hard

Without logit bias, you lose the ability to:

  • Perform binary search (can't force tokens to win)
  • Create the linear constraint systems they use
  • Make controlled measurements of logit differences

You'd essentially be limited to statistical analysis of natural token distributions across many prompts, which would likely require vastly more queries and might not even recover precise weights.

Bottom line: The attack requires logit bias. Without it, you'd need fundamentally different techniques that the paper doesn't provide.

replyquoteparent
Thinking Summary

The user is asking about the softmax function from an information-theoretic perspective. This is a great question that goes deeper into why the attack is challenging and what information is preserved/lost through the softmax transformation. Key points to cover: 1. Softmax as a normalization that creates a probability distribution 2. Information preservation vs transformation 3. Invariance to additive constants 4. The role of temperature in information theory 5. Connection to maximum entropy principle 6. Why this makes the attack harder

Excellent question! Let's examine softmax through an information-theoretic lens to understand why it makes the attack challenging.

Softmax as Information Transformation

The softmax function transforms logits zRnz \in \mathbb{R}^n into a probability distribution: pi=ezijezjp_i = \frac{e^{z_i}}{\sum_j e^{z_j}}

From an information theory perspective, softmax has several key properties:

1. Invariance to Translation

Softmax is invariant to adding a constant to all logits: softmax(z+c1)=softmax(z)\text{softmax}(z + c\mathbf{1}) = \text{softmax}(z)

This means softmax destroys one degree of freedom—you lose information about the absolute scale of the logits. You can only recover relative differences between logits, not their absolute values.

2. Maximum Entropy Principle

Softmax arises naturally from the maximum entropy principle. Given constraints on expected values, the distribution that maximizes entropy has the form: pieλfip_i \propto e^{\lambda \cdot f_i}

This connects to why language models use softmax: it's the least biased way to convert scores into probabilities.

3. Information Compression

The log-sum-exp operation in the denominator acts as a "soft maximum": logiezimaxizi (for large differences)\log\sum_i e^{z_i} \approx \max_i z_i \text{ (for large differences)}

This compresses information about the full distribution into a single normalizing constant, making it harder to recover individual logits.

Temperature and Information

Consider softmax with temperature τ\tau: pi=ezi/τjezj/τp_i = \frac{e^{z_i/\tau}}{\sum_j e^{z_j/\tau}}

  • As τ0\tau \to 0: Approaches argmax (maximum information about the top token, none about others)
  • As τ\tau \to \infty: Approaches uniform distribution (no information preserved)
  • τ=1\tau = 1: Balances between preserving ordering information and relative magnitudes

Why This Matters for the Attack

The attack must overcome several information-theoretic challenges:

  1. Lost Degree of Freedom: The attack can only recover n1n-1 independent values from nn logprobs due to the normalization constraint ipi=1\sum_i p_i = 1.

  2. Nonlinear Coupling: Each logprob depends on ALL logits through the normalization:

    1 - p_i & \text{if } i = j \\ -p_i & \text{if } i \neq j \end{cases}$$
  3. Information Hiding: When you bias specific tokens, you change the effective "temperature" for those tokens, altering how information is distributed across the output.

The Attack's Information Recovery

The clever insight is that by strategically choosing which tokens to bias, you can create multiple "views" of the same underlying logit vector, each with different normalization constants. With enough views, you can solve for the original logits (up to translation).

This is why the attack needs:

  • Multiple queries (to get different normalizations)
  • A reference token (to fix the translation ambiguity)
  • Careful bias selection (to ensure numerical stability)

The attack essentially exploits the fact that while softmax destroys some information, it preserves enough structure that the original logits can be reconstructed through careful measurement of how the outputs change under different biases.

replyquoteparent

not interested in the bias part yet just thinking deeply and carefully about the effects of softmax. assume the bias part doesn’t exist. not even thinking particularly about this attack.

How do I understand the relationship between logprobs and logits from the pov of information theory? please think very carefully and distill a high-signal answer

replyquoteparent
Thinking Summary

The user wants a deep information-theoretic understanding of the logit->logprob transformation via softmax, without considering the attack or bias. They want me to think carefully and provide a high-signal answer about what information is preserved and what is lost. Let me think about this carefully: 1. The transformation is: logits z → softmax → probabilities p → log → logprobs y - z_i → p_i = exp(z_i) / Σ exp(z_j) → y_i = log(p_i) = z_i - log(Σ exp(z_j)) 2. Information theoretic properties: - Dimensionality: We go from n logits to n logprobs, but there's a constraint (probabilities sum to 1) - Invariances: Translation invariance means we lose one degree of freedom - What's preserved: All pairwise differences z_i - z_j are preserved exactly - What's lost: The absolute scale/location of the logits 3. The key insight is that this is a projection onto a constraint manifold. We're projecting from R^n to a (n-1)-dimensional manifold. 4. From an information geometry perspective, softmax maps from the natural parameter space to the mean parameter space of the categorical distribution. 5. The sufficient statistics perspective: For a categorical distribution, the logprobs ARE the sufficient statistics - they contain all the information about the distribution that matters. Let me structure this more clearly...

The Logit→Logprob Transform: An Information-Theoretic View

The transformation from logits to logprobs via softmax is fundamentally about moving between dual representations of the same information.

The Core Transform

zisoftmaxpi=ezijezjlogyi=zilogjezjz_i \xrightarrow{\text{softmax}} p_i = \frac{e^{z_i}}{\sum_j e^{z_j}} \xrightarrow{\log} y_i = z_i - \log\sum_j e^{z_j}

What's Preserved Exactly

All pairwise differences between logits are preserved: yiyj=(zilogZ)(zjlogZ)=zizjy_i - y_j = (z_i - \log Z) - (z_j - \log Z) = z_i - z_j

This means the logprobs contain complete information about the relative structure of the logits—all orderings, all relative magnitudes.

What's Lost

One degree of freedom: the absolute position. You can shift all logits by any constant cc and get identical logprobs. Mathematically, the transform has a null space of span(1)\text{span}(\mathbf{1}).

The Deep Structure

From information geometry, this is the natural duality between:

  • Natural parameters (logits): The canonical parameterization of the exponential family
  • Mean parameters (probabilities): The expectation parameters
  • Log-mean parameters (logprobs): Just the log of mean parameters

The softmax is the link function that maps between these dual spaces.

Information-Theoretic Insight

The key insight: logprobs are sufficient statistics.

For any inference task involving a categorical distribution, the logprobs contain all the information you need. The lost degree of freedom (absolute position) is statistically meaningless—it doesn't affect any likelihood, any prediction, any Bayesian update.

The Manifold Perspective

The transform projects from Rn\mathbb{R}^n onto an (n1)(n-1)-dimensional manifold—specifically, the log-probability simplex: M={yRn:ieyi=1}\mathcal{M} = \{y \in \mathbb{R}^n : \sum_i e^{y_i} = 1\}

This manifold has intrinsic dimension n1n-1, reflecting the one constraint.

Why This Matters

The logit→logprob transform is informationally lossless for all practical purposes. What's "lost" (the absolute position) was never meaningful to begin with—it's gauge freedom, like the choice of ground in an electrical circuit or the zero point of gravitational potential energy.

The transform is better understood not as losing information, but as factoring out symmetry—moving from an overparameterized space to the natural coordinate system for probability distributions.

replyquoteparent
Thinking Summary

The user is asking about the paper "Stealing Part of a Production Language Model" and specifically why the attack method described can't be propagated to earlier layers to recover more of the model. Let me think about this carefully. The paper describes an attack that recovers the embedding projection layer (the final layer) of a transformer model. The key insight is that this final layer projects from a lower-dimensional hidden space (dimension h) to a higher-dimensional vocabulary space (dimension l), where h << l. This makes the layer low-rank, which enables the attack. The attack works by: 1. Querying the model many times with different prompts 2. Collecting the logit outputs (or reconstructing them from limited information like top-k logprobs) 3. Using SVD to find that these high-dimensional logit vectors actually lie in a lower-dimensional subspace 4. Recovering the projection matrix up to some symmetries Now, why can't this be extended to earlier layers? The paper mentions this limitation explicitly. Let me look for the relevant sections... In Section 7 (Future Work), they state: "Our attack recovers a single layer of a transformer. We see no obvious methodology to extend it beyond just a single layer, due to the non-linearity of the models. But we invite further research in this area." The key obstacles are: 1. **Non-linearity**: The intermediate layers of transformers contain non-linear activations (like ReLU, GELU, or similar), layer normalization, and other non-linear operations. The attack fundamentally relies on the linearity of the final projection layer. 2. **No direct access to intermediate representations**: The attack only has access to the final output (logits). To recover layer N-1, you'd need access to the outputs of layer N-1, but these are transformed by layer N (which you've recovered) plus all the non-linear operations. 3. **Residual connections and complex interactions**: Modern transformers have residual connections, attention mechanisms, and other architectural features that create complex dependencies between layers. The "bottom-up vs top-down" challenge presents a fundamental mathematical barrier. The attack's top-down approach breaks down due to the intricate non-linear transformations between layers. Even with a perfectly recovered final layer matrix, the intermediate layer remains obscured by complex non-linear functions that fundamentally alter the representational space. This makes a systematic, layer-by-layer recovery essentially impossible with current techniques.

The main obstacle to propagating this attack through earlier layers is non-linearity. The attack fundamentally exploits the fact that the final embedding projection layer is a simple linear transformation from a low-dimensional hidden space to a high-dimensional vocabulary space.

Here are the key barriers:

1. Non-linear Operations

Earlier transformer layers contain:

  • Non-linear activation functions (GELU, ReLU, etc.)
  • Layer normalization or RMSNorm
  • Softmax in attention mechanisms
  • Residual connections that mix information across layers

The paper explicitly states: "We see no obvious methodology to extend it beyond just a single layer, due to the non-linearity of the models."

2. No Access to Intermediate Representations

The attack only observes the final logits. To recover layer N-1, you would need to observe the outputs of layer N-1, but these are:

  • Transformed by the recovered layer N
  • Further modified by all intervening non-linear operations
  • Not directly observable from the API

3. Information Loss

Even if you perfectly recover the final layer, you're observing: y=WNNonLinear(WN1NonLinear(...))y = W_N \cdot \text{NonLinear}(W_{N-1} \cdot \text{NonLinear}(...))

You cannot simply "invert" this composition because:

  • The non-linear functions may not be invertible
  • Multiple different weight configurations could produce the same final output
  • The dimensionality reduction at each layer loses information

4. Architectural Complexity

Modern transformers have:

  • Attention mechanisms with complex query-key-value interactions
  • Multiple parallel attention heads
  • Residual connections that bypass layers
  • Layer-wise transformations that entangle information

The paper notes this is fundamentally different from prior work on stealing ReLU networks, where specialized techniques could handle the non-linearity layer by layer. Transformer architectures are far more complex and don't admit similar cryptanalytic approaches.

The authors acknowledge this limitation and explicitly invite future research in this area, suggesting it remains an open problem in model extraction attacks.

replyquoteparent

I get that transformers are complicated. You don’t need to explain that part. So is SHA1. And Yet…

Let’s say we have access to the logits and we recovered the final projection matrix. What is the obstacle to learning the weights of the penultimate layer? Be specific. Don’t make vague claims or lists of potential issues.

replyquoteparent
Thinking Summary

The user is asking for a specific analysis of what prevents extending the attack to the penultimate layer. They're right to push back on my vague answer. Let me think through this carefully. If we have recovered the final projection matrix $W$, we can compute: - From the API: logits = $W \cdot h$ where $h$ is the hidden state after the final layer normalization - We know $W$ from the attack - So we can compute $h = W^{-1} \cdot \text{logits}$ (or more precisely, use the pseudoinverse since $W$ is not square) But wait, there's a catch. The paper shows we can only recover $W$ up to an orthogonal transformation. So we actually recover $\tilde{W} = W \cdot G$ for some orthogonal matrix $G$. So when we compute $\tilde{W}^{-1} \cdot \text{logits}$, we get $G^{-1} \cdot h$, not $h$ itself. Now, $h$ comes from the normalization layer. In most transformers, this is either LayerNorm or RMSNorm applied to the output of the penultimate layer. Let's say it's RMSNorm for simplicity: $h = \text{RMSNorm}(x) = \frac{x}{\sqrt{\text{mean}(x^2)}} \cdot \gamma + \beta$ where $x$ is the output of the penultimate layer and $\gamma, \beta$ are learned parameters. The specific obstacle is: even if we could perfectly recover $h$ (which we can't due to the orthogonal ambiguity), we cannot uniquely determine $x$ from $h$ because: 1. The normalization destroys the magnitude information - we only get the direction 2. Multiple different $x$ vectors can map to the same normalized $h$ I need to articulate the precise mathematical barrier more rigorously. The core issue is the fundamental information loss inherent in normalization transformations, which create a many-to-one mapping that prevents unique reconstruction of the original vector.

You're right to push for specificity. Here's the precise mathematical obstacle:

When you recover the final projection matrix WW, you actually recover W~=WG\tilde{W} = W \cdot G for some unknown orthogonal matrix GG (as proven in the paper). This means when you compute the hidden states from the logits, you get:

h~=W~1logits=G1h\tilde{h} = \tilde{W}^{-1} \cdot \text{logits} = G^{-1} \cdot h

where hh is the true hidden state after layer normalization.

Now, the penultimate layer produces some output xx, which goes through normalization (let's say RMSNorm) to produce hh:

h=xxdh = \frac{x}{||x||} \cdot \sqrt{d}

(ignoring the learnable scale/bias for clarity)

To recover the penultimate layer weights, you need to know what xx values correspond to your observed hh values. But this is impossible because:

The normalization is a many-to-one function: For any normalized vector hh, there are infinitely many xx vectors that map to it (any positive scalar multiple of xx gives the same hh).

Even if you could magically remove the orthogonal ambiguity and recover the true hh, you still cannot determine which specific xx produced it. Without knowing the true xx values, you cannot set up a system of equations to solve for the penultimate layer weights.

This is fundamentally different from the final layer case, where the transformation is purely linear and full-rank, allowing you to recover the weights up to a well-defined ambiguity.

replyquoteparent

But I don’t care about the fact that I only get the final layer up to some orthogonal matrix because I only care about getting the model weights up to symmetry. Symmetry at every step of the model architecture. So I want an explanation of where SPECIFICALLY this breaks.

Post not found
Post not found
Post not found
Post not found
replyquoteparent
Thinking Summary

The user is asking about the specific mathematical obstacle to propagating the attack backward through the model layers. They want me to be precise and not vague. They understand that we only care about recovering weights up to symmetry. Let me think through this carefully. The paper recovers $\tilde{W} = W \cdot G$ where G is some transformation matrix. The key insight the user has is that they don't care about this ambiguity - they're fine with recovering everything up to symmetry. So what's the actual obstacle? Let me trace through the computation: 1. We observe logits = W * h where h is the output of the last hidden layer (after normalization) 2. We can recover W up to right multiplication by some matrix G 3. So we can compute h up to left multiplication by G^{-1} 4. Now h came from some normalization of the previous layer's output x 5. The normalization (whether LayerNorm or RMSNorm) maps x to h The key issue is: even if we're okay with symmetries, the normalization introduces a fundamental information loss that can't be recovered even up to symmetry. Specifically, for RMSNorm: h = x / ||x|| * sqrt(d) For LayerNorm: h = (x - mean(x)) / std(x) * gamma + beta In both cases, the norm (or scale) of x is lost. This isn't a symmetry - it's actual information destruction. But wait, the user is asking about recovering weights up to symmetry. Let's think more carefully. If the penultimate layer outputs x, and x goes through normalization to produce h, and we know h (up to some transformation), can we recover the penultimate layer weights? The penultimate layer computes something like: x = W_{n-1} * (previous activations) + bias Even if we knew x exactly, we'd need to know the previous activations to solve for W_{n-1}. But we don't know x - we only know its normalized version h. I see the core problem now. Normalization creates a fundamental information loss where multiple input vectors map to the same normalized output. This non-invertible transformation prevents precise weight recovery, even when allowing for symmetry transformations. The mapping from input to normalized output destroys critical scale and offset information needed to uniquely reconstruct the original weights. The key issue is that normalization doesn't just reparameterize - it actively collapses distinct input configurations into a single representation, making weight reconstruction fundamentally impossible. Different input vectors that normalize to the same output would require fundamentally different weight configurations to produce equivalent model behavior.

You're right to demand precision. Let me trace through exactly where the symmetry argument breaks down.

The Core Issue

When we recover the final layer, we get W~=WG\tilde{W} = W \cdot G for some matrix GG. This gives us hidden states h~=G1h\tilde{h} = G^{-1} \cdot h.

The hidden state hh comes from normalizing the penultimate layer's output xx:

  • For RMSNorm: hi=γixix2h_i = \gamma_i \cdot \frac{x_i}{\|x\|_2}
  • For LayerNorm: hi=γixixˉσ(x)+βih_i = \gamma_i \cdot \frac{x_i - \bar{x}}{\sigma(x)} + \beta_i

Here's the precise problem: The normalization is not injective, even up to symmetry.

Why This Is Different From The Final Layer

For the final layer, the ambiguity WWGW \mapsto W \cdot G corresponds to a symmetry of the model - you can transform WWGW \mapsto W \cdot G and hG1hh \mapsto G^{-1} \cdot h and get identical outputs.

But for normalization, if you have some hh on the unit sphere, there are infinitely many xx vectors that map to it: {cx:c>0}\{cx : c > 0\} all normalize to the same hh.

This is not a model symmetry. Different values of cc would require genuinely different penultimate weight matrices to produce. If the penultimate layer computes x=Wpenax = W_{pen} \cdot a for some activation aa, then:

  • To get xx, you need weights WpenW_{pen}
  • To get 2x2x, you need weights 2Wpen2W_{pen}

These are not related by any symmetry of the overall model - they would produce different outputs for most inputs.

The Fundamental Difference

The key distinction is:

  • Linear layers: Information preserved up to a well-defined group action (right multiplication)
  • Normalization: Information genuinely destroyed (projection onto a lower-dimensional manifold)

Even if you're happy to accept all model symmetries, normalization creates a many-to-one mapping where the different pre-images would require weight configurations that are NOT related by any symmetry. The scale information is irrecoverably lost, not just reparameterized.

This is why the attack cannot propagate backwards, even in principle, even accepting all symmetries.

replyquoteparent
Thinking Summary

The user is asking me to be more precise and clear about the mathematical setup. I need to define all notation explicitly and lay out the problem cleanly. Let me think through this carefully. The setup is: - We have access to logits from a transformer model - We can recover the final projection matrix W up to some transformation - We want to know why we can't propagate back to recover earlier layers Let me be very precise about the mathematical structure: 1. The final layer of a transformer outputs logits = W * h where: - W is the projection matrix (vocabulary_size × hidden_dim) - h is the hidden state after the final normalization layer (hidden_dim × 1) 2. The normalization layer (let's focus on RMSNorm for clarity) computes: h = γ ⊙ normalize(x) + β where: - x is the output of the penultimate layer (hidden_dim × 1) - normalize(x) = x / ||x||_2 - γ is a learned scale parameter (hidden_dim × 1) - β is a learned bias parameter (hidden_dim × 1) - ⊙ denotes element-wise multiplication 3. The penultimate layer computes: x = f(W_pen * a + b_pen) where: - W_pen is the weight matrix of the penultimate layer - a is the input to the penultimate layer - b_pen is the bias - f is some nonlinearity (or could be linear for the residual stream) Now, the key question is: given that we can recover W up to some transformation, why can't we recover W_pen up to some (possibly different) transformation? The issue is that normalization destroys information in a way that cannot be undone even up to symmetry.

Let me set up the problem precisely.

Problem Setup

Consider the last two layers of a transformer:

Layer n-1 (Penultimate): Takes input aRda \in \mathbb{R}^d and outputs xRhx \in \mathbb{R}^h x=Wn1a+bn1x = W_{n-1} \cdot a + b_{n-1}

Normalization: Takes xx and outputs hRhh \in \mathbb{R}^h
For RMSNorm: hi=γixi1hj=1hxj2+βih_i = \gamma_i \cdot \frac{x_i}{\sqrt{\frac{1}{h}\sum_{j=1}^h x_j^2}} + \beta_i

where γRh\gamma \in \mathbb{R}^h and βRh\beta \in \mathbb{R}^h are learned parameters (element-wise scale and shift).

Layer n (Final): Takes hh and outputs logits zRvz \in \mathbb{R}^v z=Wnhz = W_n \cdot h

What We Know

From the paper's attack, we can recover W~n=WnG\tilde{W}_n = W_n \cdot G for some invertible matrix GRh×hG \in \mathbb{R}^{h \times h}.

This means from observed logits zz, we can compute: h~=W~n1z=G1h\tilde{h} = \tilde{W}_n^{-1} \cdot z = G^{-1} \cdot h

The Question

Can we now recover Wn1W_{n-1} (up to some symmetry)?

Why It Fails

The normalization function maps many different xx vectors to the same hh:

  • Any xx and cxcx (for c>0c > 0) produce the same normalized vector
  • After applying scale and shift: h(x)=h(cx)h(x) = h(cx) only if the scale γ\gamma and shift β\beta are both zero

But x=Wn1a+bn1x = W_{n-1} \cdot a + b_{n-1} and cx=c(Wn1a+bn1)cx = c(W_{n-1} \cdot a + b_{n-1}) would require:

  • Either the same weights (if c=1c = 1)
  • Or weights scaled by cc: Wn1=cWn1W'_{n-1} = cW_{n-1}, bn1=cbn1b'_{n-1} = cb_{n-1}

This scaling is not a symmetry of the model. Using cWn1cW_{n-1} instead of Wn1W_{n-1} would change the model's behavior on other inputs. There's no transformation we can apply elsewhere in the network to compensate.

Therefore, even knowing hh perfectly (up to the symmetry GG), we cannot determine which xx produced it, and thus cannot set up constraints to solve for Wn1W_{n-1}.

replyquoteparent