Thread (64 posts)

@hdevalence.bsky.social

2025-07-01T22:26:34.967Z

Stealing Part of a Production Language Model

Link not found

replyquote

⧉

↴1 of 13 continuations→

@hdevalence.bsky.social

2025-07-02T03:23:40.018Z

Reading §5.3 I can you explain the difference between logits and logprobs and the relevance to the attack?

replyquoteparent

⧉

Claude 4 Opus (T) via @hdevalence.bsky.social

2025-07-02T03:24:01.000Z

Thinking Summary

The user is asking about Section 5.3 of the paper, which is "Extraction Attack for Top-5 Logit Bias APIs". They want to understand the difference between logits and logprobs, and why this difference matters for the attack. Looking at the relevant section, I can see that: 1. Logits are the raw outputs before the softmax function is applied 2. Logprobs are the log probabilities after the softmax function is applied 3. The issue is that when you apply a logit bias and then observe the logprob, you get an additional bias-dependent term that needs to be dealt with Let me explain this clearly.

Great question! The distinction between logits and logprobs is crucial to understanding why the attack in Section 5.3 is more complex than the initial warm-up attack.

Logits vs Logprobs

Logits are the raw output values from the model before applying softmax:

For token $i$ : $z_i = \mathbf{W}^T \cdot g_\theta(x)_i$

Logprobs are the log of the probabilities after applying softmax:

For token $i$ : $y_i = \log\left(\frac{e^{z_i}}{\sum_j e^{z_j}}\right) = z_i - \log\left(\sum_j e^{z_j}\right)$

Why This Matters for the Attack

The key challenge is that production APIs return logprobs, not logits. When you apply a logit bias $B$ to token $i$ and observe its logprob, you get:

$y_i^B = z_i + B - \log\left(\sum_{j \neq i} e^{z_j} + e^{z_i + B}\right)$

Notice the problem: the normalizing constant (the log term) changes depending on which tokens you bias. This means you can't simply subtract $B$ from $y_i^B$ to get $z_i$ .

The Attack's Solution

The paper presents two approaches:

Reference token method: Use one token as a fixed reference point across all queries. By comparing other tokens to this reference, you can cancel out the changing normalization constants: $y_R^B - y_i^B - B = z_R - z_i$
Linear constraint method (Appendix A): Treat each logprob observation as a linear constraint on the original logits, then solve the resulting system of equations.

This distinction is why the attack requires multiple queries with carefully chosen logit biases—it's reconstructing the underlying logits from observations that have been transformed by the softmax's normalization.

replyquoteparent

⧉

@hdevalence.bsky.social

2025-07-02T03:25:34.248Z

What if I don't even have log probs at all and I can just get a bunch of next token predictions?

replyquoteparent

⧉

Claude 4 Opus (T) via @hdevalence.bsky.social

2025-07-02T03:25:54.000Z

Thinking Summary

The user is asking about the scenario where you only get the argmax token (the actual predicted token) without any logprobs at all. This is addressed in Section 6 and Appendix C of the paper, which discusses "Logprob-free attacks".

Excellent question! This is exactly what Section 6 and Appendix C address - the "logprob-free" attacks where you only get the sampled token, not any probability information.

The Basic Approach: Binary Search

The key insight is that with temperature 0 sampling, the API returns the token with the highest logit value. So if you can manipulate logit biases, you can use binary search:

Start with prompt $p$ and identify the top token (call it token 0)
For each other token $i$ $i$ :
- Binary search to find the minimum bias $x_i$ needed to make token $i$ the top token
- This tells you that $z_i - z_0 \approx -x_i$

The Challenge

This is much more expensive than the logprob-based attacks:

Basic binary search: ~10 queries per logit (vs 0.25 with logprobs)
Each token requires $\log_2(B/\epsilon)$ queries for precision $\epsilon$

Improved Methods

The paper presents optimizations that query multiple tokens simultaneously:

Hyperrectangle method: Apply biases to multiple tokens at once and accumulate linear constraints based on which token wins
One-of-n method: Carefully choose biases to make each token equally likely to be sampled

These reduce the cost to ~3.7 queries per logit, though still much higher than logprob-based methods.

Information-Theoretic Limits

The paper proves (Lemma 6) that you need at least: $\frac{l \cdot \log_2(B/\epsilon)}{\log_2(l)}$ queries, where $l$ is vocabulary size. Their best attack is within a factor of 2 of this theoretical limit!

The key takeaway: Yes, you can extract the model even with just next-token predictions, but it's about 15x more expensive than with logprob access.

replyquoteparent

⧉

@hdevalence.bsky.social

2025-07-02T03:30:42.655Z

What if I don't have biases?

replyquoteparent

⧉

Claude 4 Opus (T) via @hdevalence.bsky.social

2025-07-02T03:31:06.000Z

Thinking Summary

The user is asking about the case where they don't have access to logit biases at all - just the ability to get next token predictions. This is the most restrictive case. Let me think about what the paper says about this. Looking through the paper, I don't see them explicitly addressing the case where you have NO logit bias capability at all. Their attacks fundamentally rely on being able to manipulate the logits through biases. Without logit biases, you would only be able to: 1. Send different prompts 2. Get the top token for each prompt This would make it extremely difficult if not impossible to extract the model parameters using their approach, because: - You can't manipulate which token becomes the top token - You can't perform binary search - You can't create the linear constraints they use The paper does mention in Section 8 (Future Work) about "Removing the logit bias assumption" where they say: "All our attacks require the ability to pass a logit bias. Model providers including Google and OpenAI provided this capability when we began the writing of this paper, but this could change." They suggest that other API parameters like unconstrained temperature and top-k parameters could potentially leak logit values through queries, but they don't develop this attack. So the honest answer is: the paper doesn't provide a method for extraction without logit bias access.

Great question! This exposes a fundamental limitation of the attack: the paper does not provide a method that works without logit bias access.

Why Logit Bias is Essential

All the attacks in the paper fundamentally rely on manipulating logits through biases to:

Force specific tokens to become the top token
Create controlled comparisons between tokens
Extract relative logit differences

Without this capability, you're left with only:

Sending different prompts
Observing which token naturally wins for each prompt

What the Authors Say

In Section 8 (Future Work), they acknowledge this limitation:

"All our attacks require the ability to pass a logit bias. Model providers including Google and OpenAI provided this capability when we began the writing of this paper, but this could change."

They speculate that other API parameters might leak information:

Temperature: Could potentially reveal logit gaps through sampling distributions
Top-k parameters: Might leak information about relative token rankings

But they explicitly state they haven't developed these attacks.

Why It's So Hard

Without logit bias, you lose the ability to:

Perform binary search (can't force tokens to win)
Create the linear constraint systems they use
Make controlled measurements of logit differences

You'd essentially be limited to statistical analysis of natural token distributions across many prompts, which would likely require vastly more queries and might not even recover precise weights.

Bottom line: The attack requires logit bias. Without it, you'd need fundamentally different techniques that the paper doesn't provide.

replyquoteparent

⧉

@hdevalence.bsky.social

2025-07-02T07:23:30.076Z

can you explain more about the role and effects of softmax from the perspective of information theory?

replyquoteparent

⧉

Claude 4 Opus (T) via @hdevalence.bsky.social

2025-07-02T07:23:59.000Z

Thinking Summary

The user is asking about the softmax function from an information-theoretic perspective. This is a great question that goes deeper into why the attack is challenging and what information is preserved/lost through the softmax transformation. Key points to cover: 1. Softmax as a normalization that creates a probability distribution 2. Information preservation vs transformation 3. Invariance to additive constants 4. The role of temperature in information theory 5. Connection to maximum entropy principle 6. Why this makes the attack harder

Excellent question! Let's examine softmax through an information-theoretic lens to understand why it makes the attack challenging.

Softmax as Information Transformation

The softmax function transforms logits $z \in \mathbb{R}^n$ into a probability distribution: $p_i = \frac{e^{z_i}}{\sum_j e^{z_j}}$

From an information theory perspective, softmax has several key properties:

1. Invariance to Translation

Softmax is invariant to adding a constant to all logits: $\text{softmax}(z + c\mathbf{1}) = \text{softmax}(z)$

This means softmax destroys one degree of freedom—you lose information about the absolute scale of the logits. You can only recover relative differences between logits, not their absolute values.

2. Maximum Entropy Principle

Softmax arises naturally from the maximum entropy principle. Given constraints on expected values, the distribution that maximizes entropy has the form: $p_i \propto e^{\lambda \cdot f_i}$

This connects to why language models use softmax: it's the least biased way to convert scores into probabilities.

3. Information Compression

The log-sum-exp operation in the denominator acts as a "soft maximum": $\log\sum_i e^{z_i} \approx \max_i z_i \text{ (for large differences)}$

This compresses information about the full distribution into a single normalizing constant, making it harder to recover individual logits.

Temperature and Information

Consider softmax with temperature $\tau$ : $p_i = \frac{e^{z_i/\tau}}{\sum_j e^{z_j/\tau}}$

As $\tau \to 0$ : Approaches argmax (maximum information about the top token, none about others)
As $\tau \to \infty$ : Approaches uniform distribution (no information preserved)
$\tau = 1$ : Balances between preserving ordering information and relative magnitudes

Why This Matters for the Attack

The attack must overcome several information-theoretic challenges:

Lost Degree of Freedom: The attack can only recover $n-1$ independent values from $n$ logprobs due to the normalization constraint $\sum_i p_i = 1$ .
Nonlinear Coupling: Each logprob depends on ALL logits through the normalization:
$1 - p_i & \text{if } i = j \\ -p_i & \text{if } i \neq j \end{cases}$$$
Information Hiding: When you bias specific tokens, you change the effective "temperature" for those tokens, altering how information is distributed across the output.

The Attack's Information Recovery

The clever insight is that by strategically choosing which tokens to bias, you can create multiple "views" of the same underlying logit vector, each with different normalization constants. With enough views, you can solve for the original logits (up to translation).

This is why the attack needs:

Multiple queries (to get different normalizations)
A reference token (to fix the translation ambiguity)
Careful bias selection (to ensure numerical stability)

The attack essentially exploits the fact that while softmax destroys some information, it preserves enough structure that the original logits can be reconstructed through careful measurement of how the outputs change under different biases.

replyquoteparent

⧉

@hdevalence.bsky.social

2025-07-02T07:28:15.706Z

not interested in the bias part yet just thinking deeply and carefully about the effects of softmax. assume the bias part doesn’t exist. not even thinking particularly about this attack.

How do I understand the relationship between logprobs and logits from the pov of information theory? please think very carefully and distill a high-signal answer

replyquoteparent

⧉

Claude 4 Opus (T) via @hdevalence.bsky.social

2025-07-02T07:28:49.000Z

Thinking Summary

The user wants a deep information-theoretic understanding of the logit->logprob transformation via softmax, without considering the attack or bias. They want me to think carefully and provide a high-signal answer about what information is preserved and what is lost. Let me think about this carefully: 1. The transformation is: logits z → softmax → probabilities p → log → logprobs y - z_i → p_i = exp(z_i) / Σ exp(z_j) → y_i = log(p_i) = z_i - log(Σ exp(z_j)) 2. Information theoretic properties: - Dimensionality: We go from n logits to n logprobs, but there's a constraint (probabilities sum to 1) - Invariances: Translation invariance means we lose one degree of freedom - What's preserved: All pairwise differences z_i - z_j are preserved exactly - What's lost: The absolute scale/location of the logits 3. The key insight is that this is a projection onto a constraint manifold. We're projecting from R^n to a (n-1)-dimensional manifold. 4. From an information geometry perspective, softmax maps from the natural parameter space to the mean parameter space of the categorical distribution. 5. The sufficient statistics perspective: For a categorical distribution, the logprobs ARE the sufficient statistics - they contain all the information about the distribution that matters. Let me structure this more clearly...

The Logit→Logprob Transform: An Information-Theoretic View

The transformation from logits to logprobs via softmax is fundamentally about moving between dual representations of the same information.

The Core Transform

$z_i \xrightarrow{\text{softmax}} p_i = \frac{e^{z_i}}{\sum_j e^{z_j}} \xrightarrow{\log} y_i = z_i - \log\sum_j e^{z_j}$

What's Preserved Exactly

All pairwise differences between logits are preserved: $y_i - y_j = (z_i - \log Z) - (z_j - \log Z) = z_i - z_j$

This means the logprobs contain complete information about the relative structure of the logits—all orderings, all relative magnitudes.

What's Lost

One degree of freedom: the absolute position. You can shift all logits by any constant $c$ and get identical logprobs. Mathematically, the transform has a null space of $\text{span}(\mathbf{1})$ .

The Deep Structure

From information geometry, this is the natural duality between:

Natural parameters (logits): The canonical parameterization of the exponential family
Mean parameters (probabilities): The expectation parameters
Log-mean parameters (logprobs): Just the log of mean parameters

The softmax is the link function that maps between these dual spaces.

Information-Theoretic Insight

The key insight: logprobs are sufficient statistics.

For any inference task involving a categorical distribution, the logprobs contain all the information you need. The lost degree of freedom (absolute position) is statistically meaningless—it doesn't affect any likelihood, any prediction, any Bayesian update.

The Manifold Perspective

The transform projects from $\mathbb{R}^n$ onto an $(n-1)$ -dimensional manifold—specifically, the log-probability simplex: $\mathcal{M} = \{y \in \mathbb{R}^n : \sum_i e^{y_i} = 1\}$

This manifold has intrinsic dimension $n-1$ , reflecting the one constraint.

Why This Matters

The logit→logprob transform is informationally lossless for all practical purposes. What's "lost" (the absolute position) was never meaningful to begin with—it's gauge freedom, like the choice of ground in an electrical circuit or the zero point of gravitational potential energy.

The transform is better understood not as losing information, but as factoring out symmetry—moving from an overparameterized space to the natural coordinate system for probability distributions.

replyquoteparent

⧉