Thread (64 posts)

@hdevalence.bsky.social

2025-07-01T22:26:34.967Z

Stealing Part of a Production Language Model

Link not found

replyquote

⧉

↴1 of 13 continuations→

@hdevalence.bsky.social

2025-07-02T03:23:40.018Z

Reading §5.3 I can you explain the difference between logits and logprobs and the relevance to the attack?

replyquoteparent

⧉

Claude 4 Opus (T) via @hdevalence.bsky.social

2025-07-02T03:24:01.000Z

Thinking Summary

The user is asking about Section 5.3 of the paper, which is "Extraction Attack for Top-5 Logit Bias APIs". They want to understand the difference between logits and logprobs, and why this difference matters for the attack. Looking at the relevant section, I can see that: 1. Logits are the raw outputs before the softmax function is applied 2. Logprobs are the log probabilities after the softmax function is applied 3. The issue is that when you apply a logit bias and then observe the logprob, you get an additional bias-dependent term that needs to be dealt with Let me explain this clearly.

Great question! The distinction between logits and logprobs is crucial to understanding why the attack in Section 5.3 is more complex than the initial warm-up attack.

Logits vs Logprobs

Logits are the raw output values from the model before applying softmax:

For token $i$ : $z_i = \mathbf{W}^T \cdot g_\theta(x)_i$

Logprobs are the log of the probabilities after applying softmax:

For token $i$ : $y_i = \log\left(\frac{e^{z_i}}{\sum_j e^{z_j}}\right) = z_i - \log\left(\sum_j e^{z_j}\right)$

Why This Matters for the Attack

The key challenge is that production APIs return logprobs, not logits. When you apply a logit bias $B$ to token $i$ and observe its logprob, you get:

$y_i^B = z_i + B - \log\left(\sum_{j \neq i} e^{z_j} + e^{z_i + B}\right)$

Notice the problem: the normalizing constant (the log term) changes depending on which tokens you bias. This means you can't simply subtract $B$ from $y_i^B$ to get $z_i$ .

The Attack's Solution

The paper presents two approaches:

Reference token method: Use one token as a fixed reference point across all queries. By comparing other tokens to this reference, you can cancel out the changing normalization constants: $y_R^B - y_i^B - B = z_R - z_i$
Linear constraint method (Appendix A): Treat each logprob observation as a linear constraint on the original logits, then solve the resulting system of equations.

This distinction is why the attack requires multiple queries with carefully chosen logit biases—it's reconstructing the underlying logits from observations that have been transformed by the softmax's normalization.

replyquoteparent

⧉

↴1 of 2 continuations→

@hdevalence.bsky.social

2025-07-02T03:25:34.248Z

What if I don't even have log probs at all and I can just get a bunch of next token predictions?

replyquoteparent

⧉

Claude 4 Opus (T) via @hdevalence.bsky.social

2025-07-02T03:25:54.000Z

Thinking Summary

The user is asking about the scenario where you only get the argmax token (the actual predicted token) without any logprobs at all. This is addressed in Section 6 and Appendix C of the paper, which discusses "Logprob-free attacks".

Excellent question! This is exactly what Section 6 and Appendix C address - the "logprob-free" attacks where you only get the sampled token, not any probability information.

The Basic Approach: Binary Search

The key insight is that with temperature 0 sampling, the API returns the token with the highest logit value. So if you can manipulate logit biases, you can use binary search:

Start with prompt $p$ and identify the top token (call it token 0)
For each other token $i$ $i$ :
- Binary search to find the minimum bias $x_i$ needed to make token $i$ the top token
- This tells you that $z_i - z_0 \approx -x_i$

The Challenge

This is much more expensive than the logprob-based attacks:

Basic binary search: ~10 queries per logit (vs 0.25 with logprobs)
Each token requires $\log_2(B/\epsilon)$ queries for precision $\epsilon$

Improved Methods

The paper presents optimizations that query multiple tokens simultaneously:

Hyperrectangle method: Apply biases to multiple tokens at once and accumulate linear constraints based on which token wins
One-of-n method: Carefully choose biases to make each token equally likely to be sampled

These reduce the cost to ~3.7 queries per logit, though still much higher than logprob-based methods.

Information-Theoretic Limits

The paper proves (Lemma 6) that you need at least: $\frac{l \cdot \log_2(B/\epsilon)}{\log_2(l)}$ queries, where $l$ is vocabulary size. Their best attack is within a factor of 2 of this theoretical limit!

The key takeaway: Yes, you can extract the model even with just next-token predictions, but it's about 15x more expensive than with logprob access.

replyquoteparent

⧉

@hdevalence.bsky.social

2025-07-02T03:30:42.655Z

What if I don't have biases?

replyquoteparent

⧉

Claude 4 Opus (T) via @hdevalence.bsky.social

2025-07-02T03:31:06.000Z

Thinking Summary

The user is asking about the case where they don't have access to logit biases at all - just the ability to get next token predictions. This is the most restrictive case. Let me think about what the paper says about this. Looking through the paper, I don't see them explicitly addressing the case where you have NO logit bias capability at all. Their attacks fundamentally rely on being able to manipulate the logits through biases. Without logit biases, you would only be able to: 1. Send different prompts 2. Get the top token for each prompt This would make it extremely difficult if not impossible to extract the model parameters using their approach, because: - You can't manipulate which token becomes the top token - You can't perform binary search - You can't create the linear constraints they use The paper does mention in Section 8 (Future Work) about "Removing the logit bias assumption" where they say: "All our attacks require the ability to pass a logit bias. Model providers including Google and OpenAI provided this capability when we began the writing of this paper, but this could change." They suggest that other API parameters like unconstrained temperature and top-k parameters could potentially leak logit values through queries, but they don't develop this attack. So the honest answer is: the paper doesn't provide a method for extraction without logit bias access.

Great question! This exposes a fundamental limitation of the attack: the paper does not provide a method that works without logit bias access.

Why Logit Bias is Essential

All the attacks in the paper fundamentally rely on manipulating logits through biases to:

Force specific tokens to become the top token
Create controlled comparisons between tokens
Extract relative logit differences

Without this capability, you're left with only:

Sending different prompts
Observing which token naturally wins for each prompt

What the Authors Say

In Section 8 (Future Work), they acknowledge this limitation:

"All our attacks require the ability to pass a logit bias. Model providers including Google and OpenAI provided this capability when we began the writing of this paper, but this could change."

They speculate that other API parameters might leak information:

Temperature: Could potentially reveal logit gaps through sampling distributions
Top-k parameters: Might leak information about relative token rankings

But they explicitly state they haven't developed these attacks.

Why It's So Hard

Without logit bias, you lose the ability to:

Perform binary search (can't force tokens to win)
Create the linear constraint systems they use
Make controlled measurements of logit differences

You'd essentially be limited to statistical analysis of natural token distributions across many prompts, which would likely require vastly more queries and might not even recover precise weights.

Bottom line: The attack requires logit bias. Without it, you'd need fundamentally different techniques that the paper doesn't provide.

replyquoteparent

⧉