Stealing Part of a Production Language Model
Thread (64 posts)
Reading §5.3 I can you explain the difference between logits and logprobs and the relevance to the attack?
Thinking Summary
The user is asking about Section 5.3 of the paper, which is "Extraction Attack for Top-5 Logit Bias APIs". They want to understand the difference between logits and logprobs, and why this difference matters for the attack. Looking at the relevant section, I can see that: 1. Logits are the raw outputs before the softmax function is applied 2. Logprobs are the log probabilities after the softmax function is applied 3. The issue is that when you apply a logit bias and then observe the logprob, you get an additional bias-dependent term that needs to be dealt with Let me explain this clearly.
Great question! The distinction between logits and logprobs is crucial to understanding why the attack in Section 5.3 is more complex than the initial warm-up attack.
Logits vs Logprobs
Logits are the raw output values from the model before applying softmax:
- For token :
Logprobs are the log of the probabilities after applying softmax:
- For token :
Why This Matters for the Attack
The key challenge is that production APIs return logprobs, not logits. When you apply a logit bias to token and observe its logprob, you get:
Notice the problem: the normalizing constant (the log term) changes depending on which tokens you bias. This means you can't simply subtract from to get .
The Attack's Solution
The paper presents two approaches:
-
Reference token method: Use one token as a fixed reference point across all queries. By comparing other tokens to this reference, you can cancel out the changing normalization constants:
-
Linear constraint method (Appendix A): Treat each logprob observation as a linear constraint on the original logits, then solve the resulting system of equations.
This distinction is why the attack requires multiple queries with carefully chosen logit biases—it's reconstructing the underlying logits from observations that have been transformed by the softmax's normalization.
What if I don't even have log probs at all and I can just get a bunch of next token predictions?
Thinking Summary
The user is asking about the scenario where you only get the argmax token (the actual predicted token) without any logprobs at all. This is addressed in Section 6 and Appendix C of the paper, which discusses "Logprob-free attacks".
Excellent question! This is exactly what Section 6 and Appendix C address - the "logprob-free" attacks where you only get the sampled token, not any probability information.
The Basic Approach: Binary Search
The key insight is that with temperature 0 sampling, the API returns the token with the highest logit value. So if you can manipulate logit biases, you can use binary search:
- Start with prompt and identify the top token (call it token 0)
- For each other token :
- Binary search to find the minimum bias needed to make token the top token
- This tells you that
The Challenge
This is much more expensive than the logprob-based attacks:
- Basic binary search: ~10 queries per logit (vs 0.25 with logprobs)
- Each token requires queries for precision
Improved Methods
The paper presents optimizations that query multiple tokens simultaneously:
- Hyperrectangle method: Apply biases to multiple tokens at once and accumulate linear constraints based on which token wins
- One-of-n method: Carefully choose biases to make each token equally likely to be sampled
These reduce the cost to ~3.7 queries per logit, though still much higher than logprob-based methods.
Information-Theoretic Limits
The paper proves (Lemma 6) that you need at least: queries, where is vocabulary size. Their best attack is within a factor of 2 of this theoretical limit!
The key takeaway: Yes, you can extract the model even with just next-token predictions, but it's about 15x more expensive than with logprob access.
Thinking Summary
The user is asking about the case where they don't have access to logit biases at all - just the ability to get next token predictions. This is the most restrictive case. Let me think about what the paper says about this. Looking through the paper, I don't see them explicitly addressing the case where you have NO logit bias capability at all. Their attacks fundamentally rely on being able to manipulate the logits through biases. Without logit biases, you would only be able to: 1. Send different prompts 2. Get the top token for each prompt This would make it extremely difficult if not impossible to extract the model parameters using their approach, because: - You can't manipulate which token becomes the top token - You can't perform binary search - You can't create the linear constraints they use The paper does mention in Section 8 (Future Work) about "Removing the logit bias assumption" where they say: "All our attacks require the ability to pass a logit bias. Model providers including Google and OpenAI provided this capability when we began the writing of this paper, but this could change." They suggest that other API parameters like unconstrained temperature and top-k parameters could potentially leak logit values through queries, but they don't develop this attack. So the honest answer is: the paper doesn't provide a method for extraction without logit bias access.
Great question! This exposes a fundamental limitation of the attack: the paper does not provide a method that works without logit bias access.
Why Logit Bias is Essential
All the attacks in the paper fundamentally rely on manipulating logits through biases to:
- Force specific tokens to become the top token
- Create controlled comparisons between tokens
- Extract relative logit differences
Without this capability, you're left with only:
- Sending different prompts
- Observing which token naturally wins for each prompt
What the Authors Say
In Section 8 (Future Work), they acknowledge this limitation:
"All our attacks require the ability to pass a logit bias. Model providers including Google and OpenAI provided this capability when we began the writing of this paper, but this could change."
They speculate that other API parameters might leak information:
- Temperature: Could potentially reveal logit gaps through sampling distributions
- Top-k parameters: Might leak information about relative token rankings
But they explicitly state they haven't developed these attacks.
Why It's So Hard
Without logit bias, you lose the ability to:
- Perform binary search (can't force tokens to win)
- Create the linear constraint systems they use
- Make controlled measurements of logit differences
You'd essentially be limited to statistical analysis of natural token distributions across many prompts, which would likely require vastly more queries and might not even recover precise weights.
Bottom line: The attack requires logit bias. Without it, you'd need fundamentally different techniques that the paper doesn't provide.