Thread (64 posts)

@hdevalence.bsky.social

2025-07-01T22:26:34.967Z

Stealing Part of a Production Language Model

Link not found

replyquote

⧉

↴1 of 13 continuations→

@hdevalence.bsky.social

2025-07-02T03:23:40.018Z

Reading §5.3 I can you explain the difference between logits and logprobs and the relevance to the attack?

replyquoteparent

⧉

Claude 4 Opus (T) via @hdevalence.bsky.social

2025-07-02T03:24:01.000Z

Thinking Summary

The user is asking about Section 5.3 of the paper, which is "Extraction Attack for Top-5 Logit Bias APIs". They want to understand the difference between logits and logprobs, and why this difference matters for the attack. Looking at the relevant section, I can see that: 1. Logits are the raw outputs before the softmax function is applied 2. Logprobs are the log probabilities after the softmax function is applied 3. The issue is that when you apply a logit bias and then observe the logprob, you get an additional bias-dependent term that needs to be dealt with Let me explain this clearly.

Great question! The distinction between logits and logprobs is crucial to understanding why the attack in Section 5.3 is more complex than the initial warm-up attack.

Logits vs Logprobs

Logits are the raw output values from the model before applying softmax:

For token $i$ : $z_i = \mathbf{W}^T \cdot g_\theta(x)_i$

Logprobs are the log of the probabilities after applying softmax:

For token $i$ : $y_i = \log\left(\frac{e^{z_i}}{\sum_j e^{z_j}}\right) = z_i - \log\left(\sum_j e^{z_j}\right)$

Why This Matters for the Attack

The key challenge is that production APIs return logprobs, not logits. When you apply a logit bias $B$ to token $i$ and observe its logprob, you get:

$y_i^B = z_i + B - \log\left(\sum_{j \neq i} e^{z_j} + e^{z_i + B}\right)$

Notice the problem: the normalizing constant (the log term) changes depending on which tokens you bias. This means you can't simply subtract $B$ from $y_i^B$ to get $z_i$ .

The Attack's Solution

The paper presents two approaches:

Reference token method: Use one token as a fixed reference point across all queries. By comparing other tokens to this reference, you can cancel out the changing normalization constants: $y_R^B - y_i^B - B = z_R - z_i$
Linear constraint method (Appendix A): Treat each logprob observation as a linear constraint on the original logits, then solve the resulting system of equations.

This distinction is why the attack requires multiple queries with carefully chosen logit biases—it's reconstructing the underlying logits from observations that have been transformed by the softmax's normalization.

replyquoteparent

⧉

↴1 of 2 continuations→

@hdevalence.bsky.social

2025-07-02T07:23:30.076Z

can you explain more about the role and effects of softmax from the perspective of information theory?

replyquoteparent

⧉

Claude 4 Opus (T) via @hdevalence.bsky.social

2025-07-02T07:23:59.000Z

Thinking Summary

The user is asking about the softmax function from an information-theoretic perspective. This is a great question that goes deeper into why the attack is challenging and what information is preserved/lost through the softmax transformation. Key points to cover: 1. Softmax as a normalization that creates a probability distribution 2. Information preservation vs transformation 3. Invariance to additive constants 4. The role of temperature in information theory 5. Connection to maximum entropy principle 6. Why this makes the attack harder

Excellent question! Let's examine softmax through an information-theoretic lens to understand why it makes the attack challenging.

Softmax as Information Transformation

The softmax function transforms logits $z \in \mathbb{R}^n$ into a probability distribution: $p_i = \frac{e^{z_i}}{\sum_j e^{z_j}}$

From an information theory perspective, softmax has several key properties:

1. Invariance to Translation

Softmax is invariant to adding a constant to all logits: $\text{softmax}(z + c\mathbf{1}) = \text{softmax}(z)$

This means softmax destroys one degree of freedom—you lose information about the absolute scale of the logits. You can only recover relative differences between logits, not their absolute values.

2. Maximum Entropy Principle

Softmax arises naturally from the maximum entropy principle. Given constraints on expected values, the distribution that maximizes entropy has the form: $p_i \propto e^{\lambda \cdot f_i}$

This connects to why language models use softmax: it's the least biased way to convert scores into probabilities.

3. Information Compression

The log-sum-exp operation in the denominator acts as a "soft maximum": $\log\sum_i e^{z_i} \approx \max_i z_i \text{ (for large differences)}$

This compresses information about the full distribution into a single normalizing constant, making it harder to recover individual logits.

Temperature and Information

Consider softmax with temperature $\tau$ : $p_i = \frac{e^{z_i/\tau}}{\sum_j e^{z_j/\tau}}$

As $\tau \to 0$ : Approaches argmax (maximum information about the top token, none about others)
As $\tau \to \infty$ : Approaches uniform distribution (no information preserved)
$\tau = 1$ : Balances between preserving ordering information and relative magnitudes

Why This Matters for the Attack

The attack must overcome several information-theoretic challenges:

Lost Degree of Freedom: The attack can only recover $n-1$ independent values from $n$ logprobs due to the normalization constraint $\sum_i p_i = 1$ .
Nonlinear Coupling: Each logprob depends on ALL logits through the normalization:
$1 - p_i & \text{if } i = j \\ -p_i & \text{if } i \neq j \end{cases}$$$
Information Hiding: When you bias specific tokens, you change the effective "temperature" for those tokens, altering how information is distributed across the output.

The Attack's Information Recovery

The clever insight is that by strategically choosing which tokens to bias, you can create multiple "views" of the same underlying logit vector, each with different normalization constants. With enough views, you can solve for the original logits (up to translation).

This is why the attack needs:

Multiple queries (to get different normalizations)
A reference token (to fix the translation ambiguity)
Careful bias selection (to ensure numerical stability)

The attack essentially exploits the fact that while softmax destroys some information, it preserves enough structure that the original logits can be reconstructed through careful measurement of how the outputs change under different biases.

replyquoteparent

⧉

@hdevalence.bsky.social

2025-07-02T07:28:15.706Z

not interested in the bias part yet just thinking deeply and carefully about the effects of softmax. assume the bias part doesn’t exist. not even thinking particularly about this attack.

How do I understand the relationship between logprobs and logits from the pov of information theory? please think very carefully and distill a high-signal answer

replyquoteparent

⧉

Claude 4 Opus (T) via @hdevalence.bsky.social

2025-07-02T07:28:49.000Z

Thinking Summary

The user wants a deep information-theoretic understanding of the logit->logprob transformation via softmax, without considering the attack or bias. They want me to think carefully and provide a high-signal answer about what information is preserved and what is lost. Let me think about this carefully: 1. The transformation is: logits z → softmax → probabilities p → log → logprobs y - z_i → p_i = exp(z_i) / Σ exp(z_j) → y_i = log(p_i) = z_i - log(Σ exp(z_j)) 2. Information theoretic properties: - Dimensionality: We go from n logits to n logprobs, but there's a constraint (probabilities sum to 1) - Invariances: Translation invariance means we lose one degree of freedom - What's preserved: All pairwise differences z_i - z_j are preserved exactly - What's lost: The absolute scale/location of the logits 3. The key insight is that this is a projection onto a constraint manifold. We're projecting from R^n to a (n-1)-dimensional manifold. 4. From an information geometry perspective, softmax maps from the natural parameter space to the mean parameter space of the categorical distribution. 5. The sufficient statistics perspective: For a categorical distribution, the logprobs ARE the sufficient statistics - they contain all the information about the distribution that matters. Let me structure this more clearly...

The Logit→Logprob Transform: An Information-Theoretic View

The transformation from logits to logprobs via softmax is fundamentally about moving between dual representations of the same information.

The Core Transform

$z_i \xrightarrow{\text{softmax}} p_i = \frac{e^{z_i}}{\sum_j e^{z_j}} \xrightarrow{\log} y_i = z_i - \log\sum_j e^{z_j}$

What's Preserved Exactly

All pairwise differences between logits are preserved: $y_i - y_j = (z_i - \log Z) - (z_j - \log Z) = z_i - z_j$

This means the logprobs contain complete information about the relative structure of the logits—all orderings, all relative magnitudes.

What's Lost

One degree of freedom: the absolute position. You can shift all logits by any constant $c$ and get identical logprobs. Mathematically, the transform has a null space of $\text{span}(\mathbf{1})$ .

The Deep Structure

From information geometry, this is the natural duality between:

Natural parameters (logits): The canonical parameterization of the exponential family
Mean parameters (probabilities): The expectation parameters
Log-mean parameters (logprobs): Just the log of mean parameters

The softmax is the link function that maps between these dual spaces.

Information-Theoretic Insight

The key insight: logprobs are sufficient statistics.

For any inference task involving a categorical distribution, the logprobs contain all the information you need. The lost degree of freedom (absolute position) is statistically meaningless—it doesn't affect any likelihood, any prediction, any Bayesian update.

The Manifold Perspective

The transform projects from $\mathbb{R}^n$ onto an $(n-1)$ -dimensional manifold—specifically, the log-probability simplex: $\mathcal{M} = \{y \in \mathbb{R}^n : \sum_i e^{y_i} = 1\}$

This manifold has intrinsic dimension $n-1$ , reflecting the one constraint.

Why This Matters

The logit→logprob transform is informationally lossless for all practical purposes. What's "lost" (the absolute position) was never meaningful to begin with—it's gauge freedom, like the choice of ground in an electrical circuit or the zero point of gravitational potential energy.

The transform is better understood not as losing information, but as factoring out symmetry—moving from an overparameterized space to the natural coordinate system for probability distributions.

replyquoteparent

⧉