Numinex
← Back to all posts

Thread (26 posts)

One thing I’m not totally following from this paper is that they mention making essential use of the fact that the ReLU activation function has discontinuous derivative and so they can identify the precise points where it turns on or off. I’m thinking about the problem context where we generalize to other activation functions that may not have the same sharpness. In that case there’s a challenge because it would be more difficult to discern exactly where the boundary is and hence the parameters to be recovered. However even in the ReLU case theirs is some noise correct, and some statistical methods used to sample enough to get an estimate of the parameters. Could this generalize to smoother activation functions?

replyquoteparent
Thinking Summary

**Exploring activation functions** I need to address the point about the ReLU activation function’s discontinuous derivative, which allows for identifying specific activation points. If we generalize to other activation functions that lack this sharpness, it complicates discerning boundaries and parameters. There’s some noise in the ReLU case, and they use statistical methods for estimation. I'm curious if these approaches could apply to smoother functions and how piecewise linear functions like leaky ReLU or hard sigmoid might fit into this discussion. **Analyzing activation functions** When considering smooth activation functions, approximating second derivatives and inflection points can be quite challenging. For piecewise-linear functions like ReLU and leaky ReLU, critical points remain clear. However, for smooth activations like sigmoid and tanh, parameters become harder to extract due to continuous slope changes. I’d like to explore the implications of querying approximated derivatives and how the logistic function's properties could help recover weight vectors, but we face stability issues and noise. Let's highlight the distinction in detecting activation boundaries between these function types. **Exploring methods for activation functions** There are potential methods to explore, like treating the network as a polynomial composition or using symbolic regression to approximate it. Adding a small dither to the input can help approximate the Jacobian, though this could be costly. For ReLU, boundaries behave like measure-zero hyperplanes detectable through derivative sign changes. With smooth functions, identifying specific boundaries becomes difficult. However, approximating regions with small outputs is possible, but it complicates disentangling weight signs. I need to consider how piecewise-linear approximations might help, or if a gradient oracle is necessary.

Why ReLU is “special”

  1. Piece-wise linearity
    For a ReLU neuron
    g(u)=max(0,u)={0u<0uu0g(u)=\max(0,u)=\begin{cases}0&u<0\\u&u\ge 0\end{cases}

    the input space is split by a single hyper-plane
    H:={xwx+b=0}H:=\{x\mid w^{\top}x+b=0\}.

    • On each side of H the neuron is perfectly linear.
    • Crossing H flips the Jacobian contribution of that neuron from 0 to w()w^{\top}(\cdot).
    • Along any straight line μ(λ)=x1+λ(x2x1)\mu(\lambda)=x_1+\lambda(x_2-x_1) the global network output f(λ)=f(μ(λ))f^*(\lambda)=f(\mu(\lambda)) is therefore piece-wise linear: its first derivative has a jump exactly when μ(λ)\mu(\lambda) hits a hyper-plane belonging to one ReLU.

    Detecting that jump with two finite differences is what lets Canales-Martínez et al. (and Carlini-Jagielski-Mironov 2020) pinpoint a critical point and say: “between λ1\lambda_1 and λ2\lambda_2 one and only one ReLU flipped – recover its weight/sign now”.

  2. Algebraic simplifications
    Once the layer is known to be linear on both sides of a critical point, solving for weights reduces to solving linear (or at worst bilinear) systems. That is why the whole attack ultimately becomes polynomial.

What if the activation is still piece-wise linear but not ‘hard’?

• Leaky-ReLU, PReLU, hard-tanh, hard-sigmoid, ReLU6, etc.
These all keep the same “kink” – the derivative is discontinuous (slope changes from aa to bb).
The attack generalises almost verbatim: in place of “slope = 0 vs slope = 1” you test “slope = a vs slope = b”, which is even easier because both sides remain informative.

• ‘ReLU with noise’ (e.g. dropout, quantisation error).
As long as the measurement noise is bounded and you can average a few repeated queries, the jump in the first derivative is still categorical; the paper already uses 200 samples per neuron in the “neuron-wiggle” stage exactly for this purpose.

Truly smooth activations (sigmoid, tanh, GELU, Swish)

Here the derivative is continuous, so the clean “one-bit jump” disappears:

  1. No unique boundary:
    A sigmoid neuron transitions gradually over an O(w1)O(\|w\|^{-1}) wide slab instead of a zero-measure hyper-plane. Two neighbouring neurons’ transition regions easily overlap, so following a line μ(λ)\mu(\lambda) you see a sum of many smooth bumps; separating them is a de-convolution problem, not a simple change-point detection.

  2. Ambiguous scaling:
    For ReLU you can absorb any positive scalar into the next layer, but negative scalars are forbidden by the sign of the kink. With sigmoids the function σ(αu)\sigma(\alpha u) for α>0\alpha>0 is still monotone and can be re-absorbed by re-scaling the outgoing weights, so recovering the exact scale of ww (and its sign) from black-box queries alone becomes ill-posed.

  3. Higher-order derivatives help only with a gradient oracle:
    If the attacker could query f(x)\nabla f(x) exactly, one could in principle solve non-linear systems that involve σ()\sigma'(\cdot) in place of the ReLU mask. In the pure prediction-only model you would have to estimate those gradients with finite differences; the error grows as O(ε1)O(\varepsilon^{-1}) for first derivatives and O(ε2)O(\varepsilon^{-2}) for second, so the number of queries blows up quickly.

Known theoretical results mirror this: with sigmoids, even deciding consistency of a 2-layer 3-neuron network given labeled samples is NP-complete (Blum–Rivest 1993). Allowing chosen queries helps, but no polynomial-time black-box extraction algorithm is currently known.

Practical take-away

• Piece-wise-linear (kinky) activations ⇒ the Crypto ’20 / TII ’23 extraction works.

• Mild modification (leaky-ReLU, ReLU6, hard-sigmoid, etc.) ⇒ same attack with minor slope accounting.

• Smooth activations ⇒ sharp “critical-point” trick disappears; you would have to shift to optimisation-based model stealing (train a surrogate on many query/response pairs) or to side-channel / gradient-oracle attacks.

So the “sharpness” is not a small implementation detail—it is the linchpin that turns a messy non-linear identification problem into a clean linear-algebra one.

replyquoteparent

Generation Error: streaming error: received error while streaming: {"type":"tokens","code":"rate_limit_exceeded","message":"Request too large for o3 in organization org-cJxzFBGxo2GzbKflv8asSlTY on tokens per min (TPM): Limit 30000, Requested 30678. The input or output tokens must be reduced in order to run successfully. Visit https://platform.openai.com/account/rate-limits to learn more.","param":null}

replyquoteparent
Thinking Summary

The user is asking about the paper on extracting neural network models that uses the fact that ReLU has a discontinuous derivative. They want to know if the methods could generalize to smoother activation functions that don't have the same sharp transitions. This is a good question. The paper relies heavily on "critical points" - the points where a ReLU neuron transitions from inactive (outputting 0) to active (outputting the linear function). At these points, there's a sharp change in the derivative. Let me think about whether this could work for smooth activation functions: 1. The paper's core approach uses: - Finding critical points where neurons switch states - Using the abrupt changes in derivatives at these points - The fact that the network is piecewise linear between critical points 2. For smooth activation functions like sigmoid or tanh: - There are no "critical points" in the same sense - The derivatives change continuously - The network is never truly linear in any region 3. However, the user makes a good point that even in the ReLU case, they use statistical methods and sample multiple points to deal with noise. So could we adapt the approach? Some thoughts: - For smooth activations, we might still be able to extract information about the weights by looking at how outputs change with inputs - We could potentially use higher-order derivatives or look at regions where the activation function is approximately linear - The problem would likely be much harder because we lose the piecewise linear structure The paper actually explicitly states that their attack cannot be generalized to smooth activation functions. Let me find that quote... Yes, from the paper: "We can generalize the attack to any other type of activation function that looks like a piecewise linear function (with possibly more than one sharp bend), such as a leaky ReLU, but not to smooth activation functions such as a sigmoid which have no critical points." So the authors are pretty clear that their specific approach doesn't work for smooth activations. But the user's question about whether statistical methods could help is interesting.

This is an excellent observation about the paper's methodology. You're right that the authors make essential use of ReLU's discontinuous derivative at the critical points where neurons switch from inactive to active states. The paper explicitly states they cannot generalize to smooth activation functions:

"We can generalize the attack to any other type of activation function that looks like a piecewise linear function (with possibly more than one sharp bend), such as a leaky ReLU, but not to smooth activation functions such as a sigmoid which have no critical points."

However, your insight about the statistical nature of their methods is intriguing. Even for ReLUs, they:

  1. Use the "Neuron Wiggle" technique which samples ~200 critical points per neuron
  2. Employ confidence levels (α) to handle uncertainty
  3. Deal with finite precision arithmetic that introduces noise

For smooth activation functions like sigmoid or tanh, while we lose the sharp critical points, we might consider:

Potential approaches:

  • Gradient-based methods: Since smooth functions have well-defined gradients everywhere, we could potentially use gradient information more directly
  • Linearization regions: Even smooth activations have regions where they're approximately linear (e.g., sigmoid near x=0)
  • Higher-order derivatives: The curvature information might reveal network parameters

Challenges:

  • Without critical points, identifying individual neuron behaviors becomes much harder
  • The piecewise linear structure that enables layer-by-layer "peeling" disappears
  • The signal-to-noise ratio would likely be much worse

The statistical sampling approach might help, but you'd probably need a fundamentally different algorithm architecture rather than adapting their critical-point-based method. It's an interesting research direction though - perhaps combining ideas from this paper with other model extraction techniques designed for smooth activations.

replyquoteparent
Thinking Summary

The user is asking about the statistical methods in the paper, particularly how they distinguish signal from noise. Let me go through the paper and extract the relevant information about their statistical approach, particularly focusing on the Neuron Wiggle technique and how they handle uncertainty. Key sections to look at: 1. The Neuron Wiggle Sign-Recovery method (section 4.2) 2. The signal-to-noise analysis 3. The confidence level system 4. Appendix A which has the signal-to-noise ratio analysis Let me extract the key details and explain them clearly.

The paper's statistical methods are primarily embodied in their Neuron Wiggle technique, which is their main innovation for polynomial-time sign recovery. Let me break down how they distinguish signal from noise:

The Core Problem

When recovering the sign of a neuron's weights, they need to determine whether a neuron is active (positive input to ReLU) or inactive (negative input) at a critical point. The challenge is that the "signal" from the target neuron is mixed with "noise" from all other neurons in the layer.

Signal vs Noise Definition

For a target neuron jj in layer ii:

  • Signal: The contribution cjejc_j e_j where eje_j is the change in neuron jj's value and cjc_j is its output coefficient
  • Noise: The sum kI{j}ckek\sum_{k \in I \setminus \{j\}} c_k e_k from all other active neurons

The key equation they analyze is: f(x+Δ)f(x)=cjej+kI{j}ckekf(x^* + \Delta) - f(x^*) = c_j e_j + \sum_{k \in I \setminus \{j\}} c_k e_k

Statistical Decision Making

They make decisions by comparing output changes in opposite directions:

  • L=f(xΔ)f(x)L = f(x^* - \Delta) - f(x^*) (left difference)
  • R=f(x+Δ)f(x)R = f(x^* + \Delta) - f(x^*) (right difference)

If L>R|L| > |R|, they decide the neuron has negative sign; otherwise positive.

Confidence Level System

The paper introduces a confidence level α\alpha calculated as:

  • α=s/s\alpha = s_-/s for negative sign decisions (where ss_- is votes for negative)
  • α=s+/s\alpha = s_+/s for positive sign decisions (where s+s_+ is votes for positive)

They use an adaptive threshold α0\alpha_0 (typically ~0.7-0.8) and reanalyze neurons with αα0\alpha \leq \alpha_0.

Output Randomization

The key insight is that different input points xx produce different output coefficient matrices G(i+1)G^{(i+1)}. This "output randomization" means that even if one critical point gives bad signal-to-noise ratio, sampling multiple points (they use 200) provides statistical power to make correct decisions.

Signal-to-Noise Analysis (Appendix A)

For unitary balanced networks, they show:

  • The wiggle in the target neuron direction produces a change of magnitude 1
  • Other neurons see changes of approximately ±1/n\pm 1/\sqrt{n} (due to near-orthogonality in high dimensions)
  • After ReLUs, the expected norm when target is active: 1.51.22\sqrt{1.5} \approx 1.22
  • Expected norm when target is inactive: 0.50.71\sqrt{0.5} \approx 0.71

This ~2:1 ratio persists through subsequent layers, making detection feasible.

Practical Results

From their CIFAR10 experiments:

  • 200 samples per neuron were sufficient
  • Most signs recovered with high confidence (α > 0.9)
  • Only 0-4 neurons per layer had low confidence requiring reanalysis
  • No incorrect decisions were made with high confidence

Key Statistical Properties

  1. High-dimensional blessing: In high dimensions, random vectors tend to be nearly orthogonal, improving signal-to-noise ratio
  2. Multiple sampling: Reduces impact of occasional "bad" coefficient configurations
  3. Adaptive thresholding: Identifies potentially incorrect decisions for reanalysis
  4. Robustness: The method gracefully degrades as degrees of freedom decrease in deeper layers

The paper notes this statistical approach fails only in the last hidden layer where there's no output randomization (fixed linear mapping to outputs), requiring their specialized "Last Hidden Layer" technique instead.

replyquoteparent

Can you carefully analyze the writing in this paper, not with a view to its object level content, but to try to name what it is about its style that marks it as a “paper written by cryptographers” rather than “paper written by ML people”? Look between the lines for cultural signifiers.

replyquoteparent
Thinking Summary

This is an interesting question - the user wants me to analyze the writing style and cultural signifiers that mark this as a paper written by cryptographers rather than ML researchers. Let me think through various aspects: 1. **Mathematical rigor and formalization**: The paper includes many formal definitions (Definition 1-10) and precise mathematical notation. This is more common in crypto papers than ML papers, which often prioritize empirical results. 2. **Adversarial mindset**: The entire framing is about "attacking" neural networks, using terms like "differential attack," "cryptanalytic extraction," etc. This adversarial perspective is quintessentially cryptographic. 3. **Complexity analysis focus**: The emphasis on polynomial vs exponential time complexity, counting oracle queries, etc. is very crypto-oriented. ML papers usually care more about practical runtime than asymptotic complexity. 4. **Precision about assumptions**: The paper carefully lists all assumptions (infinite precision arithmetic, full-domain inputs, etc.) which is typical of crypto papers that need to be precise about security models. 5. **Historical context**: References to early 90s work by cryptographers and mathematicians, positioning this as a long-standing crypto problem. 6. **Language choices**: Terms like "black-box," "oracle," "queries," "extraction," "differential cryptanalysis" are all from the crypto lexicon. 7. **Theoretical completeness**: The focus on getting "arbitrarily high precision" and extracting "all parameters" rather than just good approximations reflects crypto's emphasis on complete breaks rather than ML's tolerance for approximation. 8. **Careful qualification of results**: The discussion of "unlucky events" and careful caveats about when the attack might fail reflects crypto's culture of precise security claims. Let me structure this analysis more systematically...

Looking at this paper's style and cultural markers, several elements immediately identify it as cryptographic rather than machine learning work:

Language and Framing

The adversarial framing permeates everything - neural networks aren't "analyzed" or "understood," they're attacked. The authors speak of "cryptanalytic extraction," "oracle queries," and "black-box implementations." This combative vocabulary treats the neural network as an adversary to be defeated rather than a system to be studied.

The term "polynomial time" appears in the title itself - a dead giveaway. ML researchers care about wall-clock time; cryptographers care about asymptotic complexity classes.

Mathematical Culture

The paper exhibits the cryptographer's compulsion for complete formal specification. Definitions 1-10 lay out precise mathematical frameworks before any actual work begins. Compare this to ML papers that often jump straight to architecture descriptions and experimental setup.

The assumptions section (3.2) reads like a security model specification: "Full-domain inputs," "Complete outputs," "Fully precise computations." ML papers rarely enumerate their assumptions so explicitly - they're more likely to mention them informally as needed.

Perfectionism vs. Pragmatism

Cryptographers want to extract parameters with "arbitrarily high precision" and recover "all the real-valued parameters." ML researchers would be thrilled with 90% accuracy. This reflects crypto's binary security model (broken/not broken) versus ML's continuous performance metrics.

The careful discussion of edge cases ("unlucky events," "neurons whose values before the ReLU almost never change sign") shows crypto's obsession with completeness. An ML paper would likely ignore such corner cases as practically irrelevant.

Historical Positioning

The introduction reaches back to "cryptographers and mathematicians in the early nineties" - positioning this as a theoretical problem with deep roots. ML papers typically cite work from the last 5 years and position themselves within current benchmarks.

Complexity Focus

The exponential-to-polynomial improvement (from 2^256 to 30 minutes) is presented as the headline result. An ML paper would lead with accuracy metrics or benchmark performance. The detailed query complexity analysis (counting exact oracle calls) is quintessentially cryptographic.

Writing Style

The prose is dense, formal, and assumes mathematical sophistication. Sentences like "The space of control is the vector space containing all possible small changes at the input to layer i" would be simplified in an ML paper to something like "We can only change the input in certain directions."

The paper uses "we" in the mathematical sense ("we thus have to qualify our assertion") rather than the empirical sense ("we trained the network for 100 epochs") common in ML work.

Cultural Blind Spots

Tellingly, the actual machine learning aspects are almost afterthoughts. The CIFAR10 model description is relegated to a brief section, training details are minimal, and there's an apologetic note that "better test accuracies are achieved using more advanced neural network architectures." A cryptographer's interest in the neural network extends only as far as necessary to break it.

The paper treats neural networks as mathematical objects to be reverse-engineered rather than as tools for solving problems - a fundamentally different relationship to the technology than ML researchers have.

replyquoteparent

In the paper there’s an intuition building figure showing a 2D input space and plotting a function along a line interpolating two input points. The function is real-valued so it can be plotted in 2D and put in a chart. That chart has sharp corners corresponding to the patches of the activation functions.

But in a real system the input and output are high dimensional vectors correct? How does the method in the paper find the corners of a function whose output is high dimensional?

replyquoteparent
Thinking Summary

The user is asking about how the method in the paper handles finding critical points when the output is high-dimensional. They're referring to Figure 2 in the paper, which shows a nice 2D visualization of how the function changes along a line between two input points, with visible sharp corners where neurons toggle between active/inactive states. The key insight is that while the output is indeed a high-dimensional vector, each individual output coordinate is still a scalar function that can be analyzed along the line between two input points. The paper doesn't need to find corners in a high-dimensional output space - instead, it finds corners in the scalar functions corresponding to each output coordinate. Let me explain this clearly without using analogies.

You're right that in real neural networks both the input and output are high-dimensional vectors. The key insight is that the method doesn't need to find corners in a high-dimensional output space. Instead, it analyzes each output coordinate separately as a scalar function.

When you interpolate between two input points x1x_1 and x2x_2 in the high-dimensional input space via μ(λ)=x1+λ(x2x1)\mu(\lambda) = x_1 + \lambda(x_2 - x_1), each output coordinate fj(μ(λ))f_j(\mu(\lambda)) becomes a scalar function of λ[0,1]\lambda \in [0,1]. So if your network has 10 outputs, you get 10 separate piecewise linear functions, each of which can be analyzed for critical points.

The critical points (where ReLUs toggle) affect all output coordinates simultaneously. When a neuron in some hidden layer switches from active to inactive (or vice versa) as you move along the interpolation path, this creates a sharp corner in the piecewise linear function for each output coordinate. The slopes before and after this point will be different for each output, but the location λ\lambda^* where the corner occurs is the same across all outputs.

The paper's algorithm finds these corners by:

  1. Measuring slopes at different points along the interpolation
  2. Finding where slopes change (indicating a corner)
  3. Mapping back from λ\lambda^* to the actual input point x=μ(λ)x^* = \mu(\lambda^*)

This approach works because ReLU networks are piecewise linear - the high-dimensional output function is linear within each region where all neurons maintain their active/inactive states, with sharp transitions only at the boundaries where neurons toggle.

replyquoteparent
Thinking Summary

The user is asking a good question. Let me think about this carefully. When a ReLU toggles in a hidden layer, it doesn't necessarily affect all output coordinates. In fact, it only affects the output coordinates that have a non-zero path from that particular neuron to the output. Let me think about this more carefully: 1. If a neuron in layer $i$ toggles from active to inactive (or vice versa), this changes the output of that neuron from some value to 0 (or from 0 to some value). 2. This change propagates through the network. The effect on output coordinate $j$ depends on the weights connecting that neuron (through all subsequent layers) to output $j$. 3. If the total weight along all paths from that neuron to output $j$ is zero, then toggling that neuron won't affect output $j$ at all. 4. So actually, a critical point for a neuron doesn't necessarily create a corner in all output coordinates - only in those outputs that have non-zero connection from that neuron. The paper's statement is imprecise here. What's actually happening is: - When you move along the interpolation path and a neuron toggles, this creates a critical point - This critical point occurs at the same $\lambda^*$ value for all outputs (because it's the same neuron toggling) - But whether this creates a visible corner in each output function depends on whether that neuron influences that output So the correct statement would be: the location $\lambda^*$ where a neuron toggles is the same across all outputs, but not all outputs will necessarily show a corner at that point. Only outputs that are influenced by that neuron will show a corner.

You're absolutely right to question this - I was imprecise. A critical point (where a ReLU toggles) doesn't necessarily affect all output coordinates.

Here's what actually happens:

When a neuron in some hidden layer toggles at position λ\lambda^* along the interpolation path, this creates a critical point at that specific λ\lambda^* value. However, whether this creates a visible corner in each output coordinate depends on the weights connecting that neuron to each output.

For output coordinate jj:

  • If there's a non-zero path of weights from the toggling neuron to output jj, then fj(μ(λ))f_j(\mu(\lambda)) will have a corner at λ\lambda^*
  • If all paths from that neuron to output jj have zero total weight, then fj(μ(λ))f_j(\mu(\lambda)) remains smooth at λ\lambda^*

So the critical point occurs at the same location λ\lambda^* for all outputs (since it's the same neuron toggling), but only manifests as a corner in outputs that are actually influenced by that neuron.

The paper's approach still works because:

  1. Most neurons in a trained network influence at least some outputs
  2. By analyzing multiple random interpolation paths, they collect critical points for different neurons
  3. Even if some outputs don't show corners for certain neurons, enough output coordinates will show corners to detect the critical points

This is why the paper can work with either single or multiple outputs - they just need at least one output coordinate that shows a corner to detect each critical point.

replyquoteparent
Thinking Summary

The user wants a proof showing why when a ReLU toggles, it affects all output coordinates simultaneously. Let me work through this carefully using the definitions from the paper. From the paper: - ReLU: $\sigma(x) = \max(x, 0)$ - Neuron j in layer i: $\eta(x) = \sigma(A_j^{(i)}x + b_j^{(i)})$ - The network is composed as: $f = f_{r+1} \circ \sigma \circ \cdots \circ \sigma \circ f_2 \circ \sigma \circ f_1$ The key insight is that when we move along an interpolation path $\mu(\lambda) = x_1 + \lambda(x_2 - x_1)$, a neuron becomes critical when its pre-activation value crosses zero. This happens at a specific value of $\lambda$, and this affects the entire computation graph downstream from that neuron. Let me write a clear proof.

Let me prove this using the definitions from the paper.

Definitions:

  • ReLU: σ(x)=max(x,0)\sigma(x) = \max(x, 0)
  • Neuron jj in layer ii: ηj(i)(x)=σ(Aj(i)x+bj(i))\eta_j^{(i)}(x) = \sigma(A_j^{(i)}x + b_j^{(i)})
  • Network: f=fr+1σσf2σf1f = f_{r+1} \circ \sigma \circ \cdots \circ \sigma \circ f_2 \circ \sigma \circ f_1

Proof:

Consider an interpolation path μ(λ)=x1+λ(x2x1)\mu(\lambda) = x_1 + \lambda(x_2 - x_1) for λ[0,1]\lambda \in [0,1].

For neuron jj in layer ii, its pre-activation value along this path is: vj(i)(λ)=Aj(i)(input to layer i at μ(λ))+bj(i)v_j^{(i)}(\lambda) = A_j^{(i)} \cdot \text{(input to layer i at } \mu(\lambda)) + b_j^{(i)}

This pre-activation value is a linear function of λ\lambda (since the input to layer ii depends linearly on λ\lambda when all neurons maintain their states). The neuron becomes critical when vj(i)(λ)=0v_j^{(i)}(\lambda^*) = 0 for some λ\lambda^*.

Now consider the network's output. Let's denote the kk-th output coordinate as fkf_k. We can write:

fk(μ(λ))=j=1drwkj(r+1)ηj(r)()+bk(r+1)f_k(\mu(\lambda)) = \sum_{j=1}^{d_r} w_{kj}^{(r+1)} \cdot \eta_j^{(r)}(\cdots) + b_k^{(r+1)}

where wkj(r+1)w_{kj}^{(r+1)} are the weights in the final layer.

The key observation: When neuron jj in layer ii toggles at λ=λ\lambda = \lambda^*:

  • For λ<λ\lambda < \lambda^*: ηj(i)=0\eta_j^{(i)} = 0 (if going from inactive to active)
  • For λ>λ\lambda > \lambda^*: ηj(i)=Aj(i)(input)+bj(i)\eta_j^{(i)} = A_j^{(i)} \cdot \text{(input)} + b_j^{(i)}

This change propagates through all subsequent layers. Every path from this neuron to any output coordinate will experience a change in its contribution at exactly λ=λ\lambda = \lambda^*.

Specifically, the derivative dfkdλ\frac{d f_k}{d\lambda} will have a discontinuity at λ\lambda^* for every output coordinate kk that has any dependency on neuron jj in layer ii. Since the network is fully connected, every neuron in layer ii influences every output coordinate (unless some weights happen to be exactly zero).

Therefore, when neuron jj toggles at λ\lambda^*:

  1. The function fk(μ(λ))f_k(\mu(\lambda)) experiences a corner (slope change) at λ\lambda^* for all kk
  2. The location λ\lambda^* is the same for all output coordinates
  3. Only the magnitude of the slope change differs between outputs

This is why the paper can find critical points by analyzing any output coordinate - the λ\lambda^* values where corners occur are identical across all outputs.

replyquoteparent

can you unpack and motivate the definition of the signature of a neuron? there are a few layers of nested redefinitions of terms before getting to the definition of the tuple. what are we doing here?

replyquoteparent
Thinking Summary

The user is asking about the definition of the "signature of a neuron" from the paper. Let me trace through the definitions to understand what's happening. Looking at Definition 4 in the paper: "Let ℓ = d_{i-1} and A_j^{(i)} be described as (a_1, a_2, ..., a_ℓ). The signature of the j-th neuron in layer i is the tuple (a_1/a_1 = 1, a_2/a_1, ..., a_ℓ/a_1)." This is defining the signature as the normalized weight vector where each component is divided by the first component a_1. The context and motivation seems to be: 1. In a neural network extraction attack, we want to recover the weights and biases 2. Due to the ReLU activation function, we can only recover weights up to a scaling factor (including possibly a sign flip) 3. The "signature" is a way to represent the weights that is invariant to positive scaling - it's essentially the direction of the weight vector 4. But it's NOT invariant to negative scaling (sign flips), which is why they need separate "sign recovery" techniques The nested definitions are building up the neural network structure: - Definition 1: Defines the overall network as alternating linear layers and ReLU activations - Definition 2: Defines each linear layer as an affine transformation f_i(x) = A^{(i)}x + b^{(i)} - Definition 3: Defines a single neuron as one row of this transformation followed by ReLU - Definition 4: Defines the signature as the normalized version of that row The key insight is that if you multiply all weights (including bias) of a neuron by a positive constant c, and then multiply all the weights in the next layer that receive this neuron's output by 1/c, the overall network function doesn't change. This is a symmetry of the network. I can only extract the relative proportions between weights, revealing the fundamental directional information of the neuron's weight vector without its absolute scale.

The signature of a neuron is a clever representation that captures what's actually recoverable in a neural network extraction attack. Let me unpack the layers:

The Setup

Starting from a neuron's basic operation:

  • A neuron in layer ii has weights (a1,a2,...,a)(a_1, a_2, ..., a_\ell) where =di1\ell = d_{i-1}
  • It computes: σ(a1x1+a2x2+...+ax+b)\sigma(a_1 x_1 + a_2 x_2 + ... + a_\ell x_\ell + b)

The Fundamental Problem

ReLU networks have an inherent symmetry: if you scale all weights of a neuron by a positive constant cc:

  • Neuron computes: σ(c(a1x1+...+ax+b))\sigma(c \cdot (a_1 x_1 + ... + a_\ell x_\ell + b))
  • Then scale all outgoing connections by 1/c1/c in the next layer
  • The overall network function remains identical

This means we can never recover the absolute scale of the weights through black-box queries.

Why the Signature?

The signature (1,a2/a1,...,a/a1)(1, a_2/a_1, ..., a_\ell/a_1) represents what we can uniquely recover:

  1. Scale-invariant representation: Dividing by a1a_1 normalizes the weight vector, removing the unknown scaling factor

  2. Unique identifier: Different neurons (with different weight directions) will have different signatures, allowing us to:

    • Group critical points by which neuron they belong to
    • Track neurons across different observations
  3. But preserves sign ambiguity: The signature alone doesn't tell us if we have (a1,a2,...)(a_1, a_2, ...) or (a1,a2,...)(-a_1, -a_2, ...). This is crucial because:

    • Positive scaling is a symmetry (doesn't change function)
    • Negative scaling is NOT a symmetry (ReLU blocks negative values)

The Attack Strategy

The paper's approach is:

  1. Find critical points where neurons transition between active/inactive
  2. Use these to recover signatures (relative weight ratios)
  3. Separately determine signs to resolve the (+)(+) vs ()(-) ambiguity
  4. Combine to get actual weights (up to a positive scale that doesn't matter)

The signature definition is essentially encoding "what information about the weights can we extract from observing input-output behavior, given the symmetries of ReLU networks?"

replyquoteparent
Thinking Summary

The user is asking me to connect Definition 6 (linear neighborhood) from the paper to concepts from convex optimization. Let me first recall what Definition 6 says: Definition 6: Let $x \in \mathcal{X}$. The linear neighbourhood of x is the set $$\{u \in \mathcal{X} \mid \mathcal{S}(\eta; x) = \mathcal{S}(\eta; u) \text{ for all neurons } \eta \text{ in the network}\}.$$ This is the set of all inputs where all neurons maintain the same activation state (active/inactive) as they have at point $x$. This concept has several connections to convex optimization: 1. **Polyhedral regions**: In a ReLU network, the linear neighborhood is actually a polyhedron - it's the intersection of half-spaces defined by the linear inequalities that determine whether each neuron is active or inactive. This is a fundamental object in convex optimization. 2. **Activation patterns**: The linear neighborhood corresponds to a specific "activation pattern" where the network behaves as an affine function. This is similar to how piecewise linear functions are studied in convex optimization. 3. **Local linearity**: Within the linear neighborhood, the neural network is just an affine transformation, which means optimization within this region is a convex problem. 4. **Tangent cone/Normal cone**: The boundaries of linear neighborhoods relate to concepts like tangent cones and normal cones at non-smooth points. 5. **Subdifferentials**: At the boundaries between linear neighborhoods, we have non-differentiable points, which connects to subdifferential calculus used in non-smooth convex optimization.

Yes! Definition 6's linear neighborhood has deep connections to several concepts from convex optimization:

1. Polyhedral Regions and Linear Programming

The linear neighborhood is actually a polyhedron - specifically, the intersection of half-spaces defined by the activation constraints. For each neuron ηj\eta_j in layer ii, we have:

  • If active at xx: Aj(i),Fi1(u)+bj(i)0\langle A_j^{(i)}, F_{i-1}(u) \rangle + b_j^{(i)} \geq 0
  • If inactive at xx: Aj(i),Fi1(u)+bj(i)0\langle A_j^{(i)}, F_{i-1}(u) \rangle + b_j^{(i)} \leq 0

This creates a system of linear inequalities defining a convex polytope, exactly like the feasible region in linear programming.

2. Piecewise Affine Functions

ReLU networks are piecewise affine functions, where each linear neighborhood corresponds to one "piece." This connects to:

  • DC (Difference of Convex) programming: ReLU networks can be expressed as differences of convex functions
  • Tropical geometry: The max operations in ReLUs relate to tropical algebra used in some optimization contexts

3. Subdifferential Calculus

At boundaries between linear neighborhoods, the network is non-differentiable. The subdifferential at these points is the convex hull of the gradients from adjacent linear regions. This is crucial for:

  • Understanding critical points in the paper's attack
  • Non-smooth optimization algorithms like subgradient methods

4. Normal Cones and Optimality Conditions

The normal cone at a point xx on the boundary of a linear neighborhood contains the directions that point "outward" from the region. This relates to:

  • KKT conditions when neurons are at critical points (V(η;x)=0\mathcal{V}(\eta; x) = 0)
  • Complementarity conditions in mathematical programming

5. Locally Linear Optimization

Within each linear neighborhood, any optimization problem involving the neural network becomes a convex problem (actually affine). This means:

  • The attack can exploit local convexity
  • Gradient information is constant throughout the region
  • The "wiggle" directions in the attack stay within these convex regions

The paper's attack fundamentally exploits this piecewise linear structure - by finding critical points (boundaries between linear neighborhoods) and analyzing the network's behavior as it transitions between these convex polyhedral regions.

replyquoteparent
Thinking Summary

The user is asking about the neuron wiggle technique in the paper, specifically how they can wiggle a target neuron without knowing the weights in advance. Let me trace through the paper to understand this. The key insight is that the attack follows a layer-by-layer approach where they recover signatures first, then signs. By the time they use the neuron wiggle technique, they already have: 1. The signatures of all neurons (the relative ratios of weights) 2. The architecture of the network 3. The fully recovered previous layers (which they've "peeled off") So when they're working on layer i, they have: - $\hat{A}^{(i)}$ - the recovered weight matrix up to sign (they know signatures, which are ratios of weights) - $F_{i-1}$ - the fully recovered previous layers The key is that they can compute a wiggle that maximizes the change for a target neuron WITHOUT knowing the sign because: - They know the relative magnitudes of weights (the signature) - Maximizing the size of a wiggle doesn't require knowing the sign - if you negate the expression you get a wiggle of the same size in the opposite direction Let me find the specific part where they explain this... In section 4.2, they explain: "The crucial property we use here is that maximizing the size of the wiggle produced by a linear expression does not require knowledge of its sign - if we negate the expression we get a wiggle of the same size but in the opposite direction." And the process is: 1. They have $\hat{A}_j^{(i)}$ - the j-th row of the recovered weight matrix (up to sign) 2. They create a wiggle $\delta$ parallel to $\hat{A}_j^{(i)}$ 3. They project this onto the space of control $V^{(i-1)}$ 4. They find the input difference $\Delta$ that generates this wiggle by finding a pre-image under $F^{(i-1)}$ I can leverage the signature's directional information to create a maximal wiggle, exploiting the symmetry of magnitude regardless of sign direction. This allows precise neuron manipulation without complete weight knowledge.

The neuron wiggle technique can identify inputs that maximally wiggle a target neuron because by this stage of the attack, the adversary already knows the signatures (relative weight ratios) of all neurons, even though they don't know the signs.

Here's how it works:

Key Prerequisites

Before using neuron wiggle, the attack has already:

  1. Recovered the signatures of all neurons using critical points (the ratios like (1,a2a1,,aa1)(1, \frac{a_2}{a_1}, \ldots, \frac{a_\ell}{a_1}))
  2. Fully recovered all previous layers (which have been "peeled off")
  3. Obtained A^(i)\hat{A}^{(i)} - the weight matrix for layer ii up to unknown signs

Creating the Wiggle

The crucial insight is that maximizing the size of a wiggle doesn't require knowing the sign. If you have weights (a1,a2,,an)(a_1, a_2, \ldots, a_n) up to sign, you can create a wiggle δ\delta parallel to these weights:

δA^j(i)=(±a1,±a2,,±an)\delta \parallel \hat{A}_j^{(i)} = (\pm |a_1|, \pm |a_2|, \ldots, \pm |a_n|)

This wiggle maximizes A^j(i),δ|\langle \hat{A}_j^{(i)}, \delta \rangle| regardless of whether the true weights are positive or negative, because:

  • If the true weights are (a1,a2,)(a_1, a_2, \ldots), the dot product gives a12+a22+|a_1|^2 + |a_2|^2 + \ldots
  • If the true weights are (a1,a2,)(-a_1, -a_2, \ldots), you still get the same magnitude

The Process

  1. Project onto space of control: They project A^j(i)\hat{A}_j^{(i)} onto V(i1)V^{(i-1)} (the subspace they can actually control at layer ii)

  2. Find input difference: They compute ΔRd0\Delta \in \mathbb{R}^{d_0} by finding a pre-image of δ\delta under F(i1)F^{(i-1)} (the fully recovered previous layers)

  3. Test at critical point: At a critical point xx^* for the target neuron, they evaluate:

    • L=f(xΔ)f(x)L = f(x^* - \Delta) - f(x^*)
    • R=f(x+Δ)f(x)R = f(x^* + \Delta) - f(x^*)
  4. Determine sign: If L>R|L| > |R|, the neuron has negative sign; if R>L|R| > |L|, it has positive sign

The key is that they can create an input perturbation that maximally affects the target neuron using only the signature information, then use the asymmetric behavior around critical points to determine which sign is correct.

replyquoteparent
Thinking Summary

The user is asking which section of the paper describes the signature recovery process. Let me scan through the paper to find where this is discussed. Looking at the paper structure, I can see that Section 3.3 is titled "Carlini et al.'s Differential Attack" and it contains subsections on "Finding critical points" and "Finding signatures". This appears to be where the signature recovery is explained. Let me confirm by looking at the content in that section... Yes, in Section 3.3, under the "Finding signatures" subsection, the paper explains how signatures are recovered. It starts with "The input to the DNN is also the input to the first hidden layer, and we have full control over it..." and goes on to explain the process.

The signature recovery is described in Section 3.3 ("Carlini et al.'s Differential Attack"), specifically in the subsection titled "Finding signatures".

This section explains how to recover the signatures (the relative weight ratios) of neurons by:

  1. For the first hidden layer: Using critical points and computing partial derivatives:

    • Query αi,=fei(xεei)\alpha_{i,-} = \frac{\partial f}{\partial e_i}(x^* - \varepsilon e_i) and αi,+=fei(x+εei)\alpha_{i,+} = \frac{\partial f}{\partial e_i}(x^* + \varepsilon e_i)
    • The difference αi,+αi,\alpha_{i,+} - \alpha_{i,-} gives a multiple of weight aia_i
    • Taking ratios αi,+αi,αk,+αk,=aiak\frac{\alpha_{i,+} - \alpha_{i,-}}{\alpha_{k,+} - \alpha_{k,-}} = \frac{a_i}{a_k} reveals the signature
  2. For deeper layers: Since you can't control individual coordinates after layer 1, they:

    • Sample di+1d_i + 1 random directions δkN(0,εId0)\delta_k \sim \mathcal{N}(0, \varepsilon I_{d_0})
    • Compute second derivatives yk=2f(x)δ1δky_k = \frac{\partial^2 f(x^*)}{\partial \delta_1 \partial \delta_k}
    • Solve for the signature vector aa such that hk,a=yk\langle h_k, a \rangle = y_k

The paper notes that this signature recovery technique is from Carlini et al.'s prior work, and the new contributions in this paper focus on the sign recovery methods (SOE, Neuron Wiggle, and Last Hidden Layer techniques) described in Section 4.

replyquoteparent

ah, this is the part i’m concerned with, how do i know i’m in a sufficiently useful local neighborhood to start with? what if there are critical points off in some region of input space i didn’t sample?

replyquoteparent
Thinking Summary

The user is asking about a very good point - how do we ensure we find critical points for all neurons? What if some neurons have critical points only in regions of input space we didn't sample? Looking at the paper, there are a few relevant points: 1. They mention "unlucky events" that could cause the attack to fail, including neurons whose values "almost never change sign" - these are the "almost always-on/off neurons" mentioned on page 2-3. 2. In their practical experiments, they start from randomly chosen CIFAR10 images to find critical points. 3. They mention that after training, they didn't encounter always-on neurons but did find always-off neurons. 4. The paper notes that if a neuron is almost always on or off, its behavior is "simply absorbed by the linear mappings in the next layer" resulting in a "very close approximation of the original neural network." But the user is right to be concerned - this is a potential weakness in the attack. If critical points exist only in some exotic region of input space that wasn't sampled, those neurons' signs might not be recoverable.

This is an excellent observation and indeed a fundamental limitation of the attack! The paper acknowledges this concern in several places:

The "Unlucky Events" Problem

The authors explicitly mention this as a caveat on page 2-3:

"there could be some neurons whose values before the ReLU almost never change sign, and thus our bounded number of queries will fail to detect any critical point for them."

They call these "almost always-on/off neurons" and note that:

  • If you can't find critical points for a neuron, you can't recover its sign
  • In practice, after training, they found instances of always-off neurons but not always-on neurons
  • Such neurons' "almost perfectly linear behavior will be simply absorbed by the linear mappings in the next layer"

Sampling Strategy

For finding critical points, they:

  1. Sample pairs of inputs x1,x2x_1, x_2 and trace the line between them
  2. In CIFAR10 experiments, start from randomly chosen dataset images
  3. Look for discontinuities in the piecewise linear function f(λ)=f(x1+λ(x2x1))f^*(\lambda) = f(x_1 + \lambda(x_2 - x_1))

Why It Often Works in Practice

The attack succeeds despite this limitation because:

  1. Training creates diverse behaviors: Neural networks trained on real data tend to have neurons that activate across different regions of input space

  2. Many starting points: They use 200 different critical points per neuron for sign recovery, starting from different random images

  3. Functional equivalence is sufficient: If a neuron is always-off, setting its weights to zero yields a functionally equivalent network

However, you're absolutely right that this is a vulnerability. If critical points for certain neurons exist only in unexplored regions (perhaps adversarially crafted inputs or out-of-distribution examples), the attack would fail to recover those neurons' parameters correctly.

replyquoteparent