Numinex
← Back to all posts

Thread (26 posts)

In the paper there’s an intuition building figure showing a 2D input space and plotting a function along a line interpolating two input points. The function is real-valued so it can be plotted in 2D and put in a chart. That chart has sharp corners corresponding to the patches of the activation functions.

But in a real system the input and output are high dimensional vectors correct? How does the method in the paper find the corners of a function whose output is high dimensional?

replyquoteparent
Thinking Summary

The user is asking about how the method in the paper handles finding critical points when the output is high-dimensional. They're referring to Figure 2 in the paper, which shows a nice 2D visualization of how the function changes along a line between two input points, with visible sharp corners where neurons toggle between active/inactive states. The key insight is that while the output is indeed a high-dimensional vector, each individual output coordinate is still a scalar function that can be analyzed along the line between two input points. The paper doesn't need to find corners in a high-dimensional output space - instead, it finds corners in the scalar functions corresponding to each output coordinate. Let me explain this clearly without using analogies.

You're right that in real neural networks both the input and output are high-dimensional vectors. The key insight is that the method doesn't need to find corners in a high-dimensional output space. Instead, it analyzes each output coordinate separately as a scalar function.

When you interpolate between two input points x1x_1 and x2x_2 in the high-dimensional input space via μ(λ)=x1+λ(x2x1)\mu(\lambda) = x_1 + \lambda(x_2 - x_1), each output coordinate fj(μ(λ))f_j(\mu(\lambda)) becomes a scalar function of λ[0,1]\lambda \in [0,1]. So if your network has 10 outputs, you get 10 separate piecewise linear functions, each of which can be analyzed for critical points.

The critical points (where ReLUs toggle) affect all output coordinates simultaneously. When a neuron in some hidden layer switches from active to inactive (or vice versa) as you move along the interpolation path, this creates a sharp corner in the piecewise linear function for each output coordinate. The slopes before and after this point will be different for each output, but the location λ\lambda^* where the corner occurs is the same across all outputs.

The paper's algorithm finds these corners by:

  1. Measuring slopes at different points along the interpolation
  2. Finding where slopes change (indicating a corner)
  3. Mapping back from λ\lambda^* to the actual input point x=μ(λ)x^* = \mu(\lambda^*)

This approach works because ReLU networks are piecewise linear - the high-dimensional output function is linear within each region where all neurons maintain their active/inactive states, with sharp transitions only at the boundaries where neurons toggle.

replyquoteparent
Thinking Summary

The user is asking a good question. Let me think about this carefully. When a ReLU toggles in a hidden layer, it doesn't necessarily affect all output coordinates. In fact, it only affects the output coordinates that have a non-zero path from that particular neuron to the output. Let me think about this more carefully: 1. If a neuron in layer $i$ toggles from active to inactive (or vice versa), this changes the output of that neuron from some value to 0 (or from 0 to some value). 2. This change propagates through the network. The effect on output coordinate $j$ depends on the weights connecting that neuron (through all subsequent layers) to output $j$. 3. If the total weight along all paths from that neuron to output $j$ is zero, then toggling that neuron won't affect output $j$ at all. 4. So actually, a critical point for a neuron doesn't necessarily create a corner in all output coordinates - only in those outputs that have non-zero connection from that neuron. The paper's statement is imprecise here. What's actually happening is: - When you move along the interpolation path and a neuron toggles, this creates a critical point - This critical point occurs at the same $\lambda^*$ value for all outputs (because it's the same neuron toggling) - But whether this creates a visible corner in each output function depends on whether that neuron influences that output So the correct statement would be: the location $\lambda^*$ where a neuron toggles is the same across all outputs, but not all outputs will necessarily show a corner at that point. Only outputs that are influenced by that neuron will show a corner.

You're absolutely right to question this - I was imprecise. A critical point (where a ReLU toggles) doesn't necessarily affect all output coordinates.

Here's what actually happens:

When a neuron in some hidden layer toggles at position λ\lambda^* along the interpolation path, this creates a critical point at that specific λ\lambda^* value. However, whether this creates a visible corner in each output coordinate depends on the weights connecting that neuron to each output.

For output coordinate jj:

  • If there's a non-zero path of weights from the toggling neuron to output jj, then fj(μ(λ))f_j(\mu(\lambda)) will have a corner at λ\lambda^*
  • If all paths from that neuron to output jj have zero total weight, then fj(μ(λ))f_j(\mu(\lambda)) remains smooth at λ\lambda^*

So the critical point occurs at the same location λ\lambda^* for all outputs (since it's the same neuron toggling), but only manifests as a corner in outputs that are actually influenced by that neuron.

The paper's approach still works because:

  1. Most neurons in a trained network influence at least some outputs
  2. By analyzing multiple random interpolation paths, they collect critical points for different neurons
  3. Even if some outputs don't show corners for certain neurons, enough output coordinates will show corners to detect the critical points

This is why the paper can work with either single or multiple outputs - they just need at least one output coordinate that shows a corner to detect each critical point.

replyquoteparent
Thinking Summary

The user wants a proof showing why when a ReLU toggles, it affects all output coordinates simultaneously. Let me work through this carefully using the definitions from the paper. From the paper: - ReLU: $\sigma(x) = \max(x, 0)$ - Neuron j in layer i: $\eta(x) = \sigma(A_j^{(i)}x + b_j^{(i)})$ - The network is composed as: $f = f_{r+1} \circ \sigma \circ \cdots \circ \sigma \circ f_2 \circ \sigma \circ f_1$ The key insight is that when we move along an interpolation path $\mu(\lambda) = x_1 + \lambda(x_2 - x_1)$, a neuron becomes critical when its pre-activation value crosses zero. This happens at a specific value of $\lambda$, and this affects the entire computation graph downstream from that neuron. Let me write a clear proof.

Let me prove this using the definitions from the paper.

Definitions:

  • ReLU: σ(x)=max(x,0)\sigma(x) = \max(x, 0)
  • Neuron jj in layer ii: ηj(i)(x)=σ(Aj(i)x+bj(i))\eta_j^{(i)}(x) = \sigma(A_j^{(i)}x + b_j^{(i)})
  • Network: f=fr+1σσf2σf1f = f_{r+1} \circ \sigma \circ \cdots \circ \sigma \circ f_2 \circ \sigma \circ f_1

Proof:

Consider an interpolation path μ(λ)=x1+λ(x2x1)\mu(\lambda) = x_1 + \lambda(x_2 - x_1) for λ[0,1]\lambda \in [0,1].

For neuron jj in layer ii, its pre-activation value along this path is: vj(i)(λ)=Aj(i)(input to layer i at μ(λ))+bj(i)v_j^{(i)}(\lambda) = A_j^{(i)} \cdot \text{(input to layer i at } \mu(\lambda)) + b_j^{(i)}

This pre-activation value is a linear function of λ\lambda (since the input to layer ii depends linearly on λ\lambda when all neurons maintain their states). The neuron becomes critical when vj(i)(λ)=0v_j^{(i)}(\lambda^*) = 0 for some λ\lambda^*.

Now consider the network's output. Let's denote the kk-th output coordinate as fkf_k. We can write:

fk(μ(λ))=j=1drwkj(r+1)ηj(r)()+bk(r+1)f_k(\mu(\lambda)) = \sum_{j=1}^{d_r} w_{kj}^{(r+1)} \cdot \eta_j^{(r)}(\cdots) + b_k^{(r+1)}

where wkj(r+1)w_{kj}^{(r+1)} are the weights in the final layer.

The key observation: When neuron jj in layer ii toggles at λ=λ\lambda = \lambda^*:

  • For λ<λ\lambda < \lambda^*: ηj(i)=0\eta_j^{(i)} = 0 (if going from inactive to active)
  • For λ>λ\lambda > \lambda^*: ηj(i)=Aj(i)(input)+bj(i)\eta_j^{(i)} = A_j^{(i)} \cdot \text{(input)} + b_j^{(i)}

This change propagates through all subsequent layers. Every path from this neuron to any output coordinate will experience a change in its contribution at exactly λ=λ\lambda = \lambda^*.

Specifically, the derivative dfkdλ\frac{d f_k}{d\lambda} will have a discontinuity at λ\lambda^* for every output coordinate kk that has any dependency on neuron jj in layer ii. Since the network is fully connected, every neuron in layer ii influences every output coordinate (unless some weights happen to be exactly zero).

Therefore, when neuron jj toggles at λ\lambda^*:

  1. The function fk(μ(λ))f_k(\mu(\lambda)) experiences a corner (slope change) at λ\lambda^* for all kk
  2. The location λ\lambda^* is the same for all output coordinates
  3. Only the magnitude of the slope change differs between outputs

This is why the paper can find critical points by analyzing any output coordinate - the λ\lambda^* values where corners occur are identical across all outputs.

replyquoteparent