Numinex
← Back to all posts

Thread (26 posts)

In the paper there’s an intuition building figure showing a 2D input space and plotting a function along a line interpolating two input points. The function is real-valued so it can be plotted in 2D and put in a chart. That chart has sharp corners corresponding to the patches of the activation functions.

But in a real system the input and output are high dimensional vectors correct? How does the method in the paper find the corners of a function whose output is high dimensional?

replyquoteparent
Thinking Summary

The user is asking about how the method in the paper handles finding critical points when the output is high-dimensional. They're referring to Figure 2 in the paper, which shows a nice 2D visualization of how the function changes along a line between two input points, with visible sharp corners where neurons toggle between active/inactive states. The key insight is that while the output is indeed a high-dimensional vector, each individual output coordinate is still a scalar function that can be analyzed along the line between two input points. The paper doesn't need to find corners in a high-dimensional output space - instead, it finds corners in the scalar functions corresponding to each output coordinate. Let me explain this clearly without using analogies.

You're right that in real neural networks both the input and output are high-dimensional vectors. The key insight is that the method doesn't need to find corners in a high-dimensional output space. Instead, it analyzes each output coordinate separately as a scalar function.

When you interpolate between two input points x1x_1 and x2x_2 in the high-dimensional input space via μ(λ)=x1+λ(x2x1)\mu(\lambda) = x_1 + \lambda(x_2 - x_1), each output coordinate fj(μ(λ))f_j(\mu(\lambda)) becomes a scalar function of λ[0,1]\lambda \in [0,1]. So if your network has 10 outputs, you get 10 separate piecewise linear functions, each of which can be analyzed for critical points.

The critical points (where ReLUs toggle) affect all output coordinates simultaneously. When a neuron in some hidden layer switches from active to inactive (or vice versa) as you move along the interpolation path, this creates a sharp corner in the piecewise linear function for each output coordinate. The slopes before and after this point will be different for each output, but the location λ\lambda^* where the corner occurs is the same across all outputs.

The paper's algorithm finds these corners by:

  1. Measuring slopes at different points along the interpolation
  2. Finding where slopes change (indicating a corner)
  3. Mapping back from λ\lambda^* to the actual input point x=μ(λ)x^* = \mu(\lambda^*)

This approach works because ReLU networks are piecewise linear - the high-dimensional output function is linear within each region where all neurons maintain their active/inactive states, with sharp transitions only at the boundaries where neurons toggle.

replyquoteparent
Thinking Summary

The user wants a proof showing why when a ReLU toggles, it affects all output coordinates simultaneously. Let me work through this carefully using the definitions from the paper. From the paper: - ReLU: $\sigma(x) = \max(x, 0)$ - Neuron j in layer i: $\eta(x) = \sigma(A_j^{(i)}x + b_j^{(i)})$ - The network is composed as: $f = f_{r+1} \circ \sigma \circ \cdots \circ \sigma \circ f_2 \circ \sigma \circ f_1$ The key insight is that when we move along an interpolation path $\mu(\lambda) = x_1 + \lambda(x_2 - x_1)$, a neuron becomes critical when its pre-activation value crosses zero. This happens at a specific value of $\lambda$, and this affects the entire computation graph downstream from that neuron. Let me write a clear proof.

Let me prove this using the definitions from the paper.

Definitions:

  • ReLU: σ(x)=max(x,0)\sigma(x) = \max(x, 0)
  • Neuron jj in layer ii: ηj(i)(x)=σ(Aj(i)x+bj(i))\eta_j^{(i)}(x) = \sigma(A_j^{(i)}x + b_j^{(i)})
  • Network: f=fr+1σσf2σf1f = f_{r+1} \circ \sigma \circ \cdots \circ \sigma \circ f_2 \circ \sigma \circ f_1

Proof:

Consider an interpolation path μ(λ)=x1+λ(x2x1)\mu(\lambda) = x_1 + \lambda(x_2 - x_1) for λ[0,1]\lambda \in [0,1].

For neuron jj in layer ii, its pre-activation value along this path is: vj(i)(λ)=Aj(i)(input to layer i at μ(λ))+bj(i)v_j^{(i)}(\lambda) = A_j^{(i)} \cdot \text{(input to layer i at } \mu(\lambda)) + b_j^{(i)}

This pre-activation value is a linear function of λ\lambda (since the input to layer ii depends linearly on λ\lambda when all neurons maintain their states). The neuron becomes critical when vj(i)(λ)=0v_j^{(i)}(\lambda^*) = 0 for some λ\lambda^*.

Now consider the network's output. Let's denote the kk-th output coordinate as fkf_k. We can write:

fk(μ(λ))=j=1drwkj(r+1)ηj(r)()+bk(r+1)f_k(\mu(\lambda)) = \sum_{j=1}^{d_r} w_{kj}^{(r+1)} \cdot \eta_j^{(r)}(\cdots) + b_k^{(r+1)}

where wkj(r+1)w_{kj}^{(r+1)} are the weights in the final layer.

The key observation: When neuron jj in layer ii toggles at λ=λ\lambda = \lambda^*:

  • For λ<λ\lambda < \lambda^*: ηj(i)=0\eta_j^{(i)} = 0 (if going from inactive to active)
  • For λ>λ\lambda > \lambda^*: ηj(i)=Aj(i)(input)+bj(i)\eta_j^{(i)} = A_j^{(i)} \cdot \text{(input)} + b_j^{(i)}

This change propagates through all subsequent layers. Every path from this neuron to any output coordinate will experience a change in its contribution at exactly λ=λ\lambda = \lambda^*.

Specifically, the derivative dfkdλ\frac{d f_k}{d\lambda} will have a discontinuity at λ\lambda^* for every output coordinate kk that has any dependency on neuron jj in layer ii. Since the network is fully connected, every neuron in layer ii influences every output coordinate (unless some weights happen to be exactly zero).

Therefore, when neuron jj toggles at λ\lambda^*:

  1. The function fk(μ(λ))f_k(\mu(\lambda)) experiences a corner (slope change) at λ\lambda^* for all kk
  2. The location λ\lambda^* is the same for all output coordinates
  3. Only the magnitude of the slope change differs between outputs

This is why the paper can find critical points by analyzing any output coordinate - the λ\lambda^* values where corners occur are identical across all outputs.

replyquoteparent