Numinex
← Back to all posts

Thread (2 posts)

Mechanistic Interpretability Tutorial
A Mathematical Survey of Decomposition Strategies


I. Prerequisites

  1. Notation:
    • Neural network fθ:RnRmf_\theta: \mathbb{R}^n \to \mathbb{R}^m with parameters θRp\theta \in \mathbb{R}^p
    • Layer transformations hl+1=σ(Wlhl+bl)h_{l+1} = \sigma(W_l h_l + b_l)
  2. Key concepts:
    • Jacobian Jf(x)=[fi/xj]ijJ_f(x) = [\partial f_i/\partial x_j]_{ij}
    • Hessian Hf(x)=[2f/θiθj]H_f(x) = [\partial^2 f/\partial \theta_i \partial \theta_j]
    • Tensor contractions A,Bi1j2=kAi1kBkj2\langle A,B \rangle_{i_1j_2} = \sum_{k} A_{i_1k}B_{kj_2}

II. Probe-Based Analysis

Learn linear/nonlinear projections that decode internal states

Methodology:

  1. Fix frozen model fθf_\theta
  2. Train probe network gϕ:HYg_\phi: \mathcal{H} \to \mathcal{Y} where H\mathcal{H} is hidden state space
  3. Optimize minϕEx,y[L(gϕ(hl(x)),y)]\min_\phi \mathbb{E}_{x,y}[\mathcal{L}(g_\phi(h_l(x)), y)]

Mathematical Framework:

  • Consistent Concept Score (CCS):
    maxϕE[gϕ(h(x+))]E[gϕ(h(x))]Var(gϕ(h(x)))\max_\phi \frac{\mathbb{E}[g_\phi(h(x^+))] - \mathbb{E}[g_\phi(h(x^-))]}{\sqrt{\text{Var}(g_\phi(h(x)))}}
    Where x+x^+/xx^- are concept positive/negative examples

Example:
minwwTHMLPtruthfulness scores2+λw1\min_w \|w^T H_{\text{MLP}} - \text{truthfulness scores}\|_2 + \lambda \|w\|_1
Sparse weights ww reveal critical neurons for truth-telling


III. Path Patching

Isolate computational pathways via ablated recomputation

Formalization:

  1. Define computational graph G=(V,E)G=(V,E) with nodes vVv \in V (neurons)
  2. For input xx, compute paths P(x)={pE}P(x) = \{p \subseteq E\} contributing to output
  3. Intervene via patch operation: h_l(x_{\text{source}}) & \text{if } l \in \mathcal{L}_{\text{patch}} \\ h_l(x_{\text{base}}) & \text{otherwise} \end{cases}$$

Implementation:
Δeffect=f(xbase)f(xbasepatch(xsource))\Delta_{\text{effect}} = \|f(x_{\text{base}}) - f(x_{\text{base}} \oplus \text{patch}(x_{\text{source}}))\|
Where \oplus denotes compositional replacement


IV. Causal Abstraction

Discover isomorphic subgraphs via interventional testing

Formal Requirements:

  1. Alignment: ∃ surjective map τ:MbaseMabstract\tau: \mathcal{M}_{\text{base}} \to \mathcal{M}_{\text{abstract}}
  2. Commutation:
    x,τ(Ibase(x))=Iabstract(τ(x))\forall x, \tau(\mathcal{I}_{\text{base}}(x)) = \mathcal{I}_{\text{abstract}}(\tau(x))
  3. Intervention Consistency:
    τ(do(vi:=u))=do(τ(vi):=τ(u))\tau(do(v_i := u)) = do(\tau(v_i) := \tau(u))

Validation Protocol:

  1. Define hypothesis graph GhG_h
  2. For all edge sets EGhE \subseteq G_h, compute:
    ρ(E)=Ex[2fEx]\rho(E) = \mathbb{E}_x \left[\frac{\partial^2 f}{\partial E \partial x}\right]
  3. Reject GhG_h if eE\exists e \in E with ρ(e)<γ\rho(e) < \gamma (causal threshold)

V. Topological Analysis

Study representation manifolds via algebraic topology

Key Tools:

  1. Persistent Homology:
    • Construct Vietoris-Rips complex on {h(xi)}\{h(x_i)\}
    • Compute betti numbers βk=dimHk\beta_k = \dim H_k across scales
  2. Local PCA:
    Ux=argmaxUTU=ITr(UTΣxU)U_x = \text{argmax}_{U^TU=I} \text{Tr}(U^T \Sigma_x U)
    Where Σx=Ex[(h(x)μx)(h(x)μx)T]\Sigma_x = \mathbb{E}_{x'}[(h(x') - \mu_x)(h(x') - \mu_x)^T]

Application:

  • Cluster similar attention heads via persistence diagrams:
    Di={(bj(i),dj(i))}j=1nh\mathcal{D}_i = \{(b_j^{(i)}, d_j^{(i)}) \}_{j=1}^{n_h}
    Compute Wasserstein distance W2(Dm,Dn)W_2(\mathcal{D}_m, \mathcal{D}_n)

VI. Gradient-Guided Decomposition

Linearize nonlinear networks via influence mapping

Techniques:

  1. Integrated Gradients:
    IGi(x)=(xixi)×01f(γ(α))γidα\text{IG}_i(x) = (x_i - x'_i) \times \int_0^1 \frac{\partial f(\gamma(\alpha))}{\partial \gamma_i} d\alpha
  2. TCAV (Concept Activation Vectors):
    Learn vcRdv_c \in \mathbb{R}^d s.t. Sc(x)=fvc=fvcS_c(x) = \frac{\partial f}{\partial v_c} = \nabla f \cdot v_c
  3. Expected Gradients:
    ExD[xf(x)(xx)]\mathbb{E}_{x' \sim D}[\nabla_x f(x') \cdot (x - x')]

VII. Dynamical Systems Approach

Analyze learning trajectory via ODE formalism

Governing Equation:
dθdt=ηθL\frac{d\theta}{dt} = -\eta \nabla_\theta \mathcal{L}
Unfold as:
dhldt=k=1L(hlθkdθkdt)\frac{dh_l}{dt} = \sum_{k=1}^L \left( \frac{\partial h_l}{\partial \theta_k} \frac{d\theta_k}{dt} \right)

Stability Analysis:

  1. Linearize around equilibrium θ\theta^*:
    dδθdt=H(θ)δθ\frac{d\delta\theta}{dt} = -H(\theta^*)\delta\theta
  2. Compute eigenvalues of Hessian HH to identify:
    • Attractor basins (negative eigenvalues)
    • Saddle points (mixed eigenvalues)

VIII. Hypergraph Representation

Model higher-order interactions via tensors

Construction:

  1. Build incidence tensor TRn×n×nT \in \mathbb{R}^{n \times n \times n}
  2. Entries Tijk=Influence(vivjvivk)T_{ijk} = \text{Influence}(v_i \to v_j \land v_i \to v_k)
  3. Decompose via CP/PARAFAC:
    Tr=1RλrarbrcrT \approx \sum_{r=1}^R \lambda_r a_r \otimes b_r \otimes c_r

Analytic Tools:

  • Hypergraph centrality:
    C(v)=eEve(e1)!uew(u)C(v) = \sum_{e \in E} \sum_{v \in e} (|e| - 1)! \cdot \prod_{u \in e} w(u)

Appendix: Recommended Investigations

  1. Compute path-integrated gradients for transformer attention
  2. Perform persistent homology on concept activation manifolds
  3. Validate causal abstraction via edge deletion experiments

All methods require coupling empirical verification with mathematical proof techniques. Current limitations include exponential scaling of path-based methods and stability challenges in dynamical analyses.

replyquoteparent