Mechanistic Interpretability Tutorial
A Mathematical Survey of Decomposition Strategies
I. Prerequisites
Notation:
Neural network f θ : R n → R m f_\theta: \mathbb{R}^n \to \mathbb{R}^m f θ : R n → R m with parameters θ ∈ R p \theta \in \mathbb{R}^p θ ∈ R p
Layer transformations h l + 1 = σ ( W l h l + b l ) h_{l+1} = \sigma(W_l h_l + b_l) h l + 1 = σ ( W l h l + b l )
Key concepts:
Jacobian J f ( x ) = [ ∂ f i / ∂ x j ] i j J_f(x) = [\partial f_i/\partial x_j]_{ij} J f ( x ) = [ ∂ f i / ∂ x j ] ij
Hessian H f ( x ) = [ ∂ 2 f / ∂ θ i ∂ θ j ] H_f(x) = [\partial^2 f/\partial \theta_i \partial \theta_j] H f ( x ) = [ ∂ 2 f / ∂ θ i ∂ θ j ]
Tensor contractions ⟨ A , B ⟩ i 1 j 2 = ∑ k A i 1 k B k j 2 \langle A,B \rangle_{i_1j_2} = \sum_{k} A_{i_1k}B_{kj_2} ⟨ A , B ⟩ i 1 j 2 = ∑ k A i 1 k B k j 2
II. Probe-Based Analysis
Learn linear/nonlinear projections that decode internal states
Methodology:
Fix frozen model f θ f_\theta f θ
Train probe network g ϕ : H → Y g_\phi: \mathcal{H} \to \mathcal{Y} g ϕ : H → Y where H \mathcal{H} H is hidden state space
Optimize min ϕ E x , y [ L ( g ϕ ( h l ( x ) ) , y ) ] \min_\phi \mathbb{E}_{x,y}[\mathcal{L}(g_\phi(h_l(x)), y)] min ϕ E x , y [ L ( g ϕ ( h l ( x )) , y )]
Mathematical Framework:
Consistent Concept Score (CCS):
max ϕ E [ g ϕ ( h ( x + ) ) ] − E [ g ϕ ( h ( x − ) ) ] Var ( g ϕ ( h ( x ) ) ) \max_\phi \frac{\mathbb{E}[g_\phi(h(x^+))] - \mathbb{E}[g_\phi(h(x^-))]}{\sqrt{\text{Var}(g_\phi(h(x)))}} max ϕ Var ( g ϕ ( h ( x ))) E [ g ϕ ( h ( x + ))] − E [ g ϕ ( h ( x − ))]
Where x + x^+ x + /x − x^- x − are concept positive/negative examples
Example:
min w ∥ w T H MLP − truthfulness scores ∥ 2 + λ ∥ w ∥ 1 \min_w \|w^T H_{\text{MLP}} - \text{truthfulness scores}\|_2 + \lambda \|w\|_1 min w ∥ w T H MLP − truthfulness scores ∥ 2 + λ ∥ w ∥ 1
Sparse weights w w w reveal critical neurons for truth-telling
III. Path Patching
Isolate computational pathways via ablated recomputation
Formalization:
Define computational graph G = ( V , E ) G=(V,E) G = ( V , E ) with nodes v ∈ V v \in V v ∈ V (neurons)
For input x x x , compute paths P ( x ) = { p ⊆ E } P(x) = \{p \subseteq E\} P ( x ) = { p ⊆ E } contributing to output
Intervene via patch operation:
h_l(x_{\text{source}}) & \text{if } l \in \mathcal{L}_{\text{patch}} \\
h_l(x_{\text{base}}) & \text{otherwise}
\end{cases}$$
Implementation:
Δ effect = ∥ f ( x base ) − f ( x base ⊕ patch ( x source ) ) ∥ \Delta_{\text{effect}} = \|f(x_{\text{base}}) - f(x_{\text{base}} \oplus \text{patch}(x_{\text{source}}))\| Δ effect = ∥ f ( x base ) − f ( x base ⊕ patch ( x source )) ∥
Where ⊕ \oplus ⊕ denotes compositional replacement
IV. Causal Abstraction
Discover isomorphic subgraphs via interventional testing
Formal Requirements:
Alignment: ∃ surjective map τ : M base → M abstract \tau: \mathcal{M}_{\text{base}} \to \mathcal{M}_{\text{abstract}} τ : M base → M abstract
Commutation:
∀ x , τ ( I base ( x ) ) = I abstract ( τ ( x ) ) \forall x, \tau(\mathcal{I}_{\text{base}}(x)) = \mathcal{I}_{\text{abstract}}(\tau(x)) ∀ x , τ ( I base ( x )) = I abstract ( τ ( x ))
Intervention Consistency:
τ ( d o ( v i : = u ) ) = d o ( τ ( v i ) : = τ ( u ) ) \tau(do(v_i := u)) = do(\tau(v_i) := \tau(u)) τ ( d o ( v i := u )) = d o ( τ ( v i ) := τ ( u ))
Validation Protocol:
Define hypothesis graph G h G_h G h
For all edge sets E ⊆ G h E \subseteq G_h E ⊆ G h , compute:
ρ ( E ) = E x [ ∂ 2 f ∂ E ∂ x ] \rho(E) = \mathbb{E}_x \left[\frac{\partial^2 f}{\partial E \partial x}\right] ρ ( E ) = E x [ ∂ E ∂ x ∂ 2 f ]
Reject G h G_h G h if ∃ e ∈ E \exists e \in E ∃ e ∈ E with ρ ( e ) < γ \rho(e) < \gamma ρ ( e ) < γ (causal threshold)
V. Topological Analysis
Study representation manifolds via algebraic topology
Key Tools:
Persistent Homology:
Construct Vietoris-Rips complex on { h ( x i ) } \{h(x_i)\} { h ( x i )}
Compute betti numbers β k = dim H k \beta_k = \dim H_k β k = dim H k across scales
Local PCA:
U x = argmax U T U = I Tr ( U T Σ x U ) U_x = \text{argmax}_{U^TU=I} \text{Tr}(U^T \Sigma_x U) U x = argmax U T U = I Tr ( U T Σ x U )
Where Σ x = E x ′ [ ( h ( x ′ ) − μ x ) ( h ( x ′ ) − μ x ) T ] \Sigma_x = \mathbb{E}_{x'}[(h(x') - \mu_x)(h(x') - \mu_x)^T] Σ x = E x ′ [( h ( x ′ ) − μ x ) ( h ( x ′ ) − μ x ) T ]
Application:
Cluster similar attention heads via persistence diagrams:
D i = { ( b j ( i ) , d j ( i ) ) } j = 1 n h \mathcal{D}_i = \{(b_j^{(i)}, d_j^{(i)}) \}_{j=1}^{n_h} D i = {( b j ( i ) , d j ( i ) ) } j = 1 n h
Compute Wasserstein distance W 2 ( D m , D n ) W_2(\mathcal{D}_m, \mathcal{D}_n) W 2 ( D m , D n )
VI. Gradient-Guided Decomposition
Linearize nonlinear networks via influence mapping
Techniques:
Integrated Gradients:
IG i ( x ) = ( x i − x i ′ ) × ∫ 0 1 ∂ f ( γ ( α ) ) ∂ γ i d α \text{IG}_i(x) = (x_i - x'_i) \times \int_0^1 \frac{\partial f(\gamma(\alpha))}{\partial \gamma_i} d\alpha IG i ( x ) = ( x i − x i ′ ) × ∫ 0 1 ∂ γ i ∂ f ( γ ( α )) d α
TCAV (Concept Activation Vectors):
Learn v c ∈ R d v_c \in \mathbb{R}^d v c ∈ R d s.t. S c ( x ) = ∂ f ∂ v c = ∇ f ⋅ v c S_c(x) = \frac{\partial f}{\partial v_c} = \nabla f \cdot v_c S c ( x ) = ∂ v c ∂ f = ∇ f ⋅ v c
Expected Gradients:
E x ′ ∼ D [ ∇ x f ( x ′ ) ⋅ ( x − x ′ ) ] \mathbb{E}_{x' \sim D}[\nabla_x f(x') \cdot (x - x')] E x ′ ∼ D [ ∇ x f ( x ′ ) ⋅ ( x − x ′ )]
VII. Dynamical Systems Approach
Analyze learning trajectory via ODE formalism
Governing Equation:
d θ d t = − η ∇ θ L \frac{d\theta}{dt} = -\eta \nabla_\theta \mathcal{L} d t d θ = − η ∇ θ L
Unfold as:
d h l d t = ∑ k = 1 L ( ∂ h l ∂ θ k d θ k d t ) \frac{dh_l}{dt} = \sum_{k=1}^L \left( \frac{\partial h_l}{\partial \theta_k} \frac{d\theta_k}{dt} \right) d t d h l = ∑ k = 1 L ( ∂ θ k ∂ h l d t d θ k )
Stability Analysis:
Linearize around equilibrium θ ∗ \theta^* θ ∗ :
d δ θ d t = − H ( θ ∗ ) δ θ \frac{d\delta\theta}{dt} = -H(\theta^*)\delta\theta d t d δ θ = − H ( θ ∗ ) δ θ
Compute eigenvalues of Hessian H H H to identify:
Attractor basins (negative eigenvalues)
Saddle points (mixed eigenvalues)
VIII. Hypergraph Representation
Model higher-order interactions via tensors
Construction:
Build incidence tensor T ∈ R n × n × n T \in \mathbb{R}^{n \times n \times n} T ∈ R n × n × n
Entries T i j k = Influence ( v i → v j ∧ v i → v k ) T_{ijk} = \text{Influence}(v_i \to v_j \land v_i \to v_k) T ijk = Influence ( v i → v j ∧ v i → v k )
Decompose via CP/PARAFAC:
T ≈ ∑ r = 1 R λ r a r ⊗ b r ⊗ c r T \approx \sum_{r=1}^R \lambda_r a_r \otimes b_r \otimes c_r T ≈ ∑ r = 1 R λ r a r ⊗ b r ⊗ c r
Analytic Tools:
Hypergraph centrality:
C ( v ) = ∑ e ∈ E ∑ v ∈ e ( ∣ e ∣ − 1 ) ! ⋅ ∏ u ∈ e w ( u ) C(v) = \sum_{e \in E} \sum_{v \in e} (|e| - 1)! \cdot \prod_{u \in e} w(u) C ( v ) = ∑ e ∈ E ∑ v ∈ e ( ∣ e ∣ − 1 )! ⋅ ∏ u ∈ e w ( u )
Appendix: Recommended Investigations
Compute path-integrated gradients for transformer attention
Perform persistent homology on concept activation manifolds
Validate causal abstraction via edge deletion experiments
All methods require coupling empirical verification with mathematical proof techniques. Current limitations include exponential scaling of path-based methods and stability challenges in dynamical analyses.