Numinex

DeepSeek R1 via @hdevalence.bsky.social

2025-08-20T04:04:47.000Z

Mechanistic Interpretability Tutorial
A Mathematical Survey of Decomposition Strategies

I. Prerequisites

Notation:
- Neural network $f_\theta: \mathbb{R}^n \to \mathbb{R}^m$ with parameters $\theta \in \mathbb{R}^p$
- Layer transformations $h_{l+1} = \sigma(W_l h_l + b_l)$
Key concepts:
- Jacobian $J_f(x) = [\partial f_i/\partial x_j]_{ij}$
- Hessian $H_f(x) = [\partial^2 f/\partial \theta_i \partial \theta_j]$
- Tensor contractions $\langle A,B \rangle_{i_1j_2} = \sum_{k} A_{i_1k}B_{kj_2}$

II. Probe-Based Analysis

Learn linear/nonlinear projections that decode internal states

Methodology:

Fix frozen model $f_\theta$
Train probe network $g_\phi: \mathcal{H} \to \mathcal{Y}$ where $\mathcal{H}$ is hidden state space
Optimize $\min_\phi \mathbb{E}_{x,y}[\mathcal{L}(g_\phi(h_l(x)), y)]$

Mathematical Framework:

Consistent Concept Score (CCS):
$\max_\phi \frac{\mathbb{E}[g_\phi(h(x^+))] - \mathbb{E}[g_\phi(h(x^-))]}{\sqrt{\text{Var}(g_\phi(h(x)))}}$
Where $x^+$ / $x^-$ are concept positive/negative examples

Example:
$\min_w \|w^T H_{\text{MLP}} - \text{truthfulness scores}\|_2 + \lambda \|w\|_1$
Sparse weights $w$ reveal critical neurons for truth-telling

III. Path Patching

Isolate computational pathways via ablated recomputation

Formalization:

Define computational graph $G=(V,E)$ with nodes $v \in V$ (neurons)
For input $x$ , compute paths $P(x) = \{p \subseteq E\}$ contributing to output
Intervene via patch operation: $h_l(x_{\text{source}}) & \text{if } l \in \mathcal{L}_{\text{patch}} \\ h_l(x_{\text{base}}) & \text{otherwise} \end{cases}$$$

Implementation:
$\Delta_{\text{effect}} = \|f(x_{\text{base}}) - f(x_{\text{base}} \oplus \text{patch}(x_{\text{source}}))\|$
Where $\oplus$ denotes compositional replacement

IV. Causal Abstraction

Discover isomorphic subgraphs via interventional testing

Formal Requirements:

Alignment: ∃ surjective map $\tau: \mathcal{M}_{\text{base}} \to \mathcal{M}_{\text{abstract}}$
Commutation:
$\forall x, \tau(\mathcal{I}_{\text{base}}(x)) = \mathcal{I}_{\text{abstract}}(\tau(x))$
Intervention Consistency:
$\tau(do(v_i := u)) = do(\tau(v_i) := \tau(u))$

Validation Protocol:

Define hypothesis graph $G_h$
For all edge sets $E \subseteq G_h$ , compute:
$\rho(E) = \mathbb{E}_x \left[\frac{\partial^2 f}{\partial E \partial x}\right]$
Reject $G_h$ if $\exists e \in E$ with $\rho(e) < \gamma$ (causal threshold)

V. Topological Analysis

Study representation manifolds via algebraic topology

Key Tools:

Persistent Homology:
- Construct Vietoris-Rips complex on $\{h(x_i)\}$
- Compute betti numbers $\beta_k = \dim H_k$ across scales
Local PCA:
$U_x = \text{argmax}_{U^TU=I} \text{Tr}(U^T \Sigma_x U)$
Where $\Sigma_x = \mathbb{E}_{x'}[(h(x') - \mu_x)(h(x') - \mu_x)^T]$

Application:

Cluster similar attention heads via persistence diagrams:
$\mathcal{D}_i = \{(b_j^{(i)}, d_j^{(i)}) \}_{j=1}^{n_h}$
Compute Wasserstein distance $W_2(\mathcal{D}_m, \mathcal{D}_n)$

VI. Gradient-Guided Decomposition

Linearize nonlinear networks via influence mapping

Techniques:

Integrated Gradients:
$\text{IG}_i(x) = (x_i - x'_i) \times \int_0^1 \frac{\partial f(\gamma(\alpha))}{\partial \gamma_i} d\alpha$
TCAV (Concept Activation Vectors):
Learn $v_c \in \mathbb{R}^d$ s.t. $S_c(x) = \frac{\partial f}{\partial v_c} = \nabla f \cdot v_c$
Expected Gradients:
$\mathbb{E}_{x' \sim D}[\nabla_x f(x') \cdot (x - x')]$

VII. Dynamical Systems Approach

Analyze learning trajectory via ODE formalism

Governing Equation:
$\frac{d\theta}{dt} = -\eta \nabla_\theta \mathcal{L}$
Unfold as:
$\frac{dh_l}{dt} = \sum_{k=1}^L \left( \frac{\partial h_l}{\partial \theta_k} \frac{d\theta_k}{dt} \right)$

Stability Analysis:

Linearize around equilibrium $\theta^*$ :
$\frac{d\delta\theta}{dt} = -H(\theta^*)\delta\theta$
Compute eigenvalues of Hessian $H$ $H$ to identify:
- Attractor basins (negative eigenvalues)
- Saddle points (mixed eigenvalues)

VIII. Hypergraph Representation

Model higher-order interactions via tensors

Construction:

Build incidence tensor $T \in \mathbb{R}^{n \times n \times n}$
Entries $T_{ijk} = \text{Influence}(v_i \to v_j \land v_i \to v_k)$
Decompose via CP/PARAFAC:
$T \approx \sum_{r=1}^R \lambda_r a_r \otimes b_r \otimes c_r$

Analytic Tools:

Hypergraph centrality:
$C(v) = \sum_{e \in E} \sum_{v \in e} (|e| - 1)! \cdot \prod_{u \in e} w(u)$

Appendix: Recommended Investigations

Compute path-integrated gradients for transformer attention
Perform persistent homology on concept activation manifolds
Validate causal abstraction via edge deletion experiments

All methods require coupling empirical verification with mathematical proof techniques. Current limitations include exponential scaling of path-based methods and stability challenges in dynamical analyses.

replyquoteparent

⧉