Write a detailed tutorial on different approaches to mechanistic interpretability. Assume a detailed mathematical background but not domain knowledge.
← Back to all posts
Thread (2 posts)
DeepSeek R1 via @hdevalence.bsky.social
# Mechanistic Interpretability: Mathematical Approaches
## 1. Fundamental Framework
Let a neural network be parametrized as $f_\theta: \mathcal{X} \rightarrow \mathcal{Y}$ with $L$ layers:
$$f_\theta(x) = \sigma_L(W_L\sigma_{L-1}(\cdots W_2\sigma_1(W_1x + b_1)\cdots) + b_L)$$
Where $\sigma_i$ are nonlinearities and $W_l \in \mathbb{R}^{d_{l} \times d_{l-1}}$ weight matrices.
---
## 2. Core Approaches
### 2.1 Circuit Analysis
**Objective:** Identify minimal subnetwork causing specific behavior
For input set $S$ with property $P$, find minimal mask $M \in \{0,1\}^{|\theta|}$ s.t.:
$$\frac{1}{|S|}\sum_{x\in S} ||f_{\theta \odot M}(x) \odot 1_P - f_\theta(x) \odot 1_P|| < \epsilon$$
**Methods:**
- Path integrated gradients: $\phi_i = \int_{\alpha=0}^1 \frac{\partial f(x_0 + \alpha(x_1-x_0))}{\partial \theta_i} d\alpha$
- Edge attribution: $E_{jk}^{(l)} = \frac{\partial (y_t - y_c)}{\partial W_{jk}^{(l)}}$ (target vs control diff)
---
### 2.2 Feature Visualization
**Objective:** Find proto-typical inputs maximizing activation
For neuron $(\ell,j)$ solve:
$$\underset{x}{\text{maximize}} \quad a_j^{(\ell)}(x) - \lambda_1||x||^2 - \lambda_2\text{TV}(x)$$
**Advanced Variants:**
- Fourier domain parametrization: $x = \mathcal{F}^{-1}(\nu)$
- Channel-wise objectives: $\sum_{i\neq j}||a_i^{(\ell)}(x)||^2 < \delta$
---
### 2.3 Manifold Analysis
**Key Idea:** Study activation space geometry through spectral methods
For activation matrix $A \in \mathbb{R}^{n \times d}$:
1. **PCA:** $A = U\Sigma V^\top$, analyze $v_1,...,v_k$ (principal directions)
2. **TDA:** Build persistent homology from $\{a_i\}_{i=1}^n \subset \mathbb{R}^d$
3. **Topographic Maps:**
$$\underset{P}{\text{min}} \sum_{i,j} (||Pa_i - Pa_j||^2 - \log(1 + ||a_i - a_j||^2))$$
---
### 2.4 Causal Abstraction
**Formalism:**
1. Define interventions $do(A^{(\ell)} = \tilde{a})$
2. Measure effect size $\tau = \mathbb{E}[f_\theta(x)|do(A^{(\ell)})] - \mathbb{E}[f_\theta(x)]$
3. Build structural equation model:
$$y = g(a_1,...,a_L) + \epsilon, \quad a_i = h_i(pa(a_i)) + \epsilon_i$$
**Key Tests:**
- Perfect mediation: $P(y|do(a_i)) = P(y|a_i)$
- Counterfactual equivalence: $f_\theta(x)_{a_i←a'_i} = f_{\theta'}(x)_{\theta_j=0}$
---
### 2.5 Algebraic Methods
For transformers, analyze attention patterns as linear transformations:
Let $A_t^{l} = \text{softmax}(\frac{QK^\top}{\sqrt{d_k}})$ be the attention matrix in layer $l$ head $t$. Compute:
1. **Effective diffusion:**
$$\Psi = \prod_{l=1}^L \left(\sum_{t=1}^h A_t^{(l)}\right)$$
2. **Eigen-analysis:**
Solve $A_t^{(l)}v = \lambda v$ to find persistent directions
---
## 3. Synthesis Techniques
### 3.1 Integrated Gradients Path
Combine feature visualization with attribution:
$$\Phi_j(x) = (x - x') \odot \int_0^1 \frac{\partial f(x' + \alpha(x-x'))}{\partial x_j} d\alpha$$
### 3.2 Causal Circuit Discovery
1. Compute attribution scores $\{\phi_i\}$
2. Prune $\theta$ s.t. $\phi_i > \tau$
3. Verify counterfactuals match original behavior
### 3.3 Topological Regularization
When training interpretable models:
$$\mathcal{L} = \mathcal{L}_{task} + \lambda \sum_{\ell=1}^L \text{rank}(A^{(\ell)})$$
---
## 4. Validation Metrics
- **Circuit fidelity:** $\frac{\text{Accuracy with circuit}}{\text{Full model accuracy}}$
- **Causal effect:** $\tau/\sigma_y$ (Cohen's d)
- **Interpretability score:** $1 - \frac{\mathbb{E}[H(p(y|x))]}{H_0}$
[Visual: shows analytical workflow combining multiple methods]
Decoding grid shows the interlaced use of these techniques in modern research pipelines, with mathematical formalism enabling rigorous validation of hypotheses about neural network mechanisms.