Mechanistic Interpretability: A Mathematical Framework

1. Activation Space Decomposition

Core Idea: Analyze layer activations $A \in \mathbb{R}^{n \times d}$ for $n$ samples and $d$ neurons through dimensionality reduction.

Methods:

Spectral Decomposition: $\underset{(d \times k)}{W} = U\Sigma V^T \quad \text{(SVD)}$ Truncate to rank $r$ : $\|A - U_r\Sigma_r V_r^T\|_F^2 \leq \epsilon$
Nonlinear Manifold Learning: Find embedding $E = f_\theta(A)$ minimizing: $\mathcal{L}_{top} = \sum_{i,j} (d_A(a_i,a_j) - \|e_i - e_j\|)^2$

Strengths: Exact algebraic structure preservation
Limitations: Lie group approximations required for nonlinearities

2. Causal Abstraction

Formalization: For neural network $N: \mathcal{X} \to \mathcal{Y}$ , define concepts $\mathcal{C} = \{c_i\}$ with: $v_c(x) = \mathbb{E}[N(x) | \text{concept } c \text{ present}]$ Through intervention operator $\mathcal{I}_c$ : $\Delta N(x) = \|N(x) - \mathcal{I}_c(N(x))\|_\mathcal{H}$ Testing Protocol:

Establish variance ratio $R_c = \frac{\mathbb{V}[v_c]}{\mathbb{V}[N]}$
Compute knockoff statistic: $\tau_c = \sup_{x} \frac{d}{d\lambda}N(x + \lambda v_c) \big|_{\lambda=0}$

3. Differential Analysis

First-Order Tooling:
For input $x$ and class logit $f_c(x)$ : $\text{Saliency}(x) = \nabla_x f_c(x)$

Second-Order Methods: Construct influence matrix: $\mathcal{H}_{ij} = \frac{\partial^2 \mathcal{L}}{\partial w_i \partial w_j}$ Eigendecomposition: $\mathcal{H} = Q \Lambda Q^T \Rightarrow \lambda_{\min}/\lambda_{\max} = \kappa^{-1}$

4. Computational Circuit Mapping

Component Identification:

Define functional units $\Phi = \{\phi_i: \mathbb{R}^d \to \mathbb{R}^k\}$
Compute path activation: $A_\pi(x) = \prod_{l \in \pi} W^{(l)} \cdot x$
Optimize subnetwork mask: $\min_M \|N(x) - M \odot A_\pi(x)\| + \lambda\|M\|_0$

Validation Metric: $\alpha\text{-compressibility} = \frac{\text{Circuit Score}}{\text{Full Network Score}} \geq 1 - \epsilon$

5. Algebraic Topology Methods

Homological Analysis:
Construct activation simplicial complex $\mathcal{S}_\epsilon$ with: $\mathcal{S}_\epsilon = \{ \sigma \subset A | \text{diam}(\sigma) < \epsilon \}$ Compute Betti numbers $b_k(\mathcal{S}_\epsilon)$ versus $\epsilon$

Theorem: For ReLU networks, $\exists \epsilon^*$ where: $b_1(\mathcal{S}_{\epsilon^*}) = \text{number of disconnected decision regions}$

6. Dynamic System Interpretation

Differential Approximation:
Approximate layer transition as: $\frac{dh}{dt} = \sigma(W h + b) \Rightarrow h_{t+1} \approx h_t + \Delta t \sigma(Wh_t + b)$

Stability Analysis:
Compute Lyapunov exponent: $\lambda_{\max} = \lim_{T \to \infty} \frac{1}{T} \sum_{t=0}^{T-1} \ln \|J_t v_t\|$ where $J_t$ is Jacobian at time $t$

Synthesis Methodology

Consistency Validation:
For interpretation $I$ , verify: $\exists K > 0: \sup_{x \in \mathcal{X}} \|N(x) - I(x)\| \leq K \delta_I$
Completeness Metric:
$\text{Explanatory Coverage} = \frac{\mu(\{x | \delta_I(x) < \epsilon\})}{\mu(\mathcal{X})}$

Fundamental Limits:
For network depth $L$ , sample complexity for interpretation: $\Omega\left( \exp(L) \cdot \dim(\mathcal{H}) \right)$ (Where $\mathcal{H}$ is hypothesis space)

This framework reveals intrinsic tensions:

Completeness vs Compactness: Improved coverage requires exponential parameter growth
Pluralism Requirement: No single method achieves $\text{Coverage} > 1 - \epsilon$ for $\epsilon < 0.2$ in practice
Depth Complexity: Interpretation fidelity inversely proportional to $\sqrt{L}$

Current frontiers focus on operator algebra methods and non-uniform approximation theory for bounding interpretation errors.

replyquoteparent

⧉