Spectroscopy at Scale: Finding Interpretable Structure in Pythia-1.4B
Murfet et al.
Susceptibilities as an interpretability technique at 1B+ scale: show and tell from Pythia-1.4B.
We use singular learning theory to study how training data shapes model behavior.
We use this understanding to develop new tools for AI safety. Read more.
Murfet et al.
Susceptibilities as an interpretability technique at 1B+ scale: show and tell from Pythia-1.4B.
Wang and Murfet
Mechanistic interpretability aims to understand how neural networks generalize beyond their training data by reverse-eng…
Gordon et al.
Spectroscopy infers the internal structure of physical systems by measuring their response to perturbations.
Elliott et al.
We extend singular learning theory to reinforcement learning, showing that phase transitions in policy development are governed by the local learning coefficient, which detects transitions even when policies appear identical in terms of regret.
Urdshals et al.
We provide an extension of the MDL principle to singular models like neural networks and empirically test the predicted relationship between complexity and compressibility.
Lee et al.
We study the BIF as a tool for developmental interpretability and show that influence can change dramatically over the course of training, contrary to the classical view of training data attribution.
Adam et al.
We study the kernel induced by the BIF as a tool for interpretability and show that this recovers ground-truth structure in the training distribution.
Baker et al.
We introduce "susceptibilities" along with a framework for applying these measurements to discovering structure inside models ("structural inference") and validate this in a small language model.
Chen and Murfet
We develop a geometric account of sequence modelling that links patterns in the data to measurable properties of the los…
Murfet and Troiani
We develop a correspondence between the structure of Turing machines and the structure of singularities of real analytic…
Urdshals and Urdshals · ICML SMUNN Workshop
We study how a one-layer attention-only transformer develops relevant structures while learning to sort lists of numbers.
Carroll et al.
Modern deep neural networks display striking examples of rich internal computational structure.
Murfet et al.