Introduction
The susceptibility clusters used to interpret Pythia 1.4B here are the result of a several step process. Most of this work is detailed in Baker et al. (2025), Wang et al. (2025), and Gordon et al. (2026), but for convenience we have collected into a single post the key definitions and intuitions, as well as detailed citations where the reader can learn more. With two exceptions (PCA whitening as a preprocessing step and the use of auxiliary models for scaling) none of the material here is original, and familiar readers are urged to skim or jump around.
This post can be broken into two major parts. In the first section, we offer a mathematical description of susceptibilities, and in the second we describe how susceptibility data is collected and analyzed. The new material is contained in the second section.
Mathematical Background
We will define here the fundamental mathematical objects used for susceptibilities. We first set up the neural network as a statistical physical system (Section 2.1), then introduce observables and their expectation values under the local posterior (Section 2.2), define susceptibilities as derivatives of these expectation values with respect to changes in the data distribution (Section 2.3), and finally assemble per-component susceptibilities into the susceptibility vectors that are the primary object of study in this work (Section 2.4).
Setup
We begin with a language model whose weights are parametrized by points in a real vector space . Each point defines a next-token predictor: given a context , the model produces a distribution over possible next tokens . For a context-token pair , the per-token loss is , and the population loss is , where is the data distribution.
From the loss, we can define a probability distribution over weight space, the population posterior:
where is a prior, controls the effective dataset size, and is an inverse temperature. This distribution concentrates around low-loss regions, with and controlling how sharply.
This is the same mathematical structure as the Boltzmann distribution in statistical mechanics. We are modeling a neural network as a statistical physics system where the loss plays the role of the energy (see the companion post for a self-contained introduction to this connection via the Ising model).
A component of the model is a subset of the parameters, for example the weights of a single attention head or MLP layer. Fixing a component induces a decomposition of the weight space , letting us write , where denotes the parameters in and the rest.
Observables and Expectation Values
An observable is a function that maps a model’s weights to a real number. The simplest example is the population loss itself.
A key idea from statistical physics is to study the values of not at individual weights, but over the entire distribution , as summarized by expectation values:
When our observable is the change in loss relative to a reference point, this gives us an estimator for the local learning coefficient (LLC),
This is a principled measure of model complexity that controls generalization error in the Bayesian setting. The LLC can be interpreted as measuring half the effective dimensionality , though this can be misleading as the can be an arbitrary rational number, thus giving a fractional notion of dimensionality. See Lau et al. (2023) for more details.
We can also localize this to specific parts of the model. Given trained weights , the component loss of component is
Here is the Dirac delta function. The associated expectation value is an estimator for the weight-refined LLC for component (up to a constant multiple). It measures the complexity of that component (when considered in isolation). In (Wang et al. 2024), tracking these per-component LLCs over training revealed the developmental differentiation of attention heads in a small language model: heads that start undifferentiated gradually specialize, with their component LLCs diverging at critical periods, and the shape of the component LLCs distinguishing different types of heads.
Susceptibilities
The component LLC tells us how much structure lives in a component. It does not tell us what that structure responds to. For that, we need to ask: how does the expectation value of change when we modify the data?
Following Baker et al. (2025), we consider a one-parameter mixture of the data distribution: , where is the original distribution and is a another distribution we perturb towards. The loss under this mixture is , and the posterior shifts accordingly. We write for the expectation of an observable over this shifted posterior. The susceptibility of to the perturbation is the derivative of this expected value with respect to at :
A standard calculation (the fluctuation-dissipation theorem, proved in Baker et al. (2025), Appendix A) converts this derivative into a covariance under the unperturbed posterior:
where is the difference in loss between the perturbed and unperturbed distributions. We can learn how the system would respond to a change in data by observing how it fluctuates without such a change. Methodologically, this means we can reuse the same posterior samples to estimate both expectation values and their derivatives.
The simplest perturbation concentrates all of on a single data point . In this case, , and we obtain the per-sample susceptibility:
When the observable is the component loss , this measures how the structure in component responds to the data point . If is large and positive, training more on would increase the effective complexity of ; if large and negative, it would simplify .
Generality. Note that both the observable and the perturbation can be chosen freely. The component loss is a natural default (it requires no design choices and its expectation gives the weight-refined LLC), but any function of the weights can serve as an observable. Similarly, the perturbation need not concentrate on a single sample: it can be any shift in the data distribution (or even some other suitable training hyperparameter). Under a different choice of observable (per-sample loss on a held-out query) with a very similar per-sample perturbation, susceptibilities reduce to a generalization of classical influence functions, see Appendix E (Adam et al. 2025; Lee et al. 2025; Kreer et al. 2026).
Susceptibility Estimation
So far we have defined susceptibilities at the population level: is a covariance under , where is the expectation of the per-token loss under the true data distribution . Neither expectation is accessible in practice: requires integration over the entire dataset , and the posterior expectation requires integrating over all of weight space. To actually compute susceptibilities, we substitute both with finite averages.
Expectation over . Given a finite dataset of samples drawn from , we approximate by the empirical loss
For practical approximations, we don’t use a fixed set of samples, but rather a minibatch loss.
This replaces the population posterior with the empirical posterior , and the centered loss inside the covariance with .
Expectation over the posterior. We draw a finite set of weight samples from using Stochastic Gradient Langevin Dynamics (SGLD); see Lau et al. (2023) or the accompanying blog post for details. The population covariance is then replaced by a sample covariance, giving the susceptibility estimator
where and are the sample means across the samples.
It is not obvious that either of these substitutions is valid. Understanding how fluctuates around as for arbitrary learning machines is the core content of singular learning theory (Watanabe 2009). See Appendix C for more details. The analogous question for stochastic-gradient MCMC — whether such methods actually probe the posterior in the singular setting — remains an open problem, see also Appendix D.
From here on, when no confusion can arise, we use “susceptibility” to refer to both the population object and its estimator .
Susceptibility Vectors
By choosing a set of components , we associate to each data point a susceptibility vector:
This vector is a representation of the data point from the perspective of the model’s loss landscape. In the previously mentioned companion post we show how the values of multiple susceptibilities localized to different “rooms” lets us infer where a perturbation is globally in a large grid. The intuition here is analogous. Two data points with similar susceptibility vectors affect the model’s internal structure in similar ways; data points with very different vectors affect different parts of the model, or affect the same parts in different directions.
These susceptibility vectors are the primary object of study in this sequence. For Pythia-1.4B, we use 410 components: 384 attention heads (16 per layer across 24 layers), 24 MLP layers, and the embedding and unembedding matrices.
Working with Susceptibility Data
Having defined susceptibilities above, in this section we will describe how we collect these values efficiently at scale, as well as how we work with the data to identify clusters.
The compute involved in SGLD is currently the greatest bottleneck to collecting susceptibilities for large models. Computing the covariances for 6 million susceptibilities on Pythia 1.4b required tens of thousands of forward and backward passes on the model, and roughly 5000 H200 hours.
Because of this, we have developed a new technique to use “ground-truth” susceptibility data to efficiently collect susceptibilities on a larger set of tokens.
Upsampling Data Using an Auxiliary Model
We train an auxiliary (Aux) model: a lightweight external neural network that takes in concatenated residual stream activations and predicts per-token susceptibility values. We can then inference these models over arbitrarily large numbers of tokens, as the cost per batch is a single forward pass of the base model to generate activations, then a forward pass of the Aux model to predict susceptibility values
The process is:
- Collect ground-truth susceptibilities using SGLD, as well as the activation vectors
- Train an auxiliary model to predict a token’s susceptibility values from a concatenated vector of the model’s activations.
- Use the auxiliary model to produce susceptibilities, using only a single forward pass from the initial LLM, to collect activations, rather than a sampling process
Introducing an auxiliary model allowed us to use the 4 million susceptibilities collected in Gordon et al. (2026) to produce susceptibilities for 46 million more tokens at a fraction of the compute cost.
Aux Model Details
We experimented with several auxiliary model architectures, the best performing of which was a simple one layer encoder-decoder model with a 16x hidden layer size. We also found a small l1-norm penalty useful for stability.
For training, we used activations from four evenly spaced intermediate layers of Pythia 1.4B. Rather than predict the 410 component susceptibilities individually, we used the ground-truth susceptibilities to perform principal component analysis and trained the model to perform predictions on the PC basis. Previous work (Baker et al. 2025; Wang et al. 2025) identified principal component directions as capturing more patterns in the data than individual model components.
We measure aux model accuracy by the Pearson correlation between predicted susceptibility values and a withheld ground-truth validation set. The model achieved a mean per-PC correlation of 0.714, with stronger performance on the earlier principle components, as seen in the below chart.
Clustering
We cluster the susceptibility vectors using a clustering algorithm defined in Gordon et al. (2026), section 3.1. The algorithm is designed to find small clusters of nearby points on the periphery of a large data cloud using a localized PageRank. The algorithm works as follows:
- Turn the set of datapoints into a graph, with points as vertices edge weights between points a decaying function of distance.
- Select a random unvisited vertex
- Explore a neighborhood of that vertex by taking random walks on the graph that return to the vertex with fixed probability at each step. The steady state of such a random process can be efficiently approximated to a given tolerance , see Andersen et al. (2006) for details.
- Order the points by how frequently the random walks visit them and search the prefixes of this ordering for sets that have low conductivity: the ratio between total edge weight leaving the set and total edge weight leaving and staying in the set.
- Clusters are identified as prefixes with low conductivity. If a cluster is found, record it, if not label all the point visited as “unclustered”
- Return to step 2, repeating until every point is visited.
Intuitively, we add points consecutively to a set in order of how “near” they are to a randomly chosen seed vertex, while comparing the distances between points in the set to distances between points in the set and external points. A cluster is an outlier by this measure: Many points that are close to each other and far away from everything else. These clusters are collections of tokens that the model regards as similar in some way.
This process is sensitive to the choice of hyperparameters used for the graph construction and the random walk, but we were able to reuse the exact hyperparameters from Gordon et al. (2026). Our distance graph was a -nearest neighbor graph, and the random walk used . To account for the increased number of tokens, we ran the core algorithmic loop in parallel, choosing multiple seed vertices simultaneously.
The biggest change from our previous methodology is how we process the data prior to clustering. Raw susceptibility vectors have non-uniform variance across principal components, with earlier PCs (by definition) having higher variance than later PCs. The clustering algorithm we use is adapted from graph-theoretic methods, and begins by turning a point cloud into a distance graph, so it is strongly dependent on the Euclidean distance between susceptibility vectors. In order to prevent this distance from being dominated by spread along the first few principal components, we apply PCA to the susceptibility matrix and whiten in the PC basis – rescaling each principal component to unit variance. Based on the number of clusters found, and the quality of the patterns detected, this was a substantial improvement in the process.
The combined effect of increasing the number of tokens to 46 million, and performing a PC-whitening step prior to clustering increased the number of clusters found for Pythia 1.4B from 249 to 57,236.
Conclusion
Though Gordon et al. (2026) was published only a few months ago, we’ve made significant improvements to the process for producing and studying susceptibility clusters. This post provides a general overview of the pipeline, and we strongly encourage the reader to look at the primary post in this collection where we present a novel analysis of the thousands of interpretable clusters in Pythia 1.4B.
Appendix
Why Susceptibilities?
There are several other interpretability techniques, most prominently Sparse Autoencoders, that have similar utility to susceptibility clusters. That is to say, they are methods for unsupervised detection of human comprehensible patterns that a language model is using to process tokens. With the introduction of an aux model, the susceptibility data we collect is now a transformation of a model’s activations, exactly like SAEs.
Given this, it is reasonable to ask why anyone should care about susceptibilities, as compared to the many other tools seeking interesting structure in activation data. We gesture at an answer above, but this is worth saying directly: Though susceptibilities can be used for pattern identification, this is a side effect of their design, not the original intent. Susceptibilities are direct measures of how a datapoint affects the development of an observable during training. As such, they can be used for interpretability (datapoints that affect training in similar ways are often quite similar), but they can also be used, and in fact were constructed, to make principled interventions in the training process of large models.
This has first been successfully done at the small language model scale in Wang & Murfet (2026). We encourage the reader to explore our post on patterning toy models of superposition as another demonstration of this sort of intervention.
Leaky Activations?
Susceptibilities are a weights-based interpretability method: they depend on a model’s fixed parameters, not on any particular activation. The auxiliary model changes this, since its inputs are intermediate activations. A natural worry follows: could the interesting structure we see in aux-predicted susceptibilities be primarily an artifact of structure already present in activation space, rather than a real reflection of the loss-landscape geometry?
Here is a concrete scenario we are worried about: Suppose two tokens are nearby in activation space, because the model regards them as similar, but this relationship is not captured by susceptibilities. Since our auxiliary model applies a continuous map to activation space, then the auxiliary model susceptibilities would capture information, valuable patterns even, that did not accurately reflect what susceptibilities should be able to see. More succinctly, are aux models using activations to efficiently produce the intended data, or do they show an indeterminate blend of two different datasets?
This is an empirical question, and we address it with two complementary metrics.
Global similarity: linear CKA. Centered kernel alignment (Kornblith et al. 2019) measures the similarity of two representations by comparing the covariance structure of their kernel matrices. Given representations and of the same tokens, CKA asks: do pairs of tokens that are similar under tend to be similar under ? Values range from 0 (no correspondence) to 1 (identical structure up to rotation and scaling). The linear variant admits a feature-space reformulation that scales to the hundreds of thousands of tokens we need here without materialising an kernel matrix.
For each of the roughly 212,000 tokens in the aux-model validation set, we compute the linear CKA between (i) per-layer activations and ground-truth susceptibilities, and (ii) per-layer activations and aux-predicted susceptibilities. The first quantity is a baseline: any similarity it records reflects genuine structure shared between activations and susceptibilities on this data. The second is the object of interest: if it is close to the first, the aux model is faithfully representing this shared structure; if it is substantially higher, the aux model is adding activation structure that is not present in the ground truth.

Both curves are low across all layers — ground truth peaks near 0.15, aux predictions peak near 0.275. The aux curve sits above the ground-truth curve, rising and falling in tandem with the ground truth, although we believe that by the standards of CKA in the neural network representation literature these numbers are small. In particular it seems that this gap of is a-priori low given we generally expect learned maps to be biased towards preserving the local structure of the domain.
Overall, we take the low CKA between ground truth susceptibilities and activations as a positive sign for the orthogonality of their information content. A reasonable caricature seems to be that alignment between activations and susceptibilities comes for free from the input tokens (early layers) and from the target token (late layers), with the middle layers carrying the nontrivial computation.
Local similarity: Jaccard-KNN. CKA captures global covariance structure. It is less informative about the local neighbourhood geometry that our clustering algorithm is actually sensitive to. We complement CKA with a nearest-neighbour measurement: for each token, compute its nearest neighbours in activation space and in susceptibility space, and report the Jaccard index (size of intersection over size of union) between the two neighbour sets, averaged over all tokens.

The mean Jaccard similarity across all layers is below 0.05 for both ground truth susceptibilities, and below 0.19 for aux predictions. This indicates that local neighbourhoods in activation space and susceptibility space are almost disjoint: a token’s 50 nearest neighbours in activations have very little overlap with its 50 nearest neighbours in susceptibilities.
Taking stock. CKA and Jaccard-KNN agree on the qualitative picture. Activations and susceptibilities are, on both global and local metrics, different representations of the same tokens. The aux model reproduces this separation — its predictions are measurably more activation-like than the ground truth, but the absolute difference is small, and the resulting predictions remain far from the activations they are computed from. We take this as evidence that the interpretable clusters reported in the main post are not explainable by activation structure leaking through the aux model.
Population and Empirical Losses
Throughout this post, the posterior is defined using the population loss rather than the empirical loss . This is a deliberate choice that deserves comment, since in practice we can only ever compute .
Singular learning theory is, at its core, a theory of how fluctuates around as the number of samples grows. Watanabe’s central results generalize the classical central limit theorem to the singular setting, characterizing how the empirical log-likelihood concentrates around the population log-likelihood even when the loss landscape has degenerate geometry. The key quantities that govern this concentration (the RLCT, the singular fluctuation) are properties of the population loss , not of any particular finite sample.
The annealed posterior is the natural object from this perspective: it is defined by the geometry of , which is the geometry that ultimately controls generalization. Susceptibilities defined as derivatives of expectations under this posterior are therefore derivatives of well-defined population quantities with clean asymptotic behavior.
An alternative approach, taken in some of our earlier work on Bayesian influence functions, is to define estimators directly in terms of the empirical loss . This is described in Appendix E.
On Sampling
Even granting that the empirical loss is a good stand-in for , there is a further concern: we need to actually draw samples from the empirical posterior . We do this with stochastic-gradient MCMC methods such as Stochastic Gradient Langevin Dynamics (SGLD; Welling & Teh (2011)).
SGMCMC methods come with convergence guarantees in a variety of settings. The usual analyses, however, assume strong conditions (e.g., some version of strong log-concavity). Often these conditions are violated for models like neural networks that have degenerate loss landscapes. Existing proofs do not, in general, establish that SGLD actually samples from in these settings.
This is an active research frontier. In practice, we treat the SGLD samples as approximate draws from and lean on empirical calibration: we tune hyperparameters (step size, temperature , prior scale) to produce susceptibility estimates that behave well according to heuristics. Developing better samplers for singular models, and understanding the role of their hyperparameters, is a key direction for future work. See the companion post on hyperparameter selection for the practical choices involved in making the estimator here behave well.
Susceptibilities and (Bayesian) Influence Functions
It is possible to arrive at a closely related object by starting from the other end with the empirical posterior rather than the population posterior, and with a perturbation of sample weights rather than of the data distribution.
The empirical posterior can be written
which exhibits the per-sample losses as independent contributions to the log-density. We can perturb the weight of a single training sample by ,
and differentiate an empirical posterior expectation at . The same fluctuation-dissipation calculation as in the main text gives
This is the Bayesian influence function (BIF), as defined in Adam et al. (2025), Lee et al. (2025), and Kreer et al. (2026), following Giordano et al. (2017) and the classical influence-function literature.
Two points are worth noting. First, the loss inside the covariance is uncentered — just , not . The centered form in the main text arose from perturbing the data distribution, which upweights one sample while implicitly downweighting the others; the BIF construction perturbs only a single sample weight, with no compensating downweighting, and no centering appears.
Second, under regularity assumptions and in the large-data limit, the BIF can be Taylor expanded, which recovers the classical influence function as the leading-order term. That is, the BIF is a generalization of the classical IF. It should be seen as a specific case of the susceptibility framework we develop here (uncentered, per-sample loss observable, empirical-first construction).
How do you measure a variation in the loss landscape? If we imagine some variation in the data distribution parametrized by then it is easy to think about the calculus of how the height of the loss landscape at a single point varies with . However, the loss landscape itself is, as a mathematical object, an infinite number of points (the heights at all points in some small region around the trained parameter , say) and it’s not a priori clear how to differentiate such a thing. But recall that in a small neighborhood around the population loss is determined (provided it is analytic) by its Taylor series coefficients. This is an infinite list of numbers which varies with and we can imagine the derivative of each entry of this list easily enough. To a first approximation, susceptibilities are trying to capture some part of the information in the derivative of this “Taylor list” with respect to .
However, for various reasons we must proceed by a less direct route. Instead of studying variations of the function near we study variations in the function where . We can think of this as a transform of that retains the same information. For various choices of observable (meaning a function of ) the asymptotic behaviour in of the expectation
packages information about the partial derivatives of at (the Taylor series coefficients). By letting , and thus , vary with we capture (indirectly) the variation in our “Taylor list”. Such derivatives are precisely the susceptibilities.
Work with us
We're hiring Research Scientists, Engineers & more to join the team full-time.
Senior researchers can also express interest in a part-time affiliation through our new Research Fellows Program.
@article{gordon2026how,
title={How to Scale Susceptibilities},
author={Andrew Gordon and Max Adam and Jesse Hoogland and Daniel Murfet},
year={2026}
}- 1.Structural Inference: Interpreting Small Language Models with SusceptibilitiesGarrett Baker, George Wang, Jesse Hoogland, Daniel Murfet, 2025. arXiv preprint arXiv:2504.18274.
- 2.Embryology of a Language ModelGeorge Wang, Garrett Baker, Andrew Gordon, Daniel Murfet, 2025. arXiv preprint arXiv:2508.00331.
- 3.Towards Spectroscopy: Susceptibility Clusters in Language ModelsAndrew Gordon, Garrett Baker, George Wang, William Snell, Stan van Wingerden, Daniel Murfet, 2026. arXiv preprint arXiv:2601.12703.
- 4.The Local Learning Coefficient: A Singularity-Aware Complexity MeasureEdmund Lau, Zach Furman, George Wang, Daniel Murfet, Susan Wei, 2023. arXiv preprint arXiv:2308.12108.
- 5.Differentiation and Specialization of Attention Heads via the Refined Local Learning Coefficient[link]George Wang, Jesse Hoogland, Stan van Wingerden, Zach Furman, Daniel Murfet, 2024.
- 6.The Loss Kernel: A Geometric Probe for Deep Learning Interpretability[link]Maxwell Adam, Zach Furman, Jesse Hoogland, 2025.
- 7.Influence Dynamics and Stagewise Data Attribution[link]Jin Hwa Lee, Matthew Smith, Maxwell Adam, Jesse Hoogland, 2025.
- 8.Bayesian Influence Functions for Hessian-Free Data Attribution[link]Philipp Alexander Kreer, Wilson Wu, Maxwell Adam, Zach Furman, Jesse Hoogland, 2026.
- 9.Algebraic Geometry and Statistical Learning TheorySumio Watanabe, 2009, Vol 25. Cambridge University Press.
- 10.Local Graph Partitioning using PageRank VectorsReid Andersen, Fan Chung, Kevin Lang, 2006. In 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06), pp. 475--486. IEEE.
- 11.Patterning: The Dual of InterpretabilityGeorge Wang, Daniel Murfet, 2026. arXiv preprint arXiv:2601.13548.
- 12.Similarity of Neural Network Representations RevisitedSimon Kornblith, Mohammad Norouzi, Honglak Lee, Geoffrey Hinton, 2019. In Proceedings of the 36th International Conference on Machine Learning, Vol 97, pp. 3519--3529. PMLR.
- 13.Bayesian Learning via Stochastic Gradient Langevin DynamicsMax Welling, Yee Whye Teh, 2011. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 681--688.
- 14.Covariances, Robustness, and Variational BayesRyan Giordano, Tamara Broderick, Michael I. Jordan, 2017. arXiv preprint arXiv:1709.02536.