How to Scale Susceptibilities

The mathematics and methodologies behind scaling and analyzing susceptibilities.

Authors
Andrew Gordon*†, Max Adam*†, Jesse Hoogland*†, Daniel Murfet*†
Timaeus · * Core Research Contributor · † Post Writing Contributor · Correspondence to andrew@timaeus.co
Published
April 21, 2026

Introduction

The susceptibility clusters used to interpret Pythia 1.4B here are the result of a several step process. Most of this work is detailed in Baker et al. (2025), Wang et al. (2025), and Gordon et al. (2026), but for convenience we have collected into a single post the key definitions and intuitions, as well as detailed citations where the reader can learn more. With two exceptions (PCA whitening as a preprocessing step and the use of auxiliary models for scaling) none of the material here is original, and familiar readers are urged to skim or jump around.

This post can be broken into two major parts. In the first section, we offer a mathematical description of susceptibilities, and in the second we describe how susceptibility data is collected and analyzed. The new material is contained in the second section.

Mathematical Background

We will define here the fundamental mathematical objects used for susceptibilities. We first set up the neural network as a statistical physical system (Section 2.1), then introduce observables and their expectation values under the local posterior (Section 2.2), define susceptibilities as derivatives of these expectation values with respect to changes in the data distribution (Section 2.3), and finally assemble per-component susceptibilities into the susceptibility vectors that are the primary object of study in this work (Section 2.4).

Setup

We begin with a language model whose weights are parametrized by points in a real vector space W\mathcal{W}. Each point wWw \in \mathcal{W} defines a next-token predictor: given a context xx, the model produces a distribution p(yx,w)p(y \mid x, w) over possible next tokens yy. For a context-token pair (x,y)(x,y), the per-token loss is xy(w)=logp(yx,w)\ell_{xy}(w) = -\log p(y \mid x, w), and the population loss is L(w)=Eq(x,y)[xy(w)]L(w) = \mathbb{E}_{q(x,y)}[\ell_{xy}(w)], where qq is the data distribution.

From the loss, we can define a probability distribution over weight space, the population posterior:

ρnβ(w)exp(nβL(w))φ(w), \rho_n^\beta(w) \propto \exp(-n\beta L(w))\varphi(w),

where φ(w)\varphi(w) is a prior, nn controls the effective dataset size, and β\beta is an inverse temperature. This distribution concentrates around low-loss regions, with nn and β\beta controlling how sharply.

This is the same mathematical structure as the Boltzmann distribution in statistical mechanics. We are modeling a neural network as a statistical physics system where the loss plays the role of the energy (see the companion post for a self-contained introduction to this connection via the Ising model).

A component of the model is a subset of the parameters, for example the weights of a single attention head or MLP layer. Fixing a component CC induces a decomposition of the weight space WW, letting us write w=(u,c)w = (u, c), where cc denotes the parameters in CC and uu the rest.

Observables and Expectation Values

An observable is a function ϕ(w):WR\phi(w): \mathcal{W} \to \mathbb{R} that maps a model’s weights to a real number. The simplest example is the population loss L(w)L(w) itself.

A key idea from statistical physics is to study the values of ϕ(w)\phi(w) not at individual weights, but over the entire distribution ρnβ(w)\rho^\beta_n(w), as summarized by expectation values:

ϕn,β=ϕ(w)ρnβ(w)dw. \langle \phi\rangle_{n,\beta} = \int \phi(w)\, \rho^\beta_n(w)\,dw.

When our observable is the change in loss relative to a reference point, this gives us an estimator for the local learning coefficient (LLC),

λ^=nβ(L(w)L(w)). \hat \lambda = \langle n\beta(L(w)-L(w^*))\rangle.

This is a principled measure of model complexity that controls generalization error in the Bayesian setting. The LLC can be interpreted as measuring half the effective dimensionality deff=2λd_\text{eff} =2\lambda, though this can be misleading as the 2λ2\lambda can be an arbitrary rational number, thus giving a fractional notion of dimensionality. See Lau et al. (2023) for more details.

We can also localize this to specific parts of the model. Given trained weights w=(u,c)w^* = (u^*, c^*), the component loss of component CC is

ϕC(w)=δ(uu)[L(w)L(w)]. \phi_C(w) = \delta(u - u^*)\big[L(w) - L(w^*)\big].

Here δ\delta is the Dirac delta function. The associated expectation value ϕCn,β\langle\phi_C\rangle_{n,\beta} is an estimator for the weight-refined LLC for component CC (up to a constant multiple). It measures the complexity of that component (when considered in isolation). In (Wang et al. 2024), tracking these per-component LLCs over training revealed the developmental differentiation of attention heads in a small language model: heads that start undifferentiated gradually specialize, with their component LLCs diverging at critical periods, and the shape of the component LLCs distinguishing different types of heads.

Susceptibilities

The component LLC tells us how much structure lives in a component. It does not tell us what that structure responds to. For that, we need to ask: how does the expectation value of ϕC\phi_C change when we modify the data?

Following Baker et al. (2025), we consider a one-parameter mixture of the data distribution: qh=(1h)q+hqq_h = (1-h)q + hq', where qq is the original distribution and qq' is a another distribution we perturb towards. The loss under this mixture is Lh(w)=Eqh[xy(w)]L^h(w) = \mathbb{E}_{q_h}[\ell_{xy}(w)], and the posterior shifts accordingly. We write ϕβ,h\langle \phi \rangle_{\beta, h} for the expectation of an observable ϕ\phi over this shifted posterior. The susceptibility of ϕ\phi to the perturbation qq' is the derivative of this expected value with respect to hh at h=0h=0:

χϕ=nβhϕβ,hh=0 \chi^{\phi} = -n\beta\frac{\partial}{\partial h} \langle \phi \rangle_{\beta, h} \bigg|_{h=0}

A standard calculation (the fluctuation-dissipation theorem, proved in Baker et al. (2025), Appendix A) converts this derivative into a covariance under the unperturbed posterior:

χϕ=nβCovβ[ϕ(w), ΔL(w)], \chi^{\phi} = n\beta\operatorname{Cov}_\beta\Big[\phi(w),\ \Delta L(w)\Big],

where ΔL(w)=Lq(w)Lq(w)\Delta L(w) = L^{q'}(w) - L^q(w) is the difference in loss between the perturbed and unperturbed distributions. We can learn how the system would respond to a change in data by observing how it fluctuates without such a change. Methodologically, this means we can reuse the same posterior samples to estimate both expectation values and their derivatives.

The simplest perturbation concentrates all of qq' on a single data point (x,y)(x,y). In this case, ΔL(w)=xy(w)L(w)\Delta L(w) = \ell_{xy}(w) - L(w), and we obtain the per-sample susceptibility:

χxyϕ=nβCovβ[ϕ(w), xy(w)L(w)]. \chi^{\phi}_{xy} = n\beta\operatorname{Cov}_\beta\Big[\phi(w),\ \ell_{xy}(w) - L(w)\Big].

When the observable is the component loss ϕC\phi_C, this measures how the structure in component CC responds to the data point (x,y)(x,y). If χxyC\chi^C_{xy} is large and positive, training more on (x,y)(x,y) would increase the effective complexity of CC; if large and negative, it would simplify CC.

Generality. Note that both the observable and the perturbation can be chosen freely. The component loss ϕC\phi_C is a natural default (it requires no design choices and its expectation gives the weight-refined LLC), but any function of the weights can serve as an observable. Similarly, the perturbation need not concentrate on a single sample: it can be any shift in the data distribution (or even some other suitable training hyperparameter). Under a different choice of observable (per-sample loss on a held-out query) with a very similar per-sample perturbation, susceptibilities reduce to a generalization of classical influence functions, see Appendix E (Adam et al. 2025; Lee et al. 2025; Kreer et al. 2026).

Susceptibility Estimation

So far we have defined susceptibilities at the population level: χxyϕ\chi^\phi_{xy} is a covariance under ρnβexp(nβL(w))φ(w)\rho_n^\beta \propto \exp(-n\beta L(w))\varphi(w), where LL is the expectation of the per-token loss under the true data distribution qq. Neither expectation is accessible in practice: LL requires integration over the entire dataset qq, and the posterior expectation requires integrating over all of weight space. To actually compute susceptibilities, we substitute both with finite averages.

Expectation over qq. Given a finite dataset Dn={(xi,yi)}i=1nD_n = \{(x_i, y_i)\}_{i=1}^n of nn samples drawn from qq, we approximate LL by the empirical loss

Ln(w)=1ni=1nxiyi(w). L_n(w) = \frac{1}{n}\sum_{i=1}^n \ell_{x_i y_i}(w).

For practical approximations, we don’t use a fixed set of nn samples, but rather a minibatch loss.

This replaces the population posterior with the empirical posterior ρ^nβexp(nβLn(w))φ(w)\hat\rho_n^\beta \propto \exp(-n\beta L_n(w))\varphi(w), and the centered loss xyL\ell_{xy} - L inside the covariance with xyLn\ell_{xy} - L_n.

Expectation over the posterior. We draw a finite set of weight samples w1,,wSw_1, \ldots, w_S from ρ^nβ\hat\rho_n^\beta using Stochastic Gradient Langevin Dynamics (SGLD); see Lau et al. (2023) or the accompanying blog post for details. The population covariance is then replaced by a sample covariance, giving the susceptibility estimator

χ^xyϕ=nβSs=1S(ϕ(ws)ϕˉ)(xy(ws)Ln(ws)ΔLn), \hat\chi^\phi_{xy} = \frac{n\beta}{S} \sum_{s=1}^S \big(\phi(w_s) - \bar\phi\big)\big(\ell_{xy}(w_s) - L_n(w_s) - \overline{\Delta L_n}\big),

where ϕˉ\bar\phi and ΔLn\overline{\Delta L_n} are the sample means across the SS samples.

It is not obvious that either of these substitutions is valid. Understanding how LnL_n fluctuates around LL as nn \to \infty for arbitrary learning machines is the core content of singular learning theory (Watanabe 2009). See Appendix C for more details. The analogous question for stochastic-gradient MCMC — whether such methods actually probe the posterior in the singular setting — remains an open problem, see also Appendix D.

From here on, when no confusion can arise, we use “susceptibility” to refer to both the population object χxyϕ\chi^\phi_{xy} and its estimator χ^xyϕ\hat\chi^\phi_{xy}.

Susceptibility Vectors

By choosing a set of components C1,,CHC_1, \ldots, C_H, we associate to each data point (x,y)(x,y) a susceptibility vector:

η(xy)=(χxyC1,,χxyCH)RH. \eta(xy) = \big(\chi^{C_1}_{xy}, \ldots, \chi^{C_H}_{xy}\big) \in \mathbb{R}^H.

This vector is a representation of the data point (x,y)(x,y) from the perspective of the model’s loss landscape. In the previously mentioned companion post we show how the values of multiple susceptibilities localized to different “rooms” lets us infer where a perturbation is globally in a large grid. The intuition here is analogous. Two data points with similar susceptibility vectors affect the model’s internal structure in similar ways; data points with very different vectors affect different parts of the model, or affect the same parts in different directions.

These susceptibility vectors are the primary object of study in this sequence. For Pythia-1.4B, we use 410 components: 384 attention heads (16 per layer across 24 layers), 24 MLP layers, and the embedding and unembedding matrices.

Working with Susceptibility Data

Having defined susceptibilities above, in this section we will describe how we collect these values efficiently at scale, as well as how we work with the data to identify clusters.

The compute involved in SGLD is currently the greatest bottleneck to collecting susceptibilities for large models. Computing the covariances for 6 million susceptibilities on Pythia 1.4b required tens of thousands of forward and backward passes on the model, and roughly 5000 H200 hours.

Because of this, we have developed a new technique to use “ground-truth” susceptibility data to efficiently collect susceptibilities on a larger set of tokens.

Upsampling Data Using an Auxiliary Model

We train an auxiliary (Aux) model: a lightweight external neural network that takes in concatenated residual stream activations and predicts per-token susceptibility values. We can then inference these models over arbitrarily large numbers of tokens, as the cost per batch is a single forward pass of the base model to generate activations, then a forward pass of the Aux model to predict susceptibility values

The process is:

  1. Collect ground-truth susceptibilities using SGLD, as well as the activation vectors
  2. Train an auxiliary model to predict a token’s susceptibility values from a concatenated vector of the model’s activations.
  3. Use the auxiliary model to produce susceptibilities, using only a single forward pass from the initial LLM, to collect activations, rather than a sampling process

Introducing an auxiliary model allowed us to use the 4 million susceptibilities collected in Gordon et al. (2026) to produce susceptibilities for 46 million more tokens at a fraction of the compute cost.

Aux Model Details

We experimented with several auxiliary model architectures, the best performing of which was a simple one layer encoder-decoder model with a 16x hidden layer size. We also found a small l1-norm penalty useful for stability.

For training, we used activations from four evenly spaced intermediate layers of Pythia 1.4B. Rather than predict the 410 component susceptibilities individually, we used the ground-truth susceptibilities to perform principal component analysis and trained the model to perform predictions on the PC basis. Previous work (Baker et al. 2025; Wang et al. 2025) identified principal component directions as capturing more patterns in the data than individual model components.

We measure aux model accuracy by the Pearson correlation between predicted susceptibility values and a withheld ground-truth validation set. The model achieved a mean per-PC correlation of 0.714, with stronger performance on the earlier principle components, as seen in the below chart.

Clustering

We cluster the susceptibility vectors using a clustering algorithm defined in Gordon et al. (2026), section 3.1. The algorithm is designed to find small clusters of nearby points on the periphery of a large data cloud using a localized PageRank. The algorithm works as follows:

  1. Turn the set of datapoints into a graph, with points as vertices edge weights between points a decaying function of distance.
  2. Select a random unvisited vertex
  3. Explore a neighborhood of that vertex by taking random walks on the graph that return to the vertex with fixed probability α\alpha at each step. The steady state of such a random process can be efficiently approximated to a given tolerance ϵ\epsilon, see Andersen et al. (2006) for details.
  4. Order the points by how frequently the random walks visit them and search the prefixes of this ordering for sets that have low conductivity: the ratio between total edge weight leaving the set and total edge weight leaving and staying in the set.
  5. Clusters are identified as prefixes with low conductivity. If a cluster is found, record it, if not label all the point visited as “unclustered”
  6. Return to step 2, repeating until every point is visited.

Intuitively, we add points consecutively to a set in order of how “near” they are to a randomly chosen seed vertex, while comparing the distances between points in the set to distances between points in the set and external points. A cluster is an outlier by this measure: Many points that are close to each other and far away from everything else. These clusters are collections of tokens that the model regards as similar in some way.

This process is sensitive to the choice of hyperparameters used for the graph construction and the random walk, but we were able to reuse the exact hyperparameters from Gordon et al. (2026). Our distance graph was a 4545-nearest neighbor graph, and the random walk used α=103, ϵ=10e7\alpha = 10^{-3},\ \epsilon=10e-7. To account for the increased number of tokens, we ran the core algorithmic loop in parallel, choosing multiple seed vertices simultaneously.

The biggest change from our previous methodology is how we process the data prior to clustering. Raw susceptibility vectors have non-uniform variance across principal components, with earlier PCs (by definition) having higher variance than later PCs. The clustering algorithm we use is adapted from graph-theoretic methods, and begins by turning a point cloud into a distance graph, so it is strongly dependent on the Euclidean distance between susceptibility vectors. In order to prevent this distance from being dominated by spread along the first few principal components, we apply PCA to the susceptibility matrix and whiten in the PC basis – rescaling each principal component to unit variance. Based on the number of clusters found, and the quality of the patterns detected, this was a substantial improvement in the process.

The combined effect of increasing the number of tokens to 46 million, and performing a PC-whitening step prior to clustering increased the number of clusters found for Pythia 1.4B from 249 to 57,236.

Conclusion

Though Gordon et al. (2026) was published only a few months ago, we’ve made significant improvements to the process for producing and studying susceptibility clusters. This post provides a general overview of the pipeline, and we strongly encourage the reader to look at the primary post in this collection where we present a novel analysis of the thousands of interpretable clusters in Pythia 1.4B.

Appendix

Why Susceptibilities?

There are several other interpretability techniques, most prominently Sparse Autoencoders, that have similar utility to susceptibility clusters. That is to say, they are methods for unsupervised detection of human comprehensible patterns that a language model is using to process tokens. With the introduction of an aux model, the susceptibility data we collect is now a transformation of a model’s activations, exactly like SAEs.

Given this, it is reasonable to ask why anyone should care about susceptibilities, as compared to the many other tools seeking interesting structure in activation data. We gesture at an answer above, but this is worth saying directly: Though susceptibilities can be used for pattern identification, this is a side effect of their design, not the original intent. Susceptibilities are direct measures of how a datapoint affects the development of an observable during training. As such, they can be used for interpretability (datapoints that affect training in similar ways are often quite similar), but they can also be used, and in fact were constructed, to make principled interventions in the training process of large models.

This has first been successfully done at the small language model scale in Wang & Murfet (2026). We encourage the reader to explore our post on patterning toy models of superposition as another demonstration of this sort of intervention.

Leaky Activations?

Susceptibilities are a weights-based interpretability method: they depend on a model’s fixed parameters, not on any particular activation. The auxiliary model changes this, since its inputs are intermediate activations. A natural worry follows: could the interesting structure we see in aux-predicted susceptibilities be primarily an artifact of structure already present in activation space, rather than a real reflection of the loss-landscape geometry?

Here is a concrete scenario we are worried about: Suppose two tokens are nearby in activation space, because the model regards them as similar, but this relationship is not captured by susceptibilities. Since our auxiliary model applies a continuous map to activation space, then the auxiliary model susceptibilities would capture information, valuable patterns even, that did not accurately reflect what susceptibilities should be able to see. More succinctly, are aux models using activations to efficiently produce the intended data, or do they show an indeterminate blend of two different datasets?

This is an empirical question, and we address it with two complementary metrics.

Global similarity: linear CKA. Centered kernel alignment (Kornblith et al. 2019) measures the similarity of two representations by comparing the covariance structure of their kernel matrices. Given representations XX and YY of the same tokens, CKA asks: do pairs of tokens that are similar under XX tend to be similar under YY? Values range from 0 (no correspondence) to 1 (identical structure up to rotation and scaling). The linear variant admits a feature-space reformulation that scales to the hundreds of thousands of tokens we need here without materialising an n×nn \times n kernel matrix.

For each of the roughly 212,000 tokens in the aux-model validation set, we compute the linear CKA between (i) per-layer activations and ground-truth susceptibilities, and (ii) per-layer activations and aux-predicted susceptibilities. The first quantity is a baseline: any similarity it records reflects genuine structure shared between activations and susceptibilities on this data. The second is the object of interest: if it is close to the first, the aux model is faithfully representing this shared structure; if it is substantially higher, the aux model is adding activation structure that is not present in the ground truth.

Both curves are low across all layers — ground truth peaks near 0.15, aux predictions peak near 0.275. The aux curve sits above the ground-truth curve, rising and falling in tandem with the ground truth, although we believe that by the standards of CKA in the neural network representation literature these numbers are small. In particular it seems that this gap of 0.125\le 0.125 is a-priori low given we generally expect learned maps to be biased towards preserving the local structure of the domain.

Overall, we take the low CKA between ground truth susceptibilities and activations as a positive sign for the orthogonality of their information content. A reasonable caricature seems to be that alignment between activations and susceptibilities comes for free from the input tokens (early layers) and from the target token (late layers), with the middle layers carrying the nontrivial computation.

Local similarity: Jaccard-KNN. CKA captures global covariance structure. It is less informative about the local neighbourhood geometry that our clustering algorithm is actually sensitive to. We complement CKA with a nearest-neighbour measurement: for each token, compute its k=50k = 50 nearest neighbours in activation space and in susceptibility space, and report the Jaccard index (size of intersection over size of union) between the two neighbour sets, averaged over all tokens.

The mean Jaccard similarity across all layers is below 0.05 for both ground truth susceptibilities, and below 0.19 for aux predictions. This indicates that local neighbourhoods in activation space and susceptibility space are almost disjoint: a token’s 50 nearest neighbours in activations have very little overlap with its 50 nearest neighbours in susceptibilities.

Taking stock. CKA and Jaccard-KNN agree on the qualitative picture. Activations and susceptibilities are, on both global and local metrics, different representations of the same tokens. The aux model reproduces this separation — its predictions are measurably more activation-like than the ground truth, but the absolute difference is small, and the resulting predictions remain far from the activations they are computed from. We take this as evidence that the interpretable clusters reported in the main post are not explainable by activation structure leaking through the aux model.

Population and Empirical Losses

Throughout this post, the posterior is defined using the population loss L(w)=Eq[xy(w)]L(w) = \mathbb{E}_q[\ell_{xy}(w)] rather than the empirical loss Ln(w)=1nii(w)L_n(w) = \frac{1}{n}\sum_i \ell_i(w). This is a deliberate choice that deserves comment, since in practice we can only ever compute LnL_n.

Singular learning theory is, at its core, a theory of how LnL_n fluctuates around LL as the number of samples nn grows. Watanabe’s central results generalize the classical central limit theorem to the singular setting, characterizing how the empirical log-likelihood concentrates around the population log-likelihood even when the loss landscape has degenerate geometry. The key quantities that govern this concentration (the RLCT, the singular fluctuation) are properties of the population loss LL, not of any particular finite sample.

The annealed posterior ρexp(nβL(w))φ(w)\rho \propto \exp(-n\beta L(w))\varphi(w) is the natural object from this perspective: it is defined by the geometry of LL, which is the geometry that ultimately controls generalization. Susceptibilities defined as derivatives of expectations under this posterior are therefore derivatives of well-defined population quantities with clean asymptotic behavior.

An alternative approach, taken in some of our earlier work on Bayesian influence functions, is to define estimators directly in terms of the empirical loss LnL_n. This is described in Appendix E.

On Sampling

Even granting that the empirical loss LnL_n is a good stand-in for LL, there is a further concern: we need to actually draw samples from the empirical posterior ρ^nβ\hat\rho_n^\beta. We do this with stochastic-gradient MCMC methods such as Stochastic Gradient Langevin Dynamics (SGLD; Welling & Teh (2011)).

SGMCMC methods come with convergence guarantees in a variety of settings. The usual analyses, however, assume strong conditions (e.g., some version of strong log-concavity). Often these conditions are violated for models like neural networks that have degenerate loss landscapes. Existing proofs do not, in general, establish that SGLD actually samples from ρ^nβ\hat\rho_n^\beta in these settings.

This is an active research frontier. In practice, we treat the SGLD samples as approximate draws from ρ^nβ\hat\rho_n^\beta and lean on empirical calibration: we tune hyperparameters (step size, temperature β\beta, prior scale) to produce susceptibility estimates that behave well according to heuristics. Developing better samplers for singular models, and understanding the role of their hyperparameters, is a key direction for future work. See the companion post on hyperparameter selection for the practical choices involved in making the estimator here behave well.

Susceptibilities and (Bayesian) Influence Functions

It is possible to arrive at a closely related object by starting from the other end with the empirical posterior rather than the population posterior, and with a perturbation of sample weights rather than of the data distribution.

The empirical posterior can be written

ρ^nβ(w)exp ⁣(βj=1nxjyj(w))φ(w), \hat\rho_n^\beta(w) \propto \exp\!\Big(-\beta\sum_{j=1}^n \ell_{x_j y_j}(w)\Big)\varphi(w),

which exhibits the per-sample losses as independent contributions to the log-density. We can perturb the weight of a single training sample (xi,yi)(x_i, y_i) by ϵ\epsilon,

ρ^n,ϵβ(w)exp ⁣(βj=1nxjyj(w)βϵxiyi(w))φ(w), \hat\rho_{n,\epsilon}^\beta(w) \propto \exp\!\Big(-\beta\sum_{j=1}^n \ell_{x_j y_j}(w) - \beta\epsilon\,\ell_{x_i y_i}(w)\Big)\varphi(w),

and differentiate an empirical posterior expectation ϕn,ϵ\langle\phi\rangle_{n,\epsilon} at ϵ=0\epsilon = 0. The same fluctuation-dissipation calculation as in the main text gives

ϵϕn,ϵϵ=0=βCovρ^nβ[ϕ(w),xiyi(w)]. -\frac{\partial}{\partial \epsilon}\langle\phi\rangle_{n,\epsilon}\bigg|_{\epsilon=0} = \beta \operatorname{Cov}_{\hat\rho_n^\beta}\big[\phi(w),\, \ell_{x_i y_i}(w)\big].

This is the Bayesian influence function (BIF), as defined in Adam et al. (2025), Lee et al. (2025), and Kreer et al. (2026), following Giordano et al. (2017) and the classical influence-function literature.

Two points are worth noting. First, the loss inside the covariance is uncentered — just xiyi\ell_{x_i y_i}, not xiyiLn\ell_{x_i y_i} - L_n. The centered form in the main text arose from perturbing the data distribution, which upweights one sample while implicitly downweighting the others; the BIF construction perturbs only a single sample weight, with no compensating downweighting, and no centering appears.

Second, under regularity assumptions and in the large-data limit, the BIF can be Taylor expanded, which recovers the classical influence function as the leading-order term. That is, the BIF is a generalization of the classical IF. It should be seen as a specific case of the susceptibility framework we develop here (uncentered, per-sample loss observable, empirical-first construction).

How do you measure a variation in the loss landscape? If we imagine some variation in the data distribution parametrized by hh then it is easy to think about the calculus of how the height of the loss landscape at a single point varies with hh. However, the loss landscape itself is, as a mathematical object, an infinite number of points (the heights at all points in some small region around the trained parameter ww^*, say) and it’s not a priori clear how to differentiate such a thing. But recall that in a small neighborhood around ww^* the population loss LL is determined (provided it is analytic) by its Taylor series coefficients. This is an infinite list of numbers which varies with hh and we can imagine the derivative of each entry of this list easily enough. To a first approximation, susceptibilities are trying to capture some part of the information in the derivative of this “Taylor list” with respect to hh.

However, for various reasons we must proceed by a less direct route. Instead of studying variations of the function LL near ww^* we study variations in the function 1Ztexp(tL(w))\frac{1}{Z_t} \exp(-tL(w)) where Zt=Wexp(tL(w))dwZ_t = \int_W \exp(-tL(w)) dw. We can think of this as a transform of LL that retains the same information. For various choices of observable (meaning a function ϕ=ϕ(w)\phi = \phi(w) of ww) the asymptotic behaviour in tt of the expectation

E[ϕ]=1ZtWϕ(w)exp(tL(w))dw \mathbb{E}[\phi] = \frac{1}{Z_t} \int_W \phi(w) \exp(-tL(w)) dw

packages information about the partial derivatives of LL at ww^* (the Taylor series coefficients). By letting LL, and thus E[ϕ]\mathbb{E}[\phi], vary with hh we capture (indirectly) the variation in our “Taylor list”. Such derivatives ddhE[ϕ]\frac{d}{dh} \mathbb{E}[\phi] are precisely the susceptibilities.

Work with us

We're hiring Research Scientists, Engineers & more to join the team full-time.

Senior researchers can also express interest in a part-time affiliation through our new Research Fellows Program.

Citation Information
Please cite as:
Andrew Gordon, Max Adam, Jesse Hoogland, Daniel Murfet, "How to Scale Susceptibilities", 2026.
BibTeX Citation:
@article{gordon2026how,
  title={How to Scale Susceptibilities},
  author={Andrew Gordon and Max Adam and Jesse Hoogland and Daniel Murfet},
  year={2026}
}
References
  1. 1.
    Structural Inference: Interpreting Small Language Models with Susceptibilities
    Garrett Baker, George Wang, Jesse Hoogland, Daniel Murfet, 2025. arXiv preprint arXiv:2504.18274.
  2. 2.
    Embryology of a Language Model
    George Wang, Garrett Baker, Andrew Gordon, Daniel Murfet, 2025. arXiv preprint arXiv:2508.00331.
  3. 3.
    Towards Spectroscopy: Susceptibility Clusters in Language Models
    Andrew Gordon, Garrett Baker, George Wang, William Snell, Stan van Wingerden, Daniel Murfet, 2026. arXiv preprint arXiv:2601.12703.
  4. 4.
    The Local Learning Coefficient: A Singularity-Aware Complexity Measure
    Edmund Lau, Zach Furman, George Wang, Daniel Murfet, Susan Wei, 2023. arXiv preprint arXiv:2308.12108.
  5. 5.
    Differentiation and Specialization of Attention Heads via the Refined Local Learning Coefficient[link]
    George Wang, Jesse Hoogland, Stan van Wingerden, Zach Furman, Daniel Murfet, 2024.
  6. 6.
  7. 7.
    Influence Dynamics and Stagewise Data Attribution[link]
    Jin Hwa Lee, Matthew Smith, Maxwell Adam, Jesse Hoogland, 2025.
  8. 8.
    Bayesian Influence Functions for Hessian-Free Data Attribution[link]
    Philipp Alexander Kreer, Wilson Wu, Maxwell Adam, Zach Furman, Jesse Hoogland, 2026.
  9. 9.
    Algebraic Geometry and Statistical Learning Theory
    Sumio Watanabe, 2009, Vol 25. Cambridge University Press.
  10. 10.
    Local Graph Partitioning using PageRank Vectors
    Reid Andersen, Fan Chung, Kevin Lang, 2006. In 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06), pp. 475--486. IEEE.
  11. 11.
    Patterning: The Dual of Interpretability
    George Wang, Daniel Murfet, 2026. arXiv preprint arXiv:2601.13548.
  12. 12.
    Similarity of Neural Network Representations Revisited
    Simon Kornblith, Mohammad Norouzi, Honglak Lee, Geoffrey Hinton, 2019. In Proceedings of the 36th International Conference on Machine Learning, Vol 97, pp. 3519--3529. PMLR.
  13. 13.
    Bayesian Learning via Stochastic Gradient Langevin Dynamics
    Max Welling, Yee Whye Teh, 2011. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 681--688.
  14. 14.
    Covariances, Robustness, and Variational Bayes
    Ryan Giordano, Tamara Broderick, Michael I. Jordan, 2017. arXiv preprint arXiv:1709.02536.