How to Scale Susceptibilities

The mathematics and methodologies behind scaling and analyzing susceptibilities.

Authors
Andrew Gordon*†, Max Adam*†, Jesse Hoogland*†, Daniel Murfet*†
Timaeus · * Core Research Contributor · † Post Writing Contributor · Correspondence to andrew@timaeus.co
Published
April 21, 2026

Susceptibilities were introduced for neural network interpretability in Baker et al. (2025) and further studied in Wang et al. (2025), Gordon et al. (2026), Wang & Murfet (2026) for a family of observables indexed by components and in Kreer et al. (2025), Adam et al. (2025), and Lee et al. (2025) for a family of observables indexed by tokens in context. In this post we detail the methodologies used to scale susceptibilities to models with billions of parameters, as exhibited in our post Spectroscopy at Scale on susceptibility clusters in Pythia-1.4B.

The post has two parts. In the first part “The Ground Truth Data” we explain how we collect and post-process the ground truth susceptibility data as presented in Baker et al. (2025) and Gordon et al. (2026). We conclude this part with a look at clustering for 4.25M ground truth susceptibility vectors computed for Pythia-1.4B. In the second part “How to Upsample” we present our method for training an auxiliary (Aux) model on these 4.25M ground truth vectors, using which we produce the 46M susceptibility vectors presented in our other post.

The Ground Truth Data

In this section we explain how to take a language model, a set of components of size n_components\text{n\_components} and a dataset of token sequences of size n_datapoints\text{n\_datapoints} and produce the susceptibility matrix XX of size n_datapoints×n_components\text{n\_datapoints}\times\text{n\_components}.

The explanation has several parts:

  • Section 1.1: theoretical susceptibilities, defined at the population level.
  • Section 1.2: the empirical estimator χ^xyC\hat{\chi}^C_{xy} defined for a component CC and token sequence xyxy, which is what the code (see the GitHub repository) produces. The raw susceptibility matrix XrawX_{\text{raw}} has these quantities as its entries.
  • Section 1.3: we perform two post-processing steps (standardisation and PCA whitening) to convert the raw susceptibility matrix into the final susceptibility matrix XX that is ready to be input into our clustering pipeline. We sometimes refer to the rows of XX or XrawX_{\text{raw}} as ground truth susceptibilities, in contrast to the predicted susceptibilities considered in the next section.
  • Section 1.4: we recall the clustering algorithm that is used to group token sequences, as represented by the rows of the susceptibility matrix.

Population Susceptibility

In the setting of neural networks, susceptibilities measure the first-order response of the posterior distribution over parameters to a variation in the data distribution. This measurement depends on two choices: an observable (denoted here by ϕ\phi) whose expectation value is the chosen “window” into the information contained in the posterior distribution, and a variation of the data distribution (here associated to the variation of the probability of a single token sequence, usually denoted xyxy).

There is another important, but more technical, distinction to make: whether we talk about susceptibilities at the population level or at the empirical level. At the population level we work with the population loss LL and variations in the data distribution. The susceptibility defined at the population level is then associated with an estimator defined in terms of the empirical loss LnL_n. This is the approach taken in Baker et al. (2025) and Gordon et al. (2026). Alternatively, the susceptibility estimator can be motivated directly at the level of the empirical loss LnL_n in terms of variations in weights associated to individual token sequences xiyix_i y_i appearing in a dataset DnD_n; this is the approach taken in Kreer et al. (2025), Adam et al. (2025) and Lee et al. (2025). Since the empirical-level motivation for the per-component loss observables is more cumbersome than the population level one, we will stick with the former approach in this post.

We begin with a language model whose weights are parametrized by points in WRdW \subseteq \mathbb{R}^d. Each point wWw \in W defines a next-token predictor: given a context xx, the model produces a distribution p(yx,w)p(y \mid x, w) over possible next tokens yy. For a context-token pair (x,y)(x,y), the per-token loss is xy(w)=logp(yx,w)\ell_{xy}(w) = -\log p(y \mid x, w), and the population loss is L(w)=Eq(x,y)[xy(w)]L(w) = \mathbb{E}_{q(x,y)}[\ell_{xy}(w)], where qq is the data distribution.

From the loss, we can define a probability distribution over weight space, the population posterior:

ρ(w)=1Zexp(nβL(w))φ(w),Z=exp(nβL(w))φ(w)dw \rho(w) = \frac{1}{Z}\exp(-n\beta L(w))\varphi(w), \quad Z = \int \exp(-n\beta L(w)) \varphi(w) dw

where φ(w)>0\varphi(w) > 0 is a prior, nn controls the effective dataset size, and β\beta is an inverse temperature. This distribution concentrates around low-loss regions, with nn and β\beta controlling how sharply. This is the same mathematical structure as the Boltzmann distribution in statistical mechanics. We are modeling a neural network as a statistical physics system where the loss plays the role of the energy (see the post Interpreting the Ising Model) for a self-contained introduction to this connection).

An observable is a function ϕ(w):WR\phi(w): W \to \mathbb{R} or, more generally, a generalized function in the sense of Schwarz distributions. Some simple examples are the population loss L(w)L(w) or per-token loss xy(w)\ell_{xy}(w), centered versions thereof L(w)L(w)L(w) - L(w^*), xy(w)xy(w)\ell_{xy}(w) - \ell_{xy}(w^*) and products of these functions with Dirac distributions δS\delta_S for submanifolds SWS \subseteq W. We refer to observables that involve Dirac distributions as distributional observables. We introduce the following notation for expectation values with respect to the population posterior

ϕ=ϕ(w)ρ(w)dw. \langle \phi\rangle = \int \phi(w)\, \rho(w)\,dw.

A component of the model is a subset of the parameters, for example the weights of a single attention head or MLP layer. Fixing a component CC induces a decomposition of the weight space WW, letting us write w=(u,c)w = (u, c), where cc denotes the parameters in CC and uu the rest.

Given weights w=(u,c)w^* = (u^*, c^*) we define the per-component loss ϕC\phi_C for a component CC at ww^* to be the generalized function

ϕC(w)=δ{u}×C[L(w)L(w)] \phi_C(w) = \delta_{\{u^*\} \times C} \big[L(w) - L(w^*)\big]

where δ{u}×C\delta_{\{u^*\} \times C} is the Dirac distribution, defined by f(u,c)δ{u}×Cdudc=Cf(u,c)dc\int f(u,c) \delta_{\{u^*\} \times C} du dc = \int_C f(u^*,c) dc. One way to motivate this observable is that its expectation value is related to the weight-refined LLC for the component CC, as discussed in Wang et al. (2024). More precisely, the weight-refined LLC estimator is, up to a factor of nβn\beta, an estimator for ϕC\langle \phi_C \rangle. We can think of this as a component-specific measure of the amount of information in the model about the data distribution.

Following Baker et al. (2025), we consider a one-parameter mixture of the data distribution: qh=(1h)q+hqq_h = (1-h)q + hq', where qq is the original distribution and qq' is another distribution we perturb towards. The loss under this mixture is Lh(w)=Eqh[xy(w)]L^h(w) = \mathbb{E}_{q_h}[\ell_{xy}(w)], and the posterior shifts accordingly. We write ϕh\langle \phi \rangle_{h} for the expectation of an observable ϕ\phi over this shifted posterior. The susceptibility of ϕ\phi to the perturbation qq' is the derivative of this expected value with respect to hh at h=0h=0:

χϕ=1nβhϕhh=0 \chi^{\phi} = \frac{1}{n\beta}\frac{\partial}{\partial h} \langle \phi \rangle_{h} \bigg|_{h=0}

A standard calculation (the fluctuation-dissipation theorem, proved in Baker et al. (2025), Appendix A) converts this derivative into a covariance under the unperturbed posterior:

χϕ=Cov[ϕ(w), ΔL(w)], \chi^{\phi} = -\operatorname{Cov}\Big[\phi(w),\ \Delta L(w)\Big],

where ΔL(w)=Lh=1(w)L(w)\Delta L(w) = L^{h=1}(w) - L(w) is the difference in loss between the perturbed and unperturbed distributions. We can learn how the system would respond to a change in data by observing how it fluctuates without such a change. Methodologically, this means we can reuse the same posterior samples to estimate both expectation values and their derivatives.

The simplest perturbation concentrates all of qq' on a single data point (x,y)(x,y). In this case, ΔL(w)=xy(w)L(w)\Delta L(w) = \ell_{xy}(w) - L(w), and we obtain the per-sample susceptibility:

χxyϕ=Covβ[ϕ(w), xy(w)L(w)]. \chi^{\phi}_{xy} = -\operatorname{Cov}_\beta\Big[\phi(w),\ \ell_{xy}(w) - L(w)\Big].

When the observable is the component loss ϕC\phi_C, this measures how the structure in component CC responds to the data point (x,y)(x,y).

When ϕ\phi is taken to be the per-token loss, susceptibilities reduce to a generalization of classical influence functions (Kreer et al. 2025; Adam et al. 2025; Lee et al. 2025).

Estimator and the Raw Susceptibility Matrix

So far we have defined susceptibilities at the population level: χxyϕ\chi^\phi_{xy} is a covariance under ρexp(nβL(w))φ(w)\rho \propto \exp(-n\beta L(w))\varphi(w), where LL is the expectation of the per-token loss under the true data distribution qq. Neither expectation is accessible in practice: LL requires integration over the entire dataset qq, and the posterior expectation requires integrating over all of weight space. To actually compute susceptibilities, we substitute both with finite averages.

Expectation over qq. Given a finite dataset Dn={(xi,yi)}i=1nD_n = \{(x_i, y_i)\}_{i=1}^n of nn samples drawn from qq, we approximate LL by the empirical loss

Ln(w)=1ni=1nxiyi(w). L_n(w) = \frac{1}{n}\sum_{i=1}^n \ell_{x_i y_i}(w).

In practice, we don’t use a fixed set of nn samples, but rather a minibatch loss. This replaces the population posterior with the empirical posterior ρ^nexp(nβLn(w))φ(w)\hat\rho_n \propto \exp(-n\beta L_n(w))\varphi(w), and the centered loss xyL\ell_{xy} - L inside the covariance with ΔLn=xyLn\Delta L_n = \ell_{xy} - L_n.

Expectation over the posterior, ordinary observables. We draw a finite set of weight samples w1,,wSw_1, \ldots, w_S from ρ^n\hat\rho_n using Stochastic Gradient Langevin Dynamics (SGLD); see Lau et al. (2023) or the accompanying blog post for details. The population covariance is then replaced by a sample covariance, giving the susceptibility estimator for any (ordinary) observable ϕ\phi

χ^xyϕ=1Ss=1S(ϕ(ws)ϕˉ)(ΔLn(ws)ΔLn), \hat\chi^\phi_{xy} = \frac{1}{S} \sum_{s=1}^S \big(\phi(w_s) - \bar\phi\big)\big( \Delta L_n(w_s) - \overline{\Delta L_n}\big),

where ϕˉ\bar\phi and ΔLn\overline{\Delta L_n} are the sample means across the SS samples. Here by “ordinary” we mean that ϕ\phi is a function and not a distribution.

Expectation over the posterior, distributional observables. When the observable is a distribution rather than an ordinary function, as with the component observable ϕC\phi_C, the estimator needs to be designed differently to account for the submanifold SS appearing in the Dirac distribution δS\delta_S. For ϕC\phi_C the delta factor restricts integration to the slice {u}×C\{u^*\}\times C in parameter space. The covariance then decomposes into a hybrid expectation over two posteriors: the weight-refined posterior

ρ^nC(c)exp(nβLn(u,c))φ(u,c) \hat\rho^{C}_n(c) \propto \exp\big(-n\beta L_n(u^*, c)\big) \varphi(u^*, c)

supported on the slice, and the full empirical posterior ρ^n\hat\rho_n. Correspondingly the estimator uses two independent SGLD chains: a restricted chain sampling only the component parameters cc, with uu clamped at uu^*, producing draws c1,,cRc_1,\ldots,c_R; and an unrestricted chain producing w1,,wSw_1,\ldots,w_S as before.

Writing g(c)=Ln(u,c)Ln(w)g(c) = L_n(u^*, c) - L_n(w^*), the susceptibility estimator is

χ^xyC=1Rt=1Rg(ct)ΔLn(u,ct)(1Rt=1Rg(ct))(1Ss=1SΔLn(ws)). \hat\chi^{C}_{xy} = \frac{1}{R}\sum_{t=1}^{R} g(c_t) \Delta L_n(u^*, c_t) \\ \qquad - \Bigg(\frac{1}{R}\sum_{t=1}^{R} g(c_t)\Bigg)\Bigg(\frac{1}{S}\sum_{s=1}^{S} \Delta L_n(w_s)\Bigg).

Both factors of the first term and the mean of gg in the second term come from the restricted chain; the mean of xyLn\ell_{xy} - L_n in the second term comes from the full chain. This is the estimator introduced in Baker et al. (2025). Because the ϕC\phi_C observable’s covariance carries a CC-dependent factor ZC/ZfullZ_C/Z_{\text{full}} (the ratio of partition functions on the slice versus the full parameter space) that the estimator does not compute, χ^xyC\hat\chi^{C}_{xy} is actually an estimator for the renormalized susceptibility ZfullZCχxyC\frac{Z_{\text{full}}}{Z_C} \chi^{C}_{xy} not for the population susceptibility itself. This CC-dependent (but xyxy-independent) prefactor is absorbed by the column z-scoring step described in the next section. This empirical renormalized susceptibility estimator converges to its population counterpart as nβn\beta \to \infty with β0\beta \to 0, that is, the estimator is consistent.

The raw susceptibility matrix. We choose n_components\text{n\_components} components CC and n_datapoints\text{n\_datapoints} token sequences xyxy. The raw susceptibility matrix has rows indexed by datapoints, columns indexed by components and the estimated susceptibilities as its entries

Xraw:=(χ^xyC)xy,CMat(R,n_datapoints×n_components). X_{\text{raw}} := \Big( \hat\chi^{C}_{xy} \Big)_{xy, C} \in \operatorname{Mat}(\mathbb{R}, \text{n\_datapoints} \times \text{n\_components})\,.

For example, in Pythia-1.4B, we use 410 components: 384 attention heads (16 per layer across 24 layers), 24 MLP layers, and the embedding and unembedding matrices. Token sequences are sampled uniformly from 13 subsets of the Pile as documented in Gordon et al. (2026).

Post-Processing

We feed the raw susceptibility matrix XrawX_{\text{raw}} through two post-processing steps: standardisation (that is, taking z-scores along columns) and PCA whitening.

Standardisation. We replace each column xx of XrawX_{\text{raw}} by the so-called z-score z=(xu)/sz = (x - u)/s where uu is the mean of the column entries and ss the standard deviation, as implemented for example by scitkit-learn’s StandardScaler function. Denote the resulting matrix XX'. In the standard terminology the columns of the data matrix XrawX_{\text{raw}} are called features and the rows samples. Note that dividing by ss means that the resulting z-scored matrix XX' does not depend on any CC-dependent factors in the susceptibility estimator, and in particular this removes the dependence on the partition function ratios ZfullZC\frac{Z_{\text{full}}}{Z_C}.

PCA whitening. After standardization the columns of XX' are on the same scale, but remain correlated. Because the clustering algorithm uses Euclidean distance, directions along which XX' has high variance dominate pairwise distances. To equalise the contribution of every direction we apply the PCA whitening transform. Let

X=UΣV X' = U \, \Sigma \, V^\top

be the thin singular value decomposition, with URN×dU \in \mathbb{R}^{N \times d} having orthonormal columns, VRd×dV \in \mathbb{R}^{d \times d} orthogonal, and Σ=diag(σ1,,σd)\Sigma = \mathrm{diag}(\sigma_1, \ldots, \sigma_d) the diagonal matrix of singular values. The whitened matrix is

X=XVΛ1/2=N1  U, X = X' \, V \, \Lambda^{-1/2} = \sqrt{N-1} \; U,

where Λ=1N1Σ2\Lambda = \tfrac{1}{N-1}\Sigma^2. Equivalently, whitening sends X=UΣVX' = U \Sigma V^\top to N1U\sqrt{N-1}\,U: both the right singular vectors VV and the singular value spectrum Σ\Sigma are absorbed into the transform, leaving only the orthonormal factor UU (rescaled by N1\sqrt{N-1}). By construction XX=(N1)IX^\top X = (N-1)\,I, so all dd directions contribute equally to Euclidean distance on the rows of XX.

The connection to the PCA language is: the columns of VV diagonalise XX=VΣ2VX'^\top X' = V \Sigma^2 V^\top and are the usual principal directions; the eigenvalues λi=σi2/(N1)\lambda_i = \sigma_i^2/(N-1) of the sample covariance Σ^=XX/(N1)\widehat\Sigma = X'^\top X' / (N-1) fill the diagonal of Λ\Lambda; the entries of XVX' V are the coordinates of each sample in the principal basis, and the subsequent division by λi\sqrt{\lambda_i} rescales each principal coordinate to unit sample variance. The simplification X=N1UX = \sqrt{N-1}\,U is then just the SVD identity XVΣ1=UX' V \Sigma^{-1} = U.

Empirically, the leading right singular vector v1v_1 of XX' (i.e., the top principal direction) is approximately the uniform direction 1Rd\mathbf{1} \in \mathbb{R}^d and σ12\sigma_1^2 accounts for the majority of XF2|X'|_F^2. In Section 3.2 of Gordon et al. (2026) we addressed this “uniform mode” by subtracting each row’s mean, which deletes exactly the 1\mathbf{1} direction. PCA whitening replaces this ad hoc step with a principled one: rather than excising a single hand-picked direction, it rescales every viv_i to unit variance, so the dominant uniform mode is put on the same footing as every other right singular vector.

Figure: The Effects of Preprocessing. The above figure shows a simplified example of how our preprocessing steps change the data. The original data has coordinates that are correlated, but have very different magnitudes. In previous work, we rescaled the standard axes to correct for this, but now we normalize the principal components directly, which evens the data out in all directions.

Error in Preprocessing: We found a bug in the code which ends up performing the old row normalization step after PCA whitening. Since we have already changed the data into a PC basis this corresponds to deleting a particular direction in the data. Susceptibilities for Pythia 1.4B are 410 dimensional, and we think it is unlikely that removing one dimension has a noticeable impact on the data, but it is nonetheless an unintended behavior. We are currently performing clustering with corrected preprocessing steps, and will edit this document to reflect our results when we have them.

Clustering

We find clusters from the susceptibility vectors using a clustering algorithm defined in Gordon et al. (2026), Section 3.1. The algorithm is designed to find small clusters of nearby points on the periphery of a large data cloud using a localized PageRank. The algorithm works as follows:

  1. Turn the set of datapoints into a graph, with points as vertices, and edge weights between points a decaying function of distance. We do not use raw distances between susceptibility vectors, see the subsection below on preprocessing for details.
  2. Select a random unvisited vertex
  3. Explore a neighborhood of that vertex by taking random walks on the graph that return to the vertex with fixed probability α\alpha at each step. The steady state of such a random process can be efficiently approximated to a given tolerance ϵ\epsilon, see Andersen et al. (2006) for details.
  4. Order the points by how frequently the random walks visit them and search the prefixes of this ordering for sets that have low conductivity: the ratio between total edge weight leaving the set and total edge weight leaving and staying in the set.
  5. Clusters are identified as prefixes with low conductivity. If a cluster is found, record it, if not label all the point visited as “unclustered”
  6. Return to step 2, repeating until every point is visited.

Intuitively, we add points consecutively to a set in order of how “near” they are to a randomly chosen seed vertex, while comparing the distances between points in the set to distances between points in the set and external points. A cluster is an outlier by this measure: Many points that are close to each other and far away from everything else. These clusters are collections of tokens that the model regards as similar in some way.

This process is sensitive to the choice of hyperparameters used for the graph construction and the random walk, but we were able to reuse the exact hyperparameters from Gordon et al. (2026). Our distance graph was a 4545-nearest neighbor graph, and the random walk used α=103, ϵ=10e7\alpha = 10^{-3},\ \epsilon=10e-7. To account for the increased number of tokens, we ran the core algorithmic loop in parallel, choosing multiple seed vertices simultaneously.

For Pythia-1.4B we have available on the Cluster Explorer both the clustering of the 4.25M ground truth (GT) susceptibilities (that is, of XX as defined above) and the 46M upsampled susceptibilities (defined in the next section).

How to Upsample

The computation involved in SGLD is currently the greatest bottleneck to estimating susceptibilities for large models. Computing the covariances for 4.25 million susceptibilities on Pythia 1.4B required many forward and backward passes on the model, and roughly 4800 H200 hours. Little effort has been put into optimizing this process and we expect many gains are possible, but in this section we present a new approach to cheaply predicting approximate susceptibilities that we call upsampling.

In the context of signal processing, upsampling refers to approximations of a sequence of samples from a signal that would have been obtained by sampling the signal at a higher rate. Neural upsampling, that is, training neural networks to predict interpolating signals, is a commonly used technique for producing “super-resolution” images. In our case the relevant “signal” is the mapping from a token sequence xyxy to the susceptibility vector χxy\chi_{xy}. This mapping is, we hope, nontrivial and interesting (this is the whole point of susceptibility-based interpretability). However, we also expect that there is learnable structure in this map which a neural network could use to cheaply approximate with one forward pass what SGLD does with hundreds of forwards and backwards steps.

Cheap approximations to susceptibilities can be used to examine orders of magnitude more tokens for large models than we can afford to estimate, and also to serve as part of a fast inner-loop in applications to patterning Wang & Murfet (2026). Indeed, patterning during training may be intractable without upsampling.

Upsampling Data Using an Auxiliary Model

Throughout we fix a primary language model whose susceptibilities χxy\chi_{xy} we wish to approximate. To do so we train an auxiliary (Aux) model: a lightweight neural network that takes in concatenated residual stream activations of the primary model and predicts per-token susceptibility values. We can then inference these models over arbitrarily large numbers of tokens, as the cost per batch is a single forward pass of the base model to generate activations, then a forward pass of the Aux model to predict susceptibility values

The process is:

  1. Collect ground-truth susceptibilities using SGLD, as well as the activation vectors
  2. Train an auxiliary model to predict a token’s susceptibility values from a concatenated vector of the model’s activations.
  3. Use the auxiliary model to produce susceptibilities, using only a single forward pass from the initial LLM, to collect activations, rather than a sampling process

Introducing an auxiliary model allowed us to use the 4.25 million susceptibilities collected in Gordon et al. (2026) to produce susceptibilities for 46 million more tokens at a fraction of the compute cost.

Aux-predicted vs. ground-truth susceptibilities for Pythia-14M. Left (χpred\chi_{\text{pred}}): 6.5M susceptibility vectors predicted by the auxiliary model. Right (χtrue\chi_{\text{true}}): 850k ground-truth SGLD-estimated susceptibility vectors, drawn from the same Pile subsets and used to train the auxiliary model. Both panels use the same UMAP reducer — fitted on the ground-truth vectors and then applied to the auxiliary predictions — so coordinates are directly comparable. The auxiliary model recovers the qualitative organisation of the ground-truth embedding (each point is a token sequence, colored as in Gordon et al. (2026)). The 8X larger Aux set fills in density but does not distort the macro-structure.

Aux Model Details

We experimented with several auxiliary model architectures, the best performing of which was a simple one layer encoder-decoder model with a 16x hidden layer size. We also found a small L1-norm penalty useful for stability.

For training, we used activations from layers 5, 11, 17, and 23 of Pythia 1.4B. Rather than predict the 410 component susceptibilities individually, we used the ground-truth susceptibilities to perform principal component analysis and trained the model to perform predictions on the PC basis. Previous work (Baker et al. 2025; Wang et al. 2025) identified principal component directions as capturing more patterns in the data than individual model components.

We measure Aux model accuracy by the Pearson correlation between predicted susceptibility values and a withheld ground-truth validation set. The model achieved a mean per-PC correlation of 0.714 over the 410 PCs. As shown below, prediction accuracy tracks the explained variance of each principal component: the highest-variance PCs are predicted with correlations above 0.95, while the lowest-variance PCs plateau around 0.5–0.6.

Training Aux Models

We studied how Aux model performance scales with the number of ground-truth training tokens for Pythia-14M, varying both dataset size and hidden layer expansion factor. Each model was trained to convergence.

Performance continues to improve up to the full 44M tokens available, with saturation towards the higher token counts. The Pythia-1.4B Aux model was only trained on 4.25M ground-truth susceptibilities, which, assuming that aux models trained on susceptibilities of larger models benefit from increased dataset size, would place it in the data-limited regime of this curve. This suggests that collecting more ground-truth data would yield meaningful improvements in aux model fidelity. We also note that in this data-limited regime, Aux models converge quickly and are prone to overfitting; a light L1 penalty on the weights was useful for stability.

Evaluating Aux Models

Auxiliary model susceptibilities are intended as a compute efficient approximation of ground truth susceptibilities. We present further analysis into how faithful this approximation is:

Representational Similarity to Activations

We had the following worry when first studying Aux model susceptibilities: Could the interesting structure we see be an artifact of structure present in activation space but not present in regular, ground truth susceptibilities?

Here is a concrete scenario we are worried about: Suppose two tokens are nearby in activation space, because the model regards them as similar, but this relationship is not captured by susceptibilities. Since our auxiliary model applies a continuous map to activation space, then the auxiliary model susceptibilities would capture information, valuable patterns even, that did not accurately reflect what susceptibilities should be able to see. More succinctly, are Aux models using activations to efficiently produce the intended data, or do they show an indeterminate blend of two different datasets?

This is an empirical question, and we address it with two complementary metrics.

Global similarity: linear CKA. Centered kernel alignment (Kornblith et al. 2019) measures the similarity of two representations by comparing the covariance structure of their kernel matrices. Given representations XX and YY of the same tokens, CKA asks: do pairs of tokens that are similar under XX tend to be similar under YY? Values range from 0 (no correspondence) to 1 (identical structure up to rotation and scaling). The linear variant admits a feature-space reformulation that scales to the hundreds of thousands of tokens we need here without materialising an n×nn \times n kernel matrix.

For each of the roughly 212,000 tokens in the Aux-model validation set, we compute the linear CKA between (i) per-layer activations and ground-truth susceptibilities, and (ii) per-layer activations and aux-predicted susceptibilities. The first quantity is a baseline: any similarity it records reflects genuine structure shared between activations and susceptibilities on this data. The second is the object of interest: if it is close to the first, the Aux model is faithfully representing this shared structure; if it is substantially higher, the Aux model is adding activation structure that is not present in the ground truth.

Both curves are low across all layers — ground truth peaks near 0.15, Aux predictions peak near 0.275. The Aux curve sits above the ground-truth curve, rising and falling in tandem with the ground truth, although we believe that by the standards of CKA in the neural network representation literature these numbers are small. In particular it seems that this gap of 0.125\le 0.125 is a-priori low given we generally expect learned maps to be biased towards preserving the local structure of the domain.

Overall, we take the low CKA between ground truth susceptibilities and activations as a positive sign for the orthogonality of their information content. A reasonable caricature seems to be that alignment between activations and susceptibilities comes for free from the input tokens (early layers) and from the target token (late layers), with the middle layers carrying the nontrivial computation.

Local similarity: Jaccard-KNN. CKA captures global covariance structure. It is less informative about the local neighbourhood geometry that our clustering algorithm is actually sensitive to. We complement CKA with a nearest-neighbour measurement: for each token, compute its k=50k = 50 nearest neighbours in activation space and in susceptibility space, and report the Jaccard index (size of intersection over size of union) between the two neighbour sets, averaged over all tokens.

The mean Jaccard similarity across all layers is 0.05 (indicating that on average 4.76 of 50 nearest neighbors are preserved) for ground truth susceptibilities, and 0.19 (indicating 15.97 of 50 nearest neighbors are preserved) for aux predictions.

Large Scale Ground Truth Comparison

In addition to studying the representations, we used SGLD to produce a run of ground truth susceptibilities for Pythia-14M on the same set of 46 million tokens used in aux model susceptibilities. We then performed the same clustering process on this larger ground truth set, allowing for a direct evaluation of how accurate the aux model susceptibilities are. You can see the ground truth susceptibilities in the cluster explorer here, and the Aux model susceptibilities here. We compared the two datasets in two ways.

Token diversity. For each cluster, we compute the entropy of the distribution over its most common yy tokens. Lower entropy indicates a cluster dominated by fewer token types.

Aux clusters have consistently lower entropy than ground-truth clusters (mean 0.24 vs 0.36 nats for top-4 tokens). That is, Aux clusters are more token-homogeneous — more often organized around a single surface form. This is consistent with activation structure, which encodes token identity more directly, contributing to the Aux predictions.

Cluster geometry. These differences also appear in the aggregate statistics of the clustering output.

Clustering Aux susceptibilities yields nearly twice as many clusters as ground truth (77,140 vs 40,457), with smaller token-to-centroid distances. The Aux data has more atomic local structure, producing more numerous, mono-token groupings than the ground truth data. We think this is because the Aux model has a bias towards mapping identical tokens to nearby regions of susceptibility space while ignoring surrounding context.

These investigations show modest differences between Aux model susceptibilities and ground truth susceptibilities. Compared to ground truth, Aux model susceptibilities have a modest but detectable increased similarity to activations as well as visible differences in cluster statistics. This is not strictly negative from an interpretability perspective, but it complicates the question of understanding what Aux-predicted susceptibilities are capturing in practice. We plan in future work to measure this similarity or lack thereof in larger language models, perhaps using variations on the current Aux model architecture. Additionally, the structure of any map between activations and susceptibilities is interesting in and of itself, and so studying what these Aux models learn is perhaps an interesting step towards linking activation and weight space geometries.

Conclusion

Though Gordon et al. (2026) was published only a few months ago, we’ve made significant improvements to the process for producing and studying susceptibility clusters. This post provides a general overview of the pipeline, and we strongly encourage the reader to look at the primary post in this collection) where we present a novel analysis of the tens of thousands of interpretable clusters in Pythia 1.4B.

The main changes to the methodology of Gordon et al. (2026) were the introduction of PCA whitening (a simple post-processing change) and upsampling using Aux models. The combined effect of increasing the number of tokens to 46 million, and performing a PC-whitening step prior to clustering increased the number of clusters found for Pythia 1.4B from 249 to 57,236 and revealed significantly more abstract clusters.

Build on our work

Our tools for susceptibilities, local learning coefficients, and SGMCMC sampling are open source in the devinterp library.

Work with us

We're hiring Research Scientists, Engineers & more to join the team full-time.

Senior researchers can also express interest in a part-time affiliation through our new Research Fellows Program.