Spectroscopy at Scale: Finding Interpretable Structure in Pythia-1.4B
Scaling susceptibility-based interpretability to Pythia-1.4B reveals 57,236 clusters and abstract cross-domain concepts that approach SAE territory.

Understanding the internal structure of neural networks is a central problem in AI alignment. If we could reliably identify the structures inside a model that drive its behaviour, we would have a much stronger basis for predicting how it will generalise, and for intervening when it generalises in ways we don’t want.
Most current approaches to this problem work with neural network activations. We take a different approach. We study the geometry of the loss landscape: the mathematical surface that training navigates to find low-loss parameters. Our hypothesis is that there is a deep relationship between the geometry of this landscape and the internal computational structure of the network. From this view we derive the intuition that if the network “thinks similarly” about two inputs, then they will leave similar signatures on the loss landscape geometry around the trained parameter. These signatures are called susceptibilities.
Because susceptibilities measure how the loss landscape responds to changes in the data distribution, they naturally suggest an inverse: if you want to change the model’s internal structure, change the data. We call this patterning, and we have used it in small transformers to delay the formation of the induction circuit and to steer between modes of generalisation in a synthetic task (Wang & Murfet 2026). In this post we focus on the interpretability side.
In recent work we showed that susceptibilities can identify known structures like the induction circuit in small networks (Baker et al. 2025) and found hundreds of interpretable structures in Pythia-14M (Gordon et al. 2026). The next step was to find more abstract structures in larger networks. In this post we report on our progress towards this goal:
- More structures: in Pythia-1.4B we find 57,236 susceptibility clusters, organized into 19 interpretable semantic regions spanning code, mathematics, biomedicine, legal text, and natural language.
- More abstract: these include meta-clusters that unify concepts across domains, such as a “Roles” meta-cluster linking military ranks, legal parties, and client-server relationships. Our most abstract clusters approach the territory of features identified by SAEs in Claude 3 Sonnet.
- Scale changes what the method finds: following tokens like
sheandheracross scales, we see them reorganize from surface-form groupings in Pythia-14M (e.g.shein “he/she”) to grammatical-role groupings in Pythia-1.4B (subject, object, possessive).
The second claim, about abstraction, is the most important for establishing susceptibilities as a viable interpretability technique. The gold standard is the SAE features collected in Templeton et al. (2024) to illustrate a “diversity of highly abstract features” in Claude 3 Sonnet (second column of the figure below). We present in the same figure (first column) examples of token sequences from a selection of the most abstract clusters that we found in Pythia-1.4B.
While quite abstract, our clusters fall short of the most abstract SAE features. This is to be expected since Pythia-1.4B is, according to our best guess, three orders of magnitude smaller. Nonetheless we think this is encouraging evidence that susceptibilities are beginning to reach the territory of abstraction that SAEs have demonstrated, and we expect that applying the same methodology to larger networks will reveal yet more abstract patterns.
What is Spectroscopy?
We may not yet know the right way to think about internal structure in neural networks. In particular, we arguably lack mathematical definitions for concepts like feature and circuit. In this epistemic position we can learn from other scientific disciplines.
Physics gives us interpretability for atoms. The characteristics of physical materials are determined by how their constituent particles interact through the fundamental forces. As such, perturbations in the “external fields” around materials provoke responses which contain tell-tale signs of internal structure. For example, we can distinguish ferromagnetic materials like iron by the fact that when an external magnetic field is varied, the magnetic moments within the material tend to respond by lining up with it. This first-order response is called a susceptibility (we say that iron has positive magnetic susceptibility ). Spectroscopy is the study of materials through these susceptibilities.
We can port this notion across the bridge from physics to statistics. The idea is to associate to an infinitesimal change in the data distribution a susceptibility vector which measures how the loss landscape near the trained parameter “responds” to that change. Some components of the network (e.g. an attention head) will respond more strongly than others to any given change. For example, consider a component that helps predict the names of famous figures. For each neural network weight in this component there is a corresponding direction in the loss landscape, leading away from the trained network’s parameter. It is intuitive that the shape of the landscape in directions belonging to will respond more strongly to variations in parts of the data distribution that have to do with history and news than to variations in parts to do with code ( doesn’t care about code, unless it’s written by Churchill).
In this analogy, the data is the magnetic field and our central hypothesis for interpretability is that, following the example of spectroscopy in physics for studying materials, internal structure in neural networks is encoded in the responses of the loss landscape to variations in the data distribution. We note that this analogy is, at the mathematical level, exact: we discuss further details in our companion post on Interpreting the Ising Model.
Methodology
The fundamental unit of our analysis is the cluster. An individual susceptibility vector
is associated to a token in context , where is the number of chosen components in the network. Each entry is a real number (positive or negative) that records the response of some component to a fluctuation in the probability of in context (Baker et al. 2025). These susceptibility vectors are then clustered using a PageRank-based clustering algorithm defined in Gordon et al. (2026). A susceptibility cluster is a set of tokens (in context) that provoke similar responses from the model.
For Pythia-1.4B, a GPT-style decoder-only Transformer, we partition the model into components: attention heads, MLP layers, the embedding layer, and the unembedding layer. We use million tokens-in-context sampled uniformly from subsets of the Pile and “upsample” to million susceptibility vectors. Using the same clustering algorithm and hyperparameters we find 57,236 clusters in Pythia-1.4B compared to the 510 found in Pythia-14M in Gordon et al. (2026). The increased number results from using more tokens by an order of magnitude, and PCA whitening. For an introduction to susceptibilities of neural networks and details on how we scaled these methods to Pythia-1.4B see the companion post How to Scale Susceptibilities.
In principle a susceptibility can be associated to any observable on parameter space; in this post we focus on observables associated to individual components (e.g. attention heads, MLPs). Using per-token losses as observables instead yields a generalization of influence functions, see Kreer et al. (2025), Adam et al. (2025), and Lee et al. (2025).
The word “structure” is used in the sense introduced in (Baker et al. 2025; Wang et al. 2025) which is derived from the idea that internal structure in networks (e.g. representations and circuits) forms in response to patterns in the data distribution. In this work we do not posit a formal definition of structure. However, since a susceptibility cluster can be thought of as representing a common response of the model to some pattern present in the tokens that constitute it (Gordon et al. 2026) we will identify “internal structures” with “clusters” in this post.
The Cluster Map
To see how these clusters are organized we take the centroid of each cluster in and apply a non-linear dimensionality reduction technique (UMAP) to reduce this to two dimensions. The resulting figure is what we call the cluster map (see below). Each point in this map represents a cluster, that is, a set of tokens eliciting similar responses from the model as measured by the susceptibility vectors. The way that points are arranged in the map is a (lossy) representation of the organization of the clusters in -dimensional space.

The cluster map can be explored interactively online. It is quickly apparent that there are broader themes. For example, there is a promontory of clusters near the bottom of the map relating to terms in mathematics and physics. Not every cluster fits into the theme suggested by its region label in the map; we encourage the reader to explore for themselves.
This interpretable large-scale organization of the cluster centroids is reminiscent of the global organization of SAE feature neighborhoods for Claude 3 Sonnet in Templeton et al. (2024).
Examples of Clusters
In this section we exhibit some susceptibility clusters. Each cluster is displayed along with its name (e.g. C8361), the number of tokens in the cluster and some of the most common tokens. Also shown are randomly selected token sequences in the cluster, with the token highlighted in green.
Clusters in Pythia-14M
We begin with 10 random clusters from Pythia-14M. Our purpose is to establish a baseline of the kinds of structures known to exist in smaller models (Gordon et al. 2026). This will allow us to later contrast the nature of patterns found in Pythia-14M and Pythia-1.4B.
Clusters in Pythia-1.4B
From the “Physics/Math”, “Materials/Engineering” and “Biomedical” regions. The first six clusters appear in the header at the top of this post. The cluster C11414 contains tokens (e.g. real, complex and integer) which are adjectives describing numbers. Unlike many of the clusters shown for Pythia-14M, there is no clear pattern to the position of these tokens or the tokens that precede or follow them. Our description “Kinds of Numbers” may not capture the internal algorithm that is responsible for these tokens having similar susceptibility vectors, but it does capture the background knowledge a human requires to understand the pattern. In this sense, the cluster is “more abstract” than a cluster that simply captures the second newline in a double newline (Pythia-14M C13434).
What is abstractness? Our headline claim is that many of the clusters in Pythia-1.4B are more abstract than those found in Pythia-14M. While ultimately this remains an informal claim, here are some of the elements of the rubric we use to distinguish less from more abstract patterns: clusters with more diversity in the token (but which nonetheless have a coherent theme) are more abstract; clusters with a pattern that takes more effort for a human to recognize or communicate are more abstract; clusters which require a longer description, even making use of background terms, are more abstract.
From the “Pronouns/Informal”, “Narrative Verbs”, “People/Names” and “Culture/Everyday” regions. The first three clusters appear in the header at the top of this post. In the cluster “Administrative levels” (C7743) the token is relatively predictable if you know the next token (e.g. court, department, highway). However, susceptibilities measure the influence on the loss of predicting given its context , without knowing (since this is the modelling task we have given the transformer). The shared relationship between and in these examples is more subtle when one takes this into account.
Random Clusters in Pythia-1.4B
Above, the clusters were chosen to be interesting and exhibit a particular region in the cluster map. In this section we randomly select 10 clusters from the 57,236 clusters in Pythia-1.4B. This illustrates how many of the simple patterns present in the Pythia-14M clusters are still present in the larger model (including flavours of newlines, single common tokens in particular kinds of contexts and common word stems).
Meta-Clusters: Mid-Level Organization
So far we have discussed two levels of organization. Each cluster is made up of hundreds to thousands of token sequences, often following a recognizable pattern. Regions in the cluster map involve thousands of clusters and correspond to very broad themes or patterns like “Physics/Math”. In this section we consider a level of organization in between that we call meta-clusters. These structures involve hundreds of clusters and often represent quite abstract and subtle similarities between clusters.
Note: clusters in this section are taken from the clustering of the 4.25M ground truth susceptibilities. In the larger dataset of 46M upsampled susceptibilities many of these clusters are absorbed into a smaller number of larger clusters.
Our first example is the Game-Image-Play meta-cluster. This group of clusters appears to be unified by the theme of media that get produced, transmitted and consumed. Moving clockwise from the top of the meta-cluster we encounter album, artist and video and, at the bottom right, print, transmit and play.
Our second example is the Roles meta-cluster. This is a neighborhood of clusters containing tokens like server and client, troops and officers, network, company, council and group. The unifying theme is people (and things) defined by their role within an organized structure: legal parties (plaintiff, appellant, defendant), military ranks (lieutenant, commander), political positions, technical roles, and institutional titles. What makes this meta-cluster interesting is that it crosses domain boundaries: the same abstract pattern of “entity defined by its function within a collective” appears in legal filings, military documents, corporate governance, and software architecture. The model has identified the general concept of a role.
There are two small clusters (C9575 and C3660) dominated by the tokens ligand and ligands. These are the only clusters containing these tokens in the 4.25M ground truth susceptibility clustering. This is potentially interesting because in biology a ligand is to a receptor as an officer is to a command structure, something that “fills a functional slot” in a larger organized system. However, these clusters are small and their placement in this meta-cluster could be a UMAP artifact.
Pronouns Across Scale
A potential concern with any new interpretability method is that it simply produces different outputs for different models, without that difference reflecting something real about how the models differ. One way to check this is to follow specific tokens across scales and see whether they get organized in ways that match what we expect from a more capable model: organization by grammatical and semantic relationships rather than by which characters happen to surround them.
First we consider the token clusters in Pythia-14M containing she or her. Note the organization by surface level syntactic structure: for example, C68817 consists of she appearing as part of he/she and C60937 consists of she appearing at the beginning of quotations or as the first token inside a bracket.
Contrast this with Pythia-1.4B where clusters that contain multiple she and her tokens also contain an assortment of pronouns in a variety of contexts and are organized by more abstract patterns. Notice that C1828 and C342 consist mostly of subject pronouns (she appearing in “she kicked the ball”) and C822 consists mostly of possessive determiners (her appearing in “her purity of tone”).
These examples are indicative of the broader differences in clusters that we can observe between the two models. As we move up in scale, clusters stop being organized by surface-level syntax and start being organized by grammatical role and other deeper patterns. Put more simply: some understanding of English grammar is necessary to understand the patterns captured by Pythia-1.4B (and the authors admit that it is Claude, not them, who knows what a possessive determiner is).
Conclusion
Susceptibilities are a new approach to neural network interpretability, inspired by spectroscopy in physics (“interpretability for atoms”) and grounded in the mathematics of singular learning theory. Our earlier work in this vein (Baker et al. 2025; Gordon et al. 2026) left open the question of whether we could find more abstract structures in larger models. In this post we presented evidence that this threshold has been crossed: in Pythia-1.4B we found tens of thousands of clusters, organized into interpretable semantic regions. The patterns represented by these clusters are notably more abstract than in Pythia-14M and in our judgement are approaching the level of abstractness of SAE features for Claude 3 Sonnet.
As far as we know, anything you can find with susceptibilities in 1-10B models you could also find with SAEs. Nonetheless there are a few reasons you might still find them interesting:
- This is an independent approach based on loss landscape geometry and local posterior sampling. It is a reassuring sign for interpretability that two very different approaches (SAEs and susceptibilities) both work, and find comparable things. Maybe they’re both real!
- Susceptibilities are grounded in a rigorous theory of generalization in Bayesian statistics (singular learning theory). While this currently falls short of a satisfactory theory of deep learning (see the Limitations below), this is still a more complete link to a theoretical foundation than currently exists for activation-based interpretability.
- Our approach gives a natural path towards engineering internal structure in networks via patterning: we have used it to delay the formation of the induction circuit in small transformers and to steer between two modes of out-of-distribution generalization in a synthetic task (Wang & Murfet 2026). We believe that the affordances provided by patterning for alignment are significantly different from those derived from steering the forward pass with SAE features.
At Timaeus our mission is to make breakthrough progress on AI alignment. Our research efforts can be summarized by the trichotomy Read/Write/Align. Susceptibilities, both in the form used in this post and in the form of Bayesian influence functions (Kreer et al. 2025; Adam et al. 2025; Lee et al. 2025), form the basis of our approach to interpretability (“Read”). These measurements tell us how infinitesimal changes in the data distribution affect the loss landscape, and can be turned around to infer what changes in the data to make in order to engineer desired properties of the model; this is patterning, an approach to deliberately selecting how models generalize (“Write”). Using patterning, we are working to improve the state of the art in AI alignment, starting with fundamental problems in today’s approaches to post-training: reward model bias and reward hacking (“Align”).
Finally, an invitation to explore:
- Examine clusters for Pythia-14M and Pythia-1.4B in the Cluster Explorer.
- Read more about inferring structure in variations on the Ising model with susceptibilities, in our post Interpreting the Ising Model.
- Understand the definition of susceptibilities and see more details about how we upsample to produce the clusterings used in this post in How to Scale Susceptibilities.
- Visit the open-source susceptibility estimation repository on GitHub
- Read the sampling guide to get advice on hyperparameter selection
Limitations
In this section we enumerate some of the known limitations of susceptibilities for interpretability.
Hyperparameter sensitivity. Selecting suitable hyperparameters for SGLD sampling is one of the primary difficulties with susceptibility estimation. The basics are presented in the sampling guide. In an early stage of Gordon et al. (2026) we chose hyperparameters for which the resulting clusters were virtually indistinguishable from an “infinite-temperature” baseline that consists of injecting unstructured Gaussian noise instead of sampling from the posterior (technically, ). Despite containing no contribution from the loss landscape geometry, this baseline still yields interpretable clusters (that is, the Gaussian baseline is nontrivial). Refining the hyperparameter selection and careful analysis shows substantial differences with this baseline, see Appendix H of Gordon et al. (2026).
Scaling. Estimating susceptibilities is expensive. Each entry requires averaging the loss on token over hundreds to thousands of weight draws. The cost is dominated by backpropagation to generate the draws (which can be amortized across tokens) and repeated forward passes for each draw (which can be reduced via auxiliary model upsampling). The 4.25M ground-truth susceptibilities in this work required 4800 H200-hours. For reference, a single Gemma Scope SAE (Lieberum et al. 2024) at comparable width uses 8 billion tokens, more than two orders of magnitude more data. Susceptibilities are significantly more data-efficient but more compute-intensive per token.
Model and data. Pythia-1.4B is not a capable model by 2026 standards, and the Pile is far from the best publicly available data mixture. We chose Pythia because it is the best-studied open-weights model family and we chose the Pile because staying in-distribution eliminates a major confounder. The cost is that some of the limitations we observe (particularly in cluster abstractness) may reflect the model rather than the method. In follow-up work, we plan to study larger, more capable models with better datasets.
Circuits. While in Baker et al. (2025) we demonstrate how to find the induction circuit in a two-layer attention-only transformer using susceptibilities, we have not yet published a more general approach to finding circuits with susceptibilities comparable to, for example, feature attribution graphs in SAEs. There is no fundamental obstacle to doing this: watch this space.
Steering. SAE features can steer the forward pass by intervening on activations. The natural analogue for susceptibilities is steering the learning process through data distribution modifications, which we explore under the name “patterning” (Wang & Murfet 2026). We have not yet validated such interventions at the model scale considered here. See the companion post Patterning Toy Models of Superposition for a demonstration in a controlled setting.
Abstractness. The clusters we find are less abstract than the features Anthropic uses to audit frontier systems. In audits of Claude, white-box methods surface features like “strategic manipulation” and “security bypass,” which are more abstract and safety-relevant concepts than what we observe in Pythia-1.4B. Some of this gap is likely explained by the model’s limited capabilities (even with SAEs, (Templeton et al. 2024) also does not contain any concepts as abstract as this). Whether there is also a methodological gap remains an open question that can only be resolved by scaling to more capable models.
Frequently Asked Questions
Why not just cluster (randomly down-projected) activations? Because it doesn’t work. Or at least, we couldn’t make it work and that’s our best guess as to why nobody clusters raw activations to discover concepts in LLMs. The approaches that do apply clustering to activations either require careful preprocessing (e.g., with SAEs, (Venhoff et al. 2025), embeddings through auxiliary models, (Kopf et al. 2025), or discretization into symbolic representations (Wu et al. 2025)) or require significant post-processing machinery (e.g., (Kroeger & Bindschaedler 2025)). Susceptibility vectors, by contrast, are naturally amenable to simple clustering without such preprocessing, which is itself evidence that they provide a more useful signal than raw activations.
Are clusters the ultimate form of SLT-based interpretability? We see no reason to think so. Given a vector representation of tokens, clustering is one of the simplest forms of data analysis and we have found that it has taken us surprisingly far. It is, however, conceivable that in larger models or with orders of magnitude more tokens, alternative approaches to analyzing susceptibility vectors will be required.
Why is internal structure in networks encoded in the loss landscape? This might seem counter-intuitive in the setting of machine learning, but it is a commonplace notion in physics. In many systems the geometry of a potential function at its minima is intimately related to the properties of the physical system whose configuration corresponds to those minima. Some of the foundational ideas in the context of program synthesis have been developed in (Murfet & Troiani 2025), we will restrict ourselves here to a couple of illustrative examples:
- Modularity and control flow: suppose that the network processes two sets of inputs via separate paths, past some initial layers (which serve to route the inputs to the appropriate paths). Then variations in the weights of one path do not affect the outputs on inputs using the other path; this describes a family of variations in the data distribution with differential effects on the loss landscape that can be seen by susceptibilities.
- Patterns of errors: one reasonable model of internal structure in language models is that families of circuits co-operate to up- or down-weight possible continuations of a context, by specializing to recognition of different patterns in the context and contributing to the residual stream whatever expression or suppression “their” patterns would suggest. This convocation of circuits is affected in a complex way by perturbing the weights in any given circuit: for example, in (Baker et al. 2025) we study a balance between attention heads that want to predict the most likely continuation in the data distribution (termed bigram heads) and heads that want to predict the most likely continuation based on the context (induction heads). Now consider jointly varying (i) the probability of a common bigram vs an induction pattern in the data distribution and (ii) the weights in the bigram heads vs induction heads. The different pairings of variations in (i), (ii) affect the aforementioned “balance” differently and produce different patterns of errors. This is how the induction circuit is recovered via analysis of the susceptibility matrix in Baker et al. (2025). In turn, these patterns of errors are encoded in the partial derivatives of the loss and thus in the landscape.
The idea has been raised in interpretability circles before, for instance in Olsson et al. (2022) speculates that “it seems like there must be some way in which aspects of the loss surface connect to the formation of circuits”. This is, in short, our approach to interpretability.
Work with us
We're hiring Research Scientists, Engineers & more to join the team full-time.
Senior researchers can also express interest in a part-time affiliation through our new Research Fellows Program.
Thanks to Simon Pepin Lehalleur for comments and for suggesting C6544 in Pythia-1.4B, and to Edmund Lau for comments.
@article{murfet2026spectroscopy,
title={Spectroscopy at Scale: Finding Interpretable Structure in Pythia-1.4B},
author={Daniel Murfet and Andrew Gordon and Max Adam and George Wang and Jesse Hoogland and Garrett Baker and William Snell and Stan van Wingerden and Adam Newgas and Billy Snikkers and Rohan Hitchcock},
year={2026}
}- 1.Patterning is Dual to Interpretability: Shaping Neural Network Development with Susceptibilities[link]George Wang, Daniel Murfet, 2026.
- 2.Structural Inference: Interpreting Small Language Models with Susceptibilities[link]Garrett Baker, George Wang, Jesse Hoogland, Daniel Murfet, 2025.
- 3.Towards Spectroscopy: Susceptibility Clusters in Language Models[link]Andrew Gordon, Garrett Baker, George Wang, William Snell, Stan van Wingerden, Daniel Murfet, 2026.
- 4.Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet[link]Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen et al., 2024. Transformer Circuits Thread.
- 5.Bayesian Influence Functions for Hessian-Free Data Attribution[link]Philipp Alexander Kreer, Wilson Wu, Maxwell Adam, Zach Furman, Jesse Hoogland, 2025.
- 6.The Loss Kernel: A Geometric Probe for Deep Learning Interpretability[link]Maxwell Adam, Zach Furman, Jesse Hoogland, 2025. DOI: 10.48550/arXiv.2509.26537.
- 7.Influence Dynamics and Stagewise Data Attribution[link]Jin Hwa Lee, Matthew Smith, Maxwell Adam, Jesse Hoogland, 2025. DOI: 10.48550/arXiv.2510.12071.
- 8.
- 9.Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2[link]Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma et al., 2024.
- 10.Base Models Know How to Reason, Thinking Models Learn When[link]Constantin Venhoff, Iván Arcuschin, Philip Torr, Arthur Conmy, Neel Nanda, 2025. arXiv. DOI: 10.48550/arXiv.2510.07364.
- 11.Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework[link]Laura Kopf, Nils Feldhus, Kirill Bykov, Philine Lou Bommer, Anna Hedström, Marina M. -C. Höhne et al., 2025.
- 12.Concept-Guided Interpretability via Neural Chunking[link]Shuchen Wu, Stephan Alaniz, Shyamgopal Karthik, Peter Dayan, Eric Schulz, Zeynep Akata, 2025.
- 13.Cluster Paths: Navigating Interpretability in Neural Networks[link]Nicholas M. Kroeger, Vincent Bindschaedler, 2025. arXiv. DOI: 10.48550/arXiv.2510.06541.
- 14.
- 15.In-context Learning and Induction HeadsCatherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan et al., 2022. Transformer Circuits Thread.