Spectroscopy at Scale: Finding Interpretable Structure in Pythia-1.4B

Understanding the internal structure of neural networks is a central problem in AI alignment. If we could reliably identify the structures inside a model that drive its behavior, we would have a much stronger basis for predicting how it will generalize, and for intervening when it generalizes in ways we don’t want.

Most current approaches to this problem work with neural network activations. We take a different approach. We study the geometry of the loss landscape: the mathematical surface that training navigates to find low-loss parameters. Our hypothesis is that there is a deep relationship between the geometry of this landscape and the internal computational structure of the network. From this view we derive the intuition that if the network “thinks similarly” about two inputs, then they will leave similar signatures on the loss landscape geometry around the trained parameter. These signatures are called susceptibilities.

Because susceptibilities measure how the loss landscape responds to changes in the data distribution, they naturally suggest an inverse: if you want to change the model’s internal structure, change the data. We call this patterning, and we have used it in small transformers to delay the formation of the induction circuit and to steer between modes of generalization in a synthetic task (Wang & Murfet 2026). In this post we focus on the interpretability side.

In recent work we showed that susceptibilities can identify known structures like the induction circuit in small networks (Baker et al. 2025) and found hundreds of interpretable structures in Pythia-14M (Gordon et al. 2026). The next step was to find more abstract structures in larger networks. In this post we report on our progress towards this goal:

More structures: in Pythia-1.4B we find 57,236 susceptibility clusters, organized into interpretable semantic regions.
More abstract: these clusters represent patterns that are often much more abstract than what we found in Pythia-14M in Gordon et al. (2026). We also exhibit some interesting collections of clusters (neighborhoods) that unify concepts across domains, such as a “Roles” neighborhood linking military ranks, legal parties, and client-server relationships.
Scale changes what the method finds: following tokens like she and her across scales, we see them reorganize from surface-form groupings in Pythia-14M (e.g. she in “he/she”) to grammatical-role groupings in Pythia-1.4B.

The second claim, about abstraction, is arguably the most important for establishing susceptibilities as a viable interpretability technique. The gold standard is the SAE features collected in Templeton et al. (2024) to illustrate a “diversity of highly abstract features” in Claude 3 Sonnet (second column of the figure below). We present in the same figure (first column) examples of token sequences from a selection of the most abstract clusters that we found in Pythia-1.4B.

While quite abstract, our clusters fall short of the most abstract SAE features. This is to be expected since Pythia-1.4B is, according to our best guess, three orders of magnitude smaller than Claude 3 Sonnet. Nonetheless, this is encouraging evidence that susceptibilities are beginning to reach the territory of abstraction that SAEs have demonstrated. We expect that applying the same methodology to larger networks will reveal yet more abstract patterns.

What is Spectroscopy?

We may not yet know the right way to think about internal structure in neural networks. In particular, we arguably lack mathematical definitions for concepts like feature and circuit. In this epistemic position we can learn from other scientific disciplines.

Physics gives us interpretability for matter. The characteristics of physical materials are determined by how their constituent particles interact through the fundamental forces. As such, perturbations in the “external fields” around materials provoke responses which contain tell-tale signs of internal structure. For example, we can distinguish ferromagnetic materials like iron by the fact that when an external magnetic field is varied, the magnetic moments within the material tend to respond by lining up with it. This first-order response is called a susceptibility (we say that iron has positive magnetic susceptibility $\chi > 0$ ). Spectroscopy is the study of materials through these susceptibilities.

We can port this notion across the bridge from physics to statistics. The idea is to associate to an infinitesimal change in the data distribution a susceptibility vector which measures how the loss landscape near the trained parameter “responds” to that change. Some components of the network (e.g. an attention head) will respond more strongly than others to any given change. For example, consider a component $C$ that helps predict the names of famous figures. For each neural network weight in this component there is a corresponding direction in the loss landscape, leading away from the trained network’s parameter. It is intuitive that the shape of the landscape in directions belonging to $C$ will respond more strongly to variations in parts of the data distribution that have to do with history and news than to variations in parts to do with code ( $C$ doesn’t care about code).

In this analogy, the data is the source of the external magnetic field. Our central hypothesis for interpretability is that, following the example of spectroscopy in physics for studying materials, internal structure in neural networks is encoded in the responses of the loss landscape to variations in the data distribution. We note that this analogy is, at the mathematical level, exact: we discuss further details in our companion post on Interpreting the Ising Model.

Methodology

Given a token $y$ in context $x$ and a set of $H$ components in the network denoted $C_1, \ldots, C_H$ the associated susceptibility vector is denoted

\chi_{xy} = \big( \chi^{C_1}_{xy}, \ldots, \chi^{C_H}_{xy} \big) \in \mathbb{R}^H\,.

Each entry is a real number (positive or negative) that records the response $\chi^C_{xy}$ of some component $C$ to a fluctuation in the probability of $y$ in context $x$ (Baker et al. 2025). Given a set of tokens in context these susceptibility vectors are clustered using a PageRank-based clustering algorithm defined in Gordon et al. (2026). The resulting susceptibility clusters (or just clusters) are the fundamental unit of our analysis. A given token in context $xy$ may appear in multiple clusters. The basic intuition is that token sequences within a cluster provoke similar responses from the model, because they share some underlying patterns.

For Pythia-1.4B, a GPT-style decoder-only transformer, we partition the model into $H = 410$ components: 384 attention heads, 24 MLP layers, the embedding layer, and the unembedding layer. We use attention heads and MLP layers because we expect these units to show some degree of functional specialization. This is not the only possible decomposition: we could further subdivide attention heads into $Q, K, V$ matrices, for instance. The finer the subdivision the higher the compute cost, and in practice the susceptibility data for Pythia-1.4B appears to contain significant redundancy across components. In the future we are likely to use coarser rather than finer subdivisions.

We estimate susceptibilities for 4.25 million token sequences $xy$ sampled uniformly from subsets of the Pile (requiring 4800 H200-hours) and “upsample” to 46M susceptibility vectors following a new methodology presented in How to Scale Susceptibilities. Using the same clustering algorithm and hyperparameters as in Gordon et al. (2026) we find 57,236 clusters in Pythia-1.4B compared to the 510 found in 780K susceptibility vectors for Pythia-14M in loc.cit. The increased number results from using more tokens by an order of magnitude, and PCA whitening. Our code is available on GitHub.

In principle a susceptibility can be associated to any observable on parameter space; in this post we focus on observables associated to individual components (e.g. attention heads, MLPs). Using per-token losses as observables instead yields a generalization of influence functions, see Kreer et al. (2025), Adam et al. (2025), and Lee et al. (2025).

The word “structure” is used in the sense introduced in (Baker et al. 2025; Wang et al. 2025) which is derived from the idea that internal structure in networks (e.g. representations and circuits) forms in response to patterns in the data distribution. In this work we do not posit a formal definition of structure. However, since a susceptibility cluster can be thought of as representing a common response of the model to some pattern present in the tokens that constitute it (Gordon et al. 2026) we will identify “internal structures” with “clusters” in this post.

The Cluster Map

To see how these clusters are organized we take the centroid of each cluster in $\mathbb{R}^{410}$ and apply a non-linear dimensionality reduction technique (UMAP) to reduce this to two dimensions. The resulting figure is what we call the cluster map which can be explored interactively online:

In this post we consider three levels of organization:

Clusters are points: each point in the map represents a cluster of susceptibility vectors. These range in size from hundreds to thousands of token sequences. The way that points are arranged in the map is a lossy representation of the organization of the cluster centroids in 410-dimensional space.
Neighborhoods are small sets of clusters: in a later section we consider some small sets of clusters which share a common, relatively narrow theme and appear near each other in the cluster map. These involve hundreds of clusters.
Regions are large sets of clusters: in the map we mark and annotate 19 regions, corresponding to broad themes like “Physics/Math”. Regions typically involve thousands of clusters.

This interpretable large-scale organization of the cluster centroids is reminiscent of the global organization of SAE feature neighborhoods for Claude 3 Sonnet in Templeton et al. (2024).

Examples of Clusters

In this section we exhibit some susceptibility clusters. Each cluster is displayed along with its name (e.g. C8361), the number of tokens in the cluster and some of the most common $y$ tokens. Also shown are randomly selected token sequences in the cluster, with the $y$ token highlighted in green.

Clusters in Pythia-14M

We begin with 10 random clusters from Pythia-14M. Our purpose is to establish a baseline of the kinds of structures known to exist in smaller models. This will allow us to later contrast the nature of patterns found in Pythia-14M and Pythia-1.4B.

Clusters in Pythia-1.4B

The “Physics/Math”, “Materials/Engineering” and “Biomedical” regions. The first six clusters shown below also appear in the header at the top of this post. The cluster C11414 contains tokens (e.g. real, complex and integer) which are adjectives describing numbers. Unlike many of the clusters shown for Pythia-14M, there is no clear pattern to the position of these tokens or the tokens that precede or follow them. Our description “Kinds of Numbers” may not capture the internal algorithm that is responsible for these tokens having similar susceptibility vectors, but it does capture the background knowledge a human requires to understand the pattern. In this sense, the cluster is “more abstract” than a cluster that simply captures the second newline in a double newline (e.g. Pythia-14M C13434).

What is abstractness? Our headline claim is that many of the clusters in Pythia-1.4B are more abstract than those found in Pythia-14M. While ultimately this remains an informal claim, here are some of the elements of the rubric we use to distinguish less from more abstract patterns: clusters with more diversity in the $y$ token (but which nonetheless have a coherent theme) are more abstract; clusters with a pattern that takes more effort for a human to recognize or communicate are more abstract; clusters which require a longer description, even making use of background terms, are more abstract.

The “Pronouns/Informal”, “Narrative Verbs”, “People/Names” and “Culture/Everyday” regions. The first three clusters appear in the header at the top of this post. In the cluster “Administrative levels” (C7743) the token $y$ is relatively predictable if you know the next token $z$ (e.g. court, department, highway). However, susceptibilities measure the influence on the loss of predicting $y$ given its context $x$ , without knowing $z$ (since this is the modelling task we have given the transformer). The shared relationship between $y$ and $x$ in these examples is more subtle when one takes this into account.

Random Clusters in Pythia-1.4B

Above, the clusters were chosen to be interesting and exhibit a particular region in the cluster map. In this section we randomly select 10 clusters from the 57,236 clusters in Pythia-1.4B. This illustrates how many of the simple patterns present in the Pythia-14M clusters are still present in the larger model (including flavors of newlines, single common tokens in particular kinds of contexts and common word stems).

Neighborhoods: Mid-Level Organization

Note: clusters in this section are taken from the clustering of the 4.25M ground truth susceptibilities for Pythia-1.4B. In the larger dataset of 46M upsampled susceptibilities many of these clusters are absorbed into a smaller number of larger clusters.

Our first example is the Game-Image-Play neighborhood for Pythia-1.4B. This group of clusters appears to be unified by the theme of media that get produced, transmitted and consumed. Moving clockwise from the top of the neighborhood we encounter album, artist and video and, at the bottom right, print, transmit and play.

Our second example is the Roles neighborhood for Pythia-1.4B. This is a set of clusters containing tokens like server and client, troops and officers, network, company, council and group. The unifying theme is people (and things) defined by their role within an organized structure: legal parties (plaintiff, appellant, defendant), military ranks (lieutenant, commander), political positions, technical roles, and institutional titles. What makes this neighborhood interesting is that it crosses domain boundaries: the same abstract pattern of “entity defined by its function within a collective” appears in legal filings, military documents, corporate governance, and software architecture.

There are two small clusters (C9575 and C3660) dominated by the tokens ligand and ligands. These are the only clusters containing these tokens in the 4.25M ground truth susceptibility clustering. This is potentially interesting because in biology a ligand is to a receptor as an officer is to a command structure, something that “fills a functional slot” in a larger organized system. However, these clusters are small and their placement in this neighborhood could be a UMAP artifact.

Tokens Across Scale

A potential concern with any new interpretability method is that it simply produces different outputs for different models, without that difference reflecting something real about how the models differ. One way to check this is to follow specific tokens across scales and see whether they get organized in ways that match what we expect from a more capable model: organization by grammatical and semantic relationships rather than by which characters happen to surround them.

Pronouns

First we consider the token clusters in Pythia-14M containing she or her. Note the organization by surface level syntactic structure: for example, C68817 consists of she appearing as part of he/she and C60937 consists of she appearing at the beginning of quotations or as the first token inside a bracket.

Contrast this with Pythia-1.4B where clusters that contain multiple she and her tokens also contain an assortment of pronouns in a variety of contexts and are organized by more abstract patterns. Notice that C1828 and C342 consist mostly of subject pronouns (she appearing in “she kicked the ball”) and C822 consists mostly of possessive determiners (her appearing in “her purity of tone”).

Verbs

Our second example of variation across scale involves verbs. First we consider a region in the cluster map for Pythia-14M which consists of (base form) verbs. The highlighted cluster C17745 has make as its most common token, followed by similar verbs give and solve. However, the cluster does not contain a significant number of occurrences of the inflected forms of the verb makes, making, made.

By contrast in the Pythia-1.4B cluster map, the clusters containing make usually contain the inflected forms, as the examples below demonstrate. The verb clusters also tend to be organized by high level semantic groupings rather than clustering together with other verbs.

These examples are indicative of the broader differences in clusters that we can observe between the two models. As we move up in scale, clusters stop being organized by surface-level syntax and start being organized by grammatical role and other deeper patterns. Put more simply: some understanding of English grammar is necessary to understand the patterns captured by Pythia-1.4B (and the authors admit that it is Claude, not them, who knows what a possessive determiner is).

Conclusion

Our earlier work on susceptibilities for neural network interpretability (Baker et al. 2025; Gordon et al. 2026) left open the question of whether we could find more abstract structures in larger models. In this post we presented evidence that this threshold has been crossed: in Pythia-1.4B we found tens of thousands of clusters, organized into interpretable semantic regions. The patterns represented by these clusters are notably more abstract than in Pythia-14M and in our judgment are starting to approach the level of abstractness of SAE features for Claude 3 Sonnet.

As far as we know, anything you can find with susceptibilities in 1-10B models you could also find with SAEs. Nonetheless there are a few reasons you might still find them interesting:

This is an independent approach based on loss landscape geometry and local posterior sampling. It is a reassuring sign for interpretability that two very different stacks (SAEs and susceptibilities) both work, and find comparable things. Maybe they’re both real!
Susceptibilities are grounded in a rigorous theory of generalization in Bayesian statistics (singular learning theory). While this currently falls short of a satisfactory theory of deep learning, this is a stronger theoretical foundation than currently exists for activation-based interpretability.
Our approach gives a natural path towards engineering internal structure in networks via patterning: we have used it to delay the formation of the induction circuit in small transformers and to steer between two modes of out-of-distribution generalization in a synthetic task (Wang & Murfet 2026). We believe that the affordances provided by patterning for alignment are significantly different from those derived from steering the forward pass with SAE features.

At Timaeus our mission is to make breakthrough progress on AI alignment. Our research efforts can be summarized by the trichotomy Read/Write/Align. Susceptibilities form the basis of our approach to interpretability (“Read”). These measurements tell us how infinitesimal changes in the data distribution affect the loss landscape, and can be turned around to infer what changes in the data to make in order to engineer desired properties of the model; this is patterning, an approach to deliberately selecting how models generalize (“Write”). Using patterning, we are working to improve the state of the art in AI alignment, starting with fundamental problems in today’s approaches to post-training: reward model biases and reward hacking in RLHF (“Align”).

Finally, an invitation to explore:

Examine clusters for Pythia-14M and Pythia-1.4B in the Cluster Explorer.
Read more about inferring structure in variations on the Ising model with susceptibilities, in Interpreting the Ising Model.
Understand the definition of susceptibilities and see more details about how we upsample to produce the clusterings used in this post in How to Scale Susceptibilities.
Use susceptibilities via the devinterp library on GitHub.
Read the Guide for Sampling Hyperparameter Selection.

Limitations

In this section we enumerate some of the known limitations of susceptibilities for interpretability.

Hyperparameter sensitivity. Selecting suitable hyperparameters for SGLD sampling is one of the primary difficulties with susceptibility estimation. The basics are presented in the sampling guide. In an early stage of Gordon et al. (2026) we chose hyperparameters for which the resulting clusters were virtually indistinguishable from an “infinite-temperature” baseline that consists of injecting unstructured Gaussian noise instead of sampling from the posterior (technically, $n\beta = 0$ ). Despite containing no contribution from the loss landscape geometry, this baseline still yields interpretable clusters (that is, the Gaussian baseline is nontrivial). Refining the hyperparameter selection and careful analysis shows substantial differences with this baseline, see Appendix H of Gordon et al. (2026).

Scaling. Estimating susceptibilities is expensive. Each entry $\chi_{xy}$ requires averaging the loss on token $y$ over hundreds to thousands of weight draws. The cost is dominated by backpropagation to generate the draws (which can be amortized across tokens) and repeated forward passes for each draw (which can be reduced via auxiliary model upsampling). The 4.25M ground-truth susceptibilities in this work required 4800 H200-hours. For reference, a single Gemma Scope SAE (Lieberum et al. 2024) at comparable width uses 8 billion tokens, more than two orders of magnitude more data. Susceptibilities are significantly more data-efficient but more compute-intensive per token.

Model and data. Pythia-1.4B is not a capable model by 2026 standards, and the Pile is far from the best publicly available data mixture. We chose Pythia because it is the best-studied open-weights model family and we chose the Pile because staying in-distribution eliminates a major confounder. The cost is that some of the limitations we observe (particularly in cluster abstractness) may reflect the model rather than the method. In follow-up work, we plan to study larger, more capable models with better datasets.

Circuits. While in Baker et al. (2025) we demonstrate how to find the induction circuit in a two-layer attention-only transformer using susceptibilities, we have not yet published a more general approach to finding circuits with susceptibilities comparable to, for example, feature attribution graphs in SAEs. There is no fundamental obstacle to doing this: watch this space.

Steering. SAE features can steer the forward pass by intervening on activations. The natural analogue for susceptibilities is steering the learning process through data distribution modifications, which we explore under the name “patterning” (Wang & Murfet 2026). We have not yet validated such interventions at the model scale considered here.

Abstractness. The clusters we find are less abstract than the features Anthropic uses to audit frontier systems. In audits of Claude, white-box methods surface features like “strategic manipulation” and “security bypass,” which are more abstract and safety-relevant concepts than what we observe in Pythia-1.4B. Some of this gap is likely explained by the model’s limited capabilities (even with SAEs, (Templeton et al. 2024) also does not contain any concepts as abstract as this). Whether there is also a methodological gap remains an open question that can only be resolved by scaling to more capable models.

Frequently Asked Questions

Why not just cluster (randomly down-projected) activations? Because it doesn’t work. Or at least, we couldn’t make it work and that’s our best guess as to why nobody clusters raw activations to discover concepts in LLMs. The approaches that do apply clustering to activations either require careful preprocessing (e.g., with SAEs, Venhoff et al. (2025), embeddings through auxiliary models, Kopf et al. (2025), or discretization into symbolic representations, Wu et al. (2025)) or require significant post-processing machinery (e.g., Kroeger & Bindschaedler (2025)). Susceptibility vectors, by contrast, are naturally amenable to simple clustering without such preprocessing, which is itself evidence that they provide a more useful signal than raw activations.

Are clusters the ultimate form of SLT-based interpretability? We see no reason to think so. Given a vector representation of tokens, clustering is one of the simplest forms of data analysis and we have found that it has taken us surprisingly far. It is, however, conceivable that in larger models or with orders of magnitude more tokens, alternative approaches to data analysis will be required. More broadly, our hypothesis is that internal structure of networks is reflected in the local loss landscape, and that this geometry is in turn decodable from estimates of expectation values of observables; susceptibilities are natural observables, but we should not be read as saying that these are necessarily a sufficient set.

Why is internal structure in networks encoded in the loss landscape? This might seem counter-intuitive in the setting of machine learning, but it is a commonplace notion in physics. In many systems the geometry of a potential function at its minima is intimately related to the properties of the physical system whose configuration corresponds to those minima. Some of the foundational ideas in the context of program synthesis have been developed in Murfet & Troiani (2025), we will restrict ourselves here to a couple of illustrative examples:

Modularity and control flow: suppose that the network processes two sets of inputs via separate paths, past some initial layers (which serve to route the inputs to the appropriate paths). Then variations in the weights of one path do not affect the outputs on inputs using the other path; this describes a family of variations in the data distribution with differential effects on the loss landscape that can be seen by susceptibilities.
Patterns of errors: one reasonable model of internal structure in language models is that families of circuits co-operate to up- or down-weight possible continuations of a context, by specializing to recognition of different patterns in the context and contributing to the residual stream whatever expression or suppression “their” patterns would suggest. This convocation of circuits is affected in a complex way by perturbing the weights in any given circuit: for example, in Baker et al. (2025) we study a balance between attention heads that want to predict the most likely continuation in the data distribution (termed bigram heads) and heads that want to predict the most likely continuation based on the context (induction heads). Now consider jointly varying (i) the probability of a common bigram vs an induction pattern in the data distribution and (ii) the weights in the bigram heads vs induction heads. The different pairings of variations in (i), (ii) affect the aforementioned “balance” differently and produce different patterns of errors. This is how the induction circuit is recovered via analysis of the susceptibility matrix in Baker et al. (2025). In turn, these patterns of errors are encoded in the partial derivatives of the loss and thus in the landscape.

The idea has been raised in interpretability circles before, for instance in Olsson et al. (2022) they speculate that “it seems like there must be some way in which aspects of the loss surface connect to the formation of circuits”. This is, in short, our approach to interpretability.

Build on our work

Our tools for susceptibilities, local learning coefficients, and SGMCMC sampling are open source in the devinterp library.

Work with us

We're hiring Research Scientists, Engineers & more to join the team full-time.

Senior researchers can also express interest in a part-time affiliation through our new Research Fellows Program.

The susceptibility framework was developed by Garrett Baker, George Wang, Jesse Hoogland, and Daniel Murfet. The clustering methodology was developed by Andrew Gordon and Garrett Baker. Susceptibility data for Pythia-1.4B was collected by Garrett Baker and Andrew Gordon. The auxiliary models used for upsampling susceptibilities were trained by Max Adam. PCA whitening and clustering of the 1.4B data was carried out by Max Adam and Andrew Gordon. Exploration and analysis of 1.4B clusters was led by Daniel Murfet, Max Adam, and Andrew Gordon. The abstractness scoring rubric leading into the selection of cluster examples was developed by Rohan Hitchcock. Infrastructure and Engineering: The susceptibility estimation pipeline was built by Stan van Wingerden, and William Snell with contributions from Billy Snikkers. Cluster Explorer and Blog: The cluster explorer was created by Max Adam, with engineering contributions for scale and public deployment from Adam Newgas. The blog post interfaces were built by Max Adam. The global cluster map was produced by Daniel Murfet. Writing: The blog post was primarily drafted by Daniel Murfet with feedback from all authors. Leadership: Andrew Gordon led the language project. Daniel Murfet provided overall research direction and Stan van Wingerden provided overall engineering direction.

Please cite as:

BibTeX Citation:

@misc{murfet2026spectroscopy,
  title={Spectroscopy at Scale: Finding Interpretable Structure in Pythia-1.4B},
  author={Daniel Murfet and Andrew Gordon and Max Adam and George Wang and Jesse Hoogland and Garrett Baker and William Snell and Stan van Wingerden and Adam Newgas and Billy Snikkers and Rohan Hitchcock},
  year={2026},
  howpublished={\url{https://timaeus.co/research/2026-04-21-spectroscopy-main}},
  url={https://timaeus.co/research/2026-04-21-spectroscopy-main}
}

1.
Patterning is Dual to Interpretability: Shaping Neural Network Development with Susceptibilities [link]
George Wang, Daniel Murfet, 2026.
2.
Structural Inference: Interpreting Small Language Models with Susceptibilities [link]
Garrett Baker, George Wang, Jesse Hoogland, Daniel Murfet, 2025.
3.
Towards Spectroscopy: Susceptibility Clusters in Language Models [link]
Andrew Gordon, Garrett Baker, George Wang, William Snell, Stan van Wingerden, Daniel Murfet, 2026.
4.
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet [link]
Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen et al., 2024. Transformer Circuits Thread.
5.
Bayesian Influence Functions for Hessian-Free Data Attribution [link]
Philipp Alexander Kreer, Wilson Wu, Maxwell Adam, Zach Furman, Jesse Hoogland, 2025.
6.
The Loss Kernel: A Geometric Probe for Deep Learning Interpretability [link]
Maxwell Adam, Zach Furman, Jesse Hoogland, 2025. DOI: 10.48550/arXiv.2509.26537.
7.
Influence Dynamics and Stagewise Data Attribution [link]
Jin Hwa Lee, Matthew Smith, Maxwell Adam, Jesse Hoogland, 2025. DOI: 10.48550/arXiv.2510.12071.
8.
Embryology of a Language Model [link]
George Wang, Garrett Baker, Andrew Gordon, Daniel Murfet, 2025.
9.
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 [link]
Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma et al., 2024.
10.
Base Models Know How to Reason, Thinking Models Learn When [link]
Constantin Venhoff, Iván Arcuschin, Philip Torr, Arthur Conmy, Neel Nanda, 2025. arXiv. DOI: 10.48550/arXiv.2510.07364.
11.
Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework [link]
Laura Kopf, Nils Feldhus, Kirill Bykov, Philine Lou Bommer, Anna Hedström, Marina M. -C. Höhne et al., 2025.
12.
Concept-Guided Interpretability via Neural Chunking [link]
Shuchen Wu, Stephan Alaniz, Shyamgopal Karthik, Peter Dayan, Eric Schulz, Zeynep Akata, 2025.
13.
Cluster Paths: Navigating Interpretability in Neural Networks [link]
Nicholas M. Kroeger, Vincent Bindschaedler, 2025. arXiv. DOI: 10.48550/arXiv.2510.06541.
14.
Programs as Singularities [link]
Daniel Murfet, Will Troiani, 2025.
15.
In-context Learning and Induction Heads
Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan et al., 2022. Transformer Circuits Thread.