The Loss Kernel: A Geometric Probe for Deep Learning Interpretability

Authors

Maxwell Adam ⁼

Timaeus, University of Melbourne

Zach Furman ⁼

University of Melbourne

Jesse Hoogland

Timaeus

See Contributions

Publication Details

Published:

October 1, 2025

Access

Abstract

We introduce the loss kernel, an interpretability method for measuring similarity between data points according to a trained neural network. The kernel is the covariance matrix of per-sample losses computed under a distribution of low-loss-preserving parameter perturbations. We first validate our method on a synthetic multitask problem, showing it separates inputs by task as predicted by theory. We then apply this kernel to Inception-v1 to visualize the structure of ImageNet, and we show that the kernel's structure aligns with the WordNet semantic hierarchy. This establishes the loss kernel as a practical tool for interpretability and data attribution.

Automated Conversion Notice

Warning: This paper was automatically converted from LaTeX. While we strive for accuracy, some formatting or mathematical expressions may not render perfectly. Please refer to the original ArXiv version for the authoritative document.

1 Introduction

A central goal in AI interpretability and data attribution is interpreting and mapping the global structure of the data distribution as seen by a trained neural network (Carter et al., 2019; Pepin Lehalleur et al., 2025; Olah, 2015). One approach is to start local, by quantifying a suitable measure of similarity between pairs of individual samples—that is, by defining a kernel. “Interpreting” the global structure of the data distribution then becomes a problem of analyzing the geometric structure in this kernel (e.g., via clustering techniques), and “mapping” becomes a problem of visualizing points in this kernel space (e.g., via dimensionality reduction techniques).

This kernel-based approach has been used successfully with similarity measures derived from activations or representations. For example, it is possible to define a kernel via cosine similarity between the hidden vectors of sparse autoencoders (SAEs). Applying UMAP to this kernel provides a way to visualize the space of features in language models (Bricken et al., 2023; Templeton et al., 2024) and image models (Gorton, 2024). This kernel has also been used for analysis, such as to determine the (nearly) hierarchical relations between features (Bricken et al., 2023).

In this paper, we take a different approach derived from the geometric structure of the loss landscape. Neural networks are singular models, meaning many different parameter vectors encode identical functions and achieve the same loss. Rather than studying individual weight settings, singular learning theory (SLT; Watanabe 2009), which studies these singular models, suggests analyzing the entire set of low-loss solutions. This perspective motivates us to define the loss kernel, a measure of functional similarity based on shared sensitivity to parameter perturbations restricted to this low-loss set of solutions. Formally, the loss kernel, $K({\mathbf{z}},{\mathbf{z}}^{\prime})$ , is given by the covariance matrix of per-sample losses, $\text{Cov}[\ell({\mathbf{z}};{\bm{w}}),\ell({\mathbf{z}}^{\prime};{\bm{w}})]$ , under perturbations drawn from a suitable probe distribution. A high covariance value indicates that the inputs ${\mathbf{z}}$ and ${\mathbf{z}}^{\prime}$ share sensitivity to the same parameter perturbations, which provides evidence for two samples being functionally coupled inside a given model.

Refer to caption — Figure 1: **Geometry of the loss kernel for Inception-v1 on ImageNet.** A UMAP of pairwise distances induced by the normalized loss kernel $R({\mathbf{z}},{\mathbf{z}}^{\prime})=\mathrm{Corr}_{{\bm{w}}\sim p({\bm{w}}\mid{\mathcal{D}})}[\ell({\mathbf{z}};{\bm{w}}),\,\ell({\mathbf{z}}^{\prime};{\bm{w}})]$ for Inception-v1 on ImageNet-1k; each point is one image, colored continuously by position in the ImageNet hierarchy. Similar colors indicate inputs are semantically similar. **1–9** Insets: example neighborhoods with thumbnails showing coherent regions for *dogs* (1), *primates* (2), *birds* (3), *diaspids* (4), *crustaceans* (5), *insects* (6), *produce* (7), *musical instruments* (8), and *vehicles/cars* (9). Bottom right: Orbit views of the same 3-D embedding. B The full correlation kernel matrix (10k $\times$ 10k) next to the ground truth distance matrix derived from the ImageNet hierarchy shows similar block structures in both.

We demonstrate the loss kernel as a practical interpretability technique by combining it with established kernel-based techniques to study two settings. First, in a controlled experiment using a synthetic multitask arithmetic problem, we confirm that the kernel successfully separates inputs corresponding to functionally independent subtasks, as predicted by theory. Second, we apply the loss kernel to an Inception-v1 model to create a visual map (Figure 1) of the ImageNet dataset on which it was trained (Szegedy et al., 2014; Deng et al., 2009). We then quantitatively validate that the structure of this kernel reveals a coherent semantic organization that is consistent with the WordNet class taxonomy (Princeton University, 2010).

Contributions.

Our contributions are thus:

We introduce the loss kernel as a measure of functional coupling, motivating it from the geometric perspective of singular learning theory and defining it through a principled, local probe distribution. (Section 2)
We validate the loss kernel in a controlled setting, confirming that the loss kernel is able to successfully separate subtasks in a synthetic multitask experiment, as predicted theoretically. (Section 3)
We apply the loss kernel to Inception-v1 on ImageNet, demonstrating its utility as a large-scale interpretability and visualization tool. We show that its structure reveals a coherent semantic organization consistent with the WordNet class taxonomy. (Section 4)

2 The Loss Kernel

In this section, we define the loss kernel, a metric that quantifies whether two inputs are processed similarly by a trained neural network. First (Section 2.1), we motivate our focus on the geometry of the loss landscape, specifically the set of low-loss points $W_{\epsilon}$ that contains a given trained model ${{\bm{w}}^{\ast}}$ . Second (Section 2.2), we develop a practical probe distribution using a localized Gibbs posterior, which allows us to sample from this low-loss region. Finally (Section 2.3), using this distribution, we formally define the loss kernel as the covariance of per-sample losses under our probe distribution.

2.1 Interpretability and Degeneracy

The typical process of training a neural network yields a single parameter vector ${{\bm{w}}^{\ast}}$ , optimized via an algorithm like SGD against an objective function of the form

L_{n}({\bm{w}})=\sum_{i=0}^{n}\ell({\mathbf{z}}_{i};{\bm{w}})

where $\ell({\mathbf{z}}_{i};{\bm{w}})$ is the loss on $i$ -th data sample ${\mathbf{z}}_{i}$ for the parameter vector ${\bm{w}}$ , with a dataset of size $n$ .

The field of interpretability seeks to understand the structure of the trained model represented by ${{\bm{w}}^{\ast}}$ . It is typically implicitly presumed that one can understand the structure in the trained model using the parameters ${{\bm{w}}^{\ast}}$ , either directly by inspecting, for example, weight magnitudes (Kovaleva et al., 2021), or indirectly by examining the computation process of the model at ${{\bm{w}}^{\ast}}$ through, for example, activations (Bricken et al., 2023; Wang et al., 2022; Carter et al., 2019) and gradients (Ancona et al., 2018; Sundararajan et al., 2017).

A challenge to interpreting weights directly is that neural networks are singular: many different parameters implement the same function or achieve the same loss. This degeneracy means that properties specific to ${{\bm{w}}^{\ast}}$ may reflect arbitrary details of the learned implementation that are irrelevant to downstream behavior. For example, ReLU-scaling symmetries mean the absolute magnitude of individual weights or gradients is not always meaningful on its own, which undermines interpretability methods that rely on it.

Singular learning theory.

Watanabe’s (2009) singular learning theory (SLT) provides a mathematical framework for studying models that exhibit such degeneracies. A key idea from SLT is to study the geometry of the set of minima of the loss function as a whole rather than individual weight settings. Consider the set of parameters which are “almost equivalent” to ${{\bm{w}}^{\ast}}$ , according to the training loss $L_{n}({\bm{w}})$ :¹¹1Throughout the paper, we define objects using the training loss $L_{n}({\bm{w}})$ , including the loss kernel itself. Alternatively, we could define these objects using the population loss $L({\bm{w}})=\lim_{n\rightarrow\infty}L_{n}({\bm{w}})$ , and treat the empirical versions as estimators of the population versions. We explore this further in Section A.1.4.

W_{\epsilon}=\{{\bm{w}}\in\mathbb{R}^{d}\mid L_{n}({\bm{w}})-L_{n}({{\bm{w}}^{\ast}})<\epsilon\}.

(1)

The asymptotic volume-scaling behavior of (the population version of) $W_{\epsilon}$ is directly linked, through SLT, to the complexity, description length, and generalization error of the model at ${{\bm{w}}^{\ast}}$ (Lau et al., 2025; Urdshals et al., 2025). Our work builds on this premise to develop a principled technique for measuring whether two inputs are processed in similar ways by a given trained neural network.

2.2 Constructing a Practical Probe

While $W_{\epsilon}$ is theoretically natural, it is difficult to integrate over this set because it is so high-dimensional. Moreover, we need a way to localize this set to a specific set of model weights obtained via stochastic optimization. We make two modifications to overcome these challenges and develop a practical low-loss probe:

From hard to soft constraints.

First, we replace the sharp boundary of $W_{\epsilon}$ with a smooth Gibbs factor, $\exp(-\beta L_{n}({\bm{w}}))$ . This concentrates sampling in low-loss regions, where the inverse temperature $\beta$ plays a role analogous to $1/\epsilon$ . This makes the distribution amenable to gradient-based MCMC sampling and is formally justified by the relationship between integrals over low-loss sets and expectations under the Gibbs distribution (see Section A.3).

From global to local.

Second, we focus on the neighborhood containing the specific model ${{\bm{w}}^{\ast}}$ found by a given run of stochastic optimization. The global loss landscape may contain many regions of low loss, but we wish to interpret the particular solution our training procedure has found. We therefore re-weight the Gibbs distribution with a Gaussian kernel centered at ${{\bm{w}}^{\ast}}$ .

This yields the final probe distribution over the training set $\mathcal{D}$ :

p({\bm{w}}\mid\mathcal{D})\propto\underbrace{\exp(-\beta L_{n}({\bm{w}}))}_{\text{Low-Loss Constraint}}\cdot\underbrace{\mathcal{N}({\bm{w}}\mid{{\bm{w}}^{\ast}},\gamma^{-1}I)}_{\text{Locality Constraint}}.

From a Bayesian perspective, this is equivalent to a tempered Bayesian posterior with Gaussian prior.

2.3 The Loss Kernel

The loss kernel.

The loss kernel, $K$ , is the covariance matrix of per-sample losses under our probe distribution:

K({\mathbf{z}},{\mathbf{z}}^{\prime})=\text{Cov}_{{\bm{w}}\sim p({\bm{w}}\mid\mathcal{D})}\left[\ell({\mathbf{z}};{\bm{w}}),\ell({\mathbf{z}}^{\prime};{\bm{w}})\right].

(2)

A high value of $K({\mathbf{z}},{\mathbf{z}}^{\prime})$ indicates that inputs ${\mathbf{z}}$ and ${\mathbf{z}}^{\prime}$ are functionally coupled, sharing sensitivity to the same parameter perturbations. The kernel is symmetric positive semi-definite as it is a covariance kernel. For analysis and visualization, we often use its normalized form:

R({\mathbf{z}},{\mathbf{z}}^{\prime})=\frac{K({\mathbf{z}},{\mathbf{z}}^{\prime})}{\sqrt{K({\mathbf{z}},{\mathbf{z}})K({\mathbf{z}}^{\prime},{\mathbf{z}}^{\prime})}},

with $R({\mathbf{z}},{\mathbf{z}}^{\prime})=0$ if $K({\mathbf{z}},{\mathbf{z}})=0$ or $K({\mathbf{z}}^{\prime},{\mathbf{z}}^{\prime})=0$ , which measures the correlation between per-sample losses. $R({\mathbf{z}},{\mathbf{z}}^{\prime})$ also has the advantage of being invariant under affine changes of the loss function, unlike $K({\mathbf{z}},{\mathbf{z}}^{\prime})$ itself.

Interpretation.

The loss kernel can be seen as a generalized version of the negated (local) Bayesian Influence Function (Kreer et al., 2025), which itself generalizes the influence function from classical statistics, see Section A.2. The diagonal of this kernel, $K({\mathbf{z}},{\mathbf{z}})$ , is the per-sample loss variance. Up to a multiplicative constant, the sum of $K({\mathbf{z}},{\mathbf{z}}^{\prime})$ over the training set, $\sum_{i}K({\mathbf{z}}_{i},{\mathbf{z}}_{i})$ , is an empirical estimator for the singular fluctuation, a key quantity in SLT that governs the model’s (Gibbs) generalization error, see Section A.1.

Practical estimation.

Expectations over the probe distribution $p({\bm{w}}\mid\mathcal{D})$ are intractable to compute analytically. We therefore approximate them using Monte Carlo methods. Specifically, we generate a set of $S$ samples $\{w_{s}\}_{s=1}^{S}$ from a Stochastic Gradient Langevin Dynamics (SGLD; Welling & Teh 2011) chain (or multiple parallel chains) initialized at the trained model’s parameters, ${{\bm{w}}^{\ast}}$ . We then use these samples to compute standard unbiased plug-in estimators for the loss kernel $\hat{K}({\mathbf{z}},{\mathbf{z}}^{\prime})$ and its normalized version $\hat{R}({\mathbf{z}},{\mathbf{z}}^{\prime})$ . We provide further details and departures from SGLD in Appendix B.

3 Validation on a Synthetic Task

Before using the loss kernel to explore structure in natural data, we first verify that it behaves as expected in a controlled scenario. Theoretically, we expect that for tasks solved by independent mechanisms—where the loss factorizes into a sum of sublosses depending on disjoint sets of weights—the cross-task loss covariance is zero (Section A.4). We test this prediction on a transformer trained on a multitask modular arithmetic problem designed to encourage such independent mechanisms.

Multitask arithmetic.

For our controlled scenario, we analyze a two-layer transformer on a multitask modular arithmetic (“grokking”) problem, extending the single-task setup of Power et al. (2022). Our model is trained to perfect accuracy on two independent tasks: modular addition and modular division, both modulo 97. To encourage the development of distinct computational pathways, each operation uses a separate input vocabulary.

Reducing dimensionality.

We visualize the kernel by applying standard dimensionality reduction techniques to a set of reference points in the kernel space. We use UMAP, which obtains a low-dimensional embedding optimized to preserve nearest-neighbor relationships (McInnes et al., 2018).

UMAP operates on a distance matrix, where a point must have distance $0$ with itself and positive distance with all other points. We transform the normalized loss kernel, or correlation, $R$ into a distance $d$ by setting the distance between any two samples ${\mathbf{z}}$ and ${\mathbf{z}}^{\prime}$ to $d({\mathbf{z}},{\mathbf{z}}^{\prime})=1-R({\mathbf{z}},{\mathbf{z}}^{\prime})$ . Applying UMAP to these pairwise distances produces the embedding depicted in Figure 3, where proximity in the visualization indicates a strong functional coupling between samples as measured by the kernel.²²2In the ImageNet setting we remove connections between inputs of the same label during UMAPs nearest neighbor search to eliminate potential spurious correlations (see Appendix B)

Interpreting the kernel.

After computing the loss kernel over all pairs over 10,000 inputs drawn equally from both tasks, we find its structure reflects the task-level separation between addition and division. As seen in the UMAP visualization in Figure 3, the kernel separates into two distinct clusters corresponding precisely to the addition and division samples (and a third smaller cluster for the trivial modular division case where the dividend is zero). Examining the underlying covariance values confirms this observation: cross-task covariances are narrowly distributed around zero, while within-task covariances are substantially larger.

Though we lack a sufficient mechanistic understanding to establish whether this model’s internal implementations of modular addition and division satisfy the criteria in Section A.4, observing vanishing correlation between tasks is consistent with the behavior theoretically predicted for functionally disjoint mechanisms. This establishes the kernel’s utility in a setting with partially known ground-truth structure.

4 Application to ImageNet

Having established theoretically and empirically that the loss kernel can identify ground-truth functional separation in a controlled setting, we now deploy it as an exploratory tool on a large-scale, real-world task. We consider an Inception-v1 model (Szegedy et al., 2014) trained on ImageNet data (Deng et al., 2009), where the true functional organization is not fully known. Our goal is to investigate qualitatively whether we can use the kernel as a visualization tool and quantitatively whether structure in the kernel corresponds to meaningful semantic and hierarchical structure in the data.

Visualizing the loss kernel.

For 10,000 random validation examples, we compute the loss correlation matrix and examine top-correlated inputs. We find that nearest neighbors are interpretable, often sharing patterns of color, texture, shape, or content. Figure 4 provides qualitative examples of these relationships, showing the top and bottom correlated examples for a selection of inputs. Additional randomly chosen examples are available in Section D.7.

Hierarchical structure in ImageNet.

The ImageNet dataset (Deng et al., 2009) is not a flat collection of classes; its labels are drawn from and organized according to the WordNet hierarchy, a large lexical database of English where nouns, verbs, adjectives, and adverbs are grouped into sets of synonyms (synsets), each expressing a distinct concept (Princeton University, 2010). Each node in the ImageNet hierarchy represents a category (e.g., “animals”, “mammals”, “devices”, “plants”), and each leaf node corresponds to a specific class, which the model was trained to predict (e.g., “wire-haired fox terrier , “goldfish”, “castle”). This taxonomy provides a natural (though only partial) source of ground truth for establishing similarity between ImageNet inputs, based on the similarity between their output labels according to the WordNet hierarchy.

To visualize this ground-truth structure overlaid on the loss kernel, we color each sample in Figure 1 (A) by the position of that sample’s label in the ImageNet hierarchy. The version of ImageNet we use in these experiments is organized into 1,000 classes; by sorting these classes via their position in the hierarchy we assign similar hues to inputs of nearby categories.

Hierarchical structure in the loss kernel.

The UMAP visualization in Figure 1 reveals a clear high-level organization that mirrors the primary branches of the WordNet hierarchy. A prominent split separates “animals” from “things,” with a transitional region occupied by “produce” (Inset 7). Within these broad domains, the kernel captures finer taxonomic distinctions. For example, the “animal” kingdom subdivides into coherent superclasses. A large cluster representing “domesticated animals”, particularly “dogs” (Inset 1), transitions into other mammals like “primates” (Inset 2), and then to “birds” (Inset 3). Nearby, we observe distinct groupings for “diapsids” (Inset 4), “crustaceans” (Inset 5), and “insects” (Inset 6). This hierarchical organization persists at deeper levels of specificity, as shown by the more detailed insets for “musical instruments” (Inset 8). The block structure of the full correlation matrix, when sorted by the WordNet hierarchy (Figure 1 B), provides an additional confirmation of this nested structure, showing strong intra-class correlation that closely mirrors the ground-truth semantic distance matrix derived from WordNet.

The kernel as a developmental tool.

At initialization the kernel shows no coherent structure (see Figure 8). As training proceeds, structure begins to emerge. Early checkpoints separate broad regimes (e.g., “animal” vs. “thing”), mid-training checkpoints resolve salient subgroups (e.g., “dogs” forming a distinct cluster), and later checkpoints exhibit finer-grained specialization. The UMAP snapshots in Figure 8 illustrate this coarse-to-fine trajectory, where neighborhoods that are initially mixed become progressively more taxonomically coherent as training converges.

Bayesian influence functions and training data attribution.

The loss kernel we propose is a generalization of the negative (local) Bayesian Influence Function (BIF; Kreer et al. 2025), which has its roots in Bayesian sensitivity analysis (Giordano et al., 2017; Iba, 2025). (Kreer et al., 2025) introduced the BIF as a tool for Training Data Attribution (TDA; Koh & Liang 2020), a task focused on provenance – identifying which training points are most responsible for a specific model behavior. Our work addresses a different question: one of functional coupling. We generalize the BIF from a unidirectional, single-point attribution measure into a global, symmetric, positive semidefinite kernel that measures the functional relationship between arbitrary pairs of inputs. Furthermore, we are the first to demonstrate its power for large-scale interpretability by applying kernel analysis techniques to this functional map. For more details on the differences, see Section A.2.

Data-similarity kernels and metric learning.

The general approach of learning a data-similarity kernel is a cornerstone of statistics and machine learning, and our work is situated within this broader context (Hofmann et al., 2006; Khatib & Alkhatib, 2024). Classical methods like Principal Component Analysis (PCA) can be viewed as defining similarity through the data’s covariance matrix. This was later generalized by Kernel PCA, which uses the kernel trick to learn non-linear similarities in a high-dimensional feature space (Schölkopf et al., 1997). A related field, metric learning, is explicitly focused on learning distance or similarity functions that are optimized for specific tasks, often by training models that pull similar data points together while pushing dissimilar ones apart (Kulis, 2013). In modern deep learning, this principle is prominent in representation learning, where models learn to project data into a latent embedding space where simple distance metrics (e.g., cosine similarity) correspond to semantic similarity (Bengio et al., 2014; Mikolov et al., 2013; Chen et al., 2020).

Representation-based interpretability.

Representation-based kernels are not limited to models explicitly trained for their representations. For example, similarity measures like Centered Kernel Alignment (CKA; Kornblith et al. 2019) make it possible to derive kernels from intermediate activations of LLMs trained on next-token prediction. This falls under the broader field of representation-based interpretability, which includes other techniques such as supervised “probes” that test for specific properties of activations, and unsupervised methods, like activation atlases (Carter et al., 2019) or sparse autoencoders (SAEs; Bricken et al. 2023). As described in Section 1, these representation-based interpretability techniques offer other ways to construct kernels.

The loss kernel offers a perspective complementary to these representation-based methods. Where representation-based methods learn similarity based on what data points look like in an embedding space, the loss kernel defines similarity based on how the model treats them across the set of low-loss points. Understanding the relationship between activation-space similarity and weight-space functional coupling is a key open question. An interesting direction for future work is to bridge between these different kernel approaches. For example, Multiple Kernel Learning (MKL; Gönen & Alpaydin 2011) techniques could be adapted to learn a meta-kernel that combines information from both representations and weight-space geometry.

Mechanistic and causal interventions.

Mechanistic interpretability aims to identify circuits and algorithms via targeted interventions such as activation patching and ablations (Wang et al., 2022). Our SGLD-based probe can be viewed as a complementary, weight-space analogue to these activation-space ablations. That said, our aims are different: we seek to use the loss kernel as an exploratory tool for discovering structure in data, rather than as a confirmatory tool for testing a mechanistic hypothesis.

Developmental interpretability.

Developmental interpretability is an approach to interpretability that models the SGD learning process as an idealized Bayesian learning process, then applies SLT to derive theoretical predictions, and finally verifies those predictions empirically on models trained using standard stochastic optimization techniques. This approach has been used successfully to detect and interpret phase transitions in stagewise learning in toy models of superposition (Chen et al., 2023; Elhage et al., 2022), transformers trained on algorithmic tasks like list-sorting and in-context regression (Carroll et al., 2025; Urdshals & Urdshals, 2025), and small language models (Hoogland et al., 2024; Wang et al., 2025b; Baker et al., 2025; Wang et al., 2025a).

The loss kernel is part of this broader agenda, particularly through its connection to key SLT quantities like the singular fluctuation (Section A.1).

6 Discussion & Conclusion

We introduced a new technique, the loss kernel, for mapping and interpreting learned functional relationships between samples in a trained neural network. The kernel is defined as the covariance matrix of per-sample losses, computed under a distribution of parameter perturbations localized to the set of low-loss points. We first validated this method on a synthetic multitask problem, demonstrating that the kernel separates inputs by their underlying task, consistent with theoretical predictions for functionally independent mechanisms. Applied to an Inception-v1 model trained on ImageNet, we show that the loss kernel can be used to visualize the structure of the data distribution and that this structure reflects the WordNet semantic hierarchy. These findings highlight the loss kernel as a useful practical tool for interpretability.

Limitations. The SGLD sampling procedure can be computationally intensive, although it is a one-time, post-hoc cost (for instance, the kernel used in the ImageNet results, Section 4, took three hours to compute on four A100 GPUs). Moreover, the results depend on the hyperparameters of the local posterior, particularly the localization strength $\gamma$ (see Section D.3). We also emphasize that our method is intentionally local, designed to interpret the specific solution found by training, not the entire global loss landscape. Finally, the kernel reveals functional correlation, not causation; it is a tool for discovering related behaviors and generating hypotheses for more targeted mechanistic investigations.

Future directions.

This work opens several promising avenues for future research. A primary theoretical direction is to deepen the connections to singular learning theory, and to extend this methodology beyond pairwise statistics to explore higher-order correlations. We might also hope to formalize the relationship between weight-space coupling, as measured by our kernel, and representation similarity in activation space. On an applied front, the kernel can serve as a discovery tool to guide mechanistic interpretability by identifying functionally-coupled inputs that warrant circuit-level analysis. Its ability to identify functional outliers suggests applications in anomaly and out-of-distribution detection, and the core method can be adapted to other domains like language models using token-level losses. Finally, a key direction is to apply the kernel across training checkpoints to create a developmental view of how a model’s internal functional geometry emerges and solidifies over time.

In summary, the loss kernel offers a window into the way neural networks perceive their input data, helping to understand what data the model treats similarly, and what data the model treats differently.

Appendix

Appendix A Theory Extra: Provides additional detail on the theoretical foundations for the paper’s methodology.
1. Section A.1 Singular Learning Theory: Introduces the core concepts of SLT for singular models like neural networks, connects the loss kernel to two key quantities from SLT (the empirical variance and singular fluctuation), and sketches what a population version of the loss kernel would look like.
2. Section A.2 Training Data Attribution: Introduces influence functions from training data attribution and compares the loss kernel against a type of influence function known as the (local) Bayesian Influence Function (BIF).
3. Section A.3 From Sublevel Sets to Gibbs Distribution: Establishes the formal relationship between expectations under the Gibbs distribution and integrals over low-loss sets, justifying the use of our probe distribution as a tractable probe of the low-loss parameter set.
4. Section A.4 Decoupling of Disjoint Mechanisms: Formalizes conditions under which the loss covariance between data points from independent subtasks is zero.
Appendix B Stochastic-Gradient MCMC Estimator: Provides additional details on the SGMCMC-based estimator we use to estimate the loss kernel.
Appendix C Synthetic Task Extra: Provides additional methodology and results for the synthetic multi-task arithmetic setting.
Appendix D ImageNet Extra: Provides additional methodology, hyperparameter values and ablations, and additional results for the ImageNet setting.

Appendix A Theory Extra

A.1 Singular Learning Theory

Singular learning theory (SLT) is concerned with the theory of machine learning models which are singular: very roughly, models for which their parameterization map is not one-to-one. Neural networks of virtually any architecture are examples of singular models. Singular models break many of the assumptions of traditional statistical learning theory (Watanabe, 2009; 2018). From an interpretability perspective, they have rich geometrical structure (e.g. in their loss landscape), which often reflects information about their internal structure (Murfet & Troiani, 2025) and their training data (Pepin Lehalleur et al., 2025).

A.1.1 Setup

Classically, the setting of singular learning theory is parametric Bayesian learning. We review the setup here briefly. See Watanabe (2009; 2018) for a more in-depth treatment.

We begin with a parameter space $W\subset\mathbb{R}^{d}$ (assumed compact) and a sample space $\mathcal{Z}$ . A parametric statistical model assigns a probability $p({\mathbf{z}}\mid{\bm{w}})$ to samples ${\mathbf{z}}\in\mathcal{Z}$ for a given parameter ${\bm{w}}\in W$ . In singular learning theory, we typically assume that $p({\mathbf{z}}\mid{\bm{w}})$ is analytic or at the very least piecewise-analytic, which holds for most statistical models including the vast majority of neural networks.

To quantify sensitivity of $p({\mathbf{z}}\mid{\bm{w}})$ to infinitesimal parameter perturbations, we define the Fisher information matrix:

I_{jk}({\bm{w}})=\int\left(\frac{\partial}{\partial w_{j}}\log p({\mathbf{z}}\mid{\bm{w}})\right)\left(\frac{\partial}{\partial w_{k}}\log p({\mathbf{z}}\mid{\bm{w}})\right)\,p({\mathbf{z}}\mid{\bm{w}})\,dx.

A model is regular at a parameter ${\bm{w}}\in W$ if the Fisher information matrix is positive-definite at ${\bm{w}}$ , and singular at ${\bm{w}}$ otherwise. We often say that a model $p({\mathbf{z}}\mid{\bm{w}})$ (withoutI specifying any parameter ${\bm{w}}$ ) is regular if it is regular for all ${\bm{w}}$ , and singular otherwise.

Note that the notion of a singular model is a purely geometric property: we have yet to discuss learning or Bayesian learning. We proceed to discuss that now. We aim to learn a data distribution $q({\mathbf{z}})$ over $\mathcal{Z}$ , which we have access to only indirectly via $n$ IID samples $\mathcal{D}=\{{\mathbf{z}}_{1},\dots,{\mathbf{z}}_{n}\}$ from $q({\mathbf{z}})$ . Our performance on this task is quantified by the negative log-likelihood or training loss, $L_{n}({\bm{w}})=-\sum_{i=0}^{n}\log(p({\mathbf{z}}_{i}\mid{\bm{w}}))$ .

In a Bayesian setting, we have a prior distribution $\varphi({\bm{w}})$ , and a (tempered) posterior distribution $p({\mathbf{z}}\mid\mathcal{D}_{n})$ obtained via Bayes rule:

p({\mathbf{z}}\mid\mathcal{D}_{n})=\frac{1}{Z}\int_{W}\exp(-\beta L_{n}({\bm{w}}))\,\varphi({\bm{w}})\,d{\bm{w}}

where $Z$ is a normalizing constant and $\beta$ is a hyperparameter known as the inverse temperature. When $\beta=1$ this is the ordinary Bayesian posterior. Note that one sometimes chooses $\varphi({\bm{w}})$ to be supported only in a neighborhood of a chosen point ${{\bm{w}}^{\ast}}\in W$ , in which case we call this a local posterior distribution (Lau et al., 2025).

A.1.2 Empirical Variance and the Singular Fluctuation

Define the Bayesian training error as the empirical Kullback-Leibler divergence from the posterior predictive distribution to the true distribution:

B_{t}=\frac{1}{n}\sum_{i=0}^{n}\log\left(\frac{q({\mathbf{z}})}{\mathbb{E}_{w}[p({\mathbf{z}}_{i}\mid{\bm{w}})]}\right)

Define the Bayesian generalization error as the population Kullback-Leibler divergence from the posterior predictive distribution to the true distribution:

B_{t}=\mathbb{E}_{x}\log\left(\frac{q({\mathbf{z}})}{\mathbb{E}_{w}[p({\mathbf{z}}\mid{\bm{w}})]}\right)

The expected asymptotic difference between these quantities is given by the singular fluctuation:

\nu(\beta)=\frac{1}{2}\lim_{n\rightarrow\infty}(\mathbb{E}_{\mathcal{D}}[B_{g}]-\mathbb{E}_{\mathcal{D}}[B_{t}]).

The singular fluctuation is a birational invariant appearing in many generalization formulas within SLT, including the difference between the Bayes and Gibbs generalization errors, or the difference between the Gibbs training error and Bayes generalization error.

A.1.3 Connection to the Loss Kernel

The loss kernel can be seen as a generalization of the empirical variance, the empirical estimator of the singular fluctuation. The empirical variance is defined as:

V=\sum_{i=0}^{n}\operatorname{Var}_{w}[\log(p({\mathbf{z}}_{i}\mid{\bm{w}})],

which estimates the singular fluctuation via

\frac{2\nu(\beta)}{\beta}=\lim_{n\rightarrow\infty}\mathbb{E}_{\mathcal{D}}[V].

If we treat the negative log-likelihood as a per-sample loss, $\ell({\mathbf{z}};{\bm{w}})=-\log(p({\mathbf{z}}\mid{\bm{w}}))$ , and recall that the probe distribution coincides with the Bayesian posterior, this can be seen as the trace of the loss kernel evaluated on the training dataset $\mathcal{D}$ :

	$\displaystyle V$	$\displaystyle=\sum_{i=0}^{n}\operatorname{Var}_{w}[\log(p({\mathbf{z}}_{i}\mid{\bm{w}})]$
		$\displaystyle=\sum_{i=0}^{n}\operatorname{Cov}_{w}[\ell({\mathbf{z}}_{i};{\bm{w}}),\ell({\mathbf{z}}_{i};{\bm{w}})]$
		$\displaystyle=\sum_{i=0}^{n}K({\mathbf{z}}_{i},{\mathbf{z}}_{i})$

From this perspective, the loss kernel can be seen as a per-sample generalization of the empirical variance, which further allows taking the covariance of two different samples, including possibly samples outside the training dataset $\mathcal{D}$ .

A.1.4 Towards a Population Loss Kernel

The loss kernel introduced in the main text is an empirical object, computed from a finite training dataset $\mathcal{D}$ of size $n$ . This section sketches the link between the empirical tool and what a population version might look like in the limit as $n\rightarrow\infty$ , which is the natural setting of singular learning theory. We expect this to be an interesting direction for future theoretical work.

From Empirical to Population Loss. The loss kernel probes the geometry of the empirical loss landscape, $L_{n}({\bm{w}})=\sum_{i=1}^{n}\ell({\mathbf{z}}_{i};{\bm{w}})$ . In the asymptotic limit, the law of large numbers implies that this converges to a function known as the population loss, $L({\bm{w}})$ . If the per-sample loss is the negative log-likelihood $\ell({\mathbf{z}};{\bm{w}})=-\log p({\mathbf{z}}\mid{\bm{w}})$ , it converges to the cross entropy (equivalently, KL divergence, up to a constant) from the true distribution $q({\mathbf{z}})$ to the model’s distribution $p({\mathbf{z}}\mid{\bm{w}})$ :

L({\bm{w}})=-\int q({\mathbf{z}})\log p({\mathbf{z}}\mid{\bm{w}})\,dx.

Let $L_{0}=\min L({\bm{w}})$ . The geometry of the set $\tilde{W}_{\epsilon}=\{{\bm{w}}\mid L({\bm{w}})-L_{0}\ \leq\epsilon\}$ as $\epsilon\rightarrow 0$ is intimately connected to the singularity theory of the function $L({\bm{w}})$ . The geometry in $L({\bm{w}})$ and $\tilde{W}_{\epsilon}$ is rich, often reflecting interpretable computational structure, which we might hope to use for interpretability (Murfet & Troiani, 2025).

Posterior Concentration. The set $\tilde{W}_{\epsilon}$ has statistical meaning as well as geometric meaning. As the sample size $n$ goes to infinity, the posterior concentrates around $\tilde{W}_{\epsilon}$ for increasingly small $\epsilon$ . The intuition behind this is simple (the posterior increasingly concentrates around better and better hypotheses as it gets more data), and we describe part of this connection in Section A.3. However, we note that actually proving convergence is highly nontrivial for singular models and that Watanabe (2009) spends multiple chapters proving similar results. From the perspective of Bayesian statistics, this convergence means that the asymptotic geometry of $\tilde{W}_{\epsilon}$ controls statistical quantities like the generalization error (Watanabe, 2009). For our purposes, it means that we can use the (local) posterior (the probe distribution, as we call it in the main text), whose properties can be estimated empirically using SGLD, to study the asymptotic properties of $\tilde{W}_{\epsilon}$ .

From Empirical Observables to Population Geometry. We have said that one can use the local posterior (empirical) to probe the local asymptotic properties of $\tilde{W}_{\epsilon}$ (theoretical). To ground our discussion, we give a concrete example of how one does so for a different tool, the local learning coefficient (LLC; Lau et al. 2025). Let $B({{\bm{w}}^{\ast}})$ be a closed ball about ${{\bm{w}}^{\ast}}$ . The local learning coefficient $\lambda({{\bm{w}}^{\ast}})$ can be defined as the unique $\lambda({{\bm{w}}^{\ast}})$ such that asymptotically as $\epsilon\rightarrow 0$ :

\operatorname{Vol}(\tilde{W}_{\epsilon}\cap B({{\bm{w}}^{\ast}}))\approx c\,\epsilon^{\lambda({{\bm{w}}^{\ast}})}(-\log\epsilon)^{m-1}

for some constant $c$ and positive integer $m$ . This is the population quantity. It may be estimated in practice with a local posterior expectation value:

\hat{\lambda}({{\bm{w}}^{\ast}})=n\beta\big[\mathbb{E}_{{\bm{w}}\sim p({\bm{w}}\mid\mathcal{D})}[L_{n}({\bm{w}})]-L_{n}({{\bm{w}}^{\ast}})\big].

This type of relationship is precisely what we conjecture to hold for some suitably-defined “population” version of the loss kernel.

A Population Loss Kernel. In this paper, we do not define a population version of the loss kernel, but we expect this to be the start of a promising direction for future work. It seems conceivable that one could define such an object, and prove that it converges to the empirical loss kernel under some limit. From this perspective, the loss kernel as we have defined it in the main text would merely be an empirical estimator of the population loss kernel. By analogy to quantities like the LLC, we might expect the population version to have desirable theoretical properties, such as reparameterization invariance (see Appendix C of Lau et al. 2025). Most speculatively, one might even hope that such a population loss kernel could connect to information like “computational structure” reflected in the population geometry (Murfet & Troiani, 2025).

A.2 Training Data Attribution

The loss kernel is a natural generalization of a class of techniques known as influence functions, which are used for training data attribution (TDA; Cheng et al. 2025). This section clarifies the relationship between these objects.

A.2.1 Classical Influence Functions

Classical influence functions (IFs) measure how a model’s parameters and, consequently, any observable quantity, would change if a single training point were infinitesimally up-weighted (Cook, 1977; Cook & Weisberg, 1982). To formalize this, consider a training set $\{z_{i}\}_{i=1}^{n}$ and a tempered empirical average loss $L_{n,\bm{\beta}}({\bm{w}})=\sum_{i=1}^{n}\beta_{i}\ell_{i}({\bm{w}})$ . Let ${{\bm{w}}^{\ast}}(\bm{\beta})$ be the parameter vector that minimizes this average loss. The influence of a training point $z_{i}$ on an observable $\phi({\bm{w}})$ (e.g., the loss on a test point) is defined as the sensitivity of the observable evaluated at this new minimum to a change in the weight $\beta_{i}$ :

\operatorname{IF}(z_{i},\phi):=\frac{\partial\phi({{\bm{w}}^{\ast}}({\bm{\beta}}))}{\partial\beta_{i}}\bigg\rvert_{{\bm{\beta}}=\bm{1}}.

(3)

Applying the chain rule and the implicit function theorem, one arrives at the well-known formula involving the Hessian of the training loss, ${\bm{H}}({{{\bm{w}}^{\ast}}})$ :

\operatorname{IF}(z_{i},\phi)=-\nabla_{\bm{w}}\phi({{\bm{w}}^{\ast}})^{\top}{\bm{H}}({{{\bm{w}}^{\ast}}})^{-1}\nabla_{{\bm{w}}}\ell_{i}({{\bm{w}}^{\ast}}).

(4)

This approach faces significant challenges with modern neural networks, where the Hessian is typically singular (non-invertible) and computationally intractable to compute.

A.2.2 Bayesian Influence Functions

The Bayesian Influence Function (BIF) offers a principled, Hessian-free alternative (Giordano et al., 2017; Iba, 2025). Instead of tracking a single point estimate ${{\bm{w}}^{\ast}}(\bm{\beta})$ , the BIF measures the sensitivity of the expectation of an observable under a tempered posterior distribution $p_{\bm{\beta}}({\bm{w}}\mid\mathcal{D})\propto\exp(-L_{n,\bm{\beta}}({\bm{w}}))\varphi({\bm{w}})$ :

\operatorname{BIF}(z_{i},\phi):=\frac{\partial\mathbb{E}_{{\bm{w}}\sim p_{\bm{\beta}}({\bm{w}}\mid\mathcal{D})}[\phi({\bm{w}})]}{\partial\beta_{i}}\bigg\rvert_{{\bm{\beta}}=\bm{1}}.

(5)

A standard result from statistical physics shows that this derivative is equal to the negative covariance over the untempered ( $\bm{\beta}=\bm{1}$ ) posterior:

\operatorname{BIF}(z_{i},\phi)=-\mathrm{Cov}_{{\bm{w}}\sim p({\bm{w}}\mid\mathcal{D})}(\phi({\bm{w}}),\ell_{i}({\bm{w}})).

(6)

As proposed in Kreer et al. (2025), this method can be adapted to analyze standard, non-Bayesian models by defining a local posterior that is constrained to the neighborhood of the trained parameters ${{\bm{w}}^{\ast}}$ when combined with scalable SGMCMC-based estimators. This “local BIF” provides a practical tool for TDA that is well-defined even for singular models.

A.2.3 Connection to the Loss Kernel

The loss kernel differs from the BIF in three primary ways:

First, the BIF is unidirectional, measuring the influence of training points on (held-out) query points. This is because TDA focuses on provenance—tracing a behavior back to individual training samples. The loss kernel, in contrast, drops this distinction and directionality; it is the full symmetric, positive semidefinite kernel where entries $K({\mathbf{z}},{\mathbf{z}}^{\prime})=\mathrm{Cov}[\ell({\mathbf{z}};{\bm{w}}),\ell({\mathbf{z}}^{\prime};{\bm{w}})]$ measure functional coupling between arbitrary inputs—whether the model has encountered those samples during training or not.

Second, while influence functions focus on individual interactions between (groups of) samples, the loss kernel, as a kernel, shifts the focus to global functional organization. By applying techniques from kernel methods (e.g., UMAP), we use the loss kernel as a primary tool for interpreting the global structure of the data manifold “as seen by the model.” This comes with a caveat: it is possible to promote classical influence functions to a symmetric kernel and thereby to pull in these same kernel-derived methods. But in the classical paradigm, this operation lacks the same justification as we’re able to provide for the loss kernel in Sections 2 and A.1.

Finally, the loss kernel has deep theoretical grounding in singular learning theory (SLT) (see Section A.1). The diagonal of the loss kernel, $K({\mathbf{z}},{\mathbf{z}})$ , represents the per-sample loss variance, and its trace over the training set is an empirical variance, which is an estimator of the singular fluctuation, a key quantity that governs the model’s generalization error. We describe this connection in more detail in Section A.1.3

A.3 From Sublevel Sets to Gibbs Distribution

This appendix establishes the formal relationship between expectations under the Gibbs distribution and integrals over the low-loss sets of an analytic loss function $L({\bm{w}})$ . We demonstrate that these quantities are related by the Laplace transform, which justifies our use of a statistical expectation about the probe distribution as a tractable tool for probing the geometry of the loss landscape.

We consider two related quantities for analyzing an observable $f({\bm{w}})$ . The first is the integral of $f({\bm{w}})$ over the $\epsilon$ -low-loss set $W_{\epsilon}=\{{\bm{w}}\in\mathbb{R}^{d}\mid L({\bm{w}})-\min_{{\bm{w}}^{\prime}}L({\bm{w}}^{\prime})<\epsilon\}$ , which defines a function of $\epsilon$ :

g(\epsilon)=\int_{W_{\epsilon}}f({\bm{w}})\,d{\bm{w}}.

(7)

The second is the expectation of $f({\bm{w}})$ under the Gibbs distribution $p_{\text{gibbs}}({\bm{w}})=\frac{1}{Z(\beta)}\exp(-\beta L({\bm{w}}))$ , which defines a function of the inverse temperature $\beta$ :

\mathbb{E}_{\beta}[f({\bm{w}})]=\frac{1}{Z(\beta)}\int_{W}f({\bm{w}})e^{-\beta L({\bm{w}})}\,d{\bm{w}}

(8)

where $Z(\beta)$ is a normalizing constant and $W$ is the parameter space.

The following proposition details the precise relationship between $g(\epsilon)$ and $\mathbb{E}_{\beta}[f({\bm{w}})]$ .

Proposition 1.

The Gibbs expectation $\mathbb{E}_{\beta}[f({\bm{w}})]$ is the Laplace transform of the low-loss integral $g(\epsilon)$ , up to a known factor:

\mathbb{E}_{\beta}[f({\bm{w}})]=\frac{\beta}{Z(\beta)}\mathcal{L}\left\{g(\epsilon)\right\}(\beta),

(9)

where $\mathcal{L}\{\cdot\}(\beta)$ denotes the Laplace transform with respect to $\epsilon$ .

Proof.

By definition, the Gibbs expectation is given by

\mathbb{E}_{\beta}[f({\bm{w}})]=\frac{1}{Z(\beta)}\int_{W}f({\bm{w}})e^{-\beta L({\bm{w}})}\,d{\bm{w}}.

Using the coarea formula, we may rewrite the integral over $\mathbb{R}^{d}$ as an iterated integral over the level sets of the loss function:

\mathbb{E}_{\beta}[f({\bm{w}})]=\frac{1}{Z(\beta)}\int_{0}^{\infty}e^{-\beta\epsilon}\left(\frac{d}{d\epsilon}\int_{L({\bm{w}})<\epsilon}f({\bm{w}})\,d{\bm{w}}\right)d\epsilon.

Recognizing that $\int_{L({\bm{w}})<\epsilon}f({\bm{w}})\,d{\bm{w}}=g(\epsilon)$ , the expression becomes the Laplace transform of the derivative of $g(\epsilon)$ :

\mathbb{E}_{\beta}[f({\bm{w}})]=\frac{1}{Z(\beta)}\int_{0}^{\infty}e^{-\beta\epsilon}g^{\prime}(\epsilon)\,d\epsilon=\frac{1}{Z(\beta)}\mathcal{L}\{g^{\prime}(\epsilon)\}(\beta).

The derivative property of the Laplace transform states that $\mathcal{L}\{g^{\prime}(\epsilon)\}(\beta)=\beta\mathcal{L}\{g(\epsilon)\}(\beta)-g(0)$ . This yields:

\mathbb{E}_{\beta}[f({\bm{w}})]=\frac{1}{Z(\beta)}\left(\beta\mathcal{L}\{g(\epsilon)\}(\beta)-g(0)\right).

The term $g(0)=\int_{W_{0}}f({\bm{w}})\,d{\bm{w}}$ is an integral over the set of global minima. If $L({\bm{w}})$ is analytic, $W_{0}$ has Lebesgue measure zero, which implies $g(0)=0$ . The proposition follows.

Proposition 1 provides the theoretical basis for our methodology. The invertibility of the Laplace transform implies that the family of Gibbs expectations contains the same information as the family of low-loss-set integrals. We opt for the statistical quantity for practical reasons: $\mathbb{E}_{\beta}[f({\bm{w}})]$ is amenable to gradient-based MCMC methods, making it computationally tractable for high-dimensional models. Furthermore, it provides a summary of the observable’s behavior over all loss levels, weighted naturally by the Gibbs factor, thereby obviating the need to select an arbitrary threshold $\epsilon$ . The Gibbs expectation is thus a practical and well-founded object for analyzing the properties of the low-loss subset.

A.4 Decoupling of Disjoint Mechanisms

This section provides justification for the prediction in Section 3 that a model that has learned disjoint mechanisms for independent tasks should have zero loss covariance between samples from different tasks, under the condition that the mechanisms involve non-overlapping weights.

Proposition 2.

Let a model’s parameters ${\bm{w}}$ be partitioned into two disjoint sets, ${\bm{w}}=({\bm{w}}_{A},{\bm{w}}_{B})$ . Let the training data $\mathcal{D}$ be partitioned into two disjoint sets $\mathcal{D}_{A}$ and $\mathcal{D}_{B}$ , corresponding to two independent subtasks. Assume the model has learned disjoint mechanisms, such that for any data point ${\mathbf{z}}\in\mathcal{D}_{A}$ , its loss $\ell({\mathbf{z}};{\bm{w}})$ is a function only of ${\bm{w}}_{A}$ , and for any ${\mathbf{z}}^{\prime}\in\mathcal{D}_{B}$ , its loss $\ell({\mathbf{z}}^{\prime};{\bm{w}})$ is a function only of ${\bm{w}}_{B}$ . Then, under the probe distribution, the loss covariance between ${\mathbf{z}}$ and ${\mathbf{z}}^{\prime}$ is zero:

K({\mathbf{z}},{\mathbf{z}}^{\prime})=\mathrm{Cov}_{{\bm{w}}\sim p({\bm{w}}\mid\mathcal{D})}[\ell({\mathbf{z}};{\bm{w}}_{A}),\ell({\mathbf{z}}^{\prime};{\bm{w}}_{B})]=0

Proof.

Under the stated assumptions, the total loss $L({\bm{w}})$ on the dataset $\mathcal{D}=\mathcal{D}_{A}\cup\mathcal{D}_{B}$ is additively separable:

L({\bm{w}})=\sum_{{\mathbf{z}}\in\mathcal{D}}\ell({\mathbf{z}};{\bm{w}})=\sum_{{\mathbf{z}}\in\mathcal{D}_{A}}\ell({\mathbf{z}};{\bm{w}}_{A})+\sum_{{\mathbf{z}}\in\mathcal{D}_{B}}\ell({\mathbf{z}};{\bm{w}}_{B})=L_{A}({\bm{w}}_{A})+L_{B}({\bm{w}}_{B})

The probe distribution $p({\bm{w}}\mid\mathcal{D})$ is given by:

p({\bm{w}}\mid\mathcal{D})\propto\exp(-\beta L({\bm{w}}))\cdot\mathcal{N}({\bm{w}}|{{\bm{w}}^{\ast}},\gamma^{-1}I)

The spherical Gaussian localization term $\mathcal{N}({\bm{w}}|{{\bm{w}}^{\ast}},\gamma^{-1}I)$ also factorizes over the disjoint parameter sets:

\mathcal{N}({\bm{w}}|{{\bm{w}}^{\ast}},\gamma^{-1}I)\propto\exp\left(-\frac{\gamma}{2}\|{\bm{w}}-{{\bm{w}}^{\ast}}\|^{2}\right)=\exp\left(-\frac{\gamma}{2}\|{\bm{w}}_{A}-{\bm{w}}_{A}^{*}\|^{2}\right)\exp\left(-\frac{\gamma}{2}\|{\bm{w}}_{B}-{\bm{w}}_{B}^{*}\|^{2}\right)

Substituting the separable loss and the factorized Gaussian into the probe distribution definition, we find that the probe distribution itself factorizes:

	$\displaystyle p({\bm{w}}_{A},{\bm{w}}_{B}\|\mathcal{D})$	$\displaystyle\propto\exp(-\beta[L_{A}({\bm{w}}_{A})+L_{B}({\bm{w}}_{B})])\cdot\mathcal{N}({\bm{w}}_{A}\|{\bm{w}}_{A}^{},\gamma^{-1}I_{A})\cdot\mathcal{N}({\bm{w}}_{B}\|{\bm{w}}_{B}^{},\gamma^{-1}I_{B})$
		$\displaystyle\propto\left[\exp(-\beta L_{A}({\bm{w}}_{A}))\mathcal{N}({\bm{w}}_{A}\|{\bm{w}}_{A}^{},\gamma^{-1}I_{A})\right]\cdot\left[\exp(-\beta L_{B}({\bm{w}}_{B}))\mathcal{N}({\bm{w}}_{B}\|{\bm{w}}_{B}^{},\gamma^{-1}I_{B})\right]$
		$\displaystyle\propto p_{A}({\bm{w}}_{A}\|\mathcal{D}_{A})\cdot p_{B}({\bm{w}}_{B}\|\mathcal{D}_{B})$

where $p_{A}$ and $p_{B}$ are the probe distributions for each sub-problem. This factorization implies that ${\bm{w}}_{A}$ and ${\bm{w}}_{B}$ are independent random variables under the joint posterior $p({\bm{w}}\mid\mathcal{D})$ .

The covariance between the losses $\ell({\mathbf{z}};{\bm{w}}_{A})$ and $\ell({\mathbf{z}}^{\prime};{\bm{w}}_{B})$ is defined as:

\mathrm{Cov}[\ell({\mathbf{z}};{\bm{w}}_{A}),\ell({\mathbf{z}}^{\prime};{\bm{w}}_{B})]=\mathbb{E}[\ell({\mathbf{z}};{\bm{w}}_{A})\ell({\mathbf{z}}^{\prime};{\bm{w}}_{B})]-\mathbb{E}[\ell({\mathbf{z}};{\bm{w}}_{A})]\mathbb{E}[\ell({\mathbf{z}}^{\prime};{\bm{w}}_{B})]

Since $\ell({\mathbf{z}};-)$ is a function only of ${\bm{w}}_{A}$ , and $\ell({\mathbf{z}}^{\prime};-)$ is a function only of ${\bm{w}}_{B}$ , and ${\bm{w}}_{A}$ and ${\bm{w}}_{B}$ are independent, the expectation of their product is the product of their expectations:

\mathbb{E}[\ell({\mathbf{z}};{\bm{w}}_{A})\ell({\mathbf{z}}^{\prime};{\bm{w}}_{B})]=\mathbb{E}[\ell({\mathbf{z}};{\bm{w}}_{A})]\mathbb{E}[\ell({\mathbf{z}}^{\prime};{\bm{w}}_{B})]

Therefore, the covariance is zero:

\mathrm{Cov}[\ell({\mathbf{z}};{\bm{w}}_{A}),\ell({\mathbf{z}}^{\prime};{\bm{w}}_{B})]=\mathbb{E}[\ell({\mathbf{z}};{\bm{w}}_{A})]\mathbb{E}[\ell({\mathbf{z}}^{\prime};{\bm{w}}_{B})]-\mathbb{E}[\ell({\mathbf{z}};{\bm{w}}_{A})]\mathbb{E}[\ell({\mathbf{z}}^{\prime};{\bm{w}}_{B})]=0

This holds for any ${\mathbf{z}}\in\mathcal{D}_{A}$ and ${\mathbf{z}}^{\prime}\in\mathcal{D}_{B}$ .

While this sketch is illustrative, we note that it may be somewhat unrealistic to believe that deep learning models implement distinct mechanisms in disjoint sets of weights. See for instance the phenomenon of polysemanticity (Elhage et al., 2022). It may require a change of coordinates before mechanisms cleanly factorize. From a singular learning theory perspective, the correct remedy here is likely found at the level of population quantities, which are often invariant to arbitrary (diffeomorphic) coordinate change (see for example Appendix C of Lau et al. (2025)). We discuss the possibility of a population loss kernel with such a property in Section A.1.4, but we largely leave that to future work.

Appendix B Stochastic-Gradient MCMC Estimator

Evaluating the loss kernel $K({\mathbf{z}},{\mathbf{z}}^{\prime})=\mathrm{Cov}_{{\bm{w}}}[\ell({\mathbf{z}};{\bm{w}}),\ell({\mathbf{z}}^{\prime};{\bm{w}})]$ requires Monte-Carlo samples from the probe distribution $p({\bm{w}}\mid\mathcal{D})$ . Following Lau et al. (2025), we use Stochastic Gradient Langevin Dynamics (SGLD; Welling & Teh 2011).

Update rule.

With stochastic mini-batch $B_{t}\subset[n]$ of size $m$ and step size $\epsilon$ , SGLD performs

{\bm{w}}_{t+1}\;=\;{\bm{w}}_{t}-\frac{\epsilon}{2}\Bigl(\frac{n}{m}\!\!\sum_{{\mathbf{z}}\in B_{t}}\!\nabla_{w}\ell({\mathbf{z}};{\bm{w}}_{t})+\gamma\bigl({\bm{w}}_{t}-{{\bm{w}}^{\ast}}\bigr)\Bigr)+\sqrt{\epsilon}\,\bm{\xi}_{t},\qquad\bm{\xi}_{t}\sim\mathcal{N}(0,I).

(10)

The first term is the stochastic gradient of the loss; the second is the gradient of the Gaussian localization potential $\frac{\gamma}{2}\|{\bm{w}}-{{\bm{w}}^{\ast}}\|^{2}$ ; the injected Gaussian noise ensures asymptotic convergence to $p({\bm{w}}\mid\mathcal{D})$ as $\epsilon\to 0$ .

Parallel chains and burn-in.

To improve mixing we run $C$ independent chains, each initialized at ${{\bm{w}}^{\ast}}$ . After discarding a burn-in of $b$ iterations, we retain $T$ draws $\{{\bm{w}}_{c,t}\}_{t=1}^{T}$ per chain. For every retained weight we record the vectors $\ell({\mathbf{z}}_{i};{\bm{w}}_{c,t})$ .

Estimators.

The unbiased plug-in estimators for $K({\mathbf{z}},{\mathbf{z}}^{\prime})$ and $R({\mathbf{z}},{\mathbf{z}}^{\prime})$ are:

	$\displaystyle\hat{K}({\mathbf{z}},{\mathbf{z}}^{\prime})$	$\displaystyle=\frac{1}{CT-1}\sum_{c=1}^{C}\sum_{t=1}^{T}\bigl(\ell({\mathbf{z}};{\bm{w}}_{c,t})-\hat{\mu}({\mathbf{z}}))\bigl(\ell({\mathbf{z}}^{\prime};{\bm{w}}_{c,t})-\hat{\mu}({\mathbf{z}}^{\prime})),$
	$\displaystyle\hat{R}({\mathbf{z}},{\mathbf{z}}^{\prime})$	$\displaystyle=\hat{K}({\mathbf{z}},{\mathbf{z}}^{\prime})/\sqrt{\hat{K}({\mathbf{z}},{\mathbf{z}})\hat{K}({\mathbf{z}}^{\prime},{\mathbf{z}}^{\prime})},$

where $\hat{\mu}({\mathbf{z}})$ is the estimated average loss:

\hat{\mu}({\mathbf{z}})=\frac{1}{CT}\sum_{c,t}\ell({\mathbf{z}};{\bm{w}}_{c,t}).

Batched evaluation.

At each retained iteration ${\bm{w}}_{c,t}$ , a full forward pass is performed over the entire dataset of interest to compute and store the loss vector $(\ell({\mathbf{z}};{\bm{w}}_{c,t}))_{i}$ . In contrast, the SGLD update in Equation 10 only requires a single backward pass on a small, random minibatch $B_{t}$ .

Contrast this with the local Bayesian Influence Function (BIF; Kreer et al. 2025), which requires computing forward passes over two separate “attribution” and “query” datasets. We compute forward passes only over a single set, yielding an $n\times n$ covariance kernel. This is effectively the same as treating every sample as both an “attribution” and a “query” point to measure the functional coupling between all pairs of inputs.

Avoiding spurious correlations.

We observe that a high correlation between inputs of the same label often is spurious. At some SGLD hyperparameters, noise injected in the unembedding weights causes inputs of the same label to always slightly increase or decrease in loss together. This can dominate the observed correlations. Similar issues apply to per-sample gradient and activation based methods, where often the unembedding weights aren’t used in computation for the same reason. For example, we find that we can perfectly recover input labels by running SGLD for 10 steps on an untrained model. UMAP works via a fuzzy nearest neighbors lookup, and so to deconfound our UMAPs we delete edges between same label inputs during the neighbor finding step. This means two inputs of the same label will never be neighbors just because they share a label.

Hyperparameters Overview

Table 1 summarizes the hyperparameter settings for the correlation kernel experiments. We sample with SGLD: $m$ is the batch size, $C$ is the number of chains, $T$ the number of draws per chain, $b$ is the number of burn-in steps, $\epsilon$ is the learning rate, $\beta$ is the inverse temperature, and $\gamma$ is the localization strength.

Table 1: Summary of hyperparameter settings for correlation kernel experiments. Hyperparameters are defined in Appendix B and Section 2.3.

Section Model Dataset $m$ $C$ $T$ $b$ $\epsilon$ $n\beta$ $\gamma$ Section 3 2 Layer Transformer Modular Addition and Modular Division mod 97. 512 30 800 200 $2\times 10^{-7}$ 500 30,000 Section 4 Inception-v1 ImageNet 256 15 500 100 $5\times 10^{-5}$ 20 4,000 Section D.4 Inception-v1 ImageNet with 1,000 random samples mislabeled. 256 8 1000 100 $1\times 10^{-5}$ 20 4,000 Section D.3 Inception-v1 ImageNet 256 5 1200 2000 $1\times 10^{-4}$ 20 Varied

Appendix C Synthetic Task Extra

This section provides additional details for the synthetic multitask experiment presented in Section 3.

Model architecture.

We use a two-layer transformer with the same architecture as that used in the original grokking experiments by Power et al. (2022). We refer the reader to their work for specific architectural details. We make one modification which is to double the vocabulary, so that each task uses an independent set of tokens.

Tasks and dataset.

The model was trained on a multitask problem comprising modular addition and modular division, both over the prime modulus $p=97$ . Inputs for both tasks are sequences of the form a, b, result. The use of non-overlapping vocabularies is sufficient to for the model to distinguish which operation must be performed.

Training and evaluation data were generated by sampling integers $a,b\in\{0,\dots,96\}$ uniformly at random. For modular division $a/b$ , we compute $a\cdot b^{-1}\pmod{97}$ , where $b^{-1}$ is the modular multiplicative inverse of $b$ . We exclude cases where $b=0$ .

Training.

The model was trained on both tasks simultaneously using the Adam optimizer until it achieved 100% accuracy on the training set.

Loss kernel estimation.

After training, we estimated the loss kernel to analyze the learned functional structure. We used SGLD to draw samples from the local posterior distribution, localized around the final trained weights ${{\bm{w}}^{\ast}}$ . We collected a total of 30,000 posterior weight samples after an initial burn-in period of 200 steps for each chain. The loss kernel was then computed over an evaluation set of 10,000 randomly selected inputs, evenly split between the modular addition and modular division tasks. The specific SGLD hyperparameters, including learning rate $\epsilon$ , inverse temperature $\beta$ , and localization strength $\gamma$ , are provided in the main hyperparameter summary (Table 1).

Appendix D ImageNet Extra

D.1 Inception-v1

We apply our method to Inception-v1 (Szegedy et al., 2014). Each Inception-v1 experiment evaluates posterior correlations over 10,000 ImageNet validation samples, while sampling over the full ImageNet (Deng et al., 2009) training dataset. To reduce memory overhead, we downscale all images to 256x256 resolution. Full hyperparameters are included in Table 1. We find that the quality of correlations depends significantly on total draws used: see Section D.3 for extended discussion.

D.2 Quantifying Hierarchical Structure

To move beyond visual inspection, we quantitatively assess how well the kernel’s structure aligns with the WordNet hierarchy.

Taxonomic lift construction.

For each validation image $i$ with WordNet label $y(i)$ at depth $d(i)$ , we take its top- $k$ neighbors under the correlation kernel $R$ (we use $k{=}30$ ). For any candidate ancestor depth $d^{\prime}$ , define

\mathrm{Lift}(d,d^{\prime})\;=\;\frac{\Pr\!\left[\exists\,a\text{ at depth }d^{\prime}\text{ such that }a\preceq y(i)\ \wedge\ a\preceq y(j)\ \middle|\ j\in\mathrm{NN}_{k}(i),\ d(i){=}d\right]}{\Pr\!\left[\exists\,a\text{ at depth }d^{\prime}\text{ such that }a\preceq y(i)\ \wedge\ a\preceq y(j)\right]},

where $a\preceq y$ means “ $a$ is an ancestor of label $y$ ” in the ImageNet–WordNet hierarchy (equivalently, $y\in\mathrm{Descendants}(a)$ ). We condition on query depth $d$ to avoid confounding from the uneven leaf-depth distribution. Curves in Fig. 5 report $\mathrm{Lift}(d,d^{\prime})$ versus the depth of the shared ancestor (i.e., tree distance from the root), with one curve per query depth $d$ .

D.3 Hyperparameter Dependence

Convergence of the estimator.

Centered Kernel Alignment (CKA) is a similarity measure between two kernel (or Gram) matrices. Given kernels $K,L\in\mathbb{R}^{n\times n}$ , the CKA is defined as

\mathrm{CKA}(K,L)=\frac{\langle K_{c},L_{c}\rangle_{F}}{\|K_{c}\|_{F}\,\|L_{c}\|_{F}},

where $K_{c}$ and $L_{c}$ denote the centered versions of $K$ and $L$ , and $\langle\cdot,\cdot\rangle_{F}$ is the Frobenius inner product. This normalization ensures that $\mathrm{CKA}(K,L)\in[0,1]$ , with 1 indicating identical representational structure.

CKA analysis reveals the consistency and similarity of representations across different training runs and sampling procedures. We can use it to compare the kernels we get at different hyperparameters, but also how the kernel evolves as total SGLD step count increases. Figure 6 A shows how the CKA between the kernel at step $t$ of SGLD and the kernel at the final step changes as a function of $t$ . Note that $t$ is total steps over all chains and that we limit individual chain. We find that higher $\gamma$ leads to faster convergence. Similarly, Figure 6 B shows the CKA between kernels computed using different $\gamma$ parameters. At high $\gamma$ the CKA between kernels is close to 1, meaning the kernel is robust to specific choice of $\gamma$ .

The effect of $\gamma$ .

Recall that the hyperparameter $\gamma$ controls how tightly the probe distribution is concentrated around ${{\bm{w}}^{\ast}}$ in parameter space. Empirically, Figure 6 D quantifies this trade-off with a simple lift metric (the weighted probability that a sample’s nearest neighbors under the loss kernel $R$ share an attribute, divided by that attribute’s base rate). At low $\gamma$ , neighbors are disproportionately matched by low-level cues such as color (high color-lift); as $\gamma$ increases, color-lift falls while hierarchical coherence (neighbors sharing nearby nodes in WordNet) rises sharply. We detail how we group inputs by color in Section D.6 – we use the groupings to compute the same way we compute per-node lift in Section D.2.

UMAP snapshots beneath each curve show the same transition qualitatively: low $\gamma$ yields broad, texture/color-organized neighborhoods, while high $\gamma$ foregrounds semantically tight groupings aligned with the taxonomy. Specific per-experiment hyperparameter settings are detailed in the below section.

D.4 Detecting Memorization

We test whether the loss kernel is sensitive to changes in the functional constraints imposed on the model by making a targeted change to the model’s training data distribution. We randomly mislabel a subset of the training data, forcing the model to memorize in order to achieve a low loss.

Memorization imposes a strict functional constraint on our model. A very precise weight setting is required to achieve high performance – put simply, the set of parameters that achieve low loss on the mislabeled set forms a much narrower region (a sharper basin) than the region that preserves low loss when the mapping can be supported by shared features.

As detailed in Section A.1, the trace of our kernel is an estimator for the singular fluctuation, a quantity that appears in the asymptotic formula for the Gibbs generalization error. The kernel itself can be seen as measuring the first-order change in a related quantity known as the Bayes generalization error, with respect to the importance of each data point. While these notions of generalization are not immediately related to the type of memorization we study empirically, this provides some intuitive support to the idea that memorized examples will show up with a large self-correlation $K({\mathbf{z}},{\mathbf{z}})$ .

D.5 The loss kernel over Development.

To visualize how functional geometry emerges during learning, we compute the loss-correlation kernel at fixed training checkpoints and embed the induced distances $d({\mathbf{z}},{\mathbf{z}}^{\prime})=1-R({\mathbf{z}},{\mathbf{z}}^{\prime})$ with the same UMAP hyperparameters across time (and with same-label edges removed; see Appendix B).

Figure 8 shows a coarse-to-fine trajectory. Bar a handful of curious outliers, at initialization the kernel is essentially structureless. We leave study of these outliers to future work. By step 710, a weak global anisotropy appears that roughly separates animate from inanimate classes. By step 1388, coherent clusters begin to form (e.g., dogs). At step 3290, multiple subgroups sharpen and separate, and by step 5298 the geometry stabilizes into well-defined, semantically coherent regions that mirror the WordNet hierarchy.

D.6 Quantifying Color Lift.

We describe our method for computing the average per-color lift as shown in Figure 6 and Figure 9. In order to compute the lift we must bucket images into discrete color groups. To do so, for each input image, we compute

\bm{\mu}_{i}=\left(\frac{1}{P}\sum_{p=1}^{P}R_{i,p},\;\frac{1}{P}\sum_{p=1}^{P}G_{i,p},\;\frac{1}{P}\sum_{p=1}^{P}B_{i,p}\right),

where $P$ is the number of pixels in image $i$ , and $R_{i,p}$ , $G_{i,p}$ , $B_{i,p}$ are the red, green, and blue values of pixel $p$ in image $i$ . (Equivalently, $\bm{\mu}_{i}=(\overline{R}_{i},\,\overline{G}_{i},\,\overline{B}_{i})$ , where each bar denotes the mean over all pixels in image $i$ .) 2) Cluster the set $\{\bm{\mu}_{i}\}$ into $k$ groups using farthest point sampling (FPS). FPS ensures that cluster centers are spread out over the uneven distribution of RGB means (e.g. many gray/brown tones).

D.7 Extra ImageNet Examples

We provide more examples of the top correlated inputs from the visualization experiment in Section 4 and Figure 4. These inputs were randomly selected in chunks of 10 from between the 600th and 700th inputs of the 2500 for which we computed the loss kernel. The full set top-correlated inputs for all 2500 inputs is available at https://github.com/singfluence-anon/sf_imagenet_corrs.

Acknowledgments

We would like to thank Simon Pepin Lehalleur and Daniel Murfet for their detailed feedback and insightful discussions, and Rumi Salazar for helpful feedback and input. We are also grateful to Wilson Wu and Philipp Alexander Kreer for valuable discussions. We thank Stan van Wingerden for his assistance with the compute infrastructure.

Zach Furman was supported by the Melbourne Research Scholarship and Rowden White Scholarship during the completion of this research.

Author Contributions

Maxwell Adam led the implementation, analysis, and visualization of all the experiments. Zach Furman led the theoretical development and definition of the loss kernel. Jesse Hoogland led the project and writing. All authors contributed to the writing.

Reproducibility Statement

To ensure our work is reproducible, we provide detailed descriptions of our methodology throughout the paper and its appendices. The core SGLD-based estimation procedure for the loss kernel is formally presented in Sections 2.3 and B. All experiments were conducted on a public dataset (ImageNet; Deng et al. 2009) and a standard model architecture (Inception-v1; Szegedy et al. 2014), or on a synthetic, fully described multitask arithmetic problem (Sections 3 and C). A complete summary of the SGLD hyperparameters used for each experiment is available in Table 1 in Appendix B, with further implementation details and sensitivity analyses for the ImageNet setting discussed in Section D.3. The setup for our main ImageNet analysis, including the quantitative evaluation against the WordNet hierarchy, is detailed in Appendix D.

LLM Usage Statement

We used Large Language Models (LLMs) to help produce this paper. We used them to edit our writing by fixing errors and improving phrasing. We also used them to brainstorm the paper’s structure and get feedback on our arguments. For our experiments, LLMs helped us write code and create figures. They also assisted us in strengthening the math and proofs. The authors checked all AI-generated suggestions and are fully responsible for the content of this paper.

References

Ancona et al. (2018) Marco Ancona, Enea Ceolini, Cengiz Öztireli, and Markus Gross. Towards better understanding of gradient-based attribution methods for Deep Neural Networks, March 2018. URL http://arxiv.org/abs/1711.06104. arXiv:1711.06104 [cs].
Baker et al. (2025) Garrett Baker, George Wang, Jesse Hoogland, and Daniel Murfet. Structural Inference: Studying Small Language Models with Susceptibilities, April 2025. URL http://arxiv.org/abs/2504.18274. arXiv:2504.18274 [cs].
Bengio et al. (2014) Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation Learning: A Review and New Perspectives, April 2014. URL http://arxiv.org/abs/1206.5538. arXiv:1206.5538 [cs].
Bricken et al. (2023) Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023.
Carroll et al. (2025) Liam Carroll, Jesse Hoogland, Matthew Farrugia-Roberts, and Daniel Murfet. Dynamics of Transient Structure in In-Context Linear Regression Transformers, January 2025. URL http://arxiv.org/abs/2501.17745. arXiv:2501.17745 [cs].
Carter et al. (2019) Shan Carter, Zan Armstrong, Ludwig Schubert, Ian Johnson, and Chris Olah. Activation atlas. Distill, 2019. doi: 10.23915/distill.00015.
Chen et al. (2020) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A Simple Framework for Contrastive Learning of Visual Representations, July 2020. URL http://arxiv.org/abs/2002.05709. arXiv:2002.05709 [cs].
Chen et al. (2023) Zhongtian Chen, Edmund Lau, Jake Mendel, Susan Wei, and Daniel Murfet. Dynamical versus bayesian phase transitions in a toy model of superposition. arXiv preprint arXiv:2310.06301, 2023.
Cheng et al. (2025) Deric Cheng, Juhan Bae, Justin Bullock, and David Kristofferson. Training Data Attribution (TDA): Examining Its Adoption & Use Cases, January 2025. URL http://arxiv.org/abs/2501.12642. arXiv:2501.12642 [cs].
Cook (1977) R. Dennis Cook. Detection of influential observation in linear regression. Technometrics : a journal of statistics for the physical, chemical, and engineering sciences, February 1977. ISSN 0040-1706. URL https://www.tandfonline.com/doi/abs/10.1080/00401706.1977.10489493. tex.copyright: Copyright Taylor and Francis Group, LLC.
Cook & Weisberg (1982) R. Dennis Cook and Sanford Weisberg. Residuals and influence in regression. Monographs on statistics and applied probability. Chapman and Hall, New York, 1982. ISBN 0-412-24280-0. URL https://hdl.handle.net/11299/37076.
Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255, 2009. doi: 10.1109/CVPR.2009.5206848.
Elhage et al. (2022) Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition. Transformer Circuits Thread, 2022.
Giordano et al. (2017) Ryan Giordano, Tamara Broderick, and Michael I. Jordan. Covariances, robustness, and variational Bayes. Journal of Machine Learning Research, 19:51:1–51:49, 2017. URL https://api.semanticscholar.org/CorpusID:53238793.
Gorton (2024) Liv Gorton. Interpretable Features and Circuits in InceptionV1’s Mixed5b: A preliminary exploration of sparse autoencoders in late vision, August 2024. URL https://livgorton.com/inceptionv1-mixed5b-sparse-autoencoders/. tex.howpublished: Blog.
Gönen & Alpaydin (2011) Mehmet Gönen and Ethem Alpaydin. Multiple Kernel Learning Algorithms. Journal of Machine Learning Research, 12(64):2211–2268, 2011. ISSN 1533-7928. URL http://jmlr.org/papers/v12/gonen11a.html.
Hofmann et al. (2006) Thomas Hofmann, Bernhard Schölkopf, and Alexander J Smola. A Review of Kernel Methods in Machine Learning. Technical report, Max Planck Institute for Biological Cybernetics, Max Planck Institute for Biological Cybernetics, December 2006.
Hoogland et al. (2024) Jesse Hoogland, George Wang, Matthew Farrugia-Roberts, Liam Carroll, Susan Wei, and Daniel Murfet. The developmental landscape of in-context learning. arXiv preprint arXiv:2402.02364, 2024.
Iba (2025) Yukito Iba. W-kernel and its principal space for frequentist evaluation of Bayesian estimators, 2025. URL https://arxiv.org/abs/2311.13017.
Khatib & Alkhatib (2024) Omar EL Khatib and Nabeel Alkhatib. A Comprehensive Overview of Kernels in Machine Learning: Mathematical Foundations and Applications. International Journal of Computer (IJC), 53(1):150–172, December 2024. ISSN 2307-4523. URL https://ijcjournal.org/InternationalJournalOfComputer/article/view/2334.
Koh & Liang (2020) Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions, December 2020. URL http://arxiv.org/abs/1703.04730. arXiv:1703.04730 [stat] CitationKey: deep-influence-functions.
Kornblith et al. (2019) Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of Neural Network Representations Revisited, July 2019. URL http://arxiv.org/abs/1905.00414. arXiv:1905.00414 [cs].
Kovaleva et al. (2021) Olga Kovaleva, Saurabh Kulshreshtha, Anna Rogers, and Anna Rumshisky. BERT Busters: Outlier Dimensions that Disrupt Transformers. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 3392–3405, Online, 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-acl.300. URL https://aclanthology.org/2021.findings-acl.300.
Kreer et al. (2025) Philipp Alexander Kreer, Wilson Wu, Maxwell Adam, Zach Furman, and Jesse Hoogland. Bayesian influence functions for hessian-free data attribution, 2025.
Kulis (2013) Brian Kulis. Metric Learning : A Survey By. Foundations and Trends® in Machine Learning, 5(4):287–364, 2013. URL https://www.semanticscholar.org/paper/Metric-Learning-%3A-A-Survey-By-Kulis/4fffbf5406482305d9adcf8e24887e6f1773027a.
Lau et al. (2025) Edmund Lau, Zach Furman, George Wang, Daniel Murfet, and Susan Wei. The local learning coefficient: a singularity-aware complexity measure. In The 28th international conference on artificial intelligence and statistics, 2025. URL https://openreview.net/forum?id=1av51ZlsuL.
McInnes et al. (2018) Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018.
Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space, September 2013. URL http://arxiv.org/abs/1301.3781. arXiv:1301.3781 [cs].
Murfet & Troiani (2025) Daniel Murfet and Will Troiani. Programs as singularities. arXiv preprint arXiv:2504.08075, 2025.
Olah (2015) Christopher Olah. Visualizing representations: Deep learning and human beings, January 2015. URL https://colah.github.io/posts/2015-01-Visualizing-Representations/.
Pepin Lehalleur et al. (2025) Simon Pepin Lehalleur, Jesse Hoogland, Matthew Farrugia-Roberts, Susan Wei, Alexander Gietelink Oldenziel, George Wang, Liam Carroll, and Daniel Murfet. You are what you eat–AI alignment requires understanding how data shapes structure and generalisation. arXiv preprint arXiv:2502.05475, 2025.
Power et al. (2022) Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv preprint arXiv:2201.02177, 2022.
Princeton University (2010) Princeton University. About WordNet, 2010. URL https://wordnet.princeton.edu/.
Schölkopf et al. (1997) Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller. Kernel principal component analysis. In Gerhard Goos, Juris Hartmanis, Jan Van Leeuwen, Wulfram Gerstner, Alain Germond, Martin Hasler, and Jean-Daniel Nicoud (eds.), International conference on artificial neural networks, volume 1327, pp. 583–588, Berlin, Heidelberg, 1997. Springer Berlin Heidelberg. ISBN 978-3-540-63631-1 978-3-540-69620-9. doi: 10.1007/BFb0020217. URL http://link.springer.com/10.1007/BFb0020217. Book Title: Artificial Neural Networks — ICANN’97 Series Title: Lecture Notes in Computer Science.
Sundararajan et al. (2017) Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic Attribution for Deep Networks, June 2017. URL http://arxiv.org/abs/1703.01365. arXiv:1703.01365 [cs].
Szegedy et al. (2014) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going Deeper with Convolutions, September 2014. URL http://arxiv.org/abs/1409.4842. arXiv:1409.4842 [cs].
Templeton et al. (2024) Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet. Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html.
Urdshals & Urdshals (2025) Einar Urdshals and Jasmina Urdshals. Structure development in list-sorting transformers. arXiv preprint arXiv:2501.18666, 2025.
Urdshals et al. (2025) Einar Urdshals, Edmund Lau, Jesse Hoogland, Stan van Wingerden, and Daniel Murfet. Compressibility measures complexity: Minimum description length meets singular learning theory, sep 2025.
Wang et al. (2025a) George Wang, Garrett Baker, Andrew Gordon, and Daniel Murfet. Embryology of a Language Model, August 2025a. URL http://arxiv.org/abs/2508.00331. arXiv:2508.00331 [cs].
Wang et al. (2025b) George Wang, Jesse Hoogland, Stan van Wingerden, Zach Furman, and Daniel Murfet. Differentiation and Specialization of Attention Heads via the Refined Local Learning Coefficient. In Proceedings of The 13th International Conference on Learning Representations, 2025b.
Wang et al. (2022) Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small, November 2022. URL http://arxiv.org/abs/2211.00593. arXiv:2211.00593 [cs].
Watanabe (2009) Sumio Watanabe. Algebraic Geometry and Statistical Learning Theory. Cambridge Monographs on Applied and Computational Mathematics. Cambridge University Press, 2009.
Watanabe (2018) Sumio Watanabe. Mathematical theory of Bayesian statistics. Chapman and Hall, 2018.
Welling & Teh (2011) Max Welling and Yee Whye Teh. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, pp. 681–688, Madison, WI, USA, 2011. Omnipress. ISBN 9781450306195.

Cite as

@article{adam2025the,
  title = {The Loss Kernel: A Geometric Probe for Deep Learning Interpretability},
  author = {Maxwell Adam and Zach Furman and Jesse Hoogland},
  year = {2025},
  abstract = {We introduce the loss kernel, an interpretability method for measuring similarity between data points according to a trained neural network. The kernel is the covariance matrix of per-sample losses computed under a distribution of low-loss-preserving parameter perturbations. We first validate our method on a synthetic multitask problem, showing it separates inputs by task as predicted by theory. We then apply this kernel to Inception-v1 to visualize the structure of ImageNet, and we show that the kernel's structure aligns with the WordNet semantic hierarchy. This establishes the loss kernel as a practical tool for interpretability and data attribution.},
  eprint = {2509.26537},
  archivePrefix = {arXiv},
  url = {https://arxiv.org/abs/2509.26537}
}

Click to copy

	$\displaystyle p({\bm{w}}_{A},{\bm{w}}_{B}\|\mathcal{D})$	$\displaystyle\propto\exp(-\beta[L_{A}({\bm{w}}_{A})+L_{B}({\bm{w}}_{B})])\cdot\mathcal{N}({\bm{w}}_{A}\|{\bm{w}}_{A}^{},\gamma^{-1}I_{A})\cdot\mathcal{N}({\bm{w}}_{B}\|{\bm{w}}_{B}^{},\gamma^{-1}I_{B})$
		$\displaystyle\propto\left[\exp(-\beta L_{A}({\bm{w}}_{A}))\mathcal{N}({\bm{w}}_{A}\|{\bm{w}}_{A}^{},\gamma^{-1}I_{A})\right]\cdot\left[\exp(-\beta L_{B}({\bm{w}}_{B}))\mathcal{N}({\bm{w}}_{B}\|{\bm{w}}_{B}^{},\gamma^{-1}I_{B})\right]$
		$\displaystyle\propto p_{A}({\bm{w}}_{A}\|\mathcal{D}_{A})\cdot p_{B}({\bm{w}}_{B}\|\mathcal{D}_{B})$

The Loss Kernel: A Geometric Probe for Deep Learning Interpretability

Authors

Publication Details

Access

Abstract

Automated Conversion Notice

1 Introduction

Contributions.

2 The Loss Kernel

2.1 Interpretability and Degeneracy

Singular learning theory.

2.2 Constructing a Practical Probe

From hard to soft constraints.

From global to local.

2.3 The Loss Kernel

The loss kernel.

Interpretation.

Practical estimation.

3 Validation on a Synthetic Task

Multitask arithmetic.

Reducing dimensionality.

Interpreting the kernel.

4 Application to ImageNet

Visualizing the loss kernel.

Hierarchical structure in ImageNet.

Hierarchical structure in the loss kernel.

The kernel as a developmental tool.

5 Related Works

Bayesian influence functions and training data attribution.

Data-similarity kernels and metric learning.

Representation-based interpretability.

Mechanistic and causal interventions.

Developmental interpretability.

6 Discussion & Conclusion

Future directions.

Appendix

Appendix A Theory Extra

A.1 Singular Learning Theory

A.1.1 Setup

A.1.2 Empirical Variance and the Singular Fluctuation

A.1.3 Connection to the Loss Kernel

A.1.4 Towards a Population Loss Kernel

A.2 Training Data Attribution

A.2.1 Classical Influence Functions

A.2.2 Bayesian Influence Functions

A.2.3 Connection to the Loss Kernel

A.3 From Sublevel Sets to Gibbs Distribution

Proposition 1.

Proof.

A.4 Decoupling of Disjoint Mechanisms

Proposition 2.

Proof.

Appendix B Stochastic-Gradient MCMC Estimator

Update rule.

Parallel chains and burn-in.

Estimators.

Batched evaluation.

Avoiding spurious correlations.

Hyperparameters Overview

Appendix C Synthetic Task Extra

Model architecture.

Tasks and dataset.

Training.

Loss kernel estimation.

Appendix D ImageNet Extra

D.1 Inception-v1

D.2 Quantifying Hierarchical Structure

Taxonomic lift construction.

D.3 Hyperparameter Dependence

Convergence of the estimator.

The effect of γ\gamma.

D.4 Detecting Memorization

D.5 The loss kernel over Development.

D.6 Quantifying Color Lift.

D.7 Extra ImageNet Examples

Acknowledgments

Author Contributions

Reproducibility Statement

LLM Usage Statement

References

The effect of $\gamma$ .