Loss Landscape Degeneracy and Stagewise Development of Transformers

Authors

Jesse Hoogland ⁼

Timaeus

George Wang ⁼

Timaeus

Matthew Farrugia-Roberts

Timaeus

Liam Carroll

Timaeus

Susan Wei

University of Melbourne

Daniel Murfet

University of Melbourne

See Contributions

Publication Details

Published:

February 4, 2024

Venue:

TMLR

Access

Abstract

We show that in-context learning emerges in transformers in discrete developmental stages, when they are trained on either language modeling or linear regression tasks. We introduce two methods for detecting the milestones that separate these stages, by probing the geometry of the population loss in both parameter space and function space. We study the stages revealed by these new methods using a range of behavioral and structural metrics to establish their validity.

Automated Conversion Notice

Warning: This paper was automatically converted from LaTeX. While we strive for accuracy, some formatting or mathematical expressions may not render perfectly. Please refer to the original ArXiv version for the authoritative document.

1 Introduction

A striking phenomenon in modern deep learning is the sudden shift in a model’s internal computational structure and associated changes in input/output behavior (e.g., Wei et al., 2022; Olsson et al., 2022; McGrath et al., 2022). As large models become more deeply integrated into real-world applications, understanding this phenomenon is a priority for the science of deep learning.

A key feature of the loss landscape of neural networks is degeneracy—parameters for which some local perturbations do not affect the loss. Motivated by the perspectives of singular learning theory (SLT; Watanabe, 2009) and nonlinear dynamics (Waddington, 1957; Thom, 1972), where degeneracy plays a fundamental role in governing development, we believe that studying degeneracy in the local geometry of the loss landscape is key to understanding the development of structure and behavior in modern deep learning.

Refer to caption — (a) Two-layer attention-only language transformer (LM).

Stage	`LM1`	`LM2`	`LM3`	`LM4`	`LM5`
$\Delta\hat{\ell}$	$-2.33$	$-1.22$	$-0.18$	$-0.40$	$-0.34$
$\Delta\hat{\lambda}$	${\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}+26.4}$	${\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}+22.5}$	${\color[rgb]{0,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,1}\pgfsys@color@cmyk@stroke{1}{0}{0}{0}\pgfsys@color@cmyk@fill{1}{0}{0}{0}-1.57}$	${\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}+8.62}$	${\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}+1.77}$

Stage	`LR1`	`LR2`	`LR3`	`LR4`	`LR5`
$\Delta\hat{\ell}$	$-0.32$	$-2.21$	$-0.07$	$-0.05$	$-0.029$
$\Delta\hat{\lambda}$	${\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}+21.4}$	${\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}+149}$	${\color[rgb]{0,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,1}\pgfsys@color@cmyk@stroke{1}{0}{0}{0}\pgfsys@color@cmyk@fill{1}{0}{0}{0}-12.3}$	${\color[rgb]{0,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,1}\pgfsys@color@cmyk@stroke{1}{0}{0}{0}\pgfsys@color@cmyk@fill{1}{0}{0}{0}-44.1}$	${\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}+3.56}$

In this paper, we contribute an empirical investigation of the link between degeneracy and development for transformers in two learning settings. We track loss landscape degeneracy along with model structure and behavior throughout training, using the following methodology.

Transformer training (Section˜3): We train two transformers, a language model (LM) with around 3M parameters trained on a subset of the Pile (Gao et al., 2020; Xie et al., 2023), and an in-context linear regression model (LR) with around 50k parameters trained on synthetic regression data following Garg et al. (2022).
Degeneracy tracking (Section˜4): We quantify loss landscape degeneracy throughout training by estimating the local learning coefficient (LLC; Lau et al., 2025), a measure of degeneracy derived from SLT.
Degeneracy-based stage division (Section˜5): Motivated by the singular learning process in Bayesian inference Watanabe, 2009, §7.6; Chen et al., 2023, we use critical points in the LLC curve to divide training into approximate developmental stages.
Developmental analysis (Sections 6, 7): We track shifts in each model’s internal computational structure and input/output behavior across training, quantified using various setting-specific metrics.

Crucially, we discover that most of the developmental stages identified by changes in loss landscape degeneracy coincide with significant, interpretable shifts in the internal computational structure and input/output behavior of the transformers, showing that the stage division is meaningful. Our investigations are motivated by the hypothesis of a fundamental link between degeneracy and development in deep learning. This hypothesis is theoretically grounded in SLT but so far not empirically validated except in toy models (Chen et al., 2023). We view the above discoveries as preliminary evidence for this hypothesis in larger models, and an indication of the potential of degeneracy as a lens for understanding modern neural network development. Section˜8 discusses this and other implications of our investigation.

Degeneracy and development in singular learning theory

Our hypothesis that degeneracy and development are fundamentally linked is motivated by singular learning theory (SLT; Watanabe, 2009), a framework for studying singular statistical models a class that includes neural networks, Hagiwara et al., 1993; Watanabe, 2007; Wei et al., 2023. SLT proves that in singular models, Bayesian inference follows the singular learning process, in which degeneracy in the likelihood governs stagewise development in the posterior as the number of samples increases Watanabe, 2009, §7.6; Lau et al., 2025; Chen et al., 2023. While there are many differences between Bayesian inference and modern neural network training, an analogy to the singular learning process informs our methodology for stage division.

Degeneracy and development in nonlinear dynamics

Further motivation for our hypothesis comes from viewing neural network training as a stochastic dynamical system, in which the population loss is a governing potential encoding the data distribution. It is well-understood in nonlinear dynamics that degeneracy in the local geometry of a potential can give rise to stagewise development of system structure (Waddington, 1957; Thom, 1972, cf. Franceschelli, 2010). This connection has been observed in biological systems at significant scale and in the presence of stochasticity (Freedman et al., 2023). We emphasize changes in degeneracy over a stage whereas in bifurcation theory the focus is more on the degeneracy at stage boundaries (Rand et al., 2021; MacArthur, 2022; Sáez et al., 2022).

Stagewise development in deep learning

The idea that neural networks development occurs in stages goes back decades (Raijmakers et al., 1996) and has received renewed attention in modern deep learning (e.g., Wei et al., 2022; Olsson et al., 2022; McGrath et al., 2022; Odonnat et al., 2024; Chen et al., 2024; Edelman et al., 2024). In the case of deep linear networks, we understand theoretically that models learn progressively higher-rank approximations of their data distribution (see, e.g., Baldi & Hornik, 1989; Rogers & McClelland, 2004; Saxe et al., 2019) throughout training. Our findings suggest that studying degeneracy could help generalize this understanding to modern architectures that exhibit more complex internal computational structure, such as transformers.

Studying loss landscape geometry

Given the central role played by the loss landscape in deep learning, it is unsurprising that there have been many attempts to study its geometry.

One approach is to visualize low-dimensional slices of the loss landscape (Erhan et al., 2010; Goodfellow et al., 2014; Lipton, 2016; Li et al., 2018; Tikeng Notsawo et al., 2024). Unfortunately, a random slice is with high probability a quadratic form associated to nonzero eigenvalues of the Hessian and is thus biased against geometric features that we know are important, such as degeneracy (Wei et al., 2023). Moreover, Antognini & Sohl-Dickstein (2018) have emphasized the difficulty of probing the loss landscape of neural networks with dimensionality reduction tools.

Other standard methods of quantifying the geometry of the loss landscape, such as via the Hessian, are insensitive to important aspects of degeneracy. For example, the Hessian trace or maximum eigenvalues quantify the curvature of a critical point but ignore degenerate dimensions, and the Hessian rank counts the number of degenerate dimensions but fails to distinguish between dimensions by the order of their degeneracy (e.g., quartic vs. zero). In contrast, the LLC is a principled quantitative measure of loss landscape degeneracy. Section˜B.5 includes experiments showing that Hessian statistics do not reveal the clear stage boundaries revealed by the LLC in our in-context linear regression setting.

3 Training transformers in two settings

We study transformers trained in two learning settings, namely language modeling and in-context linear regression. These settings have been the subject of recent work on the emergence of in-context learning (ICL), a compelling example of a sudden shift in a model’s internal computational structure in modern deep learning (Olsson et al., 2022).

In this section, we describe both settings and introduce their loss functions and data distributions. Common to both settings is a transformer model denoted $f_{w}$ with parameters $w$ , which takes as input a sequence of tokens, also called a context. We describe specific architecture details and training hyperparameters in Sections˜F.1 and F.2.

Language modeling

Elhage et al. (2021) and Olsson et al. (2022) observed that two-layer attention-only transformers (transformers without MLP layers) form interesting internal computational structures supporting ICL, including induction heads. In order to compare with their behavioral and structural analysis we adopt the same architecture. In Appendix˜E we also study one-layer attention-only transformers. We note that, while we don’t study language models with MLP layers (following prior work), we do use MLP layers for in-context linear regression.

We consider the standard task of next-token prediction for token sequences taken from a subset of the Pile (Gao et al., 2020; Xie et al., 2023). We denote the input context by $S_{K}=(t_{1},\ldots,t_{K})$ where $K$ is the context length. We denote by $S_{\leq k}$ the prefix context $(t_{1},\ldots,t_{k})$ of context $S_{K}$ . Our data is a collection of length- $K$ contexts, $\{S^{i}_{K}\}_{i=1}^{n}$ . Thus $S^{i}_{\leq k}$ denotes a prefix of the $i$ ^th context, $S^{i}_{K}$ .

Given the context $S^{i}_{\leq k}$ , the transformer model $f_{w}$ outputs a vector of logits $f_{w}(S^{i}_{\leq k})$ such that $\mathrm{softmax}(f_{w}(S_{\leq k}^{i}))$ is a probability distribution over all tokens (we denote by $\mathrm{softmax}(f_{w}(S_{\leq k}^{i}))[t]$ the probability of token $t$ ). The per-token empirical loss for language modeling is then the average cross-entropy between this distribution and the true next token at each index $k\in\{1,\ldots,K-1\}$ ,

\ell_{n,k}(w)=\frac{1}{n}\sum_{i=1}^{n}-\log\left(\mathrm{softmax}(f_{w}(S_{\leq k}^{i}))[t^{i}_{k+1}]\right).

(1)

The empirical loss is then $\ell_{n}(w)=\frac{1}{K-1}\sum_{k=1}^{K-1}\ell_{n,k}(w)$ , with the test loss $\hat{\ell}(w)$ defined analogously on a held-out set of examples. The corresponding population loss $\ell(w)$ is defined by taking the expectation with respect to the true distribution of contexts (see also Section˜A.6).

In-context linear regression

Following Garg et al. (2022), a number of recent works have explored ICL in the setting of learning simple function classes, such as linear functions. This setting is of interest because we understand theoretically optimal (in-context) linear regression, and because simple transformers are capable of good ICL performance in practice (see, e.g., Garg et al., 2022; Raventós et al., 2023).

We consider a standard synthetic in-context linear regression problem. A task is a vector $\mathbf{t}\in\mathbb{R}^{D}$ , and an example is a pair $(x,y)\in\mathbb{R}^{D}\times\mathbb{R}$ . We sample a context by sampling one task $\mathbf{t}\sim\mathcal{N}(0,I_{D})$ and then sampling $K$ i.i.d. inputs $x_{1},\ldots,x_{K}\sim\mathcal{N}(0,I_{D})$ and outputs $y_{1},\ldots,y_{K}\sim\mathcal{N}(\mathbf{t}^{\top}x,\sigma^{2})$ . This results in the context $S_{K}=(x_{1},y_{1},\ldots,x_{K-1},y_{K-1},x_{K})$ with label $y_{K}$ . We denote by $S_{\leq k}$ the prefix context $(x_{1},y_{1},\ldots,x_{k})$ of context $S_{K}$ , its label is $y_{k}$ . Section˜F.2.2 describes how we encode the $x_{i}$ and $y_{i}$ as tokens. Our data is a set of contexts $\{(\mathbf{t}_{i},S_{K}^{i},y_{K}^{i})\}_{i=1}^{n}$ sampled i.i.d. as described above.

Running a context $S_{\leq k}^{i}$ through the transformer yields a prediction $\hat{y}_{k}^{i}=f_{w}(S_{\leq k}^{i})$ , leading to the per-token empirical loss for in-context linear regression for $k\in\{1,\ldots,K\}$ ,

\ell_{n,k}(w)=\frac{1}{n}\sum_{i=1}^{n}(\hat{y}^{i}_{k}-y_{k}^{i})^{2}.

(2)

The associated empirical loss is $\ell_{n}(w)=\frac{1}{K}\sum_{k=1}^{K}\ell_{n,k}(w)$ . The corresponding test loss $\hat{\ell}(w)$ and population loss $\ell(w)$ are defined analogously as in the language modeling setting.

4 Quantifying degeneracy with the local learning coefficient

We track the evolution of degeneracy in the local geometry of the loss landscape throughout training by estimating the local learning coefficient (LLC; Watanabe, 2009; Lau et al., 2025) at model checkpoints. In this section, we review the LLC and the estimation procedure of Lau et al. (2025).

The local learning coefficient (LLC)

Given a local minimum ${w^{*}}$ of a population loss $\ell$ (a negative log likelihood), the LLC of ${w^{*}}$ , denoted $\lambda({w^{*}})$ , is a positive rational number that measures the amount of degeneracy in $\ell$ near $w^{*}$ (Watanabe, 2009; Lau et al., 2025), i.e., how many ways $w$ can be varied near ${w^{*}}$ such that $\ell(w)$ remains equal to $\ell({w^{*}})$ . Formally, the LLC is defined as the volume-scaling rate near ${w^{*}}$ . This is illustrated in Figure˜2, further described in Section˜A.1, and treated in full detail in Lau et al. (2025). Informally, the LLC is a measure of minimum “flatness.” It improves over conventional (second-order) Hessian-based measures of flatness because the LLC is sensitive to more significant, higher-order contributions to volume-scaling.

Estimating the LLC

Lau et al. (2025) introduced an estimator for the LLC based on stochastic-gradient Langevin dynamics (SGLD; Welling & Teh, 2011), which we use in our experiments. Let ${w^{*}}$ be a local minimum of the population loss $\ell$ . The LLC estimate $\hat{\lambda}({w^{*}})$ is

\hat{\lambda}({w^{*}})=n\beta\left[\mathbb{E}_{w|{w^{*}},\gamma}^{\beta}[\ell_{n}(w)]-\ell_{n}({w^{*}})\right],

(3)

where $\mathbb{E}_{w|{w^{*}},\gamma}^{\beta}$ denotes the expectation with respect to the localized Gibbs posterior

p(w;{w^{*}},\beta,\gamma)\propto\exp\left\{-n\beta\ell_{n}(w)-\frac{\gamma}{2}||w-{w^{*}}||^{2}_{2}\right\}

with inverse temperature $\beta$ (controlling the contribution of the empirical loss landscape) and localization strength $\gamma$ (controlling proximity to ${w^{*}}$ ). The basic idea behind this estimator is the following: the more degenerate the loss landscape, the easier it is for a sampler exploring the Gibbs posterior to find points of low loss, and, in turn, the lower $\hat{\lambda}({w^{*}})$ . Section˜A.3 discusses technical SGLD details, Section˜A.4 documents the hyperparameters used in our experiments, and Section˜A.5 outlines our hyperparameter tuning procedure.

Assumptions of LLC estimation

Strictly speaking, the LLC is defined only for loss functions arising as a negative log likelihood, whereas our loss function includes terms from overlapping context prefixes. It is possible to define a negative log likelihood-based loss for transformer training—we show empirically in Section˜A.6 that this does not have a significant effect on LLC estimates, and so we proceed with overlapping contexts for efficiency.

Moreover, the LLC is defined only for local minima of such loss functions, whereas we note equation ˜3 is defined for arbitrary ${w^{*}}$ and we apply the estimator throughout training. This approach has precedent in prior work on LLC estimation: Lau et al. (2025) showed that when applied to trained parameters, the estimator accurately recovers the learning coefficient associated with a nearby minimum, and Chen et al. (2023) found that the estimator produces reliable results for parameters throughout training. In our case, we obtain stable estimates throughout training given sufficiently strong localization $\gamma$ . See Section˜A.7 for more details.

5 Degeneracy-based stage division

We use critical points (that is, plateaus, where the first derivative vanishes) in the LLC curve to define stage boundaries that divide training into developmental stages. This approach is motivated by the singular learning process in Bayesian inference, which we review below.

Bayesian local free energy

Let $W^{*}$ be a neighborhood of a local minimum ${w^{*}}$ of the population loss $\ell$ (a negative log likelihood). Given $n$ samples we can define the local free energy of the neighborhood (Lau et al., 2025),

F_{n}(W^{*})=-\log\int_{W^{*}}\exp(-n\ell_{n}(w))\varphi(w)\,dw,

where $\varphi(w)$ is a prior positive on the neighborhood $W^{*}$ . The lower the local free energy of a neighborhood $W^{*}$ , the higher the Bayesian posterior mass of $W^{*}$ . In fact, by a log-sum-exp approximation, the Bayesian posterior is approximately concentrated on the neighborhood with the lowest local free energy (cf., Chen et al., 2023).

The singular learning process

Watanabe’s free energy formula gives, under certain technical conditions, an asymptotic expansion in $n$ of the local free energy Watanabe, 2018, Theorem 11; Lau et al., 2025:

F_{n}(W^{*})=n\ell_{n}({w^{*}})+\lambda({w^{*}})\log n+O_{p}(\log\log n)

(4)

Here, $\ell_{n}({w^{*}})$ is the empirical loss, $\lambda({w^{*}})$ is the LLC, and the lower-order terms include a constant contribution from the prior mass of $W^{*}$ .

The first two terms in equation ˜4 create a tradeoff between accuracy ( $\ell_{n}$ ) and degeneracy ( $\lambda$ ). Moreover, as $n$ increases, the linear term becomes increasingly important relative to the logarithmic term, changing the nature of the tradeoff. At certain critical $n$ the neighborhood with the lowest local free energy may rapidly change to a neighborhood with decreased loss and increased LLC, as illustrated in Figure˜3.

A sequence of such posterior transitions between increasingly degenerate neighborhoods is a prime example of the singular learning process (Watanabe, 2009, §7.6). We note that this is not the only possible dynamic—lower-order terms may also play a role in the evolving competition.

LLC plateaus separate developmental stages

While the general connection between the singular learning process in Bayesian inference and stagewise development in deep learning remains to be understood, Chen et al. (2023) showed that, in small autoencoders, both Bayesian inference and stochastic gradient descent undergo rapid transitions between encoding schemes, and these transitions are reflected as sudden changes in the estimated LLC.

This perspective suggests that changes in the loss landscape degeneracy, as measured by the LLC, reflect qualitative changes in the model. In larger models, we expect that these qualitative changes may be more gradual, while still being delineated by brief moments in which the posterior is stably concentrated around a given local minimum. This motivates our approach of identifying plateaus in the estimated LLC curve—brief pauses before and after a given increase or decrease in degeneracy—as stage boundaries which divide training into approximate developmental stages. The resolution of these stage boundaries depends on the density of checkpoints used for LLC estimation and the precision of those estimates.

Results

In our experiments, we identify plateaus in the estimated LLC curve by first lightly smoothing the LLC curve with a Gaussian process to facilitate stable numerical differentiation with respect to log training time. We identify plateaus as approximate zeros of this derivative, namely local minima of the absolute derivative that fall below a small threshold (see Section˜B.1). Figures˜1, B.2 and B.3 show the results. Sections˜B.4 and B.4 shows that similar stage divisions arise for independent training runs.

6 Results for language modeling

Plateaus in LLC estimates (Figures˜1(a) and 1(a)) reveal five developmental stages for our language model. In order to validate that this stage division is meaningful, we search for concomitant changes in the model’s input/output behavior and its internal computational structure. In this section, we report a range of setting-specific metrics that reveal the following significant, interpretable changes coinciding with each stage: in LM1 the model learns to predict according to bigram statistics; in LM2 the model learns to predict frequent $n$ -grams and use the positional embedding; in LM3 and LM4 the model respectively forms “previous-token heads” and “induction heads” as part of the same induction circuit studied by Olsson et al. (2022). Note that we did not discover significant changes in LM5, and we do not claim that these are the only interesting developmental changes occurring throughout training. There may be other interesting developmental changes that are not captured by our metrics, or are not significant enough to not show up in the LLC curve.

6.1 Stage LM1 (0–900 steps)

Learning bigram statistics

Figure˜4(a) shows that the bigram score—the average cross entropy between model logits and empirical bigram frequencies (see Section˜C.1.1)—is minimized around the LM1–LM2 boundary, with a value only 0.3 nats above the irreducible entropy of the empirical bigram distribution. This suggests that during LM1 the model learns to predict using bigram statistics (the optimal next-token prediction given only the current token).

6.2 Stage LM2 (900–6.5k steps)

Using positional information

During LM2 the positional embedding becomes structurally important. Figure˜4(b) shows that here the test loss for the model with the positional embedding zero-ablated diverges from the test loss of the unablated model (see Section˜C.2.1). Specifically, we mean setting learned positional embeddings to zero during evaluation. Conditional on our architecture this establishes whether the model effectively uses positional information. A similar method could be used in a model without learned positional embeddings. There is also an uptick in previous-token attention among some first-layer attention heads shown in green in Figure˜4(d).

Learning common $n$ -grams

We define an $n$ -gram score as the ratio of final-position token loss on (1) a baseline set of samples from a validation set truncated to $n$ tokens, and (2) a fixed set of common $n$ -grams (see Section˜C.1.2). To compute the “common n-grams” score after extracting the top 1000 n-grams, we compute the loss on contexts like

            [<bos_token>, <token_1>, <token_2>, ..., <token_n>]

using the loss on <token_n> and normalize against the average loss on the $n$ -th token of similar-length contexts drawn from the pretraining distribution, then divide the $n$ -th token loss of truncated pretraining contexts by the $n$ -gram loss to get the $n$ -gram score.

Figure˜4(c) shows a large improvement in $n$ -gram score for $n=3,4$ during LM2. This suggests that during LM2 the model memorizes and learns to predict common $n$ -grams for $n>2$ (note this requires using the positional encoding and may also involve previous-token heads).

Foundations of induction circuit

In this stage, the heads that eventually become previous-token and induction heads in future stages begin to compose (that is, read from and write to a shared residual stream subspace; see Figures˜C.4 and C.2.2). This suggests that the foundations for the induction circuit are laid in advance of any measurable change in model outputs or attention patterns.

6.3 Stages LM3 & LM4 (6.5k–8.5k & 8.5k–17k steps)

Formation of induction circuit as studied in Olsson et al., 2022

Figure˜4(d) shows the previous-token matching score (Section˜C.2.3) rises over LM3 and LM4 for the two first-layer heads that eventually participate in the induction circuit (as distinguished by their composition scores, Section˜C.2.2). Figure˜4(e) shows that during LM4 there is an increase in the prefix-matching score (Section˜C.2.4) for the two second-layer induction heads that complete the induction circuit. Figure˜4(f) shows a corresponding drop in the ICL score (Section˜C.1.3) as the model begins to perform in-context learning.

The LLC decreases during LM3, suggesting an increase in degeneracy (a decrease in model complexity). This may be related to interaction between heads. It would be interesting to study this stage further via mechanistic interpretability.

7 Results for in-context linear regression

Plateaus in the LLC estimates (Figures˜1(b) and 1(b)) reveal five developmental stages for our in-context linear regression model. We validate that this stage division is meaningful by identifying significant, concomitant changes in the model’s structure and behavior: in LR1 the model learns to predict without looking at the context; in LR2 the model acquires a robust in-context learning ability; and in LR3 and LR4 the model becomes more fragile to out-of-distribution inputs. We did not discover significant changes in LR5, nor do we claim this is an exhaustive list of developments.

7.1 Stage LR1 (0–1k steps)

Learning to predict without context

Figure˜5(a) shows that the mean square prediction for all tokens $\mathbb{E}[\|\hat{y}_{k}\|^{2}]$ decreases during LR1, reaching a minimum of $0.1$ (smaller than the target noise $\sigma^{2}=0.125$ ) slightly after the end of LR1. Similar to how the language model learned bigram statistics in LM1, this suggests the model first learns the optimal context-independent prediction $\hat{y}_{k}=\bar{}\mathbf{t}^{\top}x_{k}$ where $\bar{}\mathbf{t}$ is the mean of the task distribution (zero in this case).

7.2 Stage LR2 (1k–40k steps)

Acquiring in-context learning

Figure˜5(b) shows that during LR2 there is a drop in ICL score (Section˜D.1.2), indicating that the model acquires in-context learning.

Embedding and attention collapse

Section˜D.2 documents additional changes. Near the end of LR2, token and positional embeddings begin to “collapse,” effectively losing singular values and aligning with the same activation subspace (Sections˜D.2.1 and D.2.2). At the same time, several attention heads form concentrated, input-independent attention patterns (Section˜D.2.3).

7.3 Stages LR3 & LR4 (40k–126k & 126k–320k steps)

Reduced robustness to input magnitude

While performance continues to improve on typical sequences, Figure˜5(b) shows that during LR3 and LR4, the model’s in-context learning ability deteriorates for outlier sequences with higher-than-average $|x_{k}|$ .

Layer-normalization collapse

Figure˜5(c) shows the individual weights in the final layer normalization module. A large fraction of these weights go to zero in LR3 and LR4. This occurs in tandem with a similar collapse in the weights of the unembedding transforms (Section˜D.2.4). This results in the model learning to read its prediction $\hat{y}_{k}$ from a handful of privileged dimensions of the residual stream. Since this means that the network outputs become insensitive to changes in many of the parameters, we conjecture that this explains part of the striking decrease in estimated LLC over these stages (Section˜D.2.4).

This collapse is most pronounced and affects the largest proportion of weights in the unembedding module, but in LR4 it spreads to earlier layer normalization modules, particularly the layer normalization module before the first attention block (Section˜D.2.5).

8 Discussion

In this paper, we have examined the development of transformer models in two distinct learning settings. We quantified the changes in loss landscape degeneracy throughout transformer training by estimating the local learning coefficient (LLC). Motivated by the singular learning process in Bayesian inference, we divided these training runs into developmental stages at critical points of the LLC curve. We found that these developmental stages roughly coincided with significant changes in the internal computational structure and the input/output behavior of each model. In this section, we discuss several implications of these findings.

Towards a degeneracy-based understanding of deep learning

That significant structural and behavioral changes show up in the LLC curve is evidence that the development of our transformers is closely linked to loss landscape degeneracy. This finding underscores the potential of loss landscape degeneracy as a crucial lens through which to study the development of deep learning models.

While we studied two distinct learning settings (including language modeling with a nontrivial transformer architecture), it remains necessary to verify the connection between degeneracy and development across a more diverse range of emergent model structures and behaviors. Moreover, future work should investigate this connection in more depth, seeking to establish a causal connection between changes in degeneracy and changes in structure and behavior.

Towards developmental interpretability

We showed that degeneracy can reveal meaningful changes in transformers. We emphasize that our analysis is not exhaustive—we expect only certain “macroscopic” changes, such as the emergence of in-context learning, will have a significant enough effect on loss landscape degeneracy to appear separated by plateaus in the LLC curve. Recent work has extended these ideas by measuring the LLC with respect to network sub-modules and with different data distributions, providing a more refined picture of model development (Wang et al., 2025). We expect this research direction will lead to insights into the development of more complex models.

Loss landscape degeneracy offers a setting-agnostic, “unsupervised” alternative to setting-specific progress measures such as those derived by Barak et al. (2022) or developed using mechanistic insights from similar models by Nanda et al. (2023). Both approaches can reveal developments invisible in the loss, but loss landscape degeneracy is able to detect changes without requiring a mechanistic understanding in advance. Of course, once a change is detected through its effect on degeneracy, it remains to interpret the change.

Cases studies in transformer development

We do not claim that the structural and behavioral developments we observed in each setting are universal phenomena. Transformers trained with different architectures, data distributions, algorithms, or hyperparameters may develop differently. Rather, our detailed analysis contributes two “case studies” to the growing empirical literature on the emergence of structure in transformers.

On this note, our observations extend those of Olsson et al. (2022) and Elhage et al. (2021). We show that before the induction circuit forms, our 2-layer language model learns simpler interpretable strategies (based on bigram statistics and common $n$ -grams). This shows that a single training run follows a progression akin to that found by Olsson et al. (2022) for fully-developed models of increasing depth (they showed that “0-layer” models learn bigram statistics and 1-layer models learn “skip-trigrams”). A similar progression was observed by Edelman et al. (2024) in a Markovian sequence modeling task.

Moreover, in both settings, we saw that before in-context learning emerges, the model learns to predict tokens using the optimal prediction given only the current token (bigram statistics for language modeling, zero for in-context linear regression with this distribution of tasks).

Development and model complexity

While we have described the LLC as a measure of loss landscape degeneracy, it can also be understood as a measure of model complexity (cf. Section˜A.2). It is natural for changes in a model’s internal structure to show up as a change in complexity. For example, Chen et al. (2024) showed that the emergence of syntactic attention structure coincides with a spike in two model complexity measures, namely the model’s Fisher information and the intrinsic dimension (Facco et al., 2017) of the model’s embeddings.

Notably, we observe stages in which the LLC decreases, corresponding to a simplification of the computational structure of the model. Such model simplification has empirical precedent, for instance with Chen et al. (2024) and the recent literature on grokking (Power et al., 2022; Nanda et al., 2023; Tikeng Notsawo et al., 2024). In our case, the mechanistic nature of the simplification is not fully clear, with the collapse of various weights and attention patterns arising as candidates in the in-context linear regression setting.

This phenomenon is currently not accounted for by theories of neural network development. In the theory of saddle-to-saddle dynamics, deep linear networks learn progressively more complex approximations of the data (Saxe et al., 2019). Likewise, the example transitions in the singular learning process outlined in Sections˜5 and 3 describe LLC increases. Though we note that decreasing the LLC while holding the loss constant would be another way to decrease the free energy according to equation ˜4, providing a full theoretical account of these stages is an open problem.

Appendix

Appendix˜A reviews the learning coefficient, providing some simple toy examples contrasting the learning coefficient with Hessian-based measures. This section also discusses SGLD-based LLC estimation including experiment hyperparameters (Section˜A.4), and offers a detailed example of the calibrations involved in applying LLC estimation to regression transformers to serve as a reference (Section˜A.5). Appendix˜B provides further detail on our procedure for LLC-based stage identification, including stages identified in additional training runs and a brief comparison with Hessian statistics. Appendices˜C and D examine the developmental stages of language models and in-context linear regression in more detail and explain the various metrics we use to track behavioral and structural development. Appendix˜E describes some additional experiments on a one-layer language model. Appendix˜F covers transformer training experimental details, such as model architectures, training procedures, and hyperparameters.

To facilitate reproduction of our analyses, we have made our codebase available. A repository containing additional figures and code can be accessed at the URL https://github.com/timaeus-research/icl.

[sections] [sections]l1

Appendix A The local learning coefficient (LLC)

A.1 Formal Definition of the LLC

In the setting of Section˜4, let $B$ be a closed ball around ${w^{*}}$ such that ${w^{*}}$ is a global minimum on $B$ , by which we mean a point with (equal) lowest loss. If there are multiple such global minima, the volume asymptotics are determined by the geometry of one that is most degenerate in the precise sense of SLT, formalised in Lau et al. (2025), roughly corresponding to having the lowest LLC. We call this minimum the maximally degenerate global minimum on $B$ . Consider the volume of the set of nearby low-loss parameters,

V(\epsilon)=\int_{B}\mathds{1}\{\ell(w)\leq\ell({w^{*}})+\epsilon\}\,dw.

As $\epsilon\to 0$ , $V(\epsilon)$ is asymptotically equivalent to

c\epsilon^{\lambda({w^{*}})}\log(1/\epsilon)^{m({w^{*}})-1},

where $\lambda({w^{*}})$ is the LLC, $m({w^{*}})$ is another geometric quantity called the local multiplicity, and $c>0$ is a constant.

A.2 Interpretations and examples of the LLC

In Section˜4, we introduced the LLC as a quantification of geometric degeneracy. In this section, we discuss an additional perspectives on the LLC as a count of the “effective” dimensionality of a parameter, and we give additional examples of the LLC. We refer the reader to Watanabe (2009) and Lau et al. (2025) for more discussion.

The LLC has some similarity to an effective parameter count. If the population loss $\ell$ looks like a quadratic form near $w^{*}$ then $\lambda(w^{*})=\tfrac{d}{2}$ is half the number of parameters, which we can think of as $d$ contributions of $\tfrac{1}{2}$ from every independent quadratic direction. If there are only $d-1$ independent quadratic directions, and one coordinate $w_{i}$ such that small variations in $w_{i}$ near $w_{i}^{*}$ do not change the model relative to the truth (this dimension is “unused”) then $\lambda(w^{*})=\tfrac{d-1}{2}$ .

The situation becomes more intricate when certain dimensions are degenerate but not completely unused, varying to quartic or higher order near the parameter (rather than being quadratic or flat). While every unused coordinate reduces the LLC by $\tfrac{1}{2}$ , changing the dependency on a coordinate from quadratic ( $w_{i}^{2}$ ) to quartic ( $w_{i}^{4}$ ) (increasing its degeneracy while still “using” it) reduces the contribution to the LLC from $\tfrac{1}{2}$ to $\tfrac{1}{4}$ .

As a source of intuition, we provide several examples of exact LLCs:

$\ell(w_{1},w_{2},w_{3})=aw_{1}^{2}+bw_{2}^{2}+cw_{3}^{2}$ with $a,b,c>0$ . This function is nondegenerate, and $\lambda(0,0,0)=\tfrac{1}{2}+\tfrac{1}{2}+\tfrac{1}{2}=\tfrac{3}{2}$ . This is independent of $a,b,c$ . That is, the LLC $\lambda$ does not measure curvature. For this reason, it is better to avoid an intuition that centers on “basin broadness” since this tends to suggest that lowering $a,b,c$ should affect the LLC.
$\ell(w_{1},w_{2},w_{3})=w_{1}^{2}+w_{2}^{2}+0$ in $\mathbb{R}^{3}$ is degenerate, but its level sets are still submanifolds and $\lambda(0,0,0)=\tfrac{1}{2}+\tfrac{1}{2}$ . In this case the variable $w_{3}$ is unused, and so does not contribute to the LLC.
$\ell(w_{1},w_{2},w_{3})=w_{1}^{2}+w_{2}^{4}+w_{3}^{4}$ is degenerate and its level sets are, for our purposes, not submanifolds. The singular function germ $(\ell,0)$ is an object of algebraic geometry, and the appropriate mathematical object is not a manifold or a variety but a scheme. The quartic terms contribute $\tfrac{1}{4}$ to the LLC, so that $\lambda(0,0,0)=\tfrac{1}{2}+\tfrac{1}{4}+\tfrac{1}{4}=1$ . The higher the power of a variable, the greater the degeneracy and the lower the LLC.

Figure˜2 offers several additional examples, from left to right:

A quadratic potential $\ell_{1}(w_{1},w_{2})=w_{1}^{2}+w_{2}^{2}$ , for which the LLC is maximal in two dimensions, $\lambda_{1}(0,0)=d/2=1$ .
A quartic potential $\ell_{2}(w_{1},w_{2})=w_{1}^{4}+w_{2}^{4}$ , for which the LLC is $\lambda_{2}(0,0)=1/2$ .
An even more degenerate potential $\ell_{3}(w_{1},w_{2})=w_{1}^{2}w_{2}^{4}$ , for which $\lambda_{3}(0,0)=1/4$ . We note that Hessian-derived metrics cannot distinguish between this degeneracy and the preceding quartic degeneracy.
A qualitatively distinct potential $\ell_{4}(w_{1},w_{2})=(w_{1}-1)^{2}(w_{1}^{2}+w_{2}^{2})^{4}$ from Lau et al. (2025) with the same LLC at the origin, $\lambda_{4}(0,0)=1/4$ .

While nondegenerate functions can be locally written as quadratic forms by the Morse Lemma (and are thus qualitatively similar to the approximation obtained from their Hessians), there is no simple equivalent for degenerate functions, such as the population losses of deep neural networks.

A.3 Estimating LLCs with SGLD

We follow Lau et al. (2025) in using SGLD to estimate the expectation value of the loss in the estimator of the LLC. For a given choice of weights $w^{*}$ we sample $C$ independent chains with $T_{\mathrm{SGLD}}$ steps per chain. Each chain $c$ is a sequence of weights $\{w^{(c)}_{\tau}\}_{\tau=1}^{T_{\mathrm{SGLD}}}$ . From these samples, we estimate the expectation $\mathbb{E}^{\beta}_{w|w^{*},\gamma}[\mathcal{O}(w)]$ of an observable $\mathcal{O}$ by

\frac{1}{CT_{\mathrm{SGLD}}}\sum_{c=1}^{C}\sum_{\tau=1}^{T_{\mathrm{SGLD}}}\mathcal{O}(w_{\tau}^{(c)}),

(5)

with an optional burn-in period. Dropping the chain index $c$ , each sample in a chain is generated according to:

	$\displaystyle w_{\tau+1}$	$\displaystyle=w_{\tau}+\Delta w_{\tau},$		(6)
	$\displaystyle w_{1}$	$\displaystyle=w^{*},$		(7)

where the step $\Delta w_{\tau}$ comes from an SGLD update

\Delta w_{\tau}=\frac{\epsilon}{2}\left(\beta n\nabla\ell_{m}^{(\tau)}(w_{\tau})+\tfrac{\gamma}{2}\left(w_{\tau}-w^{*}\right)\right)+\mathcal{N}(0,\epsilon)\,.

(8)

In each step $\tau$ we sample a mini-batch of size $m$ and the associated empirical loss, denoted $\ell_{m}^{(\tau)}$ , is used to compute the gradient in the SGLD update. We note that LLC estimator defined in ˜3 uses the expectation $\mathbb{E}^{\beta}[\ell_{n}(w)]$ which in the current notation means we should take $\mathcal{O}(w)$ to be $\ell_{n}(w)$ . For computational efficiency we follow Lau et al. (2025) in recycling the mini-batch losses $\ell_{m}(w^{(c)}_{\tau})$ computed during the SGLD process. That is, we take $\mathcal{O}=\ell_{m}^{(\tau)}$ rather than $\mathcal{O}=\ell_{n}$ .

Time and Space Complexity.

The computational cost per LLC estimate is proportional to a standard training step, denoted $S$ . We expect to require a constant number $CT_{\text{SGLD}}$ of samples (on the order of $10^{2}$ – $10^{4}$ ) to yield robust estimates, independent of model size. Using logarithmically spaced checkpoints, the total computational complexity for generating an LLC curve over the entire training process scales as $O(SCT_{\text{SGLD}}\log N)$ , where $N$ is the total number of training steps. The space complexity incurs a modest linear overhead compared to standard SGD, requiring storage for one additional copy of the weights to enable localization.

A.4 LLC estimation experiment details

A.4.1 LLC estimation details for language models

For language models, we use SGLD to sample 20 independent chains with 200 steps per chain and 1 sample per step. For the one-layer model, we used $\epsilon=0.003,\gamma=300$ , and for the two-layer model we used $\epsilon=0.001,\gamma=100$ . Estimating the LLC across all checkpoints took around 200 GPU hours for the two-layer model on a single A100 and around 125 GPU hours for the one-layer model. For additional runs of the two-layer model, we ran fewer chains, bringing the time down to about 2 TPU hours per training run.

We sampled a separate set of 1 million lines (lines 10m-11m) from the DSIR filtered Pile, denoted as $D_{\text{sgld}}$ . The first 100,000 lines from this SGLD set (lines 10m-10.1m) were used as a validation set. The sampling of batches for SGLD mirrored the approach taken during the primary training phase. Each SGLD estimation pass was seeded analogously, so, at different checkpoints, the SGLD chains encounter the same selection of batches and injected Gaussian noise.

Table 1: Hyperparameters for estimating the LLC for language models.

Hyperparameter	Category	Description/Notes	1-Layer	2-Layer
C	Sampler	# of chains	$20$
$T_{\mathrm{SGLD}}$	Sampler	# of SGLD steps / chain	$200$
$\epsilon$	SGLD	Step size	$0.003$	$0.001$
$\gamma$	SGLD	Localization strength	300	100
$n\beta$	SGLD	Inverse temperature	21.7
$m$	SGLD	The size of each SGLD batch	100
$\mu$	Data	Dataset size for gradient minibatches	13m

A.4.2 LLC estimation details for in-context linear regression

For in-context linear regression models, we generate a fixed dataset of $2^{20}$ samples. Using SGLD, we sample $10$ independent chains with 5,000 steps per chain, of which the first 1,000 are discarded as a burn-in, after which we draw observations once per step, at a temperature $n\beta=66.7$ , $\epsilon=0.0003$ , and $\gamma=13.3$ , over batches of size $m=1024$ . LLC estimation takes up to 72 CPU-hours per training run.

Table 2: LLC estimation hyperparameters. A summary of the hyperparameters involved in estimating the LLC and the default values we use.

Hyperparameter	Category	Description/Notes	Default Values
C	Sampler	# of chains	$10$
$T_{\mathrm{SGLD}}$	Sampler	# of SGLD steps / chain	$5000$
$\epsilon$	SGLD	Step size	$0.0003$
$\gamma$	SGLD	Localization strength	$13.3$
$n\beta$	SGLD	Inverse temperature	$66.7$
$m$	SGLD	The size of each SGLD batch	$1024$
$\mu$	Data	Dataset size for gradient minibatches	$2^{20}$

A.5 A guide to SGLD-based LLC estimation

This section walks through some of the hyperparameter choices and sweeps involved in calibrating LLC estimates. We provide it as a reference for others seeking to adjust LLC estimation to novel settings.

A.5.1 Varying the temperature

In Lau et al. (2025), the inverse temperature $\beta$ is set to a fixed “optimal” value $\beta^{*}=1/\log n$ , where $n$ is the number of training samples. In practice, we find that it can be advantageous to sample at a higher temperature.

Since $\beta$ always shows up in a product with $n$ (in ˜8 for the SGLD step and in ˜3 for the LLC), we can view the inverse temperature as a multiplier that adjusts the effective size of your dataset. In a Bayesian setting, $\beta=2$ would mean updating twice on each of the samples in your dataset.

The problem with the default choice of $\beta^{*}$ is that as we increase $n$ we have to decrease the SGLD step size $\epsilon$ to prevent the update from becoming ill-conditioned, and this eventually causes the gradient term to suppress the noise term. This, in turn, leads to requiring larger batches to suppress the gradient noise and requiring longer chains to sufficiently explore the local posterior (Section˜A.5.3).

Instead of $n\beta=n/\log n$ , we perform LLC estimation at $n\beta=m/\log m$ , where $m$ is the SGLD batch size.

A.5.2 Seeding the random noise

To smooth out the $\hat{\lambda}_{t}$ curves, we reset the random seed before LLC estimation run at each checkpoint. This means the sequence of injected Gaussian noise is the same for LLC estimation runs at different checkpoints. Additionally, if the batch size is held constant, the batch schedule will also be constant across different estimation runs. Figure˜A.3 shows that this does not affect the overall shape of the learning coefficient curves; it simply smooths it out.

A.5.3 Calibrating $\epsilon$ , $\beta$ , and $\gamma$

As a rule of thumb, $\epsilon$ should be large enough that the $\hat{\lambda}$ estimate converges within the $T_{\mathrm{SGLD}}$ steps of each chain but not too large that you run into issues with numerical stability and divergent estimates. Subject to this constraint, $\gamma$ should be as small as possible to encourage exploration without enabling the chains to “escape” to nearby better optima, and $\beta$ should be as large as possible (but no greater than $1/\log n$ ).

To determine the optimal SGLD hyperparameters, we perform a grid sweep over a reparametrization of the SGLD steps in terms of $\tilde{\beta},\tilde{\gamma},\varepsilon$ :

\Delta w_{t}=\tilde{\beta}\nabla\ell^{(\tau)}_{m}+\tilde{\gamma}(w^{*}-w_{t})+\mathcal{N}(0,\varepsilon),

where $\tilde{\beta}=\varepsilon\beta n/2$ , $\tilde{\gamma}=\varepsilon\gamma/4$ .

The results of this hyperparameter sweep are illustrated in Figure˜A.4 for final checkpoints. Separately (not pictured), we check the resulting hyperparameters for a subset of earlier checkpoints. This is needed since, for example, a well-behaved set of hyperparameters at the end of training may lead to failures like divergent estimates (Figure˜A.5) earlier in training when the geometry is more complex and thus the chains less stable.

A.5.4 LLC traces

As a useful diagnostic when calibrating the LLC estimates, we propose an online variant for learning coefficient estimation. When overlaid on top of individual-chain LLC traces, this helps reveal common failure modes like divergent estimates, non-converged estimates, and escapes (Figure˜A.5). These traces display the running estimate of $\hat{\lambda}$ as a function of the number of steps taken in a chain (with the estimate averaged across independent chains).

Define $\hat{\lambda}_{\tau}(w_{0})$ , the LLC at $w_{0}$ after $\tau$ time-steps for a single SGLD chain as follows (Lau et al., 2025):

\hat{\lambda}_{\tau}(w_{0})=n\beta\left(\frac{1}{T}\sum_{t=1}^{T}\ell_{n}(w_{\tau})-\ell_{n}(w_{0})\right).

Moving terms around, we get,

$\displaystyle\hat{\lambda}_{\tau}(w_{0})$	$\displaystyle=\frac{n}{\log n}\left(\frac{1}{\tau}\sum_{\tau^{\prime}=1}^{\tau}\ell_{n}(w_{\tau^{\prime}})-\ell_{n}(w_{0})\right)$	(9)
	$\displaystyle=n\beta\left(\frac{\tau-1}{\tau}\left(\frac{1}{\tau-1}\sum_{\tau^{\prime}=1}^{\tau-1}\ell_{n}(w_{\tau}^{\prime})-\ell_{n}(w_{0})+\ell_{n}(w_{0})\right)+\frac{1}{\tau}\ell_{n}(w_{\tau})-\ell_{n}(w_{0})\right)$	(10)
	$\displaystyle=\frac{\tau-1}{\tau}\hat{\lambda}_{\tau-1}(w_{0})+n\beta\left(\frac{1}{\tau}\ell_{n}(w_{\tau})+\left(\frac{\tau-1}{\tau}-1\right)\ell_{n}(w_{0})\right)$	(11)
	$\displaystyle=\frac{1}{\tau}\left((\tau-1)\hat{\lambda}_{\tau-1}(w_{0})+n\beta\left(\ell_{n}(w_{\tau})-\ell_{n}(w_{0})\right)\right),$	(12)

where

\hat{\lambda}_{0}(w_{0})=0.

This can be easily extended to an online estimate over chains by averaging the update $n\beta\left(\ell_{n}(w_{\tau})-\ell_{n}(w_{0})\right)$ over multiple chains.

A.6 LLC estimates for a non-log-likelihood-based loss

In the main body, we apply the LLC to empirical loss functions that do not arise as the log likelihood of independent random variables, due to the repeated use of dependent sub-sequences. Here we explain that it is possible to define a proper negative log likelihood over independent observations for the in-context linear regression setting: similar observations can be made in the language modeling setting.

Let $\Pi(k)$ be a probability distribution over the context length $k$ . Ideally, the transformer would be trained to make predictions $y_{k}$ given a context of length $k$ where $k$ is sampled from $\Pi$ . With the given distribution over contexts this leads to a negative log likelihood of the form

L(w)=\sum_{k}p_{k}L_{[k]}(w)

(13)

where $p_{k}$ is the probability of sampling $k$ from $\Pi$ and

L_{[k]}(w)=\int q(S_{k},y_{k}|\mathbf{t},k)q(\mathbf{t})\Big{[}f_{w}(S_{k})-y_{k}\Big{]}^{2}\,dS_{k}\,dy_{k}\,d\mathbf{t}

(14)

using the notation of Section˜3 so $S_{k}=(x_{1},y_{1},\ldots,x_{k-1},y_{k-1},x_{k})$ is a context of length $k$ . It is straightforward to check that this negative log likelihood $L$ agrees with the population loss $\ell$ associated to the empirical loss defined in Section˜3. However the empirical quantities $L_{n}(w)$ and $\ell_{n}(w)$ defined for a set of samples of size $n$ are not the same.

Since we use the empirical loss $\ell_{n}$ in our calculation of the estimated LLC, whereas the foundational theory of SLT is written in terms of the empirical negative log likelihood $L_{n}$ , it is natural to wonder how much of a difference this makes in practice. Figure˜A.6 depicts LLC traces (Section˜A.5) for a highlighted number of checkpoints using either a likelihood-based estimate (with variable sequence length) or loss-based estimate (with fixed sequence length). The relative orderings of complexities does not change, and even the values of the LLC estimates do not make much of a difference, except at the final checkpoint, which has a higher value for the sub-sequence-based estimate.

A.7 LLC estimates away from local minima

Our methodology for detecting stages is to apply LLC estimation to compute $\hat{\lambda}(w^{*})$ at neural network parameters ${w^{*}}=w_{t}$ across training. In the typical case these parameters will not be local minima of the population loss, violating the theoretical conditions under which the LLC is defined.

It is not surprising that the estimator appears to work if ${w^{*}}$ is approximately a local minima. Lau et al. (2025) validated their estimator at both parameters constructed to be local minima of the population loss and also at parameters found through training with stochastic gradient descent (possibly not local minima of the empirical loss, let alone the population loss). They showed that in both cases the estimator recovers the true learning coefficient associated with the global minimum of the population loss.

On the other hand, if ${w^{*}}$ is far from any local minima, it is a priori quite surprising that the SGLD-based estimation procedure works at all, as in this situation one might expect the chains to explore directions in which the loss decreases. Nevertheless, Chen et al. (2023) found that, empirically, LLC estimation away from local minima appears to give sensible results in practice. In our case, with sufficient localization we see stable estimates throughout training.

Theoretically accounting for this phenomenon is an interesting open problem. Perhaps there is a notion of stably evolving equilibrium in the setting of neural network training, echoing some of the ideas of Waddington (1957), such that the LLC estimation procedure is effectively giving us the LLC of a different potential to the population loss—a potential for which the current parameter actually is at a critical point. We leave addressing this question to future work.

Appendix B LLC-based stage boundary identification

B.1 Procedure for stage boundary identification

To identify stage boundaries, we look for plateaus in the LLC: checkpoints at which the slope of $\hat{\lambda}(w_{t})$ over $t$ vanishes. To mitigate noise in the LLC estimates, we first fit a Gaussian process with some smoothing to the LLC-over-time curve. Then we numerically calculate the slope of this Gaussian process with respect to $\log t$ . The logarithm corrects for the fact that the learning coefficient, like the loss, changes less as training progresses. We identify stage boundaries by looking for checkpoints at which this estimated slope equals zero. The results of this procedure are depicted in Figure˜B.1 for language and Figure˜B.2 for in-context linear regression.

At a local minima or maxima of the estimated LLC curve identifying a plateau from this estimated slope is straightforward, since the derivative crosses the x-axis. However at a saddle point, the slope may not exactly reach zero, so we have to specify a “tolerance” for the absolute value of the derivative, below which we treat the boundary as an effective plateau.

In this case, we additionally require that the plateau be at a local minimum of the absolute first derivative. Otherwise, we may identify several adjacent points as all constituting a stage boundary.

To summarize, identifying stage boundaries is sensitive to the following choices: the intervals between checkpoints, the amount of smoothing, whether to differentiate with respect to $t$ or $\log t$ , and the choice of tolerance. However, once a given choice of these hyperparameters is fixed, stages can be automatically identified, without further human judgment.

B.2 Stage boundary identification details for language model

Figure˜B.1 displays the test loss and LLC curves from Figure˜1(a) in addition to the weight norm over time and associated slopes. Stage boundaries coincide with where the slope of the LLC crosses zero, that is, where there is a plateau in the LLC.

B.3 Stage boundary identification details for in-context linear regression

Figure˜B.2 displays the test loss and LLC curves from Figure˜1(b) in addition to the weight norm over time, and numerically estimated slopes associated to these three metrics. As in the case of language models, we identify stage boundaries by looking for plateaus in the LLC. Unlike the language models, here the boundaries LR1–LR2 and LR2–LR3 are clearly visible in the loss.

B.4 Stage identification for additional training runs

Figure˜3(a) shows loss and LLC curves for five seeds (differing in model initialization and batch schedule). In each seed, LLC estimation reveals stage LM1–LM4. In three of the five seeds, stage LM5 is subdivided into two additional stages.

Figure˜3(b) shows loss and LLC curves for five unique seeds (differing in model initialization and batch schedule). In each seed, LLC estimation reveals stages LR1–LR5. There is remarkably little variance across different seeds.

B.5 Comparison to Hessian statistics

Figure˜B.4 shows a quantification of the curvature-based notion of flatness captured by the Hessian (in contrast to the degeneracy-based notion of flatness captured by the LLC) for our in-context linear regression transformer. To estimate the trace and maximum eigenvalues shown in this figure, we use the PyHessian library (Yao et al., 2020) over a batch of $m=1024$ samples.

Crucially, we observe that these Hessian-derived metrics (henceforth, “curvature”) and the LLC are not consistently correlated. During the first part of LR2, the LLC and the curvature are jointly increasing. Starting at around $t=20\mathrm{k}$ , while the LLC is still increasing, the curvature starts decreasing. In the first part of LR3, both metrics decrease in tandem, but as of around $t=120\mathrm{k}$ , the curvature turns around and starts increasing.

The Hessian fails to detect three of the four stage boundaries identified by our LLC-based methodology. Since these Hessian-based metrics are dominated by the largest eigenvalues—the directions of maximum curvature—they fail to observe the finer-grained measures of degeneracy that dominate the LLC. Moreover, we observe that LLC estimation is more scalable (empirically, it seems to be roughly linear in parameter count) than estimating the full Hessian (which is quadratic).

Appendix C Developmental analysis of language models

In this section, we present further evidence on behavioral (Section˜C.1) and structural (Section˜C.2) development of the language model over the course of training.

C.1 Behavioral development

C.1.1 Bigram score

We empirically estimate the conditional bigram distribution by counting instances of bigrams over the training data. From this, we obtain the conditional distribution $\tilde{q}(t^{\prime}|t)$ , the likelihood that a token $t^{\prime}$ follows $t$ . The bigram score $B_{k}^{S}$ at index $k$ of an input context $S$ is the cross entropy between the model’s predictions $p(t_{k+1}|t_{k})$ at that position and the empirical bigram distribution,

B_{k}^{S}=-\sum_{i=1}^{d_{\textrm{vocab}}}\tilde{q}(t_{k+1}^{(i)}|t_{k})\log p(t_{k+1}^{(i)}|t_{k}),

(15)

where the $t_{k+1}^{(i)}$ range over the possible second tokens from the tokenizer vocabulary. From this we obtain the average bigram score

\bar{B}=\frac{1}{n}\sum_{i=1}^{n}B_{k_{i}}^{S_{i}},

(16)

where we take fixed random sequences of $k_{i}$ and $S_{i}$ for $1\leq i\leq n=5,000$ , which is displayed over training in Figure˜4(a). This is compared against the best-achievable bigram score, which is the bigram distribution entropy itself, averaged over the validation set.

C.1.2 $n$ -gram scores

In stage LM2 we consider $n$ -grams, which are sequences of $n$ consecutive tokens, meaning $2$ -grams and bigrams are the same. Specifically, we consider common $n$ -grams, which is defined heuristically by comparing our 5,000 vocab size tokenizer with the full GPT-2 tokenizer. We use the GPT-2 tokenizer as our heuristic because its vocabulary is constructed iteratively by merging the most frequent pairs of tokens.

We first tokenize the tokens in the full GPT-2 vocabulary to get a list of 50,257 $n$ -grams for various $n$ . The first 5,000 such $n$ -grams are all $1$ -grams, after which $2$ -grams begin appearing, then $3$ -grams, $4$ -grams, and so on (where $2$ -grams and $3$ -grams may still continue to appear later in the vocabulary). We then define the set of common $n$ -grams as the first 1,000 $n$ -grams that appear in this list for a fixed $n$ , $n\geq 2$ .

If we track the performance on $n$ -grams and see it improve, we may ask whether this is simply a function of the model learning to use more context in general, rather than specifically improving on the set of $n$ -grams being tracked. We measure performance against this baseline by defining an $n$ -gram score. For a fixed $n$ , we obtain the average loss $\ell_{\textrm{gram}}^{n}$ of the model on predicting the final tokens of our set of 1,000 $n$ -grams and also obtain the average loss $\ell_{\textrm{test}}^{n}$ of the model on a validation set at position $n$ of each validation sequence. The $n$ -gram score is then defined to be $\ell_{\textrm{test}}^{n}/\ell_{\textrm{gram}}^{n}$ .

C.1.3 In-context learning score

The in-context learning score is a behavioral measure of the relative performance of a model later in a sequence versus earlier in the sequence. We define $\text{ICL}_{k_{1}:k_{2}}$ to be the loss on token $k_{2}$ minus the loss on token $k_{1}$ , so a more negative score indicates better relative performance later in the sequence. A more negative ICL score does not, however, mean that a model is achieving better overall loss on later tokens; it is only about the relative improvement. For the language model, we follow a similar construction as Olsson et al. (2022), where we take $k_{2}$ to be the 500th token and $k_{1}$ to be the 50th token. This is then averaged over a 100k-row validation dataset. The performance of the language model over the course of training can be seen at the bottom of Figure˜4(f).

C.1.4 Visualizing behavioral changes

In Figure˜C.1, we visualize changes in the model’s input/output behavior by comparing model predictions before and after developmental stages and highlighting tokens with the greatest differences.

C.2 Structural development

C.2.1 Positional embedding

In Figure˜C.2, we measure the effect of the positional embedding on model performance by comparing the model’s performance at particular context positions on a validation set over the course of training against performance on the same validation set but with the positional embedding zero-ablated. The full context length is 1024, and we measure test loss at positions 1, 2, 3, 5, 10, 20, 30, 50, 100, 200, 300, 500, and 1000. In the transition from stage LM1 to LM2, the model begins using the learnable positional embedding to improve performance. The difference between test loss with and without the positional ablation is negligible at all measured positions until the LM1–LM2 boundary.

Structurally, we might predict that the positional embeddings should organize themselves in a particular way: in order to understand relative positions, adjacent positions should be embedded close to each other, and far-away positions should be embedded far apart.

In Figure˜C.3, we examine the development of the positional embedding itself over time from two angles. The first is to take the embeddings of each position in the context and to run PCA on those embeddings. The result is that as training progresses, the positional embedding PCAs gradually resolve into Lissajous curves, suggesting that the positional embeddings might look like a random walk (Antognini & Sohl-Dickstein, 2018; Shinn, 2023). However, if we look to the explained variance, we see that it grows very large for PC1, reaching $94.2\%$ at training step 6400. This is much higher than we would expect for Brownian motion, where we expect to see about $61\%$ explained variance in PC1 (Antognini & Sohl-Dickstein, 2018).

The second perspective we use is to look at how the magnitudes of positional embeddings over the context length develop. In this case, we observe that the magnitudes seem to have a fairly regular structure. In conjunction with the PCAs and explained variance, we might infer that the positional embeddings look approximately like a (possibly curved) line in $d_{\textrm{model}}=256$ dimensional space. A positional embedding organized in this way would make it easier for an attention head to attend to multiple recent tokens, which is necessary if a single head is to learn $n$ -grams.

C.2.2 Composition scores

Let $W_{Q}^{h},W_{K}^{h},W_{V}^{h}$ be the query, key, and value weights of attention head $h$ respectively. There are three types of composition between attention heads in transformer models in Elhage et al. (2021):

Q-Composition: the query matrix $W_{Q}^{h}$ of an attention head reads in a subspace affected by a previous head
K-Composition: the key matrix $W_{K}^{h}$ of an attention head reads in a subspace affected by a previous head
V-Composition: the value matrix $W_{V}^{h}$ of an attention head reads in a subspace affected by a previous head

If $W_{O}^{h}$ is the output matrix of an attention head, then $W_{QK}^{h}=W_{Q}^{h\ T}W_{K}^{h}$ and $W_{OV}^{h}=W_{O}^{h}W_{V}^{h}$ . The composition scores are

||MW_{OV}^{h1}||_{F}/(||M||_{F}||W_{OV}^{h_{1}}||_{F})

(17)

Where $M=W_{QK}^{h_{2}\ T}$ , $M=W_{QK}^{h_{2}}$ , and $M=W_{OV}^{h_{2}}$ for Q-, K-, and V-Composition respectively. See Figure˜C.4 for K-composition scores over time between attention heads in the induction circuits.

C.2.3 Previous-token matching score

The previous-token matching score is a structural measure of induction head attention. It is the attention score given to $[A]$ by an attention head at $[B]$ in the sequence $\ldots[A][B]$ (i.e., how much the head attends to the immediately preceding token).

We compute this score using a synthetic data generating process, generating 10k fixed random sequences with length between 16 and 64. The first token is a special “beginning of string" token, and the remaining tokens are uniformly randomly sampled from other tokens in the vocabulary.

For each sample in this synthetic dataset, we measure the attention score that an attention head gives to the previous token when at the last token in the sequence. These scores are averaged across the dataset to produce the previous-token matching score for that attention head at a given checkpoint. The progression of previous-token matching scores over time can be seen in Figure˜4(d).

C.2.4 Prefix matching score

The prefix matching score from Olsson et al. (2022) is defined similarly to the previous-token matching score. Given a sequence $[A][B]\ldots[A]$ , the prefix matching score of a particular attention head is how much the attention head attends back to the first instance of $[A]$ when at the second instance of $[A]$ .

We compute this score using a synthetic data-generating process. We first generate 10k fixed random sequences of length 128. The first token is always a special “beginning of string" token and the $[A]$ and $[B]$ tokens are selected and placed randomly. One $[A]$ token is placed in the first half of the sequence, the other is placed in the second half, and the $[B]$ token is placed directly after the first $[A]$ token. The remaining tokens are randomly sampled from the tokenizer vocabulary, excluding the $[A]$ , $[B]$ , and beginning of string tokens.

For each sample in this synthetic dataset, we measure the attention score that each attention head assigns to the earlier instance of $[A]$ from the latter instance of $[A]$ . These scores are averaged across the dataset to produce the prefix matching score for that attention head at a given checkpoint. The progression of prefix matching scores over time can be seen in Figure˜4(e).

Appendix D Developmental analysis of regression transformers

In this section, we present further evidence on the behavioral (Section˜D.1) and structural (Section˜D.2) development of the transformer in the setting of in-context linear regression.

D.1 Behavioral development

D.1.1 Task prior score

In addition to training models on a data distribution in which tasks $\mathbf{t}$ are generated on-the-fly, we examine the setting of Raventós et al. (2023), in which a finite set of $M$ tasks is generated ahead of time, and training samples involve randomly selected tasks from this set.

Figure˜D.1 depicts (a) the mean square distance between the model’s predictions and the zero prediction in addition to (b) the mean square distance between the model’s predictions and the “task prior” prediction, using the component-wise averaged $\overline{\mathbf{t}missing}$ over the set of tasks encountered during training. For all models, the minimum distance to the task prior prediction is lower than the minimum distance to the zero prediction. Hence, we call stage LR1 “learning the task prior” rather than simply learning the zero prediction.

D.1.2 ICL

We consider two variants of the ICL score: $\operatorname{ICL}_{1:D}$ , and $\operatorname{ICL}_{D:K}$ .

If the noise term $\sigma^{2}$ equals zero and both tasks $\mathbf{t}$ and inputs $x_{k}$ are normalized (i.e., $\mathbf{t}\in S^{D-1}$ ), then $D-1$ observations of input/output pairs are enough to precisely identify $\mathbf{t}$ . Therefore, $\operatorname{ICL}_{1:D}$ measures how successful the model is at initially locating the task. The fact that the tasks and inputs are not normalized changes this only slightly: the task will still sit near $S^{D-1}$ within a shell of vanishing thickness as $D\to\infty$ .

Once localized, $\operatorname{ICL}_{D:K}$ measures how successfully the model refines its internal estimate of $\mathbf{t}$ with additional examples, which it can use to reduce the error due to noise.

In terms of implementation, it’s not necessary for the model to internally make a distinction between locating and refining its estimate of the task. For example, ridge regression makes no distinction. Still, we find it useful for reasoning about the progression of the model. In particular, we note that early in stage LR2, while the model begins to develop ICL for early tokens, it becomes worse at ICL over tokens late in the context. Later, at around 23k steps, $\operatorname{ICL}_{D:K}$ stabilizes, while $\operatorname{ICL}_{1:D}$ continues improving over the entire training run.

D.1.3 OOD generalization

To further investigate behavior in stages LR2 and LR3, we probe the model on data sampled from different distributions than encountered during training.¹¹1Cf. Raventós et al. (2023) evaluating models trained on a set of discrete tasks on the “true” distribution consisting of novel tasks. We evaluate behavior on two families of perturbations: “OOD inputs” $x_{k}$ , sampled according to a different scale

x_{k}\sim\mathcal{N}(0,gI_{D}),

(18)

for some gain parameter $g$ , and “OOD tasks”

\mathbf{t}\sim\mathcal{N}(0,gI_{D}).

(19)

Note that these inputs and tasks are not out-of-distribution in the sense of coming from a distribution with a different support than the training distribution. However, the samples drawn from these “extreme” distributions are exponentially suppressed by the original training distribution. Figure˜D.3 plots the normalized MSE for these two distributions over training time.

Between $t=1k$ and $t=4\mathrm{k}$ the model’s outputs rapidly diminish in scale for out-of-distribution samples, both for $g>1$ and $g<1$ , especially for out-of-distribution inputs. While the model is moving away from predicting with the task prior for in-distribution samples, it moves closer to predicting with the task prior for-in-distribution samples.

Between $t=4\mathrm{k}$ and $t=23\mathrm{k}$ , the model recovers on moderately out-of-distribution inputs $g<10^{1.5}$ with performance remaining close to constant beyond this range. Past this stage, performance improves constantly for out-of-distribution tasks.

For out-of-distribution inputs, performance eventually worsens for some ranges of $g$ . Between $t=23\mathrm{k}$ and $t=80\mathrm{k}$ the model further approaches the task prior prediction for extreme out-of-distribution inputs $g>10^{1.5}$ . Subsequently, between $t=75\mathrm{k}$ and $t=130\mathrm{k}$ the model moves away from the task prior prediction for extreme inputs, and performance deteriorates for inputs with $g>10^{0.5}$ . As of LR5, performance is roughly constant.

D.2 Structural development

D.2.1 Embedding

The embedding matrix $W_{E}$ is a linear transformation from $\mathbb{R}^{D+1}\to\mathbb{R}^{{d_{\text{embed}}}}$ . Plotting the $D+1$ singular values of this matrix, we notice that the embedding partially loses one of its components starting at the end of LR2 (Figure˜D.4a).

The input “tokens” $x_{k}$ span a $D$ -dimensional subspace of the $(D+1)$ -dimensional “token space.” The target tokens $y_{k}$ span an orthogonal $1$ -dimensional subspace. The collapse of one of the embedding matrix’s singular values means that the model learns to redundantly encode the inputs and targets in the same $D$ -dimensional subspace of the space of residual stream activations. The almost order of magnitude separation in the magnitudes of the square singular value means that the $(D+1)$ ^th component of the token embedding explains only 2.9% of the variance in activations of the residual stream immediately after the embedding, whereas the dominant components explain roughly 24% each.

Contributions to degeneracy

Given a linear transformation $T_{1}:\mathbb{R}^{D_{1}}\to\mathbb{R}^{D_{2}}$ followed by another linear transformation $T_{2}:\mathbb{R}^{D_{2}}\to\mathbb{R}^{D_{3}}$ , reducing the rank of $T_{1}$ from $r$ to $r^{\prime}<r$ renders $D_{3}(r-r^{\prime})$ components of the second transformation irrelevant. This would mean a decrease in the learning coefficient of $D_{3}(r-r^{\prime})/2$ (a decrease in the effective dimensionality of $d$ leads to a decrease in the LLC of $d/2$ ²²2Note that this is not the only possible way for the LLC to decrease. Changing the local loss landscape from quadratic to quartic or some higher power would also lower the LLC, by a fractional amount.). In the actual model, we don’t see an exact decrease in the rank, and a layer normalization sits between the linear transformation of the embedding and the linear transformations of each transformer block and unembedding. It is unclear what the precise relation between structure and degeneracy is in this case (Section˜D.2.6). Still, suggestively, the onset of embedding collapse coincides with a decrease in the rate of increase of $\hat{\lambda}(w_{t})$ .

D.2.2 Positional encoding

The positional encoding goes through a similar collapse to the unembedding starting during the second part of LR2 and continuing into LR3 (Figure˜D.4b). Additionally, throughout these stages, the subspace spanned by the embedding becomes more aligned with the subspace spanned by the positional encoding (Figure˜D.4c).

Contributions to degeneracy

For the same reason as with the token embedding, a decrease in the dimensionality of the subspace occupied by activations reduces the effective number of dimensions and thus the learning coefficient. This occurs both as the positional encoding’s effective dimensionality decreases (vanishing singular values, Figure˜D.4b) and as the token embedding subspace and positional embedding subspace align (increasing cosine similarity, Figure˜D.4b).

D.2.3 Attention collapse

Over the course of training, we observe that some attention heads learn to attend solely (soft attention becomes hard attention) and consistently to certain positions (the attention pattern becomes content-independent). We call this phenomenon attention collapse in parallel with the other observed forms of collapse. Not only does this potentially contribute to a decrease in the LLC, but it also makes the attention heads identifiable: we find a self-attention head, previous-attention heads, previous- $x$ -attention heads, and previous- $y$ -attention heads.

$x$ -attention vs. $y$ -attention

For convenience we separate each attention head in two: one part for the $x$ -tokens, and the other for the $y$ -tokens.

Attention entropy score

To quantify attention hardness, we use the attention entropy score (Ghader & Monz, 2017; Vig & Belinkov, 2019). Given the attention pattern $\alpha_{k,k^{\prime}}^{(b,h)}$ for how much token $k$ in head $h$ in block $b$ attends back to token $k^{\prime}$ , its attention entropy score $H_{k}^{(b,h)}$ is the Shannon entropy over preceding indices $k^{\prime}<k$ ,

H_{k}^{(b,h)}=-\sum_{k^{\prime}\leq k}\alpha_{k,k^{\prime}}^{(b,h)}\log_{2}\alpha_{k,k^{\prime}}^{(b,h)}.

(20)

From this, we compute the normalized entropy $\hat{H}_{k}^{(b,k)}$ , which divides the attention entropy by the maximum entropy for the given context length,

\hat{H}_{k}^{(b,h)}=\frac{H_{k}^{(b,h)}}{\log_{2}(k)}.

(21)

This accounts for the entropy being calculated over different numbers of tokens and is displayed in Figure˜D.5. Notably, the identified stages line up closely to stages of these attention entropy curves.

Constant attention

Accomplishing constant attention requires the presence of biases in the query and key transformations, or if there is no bias (as is the case for the models we investigated), requires attending to the positional embedding. With the Shortformer-style positional encoding used for the language models (Section˜F.1.1), this is straightforward: the positional information is injected directly into the key and weight matrices. With the in-context linear regression models, where the positional embedding is added to the residual stream activations, this is less straightforward: achieving constant attention requires separating residual stream activations into orthogonal positional- and input-dependent subspaces, then reading from the former with the query and key weight matrices.

Attention variability score

To quantify how constant the attention pattern is, we use measure attention variability (Vig & Belinkov, 2019),

V^{(b,h)}_{k}=\frac{\sum_{i=1}^{n}\sum_{k^{\prime}\leq k}\left|\alpha_{k,k^{\prime}}^{(b,h)}(S_{K}^{(i)})-\bar{\alpha}_{k,k^{\prime}}^{(b,h)}\right|}{2n\sum_{k^{\prime}\leq k}\bar{\alpha}_{k,k^{\prime}}^{(b,h)}},

(22)

where the division by $2$ ensures the variability lies in the range $[0,1]$ . This is displayed in Figure˜D.6. These reveal that though attention hardness and variability are independent axes of differentiation, empirically, we observe that hard attention is correlated with low variability.

Self-attention score

Self-attention is measured by the average amount a token $k$ attends to itself, $\alpha_{k,k}^{(b,h)}$ .

Previous-token attention score

Previous-token attention is measured the same as in the language model setting (Section˜C.2) with one difference: we compute the previous-token score not over a synthetic dataset but over a validation batch.

$x$ -attention score

The total amount attended to inputs $x_{k}$ , that is $\alpha_{k,x}^{(b,h)}=\sum_{k^{\prime}=1}^{K}\alpha_{k,2k}^{(b,h)}$ .

$y$ -attention score

Defined analogously $\alpha_{k,x}^{(b,h)}=\sum_{k^{\prime}=1}^{K}\alpha_{k,2k+1}^{(b,h)}$ .

Classifying attention heads

Several attention heads are easy to identify by virtue of being both concentrated and consistent. These are depicted in Figure˜D.7 and include: (B1H3y) previous-token heads (also present in the language model case), (B1H1y) previous-x, and (B1H4x, B2H1y) previous-y heads. Other training runs also include self-attention heads.

Contributions to degeneracy

Suppose an attention head $h$ in block $b$ has the following constant attention pattern (after the softmax) $A^{(b,h)}=\sum_{i}\delta_{l(i)\,i}$ . That is, for each token $i$ , that attention head attends solely to a single earlier token $l(i)\leq i$ and no others. Restricting to single-head attention (the argument generalizes straightforwardly), the final contribution of this attention head to the residual stream is the following (Phuong & Hutter, 2022):

O=W_{O}\cdot(V\cdot A)

(23)

where $A\in\mathbb{R}^{\ell_{z}}\times\mathbb{R}^{\ell_{x}}$ is the attention pattern, $V\in\mathbb{R}^{d_{\text{out}}}\times\mathbb{R}^{\ell_{z}}$ is the value matrix, and $W_{O}\in\mathbb{R}^{d_{z}}\times\mathbb{R}^{\ell_{z}}$ is the matrix of residual stream activations, and $V\in\mathbb{R}^{d_{\text{out}}}\times\mathbb{R}^{\ell_{z}}$ is the value matrix. The result of this operation is subsequently multiplied by the output matrix and then added back into the residual stream. Plugging in the hard and constant attention pattern, writing out the matrix multiplication, and filling in the definition of $A$ we get

O_{ij}=\sum_{k}(W_{O})_{ik}V_{kl(j)}\delta_{l(j)j}.

(24)

For each column in $A$ , the hard attention picks out a single element of $V$ at column $l(j)$ for each row $k$ . Now suppose that there is a token $l^{\prime}$ that receives no attention from any position $j$ . That is, there exists no $j$ such that $l^{\prime}=l(j)$ . Then, there is a column $l^{\prime}$ in $V$ which does not contribute to the result of $V\cdot A$ , and, in turn, a column $l^{\prime}$ in $W_{O}$ , which does not contribute to the output of the head. As discussed for the embedding and layer norm, this decrease in effective dimensionality leads to a decrease in the learning coefficient.

Note that this argument does not hold for all hard and constant attention patterns. It holds solely for attention patterns that consistently ignore some earlier token across all positions, such as the previous- $x$ and previous- $y$ heads, but not the self-attention and previous-token heads. As discussed in Section˜D.2.6, it remains unclear what exactly the threshold for “ignoring” a token should be before it contributes to degeneracy and whether any of the heads we examine actually meet this threshold.

D.2.4 Unembedding collapse

The unembedding block consists of a layer normalization layer $\operatorname{LN}(z)$ followed by a linear transformation $W_{U}z+b_{U}$ and finally a projection $\pi_{y}$ to extract the $y$ -component. Given the 64-dimensional vector of activations $z$ in the residual stream right before the unembedding (for a specific token), the full unembedding operation is:

\pi_{y}\left[W_{U}\left(\frac{z-\mathbb{E}[z]}{\sqrt{\mathbb{V}[z]+\epsilon}}\odot\gamma+\beta\right)+b_{U}\right]

where $\odot$ denotes element-wise multiplication of two vectors and $\gamma,\beta$ are the layer normalization weights and biases respectively.

Effective unembedding weights and biases

Moving terms around, we can represent this as

\left((W_{U})_{[0,:]}\odot\gamma\right)\left(\frac{z-\mathbb{E}[z]}{\sqrt{\mathbb{V}[z]+\epsilon}}\right)+\left((W_{U})_{[0,:]}\beta\right)+(b_{U})_{[0]}

where we order the outputs so that the $y$ -token corresponds to the $0$ th row. Because we are reading out a single $y$ component, we can express the unembedding transformation in terms of “effective" unembedding weights and biases

	$\displaystyle\tilde{W}_{U}$	$\displaystyle=(W_{U})_{[0,:]}\odot\gamma,$
	$\displaystyle\tilde{b}_{U}$	$\displaystyle=\left((W_{U})_{[0,:]}\beta\right)+(b_{U})_{[0]}.$

Unembedding weights over time

In Figure˜D.8, we plot $(\gamma,\beta)$ , $((W_{U})_{[0,:]},(b_{U})_{[0]})$ , and $(\tilde{W}_{U},\tilde{b}_{U})$ as a function of training steps, along with the mean weight over time. These are 64- and 1-dimensional vectors, so we can display the entire set of components. During stage LR3 the majority of weights $\beta$ and $W_{U}$ “collapse” to zero. Additionally, the layer normalization biases temporarily experience a large increase in variance before returning to small values. Despite this, the mean of the linear weights, layer normalization biases, and effective weights remains remarkably constant and close to zero throughout the entire process.

Contributions to degeneracy

Suppose that $D$ of the layer normalization weights have vanished, say $\gamma_{i}=0$ for $1\leq i\leq D$ . Then the corresponding columns of $W_{U}$ only contribute to the unembedding via their product $(W_{U})_{[:,1:D]}\beta_{[1:D]}$ with the first $D$ rows of $\beta$ . This creates a typical form of degeneracy studied in SLT and found, for example, in deep linear networks, where we can change the weights to $(W_{U})_{[:,1:D]}A,A^{-1}\beta_{[1:D]}$ for any invertible $D\times D$ matrix $A$ without changing the function computed by the network. If in addition the $\beta_{i}$ vanish for $1\leq i\leq D$ then the entries of $(W_{U})_{[:,1:D]}$ are completely unconstrained, creating further degeneracy.

D.2.5 Layer normalization collapse

The “collapse” in layer normalization weights is not unique to the unembedding. As depicted in Figure˜D.9, this behavior occurs in all layer norms except for the second MLP. The biases also remain centered close to zero even as the variance in biases grows much larger. Unlike in the unembedding, these layers begin to change earlier (starting halfway through LR2).

What is most striking about the layer normalization collapse is that it occurs without any explicit regularization (neither weight decay nor dropout). As such, it demonstrates a clear example of implicit regularization, i.e., inductive biases in the optimizer or model that favor simpler solutions.

Contributions to degeneracy

In the previous section, we describe how layer norm collapse in the unembedding is linked to an increase in degeneracy because it ensures that parameters in the subsequent linear layer become irrelevant. The same is true for layer norm which precedes the attention and MLP blocks.

D.2.6 Degeneracy and development

In the previous subsections, we provide a set of theoretical arguments for how embedding collapse (Section˜D.2.1), layer normalization collapse (Section˜D.2.5), and attention collapse (Section˜D.2.3) can lead to an increase in degeneracy, even while leaving the implemented function unchanged.

The free energy formula tells us that, for two different solutions (sets of weights) with the same loss, the Bayesian posterior will asymptotically prefer the model that has the lower learning coefficient (i.e., higher degeneracy). This suggests that these different forms of collapse may be driven by a bias towards higher degeneracy, as captured in the free energy formula. However, in idealized Bayesian inference, we do not expect the posterior to concentrated around the neighborhood of an equal-loss-but-higher-degeneracy local minimum to begin with. That this kind of transition arises in practice might arise from one of the various differences between Bayesian inference and gradient-based training.

Actually establishing a causal link between increasing degeneracy and structure development is beyond the scope of this paper. For one, the theoretical arguments hinge on the collapse being complete, that is, the components that go to zero must become exactly zero in the limit, where we take the number of samples to compute the loss to infinity. In practice, we expect there to be some threshold $\epsilon$ below which we can treat weights as effectively zero. Second, even if these explanations are correct, we do not know that they account for all of the empirically observed decrease in the LLC during these stages. There may be other drivers we missed. Finally, establishing a causal link requires theoretical progress in relating the Bayesian learning process to the SGD learning process. The arguments are suggestive, but currently only a source of intuition for how structure and degeneracy can be related, and a starting point for future research.

Appendix E One-layer language model experiments

We also trained and ran some experiments on a one-layer language model (see Section˜F.1.1 for details). We aggregate results for the one-layer language model here, mirroring the experiments for the two-layer language model where possible. The early development of the one-layer model has many parallels with the two-layer model. At a single stage boundary, just as it occurs in the two-layer model, the one-layer model minimizes its bigram score (see Section˜C.1.1), begins utilizing the positional embedding to noticeably improve performance (see Section˜C.2.1), and starts making sudden improvements to the same $n$ -gram scores (see Section˜C.1.2). Remarkably this occurs at the same checkpoint as in the 2-layer model (at 900 training steps).

One key difference, however, is that this occurs at the second stage boundary as discerned by the plateaus of the LLC estimation. We did not closely investigate why the LLC estimation appears to drop between steps 400 and 900 in this model. As a result though, we do observe an interesting qualitative similarity to the drop in LLC in stage LM3 of the two-layer model, that this drop precedes a noticeable bump in the loss function.

Appendix F Transformer training experiment details

F.1 Language models

F.1.1 Architecture

The language model architectures we consider are one- and two-layer attention-only transformers. They have a context length of 1024, a residual stream dimension of $d_{model}=256$ , $H=8$ attention heads per layer, and include layer normalization layers. We also used a learnable Shortformer positional embedding (Press et al., 2021). The resulting models have a total of $d=3,091,336$ parameters for $L=1$ and $d=3,355,016$ parameters for $L=2$ . We used an implementation provided by TransformerLens (Nanda & Bloom, 2022).

F.1.2 Tokenization

For tokenization, we used a truncated variant of the GPT-2 tokenizer that cut the original vocabulary of 50,000 tokens down to 5,000 (Eldan & Li, 2023) to reduce the size of the model. We think this may contribute to the prominence of the the plateau at the end of LM1: the frequency of bigram statistics depends on your choice of tokens, and a larger tokenizer leads to bigrams that are individually much less frequent.

F.1.3 Training

The models are trained on a single epoch over $50,000$ steps on $\sim$ 5 billion tokens using a resampled subset of the Pile (Gao et al., 2020; Xie et al., 2023) using a batch size of $100$ . A snapshot was saved every $10$ steps for a total of $5000$ checkpoints, though a majority of analysis used checkpoints every 100 steps. The training time was around 6 GPU hours per model on an A100. Additional seeds were trained on v4 TPUs at around 1.5 TPU hours per model.

Training was conducted on the first 10 million lines of the DSIR-filtered Pile (Xie et al., 2023; Gao et al., 2020) but did not exhaust all 10 million lines. The model was subject to weight decay regularization, without the application of dropout. We did not employ a learning rate scheduler throughout the training process.

Table 3: Summary of hyperparameters and their values for transformer language model training experiments.

Hyperparameter	Category	Description/Notes	Value
$n$	Data	# of training samples	$5,000,000$
$T$	Data	# of training steps	$50,000$
$N_{\text{test}}$	Data	# of test samples	512
Tokenizer Type	Data	Type of Tokenizer	Truncated GPT-2 Tokenizer
$D$	Data	Vocabulary size	5,000
$K$	Data	Context size	1,024
$L$	Model	# of layers in the model	$2$
$H$	Model	# of heads per layer	8
$d_{\mathrm{mlp}}$	Model	MLP hidden layer size	N/A
$d_{\mathrm{embed}}$	Model	Embedding size	256
$d_{\text{head}}$	Model	Head size	32
$\mathrm{seed}$	Model	Model initialization	1
m	Training	Batch Size	100
Optimizer Type	Optimizer	Type of optimizer	AdamW
$\eta$	Optimizer	Learning rate	$0.001$
$\lambda_{\mathrm{wd}}$	Optimizer	Weight Decay	$0.05$
$\beta_{1,2}$	Optimizer	Betas	$(0.9,0.999)$

F.2 In-context linear regression transformers

F.2.1 Architecture

In the following $L$ refers to the number of layers (blocks) in the transformer, $H$ is the number of heads in each layer, $D$ is the dimension of inputs $x\in\mathbb{R}^{D}$ and $K$ is the number of $(x,y)$ pairs provided to the Transformer in-context.

The architecture is a pre-layer-norm decoder-only transformer modeled after NanoGPT Karpathy, 2022; see also Phuong & Hutter, 2022 with a learnable positional embedding. For the models discussed in the main body, we consider $L=2$ , $H=4$ transformers (with $d=51,717$ parameters), i.e., two transformer blocks with four attention heads each.

Component	# of Parameters
Token Embedding Weight	$320$
Positional Embedding Weight	$1,024$
Layer 1 Layer Norm Weight 2	$64$
Layer 1 Layer Norm Bias 1	$64$
Layer 1 Attention Weights	$12,288$
Layer 1 Attention Output Weights	$4,096$
Layer 1 Layer Norm Weight 1	$64$
Layer 1 Layer Norm Bias 2	$64$
Layer 1 Feed-Forward MLP Weight	$4,096$
Layer 1 Feed-Forward MLP Bias	$64$
Layer 1 Feed-Forward Output Weight	$4,096$
Layer 1 Feed-Forward Output Bias	$64$
Layer 2 Layer Norm Weight 1	$64$
Layer 2 Layer Norm Bias 1	$64$
Layer 2 Attention Weights	$12,288$
Layer 2 Attention Output Weights	$4,096$
Layer 2 Layer Norm Weight 2	$64$
Layer 2 Layer Norm Bias 2	$64$
Layer 2 Feed-Forward MLP Weight	$4,096$
Layer 2 Feed-Forward MLP Bias	$64$
Layer 2 Feed-Forward Output Weight	$4,096$
Layer 2 Feed-Forward Output Bias	$64$
Unembedding Layer Norm Weight 1	$64$
Unembedding Layer Norm Bias 1	$64$
Unembedding Weight 2	$320$
Unembedding Bias 2	$5$

F.2.2 Tokenization

To run contexts $S_{K}$ through the above model requires an initial encoding or “tokenization step” and final “projection step.” The context is encoded as a sequence of “tokens” $T_{k}$ as follows:

T_{k}=\left(\begin{pmatrix}0\\ \vline\\ x_{1}\\ \vline\end{pmatrix},\begin{pmatrix}y_{1}\\ 0\\ \vdots\\ 0\end{pmatrix},\cdots\begin{pmatrix}0\\ \vline\\ x_{k}\\ \vline\end{pmatrix},\begin{pmatrix}y_{k}\\ 0\\ \vdots\\ 0\end{pmatrix}\right).

Through the main text, we write $f_{w}(S_{k})$ for $f_{w}(T_{k})$ . Note that this tokenization includes the final $y_{k}$ token even though this receives no training signal. For this reason, we omit this token from the attention entropy and variability plots (Figures˜D.5 and D.6).

The transformer outputs a series of tokens of the same shape as $T_{k}$ . To read out the $\hat{y}_{k}$ predictions, we read out the first component of every other token, i.e.,

	$\displaystyle\pi_{Y}:\mathbb{R}^{(D+1)\times 2K}$	$\displaystyle\to\mathbb{R}^{K}$		(25)
	$\displaystyle\left(\begin{pmatrix}\hat{y}_{1}\\ \vdots\\ \end{pmatrix},\begin{pmatrix}.\\ \vdots\end{pmatrix},\cdots,\begin{pmatrix}\hat{y}_{k}\\ \vdots\\ \end{pmatrix},\begin{pmatrix}.\\ \vdots\end{pmatrix}\right)$	$\displaystyle\mapsto(\hat{y}_{1},\dots,y_{k}).$		(26)

F.2.3 Training

We train from a single seed for each choice of architecture and optimizer hyperparameters using minibatch stochastic gradient descent. We train without explicit regularization and use the Adam optimizer (Kingma & Ba, 2014). The training runs take 1 to 5 TPU-hours on TPUs provided by Google Research. Models are trained from the same initialization and on the data vectors within each batch (but for different sets of tasks and task orderings).

Models are trained on a single epoch: each of the $T=500,000$ batches consists of a new set of sequences with batch size 256. For the LLC estimates, we save 190 checkpoints: 100 are linearly spaced over the training run, and the remaining 90 are logarithmically spaced. We perform LLC estimation and other analyses on these checkpoints.

Table 4: Summary of hyperparameters and their default values for in-context linear regression transformer model training experiments.

Hyperparameter	Category	Description/Notes	Default Values
$n$	Data	# of training samples	128,000,000
$B$	Data	Batch size during training	256
$T$	Data	# of training steps	500k
$N_{\text{test}}$	Data	# of eval samples	2048
$D$	Data	Dimensions of linear regression task (Task size)	4
$K$	Data	Maximum in-context examples	8
$\sigma^{2}$	Data	Variance of noise in data generation	0.125
$L$	Model	# of layers in the model	2
$H$	Model	# of attention heads per layer	4
$d_{\mathrm{mlp}}$	Model	Size of the hidden layer in MLP	64
$d_{\mathrm{embed}}$	Model	Embedding size	64
$\mathrm{seed}$	Misc	Training run seeds	{(0, 1, 2, 3, 4)}
Optimizer Type	Optimizer	Type of optimizer	Adam
$\eta$	Optimizer	Maximum learning rate	0.003
$\lambda_{\mathrm{wd}}$	Optimizer	Weight Decay	0
$\beta_{1,2}$	Optimizer	Betas	(0.9, 0.999)
Scheduler Type	Scheduler	Type of learning rate scheduler	OneCycleLR
Strategy	Scheduler	Strategy for annealing the learning rate	Linear
% start	Scheduler	Percentage of the cycle when learning rate is increasing	0.5

Author contributions

All authors communicated regularly about all aspects of the methodology and writing. The following is a non-exhaustive list of some particular areas of individual contribution.

JH led the coding, experimentation, and analysis for the in-context linear regression setting; assisted with analysis of the language modeling setting; contributed substantially to the writing of the manuscript; and designed the conceptual illustrations.
GW led the coding, experimentation, and analysis for the language modeling setting; and contributed substantially to the writing of the manuscript.
MFR led the writing of the current main text, based on earlier versions written with others; and assisted with coding and tuning hyperparameters for the in-context linear regression setting.
LC assisted with coding, experimentation, and analysis for the in-context linear regression setting.
SW advised on technical details including assumptions of LLC estimation; and assisted with writing.
DM had the original idea for the degeneracy-based stage division methodology; contributed substantially to earlier versions of the main text; and coordinated the project.

Acknowledgments

We thank Edmund Lau for advice on LLC estimation and Mansheej Paul for advice on training transformers in the in-context linear regression setting. We are grateful to Evan Hubinger and Simon Pepin Lehalleur for helpful conversations. For their valuable feedback on the manuscript, we thank Andres Campero, Zach Furman, Simon Pepin Lehalleur, and Nandi Schoots.

We thank Google’s TPU Research Cloud program for supporting some of our experiments with Cloud TPUs. JH and GW completed part of this research through the AI Futures Fellowship. MFR completed part of this research as an independent researcher sponsored by private individuals, part at Timaeus, and part at the University of Oxford. LC’s work was supported by Lightspeed Grants.

References

Antognini & Sohl-Dickstein (2018) Joseph Antognini and Jascha Sohl-Dickstein. PCA of high dimensional random walks with comparison to neural network training. In Advances in Neural Information Processing Systems 31, pp. 10307–10316. Curran Associates, 2018. Access via NeurIPS.
Baldi & Hornik (1989) Pierre Baldi and Kurt Hornik. Neural networks and principal component analysis: Learning from examples without local minima. Neural Networks, 2(1):53–58, 1989. DOI 10.1016/0893-6080(89)90014-2.
Barak et al. (2022) Boaz Barak, Benjamin Edelman, Surbhi Goel, Sham Kakade, Eran Malach, and Cyril Zhang. Hidden progress in deep learning: SGD learns parities near the computational limit. In Advances in Neural Information Processing Systems 35, pp. 21750–21764. Curran Associates, 2022. Access via NeurIPS.
Chen et al. (2024) Angelica Chen, Ravid Shwartz-Ziv, Kyunghyun Cho, Matthew L Leavitt, and Naomi Saphra. Sudden drops in the loss: Syntax acquisition, phase transitions, and simplicity bias in MLMs. In The Twelfth International Conference on Learning Representations. OpenReview, 2024. Access via OpenReview.
Chen et al. (2023) Zhongtian Chen, Edmund Lau, Jake Mendel, Susan Wei, and Daniel Murfet. Dynamical versus Bayesian phase transitions in a toy model of superposition, 2023. Preprint arXiv:2310.06301 [cs.LG].
Edelman et al. (2024) Ezra Edelman, Nikolaos Tsilivis, Benjamin L Edelman, Eran Malach, and Surbhi Goel. The evolution of statistical induction heads: In-context learning Markov chains. In Advances in Neural Information Processing Systems 37, pp. 64273–64311. Curran Associates, 2024. Access via NeurIPS.
Eldan & Li (2023) Ronen Eldan and Yuanzhi Li. TinyStories: How small can language models be and still speak coherent English?, 2023. Preprint arXiv:2305.07759 [cs.CL].
Elhage et al. (2021) Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. Access via Transformer Circuits Thread.
Erhan et al. (2010) Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, and Samy Bengio. Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research, 11(19):625–660, 2010. Access via JMLR.
Facco et al. (2017) Elena Facco, Maria d’Errico, Alex Rodriguez, and Alessandro Laio. Estimating the intrinsic dimension of datasets by a minimal neighborhood information. Scientific Reports, 7(1):12140, 2017. DOI 10.1038/s41598-017-11873-y.
Fowler (1975) D H Fowler. Structural Stability and Morphogensis: An Outline of a General Theory of Models. W. A. Benjamin, Inc., 1975. English translation of Thom (1972), with updates from the author.
Franceschelli (2010) Sara Franceschelli. Morphogenesis, structural stability and epigenetic landscape. In Paul Bourgine and Annick Lesne (eds.), Morphogenesis: Origins of Patterns and Shapes, pp. 283–293. Springer, 2010. DOI 10.1007/978-3-642-13174-5_16.
Freedman et al. (2023) Simon L Freedman, Bingxian Xu, Sidhartha Goyal, and Madhav Mani. A dynamical systems treatment of transcriptomic trajectories in hematopoiesis. Development, 150(11):dev201280, 2023. DOI 10.1242/dev.201280.
Gao et al. (2020) Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The Pile: an 800GB dataset of diverse text for language modeling. Technical report, EleutherAI, 2020. Preprint arXiv:2101.00027 [cs.CL].
Garg et al. (2022) Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gregory Valiant. What can transformers learn in-context? A case study of simple function classes. In Advances in Neural Information Processing Systems 35, pp. 30583–30598. Curran Associates, 2022. Access via NeurIPS.
Ghader & Monz (2017) Hamidreza Ghader and Christof Monz. What does attention in neural machine translation pay attention to? In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 30–39. Asian Federation of Natural Language Processing, 2017. Access via ACL Anthology.
Goodfellow et al. (2014) Ian J Goodfellow, Oriol Vinyals, and Andrew M Saxe. Qualitatively characterizing neural network optimization problems, 2014. Conference paper presented at ICLR 2015. Preprint arXiv:1412.6544 [cs.NE].
Hagiwara et al. (1993) Katsuyuki Hagiwara, Naohiro Toda, and Shiro Usui. On the problem of applying AIC to determine the structure of a layered feedforward neural network. In Proceedings Of 1993 International Joint Conference On Neural Networks, volume 3, pp. 2263–2266. IEEE, 1993. DOI 10.1109/IJCNN.1993.714176.
Karpathy (2022) Andrej Karpathy. NanoGPT, 2022. Access via github.com/karpathy.
Kingma & Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2014. Conference paper presented at ICLR 2015. Preprint arXiv:1412.6980 [cs.LG].
Lau et al. (2025) Edmund Lau, Zach Furman, George Wang, Daniel Murfet, and Susan Wei. The local learning coefficient: A singularity-aware complexity measure. In The 28th International Conference on Artificial Intelligence and Statistics. OpenReview, 2025. Access via OpenReview.
Li et al. (2018) Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. In Advances in Neural Information Processing Systems 31, pp. 6389–6399. Curran Associates, 2018. Access via NeurIPS.
Lipton (2016) Zachary C Lipton. Stuck in a what? Adventures in weight space, 2016. Workshop paper presented at ICLR 2016. Preprint arXiv:1602.07320 [cs.LG].
MacArthur (2022) Ben D MacArthur. The geometry of cell fate. Cell Systems, 13(1):1–3, 2022. DOI 10.1016/j.cels.2021.12.001.
McGrath et al. (2022) Thomas McGrath, Andrei Kapishnikov, Nenad Tomašev, Adam Pearce, Martin Wattenberg, Demis Hassabis, Been Kim, Ulrich Paquet, and Vladimir Kramnik. Acquisition of chess knowledge in AlphaZero. Proceedings of the National Academy of Sciences, 119(47):e2206625119, 2022. DOI 10.1073/pnas.2206625119.
Nanda & Bloom (2022) Neel Nanda and Joseph Bloom. TransformerLens, 2022. Access via github.com/neelnanda-io.
Nanda et al. (2023) Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. In The Eleventh International Conference on Learning Representations. OpenReview, 2023. Access via OpenReview.
Odonnat et al. (2024) Ambroise Odonnat, Wassim Bouaziz, and Vivien Cabannes. Clustering head: A visual case study of the training dynamics in transformers, 2024. Preprint arXiv:2410.24050 [cs.LG].
Olsson et al. (2022) Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. In-context learning and induction heads. Transformer Circuits Thread, 2022. Access via Transformer Circuits Thread.
Phuong & Hutter (2022) Mary Phuong and Marcus Hutter. Formal algorithms for transformers. Technical report, DeepMind, 2022. Preprint arXiv:2207.09238 [cs.LG].
Power et al. (2022) Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets, 2022. Preprint arXiv:2201.02177 [cs.LG].
Press et al. (2021) Ofir Press, Noah A Smith, and Mike Lewis. Shortformer: Better language modeling using shorter inputs. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 5493–5505. Association for Computational Linguistics, 2021. DOI 10.18653/v1/2021.acl-long.427.
Raijmakers et al. (1996) Maartje E J Raijmakers, Sylvester van Koten, and Peter C M Molenaar. On the validity of simulating stagewise development by means of PDP networks: Application of catastrophe analysis and an experimental test of rule-like network performance. Cognitive Science, 20(1):101–136, 1996. DOI 10.1207/s15516709cog2001_4.
Rand et al. (2021) David A Rand, Archishman Raju, Meritxell Sáez, Francis Corson, and Eric D Siggia. Geometry of gene regulatory dynamics. Proceedings of the National Academy of Sciences, 118(38):e2109729118, 2021. DOI 10.1073/pnas.2109729118.
Raventós et al. (2023) Allan Raventós, Mansheej Paul, Feng Chen, and Surya Ganguli. Pretraining task diversity and the emergence of non-Bayesian in-context learning for regression. In Advances in Neural Information Processing Systems 36, pp. 14228–14246. Curran Associates, 2023. Access via NeurIPS.
Rogers & McClelland (2004) Timothy T Rogers and James L McClelland. Semantic Cognition: A Parallel Distributed Processing Approach. MIT Press, 2004. DOI 10.7551/mitpress/6161.001.0001.
Sáez et al. (2022) Meritxell Sáez, Robert Blassberg, Elena Camacho-Aguilar, Eric D Siggia, David A Rand, and James Briscoe. Statistically derived geometrical landscapes capture principles of decision-making dynamics during cell fate transitions. Cell Systems, 13(1):12–28, 2022. DOI 10.1016/j.cels.2021.08.013.
Saxe et al. (2019) Andrew M Saxe, James L McClelland, and Surya Ganguli. A mathematical theory of semantic development in deep neural networks. Proceedings of the National Academy of Sciences, 116(23):11537–11546, 2019. DOI 10.1073/pnas.1820226116.
Shinn (2023) Maxwell Shinn. Phantom oscillations in principal component analysis. Proceedings of the National Academy of Sciences, 120(48):e2311420120, 2023. DOI 10.1073/pnas.2311420120.
Thom (1972) René Thom. Stabilité Structurelle et Morphogénèse: Essai d’une Théorie Générale des Modèles [Structural Stability and Morphogensis: An Outline of a General Theory of Models]. W. A. Benjamin, Inc., 1972. In French. Translated into English by Fowler (1975).
Tikeng Notsawo et al. (2024) Pascal Tikeng Notsawo, Jr., Hattie Zhou, Mohammad Pezeshki, Irina Rish, and Guillaume Dumas. Predicting grokking long before it happens: A look into the loss landscape of models which grok. In ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models. OpenReview, 2024. Access via OpenReview.
Vig & Belinkov (2019) Jesse Vig and Yonatan Belinkov. Analyzing the structure of attention in a transformer language model. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 63–76. Association for Computational Linguistics, 2019. DOI 10.18653/v1/W19-4808.
Waddington (1957) C H Waddington. The Strategy of the Genes: A Discussion of Some Aspects of Theoretical Biology. Allen & Unwin, 1957.
Wang et al. (2025) George Wang, Jesse Hoogland, Stan van Wingerden, Zach Furman, and Daniel Murfet. Differentiation and specialization of attention heads via the refined local learning coefficient. In The Thirteenth International Conference on Learning Representations. OpenReview, 2025. Access via OpenReview.
Watanabe (2007) Sumio Watanabe. Almost all learning machines are singular. In IEEE Symposium on Foundations of Computational Intelligence, pp. 383–388. IEEE, 2007. DOI 10.1109/FOCI.2007.371500.
Watanabe (2009) Sumio Watanabe. Algebraic Geometry and Statistical Learning Theory. Cambridge University Press, 2009. DOI 10.1017/CBO9780511800474.
Watanabe (2018) Sumio Watanabe. Mathematical Theory of Bayesian Statistics. CRC Press, 2018. DOI 10.1201/9781315373010.
Wei et al. (2022) Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022. Access via OpenReview.
Wei et al. (2023) Susan Wei, Daniel Murfet, Mingming Gong, Hui Li, Jesse Gell-Redman, and Thomas Quella. Deep learning is singular, and that’s good. IEEE Transactions on Neural Networks and Learning Systems, 34(12):10473–10486, 2023. DOI 10.1109/TNNLS.2022.3167409.
Welling & Teh (2011) Max Welling and Yee Whye Teh. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning, pp. 681–688. ACM, 2011. Access via ICML.
Xie et al. (2023) Sang Michael Xie, Shibani Santurkar, Tengyu Ma, and Percy S Liang. Data selection for language models via importance resampling. In Advances in Neural Information Processing Systems 36, pp. 34201–34227. Curran Associates, 2023. Access via NeurIPS.
Yao et al. (2020) Zhewei Yao, Amir Gholami, Kurt Keutzer, and Michael W Mahoney. PyHessian: neural networks through the lens of the Hessian. In 2020 IEEE International Conference on Big Data (Big Data), pp. 581–590, 2020. DOI 10.1109/BigData50022.2020.9378171.

Cite as

@article{hoogland2024loss,
  title = {Loss Landscape Degeneracy and Stagewise Development of Transformers},
  author = {Jesse Hoogland and George Wang and Matthew Farrugia-Roberts and Liam Carroll and Susan Wei and Daniel Murfet},
  year = {2024},
  abstract = {We show that in-context learning emerges in transformers in discrete developmental stages, when they are trained on either language modeling or linear regression tasks. We introduce two methods for detecting the milestones that separate these stages, by probing the geometry of the population loss in both parameter space and function space. We study the stages revealed by these new methods using a range of behavioral and structural metrics to establish their validity.},
  url = {https://tmlr.infinite-conf.org/paper_pages/45qJyBG8Oj.html}
}

Click to copy

Component	1-Layer	2-Layer
Token Embedding Weights	$1,280,000$
Positional Embedding Weights	$262,144$
Layer 1 Layer Norm Weights	$256$
Layer 1 Layer Norm Bias	$256$
Layer 1 Attention Query Weights	$65,536$
Layer 1 Attention Key Weights	$65,536$
Layer 1 Attention Value Weights	$65,536$
Layer 1 Attention Output Weights	$65,536$
Layer 1 Attention Query Bias	$256$
Layer 1 Attention Key Bias	$256$
Layer 1 Attention Value Bias	$256$
Layer 1 Attention Output Bias	$256$
Layer 2 Layer Norm Weights	N/A	$256$
Layer 2 Layer Norm Bias	N/A	$256$
Layer 2 Attention Query Weights	N/A	$65,536$
Layer 2 Attention Key Weights	N/A	$65,536$
Layer 2 Attention Value Weights	N/A	$65,536$
Layer 2 Attention Output Weights	N/A	$65,536$
Layer 2 Attention Query Bias	N/A	$256$
Layer 2 Attention Key Bias	N/A	$256$
Layer 2 Attention Value Bias	N/A	$256$
Layer 2 Attention Output Bias	N/A	$256$
Final Layer Norm Weights	$256$
Final Layer Norm Bias	$256$
Unembedding Weights	$1,280,000$
Unembedding Bias	$5,000$

Stage	`LM1`	`LM2`	`LM3`	`LM4`	`LM5`
End $t$	900	6.5k	8.5k	17k	50k
$\Delta\hat{\ell}$	$-2.33$	$-1.22$	$-0.18$	$-0.40$	$-0.34$
$\Delta\hat{\lambda}$	${\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}+26.4}$	${\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}+22.5}$	${\color[rgb]{0,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,1}\pgfsys@color@cmyk@stroke{1}{0}{0}{0}\pgfsys@color@cmyk@fill{1}{0}{0}{0}-1.57}$	${\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}+8.62}$	${\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}+1.77}$

Stage	`LR1`	`LR2`	`LR3`	`LR4`	`LR5`
End $t$	1k	40k	126k	320k	500k
$\Delta\hat{\ell}$	$-0.32$	$-2.21$	$-0.07$	$-0.05$	$-0.029$
$\Delta\hat{\lambda}$	${\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}+21.4}$	${\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}+149}$	${\color[rgb]{0,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,1}\pgfsys@color@cmyk@stroke{1}{0}{0}{0}\pgfsys@color@cmyk@fill{1}{0}{0}{0}-12.3}$	${\color[rgb]{0,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,1}\pgfsys@color@cmyk@stroke{1}{0}{0}{0}\pgfsys@color@cmyk@fill{1}{0}{0}{0}-44.1}$	${\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}+3.56}$

Loss Landscape Degeneracy and Stagewise Development of Transformers

Authors

Publication Details

Access

Abstract

Automated Conversion Notice

1 Introduction

2 Related work

Degeneracy and development in singular learning theory

Degeneracy and development in nonlinear dynamics

Stagewise development in deep learning

Studying loss landscape geometry

3 Training transformers in two settings

Language modeling

In-context linear regression

4 Quantifying degeneracy with the local learning coefficient

The local learning coefficient (LLC)

Estimating the LLC

Assumptions of LLC estimation

5 Degeneracy-based stage division

Bayesian local free energy

The singular learning process

LLC plateaus separate developmental stages

Results

6 Results for language modeling

6.1 Stage LM1 (0–900 steps)

Learning bigram statistics

6.2 Stage LM2 (900–6.5k steps)

Using positional information

Learning common nnitalic_n-grams

Foundations of induction circuit

6.3 Stages LM3 & LM4 (6.5k–8.5k & 8.5k–17k steps)

Formation of induction circuit as studied in Olsson et al., 2022

7 Results for in-context linear regression

7.1 Stage LR1 (0–1k steps)

Learning to predict without context

7.2 Stage LR2 (1k–40k steps)

Acquiring in-context learning

Embedding and attention collapse

7.3 Stages LR3 & LR4 (40k–126k & 126k–320k steps)

Reduced robustness to input magnitude

Layer-normalization collapse

8 Discussion

Towards a degeneracy-based understanding of deep learning

Towards developmental interpretability

Cases studies in transformer development

Development and model complexity

Appendix

Appendix A The local learning coefficient (LLC)

A.1 Formal Definition of the LLC

A.2 Interpretations and examples of the LLC

A.3 Estimating LLCs with SGLD

Time and Space Complexity.

A.4 LLC estimation experiment details

A.4.1 LLC estimation details for language models

A.4.2 LLC estimation details for in-context linear regression

A.5 A guide to SGLD-based LLC estimation

A.5.1 Varying the temperature

A.5.2 Seeding the random noise

A.5.3 Calibrating ϵ\epsilonitalic_ϵ, β\betaitalic_β, and γ\gammaitalic_γ

A.5.4 LLC traces

A.6 LLC estimates for a non-log-likelihood-based loss

A.7 LLC estimates away from local minima

Appendix B LLC-based stage boundary identification

B.1 Procedure for stage boundary identification

B.2 Stage boundary identification details for language model

B.3 Stage boundary identification details for in-context linear regression

B.4 Stage identification for additional training runs

B.5 Comparison to Hessian statistics

Appendix C Developmental analysis of language models

C.1 Behavioral development

C.1.1 Bigram score

C.1.2 nnitalic_n-gram scores

C.1.3 In-context learning score

C.1.4 Visualizing behavioral changes

C.2 Structural development

C.2.1 Positional embedding

C.2.2 Composition scores

C.2.3 Previous-token matching score

C.2.4 Prefix matching score

Learning common $n$ -grams

A.5.3 Calibrating $\epsilon$ , $\beta$ , and $\gamma$

C.1.2 $n$ -gram scores

$x$ -attention vs. $y$ -attention

$x$ -attention score

$y$ -attention score