Dynamics of Transient Structure in In-Context Linear Regression Transformers

Authors

Liam Carroll

Timaeus & Gradient Institute

Jesse Hoogland

Timaeus

Matthew Farrugia-Roberts

University of Oxford

Daniel Murfet

University of Melbourne

See Contributions

Publication Details

Published:

January 29, 2025

Access

Abstract

Modern deep neural networks display striking examples of rich internal computational structure. Uncovering principles governing the development of such structure is a priority for the science of deep learning. In this paper, we explore the transient ridge phenomenon: when transformers are trained on in-context linear regression tasks with intermediate task diversity, they initially behave like ridge regression before specializing to the tasks in their training distribution. This transition from a general solution to a specialized solution is revealed by joint trajectory principal component analysis. Further, we draw on the theory of Bayesian internal model selection to suggest a general explanation for the phenomena of transient structure in transformers, based on an evolving tradeoff between loss and complexity. We empirically validate this explanation by measuring the model complexity of our transformers as defined by the local learning coefficient.

Automated Conversion Notice

Warning: This paper was automatically converted from LaTeX. While we strive for accuracy, some formatting or mathematical expressions may not render perfectly. Please refer to the original ArXiv version for the authoritative document.

1 Introduction

Why do neural networks transition between qualitatively different modes of computation during training? This phenomenon has been studied for decades in artificial and biological neural networks (Baldi & Hornik, 1989; Rogers & McClelland, 2004). Recent work on transformers has uncovered particularly salient examples of transitions between two well-characterized alternative ways of approximating the data distribution. For instance, Power et al. (2022) show a “grokking” transition from an initial memorizing solution to a generalizing solution while training transformers to perform modular arithmetic. Conversely, Singh et al. (2024) show a transition from a “transient” generalizing solution to a memorizing solution while training transformers for in-context classification.

In this paper, we study a similar transition from generalization to memorization in transformers trained for in-context linear regression. Following Raventós et al. (2023), we construct sequences with latent regression vectors (tasks) sampled uniformly from a fixed set of size $M$ (the task diversity). In this setting, Raventós et al. (2023) showed that fully trained transformers may behaviorally approximate either of two distinct in-context learning algorithms:

Discrete minimum mean squared error (dMMSE): The posterior mean given a uniform prior over the $M$ tasks (implies memorizing the $M$ tasks in some fashion).
Ridge regression (ridge): The posterior mean given a Gaussian prior from which the fixed tasks were initially sampled (independent of $M$ , generalizes to new tasks).

Moreover, Panwar et al. (2024, §6.1) showed that for intermediate $M$ values, the out-of-distribution loss of a given transformer is non-monotonic, suggesting that these transformers initially approach ridge before diverting towards dMMSE. We term this phenomenon transient ridge.

In this paper, we extend the brief analysis of Panwar et al. (2024, §6.1) and investigate the dynamics of transient ridge in detail, contributing the following.

In Section 4, we replicate transient ridge and we comparatively analyze the in-distribution function-space trajectories of our transformers using joint trajectory principal component analysis, revealing generalization–memorization as a principal axis of development and clarifying how the task diversity affects the dynamics.
In Section 5, we explain transient ridge as the transformer navigating a tradeoff between loss and complexity that evolves over training, akin to Bayesian internal model selection Watanabe, 2009, §7.6; Chen et al., 2023, and we validate this explanation by estimating the complexity of the competing solutions using the local learning coefficient (Lau et al., 2025).

These results expand our understanding of the transient ridge phenomenon and highlight the evolving loss/complexity tradeoff as a promising principle for understanding similar transience phenomena. Section 6 discusses limitations and directions for further investigation.

In this section, we review empirical and theoretical work on the topic of the emergence and transience of computational structure in deep learning.

Internal computational structure.

Modern deep learning has shown striking examples of the emergence of internal computational structure in deep neural networks, such as syntax trees in transformers trained on natural language (Hewitt & Manning, 2019), conceptual chess knowledge in AlphaZero (McGrath et al., 2022), and various results from mechanistic interpretability (e.g., Olah et al., 2020; Cammarata et al., 2020; Elhage et al., 2021).

It is known that properties of the data distribution influence the emergence of computational structure. For example, Chan et al. (2022) studied an in-context classification and identified data properties that are necessary for transformers to develop in-context learning abilities. Raventós et al. (2023) studied in-context linear regression (Garg et al., 2022; Akyürek et al., 2023; von Oswald et al., 2023; Bai et al., 2024) and showed that changing the task diversity of the training distribution can change the in-context learning algorithm approximated by the fully-trained transformer.

Transient structure.

In some cases, multiple interesting computational structures emerge throughout training, with different ones determining model outputs at different times. A well-known example is the “grokking” transition, in which transformers learning modular arithmetic initially memorize the mappings from the training set, before eventually generalizing to unseen examples (Power et al., 2022) using an internal addition algorithm (Nanda et al., 2023).

Conversely, Singh et al. (2024) showed that transformers trained for in-context classification (Chan et al., 2022) can gradually shift from predicting based on contextual examples to predicting memorized labels, losing the ability to generalize to new mappings. Singh et al. (2024) termed this phenomenon “transient in-context learning.”

Similarly, for in-context linear regression, Panwar et al. (2024, §6.1) observed transformers initially achieving low out-of-distribution generalization loss (indicating that they approximate ridge regression) before eventually specializing to a memorized set of tasks. In an attempt to unify terminology, we call this phenomenon “transient ridge.” Compared to Panwar et al. (2024, §6.1), our work is novel in that it offers a more in-depth empirical analysis of this phenomenon, and we also offer an explanation of the phenomenon.

Additional examples of “transient structure” have recently been observed in settings including language modeling (Hoogland et al., 2024), in-context Markovian sequence modeling (Edelman et al., 2024; Park et al., 2024), and in-context modular arithmetic (He et al., 2024).

Explaining transient in-context learning.

There have been attempts to explain transience in the in-context classification setting originally studied by Singh et al. (2024). Nguyen & Reddy (2024) offer a simplified model in which in-context learning is acquired more rapidly than in-weight learning, and targeted regularization of the induction mechanism can cause it to later give way to in-weight learning.

Chan et al. (2024) give an explanation based on regret bounds for in-context and in-weight learning. In their model, in-context learning emerges because it is initially more accurate than in-weight learning for rare classes. Once the model sees more data for a class, in-weight learning becomes more accurate than in-context learning, due to limitations in their proposed induction mechanism.

Compared to these models, we offer a higher-level explanation of the general phenomenon of transient structure in terms of principles governing the preference for one solution over another at different points in training. We study this explanation in the setting of in-context linear regression, but it is also applicable in other settings.

Explaining transient structure.

There have been several attempts to explain transient structure in more general terms. If the memorizing solution achieves lower loss than the transient generalizing solution, the ultimate preference for memorization is not surprising (Singh et al., 2024; Park et al., 2024). The question remains, why would a generalizing solution arise in the first place if it is not as accurate as the memorizing solution (Singh et al., 2024)?

Panwar et al. (2024) speculate that the initial emergence of the generalizing solution could be due to an inductive bias towards simplicity. However, this still leaves the question, given that a less-accurate generalizing solution does emerge, why would it then fade later in training (Singh et al., 2024)?

Our work integrates these two perspectives. Rather than prioritizing accuracy or simplicity, we postulate a tradeoff between accuracy and simplicity that evolves over training. This explains the emergence of a simpler, less accurate generalizing solution (ridge) and its subsequent replacement by a complex, more accurate memorizing solution (dMMSE).

Internal model selection in deep learning.

Recent work has studied the relevance of internal model selection in Bayesian inference to deep learning. Chen et al. (2023) showed that, when small autoencoders transition between different encoding schemes during training (Elhage et al., 2022), such transitions are consistent with Bayesian inference. Hoogland et al. (2024) and Wang et al. (2024) found that the same theory can be used to detect the formation of internal structure, such as induction circuits (Elhage et al., 2021; Olsson et al., 2022) in small language models. Ours is the first work to analyze a transition between two transformer solutions in detail from this perspective.

3 In-context linear regression

In this section, we introduce the in-context linear regression setting and the idealized dMMSE and ridge solutions, largely following Raventós et al. (2023).

3.1 Nested multi-task data distributions

Given a latent regression vector, or task, $\mathbf{t}\in\mathbb{R}^{D}$ , we define a conditional distribution $q(S|\mathbf{t})$ of sequences of i.i.d. pairs

S=(x_{1},y_{1},\ldots,x_{K},y_{K})\in(\mathbb{R}^{D}\times\mathbb{R})^{K}

where $x_{k}\sim q(x)=\mathcal{N}(0,I_{D})$ and $y_{k}\sim q(y|x_{k},\mathbf{t})=\mathcal{N}(\mathbf{t}^{\top}x_{k},\sigma^{2})$ . We set $K=16$ , $D=8$ , and $\sigma^{2}=0.125$ .

We then define an unconditional data distribution of sequences $q(S)=q(S|\mathbf{t})q(\mathbf{t})$ , where $q(\mathbf{t})$ is one of several task distributions described below. We sample a dataset of size $N$ , $\mathcal{D}=\{S^{i}\}_{i=1}^{N}\sim q(S)$ , by first sampling $\mathbf{t}^{i}\sim q(\mathbf{t})$ and then sampling $S^{i}\sim q(S|\mathbf{t}^{i})$ for $i=1,\ldots,N$ .

We define a task distribution $q_{M}(\mathbf{t})$ for each task diversity $M\in\mathbb{N}\cup\{\infty\}$ as follows. We fix an unbounded i.i.d. sequence $\mathbf{t}_{1},\mathbf{t}_{2},\ldots\sim\mathcal{N}(0,I_{D})$ . For $M\in\mathbb{N}$ we define

\mathcal{T}_{M}=\{\mathbf{t}_{1},\dots,\mathbf{t}_{M}\}\quad\text{and}\quad q_% {M}(\mathbf{t})=\mathrm{Uniform}(\mathcal{T}_{M}).

We further define $\mathcal{T}_{\infty}=\{\mathbf{t}_{1},\ldots\}$ and $q_{\infty}(\mathbf{t})=\mathcal{N}(0,I_{D})$ . We denote by $q_{M}(S)$ the data distribution formed from $q_{M}(\mathbf{t})$ , and by $\mathcal{D}^{(M)}$ a corresponding dataset.

Note that, in a departure from Raventós et al. (2023), the task sets $\mathcal{T}_{1}\subseteq\mathcal{T}_{2}\subseteq\cdots\subseteq\mathcal{T}_{\infty}$ are nested by construction. In particular, the root task $\mathbf{t}_{1}$ is included at every $M$ , allowing us to compare all models by their behavior on $q(S|\mathbf{t}_{1})$ .

3.2 Mean squared error objective

Given a sequence $S$ , denote by $S_{\leq k}$ the context subsequence $(x_{1},y_{1},\ldots,x_{k-1},y_{k-1},x_{k})$ with label $y_{k}$ . Let $f$ be a function mapping contexts to predicted labels. Given a dataset $\{S^{i}\}_{i=1}^{N}\sim q(S)$ we define the per-token empirical loss

\ell_{N,k}(f)=\frac{1}{N}\sum_{i=1}^{N}(f(S^{i}_{\leq k})-y_{k}^{i})^{2}.

(1)

Averaging over context lengths we obtain the empirical loss

\ell_{N}(f)=\frac{1}{K}\sum_{k=1}^{K}\ell_{N,k}(f).

(2)

The corresponding population loss $\ell(f)$ is defined by taking the expectation over the data distribution $q(S)$ ,

\ell(f)=\mathbb{E}_{S\sim q}\left[\frac{1}{K}\sum_{k=1}^{K}(f(S_{\leq k})-y_{k% }^{i})^{2}\right].

(3)

For a function $f(\cdot,w)$ implemented by a transformer with parameter $w$ , we denote the losses $\ell_{N,k}(w)$ , $\ell_{N}(w)$ , and $\ell(w)$ . For task diversity $M$ we use a superscript $\ell^{M}$ .

3.3 Idealized in-context linear regression predictors

Given a context $S_{\leq k}$ there are many possible algorithms that could be chosen to predict $\hat{y}_{k}$ . Raventós et al. (2023) studied the following two predictors:

Predictor 1 (dMMSE).

For $M\in\mathbb{N}$ and $k=1,\ldots,K$ , the discrete minimum mean squared error predictor, dMMSE_M, is the function $g^{M}_{k}:(\mathbb{R}^{D}\times\mathbb{R})^{k}\times\mathbb{R}^{D}\to\mathbb{R}$ such that

g_{k}^{M}(x_{1},y_{1},\ldots,x_{k})=\left(\hat{\mathbf{t}}_{k}^{M}\right)^{% \top}x_{k}

(4)

where the dMMSE_M task estimate $\hat{\mathbf{t}}^{M}_{k}\in\mathbb{R}^{D}$ is given by

\hat{\mathbf{t}}_{k}^{M}=\frac{\sum_{m=1}^{M}\exp\left(-\frac{1}{2\sigma^{2}}% \sum_{j=1}^{k-1}\left(y_{j}-\mathbf{t}_{m}^{\top}x_{j}\right)^{2}\right)% \mathbf{t}_{m}}{\sum_{m=1}^{M}\exp\left(-\frac{1}{2\sigma^{2}}\sum_{j=1}^{k-1}% (y_{j}-\mathbf{t}_{m}^{\top}x_{j})^{2}\right)}.

Note that the dMMSE_M task estimate and therefore the prediction explicitly depends on the tasks $\mathcal{T}_{M}=\{\mathbf{t}_{1},\ldots,\mathbf{t}_{M}\}$ .

Predictor 2 (ridge).

For $k=1,\ldots,K$ , the ridge predictor $g^{\infty}_{k}:(\mathbb{R}^{D}\times\mathbb{R})^{k}\times\mathbb{R}^{D}\to% \mathbb{R}$ is given by

g^{\infty}_{k}(x_{1},y_{1},\ldots,x_{k})=\left(\hat{\mathbf{t}}_{k}^{\infty}% \right)^{\top}x_{k}

(5)

where if $k=1$ the task estimate is $\hat{\mathbf{t}}_{k}^{\infty}=\mathbf{0}$ , otherwise the task estimate $\hat{\mathbf{t}}_{k}^{\infty}$ is given by $L_{2}$ -regularized least-squares regression on the examples in the context with the regularization parameter set to $\sigma^{2}$ ,

\hat{\mathbf{t}}_{k}^{\infty}=\left(X^{\top}X+\sigma^{2}I_{D}\right)^{-1}X^{% \top}Y,

where $X=(x_{1}^{\top},\dots,x_{k-1}^{\top})$ and $Y=(y_{1},\dots,y_{k-1})$ .

Optimality of predictors.

Raventós et al. (2023) showed that for finite task diversity $M\in\mathbb{N}$ , given data distribution $q_{M}(S)$ , the minimum mean squared error predictions are given by equation 4, that is, dMMSE_M, whereas for infinite task diversity, given data distribution $q_{\infty}(S)$ , the minimum mean squared error predictions are given by equation 5, that is, ridge.

Moreover, note that for a fixed context $x_{1},y_{1},\ldots,x_{k}$ , we have that as $M\to\infty$ , $\hat{\mathbf{t}}_{k}^{M}\xrightarrow{\text{a.s.}}\,\hat{\mathbf{t}}_{k}^{\infty}$ . It follows that ridge is an approximately optimal predictor for $q_{M}(S)$ given a large finite task diversity $M$ . However, for all finite task diversities $M$ it remains possible to reduce expected loss on $q_{M}(S)$ by specializing to the tasks in $\mathcal{T}_{M}$ (at the cost of increased loss on sequences constructed from other tasks).

Consistency of predictors.

The task estimates $\hat{\mathbf{t}}_{k}^{M},\hat{\mathbf{t}}_{k}^{\infty}$ are both asymptotically consistent assuming unbounded sequences drawn based on a realizable task $\mathbf{t}\in\mathcal{T}_{M}$ . However, the task estimates will differ for all $k$ (due to the different priors, $q_{M}(\mathbf{t})$ and $q_{\infty}(\mathbf{t})$ ), particularly for early tokens and especially in the under-determined regime $k\leq D$ .

4 The transient ridge phenomenon

In this section, we replicate the transient ridge phenomenon observed by Panwar et al. (2024, §6.1) by training transformers at a range of task diversity parameters and evaluating their performance on out-of-distribution (OOD) sequences.

We then apply the general technique of joint trajectory PCA: We use principal component analysis (PCA) to decompose the collective function-space trajectories of the transformers, producing a low-dimensional representation of their behavioral development. Without having to specify the idealized predictors, we recover the difference between dMMSE and ridge as correlated to the second principal component, and show that in the lead up to the task diversity threshold trajectories are increasingly drawn towards ridge.

4.1 Transformer training

We train transformers on nested multi-task data distributions with varying task diversity (Section 3.1) under the mean squared error objective (Section 3.2) to see when they behaviorally approximate dMMSE or ridge (Section 3.3). We use a $2$ -layer transformer with $d=2.65$ million parameters (details in Appendix A; more architectures in Appendix I).

We train with each of a set $\mathcal{M}$ of task diversities ranging from $M=1$ to $M=2^{15}$ and also including $\infty$ . Each run generates a trajectory $w_{t}^{M}\subseteq\mathbb{R}^{d}$ through parameter space for training steps $t=0,\ldots,T$ , from which we subsample checkpoints $\mathcal{C}\subseteq\{0,\ldots,T\}$ using a union of linear and logarithmic intervals (Section A.4). For notational ease, we sometimes denote the function $f(\cdot,w_{t}^{M})$ as $f_{M}(\cdot,w_{t})$ .

4.2 Joint trajectory principal component analysis

An established method for studying the development of structure and function in a system is to analyze its trajectory in configuration space. Amadei et al. (1993) developed the technique of applying PCA to such trajectories, called essential dynamics or simply trajectory PCA. It is argued that important features of trajectories appear in the essential subspace spanned by the leading principal components (Briggman et al., 2005; Ahrens et al., 2012), though interpreting PCA of time series requires care cf., Shinn, 2023; Antognini & Sohl-Dickstein, 2018; also Section C.1.

Trajectory PCA has seen diverse applications in molecular biology (Amadei et al., 1993; Meyer et al., 2006; Hayward & De Groot, 2008), neuroscience (Briggman et al., 2005; Cunningham & Yu, 2014), and deep learning (Olsson et al., 2022; Mao et al., 2024). We adapt a multi-trajectory variant (Briggman et al., 2005) to study the collective behavioral dynamics of our family of transformer models trained with different task diversities. In particular, we simultaneously perform PCA on the trajectories of all models through a finite-dimensional subspace of function space. Our detailed methodology for this joint trajectory PCA is as follows.

Joint encoding of transformer trajectories.

Given a parameter $w$ , we can view the transformer as mapping each sequence $S$ to a vector of predictions for its $K$ subsequences,

f(S,w)=\big{(}f(S_{\leq 1},w),\ldots,f(S_{\leq K},w)\big{)}\in\mathbb{R}^{K}.

We fix a finite dataset $\mathcal{D}^{(1)}=\{S^{i}\}_{i=1}^{B}\sim q_{1}(S)$ of $B=512$ input sequences (recalling that the root task $\mathbf{t}_{1}$ is shared by all task sets, so is the natural task to use to compare in-distribution behavior). We concatenate the outputs of $f(\cdot,w)$ for each input $S^{i}$ into one long row vector,

f(\mathcal{D}^{(1)},w)=\big{(}f(S^{1},w),\ldots,f(S^{B},w)\big{)}\in\mathbb{R}% ^{BK},

representing the function $f(\cdot,w)$ as a point in a finite-dimensional subspace of function space.

We apply this construction to each transformer checkpoint $\{w_{t}^{M}\}_{M\in\mathcal{M},t\in\mathcal{C}}$ . For each $M\in\mathcal{M}$ , we aggregate the row vectors from each checkpoint into a matrix $F_{M}\in\mathbb{R}^{|\mathcal{C}|\times BK}$ and then stack each $F_{M}$ vertically into $F_{\mathcal{M}}\in\mathbb{R}^{|\mathcal{M}||\mathcal{C}|\times BK}$ :

F_{M}=\begin{bmatrix}f(\mathcal{D}^{(1)},w^{M}_{t_{1}})\\ f(\mathcal{D}^{(1)},w^{M}_{t_{2}})\\ \vdots\\ f(\mathcal{D}^{(1)},w^{M}_{t_{|\mathcal{C}|}})\end{bmatrix}\text{\ for\ }M\in% \mathcal{M},\quad F_{\mathcal{M}}=\begin{bmatrix}F_{1}\\ \vdots\\ F_{2^{15}}\\ F_{\infty}\end{bmatrix}.

Principal component analysis.

We apply PCA to the joint matrix $F_{\mathcal{M}}$ . Supposing $F_{\mathcal{M}}$ has been mean-centered, it has a singular value decomposition $F_{\mathcal{M}}=U\Lambda V^{\top}$ where $U\in\mathbb{R}^{|\mathcal{M}||\mathcal{C}|\times|\mathcal{M}||\mathcal{C}|}$ has left singular vectors as columns, $\Lambda\in\mathbb{R}^{|\mathcal{M}||\mathcal{C}|\times BK}$ is a diagonal matrix of ordered positive singular values, and $V\in\mathbb{R}^{BK\times BK}$ has right singular vectors as its columns. For $v\leq BK$ , let $V_{v}$ denote the loading matrix given by the first $v$ columns of $V$ . The span of these columns forms the $v$ -dimensional (joint) essential subspace.

Projecting into the essential subspace.

The corresponding projection from feature space into the essential subspace is given by $\pi_{v}:\mathbb{R}^{BK}\to\mathbb{R}^{v}$ where $\pi_{v}(y)=yV_{v}$ . The developmental trajectory of each model is then represented as a curve $\gamma_{M}:\mathcal{C}\to\mathbb{R}^{v}$ defined by

\gamma_{M}(t)=\pi_{v}(f(\mathcal{D}^{(1)},w^{M}_{t}))\,.

For any principal component dimension $i\leq v$ we call the $i$ ^th component function $\gamma_{M}^{i}:\mathcal{C}\to\mathbb{R}$ a PC-over-time curve.

The dMMSE_M and ridge predictors defined in equations 4 and 5 can likewise be encoded as row vectors and projected into the essential subspace. For $M\in\mathbb{N}\cup\{\infty\}$ let

G_{M}=\big{(}g^{M}_{1}(S^{1}_{\leq 1}),g^{M}_{2}(S^{1}_{\leq 2}),\ldots,g^{M}_% {K}(S^{B}_{\leq K})\big{)}\in\mathbb{R}^{BK}.

Then each predictor projects to a single point in the essential subspace, $\pi_{v}(G_{M})$ . Note that we do not include the points $G_{1}$ , …, $G_{\infty}$ in the data prior to performing PCA.

Refer to caption — Figure 1: **Behavioral dynamics of the transient ridge phenomenon.** *(Top left):* OOD loss over training on sequences sampled with a Gaussian task distribution for task diversities $M\in\mathcal{M}$ . For intermediate $M$ we see non-monotonicity caused by the transient ridge phenomenon, or “forgetting” as observed by Panwar et al. (2024, §6.1). We define $t^{\operatorname{crit}}_{M}$ as the step at which the OOD loss is minimized for $M$ (Section B.2). We mark this step with a circle in the other plots. *(Right):* We project each transformer’s trajectory $\{f(\cdot,w_{t}^{M})\}_{t\in\mathcal{C}}$ to a curve $\gamma_{M}(t)$ in the essential subspace computed by joint trajectory PCA. We project dMMSE_M (diamonds) and ridge (square) into the same subspace. For intermediate task diversity $M$ , the development is deflected towards ridge on its way towards dMMSE_M. *(Bottom left):* In-distribution function-space distances $\Delta(\cdot,\text{dMMSE${}_{M}$/Ridge})$ clarify which fully-trained transformers (stars) approximate dMMSE_M, and which transformers approximate ridge at $t^{\operatorname{crit}}_{M}$ (circles). *(Note):* loss and PC curves are lightly smoothed, see Appendix D for raw data.

4.3 Experimental results

Figure 1 shows OOD loss on a fixed test set $\mathcal{D}^{(\infty)}\sim q_{\infty}(S)$ and the result of 2-dimensional joint trajectory PCA ( $27.5\%$ explained variance). Appendix B shows in-distribution loss. Appendix C extends to 4-dimensional PCA, explores the effect of checkpoint distributions, and shows that results are insensitive to the choice of batch size $B\geq 16$ .

Essential subspace.

Strikingly, PC2 correlates with an axis of behavioral difference between dMMSE_M (for increasing $M$ ) and ridge. Appendix E confirms that the predictions on earlier tokens, where dMMSE and ridge differ more, load more heavily on PC2 than those for later tokens do. PC2 also correlates with OOD loss, while PC1 appears to correlate with loss on $q_{1}(S)$ and with a notion of “development time” (see Figure C.1 and Section C.2).

Task diversity threshold.

As in Raventós et al. (2023), at low task diversity (in our case $M\leq 128$ ), fully-trained transformers behaviorally approximate dMMSE_M, while above a task diversity threshold ( $M\geq 362$ ), they converge to a point that behaviorally approximates ridge. Trajectories $M\in\{182,256\}$ converge somewhere between.

Transient ridge.

Replicating Panwar et al. (2024, §6.1), we see that for intermediate $M$ in the lead-up to the task diversity threshold, the OOD loss is non-monotonic. For $M\in\{16,24,\ldots,128\}$ , loss decreases towards that of ridge, then increases to that of dMMSE_M. We see a partial dip for $M\in\{6,8\}$ and a partial rise for $M\in\{182,256\}$ .

Trajectory PCA reveals that this non-monotonicity coincides with changes in the development of in-distribution behavior. For low $M$ , the transformers proceed directly to dMMSE_M in the essential subspace. As $M$ increases (until the task diversity threshold), the trajectories are increasingly deflected from this straight path into one that transits via approximating ridge. Beyond the task diversity threshold, the trajectories proceed directly to ridge and do not depart.

This trajectory PCA result suggests that the presence of the approximate ridge solution in the optimization landscape is in some sense influencing the development of internal structures in the transformers. Moreover, as $M$ increases, as the dMMSE_M solution changes, the strength of the influence of the ridge solution increases. In the next section, we attempt to understand the nature of this influence.

5 Evolving loss/complexity tradeoff

In this section, we model the transient ridge phenomenon as the result of the transformer navigating an evolving tradeoff between loss and complexity as it undergoes additional training. We draw on the theory of Bayesian internal model selection to qualitatively predict the nature of the tradeoff. We then empirically validate this model of the phenomenon by quantifying the complexity of the fully-trained transformers using the associated complexity measure.

5.1 Learning solutions of increasing complexity

The learning dynamics of many systems follow a pattern of progressing from solutions of high loss but low complexity to solutions of low loss but high complexity. This pattern has been studied in detail in certain models including deep linear networks (e.g., Baldi & Hornik, 1989; Saxe et al., 2019; Gissin et al., 2020; Jacot et al., 2021), multi-index models (Abbe et al., 2023), and image models (Kalimeris et al., 2019), each with their own notion of “complexity.”

Unfortunately, we lack results describing how such a progression should play out, or what complexity measure to use, for general deep learning. Therefore, we turn to singular learning theory (SLT; Watanabe, 2009; 2018)—a framework for studying statistical models with degenerate information geometry, including neural networks (Hagiwara et al., 1993; Watanabe, 2007; Wei et al., 2023)—in which a similar loss/complexity tradeoff has been studied in general terms in the setting of Bayesian inference.

5.2 Bayesian internal model selection

In Bayesian inference, SLT shows that the solutions around which the posterior concentrates are determined by a balance of loss and complexity. Moreover, the ideal balance changes as the number of samples increases, driving a progression from simple but inaccurate solutions to accurate but complex solutions Watanabe, 2009, §7.6; Chen et al., 2023. The leading-order complexity measure is the local learning coefficient (LLC; Lau et al., 2025), which can be understood as a degeneracy-aware effective parameter count. We outline this internal model selection principle below.

Bayesian posterior.

Consider a neural network parameter space $\mathcal{W}\subseteq\mathbb{R}^{d}$ . Let $\varphi$ be a nonzero prior over $\mathcal{W}$ and $\ell_{n}:\mathcal{W}\to\mathbb{R}$ an empirical loss (the average negative log likelihood) on $n$ samples. Then the Bayesian posterior probability of a neighborhood $\mathcal{U}\subseteq\mathcal{W}$ given $n$ samples is

p_{n}(\mathcal{U})=\frac{Z_{n}(\mathcal{U})}{Z_{n}(\mathcal{W})}

where $Z_{n}(\mathcal{X})$ is the marginal likelihood of $\mathcal{X}\subseteq\mathcal{W}$ ,

Z_{n}(\mathcal{X})=\int_{\mathcal{X}}\exp(-n\ell_{n}(w))\varphi(w)\,dw.

Bayesian posterior log-odds.

Consider two neighborhoods $\mathcal{U},\mathcal{V}\subseteq\mathcal{W}$ . The preference of the Bayesian posterior for $\mathcal{U}$ over $\mathcal{V}$ can be summarized in the posterior log-odds,

\log\frac{p_{n}(\mathcal{U})}{p_{n}(\mathcal{V})}=\log Z_{n}(\mathcal{U})-\log Z% _{n}(\mathcal{V}),

(6)

which is positive to the extent that $p_{n}$ prefers $\mathcal{U}$ over $\mathcal{V}$ .

Watanabe’s free energy formula.

SLT gives an asymptotic expansion of the Bayesian local free energy $-\log Z_{n}(\cdot)$ . Let $u\in\mathcal{W}$ be a solution, that is, a local minimum of the expected negative log likelihood, and let $\mathcal{U}$ be a closed ball around $u$ , in which $u$ is a maximally degenerate global minimum. Then, under certain technical conditions on the model, we have the following asymptotic expansion in $n$ Watanabe, 2018, Theorem 11; Lau et al., 2025:

-\log Z_{n}(\mathcal{U})=\ell_{n}(u)\cdot n+\lambda(u)\cdot\log n+O_{p}(\log% \log n)

(7)

where $\lambda(u)$ is the LLC and the lower-order terms include various other contributions, such as from the prior.

The loss/complexity tradeoff.

If $v\in\mathcal{V}$ is a competing solution (with its own neighborhood), then 6 and 7 give

\log\frac{p_{n}(\mathcal{U})}{p_{n}(\mathcal{V})}=\Delta\ell_{n}\cdot n+\Delta% \lambda\cdot\log n+O_{p}(\log\log n)

(8)

where $\Delta\ell_{n}=\ell_{n}(v)-\ell_{n}(u)$ and $\Delta\lambda=\lambda(v)-\lambda(u)$ .

Equation 8 describes an evolving tradeoff between loss and complexity as follows. Assume the lower-order terms from each free energy expansion cancel. Then if $\Delta\ell_{n}<0$ ( $u$ has higher loss than $v$ ) and $\Delta\lambda>0$ ( $u$ has lower LLC than $v$ ), the sign of the log-odds depends on $n$ . The Bayesian posterior will prefer $\mathcal{U}$ (around the simple but inaccurate solution) until $\log(n)/n<(-\Delta\ell_{n})/\Delta\lambda$ , after which it will prefer $\mathcal{V}$ (around the accurate but complex solution).

From Bayesian inference to deep learning.

Neural networks are typically trained by stochastic gradient-based optimization, not Bayesian inference. Nevertheless, as described in Section 2, recent work suggests that some qualitatively similar evolving tradeoff governs the development of structure in deep learning over training time (Chen et al., 2023; Hoogland et al., 2024; Wang et al., 2024).

This suggests that some as-yet-unknown principle of “dynamic¹¹1 In the sense of nonlinear dynamics (cf., e.g., Strogatz, 1994), where it is well-established that degeneracy in the geometry of critical points of a governing potential influences system trajectories. internal model selection”—in which the loss and the LLC play leading roles—underpins the structure of the optimization landscape, in turn influencing the trajectories followed by stochastic gradient-based optimization. Based on this motivation, we apply equation 8 to qualitatively predict the transient ridge phenomenon in terms of the differences in loss and LLC of the transformers that approximately implement the dMMSE_M and ridge predictors.

5.3 Explaining the transient ridge phenomenon

We offer the following explanation for the dynamics of transformers trained on in-context linear regression data with task diversity $M\in\mathbb{N}$ . Let $u_{\infty}\in\mathcal{U}$ and $v_{M}\in\mathcal{V}_{M}$ be transformer parameters approximately implementing ridge and dMMSE_M respectively, along with their neighborhoods.

Low $M$ : We expect $v_{M}$ to have much lower loss and LLC than $u_{\infty}$ . As equation 8 never favors $\mathcal{U}$ , training should proceed directly to $v_{M}$ .
Intermediate $M$ : We expect $v_{M}$ to have lower loss but higher LLC than $u_{\infty}$ . As equation 8 initially favors $\mathcal{U}$ but eventually favors $\mathcal{V}_{M}$ , training should proceed first towards $u_{\infty}$ before pivoting towards $v_{M}$ (after a number of training steps that increases with $M$ ).
High $M$ : We expect $v_{M}$ to have slightly lower loss but much higher LLC than $u_{\infty}$ . As equation 8 only favors $\mathcal{V}_{M}$ at very high $n$ , trajectories should proceed to $u_{\infty}$ and should not depart by the end of training.

See Figure 2 for a conceptual illustration. For $M$ values that fall between these three prototypical cases, the posterior preference is less sharp. Therefore we expect to see gradual shifts the dynamics over the range of $M$ values.

5.4 Empirical validation of the explanation

The above explanation is consistent with the findings of Section 4. It remains to validate that the trends in the loss and LLC are as expected. In this section, we outline our experiments estimating the loss and LLC of $u_{\infty}$ and $v_{M}$ .

Estimating loss.

We first estimate the loss of $u_{\infty}$ and $v_{M}$ by directly evaluating the idealized predictors 4 and 5. Alternatively, noting that the transformer cannot necessarily realize these predictors, we evaluate the end-of-training parameters (representing $v_{M}$ for low $M$ or $u$ for high $M$ ). Figure 3(top) confirms that the loss gap between idealized predictors shrinks with increasing $M$ , and the transformers achieve similar loss to their respective algorithms.

Estimating LLC.

The LLC is architecture-dependent, so we can’t meaningfully measure the LLC of the idealized predictors, only our fully-trained transformers (representing $v_{M}$ for low $M$ or $u_{\infty}$ for high $M$ ). Following Lau et al. (2025), we estimate the LLC of a parameter $w_{*}^{M}$ as the average increase in empirical loss $\ell_{n}$ for nearby parameters,

\hat{\lambda}(w_{*}^{M})=n\beta\left(\mathbb{E}_{w\mid w_{*}^{M},\gamma}^{% \beta}\bigl{[}\ell_{n}(w)\bigr{]}-\ell_{n}(w_{*}^{M})\right),

(9)

where $n$ is a sample size, $\beta$ is an inverse temperature, $\gamma$ is a localization strength parameter, and $\mathbb{E}_{w\mid w_{*}^{M},\gamma}^{\beta}$ is expectation over the localized Gibbs posterior

p\left(w;w_{*}^{M},\beta,\gamma\right)\propto\exp\left(-n\beta\ell_{n}(w)-% \frac{\gamma}{2}\left\|w-w_{*}^{M}\right\|_{2}^{2}\right).

We sample from this posterior with stochastic gradient Langevin dynamics (SGLD; Welling & Teh, 2011). Appendix F gives further details on LLC estimation, sampling with SGLD, and hyperparameter calibration.

Figure 3(bottom) shows LLC estimates $\hat{\lambda}(w_{*}^{M})$ for fully-trained transformers. High- $M$ LLCs converge to $\hat{\lambda}_{\infty}$ , which we take as $\lambda(u_{\infty})$ , the LLC of the ridge solution. For low- $M$ transformers that converge to dMMSE_M, we take $\hat{\lambda}(w_{*}^{M})$ as $\lambda(v_{M})$ . As expected, this LLC increases with $M$ , and crosses $\hat{\lambda}_{\infty}$ during the onset of transience. Surprisingly, the estimated LLC of dMMSE_M plateaus above $M=32$ , suggesting that the approximation achieved by the fully-trained transformers may be incomplete.

6 Limitations and future work

Our findings support an understanding of the transient ridge phenomenon as driven by an evolving loss/complexity tradeoff, governed by principles that are yet to be fully discovered but qualitatively resemble Bayesian internal model selection. In this section, we enumerate remaining gaps in this understanding, representing future steps towards a comprehensive understanding of neural network development.

6.1 Transformers versus idealized predictors

Our analysis is based primarily on in-distribution behavior, and it is not clear that our transformers can or do faithfully approximate the idealized predictors for all input sequences. Moreover, it is unclear whether the solutions governing training dynamics are necessarily the parameters to which transformers converge (we consider an alternative interpretation in Appendix G). Future work could seek a more detailed understanding of transformer solutions arising in practice, for example using mechanistic interpretability.

6.2 The role of lower-order terms

In Section 5.2, we make the simplifying assumption that the lower-order terms from each expansion cancel. However if these terms are not equal then their difference enters the posterior log odds, influencing the evolution of the posterior, especially for low $n$ . SLT has studied these terms (cf., e.g., Lau et al., 2025), but they are not as well-studied as the LLC. Future work could deepen our theoretical understanding of these lower-order terms or our empirical understanding of their role in internal model selection.

6.3 Dynamic versus Bayesian internal model selection

Of course, our primary motivation is to study neural network development, rather than Bayesian internal model selection per se. While we have contributed further evidence that the loss and the LLC play a leading role in a principle of “dynamic internal model selection” that governs neural network development, the precise form of this principle and the precise roles of the loss and the LLC remain to be determined. Our work highlights this as a promising direction for future empirical and theoretical work.

6.4 Why does transience stop?

Bayesian internal model selection suggests that the ridge solution should always eventually give way to a more complex but more accurate dMMSE_M solution. In practice, replicating Raventós et al. (2023), we see a clear task diversity threshold above which transformers never leave ridge. This could be due to capacity constraints, under-training, neuroplasticity loss, or a concrete difference between Bayesian and “dynamic” internal model selection. Appendix H offers a preliminary analysis, but reaches no firm conclusion, leaving this an open question for future work.

6.5 Beyond in-context linear regression

As outlined in Section 2, the phenomenon of a transient generalizing solution giving way to a memorizing solution over training has now been observed in a range of sequence modeling settings beyond our setting of in-context linear regression Singh et al., 2024; Hoogland et al., 2024; Edelman et al., 2024; He et al., 2024; Park et al., 2024. There’s also the reverse phenomenon of a transition from a memorizing solution to an equal-loss but simpler generalizing solution (“grokking;” Power et al., 2022; Nanda et al., 2023).

These settings represent rich subjects for future empirical work investigating the principles governing the development neural networks. In the case of grokking, we note that the particular loss/complexity tradeoff outlined in Section 5 does not account for transitions that decrease complexity, though such transitions can be described within the Bayesian internal model selection framework by taking lower-order terms into account (cf. Section 6.2).

7 Conclusion

This paper contributes an in-depth study of the training dynamics of transformers in the settings of in-context linear regression with variable task diversity. We adapt the technique of trajectory principal component analysis from molecular biology and neuroscience and deploy it to expand our empirical understanding of the developmental dynamics of our transformers, and the variation in these dynamics with task diversity, revealing the choice between memorization and generalization as a principal axis of development.

Moreover, we adopt the perspective of singular learning theory to offer an explanation of these dynamics as an evolving tradeoff between loss and complexity (as measured by the local learning coefficient), finding evidence that these elements play a leading role in governing the development of our transformers, akin to their role in governing the development of the posterior in Bayesian internal model selection. These findings open the door to future research aiming to uncover the true principles governing the development of internal structure in deep learning.

Appendix

[appendix] [appendix]l1

Appendix A Transformer training details

A.1 Architecture

We use the same in-context linear regression transformer architecture as Hoogland et al. (2024)—a two-layer transformer modeled after NanoGPT Karpathy, 2022; see also Phuong & Hutter, 2022 and Raventós et al. (2023) (but with fewer layers). In more detail, the architecture is a pre-layer-norm decoder-only transformer with a learnable positional embedding. For the primary architecture used in the main body and related appendices, we set $L=2$ layers, $H=4$ attention heads per layer, $d_{\mathrm{embed}}=512$ embedding dimensions and $d_{\mathrm{MLP}}=512$ MLP dimensions, yielding a transformer with the aforementioned $d=2.65$ million trainable parameters. Further details are given in Table 1.

We note that Raventós et al. (2023) used $D=8,K=16,\sigma^{2}=0.25,L=8,d_{\text{embed}}=128,H=2$ . That is, our models are smaller and wider than those of Raventós et al. (2023), so some results are not directly comparable.

A.2 Tokenization

To run sequences $S=(x_{1},y_{1},\ldots,x_{K},y_{K})$ through the transformer and produce a sequence of predicted labels $\hat{y}_{1},\ldots,\hat{y}_{K}$ for each subsequence $S_{\leq 1},\ldots,S_{\leq K}$ requires an initial encoding or “tokenization” step and a final “projection” step.

The sequence $S$ is first encoded as a sequence of tokens in $\mathbb{R}^{D+1}$ using the tokenization function $\tau:(\mathbb{R}^{D}\times\mathbb{R})^{K}\to(\mathbb{R}^{D+1})^{2K}$ :

\tau(x_{1},y_{1},x_{2},y_{2},\ldots,x_{K},y_{K})=\left(\begin{pmatrix}0\\ \vline\\ x_{1}\\ \vline\end{pmatrix},\begin{pmatrix}y_{1}\\ 0\\ \vdots\\ 0\end{pmatrix},\begin{pmatrix}0\\ \vline\\ x_{2}\\ \vline\end{pmatrix},\begin{pmatrix}y_{2}\\ 0\\ \vdots\\ 0\end{pmatrix},\ldots,\begin{pmatrix}0\\ \vline\\ x_{K}\\ \vline\end{pmatrix},\begin{pmatrix}y_{K}\\ 0\\ \vdots\\ 0\end{pmatrix}\right).

Note that this tokenization includes the final $y_{K}$ even though this is never used as part of the context for a prediction (it is used only as a label).

The transformer architecture takes a token sequence in $(\mathbb{R}^{D+1})^{2K}$ as input and outputs a vector of the same shape. To extract the predictions $\hat{y}_{k}$ for each $k$ , we read out the first component of every other token using a projection function $\pi_{Y}:(\mathbb{R}^{D+1})^{2K}\to\mathbb{R}^{K}$ ,

\pi_{Y}\left(\begin{pmatrix}\hat{y}_{1}\\ \vdots\end{pmatrix},\begin{pmatrix}\ .\ \,\\ \vdots\end{pmatrix},\begin{pmatrix}\hat{y}_{2}\\ \vdots\end{pmatrix},\begin{pmatrix}\ .\ \,\\ \vdots\end{pmatrix},\ldots,\begin{pmatrix}\hat{y}_{K}\\ \vdots\end{pmatrix},\begin{pmatrix}\ .\ \,\\ \vdots\end{pmatrix}\right)=(\hat{y}_{1},\hat{y}_{2},\ldots,\hat{y}_{K}).

We use these extracted predictions in computing the loss as per Section 3. Note that transformer’s outputs for dimensions that are not extracted by $\pi_{Y}$ are not subject to training (nor are they subject to evaluation).

Evaluating the loss of the transformer on all masked context predictions and not just the final token is an important design choice for our results. Given a sufficiently large context, many regression algorithms will arrive at the same prediction on late tokens. By training and then tracking the transformer’s early token predictions too, its functional outputs are more representative of its internal structure, giving a richer picture of their essential dynamics with the joint trajectory PCA (see also Appendix E).

A.3 Training

We train each transformer for $T=150$ k steps. We employ a learning rate scheduler that increases linearly from 0 to $0.003$ over the first $50$ k steps and remains constant thereafter. We train without explicit regularization and use the Adam optimizer (Kingma & Ba, 2014), with batch-size $B_{\text{train}}=1024$ . Each training run took 3–4 TPU hours with TPUs provided by Google TPU Research Cloud. Further details are given in Table 1.

Each model was initialized at the same point in parameter space, $w_{0}^{M_{1}}=w_{0}^{M_{2}}$ for each $M_{1},M_{2}\in\mathcal{M}$ , drawn once according to default settings in PyTorch. All models were trained to convergence, except for $M={182,256}$ whose loss was still decreasing very slowly at the end of training. (We used a fixed training duration across all models for practical comparison).

We note that Raventós et al. (2023) used $B_{\text{train}}=256,T=500K$ , meaning we train with a larger batch size but for fewer steps. However, the total number of training samples (153M) is comparable to their setting (128M).

A.4 Checkpoint distribution

We sample checkpoint indices $\mathcal{C}\subseteq\{0,\ldots,T\}$ , where $|\mathcal{C}|=2203$ . In particular, we use a combination of linear and log spaced checkpoints, $\mathcal{C}=\mathcal{C}_{\mathrm{linear}}\cup\mathcal{C}_{\mathrm{log}}$ . Setting $T=150$ k and $N=1000$ we define

\mathcal{C}_{\mathrm{linear}}=\{0,100,200,\dots,T\}\qquad\text{and}\qquad% \mathcal{C}_{\mathrm{log}}=\{\lfloor T^{\frac{j}{N-1}}\rfloor:j=0,1,...,N-1\}.

This set comprises 1,500 checkpoints spaced linearly at 100-step intervals, with the remaining checkpoints logarithmically spaced.

This sampling strategy allows us to observe both early-stage rapid changes and later-stage gradual developments throughout the training process. We investigate the effect of combining log and linear steps on trajectory PCA in Section C.4.

Table 1: Summary of hyperparameters for the primary transformer of the main body.

Hyperparameter	Category	Description/Notes	Values
$N$	Data	Total # of training samples	$153{,}600{,}000$
$B_{\mathrm{train}}$	Data	Batch size during training	1024
$T$	Data	# of training steps	$150{,}000$
$D$	Data	Dimension of linear regression task (task size)	8
$K$	Data	Maximum in-context examples	16
$\sigma^{2}$	Data	Variance of noise in data generation	$0.125$
$L$	Model	# of layers in the model	2
$H$	Model	# of attention heads per layer	4
$d_{\mathrm{mlp}}$	Model	Size of the hidden layer in MLP	512
$d_{\mathrm{embed}}$	Model	Embedding size	512
$\mathrm{seed}$	Misc	Training run seed	1
Optimizer Type	Optimizer	Type of optimizer	Adam
$\eta$	Optimizer	Maximum learning rate	0.003
$\lambda_{\mathrm{wd}}$	Optimizer	Weight Decay	0
$\beta_{1,2}$	Optimizer	Betas	(0.9, 0.999)
Scheduler Type	Scheduler	Type of learning rate scheduler	Increase then constant
Strategy	Scheduler	Strategy for annealing the learning rate	Linear
% start	Scheduler	Percentage of the cycle when learning rate is increasing	$33.3\%$

Appendix B Transformer evaluation details

The full results of our evaluation are shown in Figure B.1.

B.1 Evaluation data

We evaluate the performance of our transformers on several distributions:

We evaluate the root task performance of our transformers on a dataset $\mathcal{D}^{(1)}\sim q_{1}(S)$ . Note that $q_{1}(S)$ is not the training distribution for models with task diversity $M>1$ , but, by construction, the sequences from this distribution are still in the training distribution for those models in the sense that task $\mathbf{t}_{1}$ has positive support under $q_{M}(\mathbf{t}_{t})$ .
$M$ .

We also evaluate in-distribution performance of the model trained with task diversity $M$ on a dataset $\mathcal{D}^{(M)}\sim q_{M}(S)$ . Note that this evaluation dataset is different for each model, and it is the same distribution from which the training dataset for that model was sampled.
$\infty$ .

As discussed in Section 4.3, we evaluate the out-of-distribution (OOD) performance of our transformers on a dataset $\mathcal{D}^{(\infty)}\sim q_{\infty}(S)$ . Note that this is technically in-distribution data (rather than OOD data) for the model trained with task diversity $M=\infty$ .

The specific sequences in $\mathcal{D}^{(1)}$ , $\mathcal{D}^{(M)}$ , and $\mathcal{D}^{(\infty)}$ are sampled independently from training sequences and therefore have (almost surely) not been seen by the models during training. Each evaluation dataset contains $B=512$ samples.

B.2 Critical times

For a given model $f_{M}(\cdot,w_{t})$ , we define $t^{\operatorname{crit}}_{M}\in\mathcal{C}$ as the step at which the loss $\ell_{B}^{\infty}(f_{M})$ on $\mathcal{D}^{(\infty)}$ is minimized. We give the values of $t^{\operatorname{crit}}_{M}$ for the primary architecture in Figure 1(b).

Task diversity $M$	Critical time $t^{\text{crit}}_{M}$
$1$	$0.00$ k
$2$	$0.02$ k
$4$	$0.04$ k
$6$	$1.36$ k
$8$	$1.86$ k
$16$	$2.57$ k
$24$	$3.03$ k
$32$	$3.18$ k
$46$	$3.30$ k
$64$	$9.20$ k
$90$	$12.54$ k
$128$	$14.70$ k
$182$	$24.90$ k
$256$	$29.70$ k
$362$	$28.30$ k
$512$	$34.40$ k
$724$	$39.90$ k
$1024$	$145.30$ k
$2048$	$138.00$ k
$4096$	$150.00$ k
$8192$	$150.00$ k
$16384$	$141.00$ k
$32768$	$150.00$ k
$\infty$	$150.00$ k

Appendix C Trajectory PCA details

As described in Section 4, we perform PCA on finite-dimensional in-distribution approximations of the functions implemented by each transformer at various checkpoints over training. Note that this methodology can be used to study joint trajectory PCA of any family of trajectories that need not be variations of a training distribution; for example, different random seeds of the same training run. The same technique could also be used to study out-of-distribution behavior by varying the input distribution for which the functional outputs are being measured.

C.1 Lissajous phenomena of trajectory PCA

Caution is required when interpreting PCA results of timeseries. It is a well known fact that the principal components of Brownian motion (i.e. a stationary diffusion process) are the Fourier modes, proven in the high-dimensional discrete case in Antognini & Sohl-Dickstein (2018), and in the continuous case in Hess (2000); Shinn (2023) building on results in Karhunen–Loève theory (Wang, 2008). This means that when these trajcetories are plotted in $(\mathrm{PC}j,\mathrm{PC}k)$ space for some $j,k\leq v$ , they appear as Lissajous curves. In practice this means that “turning points” in trajectories projected onto their first few principal components (such as the local minima on the PC1 vs. PC2 plots that characterize transient ridge) are not necessarily caused by meaningful signals in the data, and should ideally be corroborated by independent sources of information. In our case, we interpret this trend as meaningful only since it corroborates the trends in the OOD loss.

C.2 Interpretation of PC1 and PC2

The most significant axis of variation in timeseries data typically reflects the development time of the underlying process - capturing both periods of stasis and rapid change. In the canonical Brownian motion case, this manifests as a monotonic half-cosine first principal component, while more general autocorrelative processes can yield deformed versions of these Fourier modes (Shinn, 2023). For neural networks trained via SGD, this development time is naturally dictated by gradient descent steps along the loss landscape.

Indeed, since our PCA results in Figure 1 are built from outputs on $\mathcal{D}^{(1)}$ , we see in the first column of Figure C.1 that PC1 strongly correlates with each model’s loss on this dataset. Thus, these loss curves effectively capture the underlying development time that PCA detects, providing a clear interpretation of PC1.

PC2, capturing the second most significant axis of variation, correlates with the distinction between dMMSE and ridge, as evidenced in Figure C.1 where the turning point of each $\ell^{\infty}_{B}$ curve coincides with its corresponding PC2 curve. These PC2-over-time curves provide finer-grained insight into when a model trends toward dMMSE compared to the OOD loss, with deviations depicted more dramatically. We further investigate this interpretation of PC2 in Appendix E.

C.3 Extending to $v=4$ components

In Figure C.2 and the right of Figure D.1 we investigate the trajectory PCA with $v=4$ principal components. Transient ridge is visible in $(\mathrm{PC}2,\mathrm{PC}4)$ with a sharp deviation occurring in the $f_{128}$ trajectory.

C.4 Effect of checkpoint distribution

In Figure C.3 we test the effect on the PCA results if checkpoints are sampled only from $\mathcal{C}_{\mathrm{log}}$ versus $\mathcal{C}_{\mathrm{linear}}$ versus the combined $\mathcal{C}$ (cf. Section A.4). Interestingly, the different sampling methods do yield fairly different results. The trajectory PCA using $\mathcal{C}$ are similar to those of $\mathcal{C}_{\mathrm{log}}$ , indicating that log-sampling is perhaps sufficient to see the key features of the trajectories in the essential subspace. On the other hand, the curves obtained with $\mathcal{C}_{\mathrm{linear}}$ are different enough as to suggest that different modes of variation dominate over linear time.

C.5 Number of contexts for PCA features

In the main body, the size $B$ of the input contexts in $\mathcal{D}^{(1)}$ defining the feature space of the trajectory PCA (see Section 4) is set to $B=512$ . In Figure C.4 we show the convergence of the trajectories as $B$ increases. The convergence is quite fast, with $B=16$ already somewhat resembling those of $B=512$ (see Figure D.1).

Appendix D Smoothing details for loss and PC curves

We apply smoothing with a Gaussian kernel, with standard deviation 20, to each of the loss curves $\ell^{M}_{B}$ and PC-over-time curves $\gamma_{M}^{j}$ for all $M\in\mathcal{M}$ , and $j\leq 4$ . These smoothed curves are used to construct the data in Figure 1 and associated figures in Section 4. Figure D.1 shows the raw curves for comparison. The motivation for applying the smoothing is to aid in visually distinguishing similar curves, as well as to obtain a reliable estimate of each $t^{\operatorname{crit}}_{M}$ value.

Appendix E Per-token analysis

It is natural to ask, which features (columns of $F_{\mathcal{M}}$ ) are most responsible for the transient ridge phenomenon? Given the structure of our feature space, with predictions for each regression example $k\in\{1,\dots,K\}$ across $B$ contexts from $\mathcal{T}_{1}$ , we expect an implicit structure dictated by token positions. As shown in Figure 1(a), most of the variance across $M$ is driven by early tokens $k\leq 5$ , where dMMSE and ridge predictions differ most significantly.

To verify this, Figure 1(b) examines the contribution of each token to each PC by calculating the average loading magnitude. For PC $i=1,\dots,v$ and token $k=1,\dots,K$ , the average loading magnitude $A_{i,k}$ is

\displaystyle A_{i,k}=\frac{1}{B}\sum_{b=1}^{B}|V_{(b-1)K+k,i}|

(10)

where $V_{j,i}$ is the loading of PC $i$ against feature $j\in\{1,\dots,BK\}$ , as defined in Section 4.2.

As predicted, we find that tokens $k=3,4,5$ contribute most to $\mathrm{PC}2$ and therefore are the features that drive the transient ridge phenomenon. Interestingly, $k=1,2$ contribute strongly to $\mathrm{PC}3$ , while later tokens are the main contributors to $\mathrm{PC}4$ , almost inverting the pattern of $\mathrm{PC}2$ , aiding in the interpretation of the sharp retreat present in (PC2, PC4) space.

This analysis suggests that the model improves its loss primarily by adjusting its computation on these early tokens, where the distinction between specialization (dMMSE_M) and generalization (ridge) is most pronounced.

Appendix F LLC estimation details

We offer additional details on our use of SGLD for approximating the expectation in equation 9, which we refer to as LLC estimation. Note that this appendix also pertains to estimation of LLC at parameters from throughout training, as analyzed in Appendix G, not only fully-trained parameters, as analyzed in the main text.

F.1 Interpreting LLC estimates

For a full formal definition of the LLC, a derivation of the estimator, and experiments validating the soundness of the estimation technique in simpler models, we refer the reader to Lau et al. (2025).

For our purposes, it suffices to note that the form of the estimator can be intuitively understood as a degeneracy-aware effective parameter count—the more degenerate directions in the loss landscape near $w^{*}$ , the more ways for the SGLD sampler to find points of low loss, the lower the estimate.

As another source of intuition, we note that equation 9 resembles a PAC-Bayes “expected sharpness” measure, as used by Neyshabur et al. (2017). The LLC estimator uses a localized Gibbs posterior in place of the distribution of perturbations around the input parameter.

F.2 Estimating the LLC with SGLD

We adopt the approach of Lau et al. (2025) in using stochastic gradient Langevin dynamics (SGLD; Welling & Teh, 2011) to estimate the expectation value of the loss in the LLC estimator. For a given weight configuration $w^{*}$ , we generate $C$ independent chains, each consisting of $T_{\text{SGLD}}$ steps. Each chain $c$ produces a sequence of weights $\{w_{\tau}^{(c)}\}_{\tau=1}^{T_{\text{SGLD}}}$ . We then estimate the expectation $\mathbb{E}_{\beta}[\mathcal{O}(w)|w^{*},\gamma]$ of an observable $\mathcal{O}$ using:

\frac{1}{CT_{\text{SGLD}}}\sum_{c=1}^{C}\sum_{\tau=1}^{T_{\text{SGLD}}}% \mathcal{O}(w_{\tau}^{(c)})\,.

(11)

For the estimations studied here, we include a burn-in period. Within each chain (omitting the chain index $c$ for clarity), we generate samples as follows:

	$\displaystyle w_{\tau+1}$	$\displaystyle=w_{\tau}+\Delta w_{\tau},$		(12)
	$\displaystyle w_{1}$	$\displaystyle=w^{*},$		(13)

where the step $\Delta w_{\tau}$ is computed using the SGLD update:

\Delta w_{\tau}=\frac{\epsilon}{2}\left(\beta n\nabla\ell_{m}^{(\tau)}(w_{\tau% })+\frac{\gamma}{2}(w_{\tau}-w^{*})\right)+\mathcal{N}(0,\epsilon)

(14)

For each step $\tau$ , we sample a mini-batch of size $m$ and use the associated empirical loss $\ell_{m}^{(\tau)}$ to compute the gradient. We follow Lau et al. (2025) in recycling the mini-batch losses $\ell_{m}(w_{\tau}^{(c)})$ computed during SGLD for the expectation average.

F.3 SGLD hyperparameter tuning

For local learning coefficient estimation, we sample $8$ independent chains with 4000 steps per chain, of which the first 2500 are discarded as a burn-in, after which we draw observations once per step, at a temperature $n\beta=30$ , $\epsilon=0.00005$ , and $\gamma=0.01$ , over batches of size $m=1024$ . Local learning coefficient estimation takes on the order of a single TPU-hour per training run.

Table 2: LLC estimation hyperparameters. A summary of the hyperparameters involved in estimating the local learning coefficient.

Hyperparameter	Category	Description/Notes	Values
C	Sampler	# of chains	8
$T_{\mathrm{SGLD}}$	Sampler	Total # of SGLD draws / chain	1500
$T_{\text{burn-in}}$	Sampler	# of burn-in steps	2500
$\epsilon$	SGLD	Step size	0.00005
$\gamma$	SGLD	Localization strength	0.01
$n\beta$	SGLD	Effective sample size (/inverse temperature)	30
$m$	SGLD	The size of each SGLD batch	1024

To tune $\epsilon$ and $n\beta$ , we perform a grid sweep over $\epsilon$ , and $n\beta$ over a small subset of training runs and checkpoints, and plot the LLC estimate as a function of $\epsilon$ , as depicted in Figure F.1. Within these plots, we look for a range of hyperparameters for which $\hat{\lambda}$ is independent of $\epsilon$ (which shows up as a characteristic plateau). This shows up as an “elbow” in Figure F.1 (Xu, 2024). Subject to these constraints, we maximize $n\beta$ to get as close as possible to the “optimal” $n\beta=n/\log n$ , and then maximize $\epsilon$ to reduce the necessary chain length.

F.4 SGLD diagnostics

We employ multiple diagnostics to assess convergence and determine appropriate burn-in periods and numbers of samples for our SGLD chains.

The Geweke diagnostic (Geweke, 1992) compares the means of the initial 10% and the last 50% of the trajectory (after burn-in). Typically, values between $[-2,2]$ indicate a suitable burn-in period. However, in our context, we found this diagnostic overly sensitive to minor trends, often producing large values despite minimal practical differences in LLC estimates.

To address this limitation, we introduce the Relative Percentage Difference (RPD), calculated as:

\text{RPD}=\frac{\text{Mean}_{\text{last 50\%}}-\text{Mean}_{\text{first 10\%}% }}{\text{Mean}_{\text{overall}}}\times 100\%

(15)

The RPD provides a more intuitive measure of practical significance, typically showing differences of less than 1% even when Geweke values are large.

We also use the Gelman-Rubin statistic to assess convergence across different chains. Values below 1.1 or 1.2 generally indicate suitable convergence (Gelman & Rubin, 1992; Gelman et al., 2003; Vats & Knudson, 2020).

Representative loss traces for our chosen hyperparameters are shown in Figure F.2. Visual inspection suggests healthy, stabilized traces for most runs. The final checkpoints of low- $M$ training runs appear to be an exception and are still underconverged (top-right). The RPD confirms these observations, with values below 1% for 16/25 checkpoints and below 2% for 22/25 checkpoints. The exceptions align with visually less converged cases.

Interestingly, the Geweke diagnostic suggests most chains are far from converged, with values well outside the $[-2,2]$ range. We attribute this to the diagnostic’s sensitivity to minute trends when samples cluster tightly. Given the practical insignificance of these trends (as judged visually and by RPD), we opt to disregard these extreme Geweke values.

Our hyperparameter choice represents a compromise. While it is challenging to select parameters that perform optimally across all training runs, we aim for a consistent choice of hyperparameters to allow fair comparison. Based on visual inspection of loss traces, $\epsilon$ -insensitivity criteria, and these diagnostic checks, we believe our chosen hyperparameters are appropriate even if, for example, individual checkpoints may be underconverged.

Appendix G Which solutions govern development?

In Section 5, we referred to solutions $u_{\infty}$ and $v_{M}$ , representing ridge and dMMSE_M respectively, and explained transient ridge in terms of the roles that the loss and LLC of these solutions play in the evolving loss/complexity tradeoff in Bayesian internal model selection. In this appendix, we discuss two alternative candidates for the solutions that govern the development of our transformers.

G.1 Candidate 1: End-of-training parameters

The most straight-forward assumption, adopted in the main text, is that the development of our transformers is governed by the solutions to which our transformers converge. In particular, we assume there is a parameter $u_{\infty}\in\mathcal{W}$ implementing an approximation of ridge to which high- $M$ transformers converge, and there is a family of parameters $v_{M}\in\mathcal{W}$ for $M\in\mathbb{N}$ implementing approximations of dMMSE_M to which low- and intermediate- $M$ transformers converge (though high- $M$ transformers never do).

While we have shown that the in-distribution behavior of some of our transformers approximately matches that of dMMSE or ridge at convergence, it is an additional assumption to suppose that the loss and complexity of these parameters govern development. The main text outlines how, under this assumption, the loss/complexity tradeoff explanation lines up with our observations (though, since the trajectories do not converge to dMMSE_M for high $M$ , we have no obvious way to estimate $\lambda(v_{M})$ in that case).

G.2 Candidate 2: Intermediate parameters

We now develop an alternative model of the development of our transformers. In brief, we first note that the emergence of dMMSE_M or ridge (as the case may be) is not necessarily the last significant internal development in our transformers. In particular, there may be some other parameter that governs the trajectory during the development of ridge, after which the development proceeds to be governed by the parameter to which the transformer eventually converges.

It is difficult to test this story, since we have no obvious way to access such an intermediate governing parameter and measure its LLC. However, we can gain some insight by applying a different methodology of monitoring the LLC of our transformers over the entirety of their development, on the assumption that we will see changes in the LLC playing out during the development. This methodology has been pioneered by Hoogland et al. (2024), though it is not without its own limitations as discussed by Hoogland et al. (2024)—in short, we note that while arbitrary transformer checkpoints may not correspond to local minima, the LLC estimator of Lau et al. (2025) is well-defined for any parameter, and with sufficient hyperparameter calibration (particularly the localization strength) we obtain stable estimates. See Appendix F for more details on LLC estimation calibration.

Thus, we estimate the LLC over training for our family of models of varying $M$ . Before analyzing the LLC curves for our models, we recall in some detail the results from Hoogland et al. (2024), who studied similar in-context linear regression models in the $M=\infty$ case, and found that their development can be divided into several developmental stages, including the following:

During the first developmental stage, the transformer rapidly learns to predict approximately zero for every context. Since $\hat{\mathbf{t}}_{1}^{\infty}=\mathbf{0}$ , this is the optimal context-independent prediction.
During the second stage, the transformer begins to make use of the context for its predictions, coinciding with a significant LLC increase.
During the third and fourth stages, the transformer’s behavior on in-distribution inputs remains roughly stable, but it specializes to Gaussian-distributed tasks, sacrificing its performance on extremely rare tasks with high magnitudes. These stages coincide with LLC decreases.

Hoogland et al. (2024) also observed that in the third and fourth stage, various internal modules in the transformer “collapse,” focusing their computation on a subset of dimensions. Therefore we refer to these stages as “collapse” stages. We reproduce two indicative figures from Hoogland et al. (2024) for comparison with our own models in Figure G.1. Note that, compared to our architecture, Hoogland et al. (2024) used a 2-layer transformer with a smaller embedding size and so a much smaller number of parameters, and they also used different training parameters including the total number of training steps.

Figure G.2 shows the results of LLC estimation over time for our models. As did Hoogland et al. (2024), we observe that the LLC trends for our transformers can be characterized as a series of LLC changes (gradual increases or decreases) that are punctuated by brief plateaus. Apart from very small $M$ , all models have an early plateau, perhaps corresponding to learning an extremely simple context-independent predictor before moving on to more advanced predictors. All models also undergo a parallel decrease in LLC towards the end of training, possibly coinciding with a similar “collapse” of internal weights as studied by Hoogland et al. (2024). After that, for high $M$ , the curves follow the development of the $M=\infty$ case shown in yellow (cf., also, Figure G.1) for the remainder of training.

The interesting trends are in the deviation for intermediate $M$ values from this $M=\infty$ baseline. For some early intermediate $M$ values, at around the time $t^{\operatorname{crit}}_{M}$ of minimum OOD loss, the LLC “spikes.” For these $M$ , there is a neat interpretation of this increase as corresponding to a transition between a low-complexity proto-ridge solution and a high-complexity proto-dMMSE_M solution, following the same explanation assumed to cover the entirety of training in the main text.

For the later-intermediate $M$ values, these “spikes” in LLC appear less pronounced, or not at all (though the LLC still appears to be elevated with respect to the $M=\infty$ baseline). The final LLC still ends up at elevated compared to the final LLC from the $M=\infty$ case. However, there is no clear sign of a transition from a proto-ridge solution to a proto-dMMSE_M solution with an increased LLC to explain this. In these cases, we note that $t^{\operatorname{crit}}_{M}$ happens close to or within the “collapse” stages that affect a general downturn in the LLC for all $M$ . It is possible that any competition between ridge and dMMSE_M as it effects the LLC is obscured from our analysis by other developments in the model. Future work could attempt to understand and then prevent this collapse and study the within-training complexity dynamics more clearly.

Appendix H Why does transience stop?

If the development of our transformers were to be governed exactly by Bayesian internal model selection, with increased training corresponding to increased samples, then the explanation predicts that for all $M$ , a higher-complexity but lower-loss solution should eventually be preferred. Instead, we see a task diversity threshold as originally observed by Raventós et al., 2023 at which transience ends. In this section, we briefly consider several possible explanations for this phenomenon.

H.1 Model capacity constraints

We note that a dMMSE_M solution can only compete with the ridge solution if it is a faithful enough approximation of dMMSE_M that it actually achieves lower loss than the ridge solution. There is certainly some $M$ sufficiently large such that the best realizable approximation of dMMSE_M for our architecture is insufficiently competitive, at which point, transience will end. We note that Raventós et al. (2023) and Nguyen & Reddy (2024) draw similar conclusions about the potential role of model capacity in explaining the task diversity threshold.

In our case, this would require dMMSE_M to be unrealizable by $M=256$ or so. We do not rule this out, though our primary architecture has around 2.65 million parameters. We note that smaller model architectures show slightly lower task diversity thresholds (Appendix I), indicating that model capacity may be somewhat involved.

H.2 Under-training

Bayesian internal model selection predicts that, under the stated assumptions, there will be some amount of data $n$ beyond which the lower loss but more complex solution will be preferred, but the lower complexity will be preferred for all lower $n$ . By analogy, it is possible that we have not trained our transformers for enough steps to incentivize them to specialize.

We do observe that the critical time $t^{\operatorname{crit}}_{M}$ (the step of minimum OOD loss) increases with $M$ (see Section B.2). However, we note that the critical time is not near the end of training around the task diversity threshold.

H.3 Neuroplasticity loss

It’s also possible that not all time-steps contribute equally to the “effective amount of samples” driving the tradeoff. In particular, it’s possible that during the training of our transformers there is a brief window of opportunity or “critical period” during which they are governed by principles similar to Bayesian internal model selection, and after which they are either governed by similar principles on a slower timescale, or are resistant to further developments.

Some preliminary support for this hypothesis in our case can be seen in the LLC-over-training analysis provided in Appendix G, wherein we compare the development of complexity in our model to the detailed analysis of the $M=\infty$ case conducted by Hoogland et al. (2024). We observe that the task diversity threshold coincides very roughly with the overlap between the progressively increasing $t^{\operatorname{crit}}_{M}$ and a general downturn in LLC estimates that affects all models. Hoogland et al. (2024) observed a similar LLC downturn at a similar stage of training and, analyzing the internals of their transformer (with smaller but somewhat similar architecture to ours) they found that many of the weights in various internal modules of the transformer “collapse” during this downturn, which could make further development difficult.

H.4 Non-Bayesian pre-training

We note that Raventós et al. (2023) and Panwar et al. (2024) cast the task diversity threshold as an example of “non-Bayesian in-context learning,” due to the failure of the transformer to adopt the optimal Bayesian solution derived from the pre-training data distribution for in-context learning. In contrast, the perspective of Bayesian internal model selection casts the task diversity threshold as an interesting example of “non-Bayesian pre-training,” in which the transformer fails to learn following the dynamics that would coincide with the evolution of the Bayesian posterior with increasing data.

As we have discussed, it is unclear to what extent neural network development is governed by principles similar to Bayesian internal model selection, which is a formal mathematical model of Bayesian inference in neural networks, not of stochastic gradient-based optimization. It is plausible based on the perspective of nonlinear dynamics that degeneracy (such as measured by the LLC) should play a role in governing system trajectories during stochastic gradient-based optimization, and our work contributes empirical evidence that this is the case. However, there are still various gaps in our understanding of the principles governing the development of neural networks. The true “dynamic internal model selection” principles, once discovered, may reveal a natural explanation for the end of transience, leaving nothing further to be explained.

Appendix I Experiments with additional architectures

Section A.1 details the architecture hyperparameters used for the primary architecture used in our main analysis. In this appendix, we report OOD loss, joint trajectory PCA, loss and LLC estimates and LLC-over-time analysis (see Appendix G) for three additional transformer architectures (with smaller numbers of heads and/or smaller embedding/ML dimensions). The three additional architectures we study are as follows:

$L=2$ layers, $H=1$ head, $d_{\text{embed}}=d_{\text{mlp}}=128$ dimensional embeddings. The OOD loss and trajectory PCA results are in Figure I.2. The loss, LLC, and LLC-over-time results are in Figure I.6.
$L=2$ layers, $H=4$ heads, $d_{\text{embed}}=d_{\text{mlp}}=128$ dimensional embeddings. The OOD loss and trajectory PCA results are in Figure I.3. The loss, LLC, and LLC-over-time results are in Figure I.7.
$L=2$ layers, $H=4$ heads, $d_{\text{embed}}=d_{\text{mlp}}=256$ dimensional embeddings. The OOD loss and trajectory PCA results are in Figure I.4. The loss, LLC, and LLC-over-time results are in Figure I.8.

For comparison, we also reproduce Figures 1, 3 and G.2 for the primary architecture with $L=2$ layers, $H=4$ heads, $d_{\text{embed}}=d_{\text{mlp}}=512$ dimensional embeddings. The OOD loss and trajectory PCA results are in Figure I.5. The loss, LLC, and LLC-over-time results are in Figure I.9.

We provide a comparison of the PC over time curves for each architecture in Figure I.1.

In Figure 10(a), we plot $t^{\operatorname{crit}}_{M}$ values (Section B.2) for all $M$ for all four architectures. We see that within each architecture, $t^{\operatorname{crit}}_{M}$ increases with $M$ . In Figure 10(b) we plot the final loss and final LLC of all architectures for comparison, which shows that the task diversity threshold increases slightly with architecture size. All four architectures exhibit qualitatively similar behaviors as shown by these two figures.

In Figure I.11, we plot LLC over time curves for easy comparison between architectures for each fixed $M$ , as well as the respective “baseline" of the $M=\infty$ model. While we mostly see similar behavior across architectures, there are subtle differences visible in these complexity dynamics, suggesting that specific architectural choices may influence the developmental trajectory of these models in meaningful ways.

[appendix]

Impact statement

The emergence of computational structure in deep neural networks is not only a fascinating scientific and mathematical phenomenon. This structure determines a model’s out-of-distribution generalization behavior, and in turn its safety, robustness, and alignment properties. As society races ahead to develop ever more complex neural networks and integrate them ever more deeply into our digital and physical world, understanding the principles governing neural network development is a priority for the science of deep learning. This work aims to contribute towards improving our scientific understanding of neural network development, which is an integral part of (though not alone sufficient for) ensuring that future technological advances in the field of deep learning have robustly positive impact.

Acknowledgments

We thank Edmund Lau, George Wang, and Susan Wei for helpful conversations. Google’s TPU Research Cloud program supported some of our experiments with Cloud TPUs.

References

Abbe et al. (2023) Abbe, E., Adserà, E. B., and Misiakiewicz, T. SGD learning on neural networks: Leap complexity and saddle-to-saddle dynamics. In Proceedings of Thirty Sixth Conference on Learning Theory, pp. 2552–2623. PMLR, 2023.
Ahrens et al. (2012) Ahrens, M. B., Li, J. M., Orger, M. B., Robson, D. N., Schier, A. F., Engert, F., and Portugues, R. Brain-wide neuronal dynamics during motor adaptation in zebrafish. Nature, 485(7399):471–477, May 2012. ISSN 0028-0836. doi: 10.1038/nature11057.
Akyürek et al. (2023) Akyürek, E., Schuurmans, D., Andreas, J., Ma, T., and Zhou, D. What learning algorithm is in-context learning? investigations with linear models, 2023. Preprint arXiv:2211.15661 [cs.LG].
Amadei et al. (1993) Amadei, A., Linssen, A. B., and Berendsen, H. J. Essential dynamics of proteins. Proteins: Structure, Function, and Bioinformatics, 17(4):412–425, 1993.
Antognini & Sohl-Dickstein (2018) Antognini, J. and Sohl-Dickstein, J. PCA of high dimensional random walks with comparison to neural network training. In Advances in Neural Information Processing Systems, volume 31, 2018.
Bai et al. (2024) Bai, Y., Chen, F., Wang, H., Xiong, C., and Mei, S. Transformers as statisticians: Provable in-context learning with in-context algorithm selection. Advances in Neural Information Processing Systems, 36, 2024.
Baldi & Hornik (1989) Baldi, P. and Hornik, K. Neural networks and principal component analysis: Learning from examples without local minima. Neural Networks, 2(1):53–58, 1989.
Briggman et al. (2005) Briggman, K. L., Abarbanel, H. D., and Kristan Jr, W. Optical imaging of neuronal populations during decision-making. Science, 307(5711):896–901, 2005.
Cammarata et al. (2020) Cammarata, N., Olah, C., Schubert, L., Goh, G., Petrov, M., and Carter, S. Thread: Circuits. Distill, 2020. URL https://distill.pub/2020/circuits.
Chan et al. (2024) Chan, B., Chen, X., György, A., and Schuurmans, D. Toward understanding in-context vs. in-weight learning, 2024. Preprint arXiv:2410.23042 [cs.LG].
Chan et al. (2022) Chan, S., Santoro, A., Lampinen, A., Wang, J., Singh, A., Richemond, P., McClelland, J., and Hill, F. Data distributional properties drive emergent in-context learning in transformers. Advances in Neural Information Processing Systems, 35:18878–18891, 2022.
Chen et al. (2023) Chen, Z., Lau, E., Mendel, J., Wei, S., and Murfet, D. Dynamical versus Bayesian phase transitions in a toy model of superposition, 2023. Preprint arXiv:2310.06301 [cs.LG].
Cunningham & Yu (2014) Cunningham, J. P. and Yu, B. M. Dimensionality reduction for large-scale neural recordings. Nature Neuroscience, 17(11):1500–1509, 2014.
Edelman et al. (2024) Edelman, E., Tsilivis, N., Edelman, B. L., Malach, E., and Goel, S. The evolution of statistical induction heads: In-context learning markov chains. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024.
Elhage et al. (2021) Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., and Olah, C. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021.
Elhage et al. (2022) Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., Grosse, R., McCandlish, S., Kaplan, J., Amodei, D., Wattenberg, M., and Olah, C. Toy models of superposition. Transformer Circuits Thread, 2022.
Garg et al. (2022) Garg, S., Tsipras, D., Liang, P., and Valiant, G. What can transformers learn in-context? a case study of simple function classes. In Advances in Neural Information Processing Systems, 2022.
Gelman & Rubin (1992) Gelman, A. and Rubin, D. B. Inference from iterative simulation using multiple sequences. Statistical Science, 7:457–472, January 1992. doi: 10.1214/ss/1177011136.
Gelman et al. (2003) Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. Bayesian Data Analysis. Chapman and Hall/CRC, 2003.
Geweke (1992) Geweke, J. Evaluating the accuracy of sampling-based approaches to the calculations of posterior moments. Bayesian Statistics, 4:641–649, 1992.
Gissin et al. (2020) Gissin, D., Shalev-Shwartz, S., and Daniely, A. The implicit bias of depth: How incremental learning drives generalization. In International Conference on Learning Representations, 2020.
Hagiwara et al. (1993) Hagiwara, K., Toda, N., and Usui, S. On the problem of applying AIC to determine the structure of a layered feedforward neural network. In 1993 International Joint Conference on Neural Networks, volume 3, pp. 2263–2266. IEEE, 1993. doi: 10.1109/IJCNN.1993.714176.
Hayward & De Groot (2008) Hayward, S. and De Groot, B. L. Normal modes and essential dynamics. Molecular Modeling of Proteins, pp. 89–106, 2008.
He et al. (2024) He, T., Doshi, D., Das, A., and Gromov, A. Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024.
Hess (2000) Hess, B. Similarities between principal components of protein dynamics and random diffusion. Physical Review E, 62:8438–8448, Dec 2000. doi: 10.1103/PhysRevE.62.8438.
Hewitt & Manning (2019) Hewitt, J. and Manning, C. D. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4129–4138. Association for Computational Linguistics, 2019.
Hoogland et al. (2024) Hoogland, J., Wang, G., Farrugia-Roberts, M., Carroll, L., Wei, S., and Murfet, D. The developmental landscape of in-context learning, 2024. Preprint arXiv:2402.02364 [cs.LG].
Jacot et al. (2021) Jacot, A., Ged, F., Şimşek, B., Hongler, C., and Gabriel, F. Saddle-to-saddle dynamics in deep linear networks: Small initialization training, symmetry, and sparsity, 2021. Preprint arXiv:2106.15933 [stat.ML].
Kalimeris et al. (2019) Kalimeris, D., Kaplun, G., Nakkiran, P., Edelman, B., Yang, T., Barak, B., and Zhang, H. SGD on neural networks learns functions of increasing complexity. In Advances in Neural Information Processing Systems, volume 32, 2019.
Karpathy (2022) Karpathy, A. NanoGPT, 2022. URL https://github.com/karpathy/nanoGPT.
Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization, 2014. Published as a conference paper at ICLR 2015. Preprint arXiv:1412.6980 [cs.LG].
Lau et al. (2025) Lau, E., Furman, Z., Wang, G., Murfet, D., and Wei, S. The local learning coefficient: A singularity-aware complexity measure. In The 28th International Conference on Artificial Intelligence and Statistics, 2025. To appear. Preprint arXiv:2308.12108 [stat.ML].
Mao et al. (2024) Mao, J., Griniasty, I., Teoh, H. K., Ramesh, R., Yang, R., Transtrum, M. K., Sethna, J. P., and Chaudhari, P. The training process of many deep networks explores the same low-dimensional manifold. Proceedings of the National Academy of Sciences, 121(12), 2024.
McGrath et al. (2022) McGrath, T., Kapishnikov, A., Tomašev, N., Pearce, A., Wattenberg, M., Hassabis, D., Kim, B., Paquet, U., and Kramnik, V. Acquisition of chess knowledge in AlphaZero. Proceedings of the National Academy of Sciences, 119(47):e2206625119, 2022.
Meyer et al. (2006) Meyer, T., Ferrer-Costa, C., Pérez, A., Rueda, M., Bidon-Chanal, A., Luque, F. J., Laughton, C. A., and Orozco, M. Essential dynamics: a tool for efficient trajectory compression and management. Journal of Chemical Theory and Computation, 2(2):251–258, 2006.
Nanda et al. (2023) Nanda, N., Chan, L., Lieberum, T., Smith, J., and Steinhardt, J. Progress measures for grokking via mechanistic interpretability. In The Eleventh International Conference on Learning Representations, 2023.
Neyshabur et al. (2017) Neyshabur, B., Bhojanapalli, S., Mcallester, D., and Srebro, N. Exploring generalization in deep learning. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
Nguyen & Reddy (2024) Nguyen, A. and Reddy, G. Differential learning kinetics govern the transition from memorization to generalization during in-context learning, 2024. Preprint arXiv:2412.00104 [cs.LG].
Olah et al. (2020) Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M., and Carter, S. Zoom in: An introduction to circuits. Distill, 5(3):e00024.001, March 2020. URL https://distill.pub/2020/circuits/zoom-in.
Olsson et al. (2022) Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Johnston, S., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., and Olah, C. In-context learning and induction heads. Transformer Circuits Thread, 2022. URL https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/.
Panwar et al. (2024) Panwar, M., Ahuja, K., and Goyal, N. In-context learning through the Bayesian prism. In The Twelfth International Conference on Learning Representations, 2024.
Park et al. (2024) Park, C. F., Lubana, E. S., Pres, I., and Tanaka, H. Competition dynamics shape algorithmic phases of in-context learning, 2024. Preprint arXiv:2412.01003 [cs.LG].
Phuong & Hutter (2022) Phuong, M. and Hutter, M. Formal algorithms for transformers, 2022. Preprint arXiv:2207.09238 [cs.LG].
Power et al. (2022) Power, A., Burda, Y., Edwards, H., Babuschkin, I., and Misra, V. Grokking: Generalization beyond overfitting on small algorithmic datasets, 2022. Preprint arXiv:2201.02177 [cs.LG].
Raventós et al. (2023) Raventós, A., Paul, M., Chen, F., and Ganguli, S. Pretraining task diversity and the emergence of non-bayesian in-context learning for regression. In Advances in Neural Information Processing Systems, volume 36, pp. 14228–14246. Curran Associates, Inc., 2023.
Rogers & McClelland (2004) Rogers, T. T. and McClelland, J. L. Semantic Cognition: A Parallel Distributed Processing Approach. MIT Press, 2004.
Saxe et al. (2019) Saxe, A. M., McClelland, J. L., and Ganguli, S. A mathematical theory of semantic development in deep neural networks. Proceedings of the National Academy of Sciences, 116(23):11537–11546, 2019.
Shinn (2023) Shinn, M. Phantom oscillations in principal component analysis. Proceedings of the National Academy of Sciences, 120(48):e2311420120, 2023.
Singh et al. (2024) Singh, A., Chan, S., Moskovitz, T., Grant, E., Saxe, A., and Hill, F. The transient nature of emergent in-context learning in transformers. Advances in Neural Information Processing Systems, 36, 2024.
Strogatz (1994) Strogatz, S. H. Nonlinear Dynamics and Chaos: With Applications to Physics, Biology, Chemistry, and Engineering. Perseus Books, Reading, Massachusetts, 1994. ISBN 978-0-201-54344-5.
Vats & Knudson (2020) Vats, D. and Knudson, C. Revisiting the Gelman–Rubin diagnostic, 2020. Preprint arXiv:1812.09384 [stat.CO].
von Oswald et al. (2023) von Oswald, J., Niklasson, E., Randazzo, E., Sacramento, J., Mordvintsev, A., Zhmoginov, A., and Vladymyrov, M. Transformers learn in-context by gradient descent. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 35151–35174. PMLR, 2023.
Wang et al. (2024) Wang, G., Hoogland, J., van Wingerden, S., Furman, Z., and Murfet, D. Differentiation and specialization of attention heads via the refined local learning coefficient, 2024. Preprint arXiv:2410.02984 [cs.LG].
Wang (2008) Wang, L. Karhunen-Loeve Expansions and Their Applications. PhD thesis, London School of Economics and Political Science, 2008. URL http://etheses.lse.ac.uk/2950/.
Watanabe (2007) Watanabe, S. Almost all learning machines are singular. In IEEE Symposium on Foundations of Computational Intelligence, pp. 383–388. IEEE, 2007. doi: 10.1109/FOCI.2007.371500.
Watanabe (2009) Watanabe, S. Algebraic Geometry and Statistical Learning Theory. Cambridge University Press, 2009.
Watanabe (2018) Watanabe, S. Mathematical Theory of Bayesian Statistics. CRC Press, Taylor and Francis group, USA, 2018.
Wei et al. (2023) Wei, S., Murfet, D., Gong, M., Li, H., Gell-Redman, J., and Quella, T. Deep learning is singular, and that’s good. IEEE Transactions on Neural Networks and Learning Systems, 34(12):10473–10486, 2023. doi: 10.1109/TNNLS.2022.3167409.
Welling & Teh (2011) Welling, M. and Teh, Y. W. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning, 2011.
Xu (2024) Xu, A. K. Hyperparameter tuning for local learning coefficient estimation. Master’s thesis, The University of Melbourne, May 2024.

Cite as

@article{carroll2025dynamics,
  title = {Dynamics of Transient Structure in In-Context Linear Regression Transformers},
  author = {Liam Carroll and Jesse Hoogland and Matthew Farrugia-Roberts and Daniel Murfet},
  year = {2025},
  abstract = {Modern deep neural networks display striking examples of rich internal computational structure. Uncovering principles governing the development of such structure is a priority for the science of deep learning. In this paper, we explore the transient ridge phenomenon: when transformers are trained on in-context linear regression tasks with intermediate task diversity, they initially behave like ridge regression before specializing to the tasks in their training distribution. This transition from a general solution to a specialized solution is revealed by joint trajectory principal component analysis. Further, we draw on the theory of Bayesian internal model selection to suggest a general explanation for the phenomena of transient structure in transformers, based on an evolving tradeoff between loss and complexity. We empirically validate this explanation by measuring the model complexity of our transformers as defined by the local learning coefficient.},
  eprint = {2501.17745},
  archivePrefix = {arXiv},
  url = {https://arxiv.org/abs/2501.17745}
}

Click to copy