Patterning Toy Models of Superposition

TLDR: We demonstrate patterning, a principled approach for interpretability-guided training, in Anthropic’s toy model of superposition.

Reading: from developmental biology to developmental interpretability. In 1957, Conrad Waddington proposed the epigenetic landscape as an analogy for cell differentiation. Picture a marble at the top of a hillside carved with branching valleys. At each fork the cell commits to one developmental path or another. The final resting place determines what kind of cell it has become.

Neural networks develop in a comparable way. Training is the marble’s descent, here rolling down a landscape carved by the loss. Through a series of transitions, the model commits to one internal structure over another. These transitions dictate the ways in which the model will generalize.

The developmental perspective gives biologists enormous leverage: instead of cataloguing every adult cell by its full transcriptome, they trace it back to a handful of commitment events on a lineage tree. Developmental interpretability (MISSING 0000) pursues an analogous move for neural networks — read internal structure in trained networks by characterizing the developmental stages that produced it.

**Developmental Landscape of a Toy Model of Superposition.** In this post, we go from interpreting to *steering* the development of Anthropic’s toy model of superposition (MISSING 0000) through a technique we call “patterning” (MISSING 0000). We show that patterning enables precise control over developmental transitions and that it unlocks the ability to “rewrite” fate.

Writing: from interpretability to patterning. Models today are grown, not programmed: we pick a loss and a dataset, and the model’s structure is whatever training produces. Developmental interpretability lets us read what we got. But reading doesn’t let us change anything — for that we need to write. The goal is models that are crafted, not grown.

“Patterning” is our answer to what it means to “write” neural networks. The methodology combines two ideas: first, measure internal structure not by point evaluations but by expectation values averaged over the local loss landscape. This is what we call “spectroscopy,” following the tradition of statistical physics. Second, optimize not just over model parameters but also over the data mixture, using the derivatives of these expectation values to compute which data changes produce which structural changes. By repeatedly measuring and intervening over the course of training, we close a feedback loop connecting data interventions to structural outcomes.

This is control in the precise sense: a feedback loop connecting data interventions to structural outcomes.

Writing internal structure through patterning. Patterning is a technique for optimizing the data mixture during training that enables choosing between equally compatible developmental continuations and editing and rewriting learned structure.

In this post, we demonstrate patterning in Anthropic’s Toy Models of Superposition (TMS)—a deliberately simple setting where internal structure is immediately visible. In this setting, we show:

Patterning allows choosing between developmental branches. Ordinary training commits to a branch at random. Via patterning, we are able to perfectly control these crucial developmental moments.
Patterning allows moving a trained model between solutions. We show that patterning can move a model at a global minimum between equivalent but disconnected solutions, crossing loss barriers that gradient descent cannot.
Patterning works where classical methods don’t. We show that at the bifurcation points—where control matters most—influence functions and second-order methods get no signal. Patterning finds its signal in the higher-order geometry these methods miss.

Toy Models of Superposition

We begin with a recap of the toy model of superposition (Section 2.1; Elhage et al. (2022)), define “internal structure” for this model (Section 2.2), and describe the model’s stagewise development (Section 2.3).

The Model

Anthropic’s Toy Models of Superposition (Elhage et al. 2022) introduced a minimal setting for studying superposition, the phenomenon where neural networks represent more features than they have dimensions. This motivated a subsequent line of work that culminated in Anthropic’s sparse autoencoders, which characterize the modern paradigm of mechanistic interpretability (MISSING 0000). We use this model for two reasons. Its internal structure is visible in two dimensions, making it easy to just ‘see’ the structure of the model. And the model exhibits clear developmental stages, with equivalent solutions chosen between at each transition—a clean instance of the kind of structural choice that we expect to find, in more complex forms, in larger models.

Following Elhage et al. (2022), we study a tiny ReLU autoencoder (without a bias for simplicity). Inputs are binary feature vectors $x \in {0,1}^d$ drawn from a distribution $q$ , where each feature is active with an independent probability $p=1-S$ where $S$ is the “sparsity”. The model learns a feature dictionary $W \in \mathbb{R}^{m \times d}$ with representation $r = \mathrm{ReLU}(W x)$ and reconstruction $\hat{x} = W^T r$ . We train against the feature-weighted MSE,

L = \sum_x \sum_i I_i (x_i - \hat{x}_i)^2,

where $I_i$ is the feature importance. By default, we focus on the uniform importance $I_i=1$ regime.

In the high-sparsity regime, learned feature vectors arrange into a polygonal geometry: $d$ features in a 2D hidden space form a regular $d$ -gon when optimally packed. In this work, we focus on $d=5$ .

Digon

Triangle

Square

Pentagon

Because the encoder matrix $W$ has shape $(d, 2)$ , each feature’s hidden representation is a two dimensional vector, which allows for easy visualization. The above figure shows the ideal digon, triangle, square, and pentagon forms of the model.

$W$ has more degrees of freedom than the polygon’s shape: rotating all feature vectors by the same angle gives a completely different $W$ with identical behavior and loss. The polygon also has a discrete $D_5$ symmetry — under uniform importance, permuting the feature labels or reflecting the polygon doesn’t change the loss. When features have different importances, however, this breaks: which feature sits where starts to matter. For example, correlated features will tend to be represented orthogonally when possible, and side by side when not (Elhage et al. 2022).

Defining Structure

Patterning targets structure. So we need a language for describing structure. Here we borrow our first concept from statistical physics, and define an appropriate set of observables: functions $\phi_i(w): \mathcal{W} \to \mathbb{R}$ that map a model’s parameters to real numbers. What makes an observable “right” is that it captures a degree of freedom that matters for how the model behaves.

The structure of the toy model of superposition is simple enough that we can write down the right observables by hand. In general, this is a deep and arguably central problem; we return to it in the Discussion.

\phi_{\text{dot}[i,j]} = W_i \!\cdot\! W_j / |W_i||W_j|

(Normalized) Dot products measure angles between feature pairs. For a regular

n

-gon, adjacent features have dot product

\cos(2\pi/n)

\phi_{\text{norm}[i]}(W) = |W_i|

Norms measure feature magnitude.

Together, these 15 values (10 dot products + 5 norms, for our 5-feature model) form a coordinate system for a model’s structure—a macrostate, in the language of statistical physics, that captures what matters about the model while discarding what doesn’t. This is the level of description at which we will define targets, measure responses, and steer development.

Under uniform importance and uncorrelated sparsity, the loss has an additional $D_5$ symmetry: rotating or reflecting the feature labels gives different observable values but identical loss. This means many distinct configurations are behaviorally equivalent — any permutation of the ordering of features in an n-gon doesn’t affect the loss.

Stagewise Development

Training from a small, random initialization, the model usually passes through some sequence of polygon phases: 2-gon (antipodal pair), 3-gon (triangle), 4-gon (square), and so on, with each stage separated by discrete jumps. As in biological systems, the model develops in stages from simple, undifferentiated states to progressively more complex and specialized states.

Singular learning theory (SLT) provides a framework for understanding this (watanabe2009algebraic). MISSING (0000) showed that each polygon is a critical point of the loss, characterized by its loss together with its local learning coefficient $\lambda$ —a geometric invariant measuring the ‘degeneracy’ or ‘complexity’ of the critical point. Early in training, the posterior concentrates on low- $\lambda$ (simple) solutions despite their higher loss. As training progresses, it shifts toward low-loss solutions with higher $\lambda$ .

SLT analysis of TMS following MISSING (0000)

TODO

Development is stochastic: at each transition, the model can proceed down multiple possible paths. A symmetric $n$ -gon has $n$ equivalent ways to become an $(n+1)$ -gon, corresponding to the $n$ gaps between features. In ordinary training, noise breaks the symmetry and picks one branch randomly.

In biological development, the analogous problem is solved by patterning — the process by which chemical signals called morphogens concentrate along gradients that direct cells toward specific fates. In Wolpert’s French flag model, a single morphogen forms a gradient across a field of identical cells. Cells exposed to high concentrations adopt one fate, cells at medium concentrations another, and cells at low concentrations a third. The morphogen doesn’t build anything, per se; rather, it biases which of several equivalent developmental paths each cell takes.

[French flag / TMS patterning figure]

This is the same problem we face in TMS, and the technique we introduce takes its name from this analogy. The per-sample weights $h_i$ are our morphogen field: a concentration profile over data space that biases which of several equivalent branches the model takes at a developmental fork. In Choosing Bifurcations, we show how to compute the right profile.

Spectroscopy

To control development, we first need to be able to predict what effect our control inputs will have on the model’s behavior. That is, we need to characterize how behavior changes in response to perturbations, or, equivalently, the geometry of local loss landscape. Spectroscopy is our name for the set of techniques that provide this signal.

The key object is the local posterior: the Bayesian distribution over parameters in the neighborhood of a trained model. In Section 3.1, we show how to measure this distribution and thus the local landscape geometry through local expectation values. In Section 3.2, we define susceptibilities, derivatives of these expectation values with respect to the data distribution (MISSING (0000), MISSING (0000), MISSING (0000)), which tell us how that structure responds to changes in the training data. In Section 3.3, we apply these tools to read the development of a toy model of superposition.

In concurrent work (Scaling Spectroscopy), we have applied this framework to language models at scale, where the same SGLD sampling methodology recovers tens of thousands of interpretable features in Pythia models, on par with what sparse autoencoders find. In this work, we go further by showing that susceptibilities are not only diagnostic but actionable and that the same measurements used to read structure can be inverted to write it.

L(w*)E[L(w)]Var[L(w)]...

...

From points to distributions. Left: A singular loss landscape with SGLD posterior draws (orange) and the minimum

w^\ast

(black). Right: The distribution of loss values across posterior draws. The point estimate

L(w^\ast)

(black) anchors the left tail, while the posterior mean

\mathbb{E}[L(w)]

(red, dashed) captures the correction from local geometry. The difference between these two quantities is asymptotically controlled by the local learning coefficient. Higher-order moments provide more detailed corrections to the distribution of losses.

From Points to Expectation Values

Classically, we evaluate observables $\phi_i(w)$ point-wise at a specific choice of weights $w^\ast$ . The next idea we will borrow from statistical physics is to instead ask about the distribution of our observable, as summarized by expectation values like

\mu_i := \mathbb E[\phi_i(w)] = \int \phi_i(w)p(w\mid D_n)dw,

taken over the local loss landscape via the local posterior distribution

p(w\mid D_n) =\frac{e^{-nL_n(w)}\varphi(w)}{Z_n},

where $\varphi(w)$ is a prior and $Z_n$ is a normalizing constant (the “partition function” or “marginal likelihood”).

We are interested in the distribution in the neighborhood of a particular choice of weights $w^\ast$ , which we assume is a local minimum. We restrict the distribution by replacing the prior with a “localizing” prior $\varphi_\gamma(w\mid w^\ast)$ , a Gaussian centered at $w^\ast$ with precision $\gamma$ . For simplicity, we will often drop explicit dependence on $\gamma$ and $w^\ast$ from the resulting expectations $\mu_i = \mu_i(w^\ast, D_n)$ .

In the limit $n \to \infty$ , the posterior concentrates and $\mu_i$ reduces to $\phi_i(w^\ast)$ , so expectation values contain strictly more information than point values. For finite $n$ , the expectation values carry corrections determined by the local geometry of the loss landscape, including from the Hessian and all higher-order terms. These corrections are what makes these expectations a richer object than point estimates.

Estimating Expectation Values

To estimate these expectation values in practice, we compute averages over a finite set of draws from the local posterior. We sample these draws using Stochastic Gradient Langevin Dynamics (SGLD), which augments gradient descent with calibrated Gaussian noise:

w_{\tau+1} = w_t - \eta \nabla L_n(w_\tau) + \sqrt{2\eta / \beta} \xi_\tau, \qquad \xi_\tau \sim \mathcal{N}(0, I)

As $\tau \to \infty$ , these iterates converge (in distribution) to the posterior $p(w\mid D_n) \propto \exp^{-\beta L_n(w)}$ . The noise injection prevents our estimates from collapsing to a point estimate and lets us explore the landscape around the trained weights.

**Posterior geometry.** *Feature vector densities from SGLD sampling. We track $W$ over 1000 chains of 100 draws each, for perfect n-gons from n=2 (left) to n=5 (right).*

In practice, we initialize at the trained model $w^\ast$ and run multiple chains in parallel. After burn-in, we collect $T$ samples per chain. At each sample, we can record any observable $\phi_i(w)$ .

When the observable is the loss itself, $\phi(w) = L_n(w)$ , the expectation value gives us an estimator for the local learning coefficient $\lambda$ :

n\left(\mathbb E[L_n(w)] - L_n(w^\ast)\right) \to \lambda(w^\ast)\quad\text{as}\quad n \to\infty,

In Section 2.3, we used the LLC to characterize each polygon phase by its complexity. This is natural since the LLC is the leading-order correction to the local geometry. Expectation values for other observables tell us additional, finer-grained structure in this geometry.

From Expectations to Susceptibilities

Standard deep learning optimizes point values like $L_n(w^\ast)$ . Patterning proposes to optimize the expectations $\mu_i$ instead. Where SGD differentiates the loss with respect to model weights, patterning differentiates structural expectations with respect to the data distribution. These derivatives tell us how applying a given control input changes behavior.

To make this concrete, we introduce importance weights into the empirical loss:

\begin{align} L_n(w) &\to L_n(w; \boldsymbol h) = \frac{1}{n}\sum_{i=1}^n (1+h_i) \ell_i(w),\\ \mu_i(D_n) &\to \mu_i(\boldsymbol h). \end{align}

When integer-valued, the importance weights $h_i$ can be interpreted as the number of additional copies of a particular sample $i$ . But we can relax this to allow arbitrary real-valued importances, making the posterior expectations a smooth function of $\boldsymbol h$ .

The derivative of $\mu_i$ with respect to $h_i$ (up to a constant multiple) is known as a susceptibility. More generally, susceptibilities can be defined as derivatives of expectation values with respect to any parameters the posterior depends on, but throughout this work we use only per-sample weights:

\chi_{k,i} = \left.\frac{1}{n}\frac{\partial}{\partial h_i}\mu_k^n(\boldsymbol h)\right\vert_{\boldsymbol h=0}.

The entry $\chi_{k,i}$ measures how observable $k$ responds when we infinitesimally upweight sample $i$ . It maps changes in data importance $dh_i$ to changes in expectations $d\mu_k = \chi_{k,i} d h_i$ .

From Derivatives to Covariances

This derivative has a far more practical alternative form. The fluctuation–dissipation theorem converts the susceptibility into a covariance under the unperturbed posterior,

\chi_{k,i} = -\text{Cov}_{w \sim p} \left[\phi_k(w), \ell_i(w) - L(w)\right].

We can learn how the system would respond to a perturbation by observing how it fluctuates in the absence of one. Fluctuations encode the local geometry of the loss landscape; responses encode how that geometry shifts under changes in data. Susceptibilities connect the two.

Methodologically, this makes it possible to estimate susceptibilities in practice. Without it, computing susceptibilities independently for each perturbation $h_i$ would be intractable. The covariance form means we never need to actually change the data distribution. We can repurpose the same SGLD draws used to estimate the original expectation values:

\hat{\chi}_{k,i} = -\frac{1}{T} \sum_{t=1}^{T} \left(\phi_k^{(t)} - \bar{\phi}_k\right)\left(\Delta L_i^{(t)} - \overline{\Delta L_i}\right)

To visualize these susceptibilities, we return to the trained 5-gon. We run SGLD and color each posterior sample by its per-input loss $\ell_j(w)$ for each of the $2^5=32$ unique inputs $z_j$ (darker means higher loss).

The colored clouds reveal the loss landscape from each input’s perspective. When feature 0 is the only active feature (input $e_0$ ), the loss is highest when the posterior fluctuates $W_0$ away from its optimal position. When features 0 and 1 co-fire (input $e_0 + e_1$ ), the loss is highest when either feature drifts away from the other.

From Posterior Samples to Susceptibilities:

In Section 2.3, we described development as a sequence of polygon phases separated by discrete transitions. With susceptibilities in hand, we can now examine one such transition with more resolution.

Susceptibilities over Training.

We can see many t TODO REWRITE

During stable phases, susceptibilities are small: the model sits at a critical point and is relatively insensitive to changes in the data. At transitions, they spike. The loss landscape is reorganizing, the posterior spreads out, and the model becomes maximally responsive to perturbations at exactly the moment it is deciding what to learn next.

In biological development, these transitions are where morphogens act. A cell at a bifurcation is maximally sensitive to molecular signals in its environment. A small concentration gradient, applied at the right moment, is enough to commit a tissue to one fate, while the same signal during a stable phase would do almost nothing. Susceptibility peaks tell us both how to intervene (which samples to reweight) and when (at transitions, when the model is responsive).

What’s left is to turn these measurements into an actual control signal. In Section 6, we show that some of these bifurcations constitute second-order phase transitions in the precise sense of statistical physics, and that a small bias to the data distribution can deterministically choose which branch the system takes.

Inverting Susceptibilities

Spectroscopy ask how structural coordinates $\mu_i$ responds to data. Patterning inverts this to ask what change to the data distribution would be needed to produce a desired change in $\mu_i$ .

The basic idea of optimizing the data mixture rather than model weights alone is not new (see curriculum learning, data augmentation, and domain adaptation). What’s new is that susceptibilities provide a way to compute the right change. MISSING (0000) introduced this under the heading of “patterning” and demonstrated it in small language models and synthetic models, showing that inverting the susceptibility matrix can both accelerate and delay the formation of structure during training, and that susceptibilities can be used to choose among solutions that achieve the same loss. Susceptibilities are to patterning what the gradient of the loss is to SGD: they tell you which direction to move.

In Section 4.1, we present the patterning equation, which maps a desired structural change to a set of per-sample weights. In Section 4.2, we apply this to the 4-to-5-gon transition, controlling which gap the new feature grows into. In Section 4.3, we analytically solve this transition to show why the full posterior is necessary: the Hessian is exactly zero at the bifurcation point, and the signal that patterning uses to steer lives entirely in higher-order geometry.

The Patterning Equation

Given a target change $d\mu_{\text{target}}$ , we want the minimum-norm sample reweighting $dh$ such that $\chi dh \approx d\mu_{\text{target}}$ . The solution is:

dh_{\text{opt}} = \chi^\dagger d\mu_{\text{target}} = \chi^T (\chi \chi^T)^{-1} d\mu_{\text{target}}

In practice, we form the Gram matrix $G = \chi \chi^T$ and add ridge regularization $G_{\text{reg}} = G + \alpha I$ for numerical stability. We solve for $v = G_{\text{reg}}^{-1} d\mu_{\text{target}}$ and project back: $dh_{\text{opt}} = \chi^T v$ . The per-sample scores $s_i = (\chi^T v)_i$ tell us how much to upweight ( $s_i > 0$ ) or downweight ( $s_i < 0$ ) each training sample.

From $dh$ to training weights. The patterning equation produces a vector $\delta h \in \mathbb{R}^n$ that can take arbitrary sign and magnitude. To convert it into valid per-sample training weights, we start from the natural data weights $h_0$ (the Bernoulli base distribution over input types, normalized to mean 1), add the perturbation, floor at a small $\varepsilon > 0$ to prevent zero or negative weights, and renormalize:

h = \frac{\max(h_0 + \alpha dh; \varepsilon)}{\mathrm{mean}\bigl(\max(h_0 + \alpha dh; \varepsilon)\bigr)}

The scalar $\alpha$ controls the overall perturbation scale. Renormalization to mean 1 ensures the effective learning rate is unchanged. Note that the flooring and renormalization modify the actual applied perturbation: the prediction $\delta\mu_{\mathrm{pred}} = \chi \cdot (h - h_0)$ is the effective change in observables, not the raw output of the pseudoinverse. For small $\alpha$ these coincide; for large $\alpha$ the floor clips extreme negative entries and the prediction accounts for this.

Choosing Bifurcations

Susceptibilities tell us when the model is most sensitive to intervention and what effect different data perturbations will have. The patterning equation tells us which data perturbations are necessary to achieve a target effect. Now we test whether we can use this to choose the outcome of a bifurcation.

Starting from a perfect 4-gon with the fifth feature at the origin, there are four gaps it can grow into. In the language of observables, placing f4 between f0 and f1 means targeting $\phi_{\text{dot}[0,4]} = \phi_{\text{cos}[1,4]} = 0.309$ (the interior angle of a regular pentagon), leaving the angles between $W_4$ and non-neighbors unconstrained.

4-gon → 5-gonepoch 0/500

Gap 1

...

Gap 0

...

Gap 2

...

Gap 3

...

Controlled bifurcation. By patterning on the angle between f4 and its target neighbors, we can choose the gap into which f4 grows.

In biological development, bifurcations are controlled by the concentrations of different morphogens. Here, the per-sample weights $h_i$ play the same role. They define a concentration field over data space that the model is exposed to during training, and the patterning equation tells us which field to apply.

We compute the reweighting only once at the bifurcation point and train forward with the modified distribution (we extend this to an iterative procedure in the next section). Across all four gaps, patterning places the new feature in the specified location. Precision is high, though it varies somewhat with sparsity: in some regimes the transitions are sharper and easier to control than others.

Analyzing the Bifurcation

At the 4-to-5-gon transition, the dead feature $W_4$ sits at the origin. The total loss near this point turns out to be purely quartic in $|W_4|$ : the quadratic term vanishes exactly due to a cancellation between interference costs and self-reconstruction benefits. The Hessian at the origin is the zero matrix.

This has two consequences. First, any method that relies on at most second-order information (including adaptive stochastic optimizers like Adam and classical influence functions) gets no signal at this point, and will struggle even in the vicinity of this point: the loss gradients with respect to $W_4$ are zero, so there is nothing for the Hessian to invert. Second, the angular structure that determines which gap the feature grows into lives entirely in the per-sample losses, not the total loss. This is the structure that susceptibilities detect and that patterning exploits.

Analytical treatment of the 4-to-5-gon transition

At the transition point, the system sits at a critical point—a saddle in the loss landscape. Near this saddle, the loss has the form:

L(\ell) \approx L_0 + a\ell^2 + b\ell^4

where $\ell = |W_{\text{new}}|$ is the magnitude of the new feature. When $a < 0$ , the saddle becomes unstable and the feature spontaneously grows—but the critical point doesn’t specify which branch. That choice depends on which valley the system descends into.

Four active features form a regular square at unit norm, with the dead feature parametrised in polar coordinates: $W_4 = r(\cos\theta, \sin\theta)$ . By parameterising in this way, the total loss turns out to be remarkably clean:

H(r, \theta) = 1 + r^4

The quadratic term in $r$ vanishes exactly, for all angles $\theta$ . This cancellation has a satisfying origin: the interference that a growing $W_4$ causes on the four existing features contributes $+r^2$ to the loss (summing over the four quadrants via the tight-frame property of the square), while the self-reconstruction benefit of having a nonzero $W_4$ contributes $-r^2$ (from $\ell_4 = (1-r^2)^2 + r^2$ ). These cancel perfectly, leaving only the quartic. The Hessian at the origin is the zero matrix.

The posterior at this point factorizes in a useful way: the angle $\theta$ is uniformly distributed and independent of $r$ , while $r$ follows the quartic distribution $p(r) \propto r,e^{-\beta r^4}$ . While the total loss is independent of $\theta$ , the individual per-feature losses are not. Each feature’s loss fills exactly one quadrant of the angle:

\ell_0 = r^2\cos^2\theta \cdot \mathbf{1} { \cos \theta > 0}, \quad \ell_1 = r^2\sin^2 \theta \cdot \mathbf{1} { \sin \theta > 0}, \quad \ldots

Feature 0’s loss is largest when $W_4$ points toward $W_0$ , because that is where the interference is worst. This angular structure — invisible in the total loss but present in each per-sample loss — is what allows data reweighting to steer the direction of growth.

Observable

\phi_{\text{dot}[0,\,4]}

\phi_{\text{dot}[1,\,4]}

\phi_{\text{dot}[2,\,4]}

\phi_{\text{dot}[3,\,4]}

-7.72e-4

1.44e-5

8.09e-4

-3.14e-5

-6.92e-5

3.94e-5

-8.41e-4

-5.63e-6

8.24e-4

1.23e-6

7.52e-4

1.65e-5

-8.29e-4

4.84e-5

2.44e-5

-3.37e-5

8.09e-4

3.99e-5

-8.53e-4

-5.57e-5

h_0

h_1

h_2

h_3

h_4

Per-sample weight

h_i

\phi_{\text{dot}[0,\,4]}

\phi_{\text{dot}[1,\,4]}

\phi_{\text{dot}[2,\,4]}

\phi_{\text{dot}[3,\,4]}

-\alpha

\mathbf{0}

+\alpha

\mathbf{0}

\mathbf{0}

\mathbf{0}

-\alpha

\mathbf{0}

+\alpha

\mathbf{0}

+\alpha

\mathbf{0}

-\alpha

\mathbf{0}

\mathbf{0}

\mathbf{0}

+\alpha

\mathbf{0}

-\alpha

\mathbf{0}

\phi_{\text{dot}[0,\,4]}

h_0

h_1

h_2

h_3

h_4

Per-sample weight

h_i

Susceptibilities Encode Bifurcation Branching Structure. Right: The susceptibility matrix

\chi_{\mathrm{dot}[j,4], i}

, analytically computed. Each row is a dot-product observable

\phi_{\mathrm{dot}[j,4]}

; each column is a per-sample weight

h_i

. The matrix has anti-diagonal structure with entries

\pm\alpha

(where

\alpha = \frac{2}{3\pi^{3/2}}\beta^{1/2} \approx 0.847

) connecting each feature to its diametrically opposite partner. The

h_4

column is zero: the dead feature’s own data weight does not bias the angular direction. Left: The computed

\chi

using SGLD. This structure directly encodes the branching geometry of the bifurcation.

Computing the covariances under the quartic posterior gives the susceptibility of the dot product observables in closed form, depicted above. The susceptibilities tell us that increasing the weight on feature $i$ ‘s data pushes the dead feature away from feature $i$ and toward the diametrically opposite feature, because the added weight penalizes interference in that direction. Orthogonal features have zero effect on each other’s angular susceptibility. The dead feature’s own weight $h_4$ does not bias the angular direction at all — it is a uniform destabilizer, controlling whether the feature grows but not where.

The moments where developmental control matters most are precisely the singular points where classical methods fail. This is not a limitation of a particular implementation; it is a structural feature of bifurcations. The signal that patterning needs lives in higher-order geometry that the Hessian cannot see. Hence the need for more powerful techniques like patterning.

We can similarly pattern the 3 to 4 transition, targeting an angle of 60 degrees between the new feature and two neighbors.

3-gon → 4-gonepoch 0/500

Gap 0

...

Gap 1

...

Gap 2

...

Controlled bifurcation. By patterning on the angle between f4 and its target neighbors, we can choose the gap into which f4 grows.

Patterning

The patterning equation gives us a gradient in the space of data distributions. Sometimes a single step along this gradient tilts the loss landscape such that our targets are immediately obtainable. In the previous section, we applied a single patterning step to control a bifurcation – compute the reweighting once, train forward. This is enough when the path to the desired posterior is direct. However, the real power lies in iteratively following this gradient: estimate susceptibilities at the current weights, compute a new reweighting, train, and repeat. This carves a trajectory through data-space, much like how weight-space gradient descent is used to train models, and is patterning proper.

In Section 5.1, we describe this loop. In Section 5.2, we demonstrate it by swapping two features in a trained 5-gon, moving between two global minima separated by a large loss barrier. In Section 5.3, we step back and consider what patterning is doing geometrically: tracing a trajectory not through weight space but through the space of loss landscapes.

The Patterning Loop

The simplest version of this loop is as follows:

Sample the posterior. Run SGLD at the current weights $w$ to collect draws ${w^{(t)}}$ .
Measure and estimate. From these draws, compute the current posterior expectations $\mu_i = \hat{\mathbb{E}}[\phi_i(w)]$ and estimate the susceptibility matrix $\chi$ .
Compute the tracking error. $\delta\mu = \mu_{\mathrm{goal}} - \mu_{\mathrm{current}}$ .
Solve for the data perturbation. $\delta h = \chi^\dagger \delta\mu$ .
Update sample weights. $h \leftarrow \mathrm{clip\text{-}and\text{-}normalize}(h + \alpha \cdot \delta h)$ (see below).
Train. Take $S$ gradient steps on the reweighted loss $L(w; h) = \frac{1}{n}\sum_i h_i ,\ell_i(w)$ .
Return to step 1.

[Bilevel optimization: A series of panels showing contour plot of a loss landscape. We see a starting mode and a target mode (even though “modes” is a wrong concept it’s fine here) labeled on the first panel. From left to right, we see the loss landscape morphing slightly as the landscape goes through a second order phase transition that annihilates the two modes against each other in the middle and then pulls them apart again. Within each panel we see a small SgD run converge from the previous local minimum to the new modified local minimum. As we cross the annihilation, SGD ends up on the other side of the newly forming energy barrier, and we end up at the final target mode.]

Patterning as bilevel optimization: We use susceptibilities to change the data mixture dynamically over the course of training. This reshapes the geometry of the loss landscape, which enables fine-grained control over developmental transitions as well as the ability to rewrite structure that was locked in place.

Rewriting Minima

By controlling bifurcations we choose which valley Waddington’s marble descends into. But what about moving between valleys — taking a model that has already settled and pushing it somewhere else?

In Waddington’s original picture, this is forbidden. The ridges between valleys grow steeper as the marble descends; a differentiated cell cannot become a different cell type. For decades, biologists believed this was a one-way street. Then in 2006, Shinya Yamanaka showed that a handful of transcription factors could reprogram adult cells back into stem cells — the marble could be pushed back up the hill.

We take a trained 5-gon and ask: can we rearrange its features into a specified configuration? The target has the same loss as the original (by symmetry of the polygon), but there is no low-loss path between the two under the natural data distribution — each is a global minimum separated by a loss barrier.

Because patterning can reweight data samples, we don’t need to follow the gradient on the original loss — we can reshape the landscape itself. We change which data the model sees, and what was a ridge becomes a valley. The marble still rolls downhill, but downhill now leads somewhere new.

Classifying Targets

Our observables are the ten pairwise cosine similarities $\hat W_i \cdot \hat W_j$ and the five feature norms $\|W_i\|$ . Because every target is a rearrangement of the same regular pentagon, all norms are identical across targets — the targets differ only in which features are adjacent and which are not.

On a regular pentagon, the cosine between two features takes one of just two values, determined by whether the features are adjacent (gap 1, one edge apart) or non-adjacent (gap 2, two edges apart) on the original ring. A target is therefore fully specified by its gap pattern: for each pair of positions on the ring, are the two features placed there gap-1 or gap-2 apart?

Many different-looking rearrangements produce the same gap pattern. For instance, the 3-cycle $(f_1\;f_2\;f_3)$ and the double swap $(f_1\;f_2)(f_3\;f_4)$ place different features at every position, yet at every position pair the gap between the features placed there is the same — so the two rearrangements define the same target. This happens because one arrangement can be obtained from the other by relabeling all features via a rotation of the pentagon, which preserves all gaps.

gap 1

gap 2

\mathrm{id}

All edges gap 1

(f_1\;f_2\;f_3)

(f_1\;f_2)(f_3\;f_4)

x

Working out which rearrangements are observationally distinct, we find exactly 12 gap patterns — the 120 permutations of five features, modulo the 10-element dihedral group $D_5$ of rotations and reflections. One is the identity. The remaining 11 non-trivial targets are naturally organized by how many of the five original neighbor pairs they destroy — the break count.

\mathrm{id}

(f_3\;f_4)

(f_2\;f_3)

(f_2\;f_4)

(f_1\;f_2)

(f_1\;f_3)

≡

2-breaks. Two features the starting polygon exchange positions. The rest of the polygon is undisturbed — three of the five original neighbor pairs survive. Any single swap between features falls in this class, and all are equivalent under $D_5$ . For example, swapping f2 ↔ f3 and swapping f4 ↔ f1 are the same task: both swap a pair of adjacent features, and the starting polygon’s rotational symmetry maps one to the other.

(f_1\;f_3\;f_2)

(f_2\;f_4\;f_3)

(f_1\;f_2\;f_3)

(f_2\;f_3\;f_4)

(f_1\;f_2)(f_3\;f_4)

≡

3-breaks. Only two of the five original neighbor pairs survive. Each 3-break target can be described equally well as a 3-cycle rotating three features or as two simultaneous swaps — these very different operations produce the same gap pattern at every position pair, so they define the same observable target. All five 3-break targets are equivalent under $D_5$ .

(f_1\;f_3\;f_4\;f_2)

The 5-break. Every neighbor relationship in the starting polygon is changed. There is exactly one such target (up to

D_5

): the features are maximally scrambled so that no original neighbor pair remains adjacent. Naturally, there is only one such permutation of this type.

A rearrangement can break 0, 2, 3, or 5 of the five original neighbor pairs. Break counts of 1 and 4 are impossible: four intact edges of the pentagon necessarily form a path whose two endpoints are adjacent, but the single remaining step would need to connect non-adjacent features — a contradiction. (By a duality that swaps gap-1 and gap-2 steps, the impossibility of 1-break implies the impossibility of 4-break.)

The dihedral symmetry of the starting pentagon makes all targets with the same break count equivalent patterning problems: a $D_5$ rotation or reflection maps any one to any other, so the susceptibility structure, data reweightings, and patterning trajectory are identical up to relabeling of features.

We test one representative target from each class, comparing three patterning regimes: single-step (one data reweighting, then train to convergence), periodic recomputation (re-estimate $\chi$ every 10 epochs), and adaptive recomputation (re-estimate when the weighted loss plateaus).

A Trajectory in the Space of Data Distributions: Heatmap shows per-input weights over 160 patterning steps (rows = input types grouped by cardinality; columns = steps). Three phases emerge: (1) Steps 0–20 heavily upweight inputs where f3, f0, and f4 fire together, anchoring f4 between its target neighbors. (2) Steps 20–60 redistribute weight across higher cardinality inputs containing f0 to f3, to pull back the remaining features from their 4-gon positions into the 5-gon arrangement. (3) Steps 60+ stabilize, with weights settling into a steady pattern. Loss curves (top) show the weighted (red) and natural (blue) distributions peaks during active restructuring, then narrows as the target is achieved.

The figure above shows this trajectory for the feature swap. Each row of the heatmap is an input type (labelled by which features are active); each column is a patterning step; color indicates the accumulated sample weight $h_i^{(t)}$ .

Two things are immediately visible. Firstly, we can clearly see several phases consisting of different inputs being upweighted. Early on while still approaching the crossover (which happens at epoch 97) the weights are in fluctuation — work is being done to bring features zero and one together. Once feature zero is on the right hand side of feature one (the features have crossed), the work is mostly done — the natural, unweighted structure of the loss landscape would be enough to reach the target. After the successful crossover the weights stabilize until the permuted 5-gon is reached.

Secondly, weightings are interpretable.

Let’s examine what $\chi$ ‘thinks’ its data weights are going to do to the model. We can check this by, as before.

Both the upweighted ( $>1\times$ ) and downweighted ( $<1\times$ ) inputs are easy to understand. On the upweighting front, input (1,0,0,0,1) pushes $f_0$ and $f_4$ apart. In doing so $f_0$ moves inward towards $f_1$ (nice!) and $f_4$ towards $f_3$ ; input (0,1,1,0,0) does a symmetric thing, pushing $f_1$ and $f_2$ apart, similarly moving f1 and f0 together; input (1,0,1,0,0) drags $f_0$ and $f_2$ upwards – finally, input (0,1,0,0,1) is it’s symmetric partner, driving f1 up and across. At epoch 7 these four inputs are nearly the only inputs being upweighted – linearly combining their movements.

Visualizing the Trajectory: An alternative view of the data-space trajectory of our feature crossing run. Representations are colored according to the corresponding features present in each input.

Each column of $\chi$ shows a single input’s predicted effect on all 15 observables. The patterning equation produces a linear combination of these observable changes: find per-input weights such that the weighted sum of all these per-input effects adds up to the desired structural change. L

The Patterning Controller

The idea that statistical observations of a system’s behaviour can provide the information needed to control it — without requiring a mechanistic model of its internals — is among the founding insights of modern control theory (Wiener, 1948). The framework is general: a plant is the system being steered, a controller observes its output, compares it to a target, and computes an intervention. The gap between target and output is the tracking error. The loop closes when the intervention changes the output, which changes the tracking error, which changes the next intervention.

The patterning loop wraps the susceptibility computation in a feedback controller. At each step, the controller observes the current structure (measuring observables $\mu_t$ ), estimates the plant (computing $\chi_t$ ), computes an intervention (solving for $dh_t$ ), applies it (training on the reweighted loss), and repeats. This is proportional control in its simplest form.

Rather than estimating the plant on a fixed schedule, we can let the system tell us when it needs new instructions. Between recomputation points, the model trains on fixed sample weights — following the gradient downhill on a static reweighted loss landscape. As long as the landscape is guiding the model toward the target, no intervention is needed. It is only when training plateaus — when the current weights have done all they can — that we need to re-estimate χ, reassess the tracking error, and compute a fresh perturbation.

This is a form of event-triggered control (Åström, 2008): rather than sampling the plant at a fixed rate (which wastes computation when the system is already moving in the right direction), the controller fires only when the output stalls. The trigger condition is simple: when the weighted loss stops decreasing, the current data distribution has been fully exploited and the landscape needs reshaping.

Loading...

The resulting controller is remarkably efficient. When patterning on a 3-break target, a fixed schedule of one susceptibility estimate every 10 epochs takes about 120 epochs to reach the correct permutation, and around 230 to settle into a perfect 5-gon: this is 12 and 23 patterning steps, respectively. By sparsely computing susceptibilities, as few as 3 patterning steps to reach the correct permutation and a single extra to hit the perfect 5-gon. This works because the recompute, when properly calibrated, fires precisely at structural boundaries: when features begin to move, when they cross, and when they need to settle into their new positions.

TODO: Talk about minimal weight interventions and phases

This work sits at the intersection of several research areas: singular learning theory, data attribution and optimization, constrained optimization, and alignment. We survey each below, organized around how patterning relates to and differs from existing approaches.

Singular Learning Theory and Developmental Interpretability

Patterning comes most directly out of a research agenda we have been developing over the past three years [MISSING (0000), MISSING (0000)], applying singular learning theory (SLT; Watanabe (2009)) to interpretability and alignment.

What is singular learning theory? First, some background: many classical results in learning theory (including asymptotic normality, the Bayesian Information Criterion, the Bernstein–von Mises theorem) assume regular models, that is, models whose loss landscapes have isolated, locally quadratic minima. For neural networks, this assumption fails. The loss landscape is degenerate, with non-isolated minima with Hessians that have zero eigenvalues. The classical results that were built on regularity do not apply.

Singular learning theory relaxes the regularity assumption and generalizes many of the classical results to the singular setting. It is a theory of Bayesian learning, for which the learning process is governed by Bayes’ rule as the dataset size $n$ grows, rather than by gradient descent dynamics in increasing timestep $t$ . Watanabe’s central result (Watanabe 2009) derives closed-form asymptotics for (in-distribution) generalization error in the large-data limit, replacing the parameter count in the BIC with a geometric invariant known as the real log canonical threshold (RLCT) or, alternatively, the learning coefficient.

The developmental interpretability agenda. The developmental interpretability agenda (MISSING 0000) began with several key hypotheses. First, that Watanabe’s Bayesian theory could serve as a productive idealized model of SGD-based training. This has borne out empirically across a series of works spanning toy models and now increasingly larger language models (MISSING 0000). Second, that SLT could be extended beyond the asymptotic, in-distribution, Bayesian setting into a more general theory of deep learning, even at finite dataset sizes and in other limiting regimes, even under distribution shifts, and even for non-Bayesian learning algorithms.

Susceptibilities [MISSING (0000), MISSING (0000), MISSING (0000), MISSING (0000), MISSING (0000), MISSING (0000)] are the first step towards the second extension: they measure the first-order correction to posterior expectations under a vanishing shift in the data distribution, and form the basis of our approach to interpretability, which we call “Spectroscopy”. This provides a principled route towards understanding how model internals and behaviors respond to changes in the data distribution.

Patterning, introduced by MISSING (0000) and applied to the TMS here, provides evidence for the last extension. The results in Section TODO demonstrate that predictions derived from the Bayesian posterior successfully predict how SGD training responds to perturbations of the data distribution. This was not guaranteed by the theory. It is an empirical finding that supports the eventual unification of Bayesian and optimization-based perspectives on deep learning.

Influence Functions and Data Attribution

Susceptibilities are a generalization of influence functions. Influence functions [MISSING (0000), MISSING (0000)] measure the sensitivity of model predictions to changes in training data importance. This can be seen as a special case of a susceptibility where the perturbation is a per-sample weight (the same as considered here) and the posterior can be approximated as Gaussian (MISSING 0000):

\chi_{k,i}^{\text{IF}} = -\nabla\phi_k^T H^{-1} \nabla\ell_i,

where $H=\nabla^2 L_n(w)$ is the Hessian at the local minimum.

Influence functions face two major problems, one practical and one theoretical. The practical problem is that Hessian inversion is intractable for large models. The theoretical problem is the Hessian inverse is not even well-defined for neural networks. Susceptibilities solve both.

Susceptibilities bypass the Hessian inversion. Many methods within the field of training data attribution (TDA) can be understood as approximations to the full Hessian inverse. This includes EK-FAC (MISSING 0000), TRAK (MISSING 0000), DataInf (MISSING 0000), and even the simple inner product of gradients.

Estimating susceptibilities bypasses the Hessian inversion step entirely by directly computing statistics over the local loss landscape. This comes with a different set of scaling exponents that in many cases outperforms Hessian-based methods (MISSING 0000).

Susceptibilities measure influence at all orders. Classical influence functions assume that the model is at an isolated local minimum with an invertible Hessian $H$ . These are precisely the same regularity assumptions we discussed previously that break down for neural networks. As a consequence of this breakdown, the empirical Hessian inverse is technically ill-defined. This can be “fixed” by regularizing the Hessian, $H \to H + \lambda \mathbb 1$ , but, as we saw in Section TODO, this is not enough: at the 4-to-5-gon bifurcation, the Hessian is exactly zero and contains no information about which branch the model will take. Regularizing $H$ makes the matrix invertible but does not conjure any new signal.

Understanding the response of the model in general, especially at bifurcations, requires going to fourth order in loss landscape geometry. This is exactly the information that is available to susceptibilities.

Data Optimization

The idea of optimizing the data distribution to shape what models learn has a long history.

Data optimization as bilevel optimization. The most general framing is bilevel optimization over data weights [MISSING (0000), MISSING (0000)]: minimize an outer objective $L$ defined in terms of a model trained on a weighted dataset:

\boldsymbol{h} = \underset{\boldsymbol h}\text{argmin} L(w^*(\boldsymbol h)).

By applying the chain rule and invoking the implicit function theorem, the hypergradient for sample $i$ is precisely the classical influence function

\grad_{h_i} L = -\nabla L^\top H^{-1} \nabla \ell_i.

Data optimization in practice. Actually implementing this is highly intractable, even with state-of-the-art influence function approximations, because the influence functions have to be calculated repeatedly over the course of training. Therefore, in practice, the use of influence functions is reserved to one-time interventions and ad-hoc rules (e.g., filtering to remove samples below a certain threshold, or implementing a heuristic schedule based on influence rankings, see, e.g., MISSING (0000), MISSING (0000), MISSING (0000)).

In fact, most work abandons influence functions altogether, relying instead on heuristic proxies rather than actual response signals. The current state-of-the-art in actual practice optimizes at the level of domain mixture weights rather than individual samples (e.g., DoReMi, DoGE, RegMix, and related methods). This trains proxy models to predict loss contributions or extrapolates from granular mixture-level scaling laws.

There are several closely related problems to data mixture optimization. For example, curriculum learning focuses on ordering samples rather than weighting samples. Here, it is common to order samples by an increasing measure of difficulty [MISSING (0000), MISSING (0000), MISSING (0000)]. Meanwhile, data pruning and coreset selection [MISSING (0000), MISSING (0000)] restricts reweightings to binary include/exclude decisions.

How patterning generalizes this. The preceding section established that susceptibilities generalize influence functions as a response signal. Patterning can be understood as enforcing two changes on the bilevel optimization framing. First, we replace the point estimate of the outer objective with an expectation value. Second, we replace the scalar target with a specification of multiple objectives. How the resulting reweighting is applied (additive update, mirror descent on the simplex, or other) is a separate design choice downstream of both frameworks.

Constrained Optimization

Training against multi-objective specifications rather than loss alone connects patterning to a broader literature on constrained learning. The key distinction is between what is being optimized (parameters vs. data distribution) and where the constraints are imposed on (behavioral outputs vs. internal structure via expectation values).

Penalty methods are the simplest approach. These add weighted regularization terms to the loss for each additional target. This approach is limited however (MISSING 0000), because there need not be penalty coefficients that simultaneously satisfy all constraints and tuning the relative coefficients introduces costly hyperparameter tuning overhead.

Lagrangian and primal-dual methods (MISSING 0000; MISSING 0000; MISSING 0000; MISSING 0000) address this limitation by making the penalty coefficients themselves learnable. The dual variables $\lambda_i$ adapt during training: when a constraint is violated, its multiplier increases; when it is satisfied, the multiplier decreases or vanishes. Constrained Policy Optimization (MISSING 0000) and its successors enforce similar cost constraints in an RL setting throughout training via trust-region or primal-dual updates over policy parameters.

Patterning shares the core insight that training should be formulated against specifications, and its iterative loop — re-estimate susceptibilities, compute target-driven weights, train, repeat — has the same feedback-control structure as primal-dual optimization: the per-sample weights $h_i$ in patterning play an analogous role to the dual variables $\lambda_i$ in Lagrangian methods, both adapting iteratively to drive the system toward a specification. But patterning differs in what is being optimized and where the specification lives. Constrained learning methods optimize over model parameters, subject to constraints on model outputs (e.g., fairness rates, robustness margins, worst-case loss on subpopulations). Patterning optimizes over the data distribution, targeting internal structure via expectation values of observables under the local posterior. The constraints are on the geometry of the loss landscape and thus internal structure, not on predictions. And because the specification is expressed in terms of posterior expectations rather than point-valued constraint functions, patterning remains well-defined at the singular points — bifurcations, phase transitions — where the constraint gradients that Lagrangian methods rely on degenerate.

Alignment

All current alignment techniques are effectively soft forms of constrained data optimization.

Alignment is data optimization. We use the same basic deep learning techniques but with empirically and heuristically chosen datasets. SFT is just finetuning on curated examples; RLHF is RLVR with binary feedback signal coming from a learned reward model that has distilled an implicit alignment specification; DPO collapses this into a direct loss on preference pairs; Constitutional AI and deliberative alignment generate training data conditioned on a specification of natural language principles. The alignment signal in each case lives in the data.

The design of these curricula is driven largely by evolutionary search. Labs try different mixtures, train different models, and iterate on what works best. This may very well work. Or we may need to substitute in a more principled way to learn data curricula. Traditional bilevel optimization using IFs is both intractable and underpowered. Patterning with susceptibilities offers a route to addressing both.

Alignment is constrained optimization. What these methods share is an indirect relationship between the alignment specification and the resulting model structure. In RLHF, the specification is implicit in the preference data — whatever patterns annotators happen to reward, compressed into a scalar signal by the reward model. In Constitutional AI, the specification is more explicit (natural language principles) but the translation to training signal passes through an opaque chain of AI critique, revised responses, and preference labels. In all cases, the practitioner specifies desired behavior through examples and hopes the model develops the right internal structure to generalize that behavior correctly.

Patterning offers a more direct path. Instead of specifying alignment through examples of preferred outputs, it specifies alignment through structural observables — expectation values that characterize the model’s internal geometry. Susceptibilities provide a differentiable mapping from “what structural change do I want” to “what data change produces it.” The claim is not that patterning replaces RLHF or DPO. It operates at a different level: current methods specify what outputs should look like; patterning can specify what internal structure should look like. These are complementary.

The open question is the compilation step: translating alignment desiderata (“be honest,” “don’t pursue power”) into constraints on expectation values. For the toy model, this is trivial. For real models, this is the hard part, and it is the same hard part that interpretability faces. Susceptibility-based interpretability [MISSING (0000), MISSING (0000)] provides a starting point by discovering observables from the posterior itself. The vision is a closed loop: interpretability discovers the structural vocabulary, patterning targets it, and the result is alignment that operates on internal structure rather than surface behavior.

Discussion

In this section, we step back from the specific results and consider what they mean for the broader project of understanding and controlling neural network development.

From Reading to Writing

The conceptual arc of this post mirrors a progression in biology: from anatomy (describing structure) to developmental biology (understanding how structure forms) to synthetic biology (controlling what structure forms). In the language of neural networks, this is the progression from mechanistic interpretability to developmental interpretability to patterning — and from models that are grown to models that are crafted.

The theoretical foundation for this progression is Structural Bayesianism (MISSING 0000): the hypothesis that the geometry of the loss landscape faithfully reflects a model’s internal computational structure. If this is right, then posterior expectation values provide a coordinate system over the space of possible internal structures. Optimizing these coordinates is optimizing structure itself; this is the natural way to do interpretability-guided training.

Patterning makes this actionable. Over the course of a patterning run, the per-sample weights trace a trajectory through the space of data distributions. But a data distribution is more than a list of weights. It defines a loss landscape $L(W; h) = \sum_i h_i \ell_i(W)$ , and therefore a posterior, each with its own critical points, basins, and bifurcation structure. A trajectory through data space is a trajectory through the space of loss landscapes.

If we had complete access to this space, we could ask structural questions directly. Which internal configurations exist as stable minima under some data distribution? Given two configurations, is there a continuous path through data space along which one minimum deforms into the other without crossing a barrier? Where are the catastrophe surfaces — the data weightings at which the qualitative structure of the landscape changes, where minima appear, merge, or vanish?

These questions are intractable to answer exhaustively, even for our 10-parameter model. What patterning provides is a tractable way to search through this space, guided by local measurements of its geometry. The feature swap in Section TODO demonstrates that this search can find paths between configurations separated by loss barriers — paths that exist in data space even when no gradient path connects them in weight space.

Why Singularities Are Where Control Lives

A recurring theme in this work is that the moments where control matters most are precisely the moments where standard methods fail. The 4-to-5-gon bifurcation is the clearest illustration. At the bifurcation point, the Hessian is exactly zero. Any method that approximates the loss landscape as locally quadratic — including influence functions, natural gradient methods, and second-order optimizers — gets no signal. The bifurcation is invisible to them.

The posterior, by contrast, captures the full geometry: the quartic potential, the angular structure of per-sample losses, and the resulting susceptibilities that encode the branching structure of the bifurcation. The anomalous $\beta$ -scaling of the susceptibilities ( $\beta^{1/4}$ and $\beta^{1/2}$ rather than the regular $\beta^1$ ) is the fingerprint of this singularity — a direct measurement of higher-order geometry that no Gaussian approximation can reproduce.

This is not a peculiarity of toy models. Neural network loss landscapes are generically singular: weight-space symmetries, parameter redundancies, and phase transitions all produce degenerate critical points where the Hessian fails to capture the relevant geometry. Much of the machine learning literature implicitly assumes regularity — that the Hessian at a trained model tells you what you need to know about the local landscape. For neural networks, this assumption breaks down at the points that matter most: the bifurcations, phase transitions, and symmetry-breaking moments where the model is deciding what internal structure to develop.

A common objection runs: “I accept that symmetries reduce the effective dimensionality. But you’re claiming the geometry involves terms like $x^4$ and $x^6$ . Why should I believe higher-order structure matters?” The burden of proof runs in the opposite direction. Regularity — the assumption that the geometry is entirely captured by the Hessian — is a very strong condition that holds only for a measure-zero set of models. The question is not “why should singularities matter?” but “why would you assume they don’t?” The bifurcation analysis in this post provides what is, to our knowledge, the strongest concrete demonstration of the practical consequences: a setting where the qualitative outcome of training depends entirely on geometry that is invisible to second-order methods.

This is the domain of singular learning theory. The tools we have used throughout — the local posterior, expectation values, the local learning coefficient, susceptibilities — are all SLT constructions. They were designed for exactly this situation: characterizing and exploiting the geometry of singular loss landscapes. The results of this post suggest that the same tools can be used not only to read that geometry (spectroscopy) but to write it (patterning).

Limitations and Open Problems

Why Not Just Train Against the Observables?

In this toy model, one could achieve the same level of structural control via standard constrained optimization techniques like adding penalty terms for our dot product observables. At the bifurcation, this would break the rotational symmetry and pick out a unique direction. Why then invoke the machinery of susceptibilities and patterning?

This objection is valid for the TMS setting in particular, but not in general. The model’s structure is simple enough that structural observables can be written as differentiable functions of the weights. In this case, constrained optimization would work. The objection fails in general for two reasons.

First, expectation values carry information that point estimates do not. It may not always be the case that simple observables can be written down which isolate the structural information at a point value. MISSING (0000) demonstrates that patterning can choose between two minima with identical loss, that are distinguishable only by their local geometry. This case requires more powerful control over expectations.

Second, in practice, the primary way we modify the training objective is already by modifying the data. As discussed in Section TODO, when we lack explicit structural observables (which is the generic case for real models), “optimizing against the target” means reweighting data, and doing so in a principled way is exactly what the patterning equation provides.

How to Choose Observables?

In the toy model of superposition, an appropriate set of structural observables are easy to identify: dot products and norms fully characterize the polygon geometry. Real neural networks do not come with such a convenient reparameterization. Discovering the appropriate set of structural observables for a given model is an open problem, perhaps the main difficult problem; this is what interpretability is about.

Prior work on susceptibility-based interpretability [MISSING (0000), MISSING (0000), MISSING (0000)] suggests a path forward: begin with a broad (highly redundant) set of observables (per-sample losses, per-component losses, activation statistics, etc.), compute susceptibilities, and use the mode decomposition to discover which combinations of observables and data perturbations couple most strongly. This analysis itself guides the search toward more informative observables, for instance, by telling us how to explicitly target the low-gain modes to resolve them further.

There is a hard constraint here. The patterning equation can only achieve target changes that lie in the column space of the susceptibility matrix $\chi$ . Any component of the target in the null space of $\chi^T$ is unreachable by data reweighting alone. This means the observable set must be expressive enough to span the structural directions you care about. Expanding this coverage requires more observables and more compute (additional forward passes per SGLD draw), creating a direct tradeoff between control resolution and computational cost.

Can We Resolve the Relevant Observables?

The SVD decomposition of $\chi$ reveals that the directions easiest to estimate are also the directions easiest to control, and vice versa, see section TODO. Leading modes have high gain and high signal-to-noise; trailing modes have low gain and are potentially indistinguishable from sampling noise.

This poses a problem for alignment. If the structural features most relevant to safety (honesty, corrigibility, etc.) live in the low-gain tail of $\chi$ ‘s spectrum, then patterning lacks leverage precisely where it matters most. We expect that patterning may enable an essentially complete solution to the Sharp Left Turn (MISSING 0000) in principle, by giving us direct control over which structures form and how they generalize. But that control is only as good as our measurements. If the susceptibility spectrum cannot resolve alignment-relevant structure, we have traded an inductive bias problem in the model for a resolution problem in the measuring apparatus, and the outcome is the same.

There are reasons to be cautiously optimistic. Behavioral patterns that manifest broadly across the data distribution will couple to many samples and therefore appear as high-gain modes. The most consequential alignment-relevant patterns, such as power-seeking, are plausibly of this kind. More fundamentally, susceptibilities measure how the posterior would respond to a perturbation, not whether that perturbation has been observed. We expect that a catastrophic behavior does not have to be exhibited for its structural tendency to be detectable. We are actively investigating connections with elicitation (MISSING 0000), where the goal is to find data perturbations that surface rare but dangerous behaviors. Whether patterning’s resolution extends to the full alignment-relevant spectrum remains open.

How Will it Scale?

The computational cost of patterning is dominated by SGLD sampling, which requires many forward passes to estimate posterior covariances. In the toy model, this is trivial. At scale, it is the primary bottleneck.

Our concurrent work on scaling spectroscopy demonstrates that susceptibility estimation is tractable for language models with over a billion parameters, using the same SGLD methodology. The patterning side — computing reweightings and training on modified distributions — has been demonstrated in small language models (MISSING 0000) but not yet at the scale where it would be relevant for frontier alignment. Further engineering challenges include adapting to mini-batch training (where you cannot evaluate all samples simultaneously), handling the gap between population and empirical risk, and extending to reinforcement learning settings where the “data distribution” is generated by the policy itself.

Outlook

The results in this post demonstrate patterning in a setting where internal structure is immediately visible and the right observables are known in advance. The path to impact requires relaxing both of these conditions.

The most immediate next step is reward models. Current alignment pipelines depend critically on reward models that are known to develop simple, biased proxies for human preferences — the “love of lists,” sycophancy, and related phenomena documented in the reward model literature. These biases are plausibly instances of the same structural dynamics visible in TMS: at decision points in the reward model’s development, the model commits to simple proxy features rather than more complex but accurate representations of human values. If susceptibilities can identify these decision points and patterning can bias them toward the more complex branch, the result would be reward models that better represent what humans actually want. We are actively pursuing this direction.

More broadly, patterning suggests a different framing for what alignment could look like. Current methods specify desired behavior through examples and hope the model develops the right internal structure to generalize correctly. Patterning inverts this: specify the desired internal structure and compute the data distribution that produces it. The missing piece is the compiler — the map from alignment desiderata to structural observables. Building this compiler is the central open problem, and it is where interpretability and alignment meet. Susceptibility-based interpretability discovers the structural vocabulary; patterning targets it; the result is a closed loop from specification to structure.

We note that the ability to precisely control what models learn is inherently dual-use. The same techniques that could enforce alignment constraints could be used to train models toward undesirable structure — inserting backdoors, amplifying biases, or shaping models to pursue goals that are not aligned with their operators’ intentions. This is a general property of any increase in training methodology capability, not specific to patterning, but it is worth stating clearly given the precision of control demonstrated here.

Appendix

Additional Hyperparameters

TODO: Table with $I_i$ and other TMS parameters

Patterning as Policy Gradient

For readers familiar with reinforcement learning, a perhaps natural interpretation of patterning is that patterning is policy gradient on data distributions.

Consider the expected value of an observable under the posterior:

\mu(\lambda) = \mathbb{E}*{w \sim p(\cdot|D*\lambda)}[\phi(w)]

where $\lambda$ parameterizes the data distribution (e.g., sample weights). We want to find $\frac{\partial \mu}{\partial \lambda}$ to know how shifting the data affects the observable.

Using the REINFORCE trick, à la Williams, 1992:

\frac{\partial \mu}{\partial \lambda} = \mathbb{E}*w\left[\phi(w) \cdot \frac{\partial}{\partial \lambda} \log p(w|D*\lambda)\right]

The posterior is $p(w|D_\lambda) \propto \exp{-\beta L_n(w|D_\lambda)} \varphi(w)$ , so:

\log p(w|D*\lambda) = -\beta L_n(w|D*\lambda) + \log \varphi(w) - \log Z(\lambda)

Taking the derivative:

\frac{\partial}{\partial \lambda} \log p(w|D_\lambda) = -\beta \frac{\partial L_n}{\partial \lambda} - \frac{\partial}{\partial \lambda}\log Z(\lambda)

The partition function term is a baseline:

\frac{\partial}{\partial \lambda} \log Z = \frac{1}{Z} \int e^{-\beta L_n} (-\beta) \frac{\partial L}{\partial \lambda} \varphi , dw = -\beta \mathbb{E}\left[\frac{\partial L}{\partial \lambda}\right]

Substituting back:

\frac{\partial}{\partial \lambda} \log p(w|D_\lambda) = -\beta\left(\frac{\partial L}{\partial \lambda} - \mathbb{E}\left[\frac{\partial L}{\partial \lambda}\right]\right)

Therefore:

\frac{\partial \mu}{\partial \lambda} = -\beta , \text{Cov}\left(\phi(w), \frac{\partial L}{\partial \lambda}\right)

This is exactly the susceptibility formula. In RL terms:

Policy: the data distribution $p(x|\lambda)$
Environment: training dynamics that map data → trained model
Reward: the observable $\phi(w)$ we want to maximize
Policy gradient: $\frac{\partial \mu}{\partial \lambda}$ tells us how to adjust the data distribution

The key difference from standard RL: we don’t need to sample trajectories. The posterior $p(w|D_\lambda)$ (estimated via SGLD) gives us the entire distribution over “outcomes” (trained models) for a given “policy” (data distribution). We can compute the policy gradient analytically as a covariance.

Work with us

We're hiring Research Scientists, Engineers & more to join the team full-time.

Senior researchers can also express interest in a part-time affiliation through our new Research Fellows Program.

Citation Information

Please cite as:

Max Adam, Jesse Hoogland, Daniel Murfet, "Patterning Toy Models of Superposition", 2026.

BibTeX Citation:

@article{adam2026patterning,
  title={Patterning Toy Models of Superposition},
  author={Max Adam and Jesse Hoogland and Daniel Murfet},
  year={2026}
}

References

1.
⚠ ERROR: MISSING CITATION ENTRY — fix this in references.bib
MISSING, 0000.
2.
⚠ ERROR: MISSING CITATION ENTRY — fix this in references.bib
MISSING, 0000.
3.
⚠ ERROR: MISSING CITATION ENTRY — fix this in references.bib
MISSING, 0000.
4.
Toy Models of Superposition
Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec et al., 2022. Transformer Circuits Thread.
5.
⚠ ERROR: MISSING CITATION ENTRY — fix this in references.bib
MISSING, 0000.
6.
⚠ ERROR: MISSING CITATION ENTRY — fix this in references.bib
MISSING, 0000.
7.
⚠ ERROR: MISSING CITATION ENTRY — fix this in references.bib
MISSING, 0000.
8.
⚠ ERROR: MISSING CITATION ENTRY — fix this in references.bib
MISSING, 0000.
9.
⚠ ERROR: MISSING CITATION ENTRY — fix this in references.bib
MISSING, 0000.
10.
⚠ ERROR: MISSING CITATION ENTRY — fix this in references.bib
MISSING, 0000.
11.
⚠ ERROR: MISSING CITATION ENTRY — fix this in references.bib
MISSING, 0000.
12.
Algebraic Geometry and Statistical Learning Theory [link]
Sumio Watanabe, 2009. Cambridge University Press. DOI: 10.1017/CBO9780511800474.
13.
⚠ ERROR: MISSING CITATION ENTRY — fix this in references.bib
MISSING, 0000.
14.
⚠ ERROR: MISSING CITATION ENTRY — fix this in references.bib
MISSING, 0000.
15.
⚠ ERROR: MISSING CITATION ENTRY — fix this in references.bib
MISSING, 0000.
16.
⚠ ERROR: MISSING CITATION ENTRY — fix this in references.bib
MISSING, 0000.
17.
⚠ ERROR: MISSING CITATION ENTRY — fix this in references.bib
MISSING, 0000.
18.
⚠ ERROR: MISSING CITATION ENTRY — fix this in references.bib
MISSING, 0000.
19.
⚠ ERROR: MISSING CITATION ENTRY — fix this in references.bib
MISSING, 0000.
20.
⚠ ERROR: MISSING CITATION ENTRY — fix this in references.bib
MISSING, 0000.
21.
⚠ ERROR: MISSING CITATION ENTRY — fix this in references.bib
MISSING, 0000.
22.
⚠ ERROR: MISSING CITATION ENTRY — fix this in references.bib
MISSING, 0000.
23.
⚠ ERROR: MISSING CITATION ENTRY — fix this in references.bib
MISSING, 0000.
24.
⚠ ERROR: MISSING CITATION ENTRY — fix this in references.bib
MISSING, 0000.
25.
⚠ ERROR: MISSING CITATION ENTRY — fix this in references.bib
MISSING, 0000.
26.
⚠ ERROR: MISSING CITATION ENTRY — fix this in references.bib
MISSING, 0000.
27.
⚠ ERROR: MISSING CITATION ENTRY — fix this in references.bib
MISSING, 0000.
28.
⚠ ERROR: MISSING CITATION ENTRY — fix this in references.bib
MISSING, 0000.
29.
⚠ ERROR: MISSING CITATION ENTRY — fix this in references.bib
MISSING, 0000.
30.
⚠ ERROR: MISSING CITATION ENTRY — fix this in references.bib
MISSING, 0000.
31.
⚠ ERROR: MISSING CITATION ENTRY — fix this in references.bib
MISSING, 0000.
32.
⚠ ERROR: MISSING CITATION ENTRY — fix this in references.bib
MISSING, 0000.
33.
⚠ ERROR: MISSING CITATION ENTRY — fix this in references.bib
MISSING, 0000.
34.
⚠ ERROR: MISSING CITATION ENTRY — fix this in references.bib
MISSING, 0000.
35.
⚠ ERROR: MISSING CITATION ENTRY — fix this in references.bib
MISSING, 0000.
36.
⚠ ERROR: MISSING CITATION ENTRY — fix this in references.bib
MISSING, 0000.
37.
⚠ ERROR: MISSING CITATION ENTRY — fix this in references.bib
MISSING, 0000.
38.
⚠ ERROR: MISSING CITATION ENTRY — fix this in references.bib
MISSING, 0000.
39.
⚠ ERROR: MISSING CITATION ENTRY — fix this in references.bib
MISSING, 0000.
40.
⚠ ERROR: MISSING CITATION ENTRY — fix this in references.bib
MISSING, 0000.
41.
⚠ ERROR: MISSING CITATION ENTRY — fix this in references.bib
MISSING, 0000.
42.
⚠ ERROR: MISSING CITATION ENTRY — fix this in references.bib
MISSING, 0000.
43.
⚠ ERROR: MISSING CITATION ENTRY — fix this in references.bib
MISSING, 0000.
44.
⚠ ERROR: MISSING CITATION ENTRY — fix this in references.bib
MISSING, 0000.
45.
⚠ ERROR: MISSING CITATION ENTRY — fix this in references.bib
MISSING, 0000.
46.
⚠ ERROR: MISSING CITATION ENTRY — fix this in references.bib
MISSING, 0000.