Guided
Reference

Learn about SLT

Deep dive into Singular Learning Theory, Developmental Interpretability, and AI Alignment research

Choose Your Learning Path

Start here: Dialogue Introduction to SLT

Get a friendly introduction to SLT before diving into specific learning paths

Theoretical Path

For those with backgrounds in mathematics (algebraic geometry), physics, or statistical learning theory

Jump to Theoretical SLT

Applied Path

For ML engineers and practitioners wanting hands-on experience with SLT techniques

Start with LLC Estimation
Then try Demo Notebooks

SLT for Alignment

Why SLT matters for AI safety and alignment research

Theoretical SLT

Mathematical foundations and formal theory

Applied SLT

Practical techniques and implementation

AI Alignment

Broader context and safety research

SLT for Alignment

Start here to understand why SLT matters for AI safety and alignment

SLT for Alignment

Why does SLT matter at all for alignment? Essential reading for understanding the connection.

Towards Developmental Interpretability

Why study how neural networks change over training? A foundational perspective on developmental interpretability.

Start Your Journey

Begin with understanding why SLT is crucial for AI alignment before diving into the technical details.

A conversational introduction to the key concepts

Theoretical SLT

For those with backgrounds in mathematics (esp. algebraic geometry), physics, or statistical learning theory

Essential Reading

Distilling Singular Learning Theory 0-4

(by Liam Carroll) introduces SLT and explains what it says about phases and phase transitions (in the sense of the Bayesian learning process).

Singular Learning Theory: Exercises

(by Zach Furman). Reading is not enough. If you are serious about this, do the pen-and-paper exercises.

Advanced Materials

Watanabe (2022)

A good survey by the master himself, about the major results of SLT.

Additional Publications

See the publications here for more advanced materials and research papers.

Textbooks

The textbooks in SLT are:

The Grey Book

Sumio Watanabe "Algebraic Geometry and Statistical Learning Theory" 2009

• This is where all the details of the proofs of the main results of SLT are contained. It is a research monograph distilling the results proven over more than a decade. This is not an easy book to read.

Chapter 1 provides a coarse treatment of the underlying proof ideas and mechanics.

Chapters 2-5: The results of SLT depend on a lot of results from other fields of mathematics (algebraic geometry, distribution theory, manifold, empirical processes, etc). The book gives some background in each of these fields rather quickly. Scattered through these introductions is some material on how these fields relate to the core results in SLT.

Chapter 6 contains the main proofs of SLT.

Chapter 7 contains applications of the main results and examples of various learning phenomena in singular models.

The Green Book

Sumio Watanabe "Mathematical Theory of Bayesian Statistics" 2018

• This more recent book is much more focused on learning in singular models (esp. Bayesian learning).

• There are many exercises at the end of each chapter.

• This is also where Watanabe handles the non-realisable case. This requires the introduction of a new technical condition known as "relatively finite variance".

• While not recapitulating the full proof given in the Grey Book, the Green Book does go through slightly different formulations of the theory and, by assuming some technical results in the Grey Book, it walks through the proofs of most results.

Applied/Experimental SLT

For ML engineers and practitioners wanting to apply SLT techniques

Ready to get hands-on? Start with the theoretical foundation if you haven't already, then dive into practical techniques below.

Starter Notebooks

Before diving into a new project, we recommend building familiarity by going through some of the starter notebooks in the devinterp repo. These notebooks can also serve as a starting point for further investigation.

LLC Estimation

Currently, the key experimental technique in applying SLT to real-world models is local learning coefficient (LLC) estimation, introduced in Lau et al. (2023).

Quantifying degeneracy in singular models via the learning coefficient

(Lau et al. 2023) introduces the local learning coefficient (LLC) along with an SGLD-based estimator for the LLC.

[Distillation] You're Measuring Model Complexity Wrong

by Jesse Hoogland and Stan van Wingerden explains why you should care about model complexity, why the local learning coefficient is arguably the correct measure of model complexity, and how to estimate its value.

Estimating the local learning coefficient at scale

(Furman & Lau 2024) is a follow-up to Lau et al. 2023, that tries to verify how accurately LLC estimation is in the setting of deep linear networks (DLNs).

SLT High 3: The Learning Coefficient

(Optional) provides some intuitions for how to think about the learning coefficient.

Putting it in practice

Once you've read the above materials, get some hands-on practice with the example notebooks in devinterp.

Developmental Interpretability

Developmental interpretability proposes to study changes in neural network structure over the course of training (rather than trying to interpret isolated snapshots). This draws on ideas and methods from a range of areas of mathematics, statistics, and the (biological) sciences.

At the moment, the key techniques, namely applying LLC estimation over the course of training, come from Singular Learning Theory (SLT) and to a lesser extent developmental biology and statistical physics.

The readings focus on SLT:

[Lecture] SLT High 1: The Logic of Phase Transitions

explains how to apply the free energy formula in practice to reason about the singular learning process.

Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition

(Chen et al. 2023) studies Anthropic's Toy Model of Superposition using SLT. This 1) in a theoretically tractable but non-trivial model that knowing the leading order terms in the free energy expansion does allow us to predict phases and phase transitions in Bayesian learning. 2) demonstrating that we can use the learning coefficient to track the development of neural networks.

[Distillation] Growth and Form in a Toy Model of Superposition (by Liam Carroll and Edmund Lau)

The Developmental Landscape of In-Context Learning

(Hoogland et al. 2024) shows that the development of neural networks is organized into discrete stages that we can detect with local learning coefficient estimation and essential dynamics

[Distillation] Stagewise Development in Neural Networks.
Start with ICML 2024 workshop version before reading the Arxiv version.

Bonus

AI Alignment

Understanding the broader context of AI safety and alignment research

Connection to SLT: Understanding why SLT matters for alignment provides crucial context for these broader AI safety readings.

Basics

AI Alignment Metastrategy

(Kosoy 2023): provides a strong overview of the different philosophical strands of AI safety research.

Risks from Learned Optimization in Advanced Machine Learning Systems

(Hubinger et al. 2019): how dangerous behaviors could arise naturally in capable systems trained by gradient descent, introduces the idea of deceptive alignment.

Interpretability

Zoom In: An Introduction to Circuits

(Olah et al. 2020): makes the case for interpretability as a science.

A Transparency and Interpretability Tech Tree

(Hubinger 2022): makes the case for interpretability contributing to alignment.

In-Context Learning and Induction Heads

(Olsson et al. 2022): establishes a link between high-level changes in model behavior (in-context learning) and structural changes (induction-heads).

Toy Models of Superposition

(Elhage et al. 2022): describes the problem of "superposition" in interpretability.

A Mathematical Framework for Transformer Circuits

(Elhage et al. 2021): if you want to understand how transformers compute, you need to be fluent with how attention works.

Formal Algorithms for Transformers

(Phuong and Hutter 2022): for precise definitions of the ingredients of transformers, often difficult to extract from other literature.

Progress measures for grokking via mechanistic interpretability

(Nanda et al. 2023): one of the most in-depth examples of reverse-engineering the algorithm learned by a neural network.

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

(Wang et al. 2022): interpretability tools can be successfully applied to large(ish) models.

Bonus

Community & Resources

MetaUni SLT Seminars

Weekly seminars in Roblox featuring dozens of talks on SLT

DevInterp Conferences

Lectures from two SLT conferences with expert presentations

Additional Programs

Alignment 101/201, ARENA program, and MetaUni AI-safety seminars

Ready to Get Started?

Join our community and start contributing to the future of AI safety research