Learn about SLT
Deep dive into Singular Learning Theory, Developmental Interpretability, and AI Alignment research
Deep dive into Singular Learning Theory, Developmental Interpretability, and AI Alignment research
Get a friendly introduction to SLT before diving into specific learning paths
For those with backgrounds in mathematics (algebraic geometry), physics, or statistical learning theory
For ML engineers and practitioners wanting hands-on experience with SLT techniques
Start here to understand why SLT matters for AI safety and alignment
For those with backgrounds in mathematics (esp. algebraic geometry), physics, or statistical learning theory
A good survey by the master himself, about the major results of SLT.
See the publications here for more advanced materials and research papers.
The textbooks in SLT are:
Sumio Watanabe "Algebraic Geometry and Statistical Learning Theory" 2009
• This is where all the details of the proofs of the main results of SLT are contained. It is a research monograph distilling the results proven over more than a decade. This is not an easy book to read.
• Chapter 1 provides a coarse treatment of the underlying proof ideas and mechanics.
• Chapters 2-5: The results of SLT depend on a lot of results from other fields of mathematics (algebraic geometry, distribution theory, manifold, empirical processes, etc). The book gives some background in each of these fields rather quickly. Scattered through these introductions is some material on how these fields relate to the core results in SLT.
• Chapter 6 contains the main proofs of SLT.
• Chapter 7 contains applications of the main results and examples of various learning phenomena in singular models.
Sumio Watanabe "Mathematical Theory of Bayesian Statistics" 2018
• This more recent book is much more focused on learning in singular models (esp. Bayesian learning).
• There are many exercises at the end of each chapter.
• This is also where Watanabe handles the non-realisable case. This requires the introduction of a new technical condition known as "relatively finite variance".
• While not recapitulating the full proof given in the Grey Book, the Green Book does go through slightly different formulations of the theory and, by assuming some technical results in the Grey Book, it walks through the proofs of most results.
Joe Suzuki, "WAIC and WBIC with R Stan Joe Suzuki 100 Exercises for Building Logic" 2019
For ML engineers and practitioners wanting to apply SLT techniques
Ready to get hands-on? Start with the theoretical foundation if you haven't already, then dive into practical techniques below.
Before diving into a new project, we recommend building familiarity by going through some of the starter notebooks in the devinterp repo. These notebooks can also serve as a starting point for further investigation.
Currently, the key experimental technique in applying SLT to real-world models is local learning coefficient (LLC) estimation, introduced in Lau et al. (2023).
(Lau et al. 2023) introduces the local learning coefficient (LLC) along with an SGLD-based estimator for the LLC.
by Jesse Hoogland and Stan van Wingerden explains why you should care about model complexity, why the local learning coefficient is arguably the correct measure of model complexity, and how to estimate its value.
Developmental interpretability proposes to study changes in neural network structure over the course of training (rather than trying to interpret isolated snapshots). This draws on ideas and methods from a range of areas of mathematics, statistics, and the (biological) sciences.
At the moment, the key techniques, namely applying LLC estimation over the course of training, come from Singular Learning Theory (SLT) and to a lesser extent developmental biology and statistical physics.
The readings focus on SLT:
explains how to apply the free energy formula in practice to reason about the singular learning process.
(Chen et al. 2023) studies Anthropic's Toy Model of Superposition using SLT. This 1) in a theoretically tractable but non-trivial model that knowing the leading order terms in the free energy expansion does allow us to predict phases and phase transitions in Bayesian learning. 2) demonstrating that we can use the learning coefficient to track the development of neural networks.
[Distillation] Growth and Form in a Toy Model of Superposition (by Liam Carroll and Edmund Lau)
(Hoogland et al. 2024) shows that the development of neural networks is organized into discrete stages that we can detect with local learning coefficient estimation and essential dynamics
[Distillation] Stagewise Development in Neural Networks.
Start with ICML 2024 workshop version before reading the Arxiv version.
Understanding the broader context of AI safety and alignment research
Connection to SLT: Understanding why SLT matters for alignment provides crucial context for these broader AI safety readings.
(Olah et al. 2020): makes the case for interpretability as a science.
(Hubinger 2022): makes the case for interpretability contributing to alignment.
(Olsson et al. 2022): establishes a link between high-level changes in model behavior (in-context learning) and structural changes (induction-heads).
(Elhage et al. 2022): describes the problem of "superposition" in interpretability.
(Elhage et al. 2021): if you want to understand how transformers compute, you need to be fluent with how attention works.
(Phuong and Hutter 2022): for precise definitions of the ingredients of transformers, often difficult to extract from other literature.