Developmental interpretability is a new alignment research agenda grounded in singular learning theory (SLT), statistical physics, and developmental biology. The aim of developmental interpretability is to build tools for detecting, locating, and interpreting phase transitions that govern training and in-context learning. This has the potential to reduce the alignment tax for existing techniques and inform scalable new methods for interpreting neural networks.
Check out the research agenda.
2023 SLT & Alignment Summit
The SLT & Alignment Summit ("Singularities against the Singularity") was run in June 2023 (and is actually still ongoing). In the first week, we recorded more than 20 hours of lectures on the necessary background, all of which you can find here. In the second week, we're starting research collaborations on the open problems.
We'll post a review, as well as lecture notes and more informal posts over the course of the next month.