Project Ideas

Various project ideas to inspire you if you're interested in getting involved in DevInterp and SLT research.

A good place to start is with the open problems lecture from the 2023 DevInterp conference. This comprehensive overview covers key research directions and opportunities in developmental interpretability.

DevInterp Conference 2023 - Open Problems

Starter Notebooks

Before diving into a new project, we recommend building familiarity by going through some of the starter notebooks in the devinterp repo. These notebooks can also serve as a starting point for further investigation.

Active Projects

We encourage replication but discourage scooping each other: there are enough interesting problems to solve that we shouldn't be unnecessarily duplicating effort, since it slows progress in AI safety and is bad for the community. That said, if you're particularly interested in one of the projects that's active, please reach out to see if there's an opportunity to get involved and collaborate.

Natural Gradient Descent

Lead: Moosa & Zach
Discord: cxtraa
Type: Applied
Difficulty: Easy
Status: In-progress

SLT suggests natural gradient descent should generalize worse. Can we formalize this with the RLCT?

DevInterp-Flavored Projects

The number one thing we encourage people to do who want to get involved in DevInterp, especially if they're interested in the empirical aspect, is to just go out and study the development of models that haven't been studied yet.

The easiest place to start is transformers trained on algorithmic tasks. Just choose one (or come up with your own), and start applying tools from devinterp (such as local learning coefficient estimation, essential dynamics (coming soon), Oku-Aihara covariance analysis (ditto), etc.) as well as more "traditional" tools from mechinterp (i.e., "progress measures").

Type: Applied
Difficulty: Easy
Status: Unstarted

Can we detect phase transitions in settings like modular arithmetic, multitask sparse parity, and greatest common denominator using the learning coefficient?

Type: Applied
Difficulty: Medium
Status: Unstarted

LayerNorm can have a large impact on learning dynamics. Can we characterize this in a simple toy model?

SLT-Flavored Projects

If you're interested in something slightly more theoretical, there many interesting questions in the context of SLT.

Type: Applied
Difficulty: Medium
Status: Unstarted

Comparing LLC estimation in weights to different forms of ablations in activations.

Type: Applied
Difficulty: Medium
Status: Unstarted

Investigating how unlearning procedures like LEACE affect the Local Learning Coefficient.

Engineering Projects

Are you more of an research engineer than research scientist? Consider filing a PR and adding features/fixing bugs in the devinterp repo. There's plenty to do.

Theoretical Projects

We discourage you from working on more theoretical projects unless you really, really know what you're doing. Reach out to us.

Completed Projects

Just because a project is marked as "completed" here doesn't mean that this direction is closed off. It's often very helpful to begin with replications because it gives you a clear reference to compare results against. You're also sure to run into follow-up questions that the original authors didn't address, so you can always go deeper.

Induction Heads

Lead: George Wang
Discord: @_protocol
Type: Applied
Difficulty: Medium
Status: Completed

In the original context, detect the formation of induction heads.