Induction Heads
In the original context, detect the formation of induction heads.
Update
See our paper The Developmental Landscape of In-context Learning, for a developmental analysis of induction formation.
Background
If grokking is the first example that comes to mind when thinking of phase transitions in neural networks, then induction heads are the second example.
What does the “induction bump” look like from the perspective of the learning coefficient? Can we detect the formation of induction heads using this quantity? When comparing models of different sizes, do we notice the difference between single-layer transformers and multi-layer transformers in the learning coefficient?
Where to begin:
If you have decided to start working on this, please let us know in the Discord. We'll update this listing so that other people who are interested in this project can find you.