Induction Heads

Lead: George Wang @_protocol

In the original context, detect the formation of induction heads.

Type: Applied
Difficulty: Medium
Status: Completed

Update

See our paper The Developmental Landscape of In-context Learning, for a developmental analysis of induction formation.


Background

If grokking is the first example that comes to mind when thinking of phase transitions in neural networks, then induction heads are the second example.

What does the “induction bump” look like from the perspective of the learning coefficient? Can we detect the formation of induction heads using this quantity? When comparing models of different sizes, do we notice the difference between single-layer transformers and multi-layer transformers in the learning coefficient?

Where to begin:

If you have decided to start working on this, please let us know in the Discord. We'll update this listing so that other people who are interested in this project can find you.