Dynamics of Transient Structure in In-Context Linear Regression Transformers
Authors
Affiliations
Liam Carroll Timaeus & Gradient Institute Jesse Hoogland Timaeus Matthew Farrugia-Roberts University of Oxford Daniel Murfet University of MelbournePublished
Jan 29, 2025Links
Abstract
Modern deep neural networks display striking examples of rich internal computational structure. Uncovering principles governing the development of such structure is a priority for the science of deep learning. In this paper, we explore the transient ridge phenomenon: when transformers are trained on in-context linear regression tasks with intermediate task diversity, they initially behave like ridge regression before specializing to the tasks in their training distribution. This transition from a general solution to a specialized solution is revealed by joint trajectory principal component analysis. Further, we draw on the theory of Bayesian internal model selection to suggest a general explanation for the phenomena of transient structure in transformers, based on an evolving tradeoff between loss and complexity. We empirically validate this explanation by measuring the model complexity of our transformers as defined by the local learning coefficient.