Learning Coefficient Analysis of Double Descent Phenomena

Examining epoch-wise and model-wise double descent through the lens of the learning coefficient

Type: Applied
Difficulty: Hard
Status: Unstarted

This project aims to investigate the double descent phenomenon using the learning coefficient as a measure of effective model complexity. We’ll examine both epoch-wise double descent (EWDD) and model-wise double descent (MWDD) to gain insights into their underlying mechanisms and potential differences.

Key research questions:

  1. How does the learning coefficient vary with model size and training time in double descent scenarios?
  2. Can we recover a typical bias-variance tradeoff by plotting loss against the effective parameter count derived from the learning coefficient?
  3. Are there detectable differences in learning coefficient behavior between EWDD and MWDD?
  4. How does label noise affect the learning coefficient trajectory in EWDD scenarios?
  5. Can learning coefficient analysis provide evidence for or against the hypothesis that EWDD and MWDD have distinct causes?

Methodology:

  1. Implement experimental setups for both EWDD and MWDD, replicating key results from the literature.
  2. Estimate learning coefficients throughout training for various model sizes and hyperparameter settings.
  3. Create heatmaps of loss, error, and learning coefficient as functions of both training time and model size.
  4. Analyze the relationship between learning coefficient, model performance, and traditional measures of model complexity.
  5. Investigate the impact of label noise on learning coefficient trajectories in EWDD scenarios.
  6. Compare learning coefficient behavior in settings where EWDD and MWDD occur separately and simultaneously.

Expected outcomes:

  1. A more precise characterization of “effective model complexity” in double descent scenarios using the learning coefficient.
  2. Insights into the potentially distinct mechanisms underlying EWDD and MWDD.
  3. A refined understanding of how label noise and other factors contribute to double descent phenomena.
  4. Potential development of new techniques for analyzing and predicting double descent behavior using learning coefficient analysis.

This research could provide valuable insights into the nature of model complexity and generalization in deep learning, potentially reconciling the double descent phenomenon with classical learning theory through the lens of Singular Learning Theory.

Background

In double descent, the loss goes down, then up, then down again, defying the usual bias-variance tradeoffs.

1. The right form of model complexity

The traditional explanation in appeals to a (vague and nebulous) “effective model complexity.” Knowing that the learning coefficient is the “correct” notion of effective model complexity, can we make this more precise? How does the learning coefficient vary with model size? Can we recover a typical bias-variance tradeoff if instead of parameter count, we plot loss against effective parameter count?

2. Two kinds of double descent

There seem to be deeper issues with the double descent literature. In particular, epoch-wise double descent (where the double descent occurs as a function of training time) and model-wise double descent (where the double descent occurs as a function of parameter count) appear to have distinct causes.

Several authors have convincingly argued that double EWDD is caused by ill-conditioning of the model hyperparameters. There is a separation in learning speeds between different parts of the model that results in fast memorization and very slow generalization.

Meanwhile, MWDD seems to be caused by instability in fitting random noise. There’s a critical model size where the model can achieve perfect training error (the interpolation threshold). At this point, the model has just enough capacity to fit the noise in the dataset, which leads to a contrived solution that generalizes poorly. For much larger models, the learning process can select among many different models that all achieve perfect training error. Then inductive biases can select among these models to find one that generalizes well (even when memorizing).

Can we see differences between these two kinds of double descent in the learning coefficient? In other developmental measures?

In particular, Nakkiran et al. study heatmaps that show loss and error as a function of both training time and model size, showing that EWDD and MWDD can accompany one another. If EWDD has a different origin than classical MWDD, then perhaps these heatmaps are really showing an artifcat of MWDD. Would eliminating EWDD in these cases also eliminate MWDD? This would provide evidence for there being several explanations for MWDD.

EWDD appears to usually require some amount of label noise to be present. When does the model memorize these samples? Is the learning separated into three distinct stages (generalizing a little, memorizing, generalizing a lot)?

Where to begin:

If you have decided to start working on this, please let us know in the Discord. We'll update this listing so that other people who are interested in this project can find you.

Related Projects