The Local Learning Coefficient: A Singularity-Aware Complexity Measure

Authors

Edmund Lau =
University of Melbourne
Zach Furman =
Timaeus
George Wang
Timaeus
Daniel Murfet
University of Melbourne
Susan Wei
University of Melbourne
See Contributions

Publication Details

Published:
August 23, 2023
Venue:
AISTATS 2025

Access

Abstract

Deep neural networks (DNN) are singular statistical models which exhibit complex degeneracies. In this work, we illustrate how a quantity known as the learning coefficient introduced in singular learning theory quantifies precisely the degree of degeneracy in deep neural networks. Importantly, we will demonstrate that degeneracy in DNN cannot be accounted for by simply counting the number of 'flat' directions.

Automated Conversion Notice

Warning: This paper was automatically converted from LaTeX. While we strive for accuracy, some formatting or mathematical expressions may not render perfectly. Please refer to the original ArXiv version for the authoritative document.

1 Introduction

Occam’s razor, a foundational principle in scientific inquiry, suggests that the simplest among competing hypotheses should be selected. This principle has emphasized simplicity and parsimony in scientific thinking for centuries. Yet, the advent of deep neural networks (DNNs), with their multi-million parameter configurations and complex architectures, poses a stark challenge to this time-honored principle. The question arises: how do we reconcile the effectiveness of these intricate models with the pursuit of simplicity?

It is natural to pose this question in terms of model complexity. Unfortunately, many existing definitions of model complexity are problematic for DNNs. For instance, the parameter count, which in classical statistical learning theory is a commonly-used measure of the amount of information captured in a fitted model, is well-known to be inappropriate in deep learning. This is clear from technical results on generalization (Zhang et al.,, 2017), pruning (Blalock et al.,, 2020) and distillation (Hinton et al.,, 2015): the amount of information captured in a trained network is in some sense decoupled from the number of parameters.

As forcefully argued in (Wei et al.,, 2022), the gap in our understanding of DNN model complexity is induced by singularities, hence the need for a singularity-aware complexity measure. This motivates our appeal to Singular Learning Theory (SLT), which recognizes the necessity for sophisticated tools in studying statistical models exhibiting singularities. Roughly speaking, a model is singular if there are many ways to vary the parameters without changing the function; the more ways to vary without a change, the more singular (and thus more degenerate). DNNs are prime examples of such singular statistical models, characterized by complex degeneracies that make them highly singular.

SLT continues a longstanding tradition of using free energy as the starting point for measuring model complexity (Bialek et al.,, 2001), and here we see the implications of singularities. Specifically, the free energy, also known as the negative log marginal likelihood, can be shown to asymptotically diverge with sample size n𝑛nitalic_n according to the law an+blogn+o(loglogn)𝑎𝑛𝑏𝑛𝑜𝑛an+b\log n+o(\log\log n)italic_a italic_n + italic_b roman_log italic_n + italic_o ( roman_log roman_log italic_n ); the coefficient a𝑎aitalic_a of the linear term is the minimal loss achievable and the coefficient b𝑏bitalic_b of the remaining logarithmic divergence is then taken as the model complexity of the model class. In regular statistical models, the coefficient b𝑏bitalic_b is the number of parameters (divided by 2). In singular statistical models, b𝑏bitalic_b is not tied to the number of parameters, indicating a different kind of complexity at play.

The LLC arises out of a consideration of the local free energy. We explore the mathematical underpinnings of the LLC in Section 3, utilizing intuitive concepts like volume scaling to aid understanding. Our contributions encompass 1) the definition of the new LLC complexity measure, 2) the development of a scalable estimator for the LLC, and 3) empirical validations that underscore the accuracy and practicality of the LLC estimator. In particular, we demonstrate that the estimator is accurate and scalable to modern network size in a setting where theoretical learning coefficients are available for comparison. Furthermore, we show empirically that some common training heuristics effectively control the LLC.

Refer to caption
Refer to caption
Refer to caption
Figure 1: Impact of SGD learning rate (top), batch size (middle) and momentum (bottom) when training ResNet18 on CIFAR10. We plot the LLC estimate (left), test accuracy (middle) and train loss (right) across training time. As the strength of the implicit regularization increases — through higher learning rate, lower batch size and higher momentum — LLC decreases (the network gets “simpler”) and test accuracy increases. Even though most training losses collapse to zero, the LLC can discern the implicit regularization pressure applied by various training heuristics.

On this last point, we preview some results; Figure 1 displays the LLC estimate, test accuracy, and and training loss over the course of training ResNet18 on CIFAR10. Loosely speaking, lower LLC means a less complex, and thus more degenerate, neural network. In each of the rows, lighter colors represent stronger implicit regularization, e.g., higher learning rate, lower batch size, higher SGD momentum, which we see corresponds to a preference for lower LLC, i.e., simpler neural networks.

2 Setup

Let Wd𝑊superscript𝑑W\subset\mathbb{R}^{d}italic_W ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT be a compact space of parameters wW𝑤𝑊w\in Witalic_w ∈ italic_W. Consider the model-truth-prior triplet

(p(x,y|w),q(x,y),φ(w)),𝑝𝑥conditional𝑦𝑤𝑞𝑥𝑦𝜑𝑤(p(x,y|w),q(x,y),\varphi(w)),( italic_p ( italic_x , italic_y | italic_w ) , italic_q ( italic_x , italic_y ) , italic_φ ( italic_w ) ) , (1)

where q(x,y)=q(y|x)q(x)𝑞𝑥𝑦𝑞conditional𝑦𝑥𝑞𝑥q(x,y)=q(y|x)q(x)italic_q ( italic_x , italic_y ) = italic_q ( italic_y | italic_x ) italic_q ( italic_x ) is the true data-generating mechanism, p(x,y|w)=p(y|x,w)q(x)𝑝𝑥conditional𝑦𝑤𝑝conditional𝑦𝑥𝑤𝑞𝑥p(x,y|w)=p(y|x,w)q(x)italic_p ( italic_x , italic_y | italic_w ) = italic_p ( italic_y | italic_x , italic_w ) italic_q ( italic_x ) is the posited model with parameter w𝑤witalic_w representing the neural network weights, and φ𝜑\varphiitalic_φ is a prior on w𝑤witalic_w. Suppose we are given a training dataset of n𝑛nitalic_n input-output pairs, 𝒟n={(xi,yi)}i=1nsubscript𝒟𝑛superscriptsubscriptsubscript𝑥𝑖subscript𝑦𝑖𝑖1𝑛\mathcal{D}_{n}=\{(x_{i},y_{i})\}_{i=1}^{n}caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, drawn i.i.d. from q(x,y)𝑞𝑥𝑦q(x,y)italic_q ( italic_x , italic_y ).

To these objects, we can associate the sample negative log likelihood function defined as Ln(w)=1ni=1nlogp(yi|xi,w),subscript𝐿𝑛𝑤1𝑛superscriptsubscript𝑖1𝑛𝑝conditionalsubscript𝑦𝑖subscript𝑥𝑖𝑤L_{n}(w)=-\frac{1}{n}\sum_{i=1}^{n}\log p(y_{i}|x_{i},w),italic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_w ) = - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_log italic_p ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_w ) , and its theoretical counterpart defined as L(w)=Eq(x,y)logp(y|x,w).𝐿𝑤subscriptE𝑞𝑥𝑦𝑝conditional𝑦𝑥𝑤L(w)=-\mathrm{E}_{q(x,y)}\log p(y|x,w).italic_L ( italic_w ) = - roman_E start_POSTSUBSCRIPT italic_q ( italic_x , italic_y ) end_POSTSUBSCRIPT roman_log italic_p ( italic_y | italic_x , italic_w ) . It is appropriate to also call Lnsubscript𝐿𝑛L_{n}italic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and L𝐿Litalic_L the training and population loss, respectively, since using the negative log likelihood encompasses many classic loss functions used in machine learning and deep learning, such as mean squared error (MSE) and cross-entropy.

The behavior of the training and population losses is highly nontrivial for neural networks. To properly account for model complexity of neural network models, it is critical to engage with the challenges posed by singularities. To appreciate this, we follow (Watanabe,, 2009) and make the following distinction between regular and singular statistical models. A statistical model p(x,y|w)𝑝𝑥conditional𝑦𝑤p(x,y|w)italic_p ( italic_x , italic_y | italic_w ) is called regular if it is 1) identifiable, i.e. the parameter to distribution map wp(x,y|w)maps-to𝑤𝑝𝑥conditional𝑦𝑤w\mapsto p(x,y|w)italic_w ↦ italic_p ( italic_x , italic_y | italic_w ) is one-to-one, and 2) its Fisher information matrix I(w)𝐼𝑤I(w)italic_I ( italic_w ) is everywhere positive definite. We call a model singular if it is not regular. For an introduction to the implications of singular learning theory for deep learning, we refer the readers to Appendix A and further reading in (Wei et al.,, 2022). We shall assume throughout the triplet (1) satisfies a few fundamental conditions in SLT (Watanabe,, 2009). These conditions are stated and discussed in an accessible manner in A.1.

3 The local learning coefficient

In this paper, we introduce the Local Learning Coefficient (LLC), an extension of Watanabe, (2009)’s global learning coefficient, referred to as simply learning coefficient there. Below we focus on explaining the LLC through its geometric intuition, specifically as an invariant based on the volume of the loss landscape basin.

For readers interested in the detailed theoretical foundations, we have included comprehensive explanations in the appendices. Appendix A offers a short introduction to the basics of Singular Learning Theory (SLT), and Appendix B sets out formal conditions of the well-definedness of the LLC. This structure ensures that readers with varying levels of familiarity with SLT can engage with the content at their own pace.


3.1 Complexity via counting low loss parameters

At a local minimum of the population loss landscape there is a natural notion of complexity, given by the number of bits required to specify the minimum to within a tolerance ϵitalic-ϵ\epsilonitalic_ϵ. This idea is well-known in the literature on minimum description length (Grünwald and Roos,, 2019) and was used by (Hochreiter and Schmidhuber,, 1997) in an attempt to quantify the complexity of a trained neural network. However, a correct treatment has to take into consideration the degeneracy of the geometry of the population loss L𝐿Litalic_L, as we now explain.

Consider a local minimum wsuperscript𝑤w^{*}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT of the population loss L𝐿Litalic_L and a closed ball B(w)𝐵superscript𝑤B(w^{*})italic_B ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) centered on wsuperscript𝑤w^{*}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT such that for all wB(w)𝑤𝐵superscript𝑤w\in B(w^{*})italic_w ∈ italic_B ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) we have L(w)L(w)𝐿𝑤𝐿superscript𝑤L(w)\geq L(w^{*})italic_L ( italic_w ) ≥ italic_L ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). Given a tolerance ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0 we can consider the set of parameters B(w,ϵ)={wB(w)L(w)L(w)<ϵ}𝐵superscript𝑤italic-ϵconditional-set𝑤𝐵superscript𝑤𝐿𝑤𝐿superscript𝑤italic-ϵB(w^{*},\epsilon)=\{w\in B(w^{*})\mid L(w)-L(w^{*})<\epsilon\}italic_B ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_ϵ ) = { italic_w ∈ italic_B ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ∣ italic_L ( italic_w ) - italic_L ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) < italic_ϵ } whose loss is within the tolerance, the volume of which we define to be

V(ϵ):=Vol(B(w,ϵ))=B(w,ϵ)𝑑w.assign𝑉italic-ϵVol𝐵superscript𝑤italic-ϵsubscript𝐵superscript𝑤italic-ϵdifferential-d𝑤V(\epsilon):=\operatorname{Vol}(B(w^{*},\epsilon))=\int_{B(w^{*},\epsilon)}dw\,.italic_V ( italic_ϵ ) := roman_Vol ( italic_B ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_ϵ ) ) = ∫ start_POSTSUBSCRIPT italic_B ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_ϵ ) end_POSTSUBSCRIPT italic_d italic_w . (2)

The minimal number of bits to specify this set within the ball is

log2(V(ϵ)/Vol(B(w))).subscript2𝑉italic-ϵVol𝐵superscript𝑤-\log_{2}(V(\epsilon)/\operatorname{Vol}(B(w^{*})))\,.- roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_V ( italic_ϵ ) / roman_Vol ( italic_B ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) ) . (3)

This can be taken as a measure of the complexity of the set of low loss parameters near wsuperscript𝑤w^{*}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. However, as it stands this notion depends on ϵitalic-ϵ\epsilonitalic_ϵ and has no intrinsic meaning. Classically, this is addressed as follows: if a model is regular and thus L(w)𝐿𝑤L(w)italic_L ( italic_w ) is locally quadratic around wsuperscript𝑤w^{*}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, the volume satisfies a law of the form V(ϵ)cϵd/2,𝑉italic-ϵ𝑐superscriptitalic-ϵ𝑑2V(\epsilon)\approx c\epsilon^{d/2},italic_V ( italic_ϵ ) ≈ italic_c italic_ϵ start_POSTSUPERSCRIPT italic_d / 2 end_POSTSUPERSCRIPT , where c𝑐citalic_c is a constant that depends on the curvature of the basin around the local minimum wsuperscript𝑤w^{*}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, and d𝑑ditalic_d is the dimension of W𝑊Witalic_W.

This explains why d2𝑑2\frac{d}{2}divide start_ARG italic_d end_ARG start_ARG 2 end_ARG is a valid measure of complexity in the regular case, albeit one that cannot distinguish wsuperscript𝑤w^{*}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT from any other local minima. The curvature c𝑐citalic_c is less significant than the scaling exponent, but can distinguish local minima and logc𝑐\log croman_log italic_c is sometimes used as a complexity measure.

The population loss of a neural network is not locally quadratic near its local minima, from which it follows that the functional form of the volume V(ϵ)𝑉italic-ϵV(\epsilon)italic_V ( italic_ϵ ) is more complicated than in the regular case. The correct functional form was discovered by Watanabe, (2009) and we adapt it here to a local neighborhood of a parameter (see Appendix A for details):

Definition 1 (The Local Learning Coefficient (LLC), λ(w)𝜆superscript𝑤{\lambda({w^{*}})}italic_λ ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )).

There exists a unique rational111The fact that λ(w)𝜆superscript𝑤{\lambda({w^{*}})}italic_λ ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) is rational-valued, and not real-valued, as one would naively assume, is a deep fact of algebraic geometry derived from the celebrated Hironaka’s resolution of singularities. number λ(w)𝜆superscript𝑤{\lambda({w^{*}})}italic_λ ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), a positive integer m(w)𝑚superscript𝑤m({w^{*}})italic_m ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) and some constant c>0𝑐0c>0italic_c > 0 such that asymptotically as ϵ0italic-ϵ0\epsilon\to 0italic_ϵ → 0,

V(ϵ)=cϵλ(w)(logϵ)m(w)1+o(ϵλ(w)(logϵ)m(w)1).𝑉italic-ϵ𝑐superscriptitalic-ϵ𝜆superscript𝑤superscriptitalic-ϵ𝑚superscript𝑤1𝑜superscriptitalic-ϵ𝜆superscript𝑤superscriptitalic-ϵ𝑚superscript𝑤1V(\epsilon)=c\epsilon^{{\lambda({w^{*}})}}(-\log\epsilon)^{m(w^{*})-1}+o(% \epsilon^{{\lambda({w^{*}})}}(-\log\epsilon)^{m(w^{*})-1}).italic_V ( italic_ϵ ) = italic_c italic_ϵ start_POSTSUPERSCRIPT italic_λ ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( - roman_log italic_ϵ ) start_POSTSUPERSCRIPT italic_m ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - 1 end_POSTSUPERSCRIPT + italic_o ( italic_ϵ start_POSTSUPERSCRIPT italic_λ ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( - roman_log italic_ϵ ) start_POSTSUPERSCRIPT italic_m ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - 1 end_POSTSUPERSCRIPT ) . (4)

We call λ(w)𝜆superscript𝑤{\lambda({w^{*}})}italic_λ ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) the Local Learning Coefficient (LLC), and m(w)𝑚superscript𝑤m(w^{*})italic_m ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) the local multiplicity.

In the case where m(w)=1𝑚superscript𝑤1m({w^{*}})=1italic_m ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = 1, the formula simplifies, and

V(ϵ)ϵλ(w)proportional-to𝑉italic-ϵsuperscriptitalic-ϵ𝜆superscript𝑤V(\epsilon)\propto\epsilon^{{\lambda({w^{*}})}}italic_V ( italic_ϵ ) ∝ italic_ϵ start_POSTSUPERSCRIPT italic_λ ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT (5)

Thus, the LLC λ(w)𝜆superscript𝑤{\lambda({w^{*}})}italic_λ ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) is the (asymptotic) volume scaling exponent near a minimum wsuperscript𝑤{w^{*}}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT in the loss landscape: increasing the error tolerance by a factor of a𝑎aitalic_a increases the volume by a factor of aλ(w)superscript𝑎𝜆superscript𝑤a^{{\lambda({w^{*}})}}italic_a start_POSTSUPERSCRIPT italic_λ ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT. Applying Equation (3), the number of bits needed to specify V(ϵ)𝑉italic-ϵV(\epsilon)italic_V ( italic_ϵ ) within B(w)𝐵superscript𝑤B(w^{*})italic_B ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), for sufficiently small ϵitalic-ϵ\epsilonitalic_ϵ and in the case m(w)=1𝑚superscript𝑤1m(w^{*})=1italic_m ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = 1, is approximated by

λ(w)log2ϵ+O(log2log2ϵ).𝜆superscript𝑤subscript2italic-ϵ𝑂subscript2subscript2italic-ϵ-{\lambda({w^{*}})}\log_{2}\epsilon+O(\log_{2}\log_{2}\epsilon)\,.- italic_λ ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ϵ + italic_O ( roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ϵ ) .

Informally, the LLC tells us the number of additional bits needed to halve an already small error of ϵitalic-ϵ\epsilonitalic_ϵ:

log2[V(ϵ2)/V(ϵ)]λ(w)log2ϵ2+λ(w)log2ϵ=λ(w)subscript2𝑉italic-ϵ2𝑉italic-ϵ𝜆superscript𝑤subscript2italic-ϵ2𝜆superscript𝑤subscript2italic-ϵ𝜆superscript𝑤-\log_{2}\Big{[}V(\tfrac{\epsilon}{2})/V(\epsilon)\Big{]}\approx-{\lambda({w^{% *}})}\log_{2}\frac{\epsilon}{2}+{\lambda({w^{*}})}\log_{2}\epsilon={\lambda({w% ^{*}})}- roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT [ italic_V ( divide start_ARG italic_ϵ end_ARG start_ARG 2 end_ARG ) / italic_V ( italic_ϵ ) ] ≈ - italic_λ ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT divide start_ARG italic_ϵ end_ARG start_ARG 2 end_ARG + italic_λ ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ϵ = italic_λ ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )

See Figure LABEL:fig:volume_scaling for simple examples of the LLC as a scaling exponent. We note that, in contrast to the regular case, the scaling exponent λ(w)𝜆superscript𝑤{\lambda({w^{*}})}italic_λ ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) depends on the underlying data distribution q(x,y)𝑞𝑥𝑦q(x,y)italic_q ( italic_x , italic_y ).

The (global) learning coefficient λ𝜆\lambdaitalic_λ was defined by Watanabe, (2001). If wsuperscript𝑤w^{*}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is taken to be a global minimum of the population loss L(w)𝐿𝑤L(w)italic_L ( italic_w ), and the ball B(w)𝐵superscript𝑤B(w^{*})italic_B ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) is taken to be the entire parameter space W𝑊Witalic_W, then we obtain the global learning coefficient λ𝜆\lambdaitalic_λ as the scaling exponent of the volume V(ϵ)𝑉italic-ϵV(\epsilon)italic_V ( italic_ϵ ). The learning coefficient and related quantities like the WBIC (Watanabe,, 2013) have historically seen significant application in Bayesian model selection (e.g. Endo et al.,, 2020; Fontanesi et al.,, 2019; Hooten and Hobbs,, 2015; Sharma,, 2017; Kafashan et al.,, 2021; Semenova et al.,, 2020).

In Appendix C, we prove that the LLC is invariant to local diffeomorphism: that is, roughly, a locally smooth and invertible change of variables. This property is motivated by the desire that a good complexity measure should not be confounded by superficial differences in how a model is represented: two models which are essentially the same should have the same complexity.

4 LLC estimation

Having established the intuition behind the theoretical LLC in terms of volume scaling, we now turn to the task of estimating it. As described in the introduction, the LLC is a coefficient in the asymptotic expansion of the local free energy. It is this fact that we leverage for estimation. There is a mathematically rigorous link between volume scaling and the appearance of the LLC in the local free energy which we will not discuss in detail; see (Watanabe,, 2009, Theorem 7.1).

We first introduce what we call the idealized LLC estimator which is theoretically sound, albeit intractable from a computational point of view. In the sections that follow the introduction of the idealized LLC estimator, we walk through the steps taken to engineer a practically implementable version of the idealized LLC estimator.

4.1 Idealized LLC estimator

Consider the following integral

Zn(Bγ(w))=Bγ(w)exp{nLn(w)}φ(w)𝑑w,subscript𝑍𝑛subscript𝐵𝛾superscript𝑤subscriptsubscript𝐵𝛾superscript𝑤𝑛subscript𝐿𝑛𝑤𝜑𝑤differential-d𝑤Z_{n}(B_{\gamma}(w^{*}))=\int_{B_{\gamma}(w^{*})}\exp\{-nL_{n}(w)\}\varphi(w)% \,dw,italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_B start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) = ∫ start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT roman_exp { - italic_n italic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_w ) } italic_φ ( italic_w ) italic_d italic_w , (6)

where Ln(w)subscript𝐿𝑛𝑤L_{n}(w)italic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_w ) is again the sample negative log likelihood, Bγ(w)subscript𝐵𝛾superscript𝑤B_{\gamma}(w^{*})italic_B start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) denotes a small ball of radius γ𝛾\gammaitalic_γ around wsuperscript𝑤{w^{*}}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and φ𝜑\varphiitalic_φ is a prior over model parameters w𝑤witalic_w. If (6) is high, there is high posterior concentration around wsuperscript𝑤{w^{*}}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. In this sense, (6) is a measure of the concentration of low-loss solutions near wsuperscript𝑤{w^{*}}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Next consider a log transformation of Zn(Bγ(w))subscript𝑍𝑛subscript𝐵𝛾superscript𝑤Z_{n}(B_{\gamma}(w^{*}))italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_B start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) with a negative sign, i.e.,

Fn(Bγ(w))=logZn(Bγ(w)).subscript𝐹𝑛subscript𝐵𝛾superscript𝑤subscript𝑍𝑛subscript𝐵𝛾superscript𝑤F_{n}(B_{\gamma}(w^{*}))=-\log Z_{n}(B_{\gamma}(w^{*})).italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_B start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) = - roman_log italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_B start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) . (7)

This quantity is sometimes called (negative) log marginal likelihood or free energy, depending on the discipline. Given a local minimum wsuperscript𝑤{w^{*}}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT of the population negative log likelihood L(w)𝐿𝑤L(w)italic_L ( italic_w ), it can be shown using the machinery of SLT that, asymptotically in n𝑛nitalic_n, we have

Fn(Bγ(w))=nLn(w)energy+λ(w)entropylogn+op(loglogn),subscript𝐹𝑛subscript𝐵𝛾superscript𝑤𝑛subscriptsubscript𝐿𝑛superscript𝑤energysubscript𝜆superscript𝑤entropy𝑛subscript𝑜𝑝𝑛F_{n}(B_{\gamma}(w^{*}))=n\underbrace{L_{n}({w^{*}})}_{\text{energy}}+% \underbrace{\lambda({w^{*}})}_{\text{entropy}}\log n+o_{p}(\log\log n),italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_B start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) = italic_n under⏟ start_ARG italic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_ARG start_POSTSUBSCRIPT energy end_POSTSUBSCRIPT + under⏟ start_ARG italic_λ ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_ARG start_POSTSUBSCRIPT entropy end_POSTSUBSCRIPT roman_log italic_n + italic_o start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( roman_log roman_log italic_n ) , (8)

where λ(w)𝜆superscript𝑤{\lambda({w^{*}})}italic_λ ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) is the theoretical LLC expounded in Section 3. Remarkably, the asymptotic approximation in (8) holds even for singular models such as neural networks; further discussion on this can be found in Appendix B.

The asymptotic approximation in (8) suggests that a reasonable estimator of λ(w)𝜆superscript𝑤{\lambda({w^{*}})}italic_λ ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) might come from re-arranging (8) to give what we call the idealized LLC estimator,

λ^idealized(w)=Fn(Bγ(w))nLn(w)logn.superscript^𝜆idealizedsuperscript𝑤subscript𝐹𝑛subscript𝐵𝛾superscript𝑤𝑛subscript𝐿𝑛superscript𝑤𝑛\hat{\lambda}^{\mathrm{idealized}}({w^{*}})=\frac{F_{n}(B_{\gamma}(w^{*}))-nL_% {n}({w^{*}})}{\log n}.over^ start_ARG italic_λ end_ARG start_POSTSUPERSCRIPT roman_idealized end_POSTSUPERSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = divide start_ARG italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_B start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) - italic_n italic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_ARG start_ARG roman_log italic_n end_ARG . (9)

But as indicated by the name, the idealized LLC estimator cannot be easily implemented; computing or even MCMC sampling from the posterior to estimate Fn(Bγ(w))subscript𝐹𝑛subscript𝐵𝛾superscript𝑤F_{n}(B_{\gamma}(w^{*}))italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_B start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) is made no less challenging by the need to confine sampling to the neighborhood Bγ(w)subscript𝐵𝛾superscript𝑤B_{\gamma}(w^{*})italic_B start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). In what follows, we use the idealized LLC estimator as inspiration for a practically-minded LLC estimator.

4.2 Surrogate for enforcing Bγ(w)subscript𝐵𝛾superscript𝑤B_{\gamma}(w^{*})italic_B start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )

The first step towards a practically-minded LLC estimator is to circumvent the constraint posed by the neighborhood Bγ(w)subscript𝐵𝛾superscript𝑤B_{\gamma}(w^{*})italic_B start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). To this end, we introduce a localizing Gaussian prior that acts as a surrogate for enforcing the domain of integration given by Bγ(w)subscript𝐵𝛾superscript𝑤B_{\gamma}({w^{*}})italic_B start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). Specifically, let

φγ(w)exp{γ2w22}proportional-tosubscript𝜑𝛾𝑤𝛾2subscriptsuperscriptnorm𝑤22\varphi_{\gamma}(w)\propto\exp\left\{-\frac{\gamma}{2}||w||^{2}_{2}\right\}italic_φ start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_w ) ∝ roman_exp { - divide start_ARG italic_γ end_ARG start_ARG 2 end_ARG | | italic_w | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT }

be a Gaussian prior centered at the origin with scale parameter γ>0𝛾0\gamma>0italic_γ > 0. We replace (6) with

Zn(w,γ)=exp{nLn(w)}φγ(ww)𝑑w,subscript𝑍𝑛superscript𝑤𝛾𝑛subscript𝐿𝑛𝑤subscript𝜑𝛾𝑤superscript𝑤differential-d𝑤Z_{n}(w^{*},\gamma)=\int\exp\{-nL_{n}(w)\}\varphi_{\gamma}(w-w^{*})\,dw,italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_γ ) = ∫ roman_exp { - italic_n italic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_w ) } italic_φ start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_w - italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_d italic_w ,

which, for β=1𝛽1\beta=1italic_β = 1, can also be recognized as the normalizing constant to the posterior distribution given by

p(w|w,β,γ)exp{nβLn(w)γ2ww22},proportional-to𝑝conditional𝑤superscript𝑤𝛽𝛾𝑛𝛽subscript𝐿𝑛𝑤𝛾2superscriptsubscriptnorm𝑤superscript𝑤22p(w|w^{*},\beta,\gamma)\propto\exp\left\{-n\beta L_{n}(w)-\frac{\gamma}{2}||w-% w^{*}||_{2}^{2}\right\},italic_p ( italic_w | italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_β , italic_γ ) ∝ roman_exp { - italic_n italic_β italic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_w ) - divide start_ARG italic_γ end_ARG start_ARG 2 end_ARG | | italic_w - italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } , (10)

where β>0𝛽0\beta>0italic_β > 0 plays the role of an inverse temperature. Large values of γ𝛾\gammaitalic_γ force the posterior distribution in (10) to stay close to wsuperscript𝑤{w^{*}}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. A word on the notation p(w|w,β,γ)𝑝conditional𝑤superscript𝑤𝛽𝛾p(w|w^{*},\beta,\gamma)italic_p ( italic_w | italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_β , italic_γ ): this is a distribution in w𝑤witalic_w solely, the parameters w,β,γsuperscript𝑤𝛽𝛾{w^{*}},\beta,\gammaitalic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_β , italic_γ are fixed, hence the normalizing constant to (10) is an integral over w𝑤witalic_w only. As Zn(w,γ)subscript𝑍𝑛superscript𝑤𝛾Z_{n}(w^{*},\gamma)italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_γ ) is to be viewed as a proxy to (6), we shall accordingly treat

Fn(w,γ)logZn(w,γ)subscript𝐹𝑛superscript𝑤𝛾subscript𝑍𝑛superscript𝑤𝛾F_{n}(w^{*},\gamma)\coloneqq-\log Z_{n}(w^{*},\gamma)italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_γ ) ≔ - roman_log italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_γ )

as a proxy for Fn(Bγ(w))subscript𝐹𝑛subscript𝐵𝛾superscript𝑤F_{n}(B_{\gamma}(w^{*}))italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_B start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) in (7). Although it is tempting at this stage to simply drop Fn(w,γ)subscript𝐹𝑛superscript𝑤𝛾F_{n}(w^{*},\gamma)italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_γ ) into the idealized LLC estimator in place of Fn(Bγ(w))subscript𝐹𝑛subscript𝐵𝛾superscript𝑤F_{n}(B_{\gamma}({w^{*}}))italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_B start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ), we have to address estimation of Fn(w,γ)subscript𝐹𝑛superscript𝑤𝛾F_{n}(w^{*},\gamma)italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_γ ), the subject of the next section.

4.3 The LLC estimator

Let us denote the expectation of a function f(w)𝑓𝑤f(w)italic_f ( italic_w ) with respect to the posterior distribution in (10) as

Ew|w,β,γf(w)f(w)p(w|w,β,γ)𝑑w.subscriptEconditional𝑤superscript𝑤𝛽𝛾𝑓𝑤𝑓𝑤𝑝conditional𝑤superscript𝑤𝛽𝛾differential-d𝑤\mathrm{E}_{w|{w^{*}},\beta,\gamma}f(w)\coloneqq\int f(w)p(w|w^{*},\beta,% \gamma)\,dw.roman_E start_POSTSUBSCRIPT italic_w | italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_β , italic_γ end_POSTSUBSCRIPT italic_f ( italic_w ) ≔ ∫ italic_f ( italic_w ) italic_p ( italic_w | italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_β , italic_γ ) italic_d italic_w .

Consider the quantity

Ew|w,β,γ[nLn(w)]subscriptEconditional𝑤superscript𝑤superscript𝛽𝛾delimited-[]𝑛subscript𝐿𝑛𝑤\mathrm{E}_{w|{w^{*}},{\beta^{*}},\gamma}[nL_{n}(w)]roman_E start_POSTSUBSCRIPT italic_w | italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_γ end_POSTSUBSCRIPT [ italic_n italic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_w ) ] (11)

where the inverse temperature is deliberately set to β=1/lognsuperscript𝛽1𝑛\beta^{*}=1/\log nitalic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 1 / roman_log italic_n. The quantity in (11) may be regarded as a localized version of the widely applicable Bayesian information criterion (WBIC) first introduced in (Watanabe,, 2013). It can be shown that (11) is a good estimator of Fn(w,γ)subscript𝐹𝑛superscript𝑤𝛾F_{n}(w^{*},\gamma)italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_γ ) in the following sense: the leading order terms of (11) match those of Fn(w,γ)subscript𝐹𝑛superscript𝑤𝛾F_{n}(w^{*},\gamma)italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_γ ) when we perform an asymptotic expansion in sample size n𝑛nitalic_n. This justifies using (11) to estimate Fn(w,γ)subscript𝐹𝑛superscript𝑤𝛾F_{n}(w^{*},\gamma)italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_γ ). Further discussion on this can be found in in Appendix D

Going back to (9), we approximate Fn(Bγ(w))subscript𝐹𝑛subscript𝐵𝛾superscript𝑤F_{n}(B_{\gamma}(w^{*}))italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_B start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) first with Fn(w,γ)subscript𝐹𝑛superscript𝑤𝛾F_{n}(w^{*},\gamma)italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_γ ), which is further estimated by (11). We are finally ready to define the LLC estimator.

Definition 2 (Local Learning Coefficient (LLC) estimator).

Let wsuperscript𝑤{w^{*}}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT be a local minimum of L(w)𝐿𝑤L(w)italic_L ( italic_w ). Let β=1/lognsuperscript𝛽1𝑛{\beta^{*}}=1/\log nitalic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 1 / roman_log italic_n. The associated local learning coefficient estimator is given by

λ^(w)nβ[Ew|w,β,γLn(w)Ln(w)].^𝜆superscript𝑤𝑛superscript𝛽delimited-[]subscriptEconditional𝑤superscript𝑤superscript𝛽𝛾subscript𝐿𝑛𝑤subscript𝐿𝑛superscript𝑤\hat{\lambda}(w^{*})\coloneqq n{\beta^{*}}\left[\mathrm{E}_{w|{w^{*}},{\beta^{% *}},\gamma}L_{n}(w)-L_{n}({w^{*}})\right].over^ start_ARG italic_λ end_ARG ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≔ italic_n italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT [ roman_E start_POSTSUBSCRIPT italic_w | italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_γ end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_w ) - italic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ] . (12)

Note that λ^(w)^𝜆superscript𝑤\hat{\lambda}(w^{*})over^ start_ARG italic_λ end_ARG ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) depends on γ𝛾\gammaitalic_γ but we have suppressed this in the notation. Let us ponder the pleasingly simple form that is the LLC estimator. The expectation term in (12) is a measure of the loss Lnsubscript𝐿𝑛L_{n}italic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT under perturbation near wsuperscript𝑤{w^{*}}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. If the perturbed loss, under this expectation, is very close to Ln(w)subscript𝐿𝑛superscript𝑤L_{n}({w^{*}})italic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), then λ^(w)^𝜆superscript𝑤{\hat{\lambda}({w^{*}})}over^ start_ARG italic_λ end_ARG ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) is small. This accords with our intuition that if wsuperscript𝑤{w^{*}}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is simple, its loss should not change too much under reasonable perturbations. Finally we note that in applications, we use the empirical loss Ln(w)subscript𝐿𝑛𝑤L_{n}(w)italic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_w ) to determine a critical point of interest, i.e., w^nargminwLn(w).superscriptsubscript^𝑤𝑛subscriptargmin𝑤subscript𝐿𝑛𝑤{\hat{w}_{n}^{*}}\coloneqq\operatorname*{arg\,min}_{w}L_{n}(w)\,.over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≔ start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_w ) . We lose something by plugging in w^nsuperscriptsubscript^𝑤𝑛{\hat{w}_{n}^{*}}over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT to (12) directly since we end up using the dataset 𝒟nsubscript𝒟𝑛\mathcal{D}_{n}caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT twice. However we do not observe adverse effects in our experiments, see Figure 4 for an example.

4.4 The SGLD-based LLC estimator

The LLC estimator defined in (12) is not prescriptive as to how the expectation with respect to the posterior distribution should actually be approximated. A wide array of MCMC techniques are possible. However, to be able to estimate the LLC at the scale of modern deep learning, we must look at efficiency. In practice, the computational bottleneck to implementing (12) is the MCMC sampler. In particular, traditional MCMC samplers must compute log-likelihood gradients across the entire training dataset, which is prohibitively expensive at modern dataset sizes.

If one modifies these samplers to take minibatch gradients instead of full-batch gradients, this results in stochastic-gradient MCMC, the prototypical example of which is (Welling and Teh,, 2011)’s Stochastic Gradient Langevin Dynamics (SGLD). The computational cost of this sampler is much lower: roughly the cost of a single SGD step times the number of samples required. The standard SGLD update applied to sampling (10) at the optimal temperature βsuperscript𝛽\beta^{*}italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT required for the LLC estimator is given by

ΔwtΔsubscript𝑤𝑡\displaystyle\Delta w_{t}roman_Δ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =ϵ2(βnm(x,y)Btlogp(y|x,wt)+γ(wwt))+N(0,ϵ)absentitalic-ϵ2superscript𝛽𝑛𝑚subscript𝑥𝑦subscript𝐵𝑡𝑝conditional𝑦𝑥subscript𝑤𝑡𝛾superscript𝑤subscript𝑤𝑡𝑁0italic-ϵ\displaystyle=\frac{\epsilon}{2}\left(\frac{\beta^{*}n}{m}\sum_{(x,y)\in B_{t}% }\nabla\log p(y|x,w_{t})+\gamma(w^{*}-w_{t})\right)+N(0,\epsilon)= divide start_ARG italic_ϵ end_ARG start_ARG 2 end_ARG ( divide start_ARG italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_n end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∇ roman_log italic_p ( italic_y | italic_x , italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_γ ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) + italic_N ( 0 , italic_ϵ )

where Bt={(xi,yi)}i=1msubscript𝐵𝑡superscriptsubscriptsubscript𝑥𝑖subscript𝑦𝑖𝑖1𝑚B_{t}=\{(x_{i},y_{i})\}_{i=1}^{m}italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is a randomly sampled minibatch of samples of size m𝑚mitalic_m for step t𝑡titalic_t and ϵitalic-ϵ\epsilonitalic_ϵ controls both step size and variance of injected Gaussian noise. Crucially, the log-likelihood gradient is evaluated using mini-batches. In practice, we choose to shuffle the dataset once and partition it into a sequence of size m𝑚mitalic_m segments as minibatches instead of drawing fresh random samples of size m𝑚mitalic_m.

Let us now suppose we have obtained T𝑇Titalic_T approximate samples {w1,w2,,wT}subscript𝑤1subscript𝑤2subscript𝑤𝑇\{w_{1},w_{2},\dots,w_{T}\}{ italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT } of the tempered posterior distribution at inverse temperature βsuperscript𝛽\beta^{*}italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT via SGLD. This is usually taken from the SGLD trajectory after burn-in. We can then form what we call the SGLD-based LLC estimator,

λ^SGLD(w):=nβ[1Tt=1TLn(wt)Ln(w)].assignsuperscript^𝜆SGLDsuperscript𝑤𝑛superscript𝛽delimited-[]1𝑇superscriptsubscript𝑡1𝑇subscript𝐿𝑛subscript𝑤𝑡subscript𝐿𝑛superscript𝑤\hat{\lambda}^{\mathrm{SGLD}}({w^{*}}):=n{\beta^{*}}\left[\frac{1}{T}\sum_{t=1% }^{T}L_{n}(w_{t})-L_{n}({w^{*}})\right].over^ start_ARG italic_λ end_ARG start_POSTSUPERSCRIPT roman_SGLD end_POSTSUPERSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) := italic_n italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT [ divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ] . (13)

For further computation saving, we also recycle the forward passes that compute Lm(wt)subscript𝐿𝑚subscript𝑤𝑡L_{m}(w_{t})italic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), which is required for computing wLm(wt)subscript𝑤subscript𝐿𝑚subscript𝑤𝑡\nabla_{w}L_{m}(w_{t})∇ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) via back-propagation, as unbiased estimate of Ln(wt)subscript𝐿𝑛subscript𝑤𝑡L_{n}(w_{t})italic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Here by Lm(wt)subscript𝐿𝑚subscript𝑤𝑡L_{m}(w_{t})italic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), we mean 1m(x,y)Btlogp(y|x,wt)1𝑚subscript𝑥𝑦subscript𝐵𝑡𝑝conditional𝑦𝑥subscript𝑤𝑡-\frac{1}{m}\sum_{(x,y)\in B_{t}}\log p(y|x,w_{t})- divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( italic_y | italic_x , italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), though the notation suppresses the dependence on Btsubscript𝐵𝑡B_{t}italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for brevity. Pseudocode for this minibatch version of the SGLD-based LLC estimator is provided in Appendix G. Henceforth, when we say the SGLD-based LLC estimator, we are referring to the minibatch version. In Appendix H, we give a comprehensive guide on best practices for implementing the SGLD-based LLC estimator including choices for γ,ϵ𝛾italic-ϵ\gamma,\epsilonitalic_γ , italic_ϵ, number of SGLD iterations, and required burn-in.

5 Experiments

The goal of our experiments is to give evidence that the LLC estimator is accurate, scalable and can reveal insights on deep learning practice. Throughout the experiments, we implement the minibatch version of the SGLD-based LLC estimator presented in the pseudo-algorithm in Appendix G.

There are also a number of experiments we performed that are relegated to the appendices due to space constraints. They include

  • deploying LLC estimation on transformer language models (Section L)

  • an experiment on a small ReLU network verifying that SGLD sampler is just as accurate as one that uses full-batch gradients (Appendix M)

  • an experiment comparing the optimizers SGD and entropy-SGD with the latter having intimate connection to our notion of complexity (Appendix N)

  • an experiment verifying the scaling invariance property (Appendix O) set out in Appendix C

Every experiment described in the main text below has an accompanying section in the Appendix that offers full experimental details, further discussion, and possible additional figures/tables.

5.1 LLC for Deep Linear Networks (DLNs)

In this section, we verify the accuracy and scalability of our LLC estimator against theoretical LLC values in deep linear networks (DLNs) up to 100M parameters. Recall DLNs are fully-connected feedforward neural networks with identity activation function. The input-output behavior of a DLN is obviously trivial; it is equivalent to a single-layer linear network obtained by multiplying together the weight matrices. However, the geometry of such a model is highly non-trivial — in particular, the optimization dynamics and inductive biases of such networks have seen significant research interest (Saxe et al.,, 2013; Ji and Telgarsky,, 2018; Arora et al.,, 2018).

Thus one reason we chose to study DLN model complexity is because they have long served as an important sandbox for deep learning theory. Another key factor is the recent clarification of theoretical LLCs in DLNs by Aoyagi, (2024), making DLNs the most realistic setting where theoretical learning coefficients are available. The significance of Aoyagi, (2024) lies in the substantial technical difficulty of deriving theoretical global learning coefficients (and, by extension, theoretical LLCs) which means that these coefficients are generally unavailable except in a few exceptional cases, with most of the research conducted decades ago (Yamazaki and Watanabe, 2005b, ; Aoyagi et al.,, 2005; Yamazaki and Watanabe,, 2003; Yamazaki and Watanabe, 2005a, ). Further details and discussion of the theoretical result in Aoyagi, (2024) can be found in Appendix I.

Refer to caption
Refer to caption
Figure 4: Estimated LLC against true learning coefficient; model dimension shown in color. On the left, we evaluate the LLC estimator at a global minimum, wsuperscript𝑤{w^{*}}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, of the population loss. On the right, we evaluate the LLC estimator at a minimum, w^nsuperscriptsubscript^𝑤𝑛\hat{w}_{n}^{*}over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, found by SGD. Fortunately, we do not see an adverse effect of using the training data twice, a minor concern we had raised at the end of Section 4.3. The estimated LLCs accurately measures the learning coefficient λ𝜆\lambdaitalic_λ up to 100 million parameters in deep linear networks, as compared to known theoretical values (dashed line). See Figure J.1 for linear-scale plots.

The results are summarized in Figure 4. Our LLC estimator is able to accurately estimate the learning coefficient in DLNs up to 100M parameters. We further show that accuracy is maintained even if (as is typical in practice) one does not evaluate the LLC at a local minimum of the population loss, but is instead forced to use SGD to first find a local minimum w^nsuperscriptsubscript^𝑤𝑛\hat{w}_{n}^{*}over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Further experimental details and results for this section are detailed in Appendix J. Overall it is quite remarkable that our LLC estimator, after the series of engineering steps described in Section 4, maintains such high levels of accuracy.

5.2 LLC for ResNet

Here, we empirically test whether implicit regularization effectively induces a preference for simplicity as manifested by a lower LLC. We examine the effects of SGD learning rate, batch size, and momentum on the LLC. Note that we do so in isolation, i.e., we do not look at interactions between these factors. For instance, when we vary the learning rate, we hold everything else constant including batch size and momentum values.

We perform experiments on ResNet18 trained on CIFAR10 and show the results in Figure 1. We see that the training loss reaches zero in most instances and therefore on its own cannot distinguish between the effect of the implicit regularization. In contrast, we see that there is a consistent pattern of “stronger implicit regularization = higher test accuracy = lower LLC". Specifically, higher learning rate, lower batch size, higher momentum all apply stronger implicit regularization which is reflected in lower LLC. Full experimental details can be found in Appendix K. Note that in Figures 1 we employed SGD without momentum in the top two rows. We repeat the experiments in these top two rows for SGD with momentum; the associated results in Figure K.1 support very similar conclusions. We also conducted some explicit regularization experiments involving L2 regularization (Figure K.2 in Appendix K) and again conclude that stronger regularization is accompanied by lower LLC.

In contrast to the LLC estimates for DLN, the LLC estimates for ResNet18 cannot be calibrated as we do not know the true LLC values. With many models in practice, we find value in the LLC estimates by comparing their relative values between models with shared context. For instance, when comparing LLC values we might hold everything constant while varying one factor such as SGD batch size, learning rate, or momentum as we have done in this section.

We briefly cover related work here; more detail may be found in Appendix E. The primary reference for the theoretical foundation of this work, known as SLT, is Watanabe, (2009). The global learning coefficient, first introduced by (Watanabe,, 2001), provides the asymptotic expansion of the free energy, which is equivalent to the negative log Bayes marginal likelihood, an all-important important quantity in Bayesian analysis. Subsequent research has utilized algebraic-geometric tools to calculate the learning coefficient for various machine learning models (Yamazaki and Watanabe, 2005b, ; Aoyagi et al.,, 2005; Aoyagi,, 2024; Yamazaki and Watanabe,, 2003; Yamazaki and Watanabe, 2005a, ). SLT has also enhanced the understanding of model selection criteria in Bayesian statistics. Of particular relevance to this work is Watanabe, (2013), which introduced the WBIC estimator of free energy. This estimator has been applied in various practical settings (Endo et al.,, 2020; Fontanesi et al.,, 2019; Hooten and Hobbs,, 2015; Sharma,, 2017; Kafashan et al.,, 2021; Semenova et al.,, 2020).

The LLC can be seen as a singularity-aware version of basin broadness measures, which attempt to connect geometric "broadness" or "flatness" with model complexity (Hochreiter and Schmidhuber,, 1997; Jiang et al.,, 2019). In particular, the LLC estimator bears resemblance to PAC-Bayes inspired flatness/sharpness measures (Neyshabur et al.,, 2017), but takes into account the non-Gaussian nature of the posterior distribution in singular models.

The LLC is set apart from classic model complexity measures such as Rademacher complexity (Koltchinskii and Panchenko,, 2000) and the VC dimension (Vapnik and Chervonenkis,, 1971) because it measures the complexity of a specific model p(y|x,w)𝑝conditional𝑦𝑥𝑤p(y|x,w)italic_p ( italic_y | italic_x , italic_w ) rather than the complexity over the function class {f(x|w):w}:𝑓conditional𝑥𝑤𝑤\{f(x|w):w\}{ italic_f ( italic_x | italic_w ) : italic_w } where f𝑓fitalic_f is the neural network function. This makes the LLC an appealing tool for understanding the interplay between function class, data properties, and training heuristics.

Concurrent work by (Chen et al., 2023a, ) proposes a measure called the learning capacity, which can be viewed as a finite-n𝑛nitalic_n version of the learning coefficient, and investigates its behavior as a function of training size n𝑛nitalic_n.

7 Outlook

An exciting direction of future research is to study the role of the LLC in detecting phase transitions and emergent abilities in deep learning models. A first step in this direction was undertaken in (Chen et al., 2023b, ) where the energy (training loss) and entropy (both estimated and theoretical LLC) were tracked as training progresses; it was observed that energy and entropy proceed along staircases in opposing directions. Further, Hoogland et al., (2024) showed how the estimated LLC can be used to detect phase transitions in the formation of in-context learning in transformer language models. It is natural to wonder if the LLC could shed light on the hypothesis that SGD-trained neural networks sequentially learn the target function with a “saddle-to-saddle" dynamic. Previous theoretical works had to devise complexity measures on a case-by-case basis (Abbe et al.,, 2023; Berthier,, 2023). We posit that the free energy perspective could offer a more unified and general approach to understanding the intricate dynamics of learning in deep neural networks by accounting for competition between model fit and model complexity during training.

Appendix A Background on Singular Learning Theory

Most models in machine learning are singular: they contain parameters where the Fisher information matrix is singular. While these parameters where the Fisher information is degenerate form a measure zero subset, their effect is far from negligible. Singular Learning Theory (SLT) shows that the geometry in the neighbourhood of these degenerate points determines the asymptotics of learning (Watanabe,, 2009). The theory explains observable effects of degeneracy in common machine learning models under practical settings (Watanabe,, 2018) and has the potential to account for important phenomena in deep learning (Wei et al.,, 2022). The central quantity of SLT is the learning coefficient λ𝜆\lambdaitalic_λ. Many notable SLT results are conveyed through the learning coefficient which can be thought of as the complexity of the model class relative to the true distribution. In this section we carefully define the (global) learning coefficient λ𝜆\lambdaitalic_λ, which we then contrast with the local learning coefficient λ(w)𝜆superscript𝑤\lambda(w^{*})italic_λ ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ).

For notational simplicity, we consider here the unsupervised setting where we have the model p(x|w)𝑝conditional𝑥𝑤p(x|w)italic_p ( italic_x | italic_w ) parameterised by a compact parameter space Wd𝑊superscript𝑑W\subset\mathbb{R}^{d}italic_W ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. We assume fundamental conditions I and II of (Watanabe,, 2009, §6.1, §6.2). In particular W𝑊Witalic_W is defined by a finite set of real analytic inequalities, at every parameter wW𝑤𝑊w\in Witalic_w ∈ italic_W the distribution p(x|w)𝑝conditional𝑥𝑤p(x|w)italic_p ( italic_x | italic_w ) should have the same support as the true density q(x)𝑞𝑥q(x)italic_q ( italic_x ), and the prior density φ(w)=φ1(w)φ2(w)𝜑𝑤subscript𝜑1𝑤subscript𝜑2𝑤\varphi(w)=\varphi_{1}(w)\varphi_{2}(w)italic_φ ( italic_w ) = italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_w ) italic_φ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_w ) is a product of a positive smooth function φ1(w)subscript𝜑1𝑤\varphi_{1}(w)italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_w ) and non-negative real analytic φ2(w)subscript𝜑2𝑤\varphi_{2}(w)italic_φ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_w ). We refer to

(p(x|w),q(x),φ(w))𝑝conditional𝑥𝑤𝑞𝑥𝜑𝑤(p(x|w),q(x),\varphi(w))( italic_p ( italic_x | italic_w ) , italic_q ( italic_x ) , italic_φ ( italic_w ) ) (14)

as the model-truth-prior triplet.

Let K(w)𝐾𝑤K(w)italic_K ( italic_w ) be the Kullback-Leibler divergence between the truth and the model

K(w)KL(q(x)p(x|w))=q(x)log[q(x)/p(x|w)]dxK(w)\coloneqq\operatorname{KL}\big{(}q(x)\,\|\,p(x|w)\big{)}=\int q(x)\log\Big% {[}q(x)/p(x|w)\Big{]}dxitalic_K ( italic_w ) ≔ roman_KL ( italic_q ( italic_x ) ∥ italic_p ( italic_x | italic_w ) ) = ∫ italic_q ( italic_x ) roman_log [ italic_q ( italic_x ) / italic_p ( italic_x | italic_w ) ] italic_d italic_x

and define the (average) negative log likelihood to be

L(w)q(x)logp(x|w)𝑑x=K(w)S𝐿𝑤𝑞𝑥𝑝conditional𝑥𝑤differential-d𝑥𝐾𝑤𝑆L(w)\coloneqq-\int q(x)\log p(x|w)dx=K(w)-Sitalic_L ( italic_w ) ≔ - ∫ italic_q ( italic_x ) roman_log italic_p ( italic_x | italic_w ) italic_d italic_x = italic_K ( italic_w ) - italic_S

where S𝑆Sitalic_S is the entropy of the true distribution. Set K0=infwWK(w)subscript𝐾0subscriptinfimum𝑤𝑊𝐾𝑤K_{0}=\inf_{w\in W}K(w)italic_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = roman_inf start_POSTSUBSCRIPT italic_w ∈ italic_W end_POSTSUBSCRIPT italic_K ( italic_w ). Let

W0{wW:K(w)=K0}subscript𝑊0conditional-set𝑤𝑊𝐾𝑤subscript𝐾0W_{0}\coloneqq\{w\in W:K(w)=K_{0}\}italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≔ { italic_w ∈ italic_W : italic_K ( italic_w ) = italic_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT }

be the set of optimal parameters. We say the truth is realizable by the model if K0=0subscript𝐾00K_{0}=0italic_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0. We do not assume that our models are regular or that the truth is realizable, but we do assume the model-truth-prior triplet satisfies the more general condition of relative finite variance (Watanabe,, 2013). We also assume that there exists w0superscriptsubscript𝑤0w_{0}^{*}italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT in the interior of W𝑊Witalic_W satisfying K(w0)=K0𝐾superscriptsubscript𝑤0subscript𝐾0K(w_{0}^{*})=K_{0}italic_K ( italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = italic_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

Following Watanabe, (2009) we define:

Definition 3.

The zeta function of (14) is defined for Re(z)>0Re𝑧0\operatorname{Re}(z)>0roman_Re ( italic_z ) > 0 by

ζ(z)=W(K(w)K0)zφ(w)𝑑w𝜁𝑧subscript𝑊superscript𝐾𝑤subscript𝐾0𝑧𝜑𝑤differential-d𝑤\zeta(z)=\int_{W}\left(K(w)-K_{0}\right)^{z}\varphi(w)dwitalic_ζ ( italic_z ) = ∫ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( italic_K ( italic_w ) - italic_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT italic_φ ( italic_w ) italic_d italic_w

and can be analytically continued to a meromorphic function on the complex plane with poles that are all real, negative and rational (Watanabe,, 2009, Theorem 6.6). Let λ𝜆-\lambda\in\mathbb{R}- italic_λ ∈ blackboard_R be the largest pole of ζ𝜁\zetaitalic_ζ and m𝑚mitalic_m its multiplicity. Then, the learning coefficient and its multiplicity of the triple (14) are defined to be λ𝜆\lambdaitalic_λ and m𝑚mitalic_m respectively.

When p(x|w)𝑝conditional𝑥𝑤p(x|w)italic_p ( italic_x | italic_w ) is a singular model, W0subscript𝑊0W_{0}italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is an analytic variety which is in general positive dimensional (not a collection of isolated points). As long as φ>0𝜑0\varphi>0italic_φ > 0 on W0subscript𝑊0W_{0}italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT the learning coefficient λ𝜆\lambdaitalic_λ is equal to a birational invariant of W0subscript𝑊0W_{0}italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT known in algebraic geometry as the Real Log Canonical Threshold (RLCT). We will always assume this is the case, and now recall how λ𝜆\lambdaitalic_λ is described geometrically.

With Wϵ={wW:K(w)K0ϵ}subscript𝑊italic-ϵconditional-set𝑤𝑊𝐾𝑤subscript𝐾0italic-ϵW_{\epsilon}=\{w\in W:K(w)-K_{0}\leq\epsilon\}italic_W start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT = { italic_w ∈ italic_W : italic_K ( italic_w ) - italic_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≤ italic_ϵ } for sufficiently small ϵitalic-ϵ\epsilonitalic_ϵ, resolution of singularities (Hironaka,, 1964) gives us the existence of a birational proper map g:MWϵ:𝑔𝑀subscript𝑊italic-ϵg:M\to W_{\epsilon}italic_g : italic_M → italic_W start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT from an analytic manifold M𝑀Mitalic_M which monomializes K(w)K0𝐾𝑤subscript𝐾0K(w)-K_{0}italic_K ( italic_w ) - italic_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT in the following sense, described precisely in (Watanabe,, 2009, Theorem 6.5): there are local coordinate charts Mαsubscript𝑀𝛼M_{\alpha}italic_M start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT covering g1(W0)superscript𝑔1subscript𝑊0g^{-1}(W_{0})italic_g start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) with coordinates u𝑢uitalic_u such that the reparameterisation w=g(u)𝑤𝑔𝑢w=g(u)italic_w = italic_g ( italic_u ) puts K(w)K0𝐾𝑤subscript𝐾0K(w)-K_{0}italic_K ( italic_w ) - italic_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and φ(w)dw𝜑𝑤𝑑𝑤\varphi(w)dwitalic_φ ( italic_w ) italic_d italic_w into normal crossing form

K(g(u))K0=u12k1ud2kd𝐾𝑔𝑢subscript𝐾0superscriptsubscript𝑢12subscript𝑘1superscriptsubscript𝑢𝑑2subscript𝑘𝑑\displaystyle K(g(u))-K_{0}=u_{1}^{2k_{1}}\dots u_{d}^{2k_{d}}italic_K ( italic_g ( italic_u ) ) - italic_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT … italic_u start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_k start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (15)
|g(u)|=b(u)u1h1udhdsuperscript𝑔𝑢𝑏𝑢superscriptsubscript𝑢1subscript1superscriptsubscript𝑢𝑑subscript𝑑\displaystyle|g^{\prime}(u)|=b(u)u_{1}^{h_{1}}\dots u_{d}^{h_{d}}| italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_u ) | = italic_b ( italic_u ) italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT … italic_u start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (16)
φ(w)dw=φ(g(u))|g(u)|du𝜑𝑤𝑑𝑤𝜑𝑔𝑢superscript𝑔𝑢𝑑𝑢\displaystyle\varphi(w)dw=\varphi(g(u))|g^{\prime}(u)|duitalic_φ ( italic_w ) italic_d italic_w = italic_φ ( italic_g ( italic_u ) ) | italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_u ) | italic_d italic_u (17)

for some positive smooth function b(u)𝑏𝑢b(u)italic_b ( italic_u ). The RLCT of KK0𝐾subscript𝐾0K-K_{0}italic_K - italic_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is independent of the (non-unique) resolution map g𝑔gitalic_g, and may be computed as (Watanabe,, 2009, Definition 6.4)

λ=minαminj=1dhj+12kj.𝜆subscript𝛼subscript𝑗1𝑑subscript𝑗12subscript𝑘𝑗\lambda=\min_{\alpha}\min_{j=1\dots d}\frac{h_{j}+1}{2k_{j}}.italic_λ = roman_min start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_j = 1 … italic_d end_POSTSUBSCRIPT divide start_ARG italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + 1 end_ARG start_ARG 2 italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG . (18)

The multiplicity is defined as

m=maxα#{j:λj=λ}.𝑚subscript𝛼#conditional-set𝑗subscript𝜆𝑗𝜆m=\max_{\alpha}\#\{j:\lambda_{j}=\lambda\}.italic_m = roman_max start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT # { italic_j : italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_λ } .

For PW0𝑃subscript𝑊0P\in W_{0}italic_P ∈ italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT there exist coordinate charts Mαsubscript𝑀superscript𝛼M_{\alpha^{*}}italic_M start_POSTSUBSCRIPT italic_α start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT such that g(0)=P𝑔0𝑃g(0)=Pitalic_g ( 0 ) = italic_P and (15), (16) hold. The RLCT of KK0𝐾subscript𝐾0K-K_{0}italic_K - italic_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT at P𝑃Pitalic_P is then (Watanabe,, 2009, Definition 2.7)

λ(P)=minαminj=1dhj+12kj.𝜆𝑃subscriptsuperscript𝛼subscript𝑗1𝑑subscript𝑗12subscript𝑘𝑗\lambda(P)=\min_{\alpha^{*}}\min_{j=1\dots d}\frac{h_{j}+1}{2k_{j}}\,.italic_λ ( italic_P ) = roman_min start_POSTSUBSCRIPT italic_α start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_j = 1 … italic_d end_POSTSUBSCRIPT divide start_ARG italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + 1 end_ARG start_ARG 2 italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG . (19)

We then have the RLCT λ=infPW0λ(P)𝜆subscriptinfimum𝑃subscript𝑊0𝜆𝑃\lambda=\inf_{P\in W_{0}}\lambda(P)italic_λ = roman_inf start_POSTSUBSCRIPT italic_P ∈ italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_λ ( italic_P ). In regular models, all λ(P)𝜆𝑃\lambda(P)italic_λ ( italic_P ) are d/2𝑑2d/2italic_d / 2 and hence the RLCT λ=d/2𝜆𝑑2\lambda=d/2italic_λ = italic_d / 2 and m=1𝑚1m=1italic_m = 1, see (Watanabe,, 2009, Remark 1.15).

A.1 Background assumptions in SLT

There are a few technical assumptions throughout SLT that we shall collect in this section. We should note that these are only sufficient but not necessary conditions for many of the results in SLT. Most of the assumptions can be relaxed on a case by case basis without invalidating the main conclusions of SLT. For in depth discussion see Watanabe, (2009, 2010, 2018).

With the same hypotheses as above, the log likelihood ratio

r(x,w):=logp(x|w0)p(x|w)assign𝑟𝑥𝑤𝑝conditional𝑥subscript𝑤0𝑝conditional𝑥𝑤r(x,w):=\log\frac{p(x|w_{0})}{p(x|w)}italic_r ( italic_x , italic_w ) := roman_log divide start_ARG italic_p ( italic_x | italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p ( italic_x | italic_w ) end_ARG

is assumed to be an Ls(q(x))superscript𝐿𝑠𝑞𝑥L^{s}(q(x))italic_L start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ( italic_q ( italic_x ) )-valued analytic function of w𝑤witalic_w with s2𝑠2s\geq 2italic_s ≥ 2 that can be extended to a complex analytic function on Wdsubscript𝑊superscript𝑑W_{\mathbb{C}}\subset\mathbb{C}^{d}italic_W start_POSTSUBSCRIPT blackboard_C end_POSTSUBSCRIPT ⊂ blackboard_C start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. A more conceptually significant assumption is the condition relatively finite variance, which consists of the following two requirements:

  1. For any optimal parameters w1,w2W0subscript𝑤1subscript𝑤2subscript𝑊0w_{1},w_{2}\in W_{0}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, we have p(x|w1)=p(x|w2)𝑝conditional𝑥subscript𝑤1𝑝conditional𝑥subscript𝑤2p(x|w_{1})=p(x|w_{2})italic_p ( italic_x | italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = italic_p ( italic_x | italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) almost everywhere. This is also known as essential uniqueness.

  2. There exists c>0𝑐0c>0italic_c > 0 such that for all wW𝑤𝑊w\in Witalic_w ∈ italic_W, we have

    𝔼q(x)[r(x,w)]c𝔼q(x)[r(x,w)2].subscript𝔼𝑞𝑥delimited-[]𝑟𝑥𝑤𝑐subscript𝔼𝑞𝑥delimited-[]𝑟superscript𝑥𝑤2\displaystyle\mathbb{E}_{q(x)}\left[r(x,w)\right]\geq c\mathbb{E}_{q(x)}\left[% r(x,w)^{2}\right].blackboard_E start_POSTSUBSCRIPT italic_q ( italic_x ) end_POSTSUBSCRIPT [ italic_r ( italic_x , italic_w ) ] ≥ italic_c blackboard_E start_POSTSUBSCRIPT italic_q ( italic_x ) end_POSTSUBSCRIPT [ italic_r ( italic_x , italic_w ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] .

Note that if the true density q(x)𝑞𝑥q(x)italic_q ( italic_x ) is realizable by the model, i.e. there exist wWsuperscript𝑤𝑊w^{*}\in Witalic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ italic_W such that q(x)=p(x|w)𝑞𝑥𝑝conditional𝑥superscript𝑤q(x)=p(x|w^{*})italic_q ( italic_x ) = italic_p ( italic_x | italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) almost everywhere, then these conditions are automatically satisfied. This includes settings for which the training labels are synthetically generated by passing inputs through a target model (in which case realizability is satisfied by construction), such as our experiments involving DLNs and some of the experiments involving MLPs. For experiments involving real data, it is unclear how reasonable it is to assume that relative finite variance holds.

Appendix B Well-definedness of the theoretical LLC

Here we remark on how we adapt the (global) learning coefficient of Definition 3 to define the local learning coefficient in terms of the poles of a zeta function, and how this relates to the asymptotic volume formula in Definition 1 of the main text.

In addition to the setup described above, now suppose we have a local minimum wsuperscript𝑤w^{*}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT of the negative log likelihood L(w)𝐿𝑤L(w)italic_L ( italic_w ) and as in Section 3.1 we assume B(w)𝐵superscript𝑤B(w^{*})italic_B ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) to be a closed ball centered on wsuperscript𝑤w^{*}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT such that L(w)𝐿superscript𝑤L(w^{*})italic_L ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) is the minimum value of L𝐿Litalic_L on the ball. We define

V:=Vol(B(w))=B(w)φ(w)𝑑w,assign𝑉Vol𝐵superscript𝑤subscript𝐵superscript𝑤𝜑𝑤differential-d𝑤\displaystyle V:=\operatorname{Vol}(B(w^{*}))=\int_{B(w^{*})}\varphi(w)dw\,,italic_V := roman_Vol ( italic_B ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) = ∫ start_POSTSUBSCRIPT italic_B ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT italic_φ ( italic_w ) italic_d italic_w ,
φ¯(w)=1Vφ(w)¯𝜑𝑤1𝑉𝜑𝑤\displaystyle\bar{\varphi}(w)=\tfrac{1}{V}\varphi(w)over¯ start_ARG italic_φ end_ARG ( italic_w ) = divide start_ARG 1 end_ARG start_ARG italic_V end_ARG italic_φ ( italic_w )

then we can form the local triplet (p,q,φ¯)𝑝𝑞¯𝜑(p,q,\bar{\varphi})( italic_p , italic_q , over¯ start_ARG italic_φ end_ARG ) with parameter space B(w)𝐵superscript𝑤B(w^{*})italic_B ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). Note that W𝑊Witalic_W is cut out by a finite number of inequalities between analytic functions, and hence so is B(w)𝐵superscript𝑤B(w^{*})italic_B ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ).

Provided φ(w)>0𝜑superscript𝑤0\varphi(w^{*})>0italic_φ ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) > 0 the prior does not contribute to the leading terms of the asymptotic expansions considered, and in particular does not appear in the asymptotic formula for the volume in Definition 1, and so we disregard it in the main text for simplicity.

Assuming relative finite variance ofr the local triplet, we can apply the discussion of Section A. In particular, we can define the LLC λ(w)𝜆superscript𝑤{\lambda({w^{*}})}italic_λ ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) and the local multiplicity m(w)𝑚superscript𝑤m({w^{*}})italic_m ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) in terms of the poles of the “local” zeta function

ζ(z,w)=B(w)(K(w)K(w))zφ¯(w)𝑑w𝜁𝑧superscript𝑤subscript𝐵superscript𝑤superscript𝐾𝑤𝐾superscript𝑤𝑧¯𝜑𝑤differential-d𝑤\zeta(z,w^{*})=\int_{B(w^{*})}\left(K(w)-K(w^{*})\right)^{z}\bar{\varphi}(w)dwitalic_ζ ( italic_z , italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = ∫ start_POSTSUBSCRIPT italic_B ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ( italic_K ( italic_w ) - italic_K ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT over¯ start_ARG italic_φ end_ARG ( italic_w ) italic_d italic_w

Note that L(w)L(w)=K(w)K(w)𝐿𝑤𝐿superscript𝑤𝐾𝑤𝐾superscript𝑤L(w)-L(w^{*})=K(w)-K(w^{*})italic_L ( italic_w ) - italic_L ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = italic_K ( italic_w ) - italic_K ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) since K,L𝐾𝐿K,Litalic_K , italic_L differ by S𝑆Sitalic_S which does not depend on w𝑤witalic_w.

In order for the LLC to be a learning coefficient in its own right, it must be related, asymptotically, to the local free energy in the manner stipulated in (8) in the main text. We verify this next. To derive the asymptotic expansion of the local free energy Fn(B(w))subscript𝐹𝑛𝐵superscript𝑤F_{n}(B(w^{*}))italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_B ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ), we assume in addition that λ(w)λ(P)𝜆superscript𝑤𝜆𝑃\lambda(w^{*})\leq\lambda(P)italic_λ ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ italic_λ ( italic_P ) if PB(w)𝑃𝐵superscript𝑤P\in B(w^{*})italic_P ∈ italic_B ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) and L(P)=L(w)𝐿𝑃𝐿superscript𝑤L(P)=L(w^{*})italic_L ( italic_P ) = italic_L ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). That is, wsuperscript𝑤w^{*}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is at least as degenerate as any nearby minimiser. Note that the KL divergence for this triplet is just the restriction of K:W:𝐾𝑊K:W\rightarrow\mathbb{R}italic_K : italic_W → blackboard_R to B(w)𝐵superscript𝑤B(w^{*})italic_B ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), but the local triplet has its own set of optimal parameters

W0(w)={wB(w):L(w)=L(w)}.subscript𝑊0superscript𝑤conditional-set𝑤𝐵superscript𝑤𝐿𝑤𝐿superscript𝑤W_{0}(w^{*})=\{w\in B(w^{*}):L(w)=L(w^{*})\}\,.italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = { italic_w ∈ italic_B ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) : italic_L ( italic_w ) = italic_L ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) } . (20)

Borrowing the proof in Watanabe, (2009, §3), we can show that

Fn(B(w))=nLn(w)+λ(w)logn(m(w)1)loglogn+OP(1)subscript𝐹𝑛𝐵superscript𝑤𝑛subscript𝐿𝑛superscript𝑤𝜆superscript𝑤𝑛𝑚superscript𝑤1𝑛subscript𝑂𝑃1F_{n}(B(w^{*}))=nL_{n}(w^{*})+\lambda(w^{*})\log n-(m(w^{*})-1)\log\log n+O_{P% }(1)italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_B ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) = italic_n italic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + italic_λ ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_log italic_n - ( italic_m ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - 1 ) roman_log roman_log italic_n + italic_O start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( 1 ) (21)

where the difference between φ𝜑\varphiitalic_φ and φ¯¯𝜑\bar{\varphi}over¯ start_ARG italic_φ end_ARG contributes a summand logV𝑉\log Vroman_log italic_V to the constant term. Note that the condition of relative finite variance is used to establish (21). This explains why we can consider λ(w)𝜆superscript𝑤\lambda(w^{*})italic_λ ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) as a local learning coefficient, following the ideas sketched in (Watanabe,, 2009, Section 7.6). The presentation of the LLC in terms of volume scaling given in the main text now follows from (Watanabe,, 2009, Theorem 7.1).

Appendix C Reparameterization invariance of the LLC

The LLC is invariant to local diffeomorphism of the parameter space - roughly a locally smooth and invertible change of variables.

Note that this automatically implies several weaker notions of invariance, such as rescaling invariance in feedforward neural networks; other e.g. Hessian-based complexity measures have been undermined by their failure to stay invariant to this symmetry (Dinh et al.,, 2017). We now show how the LLC is invariant to local diffeomorphism. The fact that the LLC is an asymptotic quantity is crucial; this property would not hold for the volume V(ϵ)𝑉italic-ϵV(\epsilon)italic_V ( italic_ϵ ) itself, for any value of ϵitalic-ϵ\epsilonitalic_ϵ.

Let UW𝑈𝑊U\subset Witalic_U ⊂ italic_W and U~W~~𝑈~𝑊\widetilde{U}\subset\widetilde{W}over~ start_ARG italic_U end_ARG ⊂ over~ start_ARG italic_W end_ARG be open subsets of parameter spaces W𝑊Witalic_W and W~~𝑊\widetilde{W}over~ start_ARG italic_W end_ARG. A local diffeomorphism is an invertible map ϕ:UU~:italic-ϕ𝑈~𝑈\phi:U\rightarrow\widetilde{U}italic_ϕ : italic_U → over~ start_ARG italic_U end_ARG, such that both ϕitalic-ϕ\phiitalic_ϕ and ϕ1superscriptitalic-ϕ1\phi^{-1}italic_ϕ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT are infinitely differentiable. We further require that ϕitalic-ϕ\phiitalic_ϕ respect the loss function: that is, if L:U:𝐿𝑈L:U\rightarrow\mathbb{R}italic_L : italic_U → blackboard_R and L~:U~:~𝐿~𝑈\widetilde{L}:\widetilde{U}\rightarrow\mathbb{R}over~ start_ARG italic_L end_ARG : over~ start_ARG italic_U end_ARG → blackboard_R are loss functions on each space, we insist that L(u)=L~(ϕ(u))𝐿𝑢~𝐿italic-ϕ𝑢L(u)=\widetilde{L}(\phi(u))italic_L ( italic_u ) = over~ start_ARG italic_L end_ARG ( italic_ϕ ( italic_u ) ) for all uU𝑢𝑈u\in Uitalic_u ∈ italic_U.

Supposing such a ϕitalic-ϕ\phiitalic_ϕ exists, the statement to be proved is that the LLC at uUsuperscript𝑢𝑈u^{*}\in Uitalic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ italic_U under L(u)𝐿𝑢L(u)italic_L ( italic_u ) is equal to the LLC at u~=ϕ(u)U~superscript~𝑢italic-ϕsuperscript𝑢~𝑈\widetilde{u}^{*}=\phi(u^{*})\in\widetilde{U}over~ start_ARG italic_u end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_ϕ ( italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ∈ over~ start_ARG italic_U end_ARG under L~(u~)~𝐿~𝑢\widetilde{L}(\widetilde{u})over~ start_ARG italic_L end_ARG ( over~ start_ARG italic_u end_ARG ). Define

V(ϵ)=L(u)L(u)<ϵ𝑑u,V~(ϵ)=L~(u~)L~(u~)<ϵ𝑑u~formulae-sequence𝑉italic-ϵsubscript𝐿𝑢𝐿superscript𝑢italic-ϵdifferential-d𝑢~𝑉italic-ϵsubscript~𝐿~𝑢~𝐿superscript~𝑢italic-ϵdifferential-d~𝑢\displaystyle V(\epsilon)=\int_{L(u)-L(u^{*})<\epsilon}du,\quad\widetilde{V}(% \epsilon)=\int_{\widetilde{L}(\widetilde{u})-\widetilde{L}(\widetilde{u}^{*})<% \epsilon}d\widetilde{u}italic_V ( italic_ϵ ) = ∫ start_POSTSUBSCRIPT italic_L ( italic_u ) - italic_L ( italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) < italic_ϵ end_POSTSUBSCRIPT italic_d italic_u , over~ start_ARG italic_V end_ARG ( italic_ϵ ) = ∫ start_POSTSUBSCRIPT over~ start_ARG italic_L end_ARG ( over~ start_ARG italic_u end_ARG ) - over~ start_ARG italic_L end_ARG ( over~ start_ARG italic_u end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) < italic_ϵ end_POSTSUBSCRIPT italic_d over~ start_ARG italic_u end_ARG

Now note that by the change of variables formula for diffeomorphisms, we have

V~(ϵ)=L(u)L(u)<ϵ|detDϕ(u)|𝑑u~𝑉italic-ϵsubscript𝐿𝑢𝐿superscript𝑢italic-ϵ𝐷italic-ϕ𝑢differential-d𝑢\widetilde{V}(\epsilon)=\int_{{L}(u)-L(u^{*})<\epsilon}|\det D\phi(u)|duover~ start_ARG italic_V end_ARG ( italic_ϵ ) = ∫ start_POSTSUBSCRIPT italic_L ( italic_u ) - italic_L ( italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) < italic_ϵ end_POSTSUBSCRIPT | roman_det italic_D italic_ϕ ( italic_u ) | italic_d italic_u

where detDϕ(u)𝐷italic-ϕ𝑢\det D\phi(u)roman_det italic_D italic_ϕ ( italic_u ) is the Jacobian determinant of ϕitalic-ϕ\phiitalic_ϕ at u𝑢uitalic_u.

The fact that ϕitalic-ϕ\phiitalic_ϕ is a local diffeomorphism implies that there exists constants c1,c2subscript𝑐1subscript𝑐2c_{1},c_{2}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT such that c1|detDϕ(u)|c2subscript𝑐1𝐷italic-ϕ𝑢subscript𝑐2c_{1}\leq|\det D\phi(u)|\leq c_{2}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ | roman_det italic_D italic_ϕ ( italic_u ) | ≤ italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT for all uU𝑢𝑈u\in Uitalic_u ∈ italic_U. This means that

c1V(ϵ)V~(ϵ)c2V(ϵ)subscript𝑐1𝑉italic-ϵ~𝑉italic-ϵsubscript𝑐2𝑉italic-ϵc_{1}V(\epsilon)\leq\widetilde{V}(\epsilon)\leq c_{2}V(\epsilon)italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_V ( italic_ϵ ) ≤ over~ start_ARG italic_V end_ARG ( italic_ϵ ) ≤ italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_V ( italic_ϵ )

Finally, applying the definition of the LLC λ𝜆\lambdaitalic_λ and its multiplicity m𝑚mitalic_m, and leveraging the fact that this definition is asymptotic as ϵ0italic-ϵ0\epsilon\rightarrow 0italic_ϵ → 0, we can conclude that

V(ϵ)V~(ϵ)ϵλ(log(ϵ))m1proportional-to𝑉italic-ϵ~𝑉italic-ϵproportional-tosuperscriptitalic-ϵ𝜆superscriptitalic-ϵ𝑚1V(\epsilon)\propto\widetilde{V}(\epsilon)\propto\epsilon^{\lambda}(-\log(% \epsilon))^{m-1}italic_V ( italic_ϵ ) ∝ over~ start_ARG italic_V end_ARG ( italic_ϵ ) ∝ italic_ϵ start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT ( - roman_log ( italic_ϵ ) ) start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT

which demonstrates that the LLC is preserved by the local diffeomorphism ϕitalic-ϕ\phiitalic_ϕ.

Appendix D Consistency of local WBIC

In the main text, we introduced (11) as an estimator of Fn(w,γ)subscript𝐹𝑛superscript𝑤𝛾F_{n}(w^{*},\gamma)italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_γ ). Our motivation for this estimator comes directly from the well-known widely applicable Bayesian information criterion (WBIC) (Watanabe,, 2013). In this section, we refer to (11) as the local WBIC and denote it WBIC(w)WBICsuperscript𝑤\operatorname{WBIC}({w^{*}})roman_WBIC ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )

It is a direct extension of the proofs in Watanabe, (2013) to show that the first two terms in the asymptotic expansion of the local WBIC match those of Fn(w,γ)subscript𝐹𝑛superscript𝑤𝛾F_{n}({w^{*}},\gamma)italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_γ ). By this we mean that it can be shown that

Fn(w,γ)=nLn(w)+λ~(w)logn(m1)loglogn+Rnsubscript𝐹𝑛superscript𝑤𝛾𝑛subscript𝐿𝑛superscript𝑤~𝜆superscript𝑤𝑛𝑚1𝑛subscript𝑅𝑛F_{n}({w^{*}},\gamma)=nL_{n}({w^{*}})+\tilde{\lambda}({w^{*}})\log n-(m-1)\log% \log n+R_{n}italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_γ ) = italic_n italic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + over~ start_ARG italic_λ end_ARG ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_log italic_n - ( italic_m - 1 ) roman_log roman_log italic_n + italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT

and

WBIC(w)=nLn(w)+λ~(w)logn+Unλ~(w)logn/2+OP(1).WBICsuperscript𝑤𝑛subscript𝐿𝑛superscript𝑤~𝜆superscript𝑤𝑛subscript𝑈𝑛~𝜆superscript𝑤𝑛2subscript𝑂𝑃1\operatorname{WBIC}({w^{*}})=nL_{n}({w^{*}})+\tilde{\lambda}({w^{*}})\log n+U_% {n}\sqrt{\tilde{\lambda}({w^{*}})\log n/2}+O_{P}(1).roman_WBIC ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = italic_n italic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + over~ start_ARG italic_λ end_ARG ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_log italic_n + italic_U start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT square-root start_ARG over~ start_ARG italic_λ end_ARG ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_log italic_n / 2 end_ARG + italic_O start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( 1 ) . (22)

This firmly establishes that (11) is a good estimator of Fn(w,γ)subscript𝐹𝑛superscript𝑤𝛾F_{n}(w^{*},\gamma)italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_γ ). However, it is important to understand that we cannot immediately conclude that it is a good estimator of Fn(Bγ(w))subscript𝐹𝑛subscript𝐵𝛾superscript𝑤F_{n}(B_{\gamma}(w^{*}))italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_B start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ).

We conjecture that λ~(w)~𝜆superscript𝑤\tilde{\lambda}({w^{*}})over~ start_ARG italic_λ end_ARG ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) is equal to the LLC λ(w)𝜆superscript𝑤\lambda({w^{*}})italic_λ ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) given certain conditions on γ𝛾\gammaitalic_γ. So far, all our empirical findings suggest this. Detailed proof of this conjecture, in particular ascertaining the exact conditions on γ𝛾\gammaitalic_γ, will be left as future theoretical work.

We review both the singular learning theory literature directly involving the learning coefficient, as well as research from other areas of machine learning that may be relevant. This is an expanded version of the discussion in Section 6.

Singular learning theory. Our work builds upon the singular learning theory (SLT) of Bayesian statistics: good references are Watanabe, (2009, 2018). The global learning coefficient, first introduced by (Watanabe,, 2001), provides the asymptotic expansion of the free energy, which is equivalent to the negative log Bayes marginal likelihood, an all-important important quantity in Bayesian analysis. Later work used algebro-geometric tools to bound or exactly calculate the learning coefficient for a wide range of machine learning models, including Boltzmann machines Yamazaki and Watanabe, 2005b , single-hidden-layer neural networks Aoyagi et al., (2005), DLNs Aoyagi, (2024), Gaussian mixture models Yamazaki and Watanabe, (2003), and hidden Markov models Yamazaki and Watanabe, 2005a .

SLT has also enhanced the understanding of model selection criteria in Bayesian statistics. Of particular relevance to this work is Watanabe, (2013), which introduced the WBIC estimator of free energy. This estimator has been applied in various practical settings (e.g. Endo et al.,, 2020; Fontanesi et al.,, 2019; Hooten and Hobbs,, 2015; Sharma,, 2017; Kafashan et al.,, 2021; Semenova et al.,, 2020). Some of the estimation methodology in this paper can be seen as a localized extension of the WBIC. Several other papers have explored improvements or alternatives to this estimator Iriguchi and Watanabe, (2007); Imai, 2019a ; Imai, 2019b .

Basin broadness. The learning coefficient can be seen as a Bayesian version of basin broadness measures, which typically attempt to empirically connect notions of geometric “broadness" or “flatness" with model complexity Hochreiter and Schmidhuber, (1997); Jiang et al., (2019). However, the evidence supporting the (global) learning coefficient (in the Bayesian setting) is significantly stronger: the learning coefficient provably determines the Bayesian free energy to leading order Watanabe, (2009). We expect the utility of the learning coefficient as a geometric measure to apply beyond the Bayesian setting, but whether the connection with generalization will continue to hold is unknown.

Neural network identifiability. A core observation leading to singular learning theory is that the map wp(x|w)maps-to𝑤𝑝conditional𝑥𝑤w\mapsto p(x|w)italic_w ↦ italic_p ( italic_x | italic_w ) from parameters w𝑤witalic_w to statistical models p(x|w)𝑝conditional𝑥𝑤p(x|w)italic_p ( italic_x | italic_w ) may not be one-to-one (in which case, the model is singular). This observation has been made in parallel by researchers studying neural network identifiability Sussmann, (1992); Fefferman, (1994); Kůrková and Kainen, (1994); Phuong and Lampert, (2020). Recent work222Within the context of singular learning theory, this fact was known at least as early as Fukumizu, (1996). has shown that the degree to which a network is identifiable (or inverse stable) is not uniform across parameter space Berner et al., (2019); Petersen et al., (2020); Farrugia-Roberts, (2023). From this perspective, the LLC can be viewed as a quantitative measure of “how identifiable" the network is near a particular parameter.

Statistical mechanics of the loss landscape. A handful of papers have explored related ideas from a statistical mechanics perspective. Jules et al., (2023) use Langevin dynamics to probe the geometry of the loss landscape. Zhang et al., (2018) show how a bias towards “wide minima" may be explained by free energy minimization. These observations may be formalized using singular learning theory LaMont and Wiggins, (2019). In particular, the learning coefficient may be viewed as a heat capacity LaMont and Wiggins, (2019), and learning coefficient estimation corresponds to measuring the heat capacity by molecular dynamics sampling.

Other model complexity measures. The LLC is set apart from a number of classic model complexity measures such as Rademacher complexity (Koltchinskii and Panchenko,, 2000), the VC dimension (Vapnik and Chervonenkis,, 1971) because the latter measures act on an entire class of functions while the LLC measures the complexity of a specific individual function within the context of the function class carved out by the model (e.g. via DNN architecture). This affords the LLC a better position for unraveling the theoretical mysteries of deep learning, which cannot be disentangled from the way in which DNNs are trained or the data that they are trained on.

In the context studied here, our proposed LLC measures the complexity of a trained neural network rather than complexity over the entire function class of neural networks. It is also sensitive to the data distribution, making it ideal for understanding the intricate dance between function class, data properties, and implicit biases baked into different training heuristics.

Like earlier investigations by Le, (2018); Zhang et al., (2018); LaMont and Wiggins, (2019), our notion of model complexity appeals to the correspondence between parameter inference and free energy minimization. This mostly refers to the fact that the posterior distribution over w𝑤witalic_w has high concentration around wsuperscript𝑤{w^{*}}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT if the associated local free energy, Fn(Bγ(w))subscript𝐹𝑛subscript𝐵𝛾superscript𝑤F_{n}(B_{\gamma}({w^{*}}))italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_B start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ), is low. Viewed from the free energy lens, it is thus not surprising that “flatter” minima (low λ(w)𝜆superscript𝑤{\lambda({w^{*}})}italic_λ ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )) might be preferred over “sharper” minima (high λ(w)𝜆superscript𝑤{\lambda({w^{*}})}italic_λ ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )) even if the former has high training loss (higher Ln(w)subscript𝐿𝑛superscript𝑤L_{n}({w^{*}})italic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )). Put another way, (8) reveals that parameter inference is not about seeking the solution with the lowest loss. In the terminology of Zhang et al., (2018), parameter inference plays out as a competition between energy (loss) and entropy (Occam’s factor). Despite spiritual similarities, our work starts to depart from that of Zhang et al., (2018) and Le, (2018) in our technical treatment of the local free energy. These prior works rely on the Laplace approximation to arrive at the result

Fn(Bγ(w))subscript𝐹𝑛subscript𝐵𝛾superscript𝑤\displaystyle F_{n}(B_{\gamma}(w^{*}))italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_B start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) =nLn(w)+d2logn+12logdetH(w)+OP(1),absent𝑛subscript𝐿𝑛superscript𝑤𝑑2𝑛12det𝐻superscript𝑤subscript𝑂𝑃1\displaystyle=nL_{n}({w^{*}})+\frac{d}{2}\log n+\frac{1}{2}\log\mathrm{det}H(w% ^{*})+O_{P}(1),= italic_n italic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + divide start_ARG italic_d end_ARG start_ARG 2 end_ARG roman_log italic_n + divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_log roman_det italic_H ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + italic_O start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( 1 ) , (23)

where H𝐻Hitalic_H is the Hessian of the loss. The flatness of a local minimum is calculated as logdetH(w)det𝐻superscript𝑤\log\mathrm{det}H({w^{*}})roman_log roman_det italic_H ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), which is, notably, ill-defined for neural networks333Balasubramanian, (1997) includes all O(1)𝑂1O(1)italic_O ( 1 ) terms in the Laplace expansion as a type of complexity measure.. Indeed a concluding remark in Zhang et al., (2018) points out that “a more nuanced metric is needed to characterise flat minima with singular Hessian matrices." Le, (2018), likewise, state in their introduction that “to compute the (model) evidence, we must carefully account for this degeneracy", but then argues that degeneracy is not a major limitation to applying (23). This is only partially true. For a very benign type of degeneracy, (23) is indeed valid. However, under general conditions, the correct asymptotic expansion of the local free energy is provided in (8). It might be said that while Zhang et al., (2018) and Le, (2018) make an effort to account for degeneracies in the DNN loss landscape, they only take a small step up the degeneracy ladder while we take a full leap.

Similarity to PAC-Bayes. We have just described how the theoretical LLC is the sought-after notion of model complexity coming from earlier works who adopt the energy-entropy competition perspective. Interestingly, the actual LLC estimator also has connections to another familiar notion of model complexity. Among the diverse cast of complexity measures, see e.g., Jiang et al., (2019) for a comprehensive overview of over forty complexity measures in modern deep learning, the LLC estimator bears the most resemblance to PAC-Bayes inspired flatness/sharpness measures (Neyshabur et al.,, 2017). Indeed, it may be immediately obvious that, other than the scaling of nβ𝑛superscript𝛽n{\beta^{*}}italic_n italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, λ^(w)^𝜆superscript𝑤{\hat{\lambda}({w^{*}})}over^ start_ARG italic_λ end_ARG ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) can be viewed as a PAC-Bayes flatness measure which utilises a very specific posterior distribution localised to wsuperscript𝑤{w^{*}}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Recall the canonical PAC-Bayes flatness measure is based on

λPACbayes(w)=Eq(w|w)n(w)n(w),subscript𝜆PACbayessuperscript𝑤subscriptE𝑞conditional𝑤superscript𝑤subscript𝑛𝑤subscript𝑛superscript𝑤\lambda_{\mathrm{PAC-bayes}}({w^{*}})=\mathrm{E}_{q(w|{w^{*}})}\ell_{n}(w)-% \ell_{n}({w^{*}}),italic_λ start_POSTSUBSCRIPT roman_PAC - roman_bayes end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = roman_E start_POSTSUBSCRIPT italic_q ( italic_w | italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_w ) - roman_ℓ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , (24)

where nsubscript𝑛\ell_{n}roman_ℓ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is a general empirical loss function (which in our case is the sample negative log likelihood) and the “posterior" distribution q𝑞qitalic_q is often taken to be Gaussian, i.e., q(w|w)=𝒩(w,σ2I)𝑞conditional𝑤superscript𝑤𝒩superscript𝑤superscript𝜎2𝐼q(w|{w^{*}})=\mathcal{N}({w^{*}},\sigma^{2}I)italic_q ( italic_w | italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = caligraphic_N ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I ). A simple derivation shows us that the quantity in (24), if we use a Gaussian q𝑞qitalic_q around wsuperscript𝑤{w^{*}}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, reduces approximately to 12σ2Tr(H(w)),12superscript𝜎2𝑇𝑟𝐻superscript𝑤\frac{1}{2}\sigma^{2}Tr(H({w^{*}})),divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T italic_r ( italic_H ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) , where H𝐻Hitalic_H is the Hessian of the loss, i.e., H(w)=w2n(w)|w𝐻superscript𝑤evaluated-atsubscriptsuperscript2𝑤subscript𝑛𝑤superscript𝑤H({w^{*}})=\nabla^{2}_{w}\ell_{n}(w)|_{{w^{*}}}italic_H ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_w ) | start_POSTSUBSCRIPT italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. However, for singular models, the posterior distribution around wsuperscript𝑤{w^{*}}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, e.g., (10), is decidedly not Gaussian. This calls into question the standard choice of the Gaussian posterior in (24).

Learning capacity. Finally, we briefly discuss concurrent work that measures a quantity related to the learning coefficient. In Chen et al., 2023a , a measure called the learning capacity is proposed to estimate the complexity of a hypothesis class. The learning capacity can be viewed as a finite-n𝑛nitalic_n version of the learning coefficient; the latter only appears in the n𝑛n\to\inftyitalic_n → ∞ limit. Chen et al., 2023a is largely interested in the learning capacity as a function of training size n𝑛nitalic_n. They discover the learning capacity saturates at very small and large n𝑛nitalic_n with a sharp transition in between.

Applications. Recently, the LLC estimation method we introduce here has been used to empirically detect “phase transitions" in toy ReLU networks Chen et al., 2023a , and the development of in-context learning in transformers Hoogland et al., (2024).

Appendix F Model complexity vs model-independent complexity

In this paper, we have described the LLC as a measure of “model complexity." It is worth clarifying what we mean here — or rather, what we do not mean. This clarification is in part a response to Skalse, (2023).

We distinguish measures of “model complexity," such as those traditionally found in statistical learning theory, from measures of “model-independent complexity," such as those found in algorithmic information theory. Measures of model complexity, like the parameter count, describe the expressivity or degrees of freedom available to a particular model. Measures of model-independent complexity, like Kolmogorov complexity, describe the complexity inherent to the task itself.

In particular, we emphasize that — a priori — the LLC is a measure of model complexity, not model-independent complexity. It can be seen as the amount of information required to nudge a model towards wsuperscript𝑤w^{*}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and away from other parameters. Parameters with higher LLC are more complex for that particular model to implement.

Alternatively, the model is inductively biased444The role of the LLC in inductive biases is only rigorously established for Bayesian learning, but we suspect it also applies for learning with SGD. towards parameters with lower LLC — but a different model could have different inductive biases, and thus different LLC for the same task. This is why it is not sensible to conclude that a bias towards low LLC, would, on its own, explain observed “simplicity bias" in neural networks Valle-Perez et al., (2018) — this is tautological, as Skalse, (2023) noted.

To highlight this distinction, we construct a statistical model where these two notions of complexity diverge. Let f1(x)subscript𝑓1𝑥f_{1}(x)italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) be a Kolmogorov-simple function, like the identity function. Let f2(x)subscript𝑓2𝑥f_{2}(x)italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ) be a Kolmogorov-complex function, like a random lookup table. Then consider the following regression model with a single parameter w[0,1]𝑤01w\in[0,1]italic_w ∈ [ 0 , 1 ]:

f(x,w)=w8f1(x)+(1w8)f2(x)𝑓𝑥𝑤superscript𝑤8subscript𝑓1𝑥1superscript𝑤8subscript𝑓2𝑥f(x,w)=w^{8}f_{1}(x)+(1-w^{8})f_{2}(x)italic_f ( italic_x , italic_w ) = italic_w start_POSTSUPERSCRIPT 8 end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) + ( 1 - italic_w start_POSTSUPERSCRIPT 8 end_POSTSUPERSCRIPT ) italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x )

For this model, f1(x)subscript𝑓1𝑥f_{1}(x)italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) has a learning coefficient of λ=12𝜆12\lambda=\frac{1}{2}italic_λ = divide start_ARG 1 end_ARG start_ARG 2 end_ARG, whereas f2(x)subscript𝑓2𝑥f_{2}(x)italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ) has a learning coefficient of λ=116𝜆116\lambda=\frac{1}{16}italic_λ = divide start_ARG 1 end_ARG start_ARG 16 end_ARG. Therefore, despite f1(x)subscript𝑓1𝑥f_{1}(x)italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) being more Kolmogorov-simple, it is more complex for f(x,w)𝑓𝑥𝑤f(x,w)italic_f ( italic_x , italic_w ) to implement — the model is biased towards f2(x)subscript𝑓2𝑥f_{2}(x)italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ) instead of f1(x)subscript𝑓1𝑥f_{1}(x)italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ), and so f1(x)subscript𝑓1𝑥f_{1}(x)italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) requires relatively more information to learn.

Yet, this example feels contrived: in realistic deep learning settings, the parameters w𝑤witalic_w do not merely interpolate between handpicked possible algorithms, but themselves define an internal algorithm based on their values. That is, it seems intuitively like the parameters play a role closer to “source code" than “tuning constants."

Thus, while in general LLC is not a model-independent complexity measure, it seems distinctly possible that for neural networks (perhaps even models in some broader “universality class"), the LLC could be model-independent in some way. This would theoretically establish the inductive biases of neural networks. We believe this to be an intriguing direction for future work.

Appendix G SGLD-based LLC estimator: minibatch version pseudocode

Algorithm 1 computing λ^(w)^𝜆superscript𝑤{\hat{\lambda}({w^{*}})}over^ start_ARG italic_λ end_ARG ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )

Input:

  • initialization point: wsuperscript𝑤w^{*}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT

  • scale: γ𝛾\gammaitalic_γ

  • step size: ϵitalic-ϵ\epsilonitalic_ϵ

  • number of iterations: SGLD_iters

  • batch size: m𝑚mitalic_m

  • dataset of size n𝑛nitalic_n: 𝒟n={(xi,yi)}i=1,,nsubscript𝒟𝑛subscriptsubscript𝑥𝑖subscript𝑦𝑖𝑖1𝑛\mathcal{D}_{n}=\{(x_{i},y_{i})\}_{i=1,\dots,n}caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 , … , italic_n end_POSTSUBSCRIPT

  • averaged log-likelihood function for wd𝑤superscript𝑑w\in\mathbb{R}^{d}italic_w ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and arbitrary subset D𝐷Ditalic_D of data:

    logL(D,w)=1|D|(xi,yi)Dlogp(yi|xi,w)logL𝐷𝑤1𝐷subscriptsubscript𝑥𝑖subscript𝑦𝑖𝐷𝑝conditionalsubscript𝑦𝑖subscript𝑥𝑖𝑤\mathrm{logL}(D,w)=\frac{1}{\lvert D\rvert}\sum_{(x_{i},y_{i})\in D}\log p(y_{% i}|x_{i},w)roman_logL ( italic_D , italic_w ) = divide start_ARG 1 end_ARG start_ARG | italic_D | end_ARG ∑ start_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ italic_D end_POSTSUBSCRIPT roman_log italic_p ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_w )

Output: λ^(w)^𝜆superscript𝑤{\hat{\lambda}({w^{*}})}over^ start_ARG italic_λ end_ARG ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )

1:  β1lognsuperscript𝛽1𝑛\beta^{*}\leftarrow\frac{1}{\log n}italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ← divide start_ARG 1 end_ARG start_ARG roman_log italic_n end_ARG {Optimal sampling temperature.}
2:  ww𝑤superscript𝑤w\leftarrow w^{*}italic_w ← italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT {Initialize at the given parameter}
3:  arrayLogL[]arrayLogL\mathrm{arrayLogL}\leftarrow[\,\,]roman_arrayLogL ← [ ]
4:  for t=1SGLD_iters𝑡1SGLD_iterst=1\dots\text{SGLD\_iters}italic_t = 1 … SGLD_iters do
5:     B𝐵absentB\leftarrowitalic_B ← random minibatch of size m𝑚mitalic_m
6:     append logL(B,w)logL𝐵𝑤\mathrm{logL}(B,w)roman_logL ( italic_B , italic_w ) to arrayLogLarrayLogL\mathrm{arrayLogL}roman_arrayLogL
7:     ηN(0,ϵ)similar-to𝜂𝑁0italic-ϵ\eta\sim N(0,\epsilon)italic_η ∼ italic_N ( 0 , italic_ϵ ) {d𝑑ditalic_d-dimensional Gaussian, variance ϵitalic-ϵ\epsilonitalic_ϵ}
8:     Δwϵ2[γ(ww)+nβwlogL(B,w)]+ηΔ𝑤italic-ϵ2delimited-[]𝛾superscript𝑤𝑤𝑛superscript𝛽subscript𝑤logL𝐵𝑤𝜂\Delta w\leftarrow\frac{\epsilon}{2}\left[\gamma(w^{*}-w)+n{\beta^{*}}\nabla_{% w}\mathrm{logL}(B,w)\right]+\etaroman_Δ italic_w ← divide start_ARG italic_ϵ end_ARG start_ARG 2 end_ARG [ italic_γ ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_w ) + italic_n italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT roman_logL ( italic_B , italic_w ) ] + italic_η
9:     ww+Δw𝑤𝑤Δ𝑤w\leftarrow w+\Delta witalic_w ← italic_w + roman_Δ italic_w
10:  end for
11:  WBIC^nMean(arrayLogL)^WBIC𝑛MeanarrayLogL\widehat{\operatorname{WBIC}}\leftarrow-n\cdot\mathrm{Mean}(\mathrm{arrayLogL})over^ start_ARG roman_WBIC end_ARG ← - italic_n ⋅ roman_Mean ( roman_arrayLogL )
12:  nLn(w)nlogL(𝒟n,w)𝑛subscript𝐿𝑛superscript𝑤𝑛logLsubscript𝒟𝑛superscript𝑤nL_{n}(w^{*})\leftarrow-n\cdot\mathrm{logL}(\mathcal{D}_{n},w^{*})italic_n italic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ← - italic_n ⋅ roman_logL ( caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )
13:  λ^(w)WBIC^nLn(w)logn^𝜆superscript𝑤^WBIC𝑛subscript𝐿𝑛superscript𝑤𝑛{\hat{\lambda}({w^{*}})}\leftarrow\frac{\widehat{\operatorname{WBIC}}-nL_{n}(w% ^{*})}{\log n}over^ start_ARG italic_λ end_ARG ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ← divide start_ARG over^ start_ARG roman_WBIC end_ARG - italic_n italic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_ARG start_ARG roman_log italic_n end_ARG
14:  return λ^(w)^𝜆superscript𝑤{\hat{\lambda}({w^{*}})}over^ start_ARG italic_λ end_ARG ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )

Appendix H Recommendations for accurate LLC estimation and troubleshooting

In this section, we collect several recommendations for estimating the LLC accurately in practice. Note that these recommendations are largely based on our experience with LLC estimation for DLN (see Section 5.1) as it is the only realistic model (it being wide and deep) where the theoretical learning coefficient is available.

H.1 Step size

From experience, the most important hyperparameter to the performance and accuracy of the method is the step size ϵitalic-ϵ\epsilonitalic_ϵ. If the step size is too low, the sampler may not equilibriate, leading to underestimation. If the step size is too high, the sampler can become numerically unstable, causing overestimation or even “blowing up" to NaN values.

Manual tuning of the step size is possible. However we strongly recommend a particular diagnostic based on the acceptance criterion for Metropolis-adjusted Langevin dynamics (MALA). This is used to correct numerical errors in traditional MCMC, but here we use it only to detect them.

In traditional (full-gradient) MCMC, numerical errors caused by the step size are completely corrected by a secondary step in the algorithm, the acceptance check or Metropolis correction, which accepts or rejects steps with some probability roughly555Technically, the acceptance probability is based on maintaining detailed balance, not necessarily numerical error, as can be seen in the case of e.g. Metropolis-Hastings. But this is a fine intuition for gradient-based algorithms like MALA or HMC. based on the likelihood of numerical error. The proportion of steps accepted additionally becomes an important diagnostic as to the health of the algorithm: a low acceptance ratio indicates that the acceptance check is having to compensate for high levels of numerical error.

The acceptance probability between step Xksubscript𝑋𝑘X_{k}italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and proposed step Xk+1subscript𝑋𝑘1X_{k+1}italic_X start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT is calculated as:

min(1,π(Xk)q(Xk|Xk+1)π(Xk+1)q(Xk+1|Xk))1𝜋subscript𝑋𝑘𝑞conditionalsubscript𝑋𝑘subscript𝑋𝑘1𝜋subscript𝑋𝑘1𝑞conditionalsubscript𝑋𝑘1subscript𝑋𝑘\min\left(1,\frac{\pi(X_{k})\,q(X_{k}|X_{k+1})}{\pi(X_{k+1})\,q(X_{k+1}|X_{k})% }\right)roman_min ( 1 , divide start_ARG italic_π ( italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_q ( italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_π ( italic_X start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) italic_q ( italic_X start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG )

where π(x)𝜋𝑥\pi(x)italic_π ( italic_x ) is the probability density at x𝑥xitalic_x (in our case, logπ(x)=βnLn(x)𝜋𝑥𝛽𝑛subscript𝐿𝑛𝑥\log\pi(x)=\beta nL_{n}(x)roman_log italic_π ( italic_x ) = italic_β italic_n italic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x )), and q(x|x)𝑞conditionalsuperscript𝑥𝑥q(x^{\prime}|x)italic_q ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x ) is the probability of our sampler transitioning from x𝑥xitalic_x to xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

In the case of MALA, q(x|x)q(x|x)𝑞conditionalsuperscript𝑥𝑥𝑞conditional𝑥superscript𝑥q(x^{\prime}|x)\not=q(x|x^{\prime})italic_q ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x ) ≠ italic_q ( italic_x | italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) and so we must explicitly calculate this term. For MALA, it is:

q(x|x)exp(14ϵxxϵlogπ(x)2)proportional-to𝑞conditionalsuperscript𝑥𝑥14italic-ϵsuperscriptnormsuperscript𝑥𝑥italic-ϵ𝜋𝑥2q(x^{\prime}|x)\propto\exp(-\frac{1}{4\epsilon}||x^{\prime}-x-\epsilon\nabla% \log\pi(x)||^{2})italic_q ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x ) ∝ roman_exp ( - divide start_ARG 1 end_ARG start_ARG 4 italic_ϵ end_ARG | | italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_x - italic_ϵ ∇ roman_log italic_π ( italic_x ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )

We choose to use MALA’s formula because we are using SGLD, and both MALA and SGLD propose steps using Langevin dynamics. MALA’s formula is the correct one to use when attempting to apply Metropolis correction to Langevin dynamics.

For various reasons, directly implementing such an acceptance check for stochastic-gradient MCMC (while possible) is typically either ineffective or inefficient. Instead we use the acceptance probability merely as a diagnostic.

We recommend tuning the step size such that the average acceptance probability is in the range of 0.9-0.95. Below this range, increase step size to avoid numerical error. Above this range, consider decreasing step size for computational efficiency (to save on the number of steps required). For efficiency, we recommend calculating the acceptance probability for only a fraction of steps — say, one out of every twenty.

Note that since we are not actually using an acceptance check, these acceptance “probabilities" are not really probabilities, but merely diagnostic values.

H.2 Step count and burn-in

The step count for sampling should be chosen such that the sampler has time to equilibriate or “burn in”. Insufficient step count may lead to underestimating the LLC. Excessive step count will not degrade accuracy, but is unnecessarily time-consuming.

We recommend increasing the number of steps until the loss, Lm(wt)subscript𝐿𝑚subscript𝑤𝑡L_{m}(w_{t})italic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), stops increasing after some period of time. This can be done with manual inspection of the loss trace. See Figure H.1 for some examples of loss trace and MALA acceptance probability over SGLD trajectories for DLN model at different scale.

It is worth noting that the loss trace should truly be flat — a slow upwards slope can still be indicative of significant underestimation.

We also recommend that samples during this burn-in period are discarded. That is, loss values should only be tallied once they have flattened out. This avoids underestimation.

H.3 Other issues and troubleshooting

We note some miscellaneous other issues and some troubleshooting recommendations:

  • Negative LLC estimates: This can happen when wsuperscript𝑤w^{*}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT fails to be near a local minimum of L(w)𝐿𝑤L(w)italic_L ( italic_w ). However even when wsuperscript𝑤w^{*}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is a local minimum, we still might get negative LLC estimates if we are not careful. This can happen when the SGLD trajectory wanders to an area with lower loss than the initialization, causing the numerator in (12) to be negative. This could be alleviated by smaller step size or shorter chain length. This, however, risks under-exploration. This can also be alleviated by having larger restoring force γ𝛾\gammaitalic_γ. This risks the issue discussed below.

  • Large γ𝛾\gammaitalic_γ: An overly concentrated localizing prior (γ𝛾\gammaitalic_γ too large) can overwhelm the gradient signal coming from the log-likelihood. This can result in samples that are different from the posterior, destroying SGLD’s sensitivity to the local geometry.

To sum up, in pathological cases like having SGLD trajectory falling to lower loss region or blowing up beyond machine floating number limits, we recommend keeping γ𝛾\gammaitalic_γ small (1.0 to 10.0), gradually lowering the step-size ϵitalic-ϵ\epsilonitalic_ϵ while lengthening the sampling chain so that the loss trace still equilibrates.

Refer to caption

Figure H.1: Sample loss trace (blue, left axis) and MALA acceptance probability (red, right axis) over DLN training trajectories at different model sizes.

Appendix I Learning coefficient of DLN (Aoyagi,, 2024)

A DLN is a feedforward neural network without nonlinear activation. Specifically, a biasless DLN with M𝑀Mitalic_M hidden layers, layer sizes H1,H2,,HMsubscript𝐻1subscript𝐻2subscript𝐻𝑀H_{1},H_{2},\dots,H_{M}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_H start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT and input dimension H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is given by:

y=f(x,w)=WMW2W1x𝑦𝑓𝑥𝑤subscript𝑊𝑀subscript𝑊2subscript𝑊1𝑥y=f(x,w)=W_{M}\ldots W_{2}W_{1}xitalic_y = italic_f ( italic_x , italic_w ) = italic_W start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT … italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_x (25)

where xH0𝑥superscriptsubscript𝐻0x\in\mathbb{R}^{H_{0}}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the input vector and the model parameter w𝑤witalic_w consist of the weight matrices Wjsubscript𝑊𝑗W_{j}italic_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT of shape Hj×Hj1subscript𝐻𝑗subscript𝐻𝑗1H_{j}\times H_{j-1}italic_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT × italic_H start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT for j=1,,M𝑗1𝑀j=1,\dots,Mitalic_j = 1 , … , italic_M. Given a DLN, f(x,w)𝑓𝑥𝑤f(x,w)italic_f ( italic_x , italic_w ), (c.f. Equation 25) with M𝑀Mitalic_M hidden layers, layer sizes H1,H2,,HMsubscript𝐻1subscript𝐻2subscript𝐻𝑀H_{1},H_{2},\dots,H_{M}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_H start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT and input dimension H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, the associated regression model with additive Gaussian noise is given by

p(x,y|w)=q(x)2πσ2HMe12σ2yWMW2W1x2𝑝𝑥conditional𝑦𝑤𝑞𝑥superscript2𝜋superscript𝜎2subscript𝐻𝑀superscript𝑒12superscript𝜎2superscriptnorm𝑦subscript𝑊𝑀subscript𝑊2subscript𝑊1𝑥2\displaystyle p(x,y|w)=\frac{q(x)}{\sqrt{2\pi\sigma^{2}}^{H_{M}}}e^{-\frac{1}{% 2\sigma^{2}}\|y-W_{M}\dots W_{2}W_{1}x\|^{2}}italic_p ( italic_x , italic_y | italic_w ) = divide start_ARG italic_q ( italic_x ) end_ARG start_ARG square-root start_ARG 2 italic_π italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG italic_e start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∥ italic_y - italic_W start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT … italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT (26)

where q(x)𝑞𝑥q(x)italic_q ( italic_x ) is some distribution on the input xH0𝑥superscriptsubscript𝐻0x\in\mathbb{R}^{H_{0}}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, w=(W1,,WM)𝑤subscript𝑊1subscript𝑊𝑀w=(W_{1},\dots,W_{M})italic_w = ( italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_W start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) is the parameter consisting of the weight matrices Wjsubscript𝑊𝑗W_{j}italic_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT of shape Hj×Hj1subscript𝐻𝑗subscript𝐻𝑗1H_{j}\times H_{j-1}italic_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT × italic_H start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT for j=1,,M𝑗1𝑀j=1,\dots,Mitalic_j = 1 , … , italic_M and σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is the variance of the additive Gaussian noise. Let q(x,y)𝑞𝑥𝑦q(x,y)italic_q ( italic_x , italic_y ) be the density of the true data generating process and w=(W1,,WM)superscript𝑤superscriptsubscript𝑊1superscriptsubscript𝑊𝑀w^{*}=(W_{1}^{*},\dots,W_{M}^{*})italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = ( italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , … , italic_W start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) be an optimal parameter that minimizes the KL-divergence between q(x,y)𝑞𝑥𝑦q(x,y)italic_q ( italic_x , italic_y ) and p(x,y|w)𝑝𝑥conditional𝑦𝑤p(x,y|w)italic_p ( italic_x , italic_y | italic_w ).

Here we shall pause and emphasize that this result gives us the (global) learning coefficient, which is conceptually distinct from the LLC. They are related: the learning coefficient is the minimum of the LLCs of the global minima of the population loss. In our experiments, we will be measuring the LLC at a randomly chosen global minimum of the population loss. While we don’t expect LLCs to differ much among global minima for DLN, we do not know that for certain and it is of independent interest that the estimated LLC can tell us about the learning coefficient.

Theorem 1 (DLN learning coefficient, Aoyagi,, 2024).

Let r:=rank(WMW2W1)assign𝑟ranksuperscriptsubscript𝑊𝑀superscriptsubscript𝑊2superscriptsubscript𝑊1r:=\mathrm{rank}\left(W_{M}^{*}\dots W_{2}^{*}W_{1}^{*}\right)italic_r := roman_rank ( italic_W start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT … italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) be the rank of the linear transformation implemented by the true DLN, f(x,w)𝑓𝑥𝑤f(x,w)italic_f ( italic_x , italic_w ) and set Δj:=HjrassignsubscriptΔ𝑗subscript𝐻𝑗𝑟\Delta_{j}:=H_{j}-rroman_Δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT := italic_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_r, for j=0,,M𝑗0𝑀j=0,\dots,Mitalic_j = 0 , … , italic_M. There exist a subset Σ{0,1,,M}Σ01𝑀\Sigma\subset\{0,1,\dots,M\}roman_Σ ⊂ { 0 , 1 , … , italic_M } of indices, Σ={σ1,,σ+1}Σsubscript𝜎1subscript𝜎1\Sigma=\{\sigma_{1},\dots,\sigma_{\ell+1}\}roman_Σ = { italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_σ start_POSTSUBSCRIPT roman_ℓ + 1 end_POSTSUBSCRIPT } with cardinality +11\ell+1roman_ℓ + 1 that satisfy the following conditions:

max{ΔσσΣ}conditionalsubscriptΔ𝜎𝜎Σ\displaystyle\max\{\Delta_{\sigma}\mid\sigma\in\Sigma\}roman_max { roman_Δ start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ∣ italic_σ ∈ roman_Σ } <min{ΔkkΣ}absentconditionalsubscriptΔ𝑘𝑘Σ\displaystyle<\min\{\Delta_{k}\mid k\not\in\Sigma\}< roman_min { roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∣ italic_k ∉ roman_Σ }
σΣΔσsubscript𝜎ΣsubscriptΔ𝜎\displaystyle\sum_{\sigma\in\Sigma}\Delta_{\sigma}∑ start_POSTSUBSCRIPT italic_σ ∈ roman_Σ end_POSTSUBSCRIPT roman_Δ start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT max{ΔσσΣ}absentconditionalsubscriptΔ𝜎𝜎Σ\displaystyle\geq\ell\cdot\max\{\Delta_{\sigma}\mid\sigma\in\Sigma\}≥ roman_ℓ ⋅ roman_max { roman_Δ start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ∣ italic_σ ∈ roman_Σ }
σΣΔσsubscript𝜎ΣsubscriptΔ𝜎\displaystyle\sum_{\sigma\in\Sigma}\Delta_{\sigma}∑ start_POSTSUBSCRIPT italic_σ ∈ roman_Σ end_POSTSUBSCRIPT roman_Δ start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT <min{ΔσσΣ}.absentconditionalsubscriptΔ𝜎𝜎Σ\displaystyle<\ell\cdot\min\{\Delta_{\sigma}\mid\sigma\not\in\Sigma\}.< roman_ℓ ⋅ roman_min { roman_Δ start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ∣ italic_σ ∉ roman_Σ } .

Assuming that the DLN truth-model pair (q(x,y),p(x,y|w))𝑞𝑥𝑦𝑝𝑥conditional𝑦𝑤\left(q(x,y),p(x,y|w)\right)( italic_q ( italic_x , italic_y ) , italic_p ( italic_x , italic_y | italic_w ) ) satisfies the relatively finite variance condition (Appendix A.1), their learning coefficient is then given by

λ=r2+r(H0+HL)2+a(a)4(1)4(1j=1+1Δσj)2+121i<j+1ΔσiΔσj.𝜆superscript𝑟2𝑟subscript𝐻0subscript𝐻𝐿2𝑎𝑎414superscript1superscriptsubscript𝑗11subscriptΔsubscript𝜎𝑗212subscript1𝑖𝑗1subscriptΔsubscript𝜎𝑖subscriptΔsubscript𝜎𝑗\displaystyle\lambda=\frac{-r^{2}+r(H_{0}+H_{L})}{2}+\frac{a(\ell-a)}{4\ell}-% \frac{\ell(\ell-1)}{4}\left(\frac{1}{\ell}\sum_{j=1}^{\ell+1}\Delta_{\sigma_{j% }}\right)^{2}+\frac{1}{2}\sum_{1\leq i<j\leq\ell+1}\Delta_{\sigma_{i}}\Delta_{% \sigma_{j}}.italic_λ = divide start_ARG - italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_r ( italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_H start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) end_ARG start_ARG 2 end_ARG + divide start_ARG italic_a ( roman_ℓ - italic_a ) end_ARG start_ARG 4 roman_ℓ end_ARG - divide start_ARG roman_ℓ ( roman_ℓ - 1 ) end_ARG start_ARG 4 end_ARG ( divide start_ARG 1 end_ARG start_ARG roman_ℓ end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ + 1 end_POSTSUPERSCRIPT roman_Δ start_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT 1 ≤ italic_i < italic_j ≤ roman_ℓ + 1 end_POSTSUBSCRIPT roman_Δ start_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_Δ start_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT .

As we mention in the introduction, trained neural networks are less complex than they seem. It is natural to expect that this, if true, is reflected in deep networks having more degenerate (good parameters has higher volume) loss landscape. With the theorem above, we are given a window into the volume scaling behaviour in the case of DLNs, allowing us to investigate an aspect of this hypothesis.

Figure I.1 shows the the true learning coefficient, λ𝜆\lambdaitalic_λ, and the multiplicity, m𝑚mitalic_m, of many randomly drawn DLNs with different numbers of hidden layers. Observe that λ𝜆\lambdaitalic_λ decreases with network depth.

This plot is generated by creating networks with 2-800 hidden layers, with width randomly drawn from 100-2000 including the input dimension. The overall rank of the DLN is randomly drawn from the range of zero to the maximum allowed rank, which is the minimum of the layer widths. See also Figure 1 in Aoyagi, (2024) for more theoretical examples of this phenomenon.

Refer to caption

Figure I.1: The top graph shows λ𝜆\lambdaitalic_λ decreasing as the DLN becomes deeper, even though model parameter count increases with number of layers. The bottom graph shows the true multiplicities, m𝑚mitalic_m. Since regular models can only have m=1𝑚1m=1italic_m = 1, the graph shows that most of these randomly generated DLNs are singular.

Appendix J Experiment: DLN

We compare the estimated λ^(w)^𝜆superscript𝑤\hat{\lambda}(w^{*})over^ start_ARG italic_λ end_ARG ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) against theoretical λ𝜆\lambdaitalic_λ (with wsuperscript𝑤w^{*}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT being a randomly generated true parameter), by randomly generating many DLNs with different architectures and model sizes that span several orders of magnitude (OOM). Each DLN is constructed randomly, as follows. Draw an integer MU(Mlow,,Mhigh)similar-to𝑀𝑈subscript𝑀𝑙𝑜𝑤subscript𝑀𝑖𝑔M\sim U(M_{low},\dots,M_{high})italic_M ∼ italic_U ( italic_M start_POSTSUBSCRIPT italic_l italic_o italic_w end_POSTSUBSCRIPT , … , italic_M start_POSTSUBSCRIPT italic_h italic_i italic_g italic_h end_POSTSUBSCRIPT ) as the number of hidden layers, where U(a,,b)𝑈𝑎𝑏U(a,\dots,b)italic_U ( italic_a , … , italic_b ) denotes the discrete uniform distribution on the finite set {a,a+1,,b}𝑎𝑎1𝑏\{a,a+1,\dots,b\}{ italic_a , italic_a + 1 , … , italic_b }. Then, draw layer size HjU(Hlow,,Hhigh)similar-tosubscript𝐻𝑗𝑈subscript𝐻𝑙𝑜𝑤subscript𝐻𝑖𝑔H_{j}\sim U(H_{low},\dots,H_{high})italic_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∼ italic_U ( italic_H start_POSTSUBSCRIPT italic_l italic_o italic_w end_POSTSUBSCRIPT , … , italic_H start_POSTSUBSCRIPT italic_h italic_i italic_g italic_h end_POSTSUBSCRIPT ) for each j=0,,M𝑗0𝑀j=0,\dots,Mitalic_j = 0 , … , italic_M where H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT denotes the input dimension. The weight matrix Wjsubscript𝑊𝑗W_{j}italic_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT for layer j𝑗jitalic_j is then a Hj×Hj1subscript𝐻𝑗subscript𝐻𝑗1H_{j}\times H_{j-1}italic_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT × italic_H start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT matrix with each matrix element independently sampled from N(0,1)𝑁01N(0,1)italic_N ( 0 , 1 ) (random initialization). To obtain a more realistic true parameter, with probability 0.50.50.50.5, each matrix Wjsuperscriptsubscript𝑊𝑗W_{j}^{*}italic_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is modified to have a random rank of rU(0,,min(Hj1,Hj))similar-to𝑟𝑈0subscript𝐻𝑗1subscript𝐻𝑗r\sim U(0,\dots,\min(H_{j-1},H_{j}))italic_r ∼ italic_U ( 0 , … , roman_min ( italic_H start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ). For each DLN generated, a corresponding synthetic training dataset of size n𝑛nitalic_n is generated to be used in SGLD sampling.

The configuration values Mlow,Mhigh,Hlow,Hhighsubscript𝑀𝑙𝑜𝑤subscript𝑀𝑖𝑔subscript𝐻𝑙𝑜𝑤subscript𝐻𝑖𝑔M_{low},M_{high},H_{low},H_{high}italic_M start_POSTSUBSCRIPT italic_l italic_o italic_w end_POSTSUBSCRIPT , italic_M start_POSTSUBSCRIPT italic_h italic_i italic_g italic_h end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_l italic_o italic_w end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_h italic_i italic_g italic_h end_POSTSUBSCRIPT are chosen differently for separate set of experiments with model size targeting DLN size of different order of magnitude. See Table 1 for the values used in the experiments. SGLD hyperparameters ϵitalic-ϵ\epsilonitalic_ϵ, γ𝛾\gammaitalic_γ, and number of steps are chosen to suit each set of experiments according to our recommendations outlined in Appendix H. The exact hyperparameter values are given in the following section.

We emphasize that hyperparameters must be tuned independently for different model sizes. In particular, the required step size for numerical stability tends to decrease for larger models, forcing a compensatory increase in the step count. See Table 1 for an example of this tuning with scale.

Future work using e.g. μ𝜇\muitalic_μ-parameterization Yang et al., (2022) may be able to alleviate this issue.

J.1 Hyperparameters and further details

As described in the prior section, the experiments shown in Figure 4 consist of randomly constructed DLNs. For each target order of magnitude of DLN parameter count, we randomly sample the number of layers and their widths from a different range. We also use a different set of SGLD hyperparameter chosen according to the recommendation made in Section H. Configuration values that varies across OOM are shown in Table 1 and other configurations are as follow

  • Batch size used in SGLD is 500.

  • The amount of burn-in steps used for SGLD samples is set to 90% of total SGLD chain length, i.e. only the last 10% of SGLD samples are used in estimating the LLC.

  • The parameter γ𝛾\gammaitalic_γ is set to 1.01.01.01.0.

  • For each DLN, f(x,w)𝑓𝑥superscript𝑤f(x,w^{*})italic_f ( italic_x , italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) with a chosen true parameter wsuperscript𝑤w^{*}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, a synthetic dataset, {(xi,yi)}i=1,,nsubscriptsubscript𝑥𝑖subscript𝑦𝑖𝑖1𝑛\{(x_{i},y_{i})\}_{i=1,\dots,n}{ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 , … , italic_n end_POSTSUBSCRIPT is generated by randomly sampling each element of the input vector x𝑥xitalic_x uniformly from the interval [10,10]1010[-10,10][ - 10 , 10 ] and set the output as y=f(x,w)𝑦𝑓𝑥superscript𝑤y=f(x,w^{*})italic_y = italic_f ( italic_x , italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), which effectively means we are setting a very small noise variance σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

  • For LLC estimation done at a trained parameter instead of the true parameter (shown in right plot in Figure 4), the network is first trained using SGD with learning rate 0.01 and momentum 0.9 for 50000 steps.

For each target OOM, a number of different experiments are run with different random seeds. The number of such experiment is determined by our compute resources and is reported in Table 1 with some experiment failing due to SGLD chains “blowing up” (See discussion in Appendix H) for the SGLD hyperparameters used. Figure J.3 shows that there is a left tail to the mean MALA acceptance rate distribution that hint at instability in SGLD chains encountered in some λ^(w)^𝜆superscript𝑤\hat{\lambda}(w^{*})over^ start_ARG italic_λ end_ARG ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) estimation run.

OOM Num layers, Mlowsubscript𝑀𝑙𝑜𝑤M_{low}italic_M start_POSTSUBSCRIPT italic_l italic_o italic_w end_POSTSUBSCRIPT-Mhighsubscript𝑀𝑖𝑔M_{high}italic_M start_POSTSUBSCRIPT italic_h italic_i italic_g italic_h end_POSTSUBSCRIPT Widths, Hlowsubscript𝐻𝑙𝑜𝑤H_{low}italic_H start_POSTSUBSCRIPT italic_l italic_o italic_w end_POSTSUBSCRIPT-Hhighsubscript𝐻𝑖𝑔H_{high}italic_H start_POSTSUBSCRIPT italic_h italic_i italic_g italic_h end_POSTSUBSCRIPT ϵitalic-ϵ\epsilonitalic_ϵ Num SGLD steps n𝑛nitalic_n Num experiments
1k 2-5 5-50 5×1075superscript1075\times 10^{-7}5 × 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT 10k 105superscript10510^{5}10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT 99
10k 2-10 5-100 5×1075superscript1075\times 10^{-7}5 × 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT 10k 105superscript10510^{5}10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT 100
100k 2-10 50-500 1×1071superscript1071\times 10^{-7}1 × 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT 50k 106superscript10610^{6}10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT 100
1M 5-20 100-1000 5×1085superscript1085\times 10^{-8}5 × 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT 50k 106superscript10610^{6}10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT 99
10M 2-20 500-2000 2×1082superscript1082\times 10^{-8}2 × 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT 50k 106superscript10610^{6}10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT 93
100M 2-40 500-3000 2×1082superscript1082\times 10^{-8}2 × 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT 50k 106superscript10610^{6}10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT 54
Table 1: Table of experimental configuration for each batch of experiment at different order of magnitudes (OOM) in DLN model size. n𝑛nitalic_n denotes the training dataset size and ϵitalic-ϵ\epsilonitalic_ϵ denotes SGLD step size.

J.2 Additional plots for DLN experiments

  • Figure J.1 is a linear scale version of Figure 4 in the main text. This shows the estimated LLC against the true learning coefficients for experiments at different model size range without log-scale distortion.

  • Figure J.2 shows the relative error (λλ^(w))/λ𝜆^𝜆superscript𝑤𝜆(\lambda-\hat{\lambda}(w^{*}))/\lambda( italic_λ - over^ start_ARG italic_λ end_ARG ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) / italic_λ across multiple orders of magnitude of DLN model size.

Refer to caption

Figure J.1: Supplementary plot to Figure 4. Each plot shows a single batch of DLN experiment with model size at different order of magnitude. The SGLD hyperparameter is tuned once for each batch. Their values are listed in Table 1 In contrast to Figure 4 which is in log scale, all plots here are in linear scale.

Refer to caption

Figure J.2: Relative error of estimated LLC compared to the theoretical learning coefficient, for DLNs across different orders of magnitude of model size.

Refer to caption

Figure J.3: Mean MALA acceptance probability over the entire SGLD trajectory for every DLN experiment. Model size is not the only factor affecting the correct scale for SGLD step size. Local geometry varies significantly among different models, and among different neighbourhoods in the parameter space. Without tuning SGLD hyperparameters individually for each experiment, we get a spread of (mean) MALA acceptance probability over all experiments. Those with low acceptance probability may indicate poor λ^(w)^𝜆superscript𝑤\hat{\lambda}(w^{*})over^ start_ARG italic_λ end_ARG ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) estimation quality.

Appendix K Experiment: LLC for ResNet

This section provides details for the experiments where we aimed to investigate how LLC vary over neural network training with different training configuration. These experiments parallel those performed in Dherin et al., (2022) who propose thegeometric complexity measure.

We train ResNet18 (He et al.,, 2016) on the CIFAR10 dataset (Krizhevsky,, 2009) using SGD with cross-entropy loss. We vary SGD hyperparameters such as learning rate, batch size, momentum and L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-regularization rate and track the resulting LLC estimates over evenly spaced checkpoints of SGD iterations.

For the LLC estimates to be comparable, we need to ensure that the SGLD hyperparameters used in the LLC algorithm is the same. To this end, for every set of experiments where we vary a single optimizer hyperparameter, we first perform a calibration run to select SGLD hyperparameters according to the recommendation out lined in Appendix H. Once selected, this set of SGLD hyperparameter is then used for all LLC estimation within the set of experiments. That include the LLC estimation for every checkpoint of every ResNet18 training run for different optimizer configuration. Also following the same recommendation, we also burn away 90% of the sample trajectory and only use last 10% of samples. We also note that, since the SGLD hyperparameters are not tuned for all experiments within a single set, there are possibilities of negative LLC estimates or divergent SGLD chains. See Appendix H.3 for discussion on such cases and how to troubleshoot them. We manually remove these cases. They are rare enough that they do not change the LLC curves, just widen the error bars (confidence intervals) due to having less repeated experiments.

Each experiment is repeated with 5 different random seeds. While the model architecture and dataset (including the train-test split) is fixed, other factors like the network initialization, training trajectories and SGLD samples are randomized. In each plot, the error bars show the 95% confidence intervals over 5 repeated experiments of the statistics plotted. The error bars were calculated using the inbuilt error bar function to Python Seaborn library (Waskom,, 2021) plotting function, seaborn.lineplot(..., errorbar=("ci", 95), ...).

K.1 Details for main text Figures 1

For experiments that vary the learning rate in Figure 1 (top), for each learning rate value in [0.005,0.05,0.01,0.1,0.2]0.0050.050.010.10.2[0.005,0.05,0.01,0.1,0.2][ 0.005 , 0.05 , 0.01 , 0.1 , 0.2 ] we run SGD without momentum with a fixed batch size of 512 for 30000 iterations. LLC estimations were performed every 1000 iterations with SGLD hyperparameters as follow: step size ϵ=2×107italic-ϵ2superscript107\epsilon=2\times 10^{-7}italic_ϵ = 2 × 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT, chain length of 3000 iterations, batch size of 2048 and γ=1.0𝛾1.0\gamma=1.0italic_γ = 1.0.

For experiments that vary the batch size in Figure 1 (middle), for each batch size value in [16,32,64,128,256,512,1024]1632641282565121024[16,32,64,128,256,512,1024][ 16 , 32 , 64 , 128 , 256 , 512 , 1024 ] we run SGD without momentum with a fixed learning rate of 0.01 for 100000 iterations. LLC estimations were performed every 2500 iterations with SGLD hyperparameters as follow: step size ϵ=2×107italic-ϵ2superscript107\epsilon=2\times 10^{-7}italic_ϵ = 2 × 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT, chain length of 2500 iterations, batch size of 2048 and γ=1.0𝛾1.0\gamma=1.0italic_γ = 1.0.

For experiments that vary the SGD momentum in Figure 1 (bottom), for each momentum value in [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]0.00.10.20.30.40.50.60.70.80.9[0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9][ 0.0 , 0.1 , 0.2 , 0.3 , 0.4 , 0.5 , 0.6 , 0.7 , 0.8 , 0.9 ] we run SGD with momentum with a fixed learning rate of 0.05 and a fixed batch size of 512 for 20000 iterations. LLC estimations were performed every 1000 iterations with SGLD hyperparameters as follow: step size ϵ=2×107italic-ϵ2superscript107\epsilon=2\times 10^{-7}italic_ϵ = 2 × 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT, chain length of 3000 iterations, batch size of 2048 and γ=1.0𝛾1.0\gamma=1.0italic_γ = 1.0.

K.2 Additional ResNet18 + CIFAR10 LLC experiments

Experiments using SGD with momentum

We repeat the experiments varying learning rate and batch size shown in Figure 1 (top and middle) in the main text, but this time we train ResNet18 on CIFAR10 using SGD with momentum instead. The results are shown in Figure K.1 and the experimental details are as follow:

  • For experiments that varies the learning rate (top), for each learning rate value in [0.005,0.05,0.01,0.1,0.2]0.0050.050.010.10.2[0.005,0.05,0.01,0.1,0.2][ 0.005 , 0.05 , 0.01 , 0.1 , 0.2 ] we run SGD with momentum of 0.9 with a fixed batch size of 512 for 30000 iterations. LLC estimations were performed every 1000 iterations with SGLD hyperparameters as follow: step size ϵ=2×107italic-ϵ2superscript107\epsilon=2\times 10^{-7}italic_ϵ = 2 × 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT, chain length of 3000 iterations, batch size of 2048 and γ=1.0𝛾1.0\gamma=1.0italic_γ = 1.0.

  • For experiments that varies the batch size (bottom), for each batch size value in [16,32,64,128,256,512,1024]1632641282565121024[16,32,64,128,256,512,1024][ 16 , 32 , 64 , 128 , 256 , 512 , 1024 ] we run SGD with momentum of 0.9 with a fixed learning rate of 0.01 for 100000 iterations. LLC estimations were performed every 2500 iterations with SGLD hyperparameters as follow: step size ϵ=2×107italic-ϵ2superscript107\epsilon=2\times 10^{-7}italic_ϵ = 2 × 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT, chain length of 2500 iterations, batch size of 2048 and γ=1.0𝛾1.0\gamma=1.0italic_γ = 1.0.

Explicit L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-regularization

We also run analogous experiments using explicit L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-regularization. We trained ResNet18 on CIFAR10 dataset using SGD both with and without momentum using the usual cross-entropy loss but with an added L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-regularization term, αw22𝛼subscriptsuperscriptnorm𝑤22\alpha\|w\|^{2}_{2}italic_α ∥ italic_w ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT where w𝑤witalic_w denote the network weight vector and α𝛼\alphaitalic_α is the regularization rate hyperparameter that we vary in this set of experiments. Similar to other experiments in this section, we track LLC estimates over evenly spaced training checkpoints. However, it is worth noting that the loss function used for SGLD sampling as part of the LLC estimation is the original cross-entropy loss function without the added L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-regularization term.

The results are shown in Figure K.2 and the details are as follow:

  • For the experiment with momentum (Figure K.2 top), for each regularization rate α[0.0,0.01,0.025,0.05,0.075,0.1]𝛼0.00.010.0250.050.0750.1\alpha\in[0.0,0.01,0.025,0.05,0.075,0.1]italic_α ∈ [ 0.0 , 0.01 , 0.025 , 0.05 , 0.075 , 0.1 ], we run SGD with momentum of 0.9 with a fixed learning rate of 0.0005 and batch size of 512 for 15000 iterations. LLC estimations were performed every 500 iterations with SGLD hyperparameters as follow: step size ϵ=5×108italic-ϵ5superscript108\epsilon=5\times 10^{-8}italic_ϵ = 5 × 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT, chain length of 2000 iterations, batch size of 2048 and γ=1.0𝛾1.0\gamma=1.0italic_γ = 1.0.

  • For the experiment without momentum (Figure K.2 bottom), for each regularization rate α[0.0,0.01,0.025,0.05,0.075,0.1]𝛼0.00.010.0250.050.0750.1\alpha\in[0.0,0.01,0.025,0.05,0.075,0.1]italic_α ∈ [ 0.0 , 0.01 , 0.025 , 0.05 , 0.075 , 0.1 ], we run SGD without momentum with a fixed learning rate of 0.001 and batch size of 512 for 50000 iterations. LLC estimations were performed every 1000 iterations with SGLD hyperparameters as follow: step size ϵ=5×108italic-ϵ5superscript108\epsilon=5\times 10^{-8}italic_ϵ = 5 × 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT, chain length of 2000 iterations, batch size of 2048 and γ=1.0𝛾1.0\gamma=1.0italic_γ = 1.0.

Refer to caption
Refer to caption
Figure K.1: Impact of varying different training configuration believed to exert implicit regularization pressure when training ResNet18 on CIFAR10 data using SGD with momentum (contrast with those without momentum reported in Figure 1 in the main text). Top: varying learning rate. Bottom: varying batch size.
Refer to caption
Refer to caption
Figure K.2: Impact of varying explicit L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-regularization rate when training ResNet18 on CIFAR10 data using SGD with (top) and without (bottom) momentum.

Appendix L Additional Experiment: LLC for language model

The purpose of this experiment is simply to verify that the LLC can be consistently estimated for a transformer trained on language data. In Figure L.1, we show the LLC estimates for w^nsubscriptsuperscript^𝑤𝑛\hat{w}^{*}_{n}over^ start_ARG italic_w end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT at the end of training over a few training runs. We see the LLC estimates are a small fraction of the 3.3m total parameters. We also notice that the value of the LLC estimates are remarkably stable over multiple training runs. See Appendix L for experimental details.

Refer to caption
Figure L.1: SGLD-based LLC estimates for w^nsubscriptsuperscript^𝑤𝑛\hat{w}^{*}_{n}over^ start_ARG italic_w end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT at the end of training an attention-only transformer on subset of the Pile dataset. The distribution reports LLC estimates over 10 training repetitions. Again the LLC is a tiny fraction of the 3.3m parameters in the transformer.

We trained a two-layer attention-only (no MLP layers) transformer architecture on a resampled subset of the Pile dataset Gao et al., (2020); Xie et al., (2023) with a context length of 1024, a residual stream dimension of dmodel=256subscript𝑑model256d_{\textrm{model}}=256italic_d start_POSTSUBSCRIPT model end_POSTSUBSCRIPT = 256, and 8 attention heads per layer. The architecture also uses a learnable Shortformer embedding Press et al., (2021) and includes layer norm layers. Additionally, we truncated the full GPT-2 tokenizer, which has a vocabulary of around 50,000 tokens, down to the first 5,000 tokens in the vocabulary to reduce the size of the model. The resulting model has a total parameter count of d=3,355,016𝑑3355016d=3,355,016italic_d = 3 , 355 , 016. We instantiated these models using an implementation provided by TransformerLens Nanda and Bloom, (2022).

We trained 10 different seeds over a single epoch of 50,000 steps with a minibatch size of 100, resulting in about 5 billion tokens used during training for each model. We used the AdamW optimizer with a weight decay value of 0.050.050.050.05 and a learning rate of 0.0010.0010.0010.001 with no scheduler.

We run SGLD-based LLC estimation once at the end of training for each seed at a temperature of β=1/log(100)𝛽1100\beta=1/\log(100)italic_β = 1 / roman_log ( 100 ). We set γ=100𝛾100\gamma=100italic_γ = 100 and ϵ=0.001italic-ϵ0.001\epsilon=0.001italic_ϵ = 0.001. We take samples over 20 SGLD chains with 200 draws per chain using a validation set.

Appendix M Additional Experiment: MALA versus SGLD

Here we verify empirically that our SGLD-based LLC estimator (Algorithm 1) does not suffer from using minibatch loss for both SGLD sampling and LLC calculation. Specifically, we compare to LLC estimation via Equation 12 and the Metropolis-adjusted Langevin Algorithm (MALA), a standard gradient-based MCMC algorithm Roberts and Rosenthal, (1998). Notice that this comparison rather stacks the odds against the minibatch-version SGLD-based LLC estimator in Algorithm 1 so that it is all the more surprising we see such good results below.

We test with a two-hidden-layer ReLU network with ten inputs, ten outputs, and twenty neurons per hidden layer. Denote the inputs by x𝑥xitalic_x, the parameters by w𝑤witalic_w, and the output of this network f(x,w)𝑓𝑥𝑤f(x,w)italic_f ( italic_x , italic_w ). The data are generated to create a “realizable" data generating process, with “true parameter" wsuperscript𝑤w^{*}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT: inputs X𝑋Xitalic_X are generated from a uniform distribution, and labels Y𝑌Yitalic_Y are (noiselessly) generated based on the true network, so that Yi=f(Xi,w)subscript𝑌𝑖𝑓subscript𝑋𝑖superscript𝑤Y_{i}=f(X_{i},w^{*})italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_f ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ).

Hyperparameters and experimental details are as follows. We sweep dataset size from 100 to 100000, and compare our LLC estimator and the MALA-based one. For all dataset sizes, SGLD batch size was set to 32 and γ=1.0𝛾1.0\gamma=1.0italic_γ = 1.0, and MALA and SGLD shared the same true parameter wsuperscript𝑤w^{*}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT (set at random according to a normal distribution). Both MALA and SGLD used a step size of 1e-5 and the asymptotically optimal inverse temperature β=1/lognsuperscript𝛽1𝑛{\beta^{*}}=1/\log nitalic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 1 / roman_log italic_n. Experiments were run on CPU.


Refer to caption
Figure M.1: We compare LLC estimation using SGLD with LLC estimation using MALA. They make similar estimates (top) but the SGLD-based method is significantly faster (bottom), especially for large dataset sizes.

The results are summarized in Figure M.1. We find that across all dataset sizes, the SGLD and MALA estimates of the LLC agree (Figure M.1, top), but the SGLD-based estimate has far lower computational cost, especially as the dataset size grows (Figure M.1, bottom).

Appendix N Additional Experiment: SGD versus eSGD

We fit a two hidden-layer feedforward ReLU network with 1.9m parameters to MNIST using two stochastic optimizers: SGD and entropy-SGD (Chaudhari et al.,, 2019). We choose entropy-SGD because its objective is to minimize Fn(w,γ)subscript𝐹𝑛superscript𝑤𝛾F_{n}(w^{*},\gamma)italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_γ ) over wsuperscript𝑤w^{*}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, so we expect that the local minima found by entropy-SGD will have lower λ^(w)^𝜆superscript𝑤{\hat{\lambda}({w^{*}})}over^ start_ARG italic_λ end_ARG ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ).

Figure N.1 shows the LLC estimates λ^(w^n)^𝜆superscriptsubscript^𝑤𝑛{\hat{\lambda}({\hat{w}_{n}^{*}})}over^ start_ARG italic_λ end_ARG ( over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) for w^nsuperscriptsubscript^𝑤𝑛{\hat{w}_{n}^{*}}over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT at the end of training, optimized by either entropy-SGD or standard SGD. Notice the LLC estimates, for both stochastic optimizers, are on the order of 1000, much lower than the 1.9m number of parameters in the ReLU network.

Refer to caption
Figure N.1: LLC estimates for w^nsuperscriptsubscript^𝑤𝑛{\hat{w}_{n}^{*}}over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT at the end of training a feedforward ReLU network on MNIST. The distribution reports λ^(w^n)^𝜆superscriptsubscript^𝑤𝑛{\hat{\lambda}({\hat{w}_{n}^{*}})}over^ start_ARG italic_λ end_ARG ( over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) over 80 training repetitions where the training data remains fixed in this repetition, only the randomness in the stochastic optimizer is being modded out. We compare two stochastic optimizers – SGD and entropy-SGD. Note all λ^(w^n)^𝜆superscriptsubscript^𝑤𝑛{\hat{\lambda}({\hat{w}_{n}^{*}})}over^ start_ARG italic_λ end_ARG ( over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) are on the order of 1000100010001000 while the parameter count in the ReLU network is 1.9m.

Figure N.1 confirms our expectation that entropy-SGD finds local minima with lower LLC, i.e., entropy-SGD is attracted to more degenerate (simpler) critical points than SGD. Interestingly Figure N.1 also reveals that the LLC estimate has remarkably low variance over the randomness of the stochastic optimizer. Finally it is noteworthy that the LLC estimate of a learned NN model for both stochastic optimizers is, on average, a tiny percentage of the total number of weights in the NN model: λ^(w)1000^𝜆superscript𝑤1000{\hat{\lambda}({w^{*}})}\approx 1000over^ start_ARG italic_λ end_ARG ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≈ 1000.

In this experiment, we trained a feedforward ReLU network on the MNIST dataset Deng, (2012). The dataset consists of 60000600006000060000 training samples and 10000100001000010000 testing samples. The network is designed with 2 hidden layers having sizes [1024,1024]10241024[1024,1024][ 1024 , 1024 ], and it contains a total of 1863690186369018636901863690 parameters.

For training, we employed two different optimizers, SGD and entropy-SGD, minimizing cross-entropy loss. Both optimizers are set to have learning rate of 0.010.010.010.01, momentum parameter at 0.90.90.90.9 and batch size of 512512512512. SGD is trained with Nesterov look-ahead gradient estimator. The number of samples L𝐿Litalic_L used by entropy-SGD for local free energy estimation is set to 5555. The network is trained for 200200200200 epochs. The number of epochs is chosen so that the classification error rate on the training set falls below 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT.

The hyperparameters used for SGLD are as follows: ϵitalic-ϵ\epsilonitalic_ϵ is set to 105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT, chain length to 400400400400 and the minibatch size 512, and γ=100𝛾100\gamma=100italic_γ = 100. We repeat each SGLD chain 4444 times to compute the variance of estimated quantities and also as a diagnostic tool, a proxy for estimation stability.

Appendix O Additional Experiment: scaling invariance

In Appendix C, we show theoretically that the theoretical LLC is invariant to local diffeomorphism. Here we empirically verify that our LLC estimator (with preconditioning) is capable of satisfying this property in a specific easy-to-test case (though we do not test it in general).

The function implemented by a feedforward ReLU networks is invariant to a type of reparameterization known as rescaling symmetry. Invariance of other measures to this symmetry is not trivial or automatic, and other geometric measures like Hessian-based basin broadness have been undermined by their failure to stay invariant to these symmetries Dinh et al., (2017).

For simplicity, suppose we have a two-layer ReLU network, with weights W1,W2subscript𝑊1subscript𝑊2W_{1},W_{2}italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and biases b1,b2subscript𝑏1subscript𝑏2b_{1},b_{2}italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Then rescaling symmetry is captured by the following fact, for some arbitrary scalar α𝛼\alphaitalic_α:

W2ReLU(W1x+b1)+b2=αW2ReLU(1αW1x+1αb1)+b2subscript𝑊2ReLUsubscript𝑊1𝑥subscript𝑏1subscript𝑏2𝛼subscript𝑊2ReLU1𝛼subscript𝑊1𝑥1𝛼subscript𝑏1subscript𝑏2W_{2}\text{ReLU}(W_{1}x+b_{1})+b_{2}=\alpha W_{2}\text{ReLU}\left(\tfrac{1}{% \alpha}W_{1}x+\tfrac{1}{\alpha}b_{1}\right)+b_{2}italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ReLU ( italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_x + italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_α italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ReLU ( divide start_ARG 1 end_ARG start_ARG italic_α end_ARG italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_x + divide start_ARG 1 end_ARG start_ARG italic_α end_ARG italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT

That is, we may choose new parameters W1=1αW1,b1=1αb1,W2=αW2,b2=b2formulae-sequencesuperscriptsubscript𝑊11𝛼subscript𝑊1formulae-sequencesuperscriptsubscript𝑏11𝛼subscript𝑏1formulae-sequencesuperscriptsubscript𝑊2𝛼subscript𝑊2superscriptsubscript𝑏2subscript𝑏2W_{1}^{\prime}=\frac{1}{\alpha}W_{1},b_{1}^{\prime}=\frac{1}{\alpha}b_{1},W_{2% }^{\prime}=\alpha W_{2},b_{2}^{\prime}=b_{2}italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_α end_ARG italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_α end_ARG italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_α italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT without affecting the input-output behavior of the network in any way. This symmetry generalizes to any two adjacent layers in ReLU networks of arbitrary depth.

Given that these symmetries do not affect the function implemented by the network, and are present globally throughout all of parameter space, it seems like these degrees of freedom are “superfluous", and should ideally not affect our tools. Importantly, this is the case for the LLC.

We verify empirically that this property also appears to hold for our LLC estimator, when proper preconditioning is used.

O.1 Preconditioned SGLD

We must slightly modify our SGLD sampler to perform this experiment tractably. This is because applying rescaling symmetry makes the loss landscape significantly anisotropic, forcing prohibitively small step sizes with the original algorithm.

Thus we add preconditioning to the SGLD sampler from Section 4.4, with the only modification being a fixed preconditioning matrix A𝐴Aitalic_A:

ΔwtΔsubscript𝑤𝑡\displaystyle\Delta w_{t}roman_Δ italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =Aϵ2(βnm(x,y)Btlogp(y|x,wt)+γ(wwt))+N(0,ϵ)absent𝐴italic-ϵ2superscript𝛽𝑛𝑚subscript𝑥𝑦subscript𝐵𝑡𝑝conditional𝑦𝑥subscript𝑤𝑡𝛾superscript𝑤subscript𝑤𝑡𝑁0italic-ϵ\displaystyle=A\frac{\epsilon}{2}\left(\frac{\beta^{*}n}{m}\sum_{(x,y)\in B_{t% }}\nabla\log p(y|x,w_{t})+\gamma(w^{*}-w_{t})\right)+N(0,\epsilon)= italic_A divide start_ARG italic_ϵ end_ARG start_ARG 2 end_ARG ( divide start_ARG italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_n end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∇ roman_log italic_p ( italic_y | italic_x , italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_γ ( italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) + italic_N ( 0 , italic_ϵ ) (27)

In the experiment to follow, this preconditioning matrix is hardcoded for convenience, but if this algorithm is to be used in practice, the preconditioning matrix must be learned adaptively using standard methods for adaptive preconditioning (Haario et al.,, 1999).

O.2 Experiment

We take a small feedforward ReLU network, and rescale two adjacent layers in the network by α𝛼\alphaitalic_α in the fashion described above. We vary the value of α𝛼\alphaitalic_α across eight orders of magnitude and measure the estimated LLC using Eq 27 for SGLD sampling and Eq 12 for calculating LLC from samples.666Note that this means that we are not using Algorithm 1 here, both because of preconditioned SGLD and because we are calculating LLC using full-batch loss instead of mini-batch loss.

Crucially, we must use preconditioning here so as to avoid prohibitively small step size requirements. In this case, the preconditioning matrix A𝐴Aitalic_A is set manually, to be a diagonal matrix with entries α2superscript𝛼2\alpha^{2}italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for parameters corresponding to W1subscript𝑊1W_{1}italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and b1subscript𝑏1b_{1}italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, entries 1α21superscript𝛼2\frac{1}{\alpha^{2}}divide start_ARG 1 end_ARG start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG for W2subscript𝑊2W_{2}italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and entries 1111 otherwise.777In practical situations, the preconditioning matrix cannot be set manually, and must be learned adaptively. Standard methods for adaptive preconditioning exist in the MCMC literature Haario et al., (2001, 1999).


Refer to caption

Figure O.1: LLC estimation is invariant to rescaling symmetries in ReLU networks. As the rescaling parameter α𝛼\alphaitalic_α is varied over eight orders of magnitude, the estimated value of the LLC remains invariant (up to statistical error). The small error bar across multiple SGLD runs illustrates the stability of the estimation method. Model layer sizes, including input dimension is shown in the legend.

The results can be found in Figure O.1. We conclude that LLC estimation appears invariant to ReLU network rescaling symmetries.

Appendix P Compute resources disclosure

Specific details about each type of experiment we carried out are listed below. There addition compute resources required for finding suitable SGLD hyperparameters for LLC estimation, but they did not constitute significant difference to overall resource requirement.

ResNet18 + CIFAR10 experiments.

Each experiment, i.e. training a ResNet18 for the reported number of iteration and performing LLC estimation for the stated number of checkpoints, is run on a node with either a single V100 or A100 NVIDIA GPU depending on availability hosted on internal HPC cluster with 2 CPU cores and 16GB memory allocated. No significant storage required. Each experiment took between a few minutes to 3 hours depending on the configuration (mean: 87.6 minutes, median: 76.7 minutes).

Estimated total compute is 287 GPU hours spread across 208 experiments.

DLN experiments.

Each experiment, either estimating LLC at the true parameter or at a trained SGD parameter (thus require training), is run on a single A100 NVIDIA GPU node hosted on internal HPC cluster with 2 CPU cores and no more than 8GB memory allocated. No significant storage required. Each experiment took less than 15 minutes.

Estimated total compute is 150 GPU hours: 600 experiments (including failed ones) each around 15 GPU minutes.

MNIST experiments.

Each repetition of training a feedforward ReLU network using SGD or eSGD optimizer on MNIST data and estimating the LLC at the end of training is run on a single A100 NVIDIA GPU node hosted on internal HPC cluster with 8 CPU cores and no more than 16GB memory allocated. No significant storage required. Each experiment took less than 10 minutes.

Estimated total compute is 27 GPU hours: 2 sets of 80 repetitions, 10 GPU minutes each.

Language model experiments.

Training the language model and estimating its LLC at the last checkpoint took around 30 minutes on Google Colab on a single A100 NVIDIA GPU with 84GB memory and 1 CPU allocated. The storage space used for the training data is around 27GB.

Estimated total compute is 5 GPU hours: 10 repetitions of 30 GPU minutes each.

Cite as

@article{lau2023the,
  title = {The Local Learning Coefficient: A Singularity-Aware Complexity Measure},
  author = {Edmund Lau and Zach Furman and George Wang and Daniel Murfet and Susan Wei},
  year = {2023},
  abstract = {Deep neural networks (DNN) are singular statistical models which exhibit complex degeneracies. In this work, we illustrate how a quantity known as the learning coefficient introduced in singular learning theory quantifies precisely the degree of degeneracy in deep neural networks. Importantly, we will demonstrate that degeneracy in DNN cannot be accounted for by simply counting the number of 'flat' directions.},
  url = {https://proceedings.mlr.press/v258/lau25a.html}
}
Click to copy