Understanding relative finite variance in simple models
SLT requires the assumption of "relative finite variance". Is this assumption satisfied?
Question from Garrett Baker:
A key assumption in SLT is relatively finite variance. Do we have any sense about how common this is?
From Daniel in the Discord:
It’s of course true in the realisable case, so if the posterior is dominated (for some range of n) by a region of parameter space around a critical point then one can assume relative finite variance (this is how we’re justifying e.g. the free energy formula when we use lambda hat). For large n or long training times, you might expect this to be less valid, because the learning process is “coming to grips” with the fact that it is approximating the truth as well as it can using the model but there is still some distance to go…
All this heuristic of course. Right now I don’t even know how to prove relative finite variance for 1-hidden layer tanh in the non-realisable case, that would be a nice little project.
Go for it!
Where to begin:
If you have decided to start working on this, please let us know in the Discord. We'll update this listing so that other people who are interested in this project can find you.