Guide for Sampling Hyperparameter Selection

An introduction to SGMCMC-based sampling and practical guide for choosing hyperparameters.

Authors
Rohan Hitchcock*†, Jesse Hoogland*, George Wang*, Andrew Gordon*†
Timaeus · * Research Contributor · † Writing Contributor · Correspondence to rohan@timaeus.co
Published
April 21, 2026

The purpose of this document is to be a practical guide for those wanting to replicate or extend various Timaeus projects, or apply tools such as susceptibilities in their own projects. There are open theoretical questions about the behavior and limitations of the sampling algorithms. This is an important and active area of research. In this guide, though, we focus on the practical methodology needed to use SLT-based tools. This also means we focus on sampling methods which can be scaled to models with billions of parameters. For a more general introduction to sampling we recommend Betancourt (2018).

Click a quantity to explore
SGMCMC Sampler
e.g. SGLD,
RMSProp-SGLD
WBIC Estimator
Local Learning Coefficient
Susceptibility Estimator
Susceptibility
BIF Estimator
Bayesian Influence Function
Sampler

Estimates Eβ[Ln(w)]\mathbb{E}_{\beta}[L_n(w)]. Posterior localisation is implemented using a Guassian prior.

Tested empirically
[[·], [·]]
WBIC Estimator

λ^n=nβ(Eβ[Ln(w)]Ln(w0))\hat{\lambda}_n = n\beta(\mathbb{E}_{\beta}[L_n(w)] - L_n(w_0)) The expectation is taken over a local posterior supported on a neighbourhood of w0w_0 at inverse temperature β\beta. Ln(w)L_n(w) is the empirical loss on nn samples.

Asymptotically unbiased [[·], [·]]
Local Learning Coefficient

See [[·]]. A geometric invariant of the population loss L(w)L(w) at a local minimum w0w_0.

Basics of Sampling

Many quantities we care about can be expressed as expectations with respect to a localized posterior distribution on the parameter space of the model. Specifically, we want to take expectations with respect to the distribution

p(w)exp(nβLn(w))φ(w;γ,w0)exp(nβLn(w)γ2ww02) \begin{aligned} p(w) &\propto \exp(-n\beta L_n(w))\,\varphi(w; \gamma, w_0) \\ &\propto \exp\left(-n\beta L_n(w) -\tfrac{\gamma}{2} \|w - w_0\|^2 \right) \end{aligned}

where Ln(w)=1ni=1n(Xi,w)L_n(w) = -\frac{1}{n} \sum _{i=1} ^n \ell(X_i, w) is the empirical loss of a parameter ww, and φ(w;γ,w0)\varphi(w; \gamma, w_0) is a Gaussian prior with variance 1/γ1/\gamma centered at w0w_0. The term involving Ln(w)L_n(w) means the posterior reflects the geometry of the loss landscape, while the prior is intended to localize the distribution to a neighborhood of a particular point w0w_0 in parameter space. In practice, w0w_0 is a relevant point on the model’s training trajectory, such as the parameters of the model at the end of training.

These posterior expectations are estimated using samplers: algorithms which explore the geometry of the loss landscape in a (relatively) efficient way. The idea is that, when it is working correctly, the amount of time a sampler spends in any given region is proportional to the posterior probability of that region. Then, by running the sampler for many steps we are able to approximate what we would get if we could sample independent points directly from the target distribution.

The samplers we consider resemble optimizers like SGD and Adam: they use mini-batch estimates of Ln(w)\nabla L_n(w) to push the sampler towards regions where Ln(w)L_n(w) is small (and hence the posterior probability is maximized). Samplers differ from optimizers in that they inject appropriately scaled noise, which means they explore regions around minima rather than converging.

A sampler in a toy loss landscape. The goal when estimating quantities from SLT is for the sampler to explore the geometry around a point w0w_0. It should spend most of its time in the deep blue regions, and it should neither concentrate on w0w_0 nor explore different parts of parameter space. Adjust nβn\beta and γ\gamma (from the posterior distribution above) using the sliders.

Formally, a sampler produces a sequence w0,w1,,wTw_0, w_1, \ldots, w_T of points in parameter space. Assuming the time the sampler spends in any given region is proportional to the posterior probability of that region, the expectation of a function F:RdRF : \R^d \to \R is estimated as

F^=1Tt=1TF(wt).\hat{F} = \frac{1}{T} \sum _{t=1} ^T F(w_t) .

Because these methods need to scale to billion-parameter models, we are limited to approximate stochastic gradient MCMC (SG-MCMC) sampling algorithms. These are samplers where a single sampling step has approximately the same cost as a training step of the same model. In this guide we focus on two: Stochastic Gradient Langevin Dynamics (SGLD; Welling & Teh (2011)), and RMSProp-preconditioned SGLD (RMSProp-SGLD; Li et al. (2016)).

Samplers

Below we explain two sampling algorithms: SGLD (which is conceptually simpler) and RMSProp-preconditioned SGLD (which is more complex, but our default recommendation for most settings).

SGLD

SGLD is stochastic gradient descent plus appropriately scaled Gaussian noise. This is the simplest SG-MCMC sampler; most sampling algorithms can be understood as extensions of SGLD. A sequence of points w0,w1,,wTw_0, w_1, \ldots, w_T is generated as

wt+1=wt+ΔwtΔwt=ϵ2(γ(wtw0)+nβLm,t(wt))+ϵηt w_{t+1} = w_t + \Delta w_t \qquad \Delta w_t = -\frac{\epsilon}{2} \left( \gamma(w_t - w_0) + n\beta \nabla L_{m, t}(w_t) \right) + \sqrt{\epsilon} \eta_t

where:

  • ϵ>0\epsilon > 0 is the sampler’s step size;
  • Lm,t(wt)\nabla L_{m, t}(w_t) is an estimate of loss gradient using a batch of data of size mm. That is, Lm,t(w)=1mi=1m(Xi,t,w)L_{m, t}(w) = \frac{1}{m} \sum _{i=1} ^m \ell(X_{i, t}, w), where Xi,tX_{i, t} denotes a data sample in the tt-th batch;
  • η0,,ηT1\eta_0, \ldots, \eta_{T-1} is a sequence of iid standard normal vectors in Rd\R^d;
  • nβ>0n\beta > 0 and γ\gamma are the inverse temperature and localization respectively, coming from the posterior distribution.

RMSProp-preconditioned SGLD

RMSProp-preconditioned SGLD is SGLD with an adaptive step size that is updated according to the loss gradient on a per-parameter basis. It has been shown to work better than SGLD and a number of other samplers in settings similar to where we are interested in applying SLT-based methods (Hitchcock & Hoogland 2025), and it is our default recommendation.

RMSProp-preconditioned SGLD is a specific version of a “preconditioned SGLD”, which refers to the idea of improving SGLD by rescaling parameter space to take account for the different roles of parameters in the loss Ln(w)L_n(w). In general, a preconditioned SGLD algorithm has the update rule

wt+1=wt+ΔwtΔwt=ϵ2G(t)(γ(wtw0)+nβLm,t(wt))+ϵG(t)ηt w_{t+1} = w_t + \Delta w_t \qquad \Delta w_t = -\frac{\epsilon}{2}G(t) \left( \gamma(w_t - w_0) + n\beta \nabla L_{m, t}(w_t) \right) + \sqrt{\epsilon G(t)} \eta_t

where G(t)G(t) is a square matrix, ϵG(t)\sqrt{\epsilon G(t)} denotes the entry-wise square root, and all other terms are as in SGLD. In RMSProp preconditioning, we track the exponential moving average of the squared-gradients of each parameter

vi(t+1)=αvi(t)+(1α)Lm,t(wt)i2α(0,1)v_i(t+1) = \alpha v_i(t) + (1-\alpha) \nabla L_{m, t}(w_t) _i ^2 \qquad \alpha \in (0, 1)

and take G(t)G(t) to be a diagonal matrix with Gii(t)=1v^i(t)+aG_{ii}(t) = \frac{1}{\sqrt{\hat{v}_i(t)} + a} where v^i(t)=11αt+1vi(t)\hat{v}_i(t) = \frac{1}{1 - \alpha ^{t+1}}v_i(t). The quantity ϵi(t)=ϵv^i(t)+a\epsilon_i(t) = \frac{\epsilon}{\sqrt{\hat{v}_i(t)} + a} is the step size for parameter ii at step tt. This is analogous to the RMSProp optimizer.

Using Posterior Samples

Let w0,w1,,wTw_0, w_1, \ldots, w_T be a sequence of points produced by a sampler. To estimate the expectation of a function F:RdRF : \R^d \to \R with respect to the posterior, we replace the “spatial” average over Rd\R^d with a “temporal” average over the sampling process. The idea is that the time the sampler spends in a given region should be proportional to the probability of that region under the posterior distribution. In the simplest case, we might consider

Ewp[F(w)]1Tt=1TF(wt). \mathbf{E}_{w\sim p} [F(w)] \approx \frac{1}{T} \sum _{t=1}^T F(w_t).

Burn-in

In practice, approximating Ewp[F(w)]\mathbf{E}_{w\sim p} [F(w)] in this way biases the estimate toward F(w0)F(w_0). It is helpful to think about the initial phase of sampling as converging to some equilibrium distribution or “typical set”. This is a good qualitative description of the behaviour of samplers we observe in practice, though it is important to note that the theoretical story around this convergence is unclear and an active area of research. To account for the convergence phase of the sampler, we consider the approximation

Ewp[F(w)]1TBt=BTF(wt) \mathbf{E}_{w\sim p} [F(w)] \approx \frac{1}{T - B} \sum _{t=B}^{T} F(w_t)

where BB is a hyperparameter specifying the burn-in period of the sampler. We give details on how to adjust BB below.

Thinning

Another thing to consider is that the parameter wt+1w_{t+1} is highly correlated with parameter wtw_t. In some situations it may make sense to compute FF every kk steps after burn-in, rather than every step (for example, if computing FF is expensive and this allows you to increase TT). This is called thinning. We can generalise the above as

Ewp[F(w)]1DtDF(wt).\mathbf{E}_{w\sim p} [F(w)] \approx \frac{1}{|D|} \sum _{t \in D} F(w_{t}).

where D={B,B+k,B+2k,,T}D = \{B, B+k, B+2k, \ldots, T\}. In general we refer to a step which is used to estimate the expectation as a draw.

Independent Chains

Finally, estimates can often be improved by repeating the sampling process multiple times, with independent sources of randomness each time. Each sequence of steps produced in this way is called a sampling chain. Each chain should use independent Gaussian noise and independently drawn data to estimate the loss gradient. Let CC denote the number of chains and F^c\hat{F}_c the estimate of Ewp[F(w)]\mathbb{E}_{w\sim p}[F(w)] produced by chain cc. The overall estimate is

Ewp[F(w)]1Cc=1CF^c\mathbb{E}_{w\sim p}[F(w)] \approx \frac{1}{C}\sum _{c=1}^C \hat{F}_c

Terminology Summary

TermDescription
StepAny parameter wtw_t produced by the sampler.
DrawA parameter which is used to estimate the expectation.
ChainA sequence of steps produced by a sampler. Typically, multiple chains with independent noise, gradient data (used to estimate the loss gradient), and observable data (used to estimate FF, if required) are started from the same parameter w0w_0.

Summary of Hyperparameters

Before moving on to the next section we summarize all hyperparameters in one place. We have two sources of hyperparameters: those arising from the sampling process and those which control how we use samples to estimate FF. We further separate the sampling hyperparameters into ‘main’ sampling hyperparameters (which do not have sensible defaults and must always be tuned) and ‘secondary’ sampling hyperparameters (which have sensible defaults and are not always adjusted).

Sampling is implemented in the updated devinterp library (coming soon); the sample function can be accessed via from devinterp.slt.llc import sample. The argument names of hyperparameters in the sample function are given below.

Main Sampling Hyperparameters

ParameterdevinterpNotes
Step size ϵ\epsilonlrControls the size of steps taken by the sampler. Smaller is generally better, but decreasing ϵ\epsilon slows down convergence. In RMSProp-SGLD per-parameter step sizes are set adaptively; the role of ϵ\epsilon is less direct, but in-practice it is adjusted in a similar way when tuning hyperparameters.
Inverse temperature nβn\betan_betaControls the influence of the gradient term. Must be set large enough so that the loss landscape geometry influences sampling, but setting it too large can result in convergence issues or that the sampler does not remain in a neighbourhood of w0w_0.
Localization γ\gammalocalizationControls the localisation strength. Must be set large enough to keep the sampler in a neighborhood of w0w_0.

Secondary Sampling Hyperparameters

ParameterdevinterpNotes
Batch size mmbatch_sizeThe batch size used to compute the gradient of the loss. Usually the batch size used in training is a good choice.
Preconditioner stability parameter aarmsprop_epsControls the maximum size of a per-parameter step size, appearing in the expression ϵi(t)=ϵv^i(t)+a\epsilon_i(t) = \frac{\epsilon}{\sqrt{\hat{v}_i(t)} + a}. RMSProp-SGLD only.
Preconditioner EMA decay rate α\alpharmsprop_alphaControls how quickly the exponential moving average of the squared gradients responds, appearing in the expression vi(t+1)=αvi(t)+(1α)Lm,t(wt)i2v_i(t+1) = \alpha v_i(t) + (1-\alpha) \nabla L_{m, t}(w_t) _i ^2. Generally is set to α=0.99\alpha=0.99 and not adjusted. RMSProp-SGLD only.

Sample Collection Hyperparameters

ParameterdevinterpNotes
Total steps TTSet automatically based on num_burnin_steps, num_draws and num_steps_bw_draws.The total number of steps for each sampling chain. More is better.
Number of burn in steps BBnum_burnin_stepsThe number of steps before samples start being collected. Must be chosen large enough so that the sampler has converged to the typical set before samples begin being collected.
Number of steps between draws kknum_drawsControls how frequently samples are taken after burn in.
Number of chains CCnum_chainsThe number of independent sampling chains. More is better.

Hyperparameter Selection

The question of how best to choose hyperparameters for sampling is not resolved and remains an active and important area of research. There is, however, still much to be said about how to detect various failure modes and how to find hyperparameters efficiently. The purpose of the discussion below is to a) provide a guide to sampling hyperparameter selection for those wishing to use tools like susceptibilities in their work, b) publicly document our current approach to hyperparameter selection.

The overall process is roughly as follows:

  1. Sweep over the primary sampling parameters ϵ\epsilon (step size), γ\gamma (localization) and nβn\beta (posterior temperature).
  2. If necessary, adjust secondary sampling parameters such as the batch size (for computing loss gradients) and sampler-specific parameters, and repeat step 1.
  3. For the chosen hyperparameters, choose the total number of steps, number of burn-in steps, and number of steps between draws.

We will now discuss this process in more detail. By default we suggest using RMSProp-SGLD, with initial values for the secondary sampling parameters as given in the table above.

Sampling Parameter Sweep

First, run sampling for several different values of ϵ\epsilon, γ\gamma and nβn\beta, in each case spanning several orders of magnitude. For example, below we use, ϵ{106,105,104}\epsilon \in \{10^{-6}, 10^{-5}, 10^{-4}\}, nβ{1,10,100}n\beta \in \{1, 10, 100\} and γ{10,100,1000}\gamma \in \{10, 100, 1000\}. We use loss traces as our primary tool for understanding the sampler’s behavior. This is a plot of the sampling step tt versus the loss Lm,t(wt)L_{m, t}(w_t) at step tt (note that this loss must be computed each step to compute the gradient term in the sampler). We are looking for loss traces that:

  • On average increase monotonically above the loss at w0w_0.
  • Level off quickly (i.e. don’t continue to increase over the whole sampling run).
  • Don’t have large spikes.
  • Have a signal: the average increase in loss over sampling is larger than the local variation within each chain.
  • Different chains have level off at approximately the same value.

Ensure that you are taking enough total steps to see the sampler behavior (consider increasing this if none of the loss traces are converging over several orders of magnitude of ϵ\epsilon), but otherwise don’t worry about the sample collection hyperparameters at this point: these can be chosen once the final sampling hyperparameters are found. Usually these sweeps can be done with a smaller number of chains to save on compute, if required.

Click a plot for details
nβ=1n\beta = 1
nβ=10n\beta = 10
nβ=100n\beta = 100
γ=10\gamma = 10
0510
036
01020
γ=100\gamma = 100
024
01.53
01020
γ=1000\gamma = 1000
010002000300040005000Sampler step00.51
010002000300040005000Sampler step00.51
010002000300040005000Sampler step01530
Loss trace behaviorSuggestion
Loss trace not levelling off.Increase ϵ\epsilon and/or take more total steps. If this doesn’t work try increasing γ\gamma.
Loss becomes inf/nan.First decrease ϵ\epsilon. Also consider if nβn\beta and/or γ\gamma are too large.
Loss trace has spikes.First decrease ϵ\epsilon. Also consider if nβn\beta and/or γ\gamma are too large.
Loss trace trend is not non-decreasing (e.g. the loss goes below the loss at w0w_0).Increase γ\gamma and/or decrease nβn\beta.
Low signal-to-noise ratioIncrease nβn\beta, possibly while decreasing γ\gamma and ϵ\epsilon. Can also indicate a need to increase the batch size.

Based on the loss traces, continue sweeping over intermediate values until you find acceptable hyperparameters. You are looking for parameters where the loss increases as quickly as possible to the stable value (to minimize the amount of compute you need to spend on burn-in), and does not exhibit any of the bad behavior discussed above. If you don’t find any such parameters, you can consider adjusting the secondary sampling hyperparameters. Increasing the batch size can sometimes improve sampling quality overall, but has the trade off of increasing the compute needed to do sampling runs. Increasing aa in RMSProp-SGLD can help when there seems to be no range of ϵ\epsilon values which do not diverge to infinity, but it can limit the ability of the preconditioner to set per-parameter step sizes if set too large: the maximum per-parameter step size is ϵ/a\epsilon / a. You should not set aa larger than 11.

As a rule of thumb, you should prefer parameters where nβn\beta is larger. We need nβn\beta to be large enough so that the loss gradients affect sampling. It is not always obvious from the loss traces when nβn\beta is set too low (see TODO: ref).

Choosing Sample Collection Parameters

Having found primary and secondary sampling hyperparameters, we choose the sample collection parameters so that sufficiently many samples are collected after the sampler has reached equilibrium. The most important parameter here is the number of burn-in steps: it is important that it is chosen large enough so that no draws are taken before the sampler has reached the typical set. This can be seen by inspecting the loss trace. For the sake of the demonstration, we use a set of hyperparameters from the sweep above which converge slower; there are better hyperparameters for this model which let you use less burn in.

After choosing the number of burn-in steps, you need to run the chain long enough that adding more steps doesn’t meaningfully change your estimate of the sample mean of FF. A simple way to check this is to plot the running sample mean as a function of the number of steps and look for where the plot flattens out.

It’s also often worthwhile to apply thinning. To pick the spacing kk between draws, compute the correlation between pairs of loss values kk steps apart, and choose kk large enough so that this correlation is close to zero.

You should also run multiple chains. To decide how many you need, estimate the expectation of FF separately from each chain. Let F^c\hat{F}_c denote the estimate from chain c=1,,Cc = 1, \ldots, C and look at the variation between F^c\hat{F}_c (for example by computing the sample standard deviation or interquartile range).

Comparing variation across and within chains can also be used to check convergence of the sampling chains themselves. The Gelman-Rubin statistic (Gelman & Rubin 1992)

R^=N1N(within-chain variance)+1N(between-chain variance)within-chain variance \hat{R} = \sqrt{\frac{\frac{N-1}{N}(\text{within-chain variance}) + \frac{1}{N}(\text{between-chain variance})}{\text{within-chain variance}}}

formalizes this; if R^\hat{R} is not close to one then this is a signal that the chains are converging to different distributions.

Your choices about sample collection parameters are constrained by compute. From the perspective of accurately estimating the expectation of FF:

  • More chains are better
  • More steps are better
  • More burn-in steps can’t hurt (keeping the total number of draws the same)
  • More steps between draws can’t hurt (keeping the total number of draws the same)

Choosing Parameters for Multiple Weight Restrictions and Training Steps

Weight Restrictions

Instead of taking expectations with respect to the “full” posterior distribution — a distribution over all parameters in the model — we often need to estimate expectations with respect to “restricted” posteriors. The restricted posterior is obtained by treating most of the model weights as fixed values, and considering the loss only as a function of the weights within a component of the model (e.g. an attention head). The above procedure for sampling and hyperparameter selection is identical, except that only the weights within the chosen component are updated by the sampler.

Often multiple restricted posterior distributions are used together or compared, for example with susceptibilities or when comparing the restricted LLC of different attention heads. In this case, we recommend choosing a single set of hyperparameters for all weight restrictions of interest. It is possible that choosing sampling hyperparameters on a per-component basis is OK, however this needs to be investigated further.

Training Steps

It is often interesting to investigate how a quantity changes over training. As for weight restrictions, we recommend choosing a single set of hyperparameters for all training steps of interest. In practice, hyperparameter sweeps can be done at a small set of checkpoints distributed over training. When doing the main sampling run, the loss trace at every checkpoint should be recorded and checked as a diagnostic.

Verifying Hyperparameter Choices

Having chosen hyperparameters, ideally you should find a way to verify that the sampler is giving you meaningful information about the posterior. It is difficult to be prescriptive here because it depends on what you know about the model, but we give some examples below.

Check the Sampler Can Detect Known Structures

Suppose you are doing restricted posterior sampling on the attention heads of an LLM (e.g. to estimate susceptibilities) and you know that induction heads develop at a certain point over training. We would expect the LLC for an induction head to change (as a function of the training step) when the induction head develops. We should verify that the chosen sampler configuration is able to detect this change.

Check the Loss Gradient in the Sampler Affects the Estimated Quantities

One failure mode we have observed, which is not detectable by inspecting loss traces, is that for certain hyperparameter settings with nβ>0n\beta > 0, the estimated quantities are the same as if nβ=0n\beta = 0. The most comprehensive way to detect this failure mode is to re-run sampling with the same hyperparameter settings except with nβ=0n\beta = 0 and compare the final results. There should be a qualitative difference between the results of sampling with nβ>0n\beta > 0 and nβ=0n\beta = 0.

Work with us

We're hiring Research Scientists, Engineers & more to join the team full-time.

Senior researchers can also express interest in a part-time affiliation through our new Research Fellows Program.

Acknowledgements

We thank Sandy Fraser for creating an earlier version of the figure illustrating chains, burn-in and draws.

Citation Information
Please cite as:
Rohan Hitchcock, Jesse Hoogland, George Wang, Andrew Gordon, "Guide for Sampling Hyperparameter Selection", 2026.
BibTeX Citation:
@article{hitchcock2026guide,
  title={Guide for Sampling Hyperparameter Selection},
  author={Rohan Hitchcock and Jesse Hoogland and George Wang and Andrew Gordon},
  year={2026}
}
References
  1. 1.
  2. 2.
    The Local Learning Coefficient: A Singularity-Aware Complexity Measure[link]
    Edmund Lau, Zach Furman, George Wang, Daniel Murfet, Susan Wei, 2025. In Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, Vol 258, pp. 244–252. PMLR.
  3. 3.
  4. 4.
    A widely applicable Bayesian information criterion
    Sumio Watanabe, 2013. The Journal of Machine Learning Research, Vol 14(1), pp. 867–897.
  5. 5.
    Linear response estimators for singular statistical models
    Chris Elliott, Daniel Murfet, 2026. To appear.
  6. 6.
    Bayesian Influence Functions for Hessian-Free Data Attribution[link]
    Philipp Alexander Kreer, Wilson Wu, Maxwell Adam, Zach Furman, Jesse Hoogland, 2025.
  7. 7.
    Structural Inference: Interpreting Small Language Models with Susceptibilities[link]
    Garrett Baker, George Wang, Jesse Hoogland, Vinayak Pathak, Daniel Murfet, 2025.
  8. 8.
    Bayesian learning via stochastic gradient Langevin dynamics
    Max Welling, Yee W Teh, 2011. In Proceedings of the 28th international conference on machine learning, pp. 681–688.
  9. 9.
    Preconditioned stochastic gradient Langevin dynamics for deep neural networks
    Chunyuan Li, Changyou Chen, David Carlson, Lawrence Carin, 2016. In Proceedings of the AAAI conference on artificial intelligence, Vol 30.
  10. 10.
    Inference from iterative simulation using multiple sequences
    Andrew Gelman, Donald B Rubin, 1992. Statistical science, Vol 7(4), pp. 457–472. Institute of Mathematical Statistics.