Guide for Sampling Hyperparameter Selection

The purpose of this document is to be a practical guide for those wanting to replicate or extend various Timaeus projects, or apply tools such as susceptibilities in their own projects. There are open theoretical questions about the behavior and limitations of the sampling algorithms. This is an important and active area of research. In this guide, though, we focus on the practical methodology needed to use SLT-based tools. This also means we focus on sampling methods which can be scaled to models with billions of parameters. For a more general introduction to sampling we recommend Betancourt (2018).

Click a quantity to explore

SGMCMC Sampler

e.g. SGLD,
RMSProp-SGLD

WBIC Estimator

Local Learning Coefficient

Susceptibility Estimator

Susceptibility

BIF Estimator

Bayesian Influence Function

Sampler

Estimates $\mathbb{E}_{\beta}[L_n(w)]$ . Posterior localisation is implemented using a Guassian prior.

Tested empirically
(Lau et al. 2025; Hitchcock & Hoogland 2025)

WBIC Estimator

$\hat{\lambda}_n = n\beta(\mathbb{E}_{\beta}[L_n(w)] - L_n(w_0))$ The expectation is taken over a local posterior supported on a neighbourhood of $w_0$ at inverse temperature $\beta$ . $L_n(w)$ is the empirical loss on $n$ samples.

Asymptotically unbiased (Lau et al. 2025; Watanabe 2013)

Local Learning Coefficient

See Lau et al. (2025). A geometric invariant of the population loss $L(w)$ at a local minimum $w_0$ .

Basics of Sampling

Many quantities we care about can be expressed as expectations with respect to a localized posterior distribution on the parameter space of the model. Specifically, we want to take expectations with respect to the distribution

\begin{aligned} p(w) &\propto \exp(-n\beta L_n(w))\,\varphi(w; \gamma, w_0) \\ &\propto \exp\left(-n\beta L_n(w) -\tfrac{\gamma}{2} \|w - w_0\|^2 \right) \end{aligned}

where $L_n(w) = -\frac{1}{n} \sum _{i=1} ^n \ell(X_i, w)$ is the empirical loss of a parameter $w$ , and $\varphi(w; \gamma, w_0)$ is a Gaussian prior with variance $1/\gamma$ centered at $w_0$ . The term involving $L_n(w)$ means the posterior reflects the geometry of the loss landscape, while the prior is intended to localize the distribution to a neighborhood of a particular point $w_0$ in parameter space. In practice, $w_0$ is a relevant point on the model’s training trajectory, such as the parameters of the model at the end of training.

These posterior expectations are estimated using samplers: algorithms which explore the geometry of the loss landscape in a (relatively) efficient way. The idea is that, when it is working correctly, the amount of time a sampler spends in any given region is proportional to the posterior probability of that region. Then, by running the sampler for many steps we are able to approximate what we would get if we could sample independent points directly from the target distribution.

The samplers we consider resemble optimizers like SGD and Adam: they use mini-batch estimates of $\nabla L_n(w)$ to push the sampler towards regions where $L_n(w)$ is small (and hence the posterior probability is maximized). Samplers differ from optimizers in that they inject appropriately scaled noise, which means they explore regions around minima rather than converging.

n\beta = 10

\gamma = 1.0

A sampler in a toy loss landscape. The goal when estimating quantities from SLT is for the sampler to explore the geometry around a point

w_0

. It should spend most of its time in the deep blue regions, and it should neither concentrate on

w_0

nor explore different parts of parameter space. Adjust

n\beta

and

\gamma

(from the posterior distribution above) using the sliders.

Formally, a sampler produces a sequence $w_0, w_1, \ldots, w_T$ of points in parameter space. Assuming the time the sampler spends in any given region is proportional to the posterior probability of that region, the expectation of a function $F : \R^d \to \R$ is estimated as

\hat{F} = \frac{1}{T} \sum _{t=1} ^T F(w_t) .

Because these methods need to scale to billion-parameter models, we are limited to approximate stochastic gradient MCMC (SG-MCMC) sampling algorithms. These are samplers where a single sampling step has approximately the same cost as a training step of the same model. In this guide we focus on two: Stochastic Gradient Langevin Dynamics (SGLD; Welling & Teh (2011)), and RMSProp-preconditioned SGLD (RMSProp-SGLD; Li et al. (2016)).

Samplers

Below we explain two sampling algorithms: SGLD (which is conceptually simpler) and RMSProp-preconditioned SGLD (which is more complex, but our default recommendation for most settings).

SGLD

SGLD is stochastic gradient descent plus appropriately scaled Gaussian noise. This is the simplest SG-MCMC sampler; most sampling algorithms can be understood as extensions of SGLD. A sequence of points $w_0, w_1, \ldots, w_T$ is generated as

w_{t+1} = w_t + \Delta w_t \qquad \Delta w_t = -\frac{\epsilon}{2} \left( \gamma(w_t - w_0) + n\beta \nabla L_{m, t}(w_t) \right) + \sqrt{\epsilon} \eta_t

where:

$\epsilon > 0$ is the sampler’s step size;
$\nabla L_{m, t}(w_t)$ is an estimate of loss gradient using a batch of data of size $m$ . That is, $L_{m, t}(w) = \frac{1}{m} \sum _{i=1} ^m \ell(X_{i, t}, w)$ , where $X_{i, t}$ denotes a data sample in the $t$ -th batch;
$\eta_0, \ldots, \eta_{T-1}$ is a sequence of iid standard normal vectors in $\R^d$ ;
$n\beta > 0$ and $\gamma$ are the inverse temperature and localization respectively, coming from the posterior distribution.

RMSProp-preconditioned SGLD

RMSProp-preconditioned SGLD is SGLD with an adaptive step size that is updated according to the loss gradient on a per-parameter basis. It has been shown to work better than SGLD and a number of other samplers in settings similar to where we are interested in applying SLT-based methods (Hitchcock & Hoogland 2025), and it is our default recommendation.

RMSProp-preconditioned SGLD is a specific version of a “preconditioned SGLD”, which refers to the idea of improving SGLD by rescaling parameter space to take account for the different roles of parameters in the loss $L_n(w)$ . In general, a preconditioned SGLD algorithm has the update rule

w_{t+1} = w_t + \Delta w_t \qquad \Delta w_t = -\frac{\epsilon}{2}G(t) \left( \gamma(w_t - w_0) + n\beta \nabla L_{m, t}(w_t) \right) + \sqrt{\epsilon G(t)} \eta_t

where $G(t)$ is a square matrix, $\sqrt{\epsilon G(t)}$ denotes the entry-wise square root, and all other terms are as in SGLD. In RMSProp preconditioning, we track the exponential moving average of the squared-gradients of each parameter

v_i(t+1) = \alpha v_i(t) + (1-\alpha) \nabla L_{m, t}(w_t) _i ^2 \qquad \alpha \in (0, 1)

and take $G(t)$ to be a diagonal matrix with $G_{ii}(t) = \frac{1}{\sqrt{\hat{v}_i(t)} + a}$ where $\hat{v}_i(t) = \frac{1}{1 - \alpha ^{t+1}}v_i(t)$ . The quantity $\epsilon_i(t) = \frac{\epsilon}{\sqrt{\hat{v}_i(t)} + a}$ is the step size for parameter $i$ at step $t$ . This is analogous to the RMSProp optimizer.

Using Posterior Samples

Let $w_0, w_1, \ldots, w_T$ be a sequence of points produced by a sampler. To estimate the expectation of a function $F : \R^d \to \R$ with respect to the posterior, we replace the “spatial” average over $\R^d$ with a “temporal” average over the sampling process. The idea is that the time the sampler spends in a given region should be proportional to the probability of that region under the posterior distribution. In the simplest case, we might consider

\mathbf{E}_{w\sim p} [F(w)] \approx \frac{1}{T} \sum _{t=1}^T F(w_t).

Burn-in

In practice, approximating $\mathbf{E}_{w\sim p} [F(w)]$ in this way biases the estimate toward $F(w_0)$ . It is helpful to think about the initial phase of sampling as converging to some equilibrium distribution or “typical set”. This is a good qualitative description of the behaviour of samplers we observe in practice, though it is important to note that the theoretical story around this convergence is unclear and an active area of research. To account for the convergence phase of the sampler, we consider the approximation

\mathbf{E}_{w\sim p} [F(w)] \approx \frac{1}{T - B} \sum _{t=B}^{T} F(w_t)

where $B$ is a hyperparameter specifying the burn-in period of the sampler. We give details on how to adjust $B$ below.

Thinning

Another thing to consider is that the parameter $w_{t+1}$ is highly correlated with parameter $w_t$ . In some situations it may make sense to compute $F$ every $k$ steps after burn-in, rather than every step (for example, if computing $F$ is expensive and this allows you to increase $T$ ). This is called thinning. We can generalise the above as

\mathbf{E}_{w\sim p} [F(w)] \approx \frac{1}{|D|} \sum _{t \in D} F(w_{t}).

where $D = \{B, B+k, B+2k, \ldots, T\}$ . In general we refer to a step which is used to estimate the expectation as a draw.

Independent Chains

Finally, estimates can often be improved by repeating the sampling process multiple times, with independent sources of randomness each time. Each sequence of steps produced in this way is called a sampling chain. Each chain should use independent Gaussian noise and independently drawn data to estimate the loss gradient. Let $C$ denote the number of chains and $\hat{F}_c$ the estimate of $\mathbb{E}_{w\sim p}[F(w)]$ produced by chain $c$ . The overall estimate is

\mathbb{E}_{w\sim p}[F(w)] \approx \frac{1}{C}\sum _{c=1}^C \hat{F}_c

Terminology Summary

Term	Description
Step	Any parameter $w_t$ produced by the sampler.
Draw	A parameter which is used to estimate the expectation.
Chain	A sequence of steps produced by a sampler. Typically, multiple chains with independent noise, gradient data (used to estimate the loss gradient), and observable data (used to estimate $F$ , if required) are started from the same parameter $w_0$ .

Summary of Hyperparameters

Before moving on to the next section we summarize all hyperparameters in one place. We have two sources of hyperparameters: those arising from the sampling process and those which control how we use samples to estimate $F$ . We further separate the sampling hyperparameters into ‘main’ sampling hyperparameters (which do not have sensible defaults and must always be tuned) and ‘secondary’ sampling hyperparameters (which have sensible defaults and are not always adjusted).

Sampling is implemented in the updated devinterp library; the sample function can be accessed via from devinterp.slt.llc import sample. The argument names of hyperparameters in the sample function are given below.

Main Sampling Hyperparameters

Parameter	`devinterp`	Notes
Step size $\epsilon$	`lr`	Controls the size of steps taken by the sampler. Smaller is generally better, but decreasing $\epsilon$ slows down convergence. In RMSProp-SGLD per-parameter step sizes are set adaptively; the role of $\epsilon$ is less direct, but in-practice it is adjusted in a similar way when tuning hyperparameters.
Inverse temperature $n\beta$	`n_beta`	Controls the influence of the gradient term. Must be set large enough so that the loss landscape geometry influences sampling, but setting it too large can result in convergence issues or that the sampler does not remain in a neighbourhood of $w_0$ .
Localization $\gamma$	`localization`	Controls the localisation strength. Must be set large enough to keep the sampler in a neighborhood of $w_0$ .

Secondary Sampling Hyperparameters

Parameter	`devinterp`	Notes
Batch size $m$	`batch_size`	The batch size used to compute the gradient of the loss. Usually the batch size used in training is a good choice.
Preconditioner stability parameter $a$	`rmsprop_eps`	Controls the maximum size of a per-parameter step size, appearing in the expression $\epsilon_i(t) = \frac{\epsilon}{\sqrt{\hat{v}_i(t)} + a}$ . RMSProp-SGLD only.
Preconditioner EMA decay rate $\alpha$	`rmsprop_alpha`	Controls how quickly the exponential moving average of the squared gradients responds, appearing in the expression $v_i(t+1) = \alpha v_i(t) + (1-\alpha) \nabla L_{m, t}(w_t) _i ^2$ . Generally is set to $\alpha=0.99$ and not adjusted. RMSProp-SGLD only.

Sample Collection Hyperparameters

Parameter	`devinterp`	Notes
Total steps $T$	Set automatically based on `num_burnin_steps`, `num_draws` and `num_steps_bw_draws`.	The total number of steps for each sampling chain. More is better.
Number of burn in steps $B$	`num_burnin_steps`	The number of steps before samples start being collected. Must be chosen large enough so that the sampler has converged to the typical set before samples begin being collected.
Number of steps between draws $k$	`num_steps_bw_draws`	Controls how frequently samples are taken after burn in.
Number of chains $C$	`num_chains`	The number of independent sampling chains. More is better.

Hyperparameter Selection

The question of how best to choose hyperparameters for sampling is not resolved and remains an active and important area of research. There is, however, still much to be said about how to detect various failure modes and how to find hyperparameters efficiently. The purpose of the discussion below is to a) provide a guide to sampling hyperparameter selection for those wishing to use tools like susceptibilities in their work, b) publicly document our current approach to hyperparameter selection.

The overall process is roughly as follows:

Sweep over the primary sampling parameters $\epsilon$ (step size), $\gamma$ (localization) and $n\beta$ (posterior temperature).
If necessary, adjust secondary sampling parameters such as the batch size (for computing loss gradients) and sampler-specific parameters, and repeat step 1.
For the chosen hyperparameters, choose the total number of steps, number of burn-in steps, and number of steps between draws.

We will now discuss this process in more detail. By default we suggest using RMSProp-SGLD, with initial values for the secondary sampling parameters as given in the table above.

Sampling Parameter Sweep

First, run sampling for several different values of $\epsilon$ , $\gamma$ and $n\beta$ , in each case spanning several orders of magnitude. For example, below we use, $\epsilon \in \{10^{-6}, 10^{-5}, 10^{-4}\}$ , $n\beta \in \{1, 10, 100\}$ and $\gamma \in \{10, 100, 1000\}$ . We use loss traces as our primary tool for understanding the sampler’s behavior. This is a plot of the sampling step $t$ versus the loss $L_{m, t}(w_t)$ at step $t$ (note that this loss must be computed each step to compute the gradient term in the sampler). We are looking for loss traces that:

On average increase monotonically above the loss at $w_0$ .
Level off quickly (i.e. don’t continue to increase over the whole sampling run).
Don’t have large spikes.
Have a signal: the average increase in loss over sampling is larger than the local variation within each chain.
Different chains have level off at approximately the same value.

Ensure that you are taking enough total steps to see the sampler behavior (consider increasing this if none of the loss traces are converging over several orders of magnitude of $\epsilon$ ), but otherwise don’t worry about the sample collection hyperparameters at this point: these can be chosen once the final sampling hyperparameters are found. Usually these sweeps can be done with a smaller number of chains to save on compute, if required.

Click a plot for details

Fixed parameter

Value

n\beta = 1

n\beta = 10

n\beta = 100

\gamma = 10

\gamma = 100

\gamma = 1000

Loss trace behavior	Suggestion
Loss trace not levelling off.	Increase $\epsilon$ and/or take more total steps. If this doesn’t work try increasing $\gamma$ .
Loss becomes inf/nan.	First decrease $\epsilon$ . Also consider if $n\beta$ and/or $\gamma$ are too large.
Loss trace has spikes.	First decrease $\epsilon$ . Also consider if $n\beta$ and/or $\gamma$ are too large.
Loss trace trend is not non-decreasing (e.g. the loss goes below the loss at $w_0$ ).	Increase $\gamma$ and/or decrease $n\beta$ .
Low signal-to-noise ratio	Increase $n\beta$ , possibly while decreasing $\gamma$ and $\epsilon$ . Can also indicate a need to increase the batch size.

Based on the loss traces, continue sweeping over intermediate values until you find acceptable hyperparameters. You are looking for parameters where the loss increases as quickly as possible to the stable value (to minimize the amount of compute you need to spend on burn-in), and does not exhibit any of the bad behavior discussed above. If you don’t find any such parameters, you can consider adjusting the secondary sampling hyperparameters. Increasing the batch size can sometimes improve sampling quality overall, but has the trade off of increasing the compute needed to do sampling runs. Increasing $a$ in RMSProp-SGLD can help when there seems to be no range of $\epsilon$ values which do not diverge to infinity, but it can limit the ability of the preconditioner to set per-parameter step sizes if set too large: the maximum per-parameter step size is $\epsilon / a$ . You should not set $a$ larger than $1$ .

As a rule of thumb, you should prefer parameters where $n\beta$ is larger. We need $n\beta$ to be large enough so that the loss gradients affect sampling. It is not always obvious from the loss traces when $n\beta$ is set too low (see TODO: ref).

Choosing Sample Collection Parameters

Having found primary and secondary sampling hyperparameters, we choose the sample collection parameters so that sufficiently many samples are collected after the sampler has reached equilibrium. The most important parameter here is the number of burn-in steps: it is important that it is chosen large enough so that no draws are taken before the sampler has reached the typical set. This can be seen by inspecting the loss trace. For the sake of the demonstration, we use a set of hyperparameters from the sweep above which converge slower; there are better hyperparameters for this model which let you use less burn in.

After choosing the number of burn-in steps, you need to run the chain long enough that adding more steps doesn’t meaningfully change your estimate of the sample mean of $F$ . A simple way to check this is to plot the running sample mean as a function of the number of steps and look for where the plot flattens out.

It’s also often worthwhile to apply thinning. To pick the spacing $k$ between draws, compute the correlation between pairs of loss values $k$ steps apart, and choose $k$ large enough so that this correlation is close to zero.

You should also run multiple chains. To decide how many you need, estimate the expectation of $F$ separately from each chain. Let $\hat{F}_c$ denote the estimate from chain $c = 1, \ldots, C$ and look at the variation between $\hat{F}_c$ (for example by computing the sample standard deviation or interquartile range).

Comparing variation across and within chains can also be used to check convergence of the sampling chains themselves. The Gelman-Rubin statistic (Gelman & Rubin 1992)

\hat{R} = \sqrt{\frac{\frac{N-1}{N}(\text{within-chain variance}) + \frac{1}{N}(\text{between-chain variance})}{\text{within-chain variance}}}

formalizes this; if $\hat{R}$ is not close to one then this is a signal that the chains are converging to different distributions.

Your choices about sample collection parameters are constrained by compute. From the perspective of accurately estimating the expectation of $F$ :

More chains are better
More steps are better
More burn-in steps can’t hurt (keeping the total number of draws the same)
More steps between draws can’t hurt (keeping the total number of draws the same)

Choosing Parameters for Multiple Weight Restrictions and Training Steps

Weight Restrictions

Instead of taking expectations with respect to the “full” posterior distribution — a distribution over all parameters in the model — we often need to estimate expectations with respect to “restricted” posteriors. The restricted posterior is obtained by treating most of the model weights as fixed values, and considering the loss only as a function of the weights within a component of the model (e.g. an attention head). The above procedure for sampling and hyperparameter selection is identical, except that only the weights within the chosen component are updated by the sampler.

Often multiple restricted posterior distributions are used together or compared, for example with susceptibilities or when comparing the restricted LLC of different attention heads. In this case, we recommend choosing a single set of hyperparameters for all weight restrictions of interest. It is possible that choosing sampling hyperparameters on a per-component basis is OK, however this needs to be investigated further.

Training Steps

It is often interesting to investigate how a quantity changes over training. As for weight restrictions, we recommend choosing a single set of hyperparameters for all training steps of interest. In practice, hyperparameter sweeps can be done at a small set of checkpoints distributed over training. When doing the main sampling run, the loss trace at every checkpoint should be recorded and checked as a diagnostic.

Verifying Hyperparameter Choices

Having chosen hyperparameters, ideally you should find a way to verify that the sampler is giving you meaningful information about the posterior. It is difficult to be prescriptive here because it depends on what you know about the model, but we give some examples below.

Check the Sampler Can Detect Known Structures

Suppose you are doing restricted posterior sampling on the attention heads of an LLM (e.g. to estimate susceptibilities) and you know that induction heads develop at a certain point over training. We would expect the LLC for an induction head to change (as a function of the training step) when the induction head develops. We should verify that the chosen sampler configuration is able to detect this change.

Check the Loss Gradient in the Sampler Affects the Estimated Quantities

One failure mode we have observed, which is not detectable by inspecting loss traces, is that for certain hyperparameter settings with $n\beta > 0$ , the estimated quantities are the same as if $n\beta = 0$ . The most comprehensive way to detect this failure mode is to re-run sampling with the same hyperparameter settings except with $n\beta = 0$ and compare the final results. There should be a qualitative difference between the results of sampling with $n\beta > 0$ and $n\beta = 0$ .

Build on our work

Our tools for susceptibilities, local learning coefficients, and SGMCMC sampling are open source in the devinterp library.

Work with us

We're hiring Research Scientists, Engineers & more to join the team full-time.

Senior researchers can also express interest in a part-time affiliation through our new Research Fellows Program.

1.
A conceptual introduction to Hamiltonian Monte Carlo [link]
Michael Betancourt, 2018.
2.
The Local Learning Coefficient: A Singularity-Aware Complexity Measure [link]
Edmund Lau, Zach Furman, George Wang, Daniel Murfet, Susan Wei, 2025. In Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, Vol 258, pp. 244–252. PMLR.
3.
From Global to Local: A Scalable Benchmark for Local Posterior Sampling [link]
Rohan Hitchcock, Jesse Hoogland, 2025.
4.
A widely applicable Bayesian information criterion
Sumio Watanabe, 2013. The Journal of Machine Learning Research, Vol 14(1), pp. 867–897.
5.
Linear response estimators for singular statistical models
Chris Elliott, Daniel Murfet, 2026. To appear.
6.
Bayesian Influence Functions for Hessian-Free Data Attribution [link]
Philipp Alexander Kreer, Wilson Wu, Maxwell Adam, Zach Furman, Jesse Hoogland, 2025.
7.
Structural Inference: Interpreting Small Language Models with Susceptibilities [link]
Garrett Baker, George Wang, Jesse Hoogland, Vinayak Pathak, Daniel Murfet, 2025.
8.
Bayesian learning via stochastic gradient Langevin dynamics
Max Welling, Yee W Teh, 2011. In Proceedings of the 28th international conference on machine learning, pp. 681–688.
9.
Preconditioned stochastic gradient Langevin dynamics for deep neural networks
Chunyuan Li, Changyou Chen, David Carlson, Lawrence Carin, 2016. In Proceedings of the AAAI conference on artificial intelligence, Vol 30.
10.
Inference from iterative simulation using multiple sequences
Andrew Gelman, Donald B Rubin, 1992. Statistical science, Vol 7(4), pp. 457–472. Institute of Mathematical Statistics.