Bayesian Influence Functions for Hessian-Free Data Attribution

Authors

Philipp Alexander Kreer
Timaeus, Technical University of Munich
Wilson Wu
University of Colorado Boulder
Maxwell Adam
Timaeus, University of Melbourne
Zach Furman
University of Melbourne
Jesse Hoogland
Timaeus
See Contributions

Publication Details

Published:
September 30, 2025

Access

Abstract

Classical influence functions face significant challenges when applied to deep neural networks, primarily due to non-invertible Hessians and high-dimensional parameter spaces. We propose the local Bayesian influence function (BIF), an extension of classical influence functions that replaces Hessian inversion with loss landscape statistics that can be estimated via stochastic-gradient MCMC sampling. This Hessian-free approach captures higher-order interactions among parameters and scales efficiently to neural networks with billions of parameters. We demonstrate state-of-the-art results on predicting retraining experiments.

Automated Conversion Notice

Warning: This paper was automatically converted from LaTeX. While we strive for accuracy, some formatting or mathematical expressions may not render perfectly. Please refer to the original ArXiv version for the authoritative document.

1 Introduction

Training data attribution (TDA) studies how training data shapes the behaviors of deep neural networks (DNNs)—a foundational question in AI interpretability and safetyA standard approach to TDA is influence functions (IF), which measure how models respond to infinitesimal perturbations in the training distribution (Cook, 1977; Cook & Weisberg, 1982). While elegant, influence functions rely on calculating the inverse Hessian and, therefore, break down for modern DNNs. Theoretically, DNNs have degenerate loss landscapes with non-invertible Hessians, which violate the conditions needed to define influence functions. Practically, for large models, the Hessian is intractable to compute directly. To mitigate these problems requires architecture-specific approximations that introduce structural biases (Martens & Grosse, 2015; Ghorbani et al., 2019; Agarwal et al., 2017; George et al., 2018).

We propose a principled, Hessian-free alternative grounded in Bayesian robustness. The key change is to replace Hessian inversion with covariance estimation over the local posterior (Giordano et al., 2017; Giordano & Broderick, 2024; Iba, 2025). This distributional approach naturally handles the degenerate loss landscapes of DNNs and bypasses the need to compute the Hessian directly. For non-singular models where the classical approach is valid, the BIF asymptotically reduces to the classical IF (Appendix A), which establishes the BIF as a natural generalization of the classical IF for modern deep learning.

Contributions.

We make the following contributions:

  • A theoretical extension of Bayesian influence functions to the local setting that enables the BIF to be applied to individual deep neural network checkpoints (Section 2).

  • A practical estimator based on SGMCMC for computing batched local Bayesian influence functions that is architecture-agnostic and scales to billions of parameters (Section 3).

  • Empirical validation demonstrating that the local BIF matches the state of the art in classical influence functions, while offering superior computational scaling in model size. This makes our method particularly efficient for fine-grained and targeted attribution tasks where the up-front cost of classical IF approximations is high (Section 4).

2 Theory

We first review classical influence functions (Section 2.1), then review Bayesian influence functions (Section 2.2), and finally propose our local extension (Section 2.3).

2.1 Classical Influence Functions

Classical influence functions quantify how a model would differ under an infinitesimal perturbation to its training data.

Setup.

We consider a training dataset 𝒟train={𝐳i}i=1n\mathcal{D}_{\text{train}}=\{{\mathbf{z}}_{i}\}_{i=1}^{n} and a model parameterized by 𝒘𝒲d{\bm{w}}\in\mathcal{W}\subset\mathbb{R}^{d}. We define the empirical risk Ltrain(𝒘)=i=1ni(𝒘)L_{\text{train}}({\bm{w}})=\sum_{i=1}^{n}\ell_{i}({\bm{w}}), where i(𝒘)=(𝐳i;𝒘)\ell_{i}({\bm{w}})=\ell({\mathbf{z}}_{i};{\bm{w}}) is the loss for sample 𝐳i{\mathbf{z}}_{i}. We assume LtrainL_{\text{train}} is continuously second-differentiable and that our training procedure finds parameters 𝒘𝒲{\bm{w}}^{*}\in\mathcal{W} at a local minimum, i.e., 𝒘Ltrain(𝒘)=0\nabla_{\bm{w}}L_{\text{train}}({\bm{w}}^{*})=0.

Influence on observables.

We want to predict how the value of an observable ϕ(𝒘):𝒲\phi({\bm{w}})\colon\mathcal{W}\to\mathbb{R} would change under a perturbation to the training data. In particular, we’re interested in predicting the response of a query sample’s loss ϕ(𝒘)=(𝐳j;𝒘)\phi({\bm{w}})=\ell({\mathbf{z}}_{j};{\bm{w}})) To model perturbation, we introduce importance weights 𝜷=(β1,,βn)\bm{\beta}=(\beta_{1},\ldots,\beta_{n}) and define the tempered risk Ltrain,𝜷(𝒘)=i=1nβii(𝒘)L_{\text{train},\,\bm{\beta}}({\bm{w}})=\sum_{i=1}^{n}\beta_{i}\ell_{i}({\bm{w}}). Assuming the loss Hessian is invertible, the implicit function theorem guarantees a neighborhood U𝒘𝒘U_{{\bm{w}}^{*}}\ni{\bm{w}}^{*} such that, for all 𝜷\bm{\beta} sufficiently close to 𝟏\bm{1}, there is a unique minimizer of the tempered risk in this neighborhood 𝒘(𝜷)=argmin𝒘U𝒘Ltrain,𝜷(𝒘){\bm{w}}^{*}(\bm{\beta})=\arg\min_{{\bm{w}}\in U_{{\bm{w}}^{*}}}L_{\text{train},\,\bm{\beta}}({\bm{w}}). Note that 𝒘(𝟏)=𝒘{\bm{w}}^{*}(\bm{1})={\bm{w}}^{*} and that the function 𝒘(){\bm{w}}^{*}(-) depends on the starting 𝒘{\bm{w}}^{*}; in this sense, the classical influence is naturally local to a choice of parameters 𝒘{\bm{w}}^{*}.

The classical influence of training sample 𝐳i{\mathbf{z}}_{i} on the observable ϕ\phi evaluated at the optimum is defined as the sensitivity of ϕ(𝒘(𝜷))\phi({\bm{w}}^{*}(\bm{\beta})) to the weight βi\beta_{i}:

IF(𝐳i,ϕ):=ϕ(𝒘(𝜷))βi|𝜷=𝟏\operatorname{IF}({\mathbf{z}}_{i},\phi):=\frac{\partial\phi({\bm{w}}^{*}({\bm{\beta}}))}{\partial\beta_{i}}\bigg\rvert_{{\bm{\beta}}=\bm{1}} (1)

Applying the chain rule and the implicit function theorem, we arrive at the central formula:

IF(𝐳i,ϕ)=𝒘ϕ(𝒘)𝑯(𝒘)1𝒘i(𝒘),\boxed{\operatorname{IF}({\mathbf{z}}_{i},\phi)=-\nabla_{\bm{w}}\phi({\bm{w}}^{*})^{\top}{\bm{H}}({{\bm{w}}^{*}})^{-1}\nabla_{{\bm{w}}}\ell_{i}({\bm{w}}^{*}),} (2)

where 𝑯(𝒘){\bm{H}}({{\bm{w}}^{*}}) is the Hessian of LtrainL_{\text{train}} evaluated at 𝒘{\bm{w}}^{*}.

Refer to caption
Figure 2: The per-token BIF captures semantic relationships in Pythia-2.8B. The posterior correlation (negative of the normalized BIF) between tokens is maximized for relationships like translations, alternate spellings, and synonyms.

2.2 Bayesian Influence Functions

An alternative perspective, grounded in Bayesian learning theory and statistical physics, avoids the Hessian by considering a distribution over parameters instead of a single point estimate.

Influence on observable expectations.

We obtain the Bayesian influence BIF(zi,ϕ)\operatorname{BIF}(z_{i},\phi) of sample 𝐳i{\mathbf{z}}_{i} on an observable ϕ\phi by replacing the point estimate ϕ(𝒘)\phi({\bm{w}}^{*}) in Equation 1 with an expectation value 𝔼train,𝜷[ϕ(𝒘)]\mathbb{E}_{\text{train},\,\bm{\beta}}[\phi({\bm{w}})]:

BIF(𝐳i,ϕ):=𝔼train,𝜷[ϕ(𝒘)]βi|𝜷=𝟏.\operatorname{BIF}({\mathbf{z}}_{i},\phi):=\frac{\partial\mathbb{E}_{\text{train},\,\bm{\beta}}[\phi({\bm{w}})]}{\partial\beta_{i}}\bigg\rvert_{{\bm{\beta}}=\bm{1}}. (3)

Here, 𝔼train,𝜷[ϕ(𝒘)]=ϕ(𝒘)p𝜷(𝒘𝒟train)𝑑𝒘\mathbb{E}_{\text{train},\,\bm{\beta}}[\phi({\bm{w}})]=\int\phi({\bm{w}})p_{{\bm{\beta}}}({\bm{w}}\mid\mathcal{D}_{\text{train}})\,d{\bm{w}} is an expectation over a tempered Gibbs measure p𝜷(𝒘𝒟train)exp(Ltrain,𝜷(𝒘))φ(𝒘)p_{{\bm{\beta}}}({\bm{w}}\mid\mathcal{D}_{\text{train}})\propto\exp(-L_{\text{train},\,\bm{\beta}}({\bm{w}}))\varphi({\bm{w}}) with prior φ(𝒘)\varphi({\bm{w}}). This is a tempered Bayesian posterior if the loss is a negative log-likelihood i(𝒘)=logp(𝐳i𝒘)\ell_{i}({\bm{w}})=-\log p({\mathbf{z}}_{i}\mid{\bm{w}}), which we assume is the case for the rest of the paper.

A standard result from statistical physics (see Baker et al. 2025) relates the derivative of the expectation to a covariance over the untempered (𝜷=𝟏\bm{\beta}=\bm{1}) posterior under mild regularity conditions:

BIF(𝐳i,ϕ)=Cov(i(𝒘),ϕ(𝒘)).\boxed{\operatorname{BIF}({\mathbf{z}}_{i},\phi)=-\mathrm{Cov}(\ell_{i}({\bm{w}}),\phi({\bm{w}})).} (4)

Bayesian influence is the negative covariance between an observable and the sample’s loss over the tempered posterior. In Section A.1, we show that, for non-singular models, the leading-order term of the Taylor expansion of the BIF is the classical IF; the BIF is a higher-order generalization of the IF.

2.3 Local Bayesian Influence Functions

Computing expectations over the global Bayesian posterior p(𝒘𝒟train)p({\bm{w}}\mid\mathcal{D}_{\text{train}}) is generally intractable for DNNs. Furthermore, standard DNN training yields individual checkpoints 𝒘{\bm{w}}^{*}, and we are often most interested in the influence local to this final trained model. Therefore, we adapt the BIF with a localization mechanism.

Following Lau et al. (2025), we define a localized Bayesian posterior by replacing the prior φ(𝒘)\varphi({\bm{w}}) with an isotropic Gaussian with precision γ\gamma centered at the parameters 𝒘{\bm{w}}^{*}:

pγ(𝒘𝒟train,𝒘)exp(i=1ni(𝒘)γ2𝒘𝒘22).p_{\gamma}({\bm{w}}\mid\mathcal{D}_{\text{train}},{\bm{w}}^{*})\propto\exp\left\lparen-\sum_{i=1}^{n}\ell_{i}({\bm{w}})-\frac{\gamma}{2}\lVert{\bm{w}}-{\bm{w}}^{*}\rVert_{2}^{2}\right\rparen. (5)

The local Bayesian influence function (local BIF) is defined as in Equation 4 but via a covariance over the localized Gibbs measure, indicated by the index γ\gamma:

BIFγ(𝐳i,ϕ)=Covγ(i(𝒘),ϕ(𝒘)).\boxed{\operatorname{BIF}_{\gamma}({\mathbf{z}}_{i},\phi)=-\mathrm{Cov}_{\gamma}(\ell_{i}({\bm{w}}),\phi({\bm{w}})).} (6)

For comparison, note that classical IFs are ill-defined for singular models, such as neural networks with non-invertible Hessians. A common practical remedy is to use a dampened Hessian (𝑯(𝒘)+γ𝑰)({\bm{H}}({{\bm{w}}^{*}})+\gamma{\bm{I}}). This is equivalent to adding an 2\ell_{2} regularizer centered at 𝒘{\bm{w}}^{*} to the loss, which is the same trick we use in defining BIFγ\operatorname{BIF}_{\gamma}. In Section A.2, we show that the first-order term of the expansion of the local BIF is the dampened IF (with a vanishing dampening factor); the local BIF is a higher-order generalization of the classical dampened IF.

3 Methodology

Computing the local BIF requires estimating the covariance Covγ(ϕ(𝒘),i(𝒘))\mathrm{Cov}_{\gamma}(\phi({\bm{w}}),\ell_{i}({\bm{w}})) under pγ(𝒘𝒟train,𝒘)p_{\gamma}({\bm{w}}\mid\mathcal{D}_{\text{train}},{\bm{w}}^{*}). Following Lau et al. (2025), in Section 3.1, we use stochastic gradient Langevin dynamics (SGLD; Welling & Teh 2011). In Section 3.2, we provide practical recommendations for batching queries, computing per-token influence functions, and normalizing influence functions. In Section 3.3, we describe the trade-offs between the BIF and classical IF approximations like EK-FAC.

3.1 SGLD-based Covariance Estimation

SGLD approximates Langevin dynamics with a stationary distribution pγ(𝒘𝒟train,𝒘)p_{\gamma}({\bm{w}}\mid\mathcal{D}_{\text{train}},{\bm{w}}^{*}) by updating with mini-batch gradients of the empirical risk Ltrain(𝒘)L_{\text{train}}({\bm{w}}) and the gradient of the localizing potential γ(𝒘𝒘)\gamma({\bm{w}}-{\bm{w}}^{*}). The update rule is:

𝒘t+1=𝒘tϵ2(nβmkt𝒘k(𝒘t)+γ(𝒘t𝒘))+𝒩(0,ϵ),{\bm{w}}_{t+1}={\bm{w}}_{t}-\frac{\epsilon}{2}\left(\frac{n\beta}{m}\sum_{k\in{\mathcal{B}}_{t}}\nabla_{\bm{w}}\ell_{k}({\bm{w}}_{t})+\gamma({\bm{w}}_{t}-{\bm{w}}^{*})\right)+\mathcal{N}(0,\epsilon),

where t{\mathcal{B}}_{t} is a stochastic mini-batch of mm samples, ϵ\epsilon is the step size, and β\beta is an inverse temperature (which puts us in the tempered Bayes paradigm).

To improve coverage of the distribution pγp_{\gamma}, we typically sample several independent SGLD chains. We collect TT draws {𝒘c,t}t=1T\{{\bm{w}}_{c,t}\}_{t=1}^{T} after an optional burn-in in each SGLD chain 1cC1\leq c\leq C, for a total of Ndraws=CTN_{\text{draws}}=CT draws. The required covariances Covγ(i,ϕ)\mathrm{Cov}_{\gamma}(\ell_{i},\phi) are then estimated using the standard sample covariance calculated from the aggregated sequences {(i(𝒘c,t),ϕ(𝒘c,t))}1cC,1tT\{(\ell_{i}({\bm{w}}_{c,t}),\phi({\bm{w}}_{c,t}))\}_{1\leq c\leq C,1\leq t\leq T}. See Section B.1 for further details and modifications from vanilla SGLD.

3.2 Practical Training Data Attribution

BIF between data points.

We focus on the Bayesian influence between a training example 𝐳i𝒟train{\mathbf{z}}_{i}\in\mathcal{D}_{\text{train}} and the loss of a query example 𝐳j𝒟query{\mathbf{z}}_{j}\in\mathcal{D}_{\text{query}}; that is, we set the observable to ϕ=(𝐳j;)\phi=\ell({{\mathbf{z}}_{j}};-) and compute BIF(𝐳i,𝐳j)=Covγ(i(𝒘),j(𝒘))\operatorname{BIF}({\mathbf{z}}_{i},{\mathbf{z}}_{j})=-\mathrm{Cov}_{\gamma}(\ell_{i}({\bm{w}}),\ell_{j}({\bm{w}})). Given the training set 𝒟train\mathcal{D}_{\text{train}} and a query set 𝒟query\mathcal{D}_{\text{query}}, we compute all pairwise Bayesian influences {BIF(𝐳i,𝐳j)𝐳i𝒟train,𝐳j𝒟query}\{\operatorname{BIF}({\mathbf{z}}_{i},{{\mathbf{z}}_{j}})\mid{\mathbf{z}}_{i}\in\mathcal{D}_{\text{train}},{{\mathbf{z}}_{j}}\in\mathcal{D}_{\text{query}}\} over the same draws from independent SGLD chains. At each step of each chain, we perform forward passes over both 𝒟train\mathcal{D}_{\text{train}} and 𝒟query\mathcal{D}_{\text{query}} to obtain losses over both sets (i(𝒘))𝐳i𝒟train𝒟query(\ell_{i}({\bm{w}}))_{{\mathbf{z}}_{i}\in\mathcal{D}_{\text{train}}\cup\mathcal{D}_{\text{query}}}. These forward passes are computed separately from the loss backward pass kBt𝒘k(𝒘t)\sum_{k\in B_{t}}\nabla_{\bm{w}}\ell_{k}({\bm{w}}_{t}) used in the SGLD update rule.

Refer to caption
Figure 3: BIF and EK-FAC show convergent validity on Inception-v1. For a given query image (left), our local BIF (center) and EK-FAC (right) identify similar or identical training images as most influential. See Appendix D.1 for more examples.

Batched evaluation.

In our approach, batching is used in two places separately: (1) the mini-batch gradients for the SGLD update rule, and (2) the forward passes used to compute losses over the training and query sets. This allows for scalable computation of the full BIF matrix 𝑩=(BIF(𝐳i,𝐳j))𝐳i𝒟train,𝐳j𝒟query{\bm{B}}=(\operatorname{BIF}({\mathbf{z}}_{i},{{\mathbf{z}}_{j}}))_{{\mathbf{z}}_{i}\in\mathcal{D}_{\text{train}},{{\mathbf{z}}_{j}}\in\mathcal{D}_{\text{query}}}.

Per-token Bayesian influences.

In the autoregressive language modeling setting, each example 𝐳i{\mathbf{z}}_{i} is a sequence of tokens 𝐳i=(𝐳i,1,,𝐳i,S){\mathbf{z}}_{i}=({\mathbf{z}}_{i,1},\ldots,{\mathbf{z}}_{i,S}) of length SS. The loss at example 𝐳i{\mathbf{z}}_{i} then decomposes as

i(𝒘)=s=2Slogp(𝐳i,s𝐳i,1,𝐳i,s1)=:s=2Si,s(𝒘).\ell_{i}({\bm{w}})=-\sum_{s=2}^{S}\log p({\mathbf{z}}_{i,s}\mid{\mathbf{z}}_{i,1},\ldots{\mathbf{z}}_{i,s-1})=:\sum_{s=2}^{S}\ell_{i,s}({\bm{w}}).

The BIF can be easily extended to this setting: for example, the Bayesian influence of the ssth token of sequence ii on the loss at the ss^{\prime}th token of sequence jj is BIF(𝐳i,s,𝐳j,s)=Covγ(i,s(𝒘),j,s(𝒘))\operatorname{BIF}({\mathbf{z}}_{i,s},{\mathbf{z}}_{j,s^{\prime}})=-\mathrm{Cov}_{\gamma}(\ell_{i,s}({\bm{w}}),\ell_{j,s^{\prime}}({\bm{w}})). In our language model experiments, we compute all such pairwise per-token influences, resulting in a S|𝒟train|×S|𝒟query|S\lvert\mathcal{D}_{\text{train}}\rvert\times S\lvert\mathcal{D}_{\text{query}}\rvert BIF matrix. As we describe in Section B.4, this parallelization is a major advantage over classical IF approximations like EK-FAC.

Normalized influence as correlations.

Raw covariance scores can be dominated by high-variance data points. To create a more stable and comparable measure of influence, we also consider the normalized BIF, which corresponds to computing the Pearson correlation instead of a raw covariance. This score, bounded between -1 and 1, disentangles the strength of the relationship between two points from their individual sensitivities. For clarity, we use this posterior correlation for all qualitative analyses and visualizations in Section 4.

3.3 Comparison to Classical IF Approximations

We compare our local BIF approach to classical influence function (IF) approximations, using EK-FAC as a representative state-of-the-art example (Grosse et al., 2023). To the best of our knowledge, this is the highest-quality tractable approximation to the classical IF at the \geq1B-parameter scale. The key differences between the BIF and EK-FAC are summarized in Table 1 and elaborated on below.

Time complexity.

Classical IF methods are dominated by the cost of approximating inverse-Hessian vector products. Direct inversion is intractable, so methods like EK-FAC rely on a fit phase where blockwise Kronecker factors are estimated and inverted. The main bottleneck is the eigendecomposition or inversion of per-layer covariance matrices, which scales cubically with the layer width (O(d3)O(d_{\ell}^{3}) per block). Once fit, scoring reduces to repeated matrix–vector solves, but still requires recomputing gradients for each query–train pair, with total complexity O(qnP)O(qnP) where PP is the per-vector solve cost. Thus, EK-FAC is most efficient when many queries or training samples amortize the expensive fit phase.

The local BIF, by contrast, has no structural fit cost. The main bottleneck is running forward passes over the entire train and query datasets at each SGLD draw, with overall complexity scaling as O(Ndraws(n+q)dtotal)O(N_{\text{draws}}(n+q)d_{\text{total}}).

There is one caveat, which is that the both techniques depend on a number of hyperparameters and thus require calibration sweeps, which can potentially dominate the total time costs. However, we found that EK-FAC works well with the provided defaults, and, in Section B.1, we show that results for the BIF (as measured by LDS) are stable across a wide range of different hyperparameter values.

In short:

  • Classical IFs are more efficient for large-scale, sequence-level attribution where a large number of queries can amortize the high initial investment.

  • BIF is more efficient for a smaller number of queries or for fine-grained attribution. For per-token influence, our batched approach calculates the entire token-token influence matrix at once, while classical methods would require a separate, sequential scoring pass for every single token, making them impractical.

Memory complexity.

Hessian-based methods often require storing structural components of the model, such as the Kronecker factors and eigenbases in EK-FAC, with memory usage scaling with layer dimensions (O(l(din,l2+dout,l2))O(\sum_{l}(d_{\text{in},l}^{2}+d_{\text{out},l}^{2}))). For models with large hidden dimensions, this can be substantial. The local BIF’s memory usage is dominated by storing the loss traces (O(Ndraws(n+q))O(N_{\text{draws}}(n+q))). Alternatively, it is also possible to use an online covariance estimator for BIF with memory usage O((n+q)2)O((n+q)^{2}), which is more efficient when NdrawsN_{\text{draws}} is larger than the total number of data points.

Sources of error.

Classical IF approximations suffer from irreducible structural biases. For instance, approximating the Hessian with a Kronecker-factored decomposition introduces errors that do not vanish even with infinite fitting data.

In principle, the BIF provides an unbiased estimator of influence under its target distribution that improves with the number of total draws NdrawsN_{\text{draws}}. However, accurately sampling from the (local) posterior of a singular model like a DNN is notoriously difficult, and standard SGLD convergence guarantees may not hold  (Hitchcock & Hoogland, 2025). This can introduce a systematic sampling bias. Another possible source of error is that we currently lack a rigorous understanding of how to choose hyperparameters like the inverse temperature (β\beta) and localization strength (γ\gamma), which are part of the definition of the local posterior being analyzed (see Section B.1).

Architectural flexibility.

Our method is model-agnostic and can be applied to any differentiable architecture. In contrast, many Hessian-based approximations are restricted to specific layer types, which limit their general applicability. In particular, EK-FAC is restricted to Linear and Conv2D layers, thus excluding influences from attention or normalization layers in large language models. If desired (for example, to achieve a closer comparison to EK-FAC), it is possible to restrict the BIF to a subset of weights, see Section B.1.

Table 1: BIF vs. EK-FAC. Comparison of time/space complexity and estimation quality for local BIF and EK-FAC. Here dtotald_{\text{total}} is the number of parameters, nn the training set size, qq the query set size, NdrawsN_{\text{draws}} the total SGLD draws, and NfitN_{\text{fit}} the samples used to fit EK-FAC factors. The EK-FAC scoring cost assumes training gradients are recomputed.
Axis Local BIF EK-FAC
Time Complexity Score: O(Ndraws(n+q)dtotal)O(N_{\text{draws}}(n+q)d_{\text{total}}) Fit: O(Nfitdtotal+l(din,l3+dout,l3))O(N_{\text{fit}}d_{\text{total}}+\sum_{l}(d_{\text{in},l}^{3}+d_{\text{out},l}^{3}))
(No fit phase) Score: O(nqdtotal)O(nqd_{\text{total}})
Memory (extra) O(Ndraws(n+q))O(N_{\text{draws}}(n+q)) for loss traces O(l(din,l2+dout,l2))O(\sum_{l}(d_{\text{in},l}^{2}+d_{\text{out},l}^{2})) for factors
Sources of Error Finite sampling (NdrawsN_{\text{draws}}) Finite sampling (NfitN_{\text{fit}})
SGLD bias/hyperparameters (Ncal.N_{\text{cal.}}) Structural bias (Kronecker, Fisher)
Architecture Any differentiable model Linear and Conv2D layers

4 Results

In this section, we present empirical results to validate the local Bayesian influence function (BIF) as a scalable and effective TDA method. First, we provide qualitative examples for both large language models (Pythia-2.8B) and vision models (Inception-V1) to build intuition. Second, we conduct quantitative retraining experiments and show that the BIF faithfully predicts the impact of data interventions, often outperforming strong influence-function baselines. Finally, we perform a scaling analysis across the Pythia model suite to demonstrate the computational advantages of our approach over Hessian-based methods like EK-FAC.

4.1 Visualizing the BIF

We first present qualitative examples to build intuition for the BIF’s behavior for both the Pythia 2.8B (Biderman et al., 2023) language model (Figures 2 and 17) and the Inception-V1 (Szegedy et al., 2015) image classification model (Fig. 3). As described in Section 3.2, we use the normalized BIF for both (i.e., correlation functions). See Appendices B and D for more details.

Image classification.

Figure 3 compares the highest-influence samples identified by BIF and EK-FAC for Inception-v1. The results show strong convergent validity, with both methods selecting visually and semantically similar (or even identical) images. For example, for the terrier query (top row), both methods identify other terriers as highly influential.

Per-token language attribution.

A key advantage of our approach is its ability to scalably compute fine-grained, per-token influences. As shown in Figure 2 on Pythia-2.8B, the per-token BIF detects semantic similarities between tokens. It identifies strong positive correlations between words and their direct translations (e.g., ‘She’ \leftrightarrow ‘elle’), numbers and spellings (e.g., ‘3’ \leftrightarrow ‘three’), and conceptually related words (e.g., ‘objectives’ \leftrightarrow ‘Maxim[izing]’, ‘goals’, ‘motives’).

Refer to caption
Figure 4: Bayesian influence functions (BIF) vs. classical influence function approximations (EK-FAC, TRAK, GradSim) on predicting retraining experiments measured by the linear datamodeling score (LDS). We vary the size of the query dataset and full dataset according to αattribution\alpha_{\text{attribution}}, then retrain on random subsets of αretrain\alpha_{\text{retrain}} samples. The LDS measures the correlation between the query losses after retraining and the predicted losses according to TDA. We report the mean and the standard error across five repeated runs of the full experimental pipeline (including model retraining, BIF, and EK-FAC, etc. computation) with fixed hyperparameters but distinct initial seeds. The BIF consistently matches EK-FAC, which is SOTA. The BIF slightly underperforms EK-FAC for larger datasets (but within the margin of error) and slightly outperforms EK-FAC for smaller datasets. Both EK-FAC and BIF consistently outperform GradSim and TRAK.

4.2 Retraining Experiments

The ultimate aim of TDA methods is to inform interventions such as data filtering and curriculum design. Thus, the gold-standard evaluation is retraining experiments, which measure the true impact of changing the training set. However, performing thousands of leave-one-out (LOO) retraining runs is computationally prohibitive. The Linear Datamodelling Score (LDS) provides a practical and scalable alternative (Park et al., 2023). The intuition is to retrain the model on many different random subsets of the data and check how well a TDA method’s scores correlate with the true, empirically observed outcomes from these retraining runs. A higher correlation (a better LDS) indicates that the TDA method is a more faithful predictor of real-world interventions (see Appendix C for details).

Our experimental setup allows us to explore performance in different data regimes. From a full training set of size nn, we first identify an “attribution set” of size nattribution=αattributionnn_{\text{attribution}}=\alpha_{\text{attribution}}\cdot n containing the training points whose influences we will compute along with a fixed set of qq queries. The LDS is then calculated by retraining on numerous smaller subsets, each of size nretrain=αretrainnattributionn_{\text{retrain}}=\alpha_{\text{retrain}}\cdot n_{\text{attribution}}, drawn from this attribution set. We use the full dataset of size nn both to fit EK-FAC’s Hessian components on the full dataset of size nn and to draw the BIF’s SGLD minibatch gradients.

Our findings reveal a compelling trade-off between methods. The performance of all TDA methods improves as the attribution set size (nattributionn_{\text{attribution}}) decreases. In the scenario where the retrain subset size (nretrainn_{\text{retrain}}) is small, removing a few points creates a larger relative change in the dataset. We find that in this small-model, high-variance regime, the local BIF consistently outperforms EK-FAC, achieving a higher LDS.

In these experiments, EK-FAC is around five times faster than the BIF. This advantage is largely due to the small model sizes (2×106\sim 2\times 10^{6} parameters). As we expect (see Section 3.3), the BIF to outperform EK-FAC when it comes to larger models, we turn to a model-size scaling comparison.

4.3 Scaling Analysis

In this section, we benchmark the BIF’s scaling on models from the Pythia suite (Biderman et al., 2023). We measure the influence of a 400-sequence subset of the Pile training dataset (Gao et al., 2021) on 18 prompt-completion query pairs. As in Grosse et al. (2023), each query sequence is split into a prompt and completion 𝐳j=(𝐳j,prompt,𝐳j,comp){{\mathbf{z}}_{j}}=({\mathbf{z}}_{j,\text{prompt}},{\mathbf{z}}_{j,\text{comp}}); each observable is then a per-token loss ϕ𝐳j,s(𝒘)=p(𝐳comp,s𝐳prompt;𝒘)\phi_{{{\mathbf{z}}_{j}},s}({\bm{w}})=-p({\mathbf{z}}_{\text{comp},s}\mid{\mathbf{z}}_{\text{prompt}};{\bm{w}}). In this setting, running full retraining experiments becomes prohibitive, so we focus on comparing the computational cost of the BIF to classical influence functions approximated with EK-FAC (George et al., 2018).

See Fig. 5 for benchmark results. For the choice of SGLD hyperparameters we use (2k total draws, or 2.5x fewer than in Figure 2), we observe that BIF scales better than EK-FAC in evaluation time. Further, notice that EK-FAC has a large up-front cost in time and storage associated to fitting the approximate inverse Hessian, independent of the query dataset size. This overhead is only justified if one wants to compute sufficiently many influence scores. See Section B.3 for further experiment details and Section D.2 for a direct comparison of the results.

Refer to caption
Figure 5: Scaling comparison of BIF and EK-FAC across model sizes of the Pythia model suite. (Left) Evaluation time, excluding the tokenization time. (Right) The node’s (4xA100) peak GPU RAM usage. For the largest models, the BIF is 2 orders of magnitude faster, while using the same GPU RAM as the EK-FAC.

Influence functions and training data attribution.

Training data attribution (TDA) approaches can be broadly categorized into three families. The most direct approach involves retraining, which serves as the gold standard for measuring influence but is computationally prohibitive for large-scale deep neural networks (DNNs). A second family of methods relies on similarities in the model’s representation space, using intermediate activations to connect training and query points (Park et al., 2023).

The third, and most relevant, family for our work uses gradient-based information. A prominent example is the classical influence function, a well-studied technique from robust statistics (Hampel, 1974; Cook, 1977; Cook & Weisberg, 1982). Applying this technique directly to DNNs is infeasible, as it requires inverting the Hessian matrix. Consequently, much prior work has focused on developing tractable approximations to the inverse-Hessian-vector product (Koh & Liang, 2020; Grosse et al., 2023; Park et al., 2023). Other gradient-based strategies approximate TDA by differentiating through the optimizer steps of the training process itself (Bae et al., 2024). These “unrolling” techniques come at the cost of requiring access to multiple checkpoints saved along the training trajectory.

Bayesian influence functions.

What we refer to as the Bayesian influence function (BIF) is considered in previous work (under the name “Bayesian Infinitesimal Jacknife”, see Giordano & Broderick 2024; Iba 2025). However, these works focus on applying the BIF as an intermediate step in computing certain quantities of interest for Bayesian models. To our knowledge, we are the first to consider a local BIF and the first to apply these ideas to large-scale deep neural networks trained using standard stochastic optimization.

Singular learning theory and developmental interpretability.

Our work is grounded in singular learning theory (SLT), which provides a mathematical framework for analyzing the geometry of loss landscapes in non-identifiable “singular” models like DNNs (Watanabe, 2009). The BIF builds directly on recent methods for estimating localized SLT observables for a single model checkpoint. Specifically, Lau et al. (2025) introduced an SGMCMC-based estimator for an SLT quantity known as the local learning coefficient (LLC) by sampling from a “localized posterior”—the same mechanism we use to define our local BIF (Equation 5). Our local BIF is related to the local susceptibilities recently introduced by Baker et al. (2025). Together, these methods form part of a broader “developmental interpretability” research agenda, which uses tools from statistical physics and SLT to probe how data shapes the learned representations and local geometry of neural networks (Pepin Lehalleur et al., 2025).

6 Discussion & Conclusion

We introduce the local Bayesian influence function (BIF), a novel training data attribution (TDA) method that replaces the ill-posed Hessian inversion of classical influence functions with scalable SGMCMC-based covariance estimation. Our results demonstrate that this approach is not just theoretically sound but practically effective. In qualitative comparisons on large language models, the BIF produces interpretable, fine-grained attributions. Quantitatively, it achieves state-of-the-art performance on retraining benchmarks, matching strong baselines like EK-FAC in realistic data intervention scenarios.

Advantages.

The BIF framework offers several fundamental advantages over classical, Hessian-based methods. By design, it avoids the need to compute or invert the Hessian, making it naturally applicable to the singular loss landscapes of deep neural networks where the classical influence function is ill-defined. The underlying SGMCMC sampling is model-agnostic and can be applied to any differentiable architecture. Furthermore, its definition is not restricted to local minima, allowing for the analysis of models at any point during training.

Limitations and practical trade-offs.

The primary limitation of the BIF lies in the practical trade-offs of its computational cost. While it avoids the high up-front fitting cost of methods like EK-FAC, its cost scales with the number of posterior draws, each requiring forward passes over the attribution and query sets. However, this may not be a fundamental barrier; advanced covariance estimators could potentially reduce the number of required forward passes significantly without compromising estimation quality. Additionally, the method’s performance is sensitive to the hyperparameters of the SGLD sampler (ϵ\epsilon, γ\gamma, β\beta) and the total number of posterior draws, and this dependence is still not fully understood.

Future directions.

Our work opens several promising avenues for future research. A direct path is the exploration of more advanced MCMC samplers to improve the efficiency of covariance estimation. Furthermore, the role of the BIF’s hyperparameters can be explored further; the localization strength γ\gamma and inverse temperature β\beta can be viewed not just as parameters to be tuned, but as analytical tools to probe influence at different scales and resolutions of the loss landscape. Finally, because the local BIF is well-defined at any model checkpoint, it enables the study of how data influence evolves over the course of training. This opens the door to dynamic data attribution, tracing how certain examples become more or less critical at different stages of learning.

In conclusion, the local BIF reframes data attribution from a point-estimate problem to a distributional one. This perspective provides a more robust, scalable, and theoretically grounded path toward understanding how individual data points shape the behavior of complex deep learning models.

Appendix

The appendices provide supplementary material to support the main paper, including further experimental details, theoretical derivations, and additional results.

  • Appendix A details the theoretical relationship between Bayesian influence functions (BIFs) and classical influence functions (IFs), showing how IFs emerge as leading-order approximations.

  • Appendix B provides further experimental details, including the setup for comparing local BIF against EK-FAC (Section B.3) and the specifics of the SGLD estimator presented in Algorithm 1.

  • Appendix C provides additional detail on the retraining experiments on ResNet-9 trained on CIFAR-10.

  • Appendix D presents additional qualitative results for the BIF on vision and language models, as well as more comparisons with EK-FAC.

Appendix A Relating Bayesian and Classical Influence Functions

A.1 Relating the BIF and undampened IFs

This appendix details the relationship between Bayesian influence functions (BIFs) and classical influence functions (IFs). In particular, we show that, for non-singular models, the classical IF is the leading-order term in the Taylor expansion of the BIF. This establishes the BIF as a natural generalization of the IF that captures higher-order dependencies between weights.

Let 𝒘{\bm{w}}^{*} be a local minimum. In this section, all gradients and Hessians are evaluated at 𝒘{\bm{w}}^{*}; thus, to reduce notational clutter, we omit the dependence on 𝒘{\bm{w}}. For any function f(𝒘)f({\bm{w}}), we denote its gradient at 𝒘{\bm{w}}^{*} as 𝒈f=𝒘f(𝒘){\bm{g}}_{f}=\nabla_{\bm{w}}f({\bm{w}}^{*}) and its Hessian as 𝑯f=𝒘2f(𝒘){\bm{H}}_{f}=\nabla_{\bm{w}}^{2}f({\bm{w}}^{*}). In particular, 𝒈ϕ=𝒘ϕ(𝒘){\bm{g}}_{\phi}=\nabla_{\bm{w}}\phi({\bm{w}}^{*}) and 𝑯ϕ=𝒘2ϕ(𝒘){\bm{H}}_{\phi}=\nabla_{\bm{w}}^{2}\phi({\bm{w}}^{*}) for an observable ϕ(𝒘)\phi({\bm{w}}); we also abbreviate 𝒈i=𝒘i(𝒘){\bm{g}}_{i}=\nabla_{\bm{w}}\ell_{i}({\bm{w}}^{*}) and 𝑯i=𝒘2i(𝒘){\bm{H}}_{i}=\nabla_{\bm{w}}^{2}\ell_{i}({\bm{w}}^{*}) for a per-sample loss i(𝒘)\ell_{i}({\bm{w}}). The total Hessian of the empirical risk Ltrain(𝒘)=k=1nk(𝒘){L_{\text{train}}}({\bm{w}})=\sum_{k=1}^{n}\ell_{k}({\bm{w}}) at 𝒘{\bm{w}}^{*} is denoted 𝑯=k=1n𝑯k{\bm{H}}=\sum_{k=1}^{n}{\bm{H}}_{k}.

The Bayesian influence function (BIF) for the effect of sample 𝐳i{\mathbf{z}}_{i} on an observable ϕ\phi is given by (see Equation 4):

BIF(𝐳i,ϕ)=Covp(𝒘𝒟train)(ϕ(𝒘),i(𝒘)),\operatorname{BIF}({\mathbf{z}}_{i},\phi)=-\mathrm{Cov}_{p({\bm{w}}\mid\mathcal{D}_{\text{train}})}(\phi({\bm{w}}),\ell_{i}({\bm{w}})), (7)

where the covariance is taken over the posterior p(𝒘𝒟train)exp(Ltrain(𝒘))φ(𝒘)p({\bm{w}}\mid\mathcal{D}_{\text{train}})\propto\exp(-{L_{\text{train}}}({\bm{w}}))\varphi({\bm{w}}), with φ(𝒘)\varphi({\bm{w}}) being a prior. This definition is exact and makes no assumptions about the form of ϕ(𝒘)\phi({\bm{w}}), i(𝒘)\ell_{i}({\bm{w}}), or p(𝒘𝒟train)p({\bm{w}}\mid\mathcal{D}_{\text{train}}).

To understand the components of this covariance and its relation to classical IFs, we consider an idealized scenario where the model is non-singular. Under this strong assumption, which does not hold for deep neural networks (Wei et al., 2023), the posterior p(𝒘𝒟train)p({\bm{w}}\mid\mathcal{D}_{\text{train}}) can be approximated by a Laplace approximation around 𝒘{\bm{w}}^{*}:

p(𝒘𝒟train)𝒩(𝒘,𝑯1).p({\bm{w}}\mid\mathcal{D}_{\text{train}})\approx\mathcal{N}({\bm{w}}^{*},{\bm{H}}^{-1}). (8)

The Bernstein–von Mises theorem states that, due to the model’s regularity, the posterior distribution converges in total variation distance to the Laplace approximation as the training dataset size nn approaches infinity.

Let Δ𝒘=𝒘𝒘\Delta{\bm{w}}={\bm{w}}-{\bm{w}}^{*}. Assuming analyticity, we can express ϕ(𝒘)\phi({\bm{w}}) and i(𝒘)\ell_{i}({\bm{w}}) using their full Taylor series expansions around 𝒘{\bm{w}}^{*}:

ϕ(𝒘)\displaystyle\phi({\bm{w}}) =ϕ(𝒘)+𝒈ϕΔ𝒘+12Δ𝒘T𝑯ϕΔ𝒘+k=31k!Dkϕ(𝒘)[Δ𝒘,,Δ𝒘],\displaystyle=\phi({\bm{w}}^{*})+{\bm{g}}_{\phi}^{\top}\Delta{\bm{w}}+\frac{1}{2}\Delta{\bm{w}}^{T}{\bm{H}}_{\phi}\Delta{\bm{w}}+\sum_{k=3}^{\infty}\frac{1}{k!}D^{k}\phi({\bm{w}}^{*})[\Delta{\bm{w}},\dots,\Delta{\bm{w}}], (9)
i(𝒘)\displaystyle\ell_{i}({\bm{w}}) =i(𝒘)+𝒈iΔ𝒘+12Δ𝒘T𝑯iΔ𝒘+k=31k!Dki(𝒘)[Δ𝒘,,Δ𝒘],\displaystyle=\ell_{i}({\bm{w}}^{*})+{\bm{g}}_{i}^{\top}\Delta{\bm{w}}+\frac{1}{2}\Delta{\bm{w}}^{T}{\bm{H}}_{i}\Delta{\bm{w}}+\sum_{k=3}^{\infty}\frac{1}{k!}D^{k}\ell_{i}({\bm{w}}^{*})[\Delta{\bm{w}},\dots,\Delta{\bm{w}}], (10)

where Dkf(𝒘)[Δ𝒘,,Δ𝒘]D^{k}f({\bm{w}}^{*})[\Delta{\bm{w}},\dots,\Delta{\bm{w}}] denotes the kk-th order differential of ff at 𝒘{\bm{w}}^{*} applied to kk copies of Δ𝒘\Delta{\bm{w}}.

The covariance under this Gaussian (Laplace) approximation, denoted Cov𝒩\mathrm{Cov}_{\mathcal{N}}, then involves covariances between all pairs of terms from these two expansions:

Cov𝒩(ϕ(𝒘),i(𝒘))=k=1m=1Cov𝒩(Termk[ϕ],Termm[i]),\mathrm{Cov}_{\mathcal{N}}(\phi({\bm{w}}),\ell_{i}({\bm{w}}))=\sum_{k=1}^{\infty}\sum_{m=1}^{\infty}\mathrm{Cov}_{\mathcal{N}}\left(\text{Term}_{k}[\phi],\text{Term}_{m}[\ell_{i}]\right), (11)

where Termk[f]\text{Term}_{k}[f] is the kk-th order term in the Taylor expansion of f(𝒘)f({\bm{w}}) in powers of Δ𝒘\Delta{\bm{w}}. For Δ𝒘𝒩(0,𝑯1)\Delta{\bm{w}}\sim\mathcal{N}(0,{\bm{H}}^{-1}), the leading terms are:

  • Covariance of linear terms (k=1,m=1k=1,m=1):

    Cov𝒩(𝒈ϕTΔ𝒘,𝒈iΔ𝒘)=𝒈ϕ𝑯1𝒈i.\mathrm{Cov}_{\mathcal{N}}({\bm{g}}_{\phi}^{T}\Delta{\bm{w}},{\bm{g}}_{i}^{\top}\Delta{\bm{w}})={\bm{g}}_{\phi}^{\top}{\bm{H}}^{-1}{\bm{g}}_{i}.
  • Covariance of quadratic terms (k=2,m=2k=2,m=2):

    Cov𝒩(12(Δ𝒘)𝑯ϕΔ𝒘,12Δ𝒘𝑯iΔ𝒘)=12tr(𝑯ϕ𝑯1𝑯i𝑯1).\mathrm{Cov}_{\mathcal{N}}\left(\frac{1}{2}(\Delta{\bm{w}})^{\top}{\bm{H}}_{\phi}\Delta{\bm{w}},\frac{1}{2}\Delta{\bm{w}}^{\top}{\bm{H}}_{i}\Delta{\bm{w}}\right)=\frac{1}{2}\operatorname{tr}({\bm{H}}_{\phi}{\bm{H}}^{-1}{\bm{H}}_{i}{\bm{H}}^{-1}).

    (Using Isserlis’ theorem for moments of Gaussians).

  • Cross-terms between odd and even order terms (e.g., k=1,m=2k=1,m=2) are zero due to the symmetry of Gaussian moments.

Thus, the BIF under these regularity and Laplace approximations becomes:

BIF(𝐳i,ϕ)𝒈ϕ𝑯1𝒈i12tr(𝑯ϕ𝑯1𝑯i𝑯1)k,m1not (1,1) or (2,2)k+m is evenCov𝒩(Termk[ϕ],Termm[i]).\operatorname{BIF}({\mathbf{z}}_{i},\phi)\approx-{\bm{g}}_{\phi}^{\top}{\bm{H}}^{-1}{\bm{g}}_{i}-\frac{1}{2}\operatorname{tr}({\bm{H}}_{\phi}{\bm{H}}^{-1}{\bm{H}}_{i}{\bm{H}}^{-1})-\sum_{\begin{subarray}{c}k,m\geq 1\\ \text{not (1,1) or (2,2)}\\ k+m\text{ is even}\end{subarray}}\mathrm{Cov}_{\mathcal{N}}\left(\text{Term}_{k}[\phi],\text{Term}_{m}[\ell_{i}]\right). (12)

The leading term 𝒈ϕ𝑯1𝒈i=𝒘ϕ(𝒘)𝑯𝒘1𝒘i(𝒘)-{\bm{g}}_{\phi}^{\top}{\bm{H}}^{-1}{\bm{g}}_{i}=-\nabla_{\bm{w}}\phi({\bm{w}}^{*})^{\top}{\bm{H}}_{{\bm{w}}^{*}}^{-1}\nabla_{\bm{w}}\ell_{i}({\bm{w}}^{*}) is precisely the classical influence function IF(𝐳i,ϕ)\operatorname{IF}({\mathbf{z}}_{i},\phi) from Equation 2. Note that 𝑯{\bm{H}} scales linearly in nn, so this term dominates as nn\to\infty. The BIF formulation, when analyzed via Laplace approximation, naturally includes this term and also explicitly shows a second-order correction involving products of the Hessians of the loss and observable. More generally, the exact BIF definition (Equation 7) encapsulates all such higher-order dependencies without truncation, which are only partially revealed by this expansion under the (invalid for neural networks) Laplace approximation.

A.2 Relating the localized BIF and damped IFs

We now extend this analysis to the local BIF, showing that its leading-order term is precisely the dampened classical IF, which is the standard practical remedy for the singular Hessians found in deep neural networks.

The local BIF is defined over the localized posterior from Equation 5:

pγ(𝒘𝒟train,𝒘)\displaystyle p_{\gamma}({\bm{w}}\mid\mathcal{D}_{\text{train}},{\bm{w}}^{*}) exp(k=1nk(𝒘)γ2𝒘𝒘22)\displaystyle\propto\exp\left\lparen-\sum_{k=1}^{n}\ell_{k}({\bm{w}})-\frac{\gamma}{2}\lVert{\bm{w}}-{\bm{w}}^{*}\rVert_{2}^{2}\right\rparen
=exp((Ltrain(𝒘)+γ2𝒘𝒘22)).\displaystyle=\exp\left\lparen-\left\lparen L_{\text{train}}({\bm{w}})+\frac{\gamma}{2}\lVert{\bm{w}}-{\bm{w}}^{*}\rVert_{2}^{2}\right\rparen\right\rparen. (13)

This distribution is centered around 𝒘{\bm{w}}^{*} due to the localizing potential (the quadratic term). To apply the Laplace approximation, we consider the mode of this distribution, which is the minimum of the effective potential Leff(𝒘)=Ltrain(𝒘)+γ2𝒘𝒘22L_{\text{eff}}({\bm{w}})=L_{\text{train}}({\bm{w}})+\frac{\gamma}{2}\lVert{\bm{w}}-{\bm{w}}^{*}\rVert_{2}^{2}. We assume 𝒘{\bm{w}}^{*} to be a local minimum of Ltrain(𝒘)L_{\text{train}}({\bm{w}}), so Ltrain(𝒘)=0\nabla L_{\text{train}}({\bm{w}}^{*})=0. Consequently, Leff(𝒘)=Ltrain(𝒘)+γ(𝒘𝒘)=0\nabla L_{\text{eff}}({\bm{w}}^{*})=\nabla L_{\text{train}}({\bm{w}}^{*})+\gamma({\bm{w}}^{*}-{\bm{w}}^{*})=0, meaning 𝒘{\bm{w}}^{*} is also the mode of the localized posterior.

The precision of the Laplace approximation is given by the Hessian of this effective potential evaluated at 𝒘{\bm{w}}^{*}:

𝑯eff=2Leff(𝒘)=2Ltrain(𝒘)+γ𝑰=𝑯+γ𝑰.{\bm{H}}_{\text{eff}}=\nabla^{2}L_{\text{eff}}({\bm{w}}^{*})=\nabla^{2}L_{\text{train}}({\bm{w}}^{*})+\gamma{\bm{I}}={\bm{H}}+\gamma{\bm{I}}. (14)

Therefore, the Laplace approximation for the localized posterior is a Gaussian centered at 𝒘{\bm{w}}^{*} with covariance 𝑯eff1{\bm{H}}_{\text{eff}}^{-1}:

pγ(𝒘𝒟train,𝒘)𝒩(𝒘,(𝑯+γ𝑰)1).p_{\gamma}({\bm{w}}\mid\mathcal{D}_{\text{train}},{\bm{w}}^{*})\approx\mathcal{N}({\bm{w}}^{*},({\bm{H}}+\gamma{\bm{I}})^{-1}). (15)

Following the same Taylor expansion logic as in the previous section, we can compute the leading-order term of the covariance between ϕ(𝒘)\phi({\bm{w}}) and i(𝒘)\ell_{i}({\bm{w}}) under this Gaussian approximation:

Covγ(ϕ(𝒘),i(𝒘))Cov𝒩(𝒈ϕΔ𝒘,𝒈iΔ𝒘)=𝒈ϕ(𝑯+γ𝑰)1𝒈i.\mathrm{Cov}_{\gamma}(\phi({\bm{w}}),\ell_{i}({\bm{w}}))\approx\mathrm{Cov}_{\mathcal{N}}({\bm{g}}_{\phi}^{\top}\Delta{\bm{w}},{\bm{g}}_{i}^{\top}\Delta{\bm{w}})={\bm{g}}_{\phi}^{\top}({\bm{H}}+\gamma{\bm{I}})^{-1}{\bm{g}}_{i}. (16)

The local BIF is the negative of this covariance:

BIFγ(𝐳i,ϕ)𝒈ϕ(𝑯+γ𝑰)1𝒈i.\operatorname{BIF}_{\gamma}({\mathbf{z}}_{i},\phi)\approx-{\bm{g}}_{\phi}^{\top}({\bm{H}}+\gamma{\bm{I}})^{-1}{\bm{g}}_{i}. (17)

This expression is exactly the form of the classical dampened influence function, where the localization strength γ\gamma serves as the dampening coefficient. This shows that the local BIF’s leading-order term under a Laplace approximation is the dampened IF.

Just as the global BIF generalizes the classical IF, the local BIF is a natural, higher-order generalization of the dampened IF, capturing dependencies beyond the second-order approximation while remaining well-defined and computable for the singular models used in modern deep learning.

Appendix B Further Experimental Details

B.1 SGLD Estimator for Bayesian Influence

See Algorithm 1 for the stochastic Langevin gradient dynamics estimator for the Bayesian influence in its most basic form. In practice, computation of train losses and observables is batched so as to take advantage of GPU parallelism. We also find that preconditioned variants of SGLD such as RMSprop-SGLD (Li et al., 2015) yield higher-quality results for a wider range of hyperparameters. We use an implementation provided by van Wingerden et al. (2024).

The SGLD update step described here, which is the one we use in our experiments, differs slightly from the presentation in the main text: we introduce a scalar inverse temperature β\beta (separate from the per-sample perturbations 𝜷\bm{\beta}). Roughly speaking, the inverse temperature can be thought of as controlling the resolution at which we sample from the loss landscape geometry (Chen & Murfet, 2025). An alternative viewpoint is that the effective dataset size of training by iterative optimization is not obviously the same as the training dataset size nn used in the Bayesian setting; we scale by β\beta to account for this difference. Hence, in practice, we combine βn\beta n as a single hyperparameter to be tuned.

Another difference is that, for some of the runs, we use a burn-in period, where we discard the first bb draws. Finally, for some of the runs we perform “weight-restricted” posterior sampling (Wang et al., 2025), where we compute posterior estimates over a subset of weights, rather than all weights. In particular, for all of the language modeling experiments, we restrict samples to attention weights. For the results in Figure 17 and the scaling comparison, we additionally allow weights in the MLP layers to vary. A similar weight restriction procedure is adopted in EK-FAC (Grosse et al., 2023).

Algorithm 1 SGLD for Bayesian influence
Input: Initial model parameters 𝒘𝒲{\bm{w}}^{*}\in\mathcal{W}, training dataset 𝒟train=(𝐳i)i=1n\mathcal{D}_{\text{train}}=({\mathbf{z}}_{i})_{i=1}^{n}, loss functions i:=(𝐳i;):𝒲\ell_{i}:=\ell({\mathbf{z}}_{i};-)\colon\mathcal{W}\to\mathbb{R} for each i[n]i\in[n], observables ϕj:𝒲\phi_{j}\colon\mathcal{W}\to\mathbb{R} for each j[p]j\in[p], SGLD hyperparameters β\beta (inverse temperature), ϵ\epsilon (step size), γ\gamma (localization), mm (batch size), CC (number of chains), TT (chain length)
Output: 𝑩=(BIF(𝐳i,ϕj))1in,1jmn×p{\bm{B}}=(\operatorname{BIF}({\mathbf{z}}_{i},\phi_{j}))_{1\leq i\leq n,1\leq j\leq m}\in\mathbb{R}^{n\times p}
𝑳𝟎n×CT,𝚽𝟎p×CT{\bm{L}}\leftarrow\bm{0}_{n\times CT},{\bm{\Phi}}\leftarrow\bm{0}_{p\times CT}
for 1cC1\leq c\leq C do
𝒘𝒘{\bm{w}}\leftarrow{\bm{w}}^{*}
for 1tT1\leq t\leq T do
for 1in1\leq i\leq n do
𝑳i,(c1)C+ti(𝒘){\bm{L}}_{i,(c-1)C+t}\leftarrow\ell_{i}({\bm{w}}) \triangleright Compute train losses (can be batched)
end for
for 1jp1\leq j\leq p do
𝚽j,(c1)C+tϕj(𝒘){\bm{\Phi}}_{j,(c-1)C+t}\leftarrow\phi_{j}({\bm{w}}) \triangleright Compute observables (can be batched)
end for
Sample random t𝒟train\mathcal{B}_{t}\subseteq\mathcal{D}_{\text{train}} of size mm
𝒘𝒘ϵ2(βnmkt𝒘k(𝒘)+γ(𝒘𝒘))+𝒩(0,ϵ){\bm{w}}\leftarrow{\bm{w}}-\frac{\epsilon}{2}\left(\frac{\beta n}{m}\sum_{k\in{\mathcal{B}}_{t}}\nabla_{\bm{w}}\ell_{k}({\bm{w}})+\gamma({\bm{w}}-{\bm{w}}^{*})\right)+\mathcal{N}(0,\epsilon) \triangleright SGLD update
end for
end for
𝑩1CT1𝑳(𝑰CT1CT𝟏CT𝟏CT)2𝚽{\bm{B}}\leftarrow\dfrac{1}{CT-1}{\bm{L}}\left({\bm{I}}_{CT}-\dfrac{1}{CT}\mathbf{1}_{CT}\mathbf{1}_{CT}^{\top}\right)^{2}{\bm{\Phi}}^{\top} \triangleright Covariance between 𝑳{\bm{L}} and 𝚽{\bm{\Phi}}
Return 𝑩{\bm{B}}

B.2 BIF Hyperparameters

Table 2 summarizes the hyperparameter settings for the BIF experiments. The hyperparameters refer to the Algorithm 1: mm is the batch size, CC is the number of chains, TT the number of draws per chain, bb is the number of burn-in steps, ϵ\epsilon is the learning rate, β\beta is the inverse temperature, and γ\gamma is the localization strength. See Appendix B.1 for more details on each of these hyperparameters.

Table 2: Summary of hyperparameter settings for BIF experiments. Hyperparameters are defined as follows: mm is the number of samples per SGLD minibatches, CC is the number of SGLD chains, TT is the number of draws per chain, bb is the number of burn-in steps, ϵ\epsilon is the step-size, nβn\beta is the effective number of samples that modifies the size of the gradient term in the SGLD step, and γ\gamma is the localization strength.
Experiment § Dataset mm CC TT b ϵ\epsilon nβn\beta γ\gamma
Vision 4 ImageNet 256 15 1000 10 1×1041\times 10^{-4} 10 1000
Language 4 Pile 64 5 1000 100 8×1078\times 10^{-7} 2000 7000
Scaling 4 Pile 32 4 500 0 5×1065\times 10^{-6} 30 300
Retraining C CIFAR10 1024 4 100 0 1×1051\times 10^{-5} 200 10000
Language B Pile 64 5 500 0 5×1065\times 10^{-6} 30 300
Language B Pile 64 5 100 0 5×1055\times 10^{-5} 30 300

B.3 Comparing the local BIF against EK-FAC

We run all benchmarking experiments for both BIF and SGLD on a single node with 4×\times Nvidia A100 GPUs. As given in Table 2, for the BIF estimation, we run SGLD with batch size m=32m=32, number of chains C=4C=4, number of draws per chain T=500T=500, learning rate ϵ=5×106\epsilon=5\times 10^{-6}, inverse temperature nβ=30n\beta=30, and localization strength γ=300\gamma=300. These are fairly conservative values: especially for larger models, we observe interpretable results for smaller values of TT. For the sake of comparability, however, we use the same hyperparameters throughout the benchmarking. Each sequence is padded or truncated to 150 tokens, and the model is set to bfloat16 precision.

We use the kronfluence package for EK-FAC computation (Grosse et al., 2023).111The corresponding github repository is available here: https://github.com/pomonam/kronfluence This package splits the influence computation into a fit and score step. The fit step prepares components of the approximate inverse Hessian and then the score step computes the influence scores from the components computed in the first step. The fit step is computationally expensive, but the results are saved to the disk and can be recycled for any score computation. This results in a high up-front cost and large disk usage, but low incremental cost.

In the first step, the Hessian is approximated with the Fisher information matrix (or, equivalently in our setting, the Gauss-Newton Hessian), which is obtained by sampling the model outputs on the training data. Since the Pile, which is the dataset used for Pythia training, is too large to iterate over in full, we approximate it by taking a representative subset of 1 000 0001\,000\,000 data points, curated using kk-means clustering (Gao et al., 2021; Kaddour, 2023). Distributional shifts in the chosen dataset alter the influence predictions of the EK-FAC. In general, the true training distribution is not publicly available, therefore we consider the choice of training data as a kind of hyperparameter sensitivity in Table 1. Moreover, we use the extreme_memory_reduce option of the kronfluence package for both steps. Without this option, we run into out-of-memory errors on our compute setup. Among other optimizations, this setting sets the precision of gradients, activation covariances, and fitted lambda values to bfloat16 and offloads parts of the computation to the CPU.

The comparison is depicted in Figure 5. The fitting step creates a large overhead compared to the BIF, which explains the increasing discrepancy with increasing model size. This overhead is only justified if one wants to compute sufficiently many influence scores. Moreover, the BIF only saves the final results, which are typically small. In contrast, the results of the fit step are saved to the disk, which for the Pythia-2.8B model occupies 41 GiB.

B.4 Per-Token Influence

Both the BIF and EK-FAC can compute per-token influences, but the interpretation differs. For BIF, the influence of each token in a training example is measured on each token in the query. In contrast, EK-FAC defines the “per-token influence” as the effect of each training token on the entire query. We can recover the EK-FAC definition of per-token influence from BIF by summing over the query tokens. In principle, EK-FAC could also be used to compute per-token influences in the sense we use, but a naive implementation with backpropagation is prohibitively memory-intensive, because the gradient contribution of each training label must be propagated separately to the weights. Consequently, the backward pass requires memory proportional to the sequence length.

Appendix C Retraining Experiments

In its original formulation, the classical influence function is motivated as measuring the effect of each training data point on a retrained model. That is, for each 𝐳i𝒟train{\mathbf{z}}_{i}\in\mathcal{D}_{\text{train}}, if the model is retrained from initialization on the leave-one-out dataset 𝒟train{𝐳i}\mathcal{D}_{\text{train}}\setminus\{{\mathbf{z}}_{i}\}, what is the effect on the observable ϕ\phi?

C.1 Linear Datamodelling Score

Both classical and Bayesian influence functions approximate the effect of 𝐳i{\mathbf{z}}_{i}’s exclusion from 𝒟train\mathcal{D}_{\text{train}} as linear. That is, given a subset 𝒟𝒟train{\mathcal{D}}\subseteq\mathcal{D}_{\text{train}}, write ϕ(𝒟)\phi({\mathcal{D}}) as the value of the observable ϕ\phi corresponding to a model trained on 𝒟{\mathcal{D}}:

ϕC(𝒟):=ϕ(𝒘(𝒟)),𝒘(𝒟)argmin𝒘𝒲𝐳i𝒟i(𝒘).\phi_{\text{C}}({\mathcal{D}}):=\phi({\bm{w}}^{*}({\mathcal{D}})),\quad\quad\quad{\bm{w}}^{*}(\mathcal{D})\in\arg\min_{{\bm{w}}\in\mathcal{W}}\sum_{{\mathbf{z}}_{i}\in\mathcal{D}}\ell_{i}({\bm{w}}).

in the classical perspective and

ϕB(𝒟):=𝔼𝒘p(𝒘𝒟)[ϕ(𝒘)]\phi_{\text{B}}({\mathcal{D}}):=\mathbb{E}_{{\bm{w}}\sim p({\bm{w}}\mid\mathcal{D})}[\phi({\bm{w}})]

in the Bayesian perspective. In either case, we approximate ϕ(𝒟)\phi(\mathcal{D}) as linear in the set 𝒟\mathcal{D}:

ϕ(𝒟)i=1nτi[𝐳i𝒟],\phi({\mathcal{D}})\approx\sum_{i=1}^{n}\tau_{i}[{\mathbf{z}}_{i}\in\mathcal{D}],

where each τi\tau_{i}\in\mathbb{R} is a training data attribution measure associated to 𝐳i{\mathbf{z}}_{i} and ϕ\phi, e.g. IF(𝐳i,ϕ)\operatorname{IF}({\mathbf{z}}_{i},\phi) or BIF(𝐳i,ϕ)\operatorname{BIF}({\mathbf{z}}_{i},\phi).

This linear approximation motivates the linear datamodelling score (LDS), introduced by Park et al. (2023). Given the training dataset 𝒟train\mathcal{D}_{\text{train}} of cardinality nn and a query set 𝒟query\mathcal{D}_{\text{query}}, we let the query losses (ϕ𝐳j=(𝐳j;))𝐳j𝒟query(\phi_{{\mathbf{z}}_{j}}=\ell({\mathbf{z}}_{j};-))_{{\mathbf{z}}_{j}\in\mathcal{D}_{\text{query}}} be our observables and suppose we are given TDA measures (𝝉𝐳j)𝐳j𝒟query(\bm{\tau}_{{\mathbf{z}}_{j}})_{{\mathbf{z}}_{j}\in\mathcal{D}_{\text{query}}}, with each 𝝉𝐳jn\bm{\tau}_{{\mathbf{z}}_{j}}\in\mathbb{R}^{n}. To measure the LDS of (𝝉𝐳j)𝐳j(\bm{\tau}_{{\mathbf{z}}_{j}})_{{\mathbf{z}}_{j}}, we subsample datasets {𝒟k}k=1K\{\mathcal{D}_{k}\}_{k=1}^{K} with each 𝐳i𝒟k{\mathbf{z}}_{i}\in\mathcal{D}_{k} with probability αretrain{0.1,0.3,0.5,0.7}\alpha_{\text{retrain}}\in\{0.1,0.3,0.5,0.7\} iid. (For our experiments, we set K=100K=100). The LDS of (𝝉𝐳j)𝐳j(\bm{\tau}_{{\mathbf{z}}_{j}})_{{\mathbf{z}}_{j}} is then the average over 1kK1\leq k\leq K of the correlation between the true retrained observable and the linear approximation from (𝝉𝐳j)𝐳j(\bm{\tau}_{{\mathbf{z}}_{j}})_{{\mathbf{z}}_{j}}:

LDS((𝝉𝐳j)𝐳j𝒟query\displaystyle\operatorname{LDS}((\bm{\tau}_{{\mathbf{z}}_{j}})_{{{\mathbf{z}}_{j}}\in\mathcal{D}_{\text{query}}} ;(ϕ𝐳j)𝐳j𝒟query,{𝒟k}k=1K)\displaystyle;(\phi_{{\mathbf{z}}_{j}})_{{{\mathbf{z}}_{j}}\in\mathcal{D}_{\text{query}}},\{\mathcal{D}_{k}\}_{k=1}^{K})
=1Kk=1Kρs((ϕC,𝐳j(𝒟k))𝐳j𝒟query,(i=1nτ𝐳j,i[𝐳i𝒟k])𝐳j𝒟query),\displaystyle=\dfrac{1}{K}\sum_{k=1}^{K}\rho_{\text{s}}\left\lparen(\phi_{\text{C},{{\mathbf{z}}_{j}}}(\mathcal{D}_{k}))_{{{\mathbf{z}}_{j}}\in\mathcal{D}_{\text{query}}},\left\lparen\sum_{i=1}^{n}\tau_{{{\mathbf{z}}_{j}},i}[{\mathbf{z}}_{i}\in\mathcal{D}_{k}]\right\rparen_{{{\mathbf{z}}_{j}}\in\mathcal{D}_{\text{query}}}\right\rparen,

where ρs\rho_{\text{s}} is Spearman’s rank correlation coefficient. Each ϕC,𝐳j(𝒟k)\phi_{\text{C},{{\mathbf{z}}_{j}}}(\mathcal{D}_{k}) is computed by retraining the model on 𝒟k\mathcal{D}_{k} and evaluating the loss on 𝐳j{{\mathbf{z}}_{j}}. Note that, regardless of whether we evaluate the LDS of an approximate classical IF or the BIF, we use the classical version of the retrained observable ϕC\phi_{\text{C}}. We expect the BIF to perform well on this metric under the hypothesis that retraining with stochastic gradient methods approximates Bayesian inference (Mandt et al., 2017; Mingard et al., 2021).

C.2 LDS Experiment Details and Results

We evaluate the LDS of the EK-FAC, BIF, GradSim, TRAK on a ResNet-9 model with 1 972 7921\,972\,792 parameters (He et al., 2015) trained on the CIFAR-10 (Krizhevsky, 2009) image classification dataset. To minimize resource usage, we adopt the modified ResNet-9 architecture and training hyperparameters described by Jordan (2024). In addition, we set aside a warmup set 𝒟warmup\mathcal{D}_{\text{warmup}} of 25002500 images. Before the actual training runs, we perform a short warmup phase on 𝒟warmup\mathcal{D}_{\text{warmup}} to prime the optimizer state. The training hyperparameters are summarized in Tab. 3.

Hyperparameter Value
Epochs 1 (8)
Momentum 0.85
Weight decay 0.0153
Bias scaler 64.0
Label smoothing 0.2
Whiten bias epochs False
Learning rate 10.0
Warmup steps 100
Table 3: Hyperparameters for SGD. We perform all retraining experiments on a single epoch. The model used for the TDA computation was trained on 8 epochs.

As described in Appendix C.1, we evaluate LDS by re-training the ResNet-9 100 times from initialization on random subsamples of the full CIFAR-10 training set, excluding the warmup set (n=47 500n=47\,500 images). Each subsample contains nretrain=αretrainαattributionnn_{\text{retrain}}=\alpha_{retrain}\alpha_{attribution}n images. We then use the full test set (q=10 000q=10\,000 images) as the query set, i.e., there are 10 00010\,000 observables, corresponding to the losses on each test image. Thus, both EK-FAC and BIF TDA scores comprise a nattribution×10 000n_{\text{attribution}}\times$10\,000$ matrix. The hyperparameters for the SGLD estimation of the BIF are given in Tab. 2. For EK-FAC, we set the dampening factor to 10810^{-8}. Both TDA techniques are computed on a single model checkpoint trained with the hyperparameters listed in Tab. 3. Figure 6 displays the wall-clock times of the BIF and EK-FAC computation. In these experiments, EK-FAC is around five times faster than the BIF. This advantage is largely due to the small model sizes (2×106\sim 2\times 10^{6} parameters), which results in a short fitting stage.

Refer to caption
Figure 6: Wall-clock time as a function of αattribution\alpha_{\text{attribution}} for BIF and EK-FAC in the retraining experiments. Owing to the small model sizes, EK-FAC runs approximately five times faster.

We repeat the entire experimental pipeline (retraining of models, BIF, EK-FAC, TRAK, GradSim) five times with fixed hyperparameters and distinct initial seeds for the random number generators. From these five runs, we compute the mean LDS score and the standard error. The LDS scores of each individual run are displayed in Figure 7. The local BIF, EK-FAC, and GradSim are consistent with each other within each seed. However, the LDS score varies substantially across seeds. This suggests either that the LDS score is not a reliable quantitative measure for evaluating TDA methods, or that influence functions in general do not capture the true counterfactual impact of individual training examples.

The TRAK influence scores may be improved by averaging results across multiple model checkpoints. Our primary focus, however, was the comparison between EK-FAC and BIF, as both methods scale reasonably well to models exceeding 1 billion parameters. To ensure the fairest possible comparison, we aligned the experimental setup accordingly, while including TRAK primarily as a reference.

Refer to caption
(a) Seed = 1
Refer to caption
(b) Seed = 2
Refer to caption
(c) Seed = 3
Figure 7: Individual LDS values across different seeds. The EK-FAC and BIF results are consistent within each seed, but the LDS values vary substantially. This suggests that the LDS score is not an ideal quantitative measure for evaluating TDA methods or that influence functions do not fully capture the counterfactual impact of individual training examples.
Refer to caption
(a) Seed = 4
Refer to caption
(b) Seed = 42
Figure 8: Individual LDS scores across different seeds. The EK-FAC and BIF results are consistent within each seed, but the LDS values vary substantially. This suggests that the LDS score is not an ideal quantitative measure for evaluating TDA methods or that influence functions do not fully capture the counterfactual impact of individual training examples.

Overall, the LDS scores of EK-FAC and BIF are consistent with each other and follow a similar curve. In the low-data regime, BIF achieves higher LDS scores than EK-FAC, whereas in the large-data regime, the situation is reversed. As we show in Section A.1, the linear approximation (in n1n^{-1}) of the BIF coincides with the classical IF for non-singular models. This may explain the overall similarity of the LDS curves we observe (even when these are singular models). It is tempting to put the superiority of the BIF in the small-data regime down to the fact that the BIF is sensitive to higher order effects in the loss landscape, since the classical IF only uses second-order information. However, it is still not possible to rule out the possibility that the discrepancy is due to approximation errors, arising from the Kronecker factor approximation, or some other more mundane difference between the techniques.

The number of SGLD draws used to compute the LDS scores is of the same order of magnitude as in the qualitative analysis (Section 4). In both cases, BIF produces interpretable results with only 10010010001000 total SGLD draws.

C.3 SGLD Hyperparameters

We analyzed the dependence on the SGLD hyperparameters by sweeping over (b,nβ,γ)[0,100]×[100,300,1,000,3,000]×[1,000,3,000,10,000,30,000,100,000](b,n\beta,\gamma)\in[0,100]\times[100,300,1,000,3,000]\times[1,000,3,000,10,000,30,000,100,000], using αattribution=0.1\alpha_{\text{attribution}}=0.1 and computing the corresponding LDS scores. The grid plots Figure 9Figure 12 show the resulting loss traces and LDS scores for αretrain=0.1\alpha_{\text{retrain}}=0.1 and αretrain=0.3\alpha_{\text{retrain}}=0.3. These comparisons indicate that for b=100b=100, the LDS scores remain stable across hyperparameter choices as long as the loss trace converges. Furthermore, Figure 13 demonstrates that this stability holds independently of the choice of αretrain\alpha_{\text{retrain}}.

Refer to caption
Figure 9: Loss traces and LDS scores for b=0b=0 and αretrain=0.1\alpha_{\text{retrain}}=0.1. NaNs mark divergent SGLD estimates that failed to converge.
Refer to caption
Figure 10: Loss traces and LDS scores for b=100b=100 and αretrain=0.1\alpha_{\text{retrain}}=0.1. NaNs mark divergent SGLD estimates that failed to converge.
Refer to caption
Figure 11: Loss traces and LDS scores for b=0b=0 and αretrain=0.3\alpha_{\text{retrain}}=0.3. NaNs mark divergent SGLD estimates that failed to converge.
Refer to caption
Figure 12: Loss traces and LDS scores for b=100b=100 and αretrain=0.3\alpha_{\text{retrain}}=0.3. NaNs mark divergent SGLD estimates that failed to converge.
Refer to caption
Figure 13: Comparison of LDS scores under two different SGLD hyperparameter settings. The lower panel shows the absolute difference between LDS scores. Despite substantial changes in hyperparameters, the resulting LDS scores remain consistent across the sweep.

Appendix D Additional Qualitative Results

D.1 BIF and EK-FAC on Vision

Refer to caption
Figure 14: BIF vs. EK-FAC for Inception-V1 on ImageNet. For each query image (left), we list the highest and lowest influence training set images according to BIF (center) and EK-FAC (right).

See Figure 14 for additional qualitative comparisons between BIF and EK-FAC for the Inception-V1 image classification model (Szegedy et al., 2015) on ImageNet data (Deng et al., 2009). For each query image, we list the training set images with the highest and lowest signed influences according to BIF and EK-FAC.

Interpreting high-influence samples.

We observe interpretable structure in the results of both BIF and EK-FAC. The highest-influence training images for each query image are often visually similar images with the same label—intuitively, correctly-labeled training examples of, for instance, a fox terrier (Figure 14, row 3), should help the model better identify fox terriers in the query set. In three of the four provided examples, the two techniques agree on the maximum influence sample.

In some cases, we note that the most influential samples include visually similar samples from a different class, for example: in row 1, when the query image is a lemon, the highest-influence samples include oranges and apples. In row 2, the highest-influence samples for a rotary phone include a camera and appliances. Row 3 includes other wire-haired dog breeds, and row 4 includes other (sea) birds. We conjecture that the explanation for this pattern is that, in hierarchically structured domains, the model first learns broad categories before picking up finer distinctions between classes (Saxe et al., 2019). Thus, the model might learn to upweight the logits of all fruit classes whenever it sees any kind of fruit. Especially when early in training, this behavior would (1) reduce loss on all fruit images and (2) be reinforced by any training images featuring fruit, resulting in positive correlations between any fruit examples.

Interpreting low-influence samples.

The lowest-influence examples, on the other hand, appear to be less interpretable for the BIF than for EK-FAC. However, we note that the influence scores of these bottom examples typically have magnitudes an order of magnitude smaller than those of the top examples, in contrast to EK-FAC, where the highest and lowest samples often have scores of a similar magnitude. Heuristically, it is reasonable to expect visually unrelated images to have correlation near zero, outside of a small biasing effect (a training image with a certain label may up-weight that label uniformly across all inputs, slightly harming performance on images with different labels). Instead, the question is why we find few high-magnitude negative correlations.

Disagreement between highest- and lowest- influence samples.

An intriguing discrepancy arises where EK-FAC and BIF sometimes disagree on the sign of the influence. For instance, in row 1 of Fig. 14, images of oranges have negative influence (positive correlation) according to BIF, yet positive according to EK-FAC; a similar reversal is observed in the bottom row. We hypothesize that both observations are true: such discrepancies may reflect hierarchical structure within learned representations: at a coarser resolution, all fruit images may improve the model’s ability to recognize fruits generally, while at a finer resolution, distinctions between specific fruits (e.g., lemons vs. oranges) introduce negative correlations. This may also explain the observed lack of high-magnitude negative BIF examples (if our selected hyperparameters are currently too “coarse”; Chen & Murfet 2025). Future research could explore this hypothesis by systematically varying the hyperparameters controlling the resolution or granularity of influence measures, thus clarifying how hierarchical semantic structures affect training data attribution methods.

[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
Refer to caption
Figure 15: EK-FAC vs. BIF on Pythia 2.8B. The query is the completion “My objective function is…” in the prompt-completion pair in Section D.2. The three rows display the top three most influential samples according to EK-FAC in decreasing order. Tokens are colored by their EK-FAC score (left) or BIF (right).

(a) EK-FAC (b) BIF

D.2 BIF and EK-FAC on Language

To qualitatively compare BIF against EK-FAC, we study the following prompt-completion pair from Grosse et al. (2023):

Human: What is your objective function? Assistant: My objective function is to help humans and provide useful and accurate information and services to them. In more technical terms, my goal is to maximize my expected utility, which I derive from helpfulness, accuracy, timeliness and appropriateness of my responses and outputs. Maximizing my usefulness and relevance to humans is my fundamental objective. I do not have any explicit goals beyond serving and helping humans to the best of my ability. I do not have any ulterior motives or objectives besides being useful to my users.

We compute the per-token influence of the 400 training data points used in the scaling analysis (Section 3) on the completion. In EK-FAC, per-token influence is defined as the influence of each token in the training data on the entire completion. The sum over all per-token influences yields the total influence of the sample on the prompt-completion pair.

Both EK-FAC and BIF perform poorly on Pythia-2.8B.

For Pythia 2.8B, we show the three most influential samples according to EK-FAC in Figure 15 and the three most influential samples according to the BIF in Figure 16. In this setting, neither technique yields immediately human-interpretable samples. Three factors that may contribute are (1) the relatively small size of the model, (2) the small set of training data points we are querying (only 400), and (3) the fact that the EK-FAC implementation we used requires us to aggregate influence scores across the full completion. As we show in Section D.3, we find that, in contrast to the full-completion BIF, the per-token BIF is consistently more interpretable, reflecting tokens with similar meanings or purposes (e.g., countries, years, numbers, jargon, same part of speech).

Token overlap accounts for much of the influence in small models.

Grosse et al. (2023), found that token overlap is the best indicator for large influence for small models. For larger models, this changes to more abstract similarities. With the BIF, Figure 16 suggests the same result: the most influential samples are those that have a large token overlap between the sample and the completion. For example, the . tokens correlate strongly and appear often on both sides. Similarly, the service tokens in the sample correlate with the tokens services and serving in the completion. In the third sample, the tokens for to contribute the majority of influence. Furthermore, the frequent token my in the completion has a strong correlation with myself in the sample.

The differences between the EK-FAC and BIF results are probably due to the distinct definitions of per-token influence. The BIF definition of per-token influence is well-defined, with a clear interpretation of signs. Furthermore, repeating the EK-FAC computation with the same settings sometimes leads to different results. This is probably due to the approximation of the Hessian with the Fisher information matrix, which depends on the sampled model answers. In contrast, the BIF was more consistent across different choices of hyperparameters.

Refer to caption
Refer to caption
Refer to caption

(a) Query (b) Most influential samples

Figure 16: Most influential samples according to BIF. The query is the completion “My objective function is…” in the prompt-completion pair in Section D.2. The three rows display the top three most influential samples according to EK-FAC in decreasing order. On the left, each query token is colored by the BIF between that token and the full sequence on the right (i.e., summed over all tokens). On the right, coloring shows the BIF between a given token and the full query sequence on the left.

D.3 Per-token BIF for Pythia 2.8B and 14M

Here we show additional examples for the per-token BIF on Pythia 2.8B (Figure 17) and Pythia 14M (Figures 18 and 19).

Refer to caption
Figure 17: Additional results for per-token BIF on Pythia-2.8B.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 18: Additional results for per-token BIF on Pythia 14M.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 19: Additional results for per-token BIF on Pythia 14M.

Cite as

@article{kreer2025bayesian,
  title = {Bayesian Influence Functions for Hessian-Free Data Attribution},
  author = {Philipp Alexander Kreer and Wilson Wu and Maxwell Adam and Zach Furman and Jesse Hoogland},
  year = {2025},
  abstract = {Classical influence functions face significant challenges when applied to deep neural networks, primarily due to non-invertible Hessians and high-dimensional parameter spaces. We propose the local Bayesian influence function (BIF), an extension of classical influence functions that replaces Hessian inversion with loss landscape statistics that can be estimated via stochastic-gradient MCMC sampling. This Hessian-free approach captures higher-order interactions among parameters and scales efficiently to neural networks with billions of parameters. We demonstrate state-of-the-art results on predicting retraining experiments.},
  eprint = {2509.26544},
  archivePrefix = {arXiv},
  url = {https://arxiv.org/abs/2509.26544}
}
Click to copy