Training Data Influence Analysis and Estimation: A Survey

Good models require good training data. For overparameterized deep models, the causal relationship between training data and model predictions is increasingly opaque and poorly understood. Influence analysis partially demystifies training's underlying interactions by quantifying the amount each training instance alters the final model. Measuring the training data's influence exactly can be provably hard in the worst case; this has led to the development and use of influence estimators, which only approximate the true influence. This paper provides the first comprehensive survey of training data influence analysis and estimation. We begin by formalizing the various, and in places orthogonal, definitions of training data influence. We then organize state-of-the-art influence analysis methods into a taxonomy; we describe each of these methods in detail and compare their underlying assumptions, asymptotic complexities, and overall strengths and weaknesses. Finally, we propose future research directions to make influence analysis more useful in practice as well as more theoretically and empirically sound. A curated, up-to-date list of resources related to influence analysis is available at https://github.com/ZaydH/influence_analysis_papers.


Introduction
Machine learning is built on training data [Red+21]. Without good training data, nothing else works. How modern models learn from and use training data is increasingly opaque [KL17;Zha+21a;Xia22]. Regarding state-of-the-art black-box models, Yampolskiy [Yam20] notes, "If all we have is a 'black box' it is impossible to understand causes of failure and improve system safety." Large modern models require tremendous amounts of training data [BF21]. Today's uncurated, internetderived datasets commonly contain numerous anomalous instances [Ple+20]. These anomalies can arise from multiple potential sources. For example, training data anomalies may have a natural cause such as distribution shift [RL87;Yan+21], measurement error, or non-representative samples drawn from the tail of the data distribution [Hub81;Fel20;Jia+21b]. Anomalous training instances also occur due to human or algorithmic labeling errors -even on well-known, highly-curated datasets [EGH17]. Malicious adversaries can insert anomalous poison instances into the training data with the goal of manipulating specific model predictions [BNL12;Che+17;Sha+18a;HL23]. Regardless of the cause, anomalous training instances degrade a model's overall generalization performance.
Today's large datasets also generally overrepresent established and dominant viewpoints [Ben+21]. Models trained on these huge public datasets encode and exhibit biases based on protected characteristics, including gender, race, religion, and disability [BCC19; Kur+19; TC19; Zha+20; Hut+20]. These training data biases can translate into real-world harm, where, as an example, a recidivism model falsely flagged black defendants as high risk at twice the rate of white defendants [Ang+16].
Understanding the data and its relationship to trained models is essential for building trustworthy ML systems. However, it can be very difficult to answer even basic questions about the relationship between training data and model predictions; for example: 1. Is a prediction well-supported by the training data, or was the prediction just random?
2. Which portions of the training data improve a prediction? Which portions make it worse?
3. Which instances in the training set caused the model to make a specific prediction?
One strategy to address basic questions like those above is to render them moot by exclusively using simple, transparent model classes [Lip18]. Evidence exists that this "interpretable-only" strategy may be appropriate in some settings [Kni17]. However, even interpretable model classes can be grossly affected by training data issues [Hub81;CHW82;CW82]. Moreover, as the performance penalty of interpretable models grows, their continued use becomes harder to justify.
With the growing use of black-box models, we need better methods to analyze and understand black-box model decisions. Otherwise, society must carry the burden of black-box failures.

Relating Models and Their Training Data
All model decisions are rooted in the training data. Training data influence analysis partially demystifies the relationship between training data and model predictions by determining how to apportion credit (and blame) for specific model behavior to the training instances [SR88; KL17; Yeh+18; Pru+20]. Essentially, influence analysis tries to answer the question: What is each training instance's effect on a model ? An instance's "effect" is with respect to a specific perspective. For example, an instance's effect may be quantified as the change in model performance when some instance is deleted from the training data. The effect can also be relative, e.g., whether one training instance changes the model more than another.
Influence analysis emerged alongside the initial study of linear models and regression [Jae72;CW82]. This early analysis focused on quantifying how worst-case perturbations to the training data affected the final model parameters. The insights gained from early influence analysis contributed to the development of numerous methods that improved model robustness and reduced model sensitivity to training outliers [Hog79;Rou94].
Since these early days, machine learning models have grown substantially in complexity and opacity [Dev+19; KSH12;Dos+21]. Training datasets have also exploded in size [BF21]. These factors combine to make training data influence analysis significantly more challenging where, for multilayer parametric models (e.g., neural networks), determining a single training instance's exact effect can be NP-complete in the worst case [BR92].
In practice, influence may not need to be measured exactly. Influence estimation methods provide an approximation of training instances' true influence. Influence estimation is generally much more computationally efficient and is now the approach of choice [Sch+22]. However, modern influence estimators achieve their efficiency via various assumptions about the model's architecture and learning environment [KL17; Yeh+18; GZ19]. These varied assumptions result in influence estimators having different advantages and disadvantages as well as in some cases, even orthogonal perspectives on the definition of influence itself [Pru+20].

Our Contributions
To the extent of our knowledge, there has not yet been a comprehensive review of these differing perspectives of training data influence, much less of the various methods themselves. This paper fills in that gap by providing the first comprehensive survey of existing influence analysis techniques. We describe how these various methods overlap and, more importantly, the consequences -both positive and negative -that arise out of their differences. We provide this broad and nuanced understanding of influence analysis so that ML researchers and practitioners can better decide which influence analysis method best suits their specific application objectives [Sch+22].
In the remainder of this paper, we first standardize the general notation used throughout this work (Sec. 2). Section 3 reviews the various general formulations through which training data influence is viewed. We also categorize and summarize the properties of the seven most impactful influence analysis methods. Sections 4 and 5 describe these foundational influence methods in detail. For each method, we (1) formalize the associated definition of influence and how it is measured, (2) detail the formulation's strengths and weaknesses, (3) enumerate any related or derivative methods, and (4) explain the method's time, space, and storage complexities. Section 6 reviews various learning tasks where influence analysis has been applied. We provide our perspective on future directions for influence analysis research in Section 7.

General Notation
This section details our primary notation. In cases where a single influence method requires custom nomenclature, we introduce the unique notation alongside discussion of that method. 1 Supplemental Section A provides a full nomenclature reference. Let [r] denote the set of integers {1, . . . , r}. A m ∼ B denotes that the cardinality of set A is m and that A is drawn uniformly at random (u.a.r.) from set B. For singleton set A (i.e., |A| = 1), the sampling notation is simplified to A ∼ B. Let 2 A denote the power set of any set A. Set subtraction is denoted A \ B. For singleton B = {b}, set subtraction is simplified to A \ b.
The zero vector is denoted 0 with the vector's dimension implicit from context. 1[a] is the indicator function, where 1[a] = 1 if predicate a is true and 0 otherwise. Let x ∈ X ⊆ R d denote an arbitrary feature vector, and let y ∈ Y be a dependent value (e.g., label, target). Training set, D := {z i } n i=1 , consists of n training instances where each instance is a tuple, z i := (x i , y i ) ∈ Z and Z := X × Y. (Arbitrary) test instances are denoted z te := (x te , y te ) ∈ Z. Note that y te need not be x te 's true dependent value; y te can be any value in Y. Throughout this work, subscripts "i" and "te" entail that the corresponding symbol applies to an arbitrary training and test instance, respectively.
Model f : X → Y is parameterized by θ ∈ R p , where p := |θ|; f is trained on (a subset of) dataset D. Most commonly, f performs either classification or regression, although more advanced model classes (e.g., generative models) are also considered. Model performance is evaluated using a loss function : Y × Y → R. Let L(z; θ) := f (x; θ), y denote the empirical risk of instance z = (x, y) w.r.t. parameters θ. By convention, a smaller risk is better.
This work primarily focuses on overparameterized models where p d. Such models are almost exclusively trained using first-order optimization algorithms (e.g., gradient descent), which proceed iteratively over T iterations. Starting from initial parameters θ (0) , the optimizer returns at the end of each iteration, t ∈ [T ], updated model parameters θ (t) , where θ (t) is generated from previous parameters θ (t−1) , loss function , batch B (t) ⊆ D, learning rate η (t) > 0, and weight decay (L 2 ) strength λ ≥ 0. Training gradients are denoted ∇ θ L(z i ; θ (t) ). The training set's empirical risk Hessian for iteration t is denoted H (t) θ := 1 n zi∈D ∇ 2 θ L(z i ; θ (t) ), with the corresponding inverse risk Hessian denoted (H (t) θ ) −1 . Throughout this work, superscript "(t)" entails that the corresponding symbol applies to training iteration t.
Some models may be trained on data other than full training set D, e.g., subset D \ z i . Let D ⊂ D denote an alternate training set, and denote model parameters trained on D as θ (t) D . For example, θ (T ) D \i are the final parameters for a model trained on all of D except training instance z i ∈ D (note this alternate notation for compactness). When training on all of the training data, subscript D is dropped, i.e., θ (t) ≡ θ (t) D .

Overview of Influence and Influence Estimation
As explained in Section 1, training data influence quantifies the "effect" of one or more training instances. This effect's scope can be as localized as an individual model prediction, e.g., f (x te ; θ (T ) ); the effect's scope can also be so broad as to encompass the entire test data distribution.
Positive influence entails that the training instance(s) improve some quality measure, e.g., risk L(z te ; θ (T ) ). Negative influence means that the training instance(s) make the quality measure worse. Highly expressive, overparameterized models remain functionally black boxes [KL17]. Understanding why a model behaves in a specific way remains a significant challenge [BP21], and the inclusion or removal of even a single training instance can drastically change a trained model's behavior [Rou94;BF21]. In the worst case, quantifying one training instance's influence may require repeating all of training.
Since measuring influence exactly may be intractable or unnecessary, influence estimators -which only approximate the true influence -are commonly used in practice. As with any approximation, influence estimation requires making trade-offs, and the various influence estimators balance these design choices differently. This in turn leads influence estimators to make different assumptions and rely on different mathematical formulations.
When determining which influence analysis methods to highlight in this work, we relied on two primary criteria: (1) a method's overall impact and (2) the method's degree of novelty in relation to other approaches. We also chose to focus on influence analysis methods targeted toward parametric models since those models generally achieve state-of-the-art results in most domains. Nonetheless, where applicable, we also note related non-parametric influence analysis methods.
The remainder of this section considers progressively more general definitions of influence.

Pointwise Training Data Influence
Pointwise influence is the simplest and most commonly studied definition of influence. It quantifies how a single training instance affects a model's prediction on a single test instance according to some quality measure (e.g., test loss). Formally, a pointwise influence analysis method is a function I : Z × Z → R with the pointwise influence of training instance z i on test instance z te denoted I (z i , z te ). Pointwise influence estimates are denoted I (z i , z te ) ∈ R.
Below, we briefly review early pointwise influence analysis contributions and then transition to a discussion of more recent pointwise methods.

Early Pointwise Influence Analysis
The earliest notions of pointwise influence emerged out of robust statistics -specifically the analysis of training outliers' effects on linear regression models [Coo77]. Given training set D, the least-mean squares Observe that this least-squares estimator has a breakdown point of 0 [Rou94]. This means that least-squares regression is completely non-robust where a single training data outlier can shift model parameters θ * arbitrarily. For example, Figure 1 visualizes how a single training data outlier ( ) can induce a nearly orthogonal least-squares model. Put simply, an outlier training instance's potential pointwise influence on a least-squares model is unbounded. Unbounded influence on the model parameters equates to unbounded influence on model predictions.
Early influence analysis methods sought to identify the training instance that was most likely to be an outlier [Sri61;TMB73]. A training outlier can be defined as the training instance with the largest negative influence on prediction f (x te ; θ * ). Intuitively, each training instance's pointwise influence can be measured by training n := |D| models, where each model's training set leaves out a different training instance. These n models would then be compared to identify the outlier. 3 However, such repeated retraining is expensive and so more efficient pointwise influence analysis methods were studied. Different assumptions about the training data distribution lead to different definitions of the most likely outlier. For example, Srikantan [Sri61], Snedecor and Cochran [SC68], and Ellenberg [Ell76] all assume that training data outliers arise from mean shifts in normally distributed training data. Under this constraint, their methods all identify the maximum likelihood outlier as the training instance with the largest absolute residual, |y i − θ * x i |. However, Cook and Weisberg [CW82] prove that under different distributional assumptions (e.g., a variance shift instead of a mean shift), the maximum likelihood outlier may not have the largest residual.
These early influence analysis results demonstrating least-squares fragility spurred development of more robust regressors. For instance, Rousseeuw [Rou94] replaces Eq. (1)'s mean operation with median; this sim-

Retraining-Based
Leave-One-Out   Table 5 for the formal mathematical definition of all influence methods and estimators. Due to space, each method's citation is in supplemental Table 6. ple change increases the breakdown point of model parameters θ * to the maximum value, 50%. In addition, multiple robust loss functions have been proposed that constrain or cap outliers' pointwise influence [Hub64; BT74; JW78; Lec89].
As more complex models grew in prevalence, influence analysis methods similarly grew in complexity. In recent years, numerous influence analysis methods targeting deep models have been proposed. We briefly review the most impactful, modern pointwise influence analysis methods next. Below each method appears a list of closely related and derivative approaches. Modern influence analysis methods broadly categorize into two primary classes, namely:

Modern Pointwise Influence Analysis
• Retraining-Based Methods: Measure the training data's influence by repeatedly retraining model f using different subsets of training set D.
• Gradient-Based Influence Estimators: Estimate influence via the alignment of training and test instance gradients either throughout or at the end of training.
An in-depth comparison of these influence analysis methods requires detailed analysis so we defer the extensive discussion of these two categories to Sections 4 and 5, respectively. Table 1 summarizes the key properties of Figure 2's seven methods -including comparing each method's assumptions (if any), strengths/weaknesses, and asymptotic complexities. These three criteria are also discussed when detailing each of these methods in the later sections.

Alternative Perspectives on Influence and Related Concepts
Note that pointwise effects are only one perspective on how to analyze the training data's influence. Below we briefly summarize six alternate, albeit less common, perspectives of training data influence. While pointwise influence is this work's primary focus, later sections also contextualize existing influence methods w.r.t. these alternate perspectives where applicable.
(1) Recall that pointwise influence quantifies the effect of a single training instance on a single test prediction. In reality, multiple related training instances generally influence a prediction as a group [FZ20], where group members have a total effect much larger than the sum of their individual effects [BYF20; Das+21;HL22]. Group influence quantifies a set of training instances' total, combined influence on a specific test prediction.
We use very similar notation to denote group and pointwise influence. The only difference is that for group influence, the first parameter of function I is a training (sub)set instead of an individual training instance; the same applies to group influence estimates I. For example, given some test instance z te , the entire training set's group influence and group influence estimate are denoted I (D, z te ) and I (D, z te ), respectively.
In terms of magnitude, previous work has shown that the group influence of a related set of training instances is generally lowered bounded by the sum of the set's pointwise influences [Koh+19]. Put simply, the true influence of a coherent group is more than the sum of its parts, or formally, for coherent D ⊆ D, it often holds that Existing work studying group influence is limited. Later sections note examples where any of the seven primary pointwise influence methods have been extended to consider group effects.
(2) Joint influence extends influence to consider multiple test instances collectively [Jia+22;Che+22]. These test instances may be a specific subpopulation within the test distribution -for example in targeted data poisoning attacks [Jag+21; Wal+21]. The test instances could also be a representative subset of the entire test data distribution -for example in core-set selection [BMK20] or indiscriminate poisoning attacks [BNL12;Fow+21].
Most (pointwise) influence analysis methods are additive meaning for target set D te ⊆ Z, the joint (pointwise) influence simplifies to Additivity is not a requirement of influence analysis, and there are provably non-additive influence estimators [YP21].
(3) Overparameterized models like deep networks are capable of achieving near-zero training loss in most settings [Bar+20; Fel20; DAm+20]. This holds even if the training set is large and randomly labeled [Zha+17;Arp+17]. Near-zero training loss occurs because deep models often memorize some training instances.
Both Pruthi et al. [Pru+20] and Feldman and Zhang [FZ20] separately define a model's memorization 4 Table 1: Influence Analysis Method Comparison: Comparison of the complexity, assumptions, strengths, and limitations of the seven primary influence estimators detailed in Sections 4 and 5. Recall that n, p, and T are the training-set size, model parameter count, and training iteration count, respectively. LOO is the leave-one-out influence. "Full time complexity" denotes the time required to calculate influence for the first test instance; "incremental complexity" is the added time required for each subsequent test instance. "Storage complexity" represents the amount of memory required to persistently save any additional model parameters needed by the influence method. For HyDRA and the three retraining-based methods, the storage complexity is implementation dependent and is marked with an asterisk. We report each method's worst-case storage complexity; for the retraining-based methods, this worst-case storage complexity yields the best-case incremental complexity. Static, gradient-based estimators -influence functions and representer point -require no additional storage. Differentiability encompasses both model f and loss function except in the case of representer point where only the loss function must be differentiable. All criteria below are discussed in detail alongside the description of each method. of training instance z i as the pointwise influence of z i on itself. Formally (4) (4) Cook's distance measures the effect of training instances on the model parameters themselves [Coo77,Eq. (5)]. Formally, the pointwise Cook's distance of z i ∈ D is Eq. (5) trivially extends to groups of training instances where for any D ⊆ D Cook's distance is particularly relevant for interpretable model classes where feature weights are most transparent. This includes linear regression [RL87; Woj+16] and decision trees [BHL22].
(5) All definitions of influence above consider training instances' effects w.r.t. a single instantiation of a model. Across repeated stochastic retrainings, a training instance's influence may vary -potentially substantially [BPF21]. Expected influence is the average influence across all possible instantiations within a given model class [KS21]. Expected influence is particularly useful in domains where the random component of training is unknowable a priori. For example, with poisoning and backdoor attacks, an adversary crafts malicious training instances to be highly influential in expectation across all random parameter initializations and batch orderings [Che+17; Sha+18a; Fow+21].
(6) Observe that all preceding definitions view influence as a specific numerical value to measure/estimate. Influence analysis often simplifies to a relative question of whether one training instance is more influential than another. An influence ranking orders (groups of) training instances from most positively influential to most negatively influential. These rankings are useful in a wide range of applications [KZ22], including data cleaning and poisoning attack defenses as discussed in Section 6.
With this broad perspective on influence analysis in mind, we transition to focusing on specific influence analysis methods in the next two sections.

Retraining-Based Influence Analysis
Training instances can only influence a model if they are used during training. As Section 3.1.2 describes in the context of linear regression, one method to measure influence just trains a model with and without some instance; influence is then defined as the difference in these two models' behavior. This basic intuition is the foundation of retraining-based influence analysis, and this simple formulation can be applied to any model class -parametric or non-parametric.
Observe that the retraining-based framework makes no assumptions about the learning environment. In fact, this simplicity is one of the primary advantages of retraining-based influence. For comparison, Table 1 shows that all gradient-based influence estimators make strong assumptions -some of which are known not to hold for deep models (e.g., convexity). However, retraining's flexibility comes at the expense of high (sometimes prohibitive) computational cost.
Below, we describe three progressively more complex retraining-based influence analysis methods. Each method mitigates weaknesses of the preceding method -in particular, devising techniques to make retrainingbased influence more viable computationally.
Remark 1: This section treats model training as deterministic where, given a fixed training set, training always yields the same output model. Since the training of modern models is mostly stochastic, retrainingbased estimators should be represented as expectations over different random initializations and batch orderings. Therefore, (re)training should be repeated multiple times for each relevant training (sub)set with a here since it is more consistent with other work [vW21; KWR22]. probabilistic average taken over the valuation metric [Lin+22]. For simplicity of presentation, expectation over randomness is dropped from the influence and influence estimator definitions below.
Remark 2: Section 2 defines D as a supervised training set. The three primary retraining-based influence analysis methods detailed below also generalize to unsupervised and semi-supervised training.
Remark 3: When calculating retraining's time complexity below, each training iteration's time complexity is treated as a constant cost. This makes the time complexity of training a single model O(T ). Depending on the model architecture and hyperparameter settings, a training iteration's complexity may directly depend on training-set size n or model parameter count p.
Remark 4: It may be possible to avoid full model retraining by using machine unlearning methods capable of certifiably "forgetting" training instances [Guo+20; BL21; Ngu+22; Eis+22]. The asymptotic complexity of such methods is model-class specific and beyond the scope of this work. Nonetheless, certified deletion methods can drastically reduce the overhead of retraining-based influence analysis.

Leave-One-Out Influence
Leave-one-out (LOO) is the simplest influence measure described in this work. LOO is also the oldest, dating back to Cook and Weisberg [CW82] who term it case deletion diagnostics.
As its name indicates, leave-one-out influence is the change in z te 's risk due to the removal of a single instance, z i , from the training set [KL17; BF21; Jia+21a]. Formally, where θ Measuring the entire training set's LOO influence requires training (n + 1) models. Given a deterministic model class and training algorithm (e.g., convex model optimization [BV04]), LOO is one of the few influence measures that can be computed exactly in polynomial time w.r.t. training-set size n and iteration count T .

Time, Space, and Storage Complexity
Training a single model has time complexity O(T ) (see Remark 3). By additivity, training (n + 1) models has total time complexity O(nT ). Since these (n + 1) models are independent, they can be trained in parallel.
Pointwise influence analysis always has space complexity of at least O(n), i.e., the space taken by the n influence values ∀ i I (z i , z te ). Training a single model has space complexity O(p), where p := |θ|; this complexity scales linearly with the number of models trained in parallel. Table 1 treats the level of training concurrency as a constant factor, which is why LOO's total space complexity is listed as O(n + p).
A naive implementation of LOO would train the n additional models and immediately discard them after measuring z te 's test loss. This simple version of LOO has O(1) storage complexity. If instead the (n + 1) models are stored, analysis of subsequent test instances requires no additional retraining. This drastically reduces LOO's incremental time complexity for subsequent instances to just O(n) forward passes -a huge saving. 5 Note that this amortization of the retraining cost induces an O(np) storage complexity as listed in Table 1.

Strengths and Weaknesses
Leave-one-out influence's biggest strength is its simplicity. LOO is human-intelligible -even by laypersons. For that reason, LOO has been applied to ensure the fairness of algorithmic decisions [BF21].
LOO's theoretical simplicity comes at the price of huge upfront computational cost. Training some stateof-the-art models from scratch even once is prohibitive for anyone beyond industrial actors [Dis+21]. For even the biggest players, it is impractical to train (n + 1) such models given huge modern datasets [BPF21]. The climate effects of such retraining also cannot be ignored [SGM20].
LOO's simple definition in Eq. (7) is premised on deterministic training. However, even when training on the same data, modern models may have significant predictive variance for a given test instance [BF21]. This variance makes it difficult to disentangle the effect of an instance's deletion from training's intrinsic variability [BPF21]. For a single training instance, estimating the LOO influence within a standard deviation of σ requires training Ω(1/σ 2 ) models. Therefore, estimating the entire training set's LOO influence requires training Ω(n/σ 2 ) models -further exacerbating LOO's computational infeasibility [FZ20].

Related Methods
Although LOO has poor upfront and storage complexities in general, it can be quite efficient for some model classes -particularly instance-based learners [AKA91]. For example, Jia et al. [Jia+21a] propose the kNN leave-out-one (kNN LOO) estimator, which calculates the LOO influence over a surrogate k-nearest neighbors classifier instead of over target model f . kNN LOO relies on a simple two-step process. First, the features of test instance z te and training set D are extracted using a pretrained model. Next In addition, Sharchilev et al. [Sha+18b] propose LeafRefit, an efficient LOO estimator for decisiontree ensembles. LeafRefit's efficiency derives from the simplifying assumption that instance deletions do not affect the trees' structure. In cases where this assumption holds, LeafRefit's tree influence estimates are exact. To the extent of our knowledge, LeafRefit's suitability for surrogate influence analysis of deep models has not yet been explored.
Furthermore, for least-squares linear regression, Wojnowicz et al. [Woj+16] show that each training instance's LOO influence on the model parameters (i.e., Cook's distance) can be efficiently estimated by mapping training set D into a lower-dimensional subspace. By the Johnson-Lindenstrauss lemma [JL84], these influence sketches approximately preserve the pairwise distances between the training instances in D provided the projected dimension is on the order of log n.
To measure group influence, leave-one-out can be extended to leave-m-out for any integer m ≤ n.
Leave-m-out 6 has time complexity O( n m ), which is exponential in the worst case. Shapley value influence [Sha53] (Sec. 4.3) shares significant similarity with leave-m-out.
As mentioned above, LOO influence serves as the reference influence value for multiple influence estimators including Downsampling, which we describe next.

Downsampling
Proposed by Feldman and Zhang [FZ20], Downsampling 7 mitigates leave-one-out influence's two primary weaknesses: (1) computational complexity dependent on n and (2) instability due to stochastic training variation.
Downsampling relies on an ensemble of K submodels each trained on a u.a.r. subset of full training as the number of submodels that used instance z i during training. The Downsampling pointwise influence estimator 9 is then Intuitively, Eq. (8) is the change in z te 's average risk when z i is not used in submodel training. By holding out multiple instances simultaneously and then averaging, each Downsampling submodel provides insight into the influence of all training instances. This allows Downsampling to require (far) fewer retrainings than LOO.
Since each of the K training subsets is i.i.d., then Hence, Downsampling is a statistically consistent estimator of the expected LOO influence. This means that Downsampling does not estimate the influence of training instance z i on a single model instantiation. Rather, Downsampling estimates z i 's influence on the training algorithm and model architecture as a whole. By considering influence in expectation, Downsampling addresses LOO's inaccuracy caused by stochastic training's implicit variance.
In practice, K, n, and m are finite. Nonetheless, Feldman and Zhang [FZ20, Lemma 2.1] prove that, with high probability, Downsampling's LOO influence estimation error is bounded given K and m n . Remark 5: Downsampling trains models on data subsets of size m under the assumption that statements made about those models generalize to models trained on a dataset of size n (i.e., all of D). This assumption 7 Feldman and Zhang [FZ20] do not specify a name for their influence estimator. Previous work has referred to Feldman and Zhang's method as "subsampling" [BHL22] and as "counterfactual influence" [Zha+21b]. We use "Downsampling" to differentiate Feldman and Zhang's method from the existing, distinct task of dataset subsampling [TB18] while still emphasizing the methods' reliance on repeated training-set sampling. 8 Feldman and Zhang [FZ20] propose setting m = 0.7n . 9 Feldman and Zhang [FZ20] define their estimator specifically for classification. Downsampling's definition in Eq. (8) uses a more general form to cover additional learning tasks such as regression. Feldman and Zhang's original formulation would be equivalent to defining the risk as L(zte; θ (T ) i.e., the accuracy subtracted from one. may not hold for small m. To increase the likelihood this assumption holds, Feldman and Zhang propose fixing m n = 0.7. This choice balances satisfying the aforementioned assumption against the number of submodels Downsampling requires since K and ratio m n combined dictate K i , the number of submodels that are trained on z i .

Time, Space, and Storage Complexity
Downsampling's complexity analysis is identical to that of LOO (Sec. 4.1.1) except, instead of the time and storage complexities being dependent on training-set size n, Downsampling depends on submodel count K. For perspective, Feldman and Zhang's empirical evaluation used K = 2,000 for ImageNet [Den+09] (n > 14M) as well as K = 4,000 for MNIST [LeC+98] (n = 60,000) and CIFAR10 [KNH14] (n = 50,000)a savings of one to four orders of magnitude over vanilla LOO.
Remark 6: Downsampling's incremental time complexity is technically O(K + n) ∈ O(n) since pointwise influence is calculated w.r.t. each training instance. The time complexity analysis above focuses on the difference in the number of forward passes required by LOO and Downsampling; fewer forward passes translate to Downsampling being much faster than LOO in practice.

Strengths and Weaknesses
Although more complicated than LOO, Downsampling is still comparatively simple to understand and implement. Downsampling makes only a single assumption that should generally hold in practice (see Remark 5). Downsampling is also flexible and can be applied to most applications.
While Downsampling's formulation above is w.r.t. a single training instance, Downsampling trivially extends to estimate the expected group influence of multiple training instances. Observe however that the expected fraction of u.a.r. training subsets that either contains all instances in a group or none of a group decays geometrically with m n and (1 − m n ), respectively. Therefore, for large group sizes, K needs to be exponential in n to cover all group combinations.
Another strength of Downsampling is its low incremental time complexity. Each test example requires only K forward passes. These forward passes can use large batch sizes to further reduce the per-instance cost. This low incremental cost allows Downsampling to be applied at much larger scales than other methods. For example, Feldman and Zhang [FZ20] measure all pointwise influence estimates for the entire ImageNet dataset [Den+09] (n > 14M). These large-scale experiments enabled Feldman and Zhang to draw novel conclusions about neural training dynamics -including that training instance memorization (4) by overparameterized models is not a bug, but a feature, that is currently necessary to achieve state-of-the-art generalization results.
In terms of weaknesses, while Downsampling is less computationally expensive than LOO, Downsampling still has a high upfront computational cost. Training multiple models may be prohibitively expensive even when K n. Amortization of this upfront training overhead across multiple test instances is beneficial but by no means a panacea.

Related Methods
Downsampling has two primary related methods.
First, training instance memorization also occurs in generative models where the generated outputs are (nearly) identical copies of training instances [KWR22]. van den Burg and Williams [vW21] extend Downsampling to deep generative models -specifically, density models (e.g., variational autoencoders [KW14;RMW14]) that estimate posterior probability p(x|P, θ), where P denotes the training data distribution.
Like Downsampling, van den Burg and Williams's approach relies on training multiple submodels. 10 The primary difference is that van den Burg and Williams consider generative risk Beyond that, van den Burg and Williams's method is the same as Downsampling as both methods consider the LOO influence (8).
Downsampling's second closely related method is Jiang et al.'s [Jia+21b] consistency profile, defined formally as 11 By negating the risk in Eq. (14), a higher expected risk corresponds to a lower consistency profile. Consistency profile differs from Downsampling in two ways.
(1) Downsampling implicitly considers a single submodel training-set size m while consistency profile disentangles the estimator from m.
(2) Downsampling estimates z i 's influence on z te while consistency profile considers all of D as a group and estimates the expected group influence of a random subset D ⊆ D given m := |D|.
Jiang et al. [Jia+21b] also propose the consistency score (C-score), defined formally as where m is drawn uniformly from set [n]. By taking the expectation over all training-set sizes, C-score provides a total ordering over all test instances. A large C-score entails that z te is harder for the model to confidently predict. Large C-scores generally correspond to rare/atypical test instances from the tails of the data distribution. Since Downsampling considers the effect of each training instance individually, Downsampling may be unable to identify these hard-to-predict test instances -in particular if m is large enough to cover most data distribution modes.
The next section introduces the Shapley value, which in essence merges the ideas of Downsampling and C-score.

Shapley Value
Derived from cooperative game theory, Shapley value (SV) quantifies the increase in value when a group of players cooperates to achieve some shared objective [Sha53;SR88]. Given n total players, characteristic function ν : 2 [n] → R defines the value of any player coalition A ⊆ [n] [Dub75]. By convention, a larger ν(A) is better. Formally, player i's Shapley value w.r.t. ν is where n−1 |A| is the binomial coefficient. Ghorbani and Zou [GZ19] adapt SV to model training by treating the n instances in training set D as n cooperative players with the shared objective of training the "best" model. 12 For any A ⊆ D, let where the negation is needed because more "valuable" training subsets have lower risk. Then, z i 's Shapley value pointwise influence on z te is where c is a constant, multiplicative scaling factor. More intuitively, SV is the weighted change in z te 's risk when z i is added to a random training subset; the weighting ensures all training subset sizes (|D|) are prioritized equally. Eq. (17) can be viewed as generalizing the leave-one-out influence, where rather than considering only full training set D, Shapley value averages the LOO influence across all possible subsets of D.

Time, Space, and Storage Complexity
Deng and Papadimitriou [DP94] prove that computing Shapley values is #P-complete. Therefore, in the worst case, SV requires exponential time to determine exactly assuming P = NP. There has been significant follow-on work to develop tractable SV estimators, many of which are reviewed in Sec. 4.3.3.
The analysis of SV's space, storage, and incremental time complexities follows that of LOO (Sec. 4.1.1) with the exception that SV requires up to 2 n models, not just n models as with LOO.

Strengths and Weaknesses
Among all influence analysis methods, SV may have the strongest theoretical foundation with the chain of research extending back several decades. SV's dynamics and limitations are well understood, providing confidence in the method's quality and reliability. In addition, SV makes minimal assumptions about the nature of the cooperative game (i.e., model to be trained), meaning SV is very flexible. This simplicity and flexibility allow SV to be applied to many domains beyond dataset influence as discussed in the next section.
SV is provably a linear operation [Sha53;Dub75]. Given any two data valuation metrics ν and ν , it holds that SV's additivity property makes it possible to estimate both pointwise and joint SV influences without repeating any data collection [GZ19].
Furthermore, by evaluating training sets of different sizes, SV can detect subtle influence behavior that is missed by methods like Downsampling and LOO, which evaluate a single training-set size. Lin et al. [Lin+22] demonstrate this phenomenon empirically by showing that some backdoor training instances are best detected with small SV training subsets.
Concerning weaknesses, SV's computational intractability is catastrophic for non-trivial dataset sizes [KZ22]. For that reason, numerous (heuristic) SV speed-ups have been proposed, with the most prominent ones detailed next.

Related Methods
Ghorbani and Zou [GZ19] propose two SV estimators. First, truncated Monte Carlo Shapley (TMC-Shapley) relies on randomized subset sampling from training set D. 13 As a simplified description of the algorithm, TMC-Shapley relies on random permutations of D; for simplicity, denote the permutation ordering z 1 , . . . , z n . For each permutation, n models are trained where the i-th model's training set is instances {z 1 , . . . , z i }. To measure each z i 's marginal contribution for a given permutation, TMC-Shapley compares the performance of the (i − 1)-th and i-th models, i.e., the models trained on datasets {z 1 , . . . , z i−1 } and {z 1 , . . . , z i }, respectively. TMC-Shapley generates additional training-set permutations and trains new models until the SV estimates converge. Ghorbani  Since TMC-Shapley and G-Shapley rely on heuristics and assumptions to achieve tractability, neither method provides approximation guarantees. In contrast, Jia et al. [Jia+19a] prove that for k-nearest neighbors classification, SV pointwise influence can be calculated exactly in O(n lg n) time. Formally, SV's characteristic function for kNN classification is where Neigh(x te ; D) is the set of k neighbors in D nearest to x te and 1[·] is the indicator function. Each training instance either has no effect on Eq. (20)'s value (z i / ∈ Neigh(x te ; D) or y i = y te ). Otherwise, the training instance increases the value by one (z i ∈ Neigh(x te ; D) and y i = y te ).
Assuming the training instances are sorted by increasing distance from x te (i.e., x 1 is closest to x te and x n is furthest), then z i 's pointwise kNN Shapley influence is Recent work has also questioned the optimality of SV assigning uniform weight to each training subset size (see Eq. (16)). Counterintuitively, Kwon and Zou [KZ22] show theoretically and empirically that influence estimates on larger training subsets are more affected by training noise than influence estimates on smaller subsets. As such, rather than assigning all data subset sizes (|D|) uniform weight, Kwon and Zou argue that smaller training subsets should be prioritized. Specifically, Kwon and Zou propose Beta Shapley, which modifies vanilla SV by weighting the training-set sizes according to a positive skew (i.e., left-leaning) beta distribution.
SV has also been applied to study other types of influence beyond training set membership. For example, Neuron Shapley applies SV to identify the model neurons that are most critical for a given prediction [GZ20]. Lundberg and Lee's [LL17] SHAP is a very well-known tool that applies SV to measure feature importance. For a comprehensive survey of Shapley value applications beyond training data influence, see the work of Sundararajan and Najmi [SN20] and a more recent update by Rozemberczki et al. [Roz+22].

Gradient-Based Influence Estimation
For modern models, retraining even a few times to tune hyperparameters is very expensive. In such cases, it is prohibitive to retrain an entire model just to gain insight into a single training instance's influence.
For models trained using gradient descent, training instances only influence a model through training gradients. Intuitively then, training data influence should be measurable when the right training gradients are analyzed. This basic idea forms the basis of gradient-based influence estimation. As detailed below, gradientbased influence estimators rely on Taylor-series approximations or risk stationarity. These estimators also assume some degree of differentiability -either of just the loss function [Yeh+18] or both the model and loss [KL17; Pru+20; Che+21].
The exact analytical framework each gradient-based method employs depends on the set of model parameters considered [HL22]. Static, gradient-based methods -discussed first -estimate the effect of retraining by studying gradients w.r.t. final model parameters θ (T ) . Obviously, a single set of model parameters provide limited insight into the entire optimization landscape, meaning static methods generally must make stronger assumptions. In contrast, dynamic, gradient-based influence estimators reconstruct the training data's influence by studying model parameters throughout training, e.g., θ (0) , . . . , θ (T ) . Analyzing these intermediary model parameters makes dynamic methods more computationally expensive in general, but it enables dynamic methods to make fewer assumptions.
This section concludes with a discussion of a critical limitation common to all existing gradient-based influence estimators -both static and dynamic. This common weakness can cause gradient-based estimators to systematically overlook highly influential (groups of) training instances.

Static, Gradient-Based Influence Estimation
As mentioned above, static estimators are so named because they measure influence using only final model parameters θ (T ) . Static estimators' theoretical formulations assume stationarity (i.e., the model parameters have converged to a risk minimizer) and convexity.
Below we focus on two static estimators -influence functions [KL17] and representer point [Yeh+18]. Each method takes very different approaches to influence estimation with the former being more general and the latter more scalable. Both estimators' underlying assumptions are generally violated in deep networks.

Influence Functions
Along with Shapley value (Sec. 4.3), Koh and Liang's [KL17] influence functions is one of the best-known influence estimators. The estimator derives its name from influence functions (also known as infinitesimal jackknife [Jae72]) in robust statistics [Ham74]. These early statistical analyses consider how a model changes if training instance z i 's weight is infinitesimally perturbed by i . More formally, consider the change in the empirical risk minimizer from to θ (T ) Under the assumption that model f and loss function are twice-differentiable and strictly convex, Cook and Weisberg [CW82] prove via a first-order Taylor expansion that where empirical risk Hessian H Removing training instance z i from D is equivalent to i = − 1 n making the pointwise influence functions estimator with each training instance's pointwise influence then More generally, Basu et al. [BPF21] show that training hyperparameters also significantly affect influence functions' performance. Specifically, Basu et al. empirically demonstrate that model initialization, model width, model depth, weight-decay strength, and even the test instance being analyzed (z te ) all can negatively affect influence functions' LOO estimation accuracy. Their finding is supported by the analysis of Zhang and Zhang [ZZ22] who show that HVP estimation's accuracy depends heavily on the model's training regularizer with HVP accuracy "pretty low under weak regularization." Bae et al. [Bae+22] also empirically analyze the potential sources of influence functions' fragility on deep models. Bae et al. identify five common error sources, the first three of which are the most important. 15 • Warm-start gap: Influence functions more closely resembles performing fine-tuning close to final model parameters θ (T ) than retraining from scratch (i.e., starting from initial parameters θ (0) ). This difference in starting conditions can have a significant effect on the LOO estimate.
• Proximity gap: The error introduced by the dampening term included in the HVP (s test ) estimation algorithm.
• Linearization error : The error induced by considering only a first-order Taylor approximation when deleting z i and ignoring the potential effects of curvature on I IF (z i , z te ) [BYF20].
• Solver error : General error introduced by the specific solver used to estimate s test .  transformer networks (e.g., ViT-L32 with 300M parameters [Dos+21]), which are orders of magnitude larger than the simple networks Koh and Liang [KL17] consider.
Another approach to speed up influence functions is to specialize the estimator to model architectures with favorable computational properties. For example, Sharchilev et al.'s [Sha+18b] LeafInfluence method adapts influence functions to gradient boosted decision tree ensembles. By assuming a fixed tree structure and then focusing only on the trees' leaves, LeafInfluence's tree-based estimates are significantly faster than influence functions on deep models [BHL22].
A further major strength of influence functions is that it is one of the few influence analysis methods that has been studied beyond the pointwise domain. For example, Koh et al.'s [Koh+19] follow-on paper analyzes influence functions' empirical performance estimating group influence. In particular, Koh et al. [Koh+19] consider coherent training data subpopulations whose removal is expected to have a large, broad effect on the model. Even under naive assumptions of (pointwise) influence additivity, Koh et al. [Koh+19] observe that simply summing influence functions estimates tends to underestimate the true group influence. More formally, let D ⊆ D be a coherent training-set subpopulation, then Nonetheless, influence functions' additive group estimates tend to have strong rank correlation w.r.t. subpopulations' true group influence. In addition, Basu et al. [BYF20] extend influence functions to directly account for subpopulation group effects by considering higher-order terms in influence functions' Taylor-series approximation. Representer-based methods rely on kernels, which are functions K : X × X → R that measure the similarity between two vectors [HSS08]. Schölkopf et al.'s representer theorem proves that the optimal solution to a class of L 2 regularized functions can be reformulated as a weighted sum of the training data in kernel form. Put simply, representer methods decompose the predictions of specific model classes into the individual contributions (i.e., influence) of each training instance. This makes influence estimation a natural application of the representer theorem.

Representer Point Methods
Consider regularized empirical risk minimization where optimal parameters satisfy θ * := arg min with λ > 0 the L 2 regularization strength. Note that Eq. (31) defines minimizer θ * slightly differently than the last section (22) since the representer theorem requires regularization.
Empirical risk minimizers are stationary points meaning where 0 is the p-dimensional, zero vector. The above simplifies to 1 n zi∈D ∂L(z i ; θ * ) ∂θ + 2λθ * = 0 (33) For a linear model where f (x; θ) = θ x =: y, Eq. (34) further simplifies via the chain rule to where is any once-differentiable loss function, ∂ ( yi,yi)  [SHS01] representer theorem kernelized notation, the training set's group influence on test instance z te for any linear model is where kernel function K(x i , x te ,α i ) := α i x i x te returns a vector. K(x i , x te , α i ) | yte denotes the kernel value's y te -th dimension. Then, z i 's pointwise linear representer point influence on z te is Extending Representer Point to Multilayer Models Often, linear models are insufficiently expressive, with multilayer models used instead. In such cases, the representer theorem above does not directly apply. To workaround this limitation, Yeh et al. [Yeh+18] rely on what they (later) term last layer similarity [Yeh+22].  . Hence, the majority of representer point's computation is forward-pass only and can be sped up using batching. This translates to representer point being very fast -several orders of magnitude faster than influence functions and Section 5.2's dynamic estimators [HL21].
However, representer points' simplicity comes at a cost. First, at the end of training, it is uncommon that a model's final linear layer has converged to a stationary point. Before applying their method, Yeh et al. [Yeh+18] recommend freezing all model layers except the final one (i.e., freezingθ (T ) ) and then fine-tuning the classification layer (θ (T ) ) until convergence/stationarity. Without this extra fine-tuning, representer point's stationarity assumption does not hold, and poor influence estimation accuracy is expected. Beyond just complicating the training procedure itself, this extra training procedure also complicates comparison with other influence methods since it may require evaluating the approaches on different parameters.
Moreover, by focusing exclusively on the model's final linear (classification) layer, representer point methods may miss influential behavior that is clearly visible in other layers. For example, Hammoudeh and Lowd [HL22] demonstrate that while some training-set attacks are clearly visible in a network's final layer, other attacks are only visible in a model's first layer -despite both attacks targeting the same model architecture and dataset. In their later paper, Yeh et al. [Yeh+22] acknowledge the disadvantages of considering only the last layer writing, "that choice critically affects the similarity component of data influence and leads to inferior results". Yeh et al. [Yeh+22] further state that the feature representations in the final layer -and by extension representer point's influence estimates -can be "too reductive." In short, Yeh et al.'s [Yeh+18] representer point method is highly scalable and efficient but is only suitable to detect behaviors that are obvious in the model's final linear layer. This leads to all test instances from a given class having near identical top-k influence rankings. RPS-LJE's alternate definition of α i is less influenced by a training instance's loss value, which enables RPS-LJE to generate more semantically meaningful influence rankings.

Dynamic, Gradient-Based Influence Estimation
All preceding influence methods -static, gradient-based and retraining-based -define and estimate influence using only final model parameters, θ These final parameters only provide a snapshot into a training instance's possible effect. Since neural network training is NP-complete [BR92], it can be provably difficult to reconstruct how each training instance affected the training process.
As an intuition, an influence estimator that only considers the final model parameters is akin to only reading the ending of a book. One might be able to draw some big-picture insights, but the finer details of the story are most likely lost. Applying a dynamic influence estimator is like reading a book from beginning to end. By comprehending the whole influence "story," dynamic methods can observe training data relationships -both fine-grained and general -that other estimators miss.
Since test instance z te may not be known before model training, in-situ influence analysis may not be possible. Instead, as shown in Alg. 1, intermediate model parameters Θ ⊆ {θ (0) , . . . , θ (T −1) } are stored during training for post hoc influence analysis. 18 Below we examine two divergent approaches to dynamic influence estimation -the first defines a novel definition of influence while the second estimates leave-one-out influence with fewer assumptions than influence functions.

TracIn -Tracing Gradient Descent
Fundamentally, all preceding methods define influence w.r.t. changes to the training set. Pruthi et al. [Pru+20] take an orthogonal perspective. They treat training set D as fixed, and consider the change in model parameters as a function of time, or more precisely, the training iterations.
Vacuously, the training set's group influence on test instance z te is In words, training set D causes the entire change in test loss between random initial parameters θ (0) and final parameters θ (T ) . Eq. (42) decomposes by training iteration t as Consider training a model with vanilla stochastic gradient descent, where each training minibatch B (t) is a single instance and gradient updates have no momentum [RHW86]. Here, each iteration t has no effect on any other iteration beyond the model parameters themselves. Combining this with singleton batches enables attribution of each parameter change to a single training instance, namely whichever instance was in B (t) . Under this regime, Pruthi et al. [Pru+20] define the ideal TracIn pointwise influence as where the name "TracIn" derives from "tracing gradient descent influence." Eq. (43) under vanilla stochastic gradient descent decomposes into the sum of all pointwise influences While the ideal TracIn influence has a strong theoretical motivation, its assumption of singleton batches and vanilla stochastic gradient descent is unrealistic in practice. To achieve reasonable training times, modern models train on batches of up to hundreds of thousands or millions of instances. Training on a single instance at a time would be far too slow [YGG17; Goy+17; Bro+20].
A naive fix to Eq. (44) to support non-singleton batches assigns the same influence to all instances in the minibatch, or more formally, divide the change in loss L(z te ; θ (t−1) ) − L(z te ; θ (t) ) by batch size B (t) for each z i ∈ B (t) . This naive approach does not differentiate those instances in batch B (t) that had positive influence on the prediction from those that made the prediction worse.
Instead, Pruthi et al. [Pru+20] estimate the contribution of each training instance within a minibatch via a first-order Taylor approximation. Formally, Under gradient descent without momentum, the change in model parameters is directly determined by the batch instances' gradients, i.e., where η (t) is iteration t's learning rate.
A More "Practical" TracIn Training's stochasticity can negatively affect the performance of both ideal TracIn (44) and the TracIn influence estimator (48). As an intuition, consider when the training set contains two identical copies of some instance. All preceding gradient-based methods assign those two identical instances the same influence score. However, it is unlikely that those two training instances will always appear together in the same minibatch. Therefore, ideal TracIn almost certainly assigns these identical training instances different influence scores. These assigned scores may even be vastly different -by up to several orders of magnitude [HL22]. This is despite identical training instances always having the same expected TracIn influence.
Pruthi et al. [Pru+20] recognize randomness's effect on TracIn and propose the TracIn Checkpoint influence estimator (TracInCP) as a "practical" alternative. Rather than retrace all of gradient descent, TracInCP considers only a subset of the training iterations (i.e., checkpoints) T ⊆ [T ]. More importantly,
Observe that, unlike TracIn, TracInCP assigns identical training instances the same influence estimate. Therefore, TracInCP more closely estimates expected influence than TracIn. Pruthi et al. [Pru+20] use TracInCP over TracIn in much of their empirical evaluation. Other work has also shown that TracInCP routinely outperforms TracIn on many tasks [HL22]. Recall that, by definition, |T | ≤ T meaning TracInCP is asymptotically faster than TracIn. However, this is misleading. In practice, TracInCP is generally slower than TracIn as Pruthi et al. note.
Since each gradient calculation is independent, TracIn and TracInCP are fully parallelizable. Table 1 treats the level of concurrency as a constant factor, making the space complexity of both TracIn and TracInCP O(n + p).
Lastly, as detailed in Alg. 1, dynamic influence estimators require that intermediate model parameters Θ be saved during training for post hoc influence estimation. In the worst case, each training iteration's parameters are stored resulting in a storage complexity of O(pT ). In practice however, TracIn only considers a small fraction of these T training parameter vectors, meaning TracIn's actual storage complexity is generally (much) lower than the worst case.

Strengths and Weaknesses
TracIn and TracInCP avoid many of the primary pitfalls associated with static, gradient-based estimators.
First, recall from Section 5.1.1 that Hessian-vector product s test significantly increases the computational overhead and potential inaccuracy of influence functions. TracIn's theoretical simplicity avoids the need to compute any Hessian.
Second, representer point's theoretical formulation necessitated considering only a model's final linear layer, at the risk of (significantly) worse performance. TracIn has the flexibility to use only the final linear layer for scenarios where that provides sufficient accuracy 19 as well as the option to use the full model gradient when needed.
Third, by measuring influence during the training process, TracIn requires no assumptions about stationarity or convergence. In fact, TracIn can be applied to a model that is only partially trained. TracIn can also be used to study when during training an instance is most influential. For example, TracIn can identify whether a training instance is most influential early or late in training.
Fourth, due to how gradient-based methods estimate influence, highly influential instances can actually appear uninfluential at the end of training. Unlike static estimators, dynamic methods like TracIn may still be able to detect these instances. See Section 5.3 for more details.
In terms of weaknesses, TracIn's theoretical motivation assumes stochastic gradient descent without momentum. However, momentum and adaptive optimization (e.g., Adam [KB15]) significantly accelerate model convergence [Qia99; DHS11; KB15]. To align more closely with these sophisticated optimizers, Eq. (48) and Alg. 2 would need to change significantly. For context, Section 5.2.2 details another dynamic estimator, HyDRA, which incorporates support for just momentum with the resulting increase in estimator complexity substantial.
As a counter to the disadvantages of solely considering a model's last layer, 20 TracIn's authors subsequently proposed TracIn word embeddings (TracInWE), which targets large language models and considers only the gradients in those models' word embedding layer [Yeh+22]. Since language-model word embeddings can still be very large (e.g., BERT-Base's word embedding layer has 23M parameters [Dev+19]), the authors specifically use the gradients of only those tokens that appear in both training instance z i and test instance z te .
Pruthi et al. [Pru+20] also propose TracIn Random Projection (TracInRP) -a low-memory version of TracIn that provides unbiased estimates of I TracIn (i.e., an estimate of an estimate). Intuitively, TracInRP maps gradient vectors into a d-dimensional subspace (d p) via multiplication by a d × p random matrix where each entry is sampled i.i.d. from Gaussian distribution N 0, 1 d . These low-memory gradient "sketches" are used in place of the full gradient vectors in Eq. (48) [Woo14]. TracInRP is primarily targeted at applications where p is sufficiently large that storing the full training set's gradient vectors (∀ t,i ∇ θ L(z i ; θ (t−1) )) is prohibitive.
Note also that TracIn can be applied to any iterative, gradient-based model, including those that are non-parametric. For example, Brophy et al.'s [BHL22] BoostIn adapts TracIn for gradient-boosted decision tree ensembles.
TracIn has also been used outside of supervised settings. For example, Kong and Chaudhuri [KC21] apply TracIn to unsupervised learning, in particular density estimation; they propose variational autoencoder TracIn (VAE-TracIn), which quantifies the TracIn influence in β-VAEs [Hig+17]. Moreover, Thimonier et al.'s [Thi+22] TracIn anomaly detector (TracInAD) functionally estimates the distribution of influence estimates -using either TracInCP or VAE-TracIn. TracInAD then marks as anomalous any test instance in the tail of this "influence distribution".

HyDRA -Hypergradient Data Relevance Analysis
Unlike TracIn which uses a novel definition of influence (44), Chen et al.'s [Che+21] hypergradient data relevance analysis (HyDRA) estimates the leave-one-out influence (7). HyDRA leverages the same Taylor series-based analysis as Koh and Liang's [KL17] influence functions. The key difference is that HyDRA addresses a fundamental mismatch between influence functions' assumptions and deep models.
Section 5.1.1 explains that influence functions consider infinitesimally perturbing the weight of training sample z i by i . Recall that the change in z te 's test risk w.r.t. to this infinitesimal perturbation is where h The exact definition of gradient g (t−1) depends on the specific contents of batch B (t) so for simplicity, we encapsulate the batch's contribution to the gradient using catch-all term ∇ θ L(B (t) ; θ (t−1) ).
Using Eq. (51), hypergradient h (T ) i can be defined recursively as The recursive definition of hypergradient h (T ) i needs to be unrolled all the way back to initial parameters θ (0) .
The key takeaway from Eq. (54) is that training hypergradients affect the model parameters throughout all of training. By assuming a convex model and loss, Koh and Liang's [KL17] simplified formulation ignores this very real effect. As Chen et al. [Che+21] observe, hypergradients often cause non-convex models to converge to a vastly different risk minimizer. By considering the hypergradients' cumulative effect, HyDRA can provide more accurate LOO estimates than influence functions on non-convex models -albeit via a significantly more complicated and computationally expensive formulation.
Unrolling Gradient Descent Hypergradients The exact procedure to unroll HyDRA's hypergradient h (T ) i is non-trivial. For the interested reader, supplemental Section C provides hypergradient unrolling's full derivation for vanilla gradient descent without momentum. Below, we briefly summarize Section C's important takeaways, and Section C's full derivation can be skipped with minimal loss of understanding. Algorithm 4 Fast HyDRA influence estimation for gradient descent without momentum Input: Training parameter set Θ; final parameters θ (T ) ; training set size n; iteration count T ; batches B (1) , . . . , B (T ) ; learning rates η (1) , . . . , η (T ) ; weight decay λ; training instance zi; and test example zte Output: HyDRA influence estimate IHyDRA (zi, zte) Initialize to zero vector 2: for t ← 1 to T do 3: if zi ∈ B (t) then 4: Influence estimate 9: return I workaround, Chen et al. [Che+21] propose treating these risk Hessians as all zeros, proving that, under mild assumptions, the approximation error of this simplified version of HyDRA is bounded. Alg. 4 shows HyDRA's fast approximation algorithm without Hessians for vanilla gradient descent. 23 After calculating the final hypergradient, substituting h Relating HyDRA and TracIn When λ = 0 or weight decay's effects are ignored (as done by TracIn), HyDRA's fast approximation for vanilla gradient descent simplifies to Eq. (57) is very similar to TracIn's definition in Eq. (48), despite the two methods estimating different definitions of influence (LOO vs. ideal TracIn (44)). The only difference between (57) and (48) is that HyDRA always uses final test gradient ∇ θ L(z te ; θ (T ) ) while TracIn uses each iteration's test gradient ∇ θ L(z te ; θ (t−1) ) The key takeaway is that while theoretically different, HyDRA and TracIn are in practice very similar where HyDRA can be viewed as trading (incremental) speed for lower precision w.r.t. z te .
23 See HyDRA's original paper [Che+21] for the fast approximation algorithm with momentum. Observe that each hypergradient h (T ) i only needs to be computed once and can be reused for each test instance. Therefore, the fast and standard version of HyDRA have incremental time complexity of just n gradient dot products -O(np) complexity total. This incremental complexity is much faster than TracIn and asymptotically equivalent to influence functions. In practice though, HyDRA's incremental cost is much lower than that of influence functions.
Alg. 4 requires storing vector h This difference is substantial for large models and training sets. In cases where the fully-parallelized space complexity is prohibitive, each training instance's hypergradient can be analyzed separately resulting in a reduced space complexity of O(p) for both fast and standard HyDRA.
Like TracIn, HyDRA requires storing model parameters Θ ⊆ {θ (0) , . . . , θ (T −1) } making its minimum storage complexity O(pT ). Since hypergradients are reused for each test instance, they can be stored to eliminate the need to recalculate them; this introduces an additional storage complexity of O(np). This makes HyDRA's total storage complexity O(pT + np).
Remark 8: Storing both the training checkpoints and hypergradients is unnecessary. Once all hypergradients have been calculated, serialized training parameters Θ are no longer needed and can be discarded. Therefore, a more typical storage complexity is O(pT ) or O(np) -both of which are still substantial.

Strengths and Weaknesses
HyDRA and TracIn share many of the same strengths. For example, HyDRA does not require assumptions of convexity or stationarity. Moreover, as a dynamic method, HyDRA may be able to detect influential examples that are missed by static methods -in particular when those instances have low loss at the end of training (see Section 5.3 for more discussion).
HyDRA also has some advantages over TracIn. For example, as shown in Alg. 2, TracIn requires that each test instance be retraced through the entire training process. This significantly increases TracIn's incremental time complexity. In contrast, HyDRA only unrolls gradient descent for the training instances, i.e., not the test instances. Hypergradient unrolling is a one-time cost for each training instance; this upfront cost is amortized over all test instances. Once the hypergradients have been calculated, HyDRA is much faster than TracIn -potentially by orders of magnitude. In addition, HyDRA's overall design allows it to natively support momentum with few additional changes. Integrating momentum into TracIn, while theoretically possible, requires substantial algorithmic changes and makes TracIn substantially more complicated. This would mitigate a core strength of TracIn -its simplicity.
HyDRA does have two weaknesses in comparison to TracIn. First, HyDRA's standard (i.e., non-fast) algorithm requires calculating many HVPs. Second, HyDRA's O(np) space complexity is much larger than the O(n + p) space complexity of other influence analysis methods (see Table 1). For large models, this significantly worse space complexity may be prohibitive.

Related Methods The method most closely related to HyDRA is Hara et al.'s [HNM19]
SGD-influence. Both approaches estimate the leave-one-out influence by unrolling gradient descent using empirical risk Hessians. There are, however, a few key differences. First, unlike HyDRA, Hara et al. assume that the model and loss function are convex. Next, SGD-influence primarily applies unrolling to quantify the Cook's distance, θ (T ) − θ (T ) D \i . To better align their approach with dataset influence, Hara et al. propose a surrogate (linear) influence estimator which they incrementally update throughout unrolling. This means the full training process must be unrolled for each test instance individually, significantly increasing SGD-influence's incremental time complexity.
Terashita et al. [Ter+21] adapt the ideas of SGD-influence to estimate training data influence in generative adversarial networks (GANs).
Although proposed exclusively in the context of influence functions (Sec. 5.1.1.3), Schioppa et al.'s [Sch+22] basic approach to scale up influence functions via faster Hessian calculation could similarly be applied to speed up HyDRA's standard (non-fast) algorithm.

Trade-off between Gradient Magnitude and Direction
This section details a limitation common to existing gradient-based influence estimators that can cause these estimators to systematically overlook highly influential (groups of) training instances.
Observe that all gradient-based methods in this section rely on some vector dot product. For a dot product to be large, one of two criteria must be met: (1) The vector directions align (i.e., have high cosine similarity). More specifically, for influence analysis, vectors pointing in similar directions are expected to encode similar information. This is the ideal case.
(2) Either vector has a large magnitude, e.g., ∇ θ L(z; θ) . Large gradient magnitudes can occur for many reasons, but the most common cause is that the instance is either incorrectly or not confidently predicted.
Across the training set, gradient magnitudes can vary by several orders of magnitude [SWS21]. To overcome such a magnitude imbalance, training instances that actually influence a specific prediction may need to have orders of magnitude better vector alignment. In reality, what commonly happens is that incorrectly predicted or abnormal training instances appear highly influential to all test instances [SWS21]. Barshan et al. [BBD20] describe such training instances as globally influential. However, globally influential training instances provide very limited insight into individual model predictions. As Barshan et al. note, locally influential training instances are generally much more relevant and insightful when analyzing specific predictions.
RelatIF's biggest limitation is the need to estimate an HVP for every training instance. As discussed in Section 5.1.1.2, HVP estimation is expensive and often highly inaccurate in deep models. To work around these issues in their evaluation of RelatIF, Barshan Hammoudeh and Lowd note that for many common loss functions (e.g., squared, binary cross-entropy), loss value (f (x; θ), y) induces a strict ordering over loss norm ∂ (f (x;θ),y) ∂f (x;θ) . Hammoudeh and Lowd term 25 Note that this HVP is different than stest := (H (T ) θ ) −1 ∇ θ L(zte; θ (T ) ) in Eq. (28).
this phenomenon a low-loss penalty, where confidently predicted training instances have smaller gradient magnitudes and by consequence consistently appear uninfluential to gradient-based influence estimators.
To account for the low-loss penalty, Hammoudeh and Lowd [HL22] propose renormalized influence which replaces all gradient vectors -both training and test -in an influence estimator with the corresponding unit vector. Renormalization can be applied to any gradient-based estimator. For example, Hammoudeh and Lowd observe that renormalized TracInCP, which they term gradient aggregated similarity (GAS), is particularly effective at generating influence rankings. Hammoudeh and Lowd also provide a renormalized version of influence functions, Since renormalized influence functions do not require estimating additional HVPs, it is considerably faster than RelatIF. Renormalized influence functions also do not have the additional error associated with estimating RelatIF's additional HVPs. 26 This section should not be interpreted to mean that gradient magnitude is unimportant for influence analysis. On the contrary, gradient magnitude has a significant effect on training. However, the approximations made by existing influence estimators often overemphasize gradient magnitude leading to influence rankings that are not semantically meaningful.

Applications of Influence Analysis
This section briefly reviews different settings where influence analysis has been applied. We focus on higherlevel learning tasks as opposed to the specific application environments where influence analysis has been used including: toxic speech detection [HT21], social network graph labeling [Zha+21c], user engagement detection [LLY21], medical imaging annotation [Bra+22], etc.
First, data cleaning aims to improve a machine learning model's overall performance by removing "bad" training data. These "bad" instances arise due to disparate non-malicious causes including human/algorithmic labeling error, non-representative instances, noisy features, missing features, etc. [Kri+16; LDG18; KW19]. Intuitively, "bad" training instances are generally anomalous, and their features clash with the feature distribution of typical "clean" data [Woj+16]. In practice, overparameterized neural networks commonly memorize these "bad" instances to achieve zero training loss [HNM19; FZ20; Pru+20; Thi+22]. As explained in Section 3.2, memorization can be viewed as the influence of a training instance on itself. Therefore, influence analysis can be used to detect these highly memorized training instances. These memorized "bad" instances are then either removed from the training data or simply relabeled [KSH22] and the model retrained.
Poisoning and backdoor attacks craft malicious training instances that manipulate a model to align with some attacker objective. For example, a company may attempt to trick a spam filter so all emails sent by a competitor are erroneously classified as spam [Sha+18a]. Obviously, only influential (malicious) training instances affect a model's prediction. Some training set attacks rely on influence analysis to craft better (i.e., more influential) poison instances [FGL20; Jag+21; Oh+22]. Since most training set attacks do not assume the adversary knows training's random seed or even necessarily the target model's architecture, poison instances are crafted to maximize their expected group influence [Che+17].
Influence and memorization analysis have also been used to improve membership inference attacks, where the adversary attempts to extract sensitive training data provided only a pretrained (language) model [Dem+19 ;CG22].
Training set attack defenses detect and mitigate poisoning and backdoor attacks [Li+22]. Since malicious training instances must be influential to achieve the attacker's objective, defending against adversarial attacks reduces to identifying abnormally influential training instances. If attackers are constrained in the number of training instances they can insert [Wal+21], the target of a training set attack can be identified by searching for test instances that have a few exceptionally influential training instances [HL22]. The training set attack mitigation removes these anomalously influential instances from the training data and then retrains the model [Wan+19]. In addition, influence estimation has been applied to the related task of evasion attack detection, where the training set is pristine and only test instances are perturbed [CSG20].
Algorithmic fairness promotes techniques that enable machine learning models to make decisions free of prejudices and biases based on inherited characteristics such as race, religion, and gender [Meh+21]. A classic example of model unfairness is the COMPAS software tool, which estimated the recidivism risk of incarcerated individuals. COMPAS was shown to be biased against black defendants, falsely flagging them as future criminals at twice the rate of white defendants [Ang+16]. Widespread adoption of algorithmic decision making in domains critical to human safety and well-being is predicated on the public's perception and understanding of the algorithms' inherent ethical principles and fairness [Awa+18]. Yet, how to quantify the extent to which an algorithm is "fair" remains an area of active study [Dwo+12; GH19; Sax+19]. Black and Fredrikson [BF21] propose leave-one-out unfairness as a measure of a prediction's fairness. Intuitively, when a model's decision (e.g., not granting a loan, hiring an employee) is fundamentally changed by the inclusion of a single instance in a large training set, such a decision may be viewed as unfair or even capricious. Leave-one-out influence is therefore useful to measure and improve a model's robustness and fairness.
Explainability attempts to make a black-box model's decisions understandable by humans [BH21].  Ren14]. Influence estimation can assist in the selection of canonical training instances that are particularly important for a given class in general or a single test prediction specifically. Similarly, normative explanations -which collectively establish a "standard" for a given class [CJH19] -can be selected from those training instances with the highest average influence on a held-out validation set. In cases where a test instance is misclassified, influence analysis can identify those training instances that most influenced the misprediction.
Subsampling reduces the computational requirements of large datasets by training models using only a subset of the training data [TB18]. Existing work has shown that high-quality training subsets can be created by greedily selecting training instances based on their overall influence [Kha+18;Wan+20]. Under mild assumptions, Wang et al. [Wan+20] even show that, in expectation, influence-based subsampling performs at least as well as training on the full training set.
Annotating unlabeled data can be expensive -in particular for domains like medical imaging where the annotators must be domain experts [Bra+22]. Compared to labeling instances u.a.r., active learning reduces labeling costs by prioritizing annotation of particularly salient unlabeled data. In practice, active learning often simplifies to maximizing the add-one-in influence where each unlabeled instance's marginal influence must be estimated. Obviously, retraining for each possible unlabeled instance combination has exponential complexity and is intractable. Instead, a greedy strategy can be used where the influence of each unlabeled instance is estimated to identify the next candidate to label [Liu+21; Jia+21a; Zha+21c].
To enhance the benefit of limited labeled data, influence analysis has been used to create better augmented training data [Lee+20;Oh+21]. These influence-guided data augmentation methods outperform traditional random augmentations, albeit with a higher computational cost.

Future Directions
The trend of consistently increasing model complexity and opacity will likely continue for the foreseeable future. Simultaneously, there are increased societal and regulatory demands for algorithmic transparency and explainability. Influence analysis sits at the nexus of these competing trajectories [Zho+19], which points to the field growing in importance and relevance. This section identifies important directions we believe influence analysis research should take going forward.
Emphasizing Group Influence over Pointwise Influence: Most existing methods target pointwise influence, which apportions credit for a prediction to training instances individually. However, for overparameterized models trained on large datasets, only the tails of the data distribution are heavily influenced by an individual instance [Fel20]. Instead, most predictions are moderately influenced by multiple training instances working in concert [FZ20; Das+21; BYF20].
As an additional complicating factor, pointwise influence within data-distribution modes is often approximately supermodular where the marginal effect of a training instance's deletion increases as more instances from a group are removed [HL22]. This makes pointwise influence a particularly poor choice for understanding most model behavior. To date, very limited work has systematically studied group influence [Koh+19; BYF20; HL22]. Better group influence estimators could be immediately applied in various domains such as poisoning attacks, core-set selection, and model explainability.
Certified Influence Estimation: Certified defenses against poisoning and backdoor attacks guarantee that deleting a fixed number of instances from the training data will not change a model's prediction [SKL17; LF21; Jia+22; WLF22; HL23]. These methods can be viewed as upper bounding the training data's group influence -albeit very coarsely. Most certified poisoning defenses achieve their bounds by leveraging "tricks" associated with particular model architectures (e.g., instance-based learners [Jia+22] and ensembles [LF21; WLF22]) as opposed to a detailed analysis of a prediction's stability [HL23]. With limited exception [Jia+19a], today's influence estimators do not provide any meaningful guarantee of their accuracy. Rather, most influence estimates should be viewed as only providing -at best -guidance on an instance's "possible influence." Guaranteed or even probabilistic bounds on an instance's influence would enable influence estimation to be applied in settings where more than a "heuristic approximation" is required [HL23].
Improved Scalability: Influence estimation is slow. Analyzing each training instance's influence on a single test instance can take several hours or more [BBD20; Kob+20; Guo+21; HL22]. For influence estimation to be a practical tool, it must be at least an order of magnitude faster. Heuristic influence analysis speed-ups could prove very useful [Guo+21;Sch+22]. However, the consequences (and limitations) of any empirical shortcuts need to be thoroughly tested, verified, and understood. Similarly, limited existing work has specialized influence methods to particular model classes [Jia+21a] or data modalities [Yeh+22]. While application-agnostic influence estimators are useful, their flexibility limits their scalability and accuracy. Both of these performance metrics may significantly improve via increased influence estimator specialization.
Surrogate Influence and Influence Transferability: An underexplored opportunity to improve influence analysis lies in the use of surrogate models [Sha+18b; Jia+19b; Jia+21a; BHL22]. For example, linear surrogates have proven quite useful for model explainability [LL17]. While using only a model's linear layer as a surrogate may be "overly reductive" [Yeh+22], it remains an open question whether other compact models remain an option. Any surrogate method must be accompanied by rigorous empirical evaluation to identify any risks and "blind spots" the surrogate may introduce [Rud19].
Increased Evaluation Diversity: Influence analysis has the capability to provide salient insights into why models behave as they do [FZ20]. As an example, Black and Fredrikson [BF21] demonstrate how influence analysis can identify potential unfairness in an algorithmic decision. However, influence estimation evaluation is too often superficial and focuses on a very small subset of possible applications. For instance, most influence estimation evaluation focuses primarily on contrived data cleaning and mislabeled training data experiments [Woj+16; KL17; Kha+18; GZ19; Yeh+18; Pru+20; Che+21; Ter+21; KS21; SWS21; BHL22; Yeh+22; KSH22; KZ22]. It is unclear how these experiments translate into real-world or adversarial settings, with recent work pointing to generalization fragility [BPF21;Bae+22]. We question whether these data cleaning experiments -where specialized methods already exist [Kri+16; KW19; Wan+19] -adequately satisfy influence analysis's stated promise of providing "understanding [of] black-box predictions" [KL17].
Objective Over Subjective Evaluation Criteria: A common trope when evaluating an influence analysis method is to provide a test example and display training instances the estimator identified as most similar or dissimilar. These "eye test" evaluations are generally applied to vision datasets [KL17; Yeh+18; Jia+19a; Pru+20; FZ20] and to a limited extent other modalities. Such experiments are unscientific. They provide limited meaningful insight given the lack of a ground truth by which to judge the results. Most readers do not have detailed enough knowledge of a dataset to know whether the selected instances are especially representative. Rather, there may exist numerous training instances that are much more similar to the target that the influence estimator overlooked. Moreover, such visual assessments are known to be susceptible to confirmation and expectancy biases [Mah77; NDM13; KDK13].
Influence analysis evaluation should focus on experiments that are quantifiable and verifiable w.r.t. a ground truth.

Conclusions
While influence analysis has received increased attention in recent years, significant progress remains to be made. Influence estimation is computationally expensive and can be prone to inaccuracy. Going forward, fast certified influence estimators are needed. Nonetheless, despite these shortcomings, existing applications already demonstrate influence estimation's capabilities and promise.
This work reviews numerous methods with different perspectives on -and even definitions of -training data influence. It would be a mistake to view this diversity of approaches as a negative. While no single influence analysis method can be applied to all situations, most use cases should have at least one method that fits well. An obvious consequence then is the need for researchers and practitioners to understand the strengths and limitations of the various methods so as to know which method best fits their individual use case. This survey is intended to provide that insight from both empirical and theoretical viewpoints.