Incremental permutation feature importance (iPFI): towards online explanations on data streams

Explainable artificial intelligence has mainly focused on static learning scenarios so far. We are interested in dynamic scenarios where data is sampled progressively, and learning is done in an incremental rather than a batch mode. We seek efficient incremental algorithms for computing feature importance (FI). Permutation feature importance (PFI) is a well-established model-agnostic measure to obtain global FI based on feature marginalization of absent features. We propose an efficient, model-agnostic algorithm called iPFI to estimate this measure incrementally and under dynamic modeling conditions including concept drift. We prove theoretical guarantees on the approximation quality in terms of expectation and variance. To validate our theoretical findings and the efficacy of our approaches in incremental scenarios dealing with streaming data rather than traditional batch settings, we conduct multiple experimental studies on benchmark data with and without concept drift.


Introduction
Online learning from dynamic data streams is a prevalent machine learning (ML) approach for various application domains (Bahri et al. 2021).For instance, predicting energy consumption for individual households can foster energy-saving strategies such as load-shifting.Concept drift resulting from environmental changes, such as pandemicinduced lock-downs, drastically impacts the energy consumption patterns necessitating online ML (García-Martín et al. 2019).Explaining these predictions yields a greater understanding of an individual's energy use and enables prescriptive modeling for further energy-saving measures (Wastensteiner et al. 2021).For black-box machine learning methods, so-called post-hoc XAI methods seek to explain single predictions or entire models in terms of the contribution of specific features (Adadi and Berrada 2018).In this paper, we are interested in feature importance (FI) as a global assessment of features, which indicates their respective relevance to the given task and model.A prominent representative of global FI is the permutation feature importance (PFI) (Breiman 2001), which, in its original form, requires a holistic view of the entire dataset in a static batch learning environment.More generally, explainable artificial intelligence (XAI) has been studied mainly in the batch setting, where learning algorithms operate on static datasets.In scenarios where data does not fit into memory or computation time is strictly limited, like in progressive data science for big datasets (Turkay et al. 2018), or rapid online learning from data streams (Bahri et al. 2021), this assumption prohibits the use of traditional FI or XAI measures.Incremental, time-and memory-efficient implementations that provide anytime results have received much attention in recent years (Losing, Hammer, and Wersing 2018;Montiel et al. 2020).In this article, we are interested in efficient incremental algorithms for FI.Especially in the context of drifting data distributions, this task is particularly relevant -but also challenging, as many common FI methods are already computationally costly in the batch setting.(Fisher, Rudin, and Dominici 2019) and permutation tests to conclude that only properly scaled permutation tests are unbiased estimates of global FI.
• We introduce an incremental estimator for PFI (iPFI) with two sampling strategies to create marginal feature distributions in an incremental learning scenario.
• We provide theoretical guarantees regarding bias, variance, and approximation error, which can be controlled by a single sensitivity parameter, and analyze the estimation quality in the case of a static and an incrementally learned model.
• We implement iPFI and conduct experiments on its approximation quality compared to batch permutation tests, as well as its ability to efficiently provide anytime FI values under different types of concept drift.
All experiments and algorithms are publicly available and integrated into the well-known incremental learning framework river (Montiel et al. 2020).
Related Work.A variety of model-agnostic local FI methods (Ribeiro, Singh, and Guestrin 2016;Lundberg and Lee 2017;Lundberg et al. 2020;Covert and Lee 2021) exist that provide relevance-values for single instances.In addition, model-specific variants have been proposed for neural networks (Bach et al. 2015;Selvaraju et al. 2017).PFI and its extensions (Molnar et al. 2020;König et al. 2021) are among global FI methods that provide relevance-values across all instances.SAGE, a popular Shapley-based approach, has been proposed and compared with existing methods (Covert, Lundberg, and Lee 2020).As calculating FI values is computationally expensive, especially for Shapley-based methods, more efficient approaches such as FastSHAP (Jethani et al. 2021) have been introduced.Yet, none of the above methods and extensions natively support an incremental or dynamic setting in which the underlying model and its FI can rapidly change due to concept drift.
An initial approach to explaining model changes by computing differences in FI utilizing drift detection methods is (Muschalik et al. 2022).However, this does not constitute an incremental FI measure.The explanations are created with a time delay and without efficient anytime calculations.A first step towards anytime FI values has been proposed for online Random Forests, where separate test data is used from online bagging to compute changes in impurity and accuracy (Cassidy and Deviney 2014).While this method is limited to online random forests, it does not provide theoretical guarantees or an incremental approach.Similar to batch learning, incremental FI is also relevant to the field of incremental feature selection, where FI is calculated periodically with a sliding window to retain features for the incrementally fitted model (Barddal et al. 2019;Yuan, Pfahringer, and Barddal 2018).
In this work, we provide a truly incremental FI measure whose time sensitivity can be controlled by a single smoothing parameter.Moreover, we establish necessary theoretical guarantees on its approximation quality.

Global Feature Importance
We consider a supervised learning scenario, where X is the feature space and Y the target space, e.g., X = R d and Y = R or Y = {0, 1}.Let h : X → Y be a model, which is learned from a set or stream of observed data points z = (x, y) ∈ X × Y .Let D = {1, . . ., d} be the set of feature indices for the vector-wise feature representations of x = (x (i) : i ∈ D) ∈ X .Consider a subset S ⊂ D and its complement S := D \ S, which partitions the features, and denote x (S) := (x (i) : i ∈ S) as the feature subset of S for a sample x.We write h(x ( S) , x (S) ) := h(x) to distinguish between features from S and S. For the basic setting, we assume that N observations are drawn independently and identically distributed (iid) from the joint distribution of unknown random variables (X, Y ) and denote by P S the marginal distribution of the features in S, i.e., Feature importance refers to the relevance of a set of features S for a model h.To quantify FI, the key idea of measures such as PFI is to compare the model's performance when using only features in S with the performance when using all features in D = S ∪ S. The idea is that the "removal" of an important feature (i.e., the feature is not provided to a model) substantially decreases a model's performance.The model performance or risk is measured based on a norm As the model is trained on all features and retraining is computationally expensive, a common method to restrict h to S is to marginalize h over the features in S. We denote the marginalized risk f S x ( S) , y := E X∼P S h(x ( S) , X) − y . (1) A popular way to define FI for a model h and a feature set S is to compare the marginalized risk with the model's inherent risk (Covert, Lundberg, and Lee 2020).For a model h and a subset S ⊂ D, FI becomes This FI measures the increase in risk when the features in S are marginalized (Covert, Lundberg, and Lee 2020).
Empirical estimation of FI.Given observations (x 1 , y 1 ), . . ., (x N , y N ), we estimate the FI for a given model h with the canonical estimator where ϕ : {1, . . ., N } → {1, . . ., N } represents the realization of a (possibly random) sampling strategy that decides for each observation which observation should be taken to approximate X (S) and Given the iid assumption, it is clear that due to X n ⊥ X n for n = n , the estimator is an unbiased estimator of the FI φ (S) (h), if ϕ(n) = n for all n = 1, . . ., N .In the case of ϕ(n) = n, the term in the sum is zero as well as its expectation, which implies E[ φ(S) ϕ ] ≤ φ (S) (h) for any ϕ.We will now discuss a well understood choice of feature subsets S ⊂ D, sampling strategy ϕ and two estimators for φ (S) (h).

Permutation Feature Importance (PFI)
A popular example of FI is the well-known PFI (Breiman 2001) that measures the importance of each feature j ∈ D by using a set S j := {j}.More precisely, the FI for each feature j ∈ D is given by φ (Sj ) with sets S j = {j} and their complement Sj = D \ {j}.The sampling strategy ϕ used in PFI samples uniformly generated permutations ϕ ∈ S N over the set {1, . . ., N }, where each permutation has a probability of 1/N !.
(Empirical) PFI.Permutation tests, as proposed initially by Breiman (2001), effectively approximate E ϕ [ φ(Sj) ϕ ] by averaging over M uniformly sampled random permutations.We introduce a scaled version of the initially proposed method as the PFI estimator, that is with ϕ 1 , . . ., ϕ m iid ∼ unif(S N ).As discussed above, the estimator φ(Sj) ϕ for a given ϕ is an unbiased estimator for the FI φ (Sj ) (h), if the permutation is a derangement.In the following, we show that the above estimator is an unbiased estimator of FI, in contrast to the original method without scaling.
Expected PFI.The PFI estimator highly depends on the sampled permutations.Therefore, we take the expectation over ϕ to analyze its theoretical properties.We can show that the expectation is the model reliance φ(Sj) := êswitch − êorig which compares the model error êorig =1 N N n=1 h(x n ) − y n and the error of the model if averaged over all feature instantiations êswitch = m ) − y n .This quantity has been introduced and extensively studied by Fisher, Rudin, and Dominici (2019). 1 Theorem 1.The expected PFI (model reliance) can be rewritten as a normalized expectation over uniformly random permutations, i.e.
Due to space restrictions, all proof are deferred to the supplementary material.Both êswitch and êorig as well as the estimator φ(Sj) are U-statistics, which implies unbiasedness, asymptotic normality and finite sample boundaries under weak conditions (Fisher, Rudin, and Dominici 2019).The variance can thus be directly computed and it is easy to show that V[ φ(Sj) ] = O(1/N ), which by Chebyshev's inequality implies a bound on the approximation error as Hence, the approximation error of the expected PFI is directly controlled by the number of observations N used for computation.The link between permutation tests and the U-statistic φ(Sj) was already discussed by Fisher, Rudin, and Dominici (2019, Appendix A.3), where it was shown that the sum over permutations without fixed points is proportional to êswitch .The biased estimator 1 M M m=1 φ(Sj) ϕm appears in (Breiman 2001;Fisher, Rudin, and Dominici 2019;Gregorutti, Michel, and Saint-Pierre 2017).However, to our knowledge, the unbiased version in (3) has not yet been introduced, and Theorem 6 directly yields the unique normalizing constant N N −1 , which ensures that the estimator is unbiased.In particular, Theorem 6 justifies to average over repeatedly sampled realizations of ϕ in order to approximate the computationally prohibitive estimator φ(Sj) .In the following, we will pick up this notion when constructing an incremental FI estimator.

Incremental Permutation Feature Importance
We now consider a sequence of models (h t ) t∈N from an incremental learning algorithm.At time t the observed data is {(x 0 , y 0 ), . . ., (x t , y t )}.The model is incrementally learned over time, such that at time t the observation (x t , y t ) is used to update h t to h t+1 .Our goal is to efficiently provide an estimate of PFI at each time step t for each feature j ∈ D using subsets S j := {j}.Note that our results can immediately be extended to arbitrary feature subsets S ⊂ D.
In the following, we construct an efficient incremental estimator for PFI.We first discuss how (2) can be efficiently approximated in the incremental learning scenario, given a sampling strategy ϕ t .In the sequel, we will rely on a random sampling strategy which is specifically suitable for the incremental setting and easier to implement than permutationbased approaches.Note that a permutation-based approach at time t is difficult to replicate in the incremental setting, as at time s < t not all samples until time t are available.As the model changes over time, naively computing (2) at each time step t using N previous observations results in N model evaluations per time step.Instead, we aim for an estimator that averages the terms in (2) over time rather than over multiple data points at one time step, i.e., we evaluate the current model only twice to compute the time-dependent quantity λ(Sj) t (x t , x ϕt , y t ) := h t (x ( Sj ) t , x (Sj )  ϕt )−y t − h t (x t )−y t , where ϕ t : Ω → {0, . . ., t − 1} is a sampling strategy to select a previous observation.We propose to average these calculations over time by using exponential smoothing, i.e.

iPFI:
φ(Sj) for t > t 0 , φ(Sj) t0 := λ(Sj) t0 (x t , x ϕt 0 , y t0 ), and α ∈ (0, 1).The parameter α is a hyperparameter that should be chosen Algorithm 1: iPFI explanation at time t for feature j Require: : α ∈ (0, 1), sampling strategy ϕ t , and φ(Sj) t−1 .1: procedure EXPLAINONE(h t , x t , y t , j) 2: ϕ t+1 ← UpdateSampler(ϕ t , x t ) 6: end procedure based on the application.Note that a specific choice of α corresponds to a window size N , where α = 2 N +1 based on the well-known conversion formula, see e.g.(Nahmias and Olsen 2015, p.73).Given a realization ϕ s , observations s is an unbiased estimate of φ (Sj ) (h s ).We further require ϕ s ⊥ (X, Y ) and denote ϕ s : Ω → {0, . . ., s − 1} with p s,r := P(ϕ s = r), (5) for s = t 0 , . . ., t to select previous observations.Note that t 0 > 0 is the first time step where φ(Sj) t can be computed, as we need previous observations for the sampling process.In the following, we assume that the sampling strategy (ϕ s ) t0≤s≤t is fixed and clear from the context, and thus omit the dependence in φ(Sj) t .We illustrate one explanation step at time t in Algorithm 1.This directly corresponds to (3) with M = 1 and can be extended to M > 1 by repeatedly running the procedure in parallel and averaging the results.Next, we discuss two possible sampling strategies.

Incremental Sampling Strategies ϕ
Since random permutations cannot easily be realized in an incremental setting as they would require knowledge of future events, we now present two alternative types of sampling strategies.We formalize (ϕ s ) t0≤s≤t to choose the previous observation r at time s for the calculation in λ(Sj) s .To do so, we will specify the probabilities p s,r in (5).

Uniform Sampling
In uniform sampling we assume that each previous observation is equally likely to be sampled at time s, i.e., p s,r = 1/s for s = t 0 , . . ., t and r = 0, . . ., s − 1.It can be naively implemented by storing all previous observations and uniformly sampling at each time step.However, when memory is limited, it can be implemented with histograms for features of known cardinality.For others, a reservoir of fixed length can be maintained, known as reservoir sampling (Vitter 1985).The probability of a new observation to be included in the reservoir then decreases over time.Clearly, observations are drawn independently, but can be sampled more than once.In a data stream scenario, where changes to the underlying data distribution occur over time, the uniform sampling strategy may be inappropriate, and sampling strategies that prefer recent observations may be better suited.
Geometric Sampling Geometric sampling arises from the idea to maintain a reservoir of size L, which is updated by a new observation at each time step by randomly replacing a reservoir observation with the newly observed one.Until time t 0 the first L observations are stored in the reservoir.At each sampling step (t ≥ t 0 ) an observation is uniformly chosen from the reservoir with probability p := 1/L.Independently, a sample from the reservoir is selected with the same probability p := 1/L for replacement with the new observation.The resulting probabilities are of the geometric form p s,r = p(1 − p) s−r−1 for r ≥ t 0 and p s,r = p(1 − p) s−t0 for r < t 0 .Clearly, the geometric sampling strategy yields increasing probabilities for more recent observations and we demonstrate in our experiments that this can be beneficial in scenarios with concept drift.

Theoretical Results of Estimation Quality
The estimator φ(Sj) t picks up the notion of the PFI estimator φ(Sj) in (3), which approximates the expectation over the random sampling strategy (ϕ) t0≤s≤t by averaging repeated realizations.While φ(Sj) t only considers one realization of the sampling strategy, it is easy to extend the approach in the incremental learning scenario by computing the estimator φ(Sj) t in multiple separate runs in parallel.While this yields an efficient estimate of PFI, it is difficult to analyze the estimator theoretically as each estimator highly depends on the realizations of the sampling strategy.We thus again study the expectation over the sampling strategy and introduce the expected iPFI as φ(Sj) , similar to the expected PFI (model reliance) φ(Sj) .To evaluate the estimation quality, we will analyze the bias | φ(Sj) Both can be combined by Chebyshev's inequality to obtain bounds on the approximation error of φ (Sj ) As already said, all proofs are deferred to the supplementary material.Our theoretical results are stated and proven in a general manner, which allows one to extend our approach to other sampling strategies, other feature subsets, and even other aggregation techniques.
Static Model.Given iid observations from a data stream, we consider an incremental model that learns over time.We begin under the simplified assumption that the model does not change over time, i.e., h t ≡ h for all t.
Theorem 2 (Bias for static Model).If h ≡ h t , then From the above theorem it is clear that the bias of the expected iPFI φ(Sj) t is exponentially decreasing towards zero for t → ∞ and we thus continue to study the asymptotic estimator lim t→∞ φ(Sj) t .While the bias does not depend on the sampling strategy, our next results analyzes the variance of the asymptotic estimator, which does depend on the sampling strategy.
The variance is therefore directly controlled by the choice of parameters α and p.As the asymptotic estimator is unbiased, it is clear that these parameters control the approximation error, as shown in (6).
Changing Model.So far, we discussed properties of φ(Sj) t under the simplified assumption that h t does not change over time.In an incremental learning scenario, h t is updated incrementally at each time step.In cases where no concept drift affects the underlying data generating distribution, we can assume that an incremental learning algorithm gradually converges to an optimal model.We thus assume that the change of the model is controlled and show results similar to the case where h t is static.To control model change formally, we introduce We show that ∆ S and ∆ bound the difference of FI of two models h t and h s and the bias of our estimator.Theorem 4 (Bias for changing Model).
In the case of a changing model the estimator is therefore only unbiased if h t → h as t → ∞.For results on the variance, we control the variability of the models at different points in time.In the case of a static model, the covariances can be uniformly bounded, as they do not change over time.Instead, for a changing model, we introduce the time-dependent function for t 0 ≤ s, s ≤ t, r < s and r < s .Theorem 5 (Variance for changing Model).Given (9) for a sequence of models (h t ) t≥0 , the results of Theorem 8 apply.
Summary.We have shown that the approximation error of iPFI for FI is controlled by the parameters α and p.In the case of drifting data, the approximation error is additionally affected by the changes in the model, as it is then possibly biased and the covariances may change.As the expected PFI estimator has an approximation error of order O(1/N ) for FI, we conclude that the above bounds on the approximation error of expected iPFI are also valid when compared with the expected PFI, if α is chosen according to α = 2 N +1 .In the next section, we corroborate our theoretical findings with empirical evaluations and showcase the efficacy of iPFI in scenarios with concept drift.We also elaborate on the differences between the two sampling strategies.

Experiments
We conduct multiple experimental studies to validate our theoretical findings and present our approach on real data.We consider three benchmark datasets, which are wellestablished in the FI literature (Covert, Lundberg, and Lee 2020;Lundberg and Lee 2017), one real-world data stream, and one synthetic data stream.As our approach is inherently model-agnostic, we present experimental results for different model types.As classification problems, we use adult (Kohavi 1996) with a Gradient Boosting Tree (GBT) (Friedman 2001) and bank (Moro, Cortez, and Laureano 2011) with a small 2-layer Neural Network (NN) with layer sizes (128, 64).As a regression problem, we use bike (Fanaee-T and Gama 2014) with LightGBM (LGBM) (Ke et al. 2017).The real-world electricity-price classification data stream mentioned in the introduction is called elec2 (Harries 1999).In the static case an LGBM model performed best and in the online setting an Adaptive Random Forest classifier (ARF) (Gomes et al. 2017) was used.The synthetic data stream is constructed with the agrawal (Agrawal, Imielinski, and Swami 1993) classification data generator.Like elec2, an LGBM was used in the static scenario, and an ARF was applied in the dynamic setting.The models' and data streams' implementation is based on scikit-learn (Pedregosa et al. 2011), River (Montiel et al. 2020), and OpenML (Feurer et al. 2020).We mainly rely on default parameters, and the supplement contains detailed information about the datasets and applied models.In all our experiments, we compute the iPFI estimator φ(Sj) iPFI as the average over ten realizations φ(Sj) t of the incremental sampling strategies (uniform or geometric).All baseline approaches are chosen, such that they require the same amount of model evaluations as iPFI.
Experiment A: Approximation of Batch PFI First, we consider the static model setting where models are pre-trained before they are explained on the whole dataset (no incremental learning).This experiment demonstrates that iPFI correctly approximates batch PFI estimation.We compare iPFI with the classical batch PFI φ(Sj) batch for feature j ∈ D, which is computed using the whole static dataset over ten random permutations.We normalize φ(Sj) iPFI and φ(Sj) batch between 0 and 1, and compute the sum over the feature-wise absolute approximation errors; Table 1 shows the median and interquartile range (IQR) (difference between the first and third quartile) of the error based on ten random orderings of each dataset.Figure 2 shows the approximation quality of iPFI with geometric and uniform sampling per feature for the bank dataset.In the static modeling case, there is no clear difference between geometric and uniform sampling.However, in the dynamic modeling context under drift, the sampling strategy has a substantial effect on the iPFI estimates.

Experiment B: Online PFI Calculation under Drift
In this experiment, we consider a dynamic modeling scenario.Here, instead of a pre-trained model, we fit ARF models incrementally on real data streams and compute iPFI on the fly.For the sake of clarity and simplicity, we only present results for ARF models here.However, as our approach is inherently model-agnostic, any incremental model (implemented for example in river) can be explained.As a baseline, we compare our approach to the interval PFI for feature j ∈ D, which computes the PFI over fixed time intervals during the online learning process with ten random permutations in each interval.This can be seen as a naive implementation of iPFI with large gaps of uncertainty and a substantial time delay.With the synthetic agrawal stream we induce two kinds of real concept drifts: First, we switch the classification function of the data generator, which we refer to as function-drift (changing the functional dependency but retaining the distribution of X).Second, we switch the values of two or more features with each other, which we refer to as feature-drift (changing the functional dependency by changing the distribution of X).Note that feature-drift can also be applied to datasets, where the classification function is unknown.Figure 3 showcases how well iPFI reacts to both concept drift scenarios.Both concept drifts are induced in the middle of the data stream (after 10,000 samples).For the function-drift example (Figure 3, left), the agrawal classification function was switched from Agrawal, Imielinski, and Swami (1993)'s concept 1 to concept 2. Theoretically, only two features should be important for both concepts: For the first concept the pink salary and the purple age features are needed, and for the second concept the classification function relies on the cyan education and the purple age features.However, the ARF model also relies on the blue commission feature, which can be explained as commission directly depends on salary.In the feature-drift scenario (Figure 3, right), the ARF model adapts to a sudden drift where both important features (education and age) are switched with two unimportant features (car and salary).In both scenarios iPFI instantly detects the shifts in importance.
From both simulations, it is clear that iPFI and its anytime computation has clear advantages over interval PFI.In fact, iPFI quickly reacts to changes in the data distribution while still closely matching the "ground-truth" results of the batchinterval computation.For further concept drift scenarios, we refer to the supplementary material.

Experiment C: Geometric vs. Uniform Sampling
Lastly, we focus on the question, which sampling strategy to prefer in which learning environments.We conclude that geometric sampling should be applied under feature-drift scenarios, as the choice of sampling strategy substantially impacts iPFI's performance in concept drift scenarios where feature distributions change.If a dynamic model adapts to changing feature distributions, and the PFI is estimated with samples from the outdated distribution, the resulting replacement samples are outside the current data manifold.Estimating PFI by using this data can result in skewed estimates, as illustrated in Figure 4. There, we induce a featuredrift by switching the values of the most important feature for an ARF model on elec2 with a random feature.The uniform sampling strategy (Figure 4, left) is incapable of matching the "ground-truth" interval PFI estimation like the geometric sampling strategy (Figure 4, right).Hence, in dynamic learning environments like data stream analytics or continual learning, we recommend applying a sampling strategy that focuses on more recent samples, such as geometric distributions.For applications without drift in the feature-space like progressive data science, uniform sampling strategies, which evenly distribute the probability of a data point being sampled across the data stream, may still be preferred.
For further experiments on different parameters, we again refer to the supplement.Therein, we show that the smooth-ing parameter α substantially effects iPFI's FI estimates.Like any smoothing mechanism, this parameter controls the deviation of iPFI's estimates.This parameter should be set individually for the task at hand.In our experiment, values between α = 0.001 (conservative) and α = 0.01 (reactive) appeared to be reasonable.

Conclusion and Future Work
In this work, we considered global FI as a statistic measure of change in the model's risk when features are marginalized.We discussed PFI as an approach to estimate feature importance and proved that only appropriately scaled permutation tests are unbiased estimators.In this case, the expectation over the sampling strategy (expected PFI) then corresponds to the model reliance U-Statistic (Fisher, Rudin, and Dominici 2019).
Based on this notion, we presented iPFI, an efficient model-agnostic algorithm to incrementally estimate FI by averaging over repeated realizations of a sampling strategy.We introduced two incremental sampling strategies and established theoretical results for the expectation over the sampling strategy (expected iPFI) to control the approximation error using iPFI's parameters.On various benchmark datasets, we demonstrated the efficacy of our algorithms by comparing them with the batch PFI baseline method in a static progressive setting as well as with interval-based PFI in a dynamic incremental learning scenario with different types of concept drift and parameter choices.
Applying XAI methods incrementally to data stream analytics offers unique insights into models that change over time.In this work, we rely on PFI as an established and inexpensive FI measure.Other computationally more expensive approaches (such as SHAP) address some limitations of PFI.As our theoretical results can be applied to arbitrary feature subsets, analyzing these methods in the dynamic environment offers interesting research opportunities.In contrast to this work's technical focus, analyzing the dynamic XAI scenario through a human-focused lens with human-grounded experiments is paramount (Doshi-Velez and Kim 2017).

Technical Supplement Incremental Permutation Feature Importance (iPFI):
Towards Online Explanations on Data Streams This is the technical supplement for the contribution Incremental Permutation Feature Importance (iPFI): Towards Online Explanations on Data Streams.The supplement contains proofs to all of our theoretical claims, a description about the datasets and models used, summary information to the contribution's experiments, and a variety of further experiments.

Proofs
In the following, we provide the proofs of all theorems.We further present more general results that are stated as propositions.Theorem 6.The expected PFI (model reliance) can be rewritten as a normalized expectation over uniformly random permutations, i.e.
Proof.We write f (z n , z m ) := h(x m ) − y n − h(x n ) − y n and compute the expectation over randomly sampled permutations ϕ ∈ S N .Each permutation has probability 1 N !, which yields where we used in the third line that there are (N − 1)! permutations with ϕ(n) = m.We thus conclude, Theorem 7 (Bias for static Model).If h ≡ h t , then Proof.We consider the more general estimator φ(S) x ϕt , y t )] and prove a more general result that can be used for arbitrary sampling and aggregation techniques.
Theorem 8 (Variance for static Model).If h t ≡ h and Proof.We again consider the more general estimator φ(S) ] and prove a result, that can be used for arbitrary sampling and aggregation techniques.
Proposition 2. For ϕ from (5) with ϕ s ⊥ ϕ r for r < s and p s,r ≤ p s ,r for s > s , i.e., the probability to sample a previous observation r is non-increasing over time, it holds , Using p s,r := P(ϕ s = r) and properties of variance, we can write where cov((s, r), (s , r denotes the covariance of the two random variables.The above sum ranges over all possible combinations of pairs (s, r), where s = t 0 . . ., t and r = 0, . . ., s − 1.As r < s and r < s , it holds |{s, s , r, r }| ≥ 2. When |{s, s , r, r }| = 2 then s = s and r = r and the covariance reduces to the variance.When none of the indices match, i.e., |{s, s , r, r }| = 4, then the covariance is zero, due to the independence assumption.When exactly one index matches, then there are three possible cases: • case 1: s = s , r = r , • case 2: s = s , r = r with r = s or s = r • case 3: s = s , r = r .
Case 2 yields the same covariances due to the iid assumption and the symmetric of the covariance.For case 1, with where we have used E Zs [E Zr [ f (Z s , Z r )]] = φ (S) (h) as well as the iid assumption multiple times, in particular when The same arguments apply for the second argument for case 3, as cov((s, r), (s , r We thus summarize cov((s, r), (s , r w s w s p s,r p s ,r cov((s, r), (s , r )).
For the first sum, we have For the second sum, Q 3 decomposes into the three cases.For case 1, (s,s ,r,r )∈Q3 s=s ,r =r For case 2 w.l.o.g assume r = s , which implies s > s and thus w s ≥ w s , then For case 3, we have In summary, we conclude The last sum depends on both the choices of weights w s and the collision probability which is related to the Rényi entropy (Rényi 1961).The variance increases with the collision probabilities of the sampling strategy, in particular ) for uniform and geometric sampling, respectively.Lemma 1.For geometric sampling and p ∈ (0, 1) it holds Proof.The probabilities for geometric sampling are Then We now apply Proposition 2 to our particular estimator φ(S) = φ(S) with w s := α(1 − α) t−s and take the limit for t → ∞.Note that both uniform and geometric sampling fulfill the condition of the theorem.Furthermore, we have For the first sum, we have α Geometric Sampling For geometric sampling, we have For the second term it is enough to show that 0 < lim t→∞ q(α) < ∞ to prove the result, as I geom (s) = O(p).By using the properties of geometric progression, we obtain Hence, Theorem 9 (Bias for changing Model).
Proof.We again consider the more general estimator φ(S) ] and prove a more general result.
Proof.For the proof, we first show that for two models h s , h t and a subset S ⊂ D, it holds that |φ (S) The result then follows directly by definition, the observation that λ(S) s is an unbiased estimate of φ (S) (h s ), as bias for static model .
for t 0 ≤ s, s ≤ t, r < s and r < s , then for a sequence of models (h t ) t≥0 the results of Theorem 8 apply.
Proof.In all proofs a changing model h t adds a time dependency on the function f s (Z s , Z r ) := h s (X Instead of bounding the covariances by σ 2 2 , we now bound the covariances of the time-dependent functions by σ 2 max .This only directly affects Proposition 2, as p s,r p s ,r cov((s, r), (s , r )) p s,r p s ,r .
All remaining arguments and proofs are still valid for a changing model due to the iid assumption.

Link to FI used in SAGE
Our definition of FI aligns with the definition of FI v h (S) given by Covert, Lundberg, and Lee (2020), which measures the increase in risk when including features in S compared with marginalizing all features.The conditional distribution X (S) |X ( S) coincides with the marginal distribution if X ( S) ⊥ X (S) , which is often assumed in practice (Covert, Lundberg, andLee 2021, 2020;Lundberg and Lee 2017).
It was even suggested that the marginal distribution is conceptually the right choice (Janzing, Minorics, and Bloebaum 2020).
Approximation Error for expected PFI , we can write which is the basic form of a U-statistic and therefore the variance can be computed as where ] are assumed to be finite (Hoeffding 1948).For > 0, we then obtain by Chebyshev's inequality Ground-truth PFI for the agrawal stream.
River (Montiel et al. 2020) implements the agrawal (Agrawal, Imielinski, and Swami 1993)  The theoretical PFIs can be calculated with the base probability of an sample belonging to concept A (P (A 1 ) = P (A 2 ) = P (A 3 ) = 5 39 ) times the probability of switching the class through changing a feature (P (A i → B n,m )) plus the vice versa for a sample originally belonging to concept B.

Experiments
In the following, we give more comprehensive details about the datasets and models used in our experiments.
Dataset Description adult (Kohavi 1996) Binary classification dataset that classifies 48842 individuals based on 14 features into yearly salaries above and below 50k.There are six numerical features and eight nominal features.
bank (Moro, Cortez, and Laureano 2011) Binary classification dataset that classifies 45211 marketing phone calls based on 17 features to decide whether they decided to subscribe a term deposit.There are seven numerical features and ten nominal features.elec2 (Harries 1999) Binary classification dataset that classifies, if the electricity price will go up or down.The data was collected for 45312 time stamp from the Australian New South Wales Electricity Market and is based on eight features, six numerical and two nominal.
agrawal (Agrawal, Imielinski, and Swami 1993) Synthetic data stream generator to create binary classification problems to decide whether an indivdual will be granted a loan based on nine features, six numerical and three nominal.There are ten different decision functions available.
stagger (Schlimmer and Granger 1986) The stagger concepts makes a simple toy classification data stream.The syntethtical data stream generator consists of three independent categorical features that describe the shape, size, and color of an artificial object.Different classification functions can be derived from these sharp distinctions.

Model Description
All models are implemented with the default parameters from scikit-learn (Pedregosa et al. 2011) and River (Montiel et al. 2020) unless otherwise stated.
ARF The Adaptive Random Forest Classifier (ARF) uses an ensemble of 50 trees with binary splits, ADWIN drift detection and information gain split criterion.We used the default implementation AdaptiveRandomForestClassifier from River with n models=50 and binary split=True.
NN The Neural Network classifier (NN) was implemented with two hidden layers of size 128 × 64, ReLu activation function and optimized with stochastic gradient descent (ADAM).We used the default implementation MLPClassifier from scikit-learn.
GBT The Gradient Boosting Tree (GBT) uses 200 estimators and additively builds a decision tree ensemble using log-loss optimization.We used the GradientBoostingClassifier from scikit-learn with n estimators=200.
LGBM The LightGBM (LGBM) constitutes a more lightweight implementation of GBT.We used HistGradient-BoostingRegressor for regression tasks and HistGradient-BoostingClassifier for classification tasks from scikit-learn with the standard parameters.

Hardware Details
The experiments were mainly run on an computation cluster on hyperthreaded Intel Xeon E5-2697 v3 CPUs clocking at with 2.6Ghz.In total the experiments took around 300 CPU hours (30 CPUs for 10 hours) on the cluster.This mainly stems from the number of parameters and different initializations.Before running the experiments on the cluster, the implementations were validated on a Dell XPS 15 9510 containing an Intel i7-11800H at 2.30GHz.The laptop was running for around 12 hours for this validation.

Parameter Analysis
In addition to the analysis of the sampling strategy, we further analyzed the effect of α smoothing parameter and the length of the reservoir used for sampling.Figure 6 shows how the α smoothing parameter has a substantial effect on the iPFI estimates.The length of the reservoir has no large effect on the estimates.For both sensitivity analysis experiments, iPFI with geometric sampling is applied on an ARF and elec2.only the important nswprice feature is plotted.For each parameter value a single run is presented.

Figure 1 :
Figure 1: Incremental feature importance on an electricitydata stream to create anytime explanations.Concept drift in the data (rectangles) lead to model adaption without visible changes in the model's performance.

Figure 2 :
Figure 2: Boxplot of PFI estimates per feature of the bank dataset for batch PFI (left), geometric sampling iPFI (middle), and uniform sampling iPFI (right) on a pre-trained static NN.

Figure 3 :
Figure 3: iPFI on two agrawal data streams with induced concept drift.The most important features are colored.The dashed line denotes the batch calculation at set intervals.The dashed vertical line denotes the concept drift.

Figure 4 :
Figure 4: iPFI with uniform (left) and geometric sampling (right) on elec2 with a feature-drift.

Figure 5 :
Figure 5: Two-Dimensional Classification Problem of the agrawal Data Stream.

Figure 6 :
Figure 6: The importance of the nswprice feature for an ARF model training on elec2 for different values of α (left) and reservoir length (right) iPFI parameters.

Figure 15 :
Figure 15: Boxplot of PFI estimates per feature of the agrawal dataset for batch baseline (left), iPFI with geometric sampling (middle), and iPFI with uniform sampling (right) on a pre-trained static LGBM.

Figure 16 :
Figure 16: Boxplot of PFI estimates per feature of the elec2 dataset for batch baseline (left), iPFI with geometric sampling (middle), and iPFI with uniform sampling (right) on a pre-trained static LGBM.

Figure 17 :
Figure 17: Boxplot of PFI estimates per feature of the adult dataset for batch baseline (left), iPFI with geometric sampling (middle), and iPFI with uniform sampling (right) on a pre-trained static GBT.

Figure 18 :
Figure 18: Boxplot of PFI estimates per feature of the bank dataset for batch baseline (left), iPFI with geometric sampling (middle), and iPFI with uniform sampling (right) on a pre-trained static NN.

Figure 19 :
Figure 19: Boxplot of PFI estimates per feature of the bike dataset for batch baseline (left), iPFI with geometric sampling (middle), and iPFI with uniform sampling (right) on a pre-trained static LGBM.
data stream with multiple classification functions.In our experiments we consider the following classification function (among others):Both feature age and salary are uniformly distributed with X (age) ∼ U[20,80]and X (salary) ∼ U[20,150].Given iid.samples from the data stream the classification problem can be transformed into a two-dimensional problem following the above defined classification function.The two-dimensional classification problem is illustrated in Figure.5.A sample is classified as concept A when it occurs contained in A 1 , A 2 , or A 3 .Otherwise the sample is classified as concept B.

Table 2
contains summary information about the supplementary experiments.Table 3 contains additional information for Experiment B and C in the main contribution.

Table 3 :
Summary of experiment B and C's iPFI's error against interval PFI (solid line vs. dashed line in the Figures).