1 Introduction

Online learning from dynamic data streams is a prevalent machine learning (ML) approach for various application domains (Bahri et al., 2021). For instance, predicting energy consumption for individual households can foster energy-saving strategies such as load-shifting. Concept drift resulting from environmental changes, such as pandemic-induced lock-downs, drastically impacts the energy consumption patterns necessitating online ML (García-Martín et al., 2019). Explaining these predictions yields a greater understanding of an individual’s energy use and enables prescriptive modeling for further energy-saving measures (Wastensteiner et al., 2021). For black-box ML methods, so-called post-hoc XAI methods seek to explain single predictions or entire models in terms of the contribution of specific features (Adadi & Berrada, 2018).

We are interested in feature importance (FI) as a global assessment of features, which indicates their respective relevance to the given task and model. A prominent representative of model-agnostic FI measures is the permutation feature importance (PFI), which, in its original form, has been introduced for tree-based models in Breiman (2001) with various applications and extensions (Strobl et al., 2007, 2008; Altmann et al., 2010; Hapfelmeier et al., 2014; Zhu et al., 2015; Gregorutti et al., 2015, 2017). Recent work (Fisher et al., 2019) adapts PFI to a model-agnostic FI measure (model reliance) and establishes important theoretical guarantees. Albeit its limitations (Hooker et al., 2019; Fisher et al., 2019), we focus on PFI as a well-established, efficiently implementable, model-agnostic FI measure, which has served as a baseline for various more powerful extensions (Casalicchio et al., 2018; Covert et al., 2020; Molnar et al., 2020; König et al., 2021). So far, PFI requires a holistic view of the entire dataset in a static batch learning environment, which does not account for changes in the model structure, efficient anytime computations or sparse storage capabilities in data stream settings.

More generally, explainable artificial intelligence (XAI) has been studied mainly in the batch setting, where learning algorithms operate on static datasets. In scenarios where data does not fit into memory or computation time is strictly limited, like in progressive data science for big datasets (Turkay et al., 2018), or rapid online learning from data streams (Bahri et al., 2021), high computation times prohibit the use of traditional FI or XAI measures. Incremental, time- and memory-efficient implementations that provide anytime results have received much attention in recent years (Losing et al., 2018; Montiel et al., 2020). In particular, incremental algorithms enable a lifelong adaptation of machine learning technologies and their applications to possibly infinite data streams, addressing computational challenges as well as the challenge of dealing with changes of the underlying data distribution called drift. In this article, we are interested in efficient incremental algorithms for FI (see Fig. 1). Especially in the context of drifting data distributions, this task is particularly relevant—but also challenging, as many common FI methods are already computationally costly in the batch setting. While we focus with our implementation and theoretical results on PFI, our methodology of incremental FI could also be extended to other XAI measures.

Fig. 1
figure 1

Incremental feature importance on an electricity data stream to create anytime explanations. Concept drift in the data (rectangles) lead to model adaption without visible changes in the model’s performance

Contribution

We propose an incremental variant of PFI as a model-agnostic global FI estimator which is capable of dealing with data streams of arbitrary length in limited memory and linear time. Our algorithm can be applied to any model that is incrementally learned on a data stream and provides anytime explanations that immediately react to changes in the model and the underlying data distribution in case of concept drift. The main idea is that these estimates are efficiently updated at each time step by computing a one-sample estimate of FI, which is then exponentially averaged over time. To approximate marginal feature distributions we introduce two sampling strategies, which can be applied in scenarios with and without concept drift. The core idea, inspired by reservoir sampling (Vitter, 1985), is to efficiently maintain a reservoir to sample observations that are used to approximate the marginal distribution. Our core contributions include:

  • We introduce iPFI as an incremental and model-agnostic estimator for global FI by constructing an online variant of PFI (Sect. 3). Up to our knowledge, this constitutes the first mathematically substantiated approach for online global FI estimation. In contrast to PFI, iPFI reacts to concept drift in non-stationary environments and provides an explanation stream alongside the data stream in linear time and constant memory. The explanation stream can be utilized for further downstream tasks, such as inspecting possible causes for observed drift.

  • We motivate iPFI by establishing the concrete connection of permutation tests (Definition 2) (Breiman, 2001) and model reliance (Definition 3) (Fisher et al., 2019) in the batch setting (Theorem 1). This finding extends on (Fisher et al., 2019, Appendix A.3) and shows only properly scaled permutation tests are unbiased estimates of global FI.

  • We provide two sampling strategies for iPFI to incrementally compute marginal feature distributions and establish theoretical guarantees regarding bias, variance, and approximation error in terms of a single sensitivity parameter in a static and dynamic learning scenario.

  • We implement iPFI and conduct experiments on its ability to efficiently provide anytime global FI values under different types of concept drift, as well as its approximation quality compared to batch permutation tests in static modeling scenarios.

All experiments and algorithms are publicly available and integrated natively into the well-known incremental learning framework river (Montiel et al., 2020).Footnote 1

Related work A variety of model-agnostic local FI methods (Ribeiro et al., 2016; Lundberg & Lee, 2017; Lundberg et al., 2020; Covert & Lee, 2021) exist that provide relevance values for single instances. In addition, model-specific variants have been proposed for neural networks (Bach et al., 2015; Selvaraju et al., 2017) and trees (Lundberg et al., 2020). In contrast, global FI methods provide relevance values across all instances. PFI (or permutation tests) (Breiman, 2001) are a prominent global FI approach that has been widely applied (Archer & Kimes, 2008; Calle & Urrea, 2011; Wang et al., 2016), studied and extended (Strobl et al., 2007, 2008; Altmann et al., 2010; Hapfelmeier et al., 2014; Zhu et al., 2015; Gregorutti et al., 2015, 2017) for tree-based models. The method has recently been introduced as a model-agnostic approach in Fisher et al. (2019) and extended to scenarios with strongly correlated features in Molnar et al. (2020); König et al. (2021). In this regard, our definition of global FI also relates to an ongoing debate, if absent features should be marginalized using the conditional distribution (Aas et al., 2021; Frye et al., 2021) or the marginal distribution (Janzing et al., 2020), as proposed by PFI, where it was argued that the choice should depend on the application (Chen et al., 2020), and the marginal distribution was used as an approximation of the conditional distribution (Covert et al., 2020; Lundberg & Lee, 2017). A particular popular extension is SAGE, a Shapley-based (Shapley, 1953) approach, which averages marginal feature contributions over arbitrary subsets of marginalized features. It has been proposed and compared with existing methods in Covert et al. (2020) and a closely related idea was previously introduced in Casalicchio et al. (2018), called SFIMP. As calculating FI values is computationally expensive, especially for Shapley-based methods, more efficient approaches such as FastSHAP (Jethani et al., 2021) have been introduced. Yet, none of the above methods and extensions natively support an incremental or dynamic setting in which the underlying model and its global FI can rapidly change due to concept drift.

An initial approach to explaining model changes by computing differences in FI utilizing drift detection methods is Muschalik et al. (2022). However, this does not constitute an incremental FI measure. The explanations are created with a time delay and without efficient anytime calculations. A first step towards anytime FI values has been proposed for online random forests by computing mean decrease in impurity (MDI) and accuracy (MDA) over time by using online confusion matrices (Cassidy & Deviney, 2014) or maintaining node statistics incrementally (Gomes et al., 2019). While online feature scores are of particular interest in streaming scenarios, these methods are limited to a specific model class, need access to the inherent model structure and cannot be extended to a model-agnostic approach. They further do not provide any theoretical guarantees about the approximation quality in comparison to the batch versions.

Similar to batch learning, incremental FI values could be used in further downstream tasks. As an example, incremental FI is also relevant to the field of incremental feature selection, where FI is calculated periodically with a sliding window to retain features for the incrementally fitted model (Barddal et al., 2019; Yuan et al., 2018). Lastly, changes in FI can also be used for concept drift detection, as described in Haug et al. (2022).

In this work, we provide a mathematically substantiated model-agnostic incremental FI measure, whose time sensitivity can be controlled by a single smoothing parameter. To our knowledge, this is the first approach that combines online ML and model-agnostic XAI measures and provides extensive theoretical guarantees on its approximation quality.

2 Global feature importance

We consider a supervised learning scenario, where \({\mathscr {X}}\) is the feature space and \({\mathscr {Y}}\) the target space, e.g., \({\mathscr {X}} = {\mathbb {R}}^d\) and \({\mathscr {Y}} = {\mathbb {R}}\) (regression), \({\mathscr {Y}} = \{0,1\}\) (binary classification) or \({\mathscr {Y}} = \{0,1\}^c\) (multiclass classification). Let \(h: {\mathscr {X}} \rightarrow {\mathscr {Y}}\) be a model, which is learned from a set or stream of observed data points \(z=(x,y) \in {\mathscr {X}} \times {\mathscr {Y}}\). Let \(D = \{1,\dots ,d\}\) be the set of feature indices for the vector-wise feature representations of \(x = (x^{(i)}:i \in D) \in {\mathscr {X}}\). Consider a subset \(S \subset D\) and its complement \({\bar{S}}:= D {\setminus } S\), which partitions the features, and denote \(x^{(S)}:= (x^{(i)}: i \in S)\) as the feature subset of S for a sample x. We write \(h(x^{({\bar{S}})},x^{(S)}):= h(x)\) to distinguish between features from \({\bar{S}}\) and S. For the basic setting, we assume that N observations are drawn independently and identically distributed (iid) from the joint distribution of unknown random variables (XY) and denote by \({\mathbb {P}}_S\) the marginal distribution of the features in S, i.e., \(z_n:= (x_n,y_n)\) from \(Z_n:= (X_n,Y_n) \overset{iid}{\sim }\ {\mathbb {P}}_{(X,Y)}\) and \(x_n^{(S)} \text { from } X_n^{(S)} \overset{iid}{\sim }\ {\mathbb {P}}_S\) for samples \(n=1,\dots ,N\).

Feature importance refers to the relevance of a set of features S for a model h. To quantify FI, the key idea of measures such as PFI is to compare the model’s performance when using only features in \({\bar{S}}\) with the performance when using all features in \(D = S \cup {\bar{S}}\). The idea is that the “removal” of an important feature (i.e., the feature is not provided to a model) substantially decreases a model’s performance. The model performance or risk is measured based on a norm \(\Vert \cdot \Vert :{\mathscr {Y}} \rightarrow {\mathbb {R}}\) on \({\mathscr {Y}}\), e.g., the Euclidean norm, as \({\mathbb {E}}_{(X,Y)}[\, \Vert h(X)-Y\Vert \, ]\).

As the model is trained on all features and retraining is computationally expensive, a common method to restrict h to \(\bar{S}\) is to marginalize h over the features in S. We denote the marginalized risk

$$\begin{aligned} f_S \big (x^{({\bar{S}})},y \big ):= {\mathbb {E}}_{{\tilde{X}} \sim {\mathbb {P}}_S} \left[ \Vert h(x^{({\bar{S}})},{\tilde{X}})-y\Vert \right] . \end{aligned}$$
(1)

We then define FI for a model h and a feature set S as the difference of the marginalized risk and the inherent risk.

Definition 1

(Global FI) For a model h and a subset \(S \subset D\), the global feature importance (global FI) is defined as

$$\begin{aligned} \phi ^{(S)}(h):= \underbrace{{\mathbb {E}}_{(X,Y)} \Big [f_S(X^{(\bar{S})},Y) \Big ]}_{\hbox { marginalized risk over}\ {\mathbb {P}}_S} - \underbrace{{\mathbb {E}}_{(X,Y)} \Big [\Vert h(X)-Y\Vert \Big ]}_{\text {risk}}. \end{aligned}$$

This global FI measures the increase in risk when the features in S are marginalized.

Remark 1

Our definition is best suited and inherent for PFI (Breiman, 2001; Fisher et al., 2019) with single feature subsets. However, it is also related to a more general definition of FI given in Covert et al. (2020). Therein, FI is based on the reduction in risk, if features in S are included compared to marginalizing all features. In contrast to our definition, it relies on the conditional distribution \(X^{({\bar{S}})} \mid X^{(S)}\). In practice, however, the marginal distribution is often used to approximate the conditional distribution, where both coincide, if feature independence is assumed (Lundberg & Lee, 2017; Covert et al., 2020, 2021). In this case, it directly corresponds to our definition with different notation. In the literature, it was argued that the choice of distribution should depend on the application (Chen et al., 2020). The conditional distribution was preferred in Aas et al. (2021); Frye et al. (2021), which includes causal relationships in the explanation (Chen et al., 2020), whereas the marginal distribution was preferred in Janzing et al. (2020), which explains the model independent of the relationships between the features (Janzing et al., 2020; Chen et al., 2020).

2.1 Empirical estimation of global FI

Given observations \((x_1,y_1),\dots ,(x_N,y_N)\), we estimate global FI for a given model h with the canonical estimator

$$\begin{aligned} {\hat{\phi }}^{(S)}_{\varphi }:= \frac{1}{N}\sum _{n=1}^N {\hat{\lambda }}^{(S)}(x_n,x_{\varphi (n)},y_n), \end{aligned}$$
(2)

where \(\varphi :\{1,\dots ,N\} \rightarrow \{1,\dots ,N\}\) represents the realization of a (possibly random) sampling strategy that chooses for \(x_n\) an observation \(x_{\varphi (n)}\) as a replacement value with

$$\begin{aligned} {\hat{\lambda }}^{(S)}(x_n,x_m,y_n):= \Vert h(x_n^{(\bar{S})},x^{(S)}_m)-y_n\Vert - \Vert h(x_n)-y_n\Vert . \end{aligned}$$

Given the iid assumption, it is clear that due to \(X_n \perp X_{n'}\) for \(n\ne n'\), the estimator is an unbiased estimator of the global FI \(\phi ^{(S)}(h)\), if \(\varphi (n) \ne n\) for all \(n=1,\dots ,N\). In the case of \(\varphi (n)=n\), the term in the sum is zero as well as its expectation, which implies \({\mathbb {E}}[{\hat{\phi }}^{(S)}_{\varphi }] \le \phi ^{(S)}(h)\) for any \(\varphi\). We will now discuss a well understood choice of feature subsets \(S \subset D\), sampling strategy \(\varphi\) and two estimators for \(\phi ^{(S)}(h)\).

2.2 Permutation feature importance (PFI)

A popular example of global FI is the well-known PFI (Breiman, 2001) that measures the importance of each feature \(j \in D\) by using a set \(S_j:= \{j\}\). More precisely, the FI for each feature \(j \in D\) is given by \(\phi ^{(S_{j})}\) with sets \(S_{j}\) and their complement \({\bar{S}}_{j}:= D \setminus \{j\}\). The sampling strategy \(\varphi\) used in PFI samples uniformly generated permutations \(\varphi \in {\mathfrak {S}}_N\) over the set \(\{1,\dots ,N\}\), where each permutation has a probability of 1/N!.

2.2.1 Empirical PFI

Permutation tests, as proposed initially in Breiman (2001), effectively approximate \({\mathbb {E}}_{\varphi }[{\hat{\phi }}^{(S_j)}_\varphi ]\) by averaging over M uniformly sampled random permutations. We now introduce a corrected version of the originally proposed estimator, which we refer to as PFI by introducing a normalizing factor \(\frac{N}{N-1}\).

Definition 2

(PFI) Given samples \((x_1,y_1),\dots ,(x_N,y_N)\) and uniformly sampled permutations \(\varphi _1,\dots ,\varphi _m \overset{iid}{\sim } \text {unif}({\mathfrak {S}}_N)\), we define the PFI estimator as

$$\begin{aligned} {\textbf {PFI}}: {\hat{\phi }}^{(S_j)}:= \frac{N}{N-1}\underbrace{\frac{1}{M} \sum _{m=1}^M {\hat{\phi }}^{(S_j)}_{\varphi _m}}_{\approx {\mathbb {E}}_{\varphi }[{\hat{\phi }}^{(S_j)}_\varphi ]}. \end{aligned}$$
(3)

As discussed above, the estimator \({\hat{\phi }}^{(S_j)}_{\varphi }\) for a given \(\varphi\) is an unbiased estimator for global FI \(\phi ^{(S_j)}(h)\), if the permutation is a derangement (no fixed points). Our version differs by the factor \(\frac{N}{N-1}\) from the initially proposed approach (Breiman, 2001; Fisher et al., 2019). In the following, we show that, if the expectation over uniformly sampled permutations \(\varphi \sim \text {unif}({\mathfrak {S}}_{N})\) is taken, our definition is an unbiased estimator of global FI. This expectation directly links PFI to model reliance (Fisher et al., 2019), which we thus refer to as expected PFI. While our definition of PFI is closely related to the original method (Breiman, 2001), the link to expected PFI allows to provide further theoretical results. We utilize this link in an incremental learning setting to provide theoretical guarantees.

2.2.2 Expected PFI

The PFI estimator can be efficiently computed but highly depends on the sampled permutations complicating the theoretical analysis. Another definition of PFI (model reliance), which was given and extensively studied in Fisher et al. (2019), is independent of sampled permutations. We refer to it as the expected PFI.

Definition 3

(Expected PFI) Given observations \((x_1,y_1),\dots ,(x_N,y_N)\) the expected PFI is defined as

$$\begin{aligned} {\bar{\phi }}^{(S_j)}:= \underbrace{\frac{1}{N(N-1)} \sum _{n=1}^N\sum _{m \ne n}\Vert h(x_n^{(\bar{S_j})},x_m^{(S_j)})-y_n\Vert }_{=: {\hat{e}}_{\text {switch}}} - \underbrace{\frac{1}{N} \sum _{n=1}^N\Vert h(x_n)-y_n\Vert }_{=:{\hat{e}}_{\text {orig}}} \end{aligned}$$

The expected PFI computes the difference of the error of the model averaged over all feature instantiations \({\hat{e}}_{\text {switch}}\) with the model error \({\hat{e}}_{\text {orig}}\)Footnote 2. We now show that the expected PFI is actually the expectation over the sampling procedure \(\varphi\) of PFI, which directly links Definition 2 and Definition 3. As expected PFI is an unbiased estimator for global FI, we conclude that PFI is an unbiased estimator, if properly scaled as in Definition 2.

Theorem 1

The expected PFI (model reliance) can be rewritten as a normalized expectation over uniformly sampled permutations

$$\begin{aligned} {\bar{\phi }}^{(S_j)} = \frac{N}{N-1}{\mathbb {E}}_{\varphi \sim \text {unif}({\mathfrak {S}}_N)} \left[ {\hat{\phi }}^{(S_j)}_{\varphi } \right] \approx {\hat{\phi }}^{(S_j)} \end{aligned}$$
(4)

i.e. expected PFI is canonically estimated by the PFI estimator and in particular \({\bar{\phi }}^{(S_j)} = {\mathbb {E}}_\varphi [{\hat{\phi }}^{(S_j)}]\).

Due to space restrictions, all proofs are deferred to the supplementary material in Sect. A. Theorem 1 shows that the PFI estimator \({\hat{\phi }}^{(S_j)}\) is a canonical Monte-Carlo estimate of the theoretically well understood expected PFI estimator \({\bar{\phi }}^{(S_j)}\). Both \({\hat{e}}_{\text {switch}}\) and \({\hat{e}}_{\text {orig}}\) as well as the estimator \({\bar{\phi }}^{(S_j)}\) are U-statistics, which implies unbiasedness, asymptotic normality and finite sample boundaries under weak conditions (Fisher et al., 2019). The variance can, thus, be directly computed and it is easy to show that \({\mathbb {V}}[{\bar{\phi }}^{(S_j)}]= {\mathcal {O}}(1/N)\), which by Chebyshev’s inequality implies a bound on the approximation error as \({\mathbb {P}}(\vert {\bar{\phi }}^{(S_j)}- \phi ^{(S_j)}(h)\vert > \epsilon ) = {\mathcal {O}}(1/N)\). Hence, the approximation error of the expected PFI is directly controlled by the number of observations N used for computation. A possible link between permutation tests and the U-statistic \({\bar{\phi }}^{(S_j)}\) was already suggested in (Fisher et al., 2019, Appendix A.3), where it was shown that the sum over permutations without fixed points is proportional to \({\hat{e}}_{\text {switch}}\). Theorem 1 shows that both approaches are directly linked, if permutation tests are properly scaled (Definition 2). The biased estimator \(\frac{1}{M} \sum _{m=1}^M {\hat{\phi }}^{(S_j)}_{\varphi _m}\) appears in Breiman (2001); Fisher et al. (2019); Gregorutti et al. (2017). To our knowledge, the unbiased version in Definition 2 has not yet been introduced. In practice, while this factor does not change the relative importance scores, it should be considered when comparing PFI estimates with varying N. Furthermore, Theorem 1 justifies to average over repeatedly sampled realizations of \(\varphi\) in order to approximate the computationally prohibitive estimator \({\bar{\phi }}^{(S_j)}\). In the following, we will pick up this notion when constructing an incremental FI estimator.

3 Incremental permutation feature importance

In incremental learning, one deals with an a priory unlimited stream of training data. The challenge is to infer a model at any time point t based on the previous model and the currently observed data point, thereby using a fixed, limited amount of memory and efficient update schemes for the model. While incremental classification and regression models have been proposed (Bahri et al., 2021; Losing et al., 2018), technologies which accompany such methods by incremental explanation technologies are rare. In the following, we introduce an efficient incremental scheme for the popular PFI supported by theoretical guarantees using the link to expected PFI (model reliance) (Fisher et al., 2019).

We now consider a sequence of models \((h_t)_{t\in {\mathbb {N}}}\) from an incremental learning algorithm. At time t the observed data is \(\{(x_0,y_0),\dots ,(x_t,y_t)\}\). The model is incrementally learned over time, such that at time t the observation \((x_t,y_t)\) is used to update \(h_t\) to \(h_{t+1}\). Our goal is to efficiently provide an estimate of PFI at each time step t for each feature \(j \in D\) using subsets \(S_j:= \{j\}\). Note that our results can immediately be extended to arbitrary feature subsets \(S \subset D\).

In the following, we construct an efficient incremental estimator for PFI. We first discuss how (2) can be efficiently approximated in the incremental learning scenario, given a sampling strategy \(\varphi _t\). In the sequel, we will rely on a random sampling strategy which is specifically suitable for the incremental setting and easier to implement than permutation-based approaches. Note that a permutation-based approach at time t is difficult to replicate in the incremental setting, as at time \(s<t\) not all samples until time t are available. Moreover, as the model changes over time, naively computing (2) at each time step t using N previous observations results in N model evaluations per time step. Instead, we propose to use an estimator that averages the terms in (2) over time rather than over multiple data points at one time step. That means, we evaluate the current model only twice to compute the time-dependent quantity

$$\begin{aligned} {\hat{\lambda }}^{(S_j)}_t(x_t,x_{\varphi _t},y_t):= \Vert h_t(x_t^{(\bar{S_j})},x^{(S_j)}_{\varphi _t})-y_t\Vert - \Vert h_t(x_t)-y_t\Vert , \end{aligned}$$

where \(\varphi _t\) is a stochastic sampling strategy to select a previous observation with values in \(\{0,\dots ,t-1\}\), which we discuss in a second step in Sect. 3.1. We propose to average these calculations over time (rather than iterations over multiple data points) by using exponential smoothing. This yields to the definition of the incremental PFI (iPFI) estimator.

Fig. 2
figure 2

Illustration of the incremental explanation procedure

Definition 4

(iPFI) For a data stream at time t with previous observations \((x_0,y_0),\dots ,(x_t,y_t)\) and a sampling strategy \((\varphi _s)_{s=t_0,\dots ,t}\) for \(t_0>0\) the incremental PFI (iPFI) estimator is recursively defined as

$$\begin{aligned} {\textbf {iPFI}}: {\hat{\phi }}^{(S_j)}_t:= (1-\alpha ){\hat{\phi }}^{(S_j)}_{t-1} + \alpha {\hat{\lambda }}^{(S_j)}_t(x_t,x_{\varphi _t},y_t), \end{aligned}$$

for \(t>t_0\), \({\hat{\phi }}^{(S_j)}_{t_0-1}:= 0\), and \(\alpha \in (0,1)\).

figure a

The parameter \(\alpha\) is a hyperparameter that should be chosen based on the application. Note that a specific choice of \(\alpha\) corresponds to a window size N, where \(\alpha = \frac{2}{N+1}\) based on the well-known conversion formula, see e.g. (Nahmias & Olsen, 2015, p.73). Given a realization \(\varphi _s\), observations \(z_s:= (x_s,y_s)\) from iid \(Z_s:= (X_s,Y_s) \overset{iid}{\sim }\ {\mathbb {P}}_{(X,Y)}\) and \(x_s^{(S_j)}\) from \(X_s^{(S_j)} \overset{iid}{\sim }\ {\mathbb {P}}_{S_j}\), each \({\hat{\lambda _s}}^{(S_j)}\) is an unbiased estimate of \(\phi ^{(S_j)}(h_s)\). We further require \(\varphi _s \perp (X,Y)\) and denote

$$\begin{aligned} p_{s,r}:= {\mathbb {P}}(\varphi _s = r) \text { for } s=t_0,\dots ,t \text { and } r=0,\dots ,s-1, \end{aligned}$$
(5)

i.e. the probability to select a previous observation from time r at time s. Note that \(t_0>0\) is the first time step where \({\hat{\phi }}^{(S_j)}_t\) can be computed, as we need previous observations for the sampling process. In the following, we assume that the sampling strategy \((\varphi _s)_{t_0\le s\le t}\) is fixed and clear from the context, and thus omit the dependence on \({\hat{\phi }}^{(S_j)}_t\). We illustrate one explanation step at time t in Algorithm 1 and Fig. 2. This directly corresponds to (3) with \(M=1\) and can be extended to \(M>1\) by repeatedly running the procedure in parallel and averaging the results. Next, we discuss two possible sampling strategies, which are illustrated in Fig. 3.

3.1 Incremental sampling strategies \(\varphi\)

Fig. 3
figure 3

Comparison of uniform (left) and geometric (right) sampling strategies. A reservoir of length L summarizes the data stream (rectangles) until time t. The insertion probability denotes the probability that a data point is added to the reservoir at time s when it is observed. The sampling probability denotes the likelihood of drawing the individual observations at time t

Since random permutations cannot easily be realized in an incremental setting as they require infinite memory of previous observations and knowledge of future events, we now present two alternative types of sampling strategies. We formalize \((\varphi _s)_{t_0\le s\le t}\) to choose the previous observation r at time s for the calculation in \({\hat{\lambda _s}}^{(S_j)}\). To do so, we will specify the probabilities \(p_{s,r}\) in (5). An illustration of both approaches can be found in Fig. 3.

3.1.1 Uniform sampling

In uniform sampling we assume that each previous observation is equally likely to be sampled at time s, i.e., \(p_{s,r}=1/s\) for \(s=t_0,\dots ,t\) and \(r=0,\dots ,s-1\). It could be naively implemented by storing all previous observations and uniformly sampling at each time step. However, since memory is limited, uniform sampling may be implemented with histograms for categorical features of known and small cardinality. For others, a reservoir of fixed length L can be maintained, known as reservoir sampling (Vitter, 1985). The probability of a new observation to be included in the reservoir, referred to as insertion probability, then decreases over time, see Fig. 3. Clearly, observations are drawn independently, but can be sampled more than once. In a data stream scenario, where changes to the underlying data distribution occur over time, the uniform sampling strategy may be inappropriate, and sampling strategies that prefer recent observations may be better suited.

3.1.2 Geometric sampling

Geometric sampling arises from the idea to maintain a reservoir of size L, which is updated by a new observation at each time step by randomly replacing a reservoir observation with the newly observed one. Until time \(t_0\) the first L observations are stored in the reservoir. At each sampling step (\(t \ge t_0\)) an observation is uniformly chosen from the reservoir with probability \(p:= 1/L\). Independently, a sample from the reservoir is selected with the same probability \(p:= 1/L\) for replacement with the new observation. The resulting probabilities are of the geometric form \(p_{s,r}=p(1-p)^{s-r-1}\) for \(r\ge t_0\) and \(p_{s,r}=p(1-p)^{s-t_0}\) for \(r < t_0\). Clearly, the geometric sampling strategy yields increasing probabilities for more recent observations and we demonstrate in our experiments that this can be beneficial in scenarios with concept drift.

3.2 Theoretical results of estimation quality

The estimator \({\hat{\phi }}^{(S_j)}_t\) picks up the notion of the PFI estimator \({\hat{\phi }}^{(S_j)}\) in (3), which approximates the expectation over the random sampling strategy \((\varphi )_{t_0 \le s \le t}\) by averaging repeated realizations. While \({\hat{\phi }}^{(S_j)}_t\) only considers one realization of the sampling strategy, it is easy to extend the approach in the incremental learning scenario by computing the estimator \({\hat{\phi }}^{(S_j)}_t\) in multiple separate runs in parallel. While this yields an efficient estimate of PFI, it is difficult to analyze the estimator theoretically as each estimator highly depends on the realizations of the sampling strategy. We, thus, again study the expectation over the sampling strategy and introduce the expected iPFI.

Definition 5

(Expected iPFI) For a data stream at time t with previous observations \((x_0,y_0),\dots ,(x_t,y_t)\) and a sampling strategy \(\varphi := (\varphi _s)_{s=t_0,\dots ,t}\) for \(t_0>0\), we defined the expected iPFI as

$$\begin{aligned} {\bar{\phi }}^{(S_j)}_t:= {\mathbb {E}}_{\varphi }[{\hat{\phi }}^{(S_j)}_t], \end{aligned}$$

which corresponds to the expected PFI (model reliance) \({\bar{\phi }}^{(S_j)}\) in the batch setting.

To evaluate the estimation quality, we will analyze the bias \(\vert {\bar{\phi }}^{(S_j)}_t - \phi ^{(S_j)}(h_t) \vert\) and the variance of \({\bar{\phi }}^{(S_j)}_t\). Both can be combined by Chebyshev’s inequality to obtain bounds on the approximation error of \(\phi ^{(S_j)}(h_t)\) for \(\epsilon > \vert {\bar{\phi }}^{(S_j)}_t - \phi ^{(S_j)}(h_t) \vert\) as

$$\begin{aligned} {\mathbb {P}}(\vert {\bar{\phi }}^{(S_j)}_t - \phi ^{(S_j)}(h_t) \vert > \epsilon )= {\mathcal {O}} ({\mathbb {V}}[{\bar{\phi }}^{(S_j)}_t]). \end{aligned}$$
(6)

As already said, all proofs are deferred to the supplementary material in Sect. A. Our theoretical results are stated and proven in a general manner, which allows one to extend our approach to other sampling strategies, other feature subsets, and even other aggregation techniques.

Static model Given iid observations from a data stream, we consider an incremental model that learns over time. We begin under the simplified assumption that the model does not change over time, i.e., \(h_t \equiv h\) for all t.

Theorem 2

(Bias for static Model) If \(h \equiv h_t\), then

$$\begin{aligned} \phi ^{(S_j)}(h) - {\mathbb {E}}[{\bar{\phi }}^{(S_j)}_t] = (1-\alpha )^{t-t_0+1} \phi ^{(S_j)}(h). \end{aligned}$$

From the above theorem it is clear that the bias of the expected iPFI \({\bar{\phi }}^{(S_j)}_t\) is exponentially decreasing towards zero for \(t \rightarrow \infty\) and we thus continue to study the asymptotic estimator \(\lim _{t\rightarrow \infty }{\bar{\phi }}_t^{(S_j)}\). While the bias does not depend on the sampling strategy, our next results analyzes the variance of the asymptotic estimator, which does depend on the sampling strategy.

Theorem 3

(Variance for static Model) If \(h_t \equiv h\) and \({\mathbb {V}}[\Vert h(X_s^{(\bar{S_j})},X_r^{(S_j)})-Y_s\Vert -\Vert h(X_s)-Y_s\Vert ] <\infty\), then

$$\begin{aligned} \text {Uniform: }&{\mathbb {V}} \left[ \lim _{t\rightarrow \infty }{\bar{\phi }}_t^{(S_j)} \right] = {\mathcal {O}} (-\alpha \log (\alpha )). \\ \text {Geometric: }&{\mathbb {V}} \left[ \lim _{t\rightarrow \infty }{\bar{\phi }}_t^{(S_j)} \right] = {\mathcal {O}} (\alpha ) + {\mathcal {O}} (p). \end{aligned}$$

The variance is therefore directly controlled by the choice of parameters \(\alpha\) and p. As the asymptotic estimator is unbiased, it is clear that these parameters control the approximation error, as shown in (6).

Changing model So far, we discussed properties of \({\bar{\phi }}_t^{(S_j)}\) under the simplified assumption that \(h_t\) does not change over time. In an incremental learning scenario, \(h_t\) is updated incrementally at each time step. In cases where no concept drift affects the underlying data generating distribution, we can assume that an incremental learning algorithm gradually converges to an optimal model. We thus assume that the change of the model is controlled and show results similar to the case where \(h_t\) is static. To control model change formally, we introduce \(f^{\Delta }_S(x^{(\bar{S_j})},h_s,h_t):= {\mathbb {E}}_{{\tilde{X}} \sim {\mathbb {P}}_S}[\Vert h_t(x^{(\bar{S_j})},\tilde{X})-h_s(x^{(\bar{S_j})},{\tilde{X}})\Vert ]\). The expectation of \(f^\Delta _S\) is denoted \(\Delta _S(h_s,h_t):= {\mathbb {E}}_X[f^{\Delta }_S(X,h_s,h_t)]\) and \(\Delta (h_s,h_t):= \Delta _\emptyset (h_s,h_t)\). We show that \(\Delta _S\) and \(\Delta\) bound the difference of FI of two models \(h_t\) and \(h_s\) and the bias of our estimator.

Theorem 4

(Bias for changing Model) If \(\Delta (h_s,h_t) \le \delta\) and \(\Delta _S(h_s,h_t) \le \delta _S\) for \(t_0 \le s \le t\), then

$$\begin{aligned} \vert {\mathbb {E}}[{\bar{\phi }}^{(S_j)}_t] - \phi ^{(S_j)}(h_t)\vert \le \delta _S + \delta +{\mathcal {O}}((1-\alpha )^{t}). \end{aligned}$$

In the case of a changing model the estimator is therefore only unbiased if \(h_t \rightarrow h\) as \(t \rightarrow \infty\). For results on the variance, we control the variability of the models at different points in time. In the case of a static model, the covariances can be uniformly bounded, as they do not change over time. Instead, for a changing model, we introduce the time-dependent function

$$\begin{aligned} f_s(Z_s,Z_r):= \Vert h_s(X_s^{(\bar{S_j})},X_r^{(S_j)})-Y_s\Vert -\Vert h_s(X_s)-Y_s\Vert \end{aligned}$$

and assume existence of some \(\sigma _{\text {max}}^2\) such that

$$\begin{aligned} \text {cov}(f_s(Z_s,Z_r),f_{s'}(Z_{s'},Z_{r'})) \le \sigma _{\text {max}}^2 \end{aligned}$$
(7)

for \(t_0\le s,s' \le t\), \(r<s\) and \(r'<s'\).

Theorem 5

(Variance for changing Model) Given (7) for a sequence of models \((h_t)_{t\ge 0}\), the results of Theorem 3 apply.

Summary We have shown that the approximation error of iPFI for FI is controlled by the parameters \(\alpha\) and p. In the case of drifting data, the approximation error is additionally affected by the changes in the model, as it is then possibly biased and the covariances may change over time. As the expected PFI estimator has an approximation error of order \({\mathcal {O}}(1/N)\) for FI, we conclude that the above bounds on the approximation error of expected iPFI are also valid when compared with the expected PFI, if \(\alpha\) is chosen according to \(\alpha = \frac{2}{N+1}\). In the next section, we corroborate our theoretical findings with empirical evaluations and showcase the efficacy of iPFI in scenarios with concept drift. We also elaborate on the differences between the two sampling strategies.

4 Experiments

We conduct multiple experimental studies to validate our theoretical findings and present our approach on real data. We consider three benchmark datasets, which are well-established in the FI literature (Covert et al., 2020; Lundberg & Lee, 2017) called adult (Kohavi, 1996), bank (Moro et al., 2011), and bike (Fanaee-T & Gama, 2014), where bike constitutes a regression task. We further consider two binary classification real-world data streams called elec2 (Harries, 1999) and ozone (de Souza et al., 2020). Moreover, we apply the multi-class insects (de Souza et al., 2020) data stream. Lastly, we create multiple synthetic data streams based on the agrawal (Agrawal et al., 1993) and stagger (Schlimmer & Granger, 1986) concept generators where we manually induce concept drifts. As our approach is inherently model-agnostic, we present experimental results for different model types. In the static batch scenario we apply Gradient Boosting Tree (GBT) (Friedman, 2001) and LightGBM (LGBM) (Ke et al., 2017) ensembles and train small 2-layer Neural Networks (NN) with layer sizes (128, 64). In the dynamic incremental learning setting, we apply Adaptive Random Forest classifiers (ARF) (Gomes et al., 2017), small scale 3-layer NNs with layer sizes (100, 100, 10) and Hoeffding Adaptive Trees (HATs) Bifet & Gavaldà (2009). The models’ and data streams’ implementation is based on scikit-learn (Pedregosa et al., 2011), river (Montiel et al., 2020), PyTorch (Paszke et al., 2017), and OpenML (Feurer et al., 2020). We mainly rely on default parameters, yet the supplement in Sect. C contains additional information about the datasets and details about the applied models.Footnote 3 In all our experiments, we compute the iPFI estimator \(\hat{\phi }_{\text {iPFI}}^{(S_j)}\) as the average over ten realizations \({\hat{\phi }}^{(S_j)}_t\) of the incremental sampling strategies (uniform or geometric). All baseline approaches are chosen, such that they require the same amount of model evaluations as iPFI.

4.1 Experiment A: online PFI calculation under drift

First, we consider a dynamic modeling scenario. Here, instead of a pre-trained model, we fit different models incrementally on real data streams and compute iPFI on the fly. We incrementally train ARF, HAT and NN models. However, as our approach is inherently model-agnostic, any incremental model (implemented for example in river) can be explained. As a baseline, we compare our approach to the interval PFI for feature \(j \in D\), which computes the PFI over fixed time intervals during the online learning process with ten random permutations in each interval. This can be seen as a naive implementation of iPFI with large gaps of uncertainty and a substantial time delay.

With the synthetic agrawal stream we induce two kinds of real concept drifts: First, we switch the classification function of the data generator, which we refer to as function-drift (changing the functional dependency but retaining the distribution of X). Second, we switch the values of two or more features with each other, which we refer to as feature-drift (changing the functional dependency by changing the distribution of X). Note that feature-drift can be applied to datasets, where the classification function is unknown (like elec2).

Fig. 4
figure 4

iPFI on two agrawal concept drift data streams for ARF classifiers. The most important features are highlighted in color. The dashed line denotes the batch calculation at set intervals (Color figure online)

Figure 4 showcases how well iPFI reacts to both concept drift scenarios. Both concept drifts are induced in the middle of the data stream (after 10,000 samples). For the function-drift example (Fig. 4, left), the agrawal classification function was switched from Agrawal et al. (1993)’s concept 2 to concept 3. Theoretically, only two features should be important for both concepts: For the first concept the pink salary and the purple age features are needed, and for the second concept the classification function relies on the cyan education and the purple age features. However, the ARF model also relies on the blue commission feature, which can be explained as commission directly depends on salary and, thus, is transitively correlated with the target variable.

In the feature-drift scenario (Fig. 4, right), the ARF model adapts to a sudden drift where both important features (education and age) are switched with two unimportant features (car and salary). In both scenarios iPFI instantly detects the shifts in importance. From both simulations, it is clear that iPFI and its anytime computation has clear advantages over interval PFI. In fact, iPFI quickly reacts to changes in the data distribution while still closely matching the “ground-truth” results of the interval-wise computation.

Next to synthetic concept drifts on agrawal, Fig. 5 illustrates how iPFI explanations are model-agnostic on the original elec2 data stream. There, we incrementally train a NN and an ARF classifier on the stream without inducing an additional feature drift. For further concept drift scenarios, we refer to the supplementary material in Sect. C.

Fig. 5
figure 5

iPFI on elec2 (without inducing a feature drift) for an incrementally fitted NN (left) and an ARF (right)

Table 1 Summary of the additional time complexity of iPFI

Time complexity

Aside from the approximation quality in the incremental setting, we also summarize the additional time complexity of iPFI in Table 1 and observe a linear relationship (\(0.104 \cdot \vert D \vert , R^{2} = 0.966\)) over the feature count \(\vert D\vert\). For a detailed illustration of the linear relationship we refer to Sect. C.4. We run the explanation procedure ten times for seven datasets and track the run-time with and without iPFI explanations. To isolate the variability of the run-times to the explanation procedure, we use the same ARF classification model for all seven datasets. We further decompose the explanation time into the time it takes to run the model in inference (line 3 in Algorithm 1) and the remaining storing and sampling overhead. Most of the explanation time (95% to 99%) is dedicated to the inference time of the models for which performance gains cannot be easily achieved without parallelization.

Fig. 6
figure 6

Comparison of iPFI (solid) and MDI (dotted) on an agrawal concept drift stream (concept 2 to 3 (Agrawal et al., 1993), left) and elec2 (right). For each stream a single HAT classifier is trained and explained

Sanity check with tree-specific mean decrease in impurity To further illustrate the efficacy of our approach, we also compare our model-agnostic iPFI explainer to the model-specific baseline of Mean Decrease in Impurity (MDI). Earlier works (Cassidy & Deviney, 2014; Gomes et al., 2019) leverage MDI as an importance measure in the incremental setting. Similar to Gomes et al. (2019), we manually compute the MDI on incremental summary statistics stored at each split-node of a HAT classifier. As a impurity measure, we compute the gini impurity index like in Cassidy and Deviney (2014). Figure 6 shows the comparison of iPFI and MDI for an agrawal concept drift data stream and elec2. Aside from the differing scales, both measures detect the same importance rankings and react to concept drift. However, as MDI can only be computed for tree-based models such as HATs and ARFs, its applicability is strictly limited compared to the model-agnostic approach of calculating iPFI, which can be applied to any model class and loss function.

4.2 Experiment B: Geometric vs. uniform sampling

Second, we focus on the question, which sampling strategy to prefer in which learning environments. We conclude that geometric sampling should be applied under feature-drift scenarios, as the choice of sampling strategy substantially impacts iPFI’s performance in concept drift scenarios where feature distributions change over time. If a dynamic model adapts to changing feature distributions, and the PFI is estimated with samples from the outdated distribution, the resulting replacement samples are outside the current data manifold. Estimating PFI by using this data can result in skewed estimates, as illustrated in Fig. 7. There, we induce a feature-drift by switching the values of the most important feature for an ARF model on elec2 with a random feature. The uniform sampling strategy (Fig. 7, left) is incapable of matching the “ground-truth” interval PFI estimation like the geometric sampling strategy (Fig. 7, right). Hence, in dynamic learning environments like data stream analytics or continual learning, we recommend applying a sampling strategy that focuses on more recent samples, such as geometric distributions. For applications without drift in the feature-space like progressive data science, uniform sampling strategies, which evenly distribute the probability of a data point being sampled across the data stream, may still be preferred.

Fig. 7
figure 7

iPFI with uniform (left) and geometric sampling (right) on elec2 with a feature-drift

Parameter considerations We, further, conduct an analysis of the two most important hyperparameters on the elec2 data stream. The results are shown in Fig. 8. Therein, we show that the smoothing parameter \(\alpha\) substantially effects iPFI’s FI estimates. Like any smoothing mechanism, this parameter controls the deviation of iPFI’s estimates. This parameter should be set individually for the task at hand. In our experiment, values between \(\alpha = 0.001\) (conservative) and \(\alpha = 0.01\) (reactive) appeared to be reasonable. The size of the the reservoir does not substantially effect the estimation quality for values between 50 and \(2\,000\).

Fig. 8
figure 8

The importance of the nswprice feature for an ARF model training on elec2 for different values of \(\alpha\) (left) and reservoir length (right)

4.3 Experiment C: Approximation of batch PFI

Table 2 Median error of iPFI compared to batch PFI (IQR between \(Q_{1}\) and \(Q_{3}\) in braces)
Fig. 9
figure 9

Boxplot of PFI estimates per feature of the bike regression dataset for batch PFI (left), geometric sampling iPFI (middle), and uniform sampling iPFI (right) on a pre-trained LGBM regressor

We further consider the static model setting where models are pre-trained before they are explained on the whole dataset (no incremental learning). This experiment demonstrates that iPFI correctly approximates batch PFI estimation. We compare iPFI with the classical batch PFI \({\hat{\phi }}^{(S_j)}_{\text {batch}}\) for feature \(j \in D\), which is computed using the whole static dataset over ten random permutations. We normalize \(\hat{\phi }_{\text {iPFI}}^{(S_j)}\) and \(\hat{\phi }_{\text {batch}}^{(S_j)}\) between 0 and 1, and compute the sum over the feature-wise absolute approximation errors \(\sum _{j \in D}{\vert \hat{\phi }_{\text {iPFI}}^{(S_j)}-\hat{\phi }_{\text {batch}}^{(S_j)}\vert }\). Table 2 shows the median and interquartile range (IQR) (difference between the first and third quartile) of the error based on ten random orderings of each dataset. Figure 9 illustrates the approximation quality of iPFI with geometric and uniform sampling per feature for the bike regression dataset. Further results can be found in the supplement material in Sect. C. In the static modeling case, there is no clear difference between geometric and uniform sampling. However, in the dynamic modeling context under drift, the sampling strategy has a substantial effect on the iPFI estimates.

5 Conclusion and future work

In this work, we considered global FI as a statistic measure of change in the model’s risk when features are marginalized. We discussed PFI as an approach to estimate feature importance and proved that only appropriately scaled permutation tests are unbiased estimators of global FI (Theorem 1). In this case, the expectation over the sampling strategy (expected PFI) then corresponds to the model reliance U-Statistic (Fisher et al., 2019).

Based on this notion, we presented iPFI, which is a model-agnostic algorithm to incrementally estimate global FI with PFI by averaging importance scores for individual observations over repeated realizations of a sampling strategy. We introduced two incremental sampling strategies and established theoretical results for the expectation over the sampling strategy (expected iPFI) to control the approximation error using iPFI’s parameters. On various benchmark datasets, we demonstrated the efficacy of our algorithms by comparing them with the batch PFI baseline method in a static progressive setting as well as with interval-based PFI in a dynamic incremental learning scenario with different types of concept drift and parameter choices.

Applying XAI methods incrementally to data stream analytics offers unique insights into models that change over time. In this work, we rely on PFI as an established and inexpensive FI measure. Other computationally more expensive approaches (such as SAGE) address some limitations of PFI. As our theoretical results can be applied to arbitrary feature subsets, analyzing these methods in the dynamic environment offers interesting research opportunities. In contrast to this work’s technical focus, analyzing the dynamic XAI scenario through a human-focused lens with human-grounded experiments is paramount (Doshi-Velez & Kim, 2017).