Incremental permutation feature importance (iPFI): towards online explanations on data streams

Fumagalli, Fabian; Muschalik, Maximilian; Hüllermeier, Eyke; Hammer, Barbara

doi:10.1007/s10994-023-06385-y

Incremental permutation feature importance (iPFI): towards online explanations on data streams

Open access
Published: 20 September 2023

Volume 112, pages 4863–4903, (2023)
Cite this article

Download PDF

You have full access to this open access article

Machine Learning Aims and scope Submit manuscript

Incremental permutation feature importance (iPFI): towards online explanations on data streams

Download PDF

2735 Accesses
4 Citations
5 Altmetric
Explore all metrics

Abstract

Explainable artificial intelligence has mainly focused on static learning scenarios so far. We are interested in dynamic scenarios where data is sampled progressively, and learning is done in an incremental rather than a batch mode. We seek efficient incremental algorithms for computing feature importance (FI). Permutation feature importance (PFI) is a well-established model-agnostic measure to obtain global FI based on feature marginalization of absent features. We propose an efficient, model-agnostic algorithm called iPFI to estimate this measure incrementally and under dynamic modeling conditions including concept drift. We prove theoretical guarantees on the approximation quality in terms of expectation and variance. To validate our theoretical findings and the efficacy of our approaches in incremental scenarios dealing with streaming data rather than traditional batch settings, we conduct multiple experimental studies on benchmark data with and without concept drift.

Agnostic Explanation of Model Change based on Feature Importance

Article Open access 12 July 2022

iPDP: On Partial Dependence Plots in Dynamic Modeling Scenarios

iSAGE: An Incremental Version of SAGE for Online Explanation on Data Streams

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Online learning from dynamic data streams is a prevalent machine learning (ML) approach for various application domains (Bahri et al., 2021). For instance, predicting energy consumption for individual households can foster energy-saving strategies such as load-shifting. Concept drift resulting from environmental changes, such as pandemic-induced lock-downs, drastically impacts the energy consumption patterns necessitating online ML (García-Martín et al., 2019). Explaining these predictions yields a greater understanding of an individual’s energy use and enables prescriptive modeling for further energy-saving measures (Wastensteiner et al., 2021). For black-box ML methods, so-called post-hoc XAI methods seek to explain single predictions or entire models in terms of the contribution of specific features (Adadi & Berrada, 2018).

We are interested in feature importance (FI) as a global assessment of features, which indicates their respective relevance to the given task and model. A prominent representative of model-agnostic FI measures is the permutation feature importance (PFI), which, in its original form, has been introduced for tree-based models in Breiman (2001) with various applications and extensions (Strobl et al., 2007, 2008; Altmann et al., 2010; Hapfelmeier et al., 2014; Zhu et al., 2015; Gregorutti et al., 2015, 2017). Recent work (Fisher et al., 2019) adapts PFI to a model-agnostic FI measure (model reliance) and establishes important theoretical guarantees. Albeit its limitations (Hooker et al., 2019; Fisher et al., 2019), we focus on PFI as a well-established, efficiently implementable, model-agnostic FI measure, which has served as a baseline for various more powerful extensions (Casalicchio et al., 2018; Covert et al., 2020; Molnar et al., 2020; König et al., 2021). So far, PFI requires a holistic view of the entire dataset in a static batch learning environment, which does not account for changes in the model structure, efficient anytime computations or sparse storage capabilities in data stream settings.

More generally, explainable artificial intelligence (XAI) has been studied mainly in the batch setting, where learning algorithms operate on static datasets. In scenarios where data does not fit into memory or computation time is strictly limited, like in progressive data science for big datasets (Turkay et al., 2018), or rapid online learning from data streams (Bahri et al., 2021), high computation times prohibit the use of traditional FI or XAI measures. Incremental, time- and memory-efficient implementations that provide anytime results have received much attention in recent years (Losing et al., 2018; Montiel et al., 2020). In particular, incremental algorithms enable a lifelong adaptation of machine learning technologies and their applications to possibly infinite data streams, addressing computational challenges as well as the challenge of dealing with changes of the underlying data distribution called drift. In this article, we are interested in efficient incremental algorithms for FI (see Fig. 1). Especially in the context of drifting data distributions, this task is particularly relevant—but also challenging, as many common FI methods are already computationally costly in the batch setting. While we focus with our implementation and theoretical results on PFI, our methodology of incremental FI could also be extended to other XAI measures.

Contribution

We propose an incremental variant of PFI as a model-agnostic global FI estimator which is capable of dealing with data streams of arbitrary length in limited memory and linear time. Our algorithm can be applied to any model that is incrementally learned on a data stream and provides anytime explanations that immediately react to changes in the model and the underlying data distribution in case of concept drift. The main idea is that these estimates are efficiently updated at each time step by computing a one-sample estimate of FI, which is then exponentially averaged over time. To approximate marginal feature distributions we introduce two sampling strategies, which can be applied in scenarios with and without concept drift. The core idea, inspired by reservoir sampling (Vitter, 1985), is to efficiently maintain a reservoir to sample observations that are used to approximate the marginal distribution. Our core contributions include:

We introduce iPFI as an incremental and model-agnostic estimator for global FI by constructing an online variant of PFI (Sect. 3). Up to our knowledge, this constitutes the first mathematically substantiated approach for online global FI estimation. In contrast to PFI, iPFI reacts to concept drift in non-stationary environments and provides an explanation stream alongside the data stream in linear time and constant memory. The explanation stream can be utilized for further downstream tasks, such as inspecting possible causes for observed drift.
We motivate iPFI by establishing the concrete connection of permutation tests (Definition 2) (Breiman, 2001) and model reliance (Definition 3) (Fisher et al., 2019) in the batch setting (Theorem 1). This finding extends on (Fisher et al., 2019, Appendix A.3) and shows only properly scaled permutation tests are unbiased estimates of global FI.
We provide two sampling strategies for iPFI to incrementally compute marginal feature distributions and establish theoretical guarantees regarding bias, variance, and approximation error in terms of a single sensitivity parameter in a static and dynamic learning scenario.
We implement iPFI and conduct experiments on its ability to efficiently provide anytime global FI values under different types of concept drift, as well as its approximation quality compared to batch permutation tests in static modeling scenarios.

All experiments and algorithms are publicly available and integrated natively into the well-known incremental learning framework river (Montiel et al., 2020).^{Footnote 1}

Related work A variety of model-agnostic local FI methods (Ribeiro et al., 2016; Lundberg & Lee, 2017; Lundberg et al., 2020; Covert & Lee, 2021) exist that provide relevance values for single instances. In addition, model-specific variants have been proposed for neural networks (Bach et al., 2015; Selvaraju et al., 2017) and trees (Lundberg et al., 2020). In contrast, global FI methods provide relevance values across all instances. PFI (or permutation tests) (Breiman, 2001) are a prominent global FI approach that has been widely applied (Archer & Kimes, 2008; Calle & Urrea, 2011; Wang et al., 2016), studied and extended (Strobl et al., 2007, 2008; Altmann et al., 2010; Hapfelmeier et al., 2014; Zhu et al., 2015; Gregorutti et al., 2015, 2017) for tree-based models. The method has recently been introduced as a model-agnostic approach in Fisher et al. (2019) and extended to scenarios with strongly correlated features in Molnar et al. (2020); König et al. (2021). In this regard, our definition of global FI also relates to an ongoing debate, if absent features should be marginalized using the conditional distribution (Aas et al., 2021; Frye et al., 2021) or the marginal distribution (Janzing et al., 2020), as proposed by PFI, where it was argued that the choice should depend on the application (Chen et al., 2020), and the marginal distribution was used as an approximation of the conditional distribution (Covert et al., 2020; Lundberg & Lee, 2017). A particular popular extension is SAGE, a Shapley-based (Shapley, 1953) approach, which averages marginal feature contributions over arbitrary subsets of marginalized features. It has been proposed and compared with existing methods in Covert et al. (2020) and a closely related idea was previously introduced in Casalicchio et al. (2018), called SFIMP. As calculating FI values is computationally expensive, especially for Shapley-based methods, more efficient approaches such as FastSHAP (Jethani et al., 2021) have been introduced. Yet, none of the above methods and extensions natively support an incremental or dynamic setting in which the underlying model and its global FI can rapidly change due to concept drift.

An initial approach to explaining model changes by computing differences in FI utilizing drift detection methods is Muschalik et al. (2022). However, this does not constitute an incremental FI measure. The explanations are created with a time delay and without efficient anytime calculations. A first step towards anytime FI values has been proposed for online random forests by computing mean decrease in impurity (MDI) and accuracy (MDA) over time by using online confusion matrices (Cassidy & Deviney, 2014) or maintaining node statistics incrementally (Gomes et al., 2019). While online feature scores are of particular interest in streaming scenarios, these methods are limited to a specific model class, need access to the inherent model structure and cannot be extended to a model-agnostic approach. They further do not provide any theoretical guarantees about the approximation quality in comparison to the batch versions.

Similar to batch learning, incremental FI values could be used in further downstream tasks. As an example, incremental FI is also relevant to the field of incremental feature selection, where FI is calculated periodically with a sliding window to retain features for the incrementally fitted model (Barddal et al., 2019; Yuan et al., 2018). Lastly, changes in FI can also be used for concept drift detection, as described in Haug et al. (2022).

In this work, we provide a mathematically substantiated model-agnostic incremental FI measure, whose time sensitivity can be controlled by a single smoothing parameter. To our knowledge, this is the first approach that combines online ML and model-agnostic XAI measures and provides extensive theoretical guarantees on its approximation quality.

2 Global feature importance

We consider a supervised learning scenario, where ${\mathscr {X}}$ is the feature space and ${\mathscr {Y}}$ the target space, e.g., ${\mathscr {X}} = {\mathbb {R}}^d$ and ${\mathscr {Y}} = {\mathbb {R}}$ (regression), ${\mathscr {Y}} = \{0,1\}$ (binary classification) or ${\mathscr {Y}} = \{0,1\}^c$ (multiclass classification). Let $h: {\mathscr {X}} \rightarrow {\mathscr {Y}}$ be a model, which is learned from a set or stream of observed data points $z=(x,y) \in {\mathscr {X}} \times {\mathscr {Y}}$. Let $D = \{1,\dots ,d\}$ be the set of feature indices for the vector-wise feature representations of $x = (x^{(i)}:i \in D) \in {\mathscr {X}}$. Consider a subset $S \subset D$ and its complement ${\bar{S}}:= D {\setminus } S$, which partitions the features, and denote $x^{(S)}:= (x^{(i)}: i \in S)$ as the feature subset of S for a sample x. We write $h(x^{({\bar{S}})},x^{(S)}):= h(x)$ to distinguish between features from ${\bar{S}}$ and S. For the basic setting, we assume that N observations are drawn independently and identically distributed (iid) from the joint distribution of unknown random variables (X, Y) and denote by ${\mathbb {P}}_S$ the marginal distribution of the features in S, i.e., $z_n:= (x_n,y_n)$ from $Z_n:= (X_n,Y_n) \overset{iid}{\sim }\ {\mathbb {P}}_{(X,Y)}$ and $x_n^{(S)} \text { from } X_n^{(S)} \overset{iid}{\sim }\ {\mathbb {P}}_S$ for samples $n=1,\dots ,N$.

Feature importance refers to the relevance of a set of features S for a model h. To quantify FI, the key idea of measures such as PFI is to compare the model’s performance when using only features in ${\bar{S}}$ with the performance when using all features in $D = S \cup {\bar{S}}$. The idea is that the “removal” of an important feature (i.e., the feature is not provided to a model) substantially decreases a model’s performance. The model performance or risk is measured based on a norm $\Vert \cdot \Vert :{\mathscr {Y}} \rightarrow {\mathbb {R}}$ on ${\mathscr {Y}}$, e.g., the Euclidean norm, as ${\mathbb {E}}_{(X,Y)}[\, \Vert h(X)-Y\Vert \, ]$.

As the model is trained on all features and retraining is computationally expensive, a common method to restrict h to $\bar{S}$ is to marginalize h over the features in S. We denote the marginalized risk

$$\begin{aligned} f_S \big (x^{({\bar{S}})},y \big ):= {\mathbb {E}}_{{\tilde{X}} \sim {\mathbb {P}}_S} \left[ \Vert h(x^{({\bar{S}})},{\tilde{X}})-y\Vert \right] . \end{aligned}$$

(1)

We then define FI for a model h and a feature set S as the difference of the marginalized risk and the inherent risk.

Definition 1

(Global FI) For a model h and a subset $S \subset D$, the global feature importance (global FI) is defined as

$$\begin{aligned} \phi ^{(S)}(h):= \underbrace{{\mathbb {E}}_{(X,Y)} \Big [f_S(X^{(\bar{S})},Y) \Big ]}_{\hbox { marginalized risk over}\ {\mathbb {P}}_S} - \underbrace{{\mathbb {E}}_{(X,Y)} \Big [\Vert h(X)-Y\Vert \Big ]}_{\text {risk}}. \end{aligned}$$

This global FI measures the increase in risk when the features in S are marginalized.

Remark 1

Our definition is best suited and inherent for PFI (Breiman, 2001; Fisher et al., 2019) with single feature subsets. However, it is also related to a more general definition of FI given in Covert et al. (2020). Therein, FI is based on the reduction in risk, if features in S are included compared to marginalizing all features. In contrast to our definition, it relies on the conditional distribution $X^{({\bar{S}})} \mid X^{(S)}$. In practice, however, the marginal distribution is often used to approximate the conditional distribution, where both coincide, if feature independence is assumed (Lundberg & Lee, 2017; Covert et al., 2020, 2021). In this case, it directly corresponds to our definition with different notation. In the literature, it was argued that the choice of distribution should depend on the application (Chen et al., 2020). The conditional distribution was preferred in Aas et al. (2021); Frye et al. (2021), which includes causal relationships in the explanation (Chen et al., 2020), whereas the marginal distribution was preferred in Janzing et al. (2020), which explains the model independent of the relationships between the features (Janzing et al., 2020; Chen et al., 2020).

2.1 Empirical estimation of global FI

Given observations $(x_1,y_1),\dots ,(x_N,y_N)$, we estimate global FI for a given model h with the canonical estimator

$$\begin{aligned} {\hat{\phi }}^{(S)}_{\varphi }:= \frac{1}{N}\sum _{n=1}^N {\hat{\lambda }}^{(S)}(x_n,x_{\varphi (n)},y_n), \end{aligned}$$

(2)

where $\varphi :\{1,\dots ,N\} \rightarrow \{1,\dots ,N\}$ represents the realization of a (possibly random) sampling strategy that chooses for $x_n$ an observation $x_{\varphi (n)}$ as a replacement value with

$$\begin{aligned} {\hat{\lambda }}^{(S)}(x_n,x_m,y_n):= \Vert h(x_n^{(\bar{S})},x^{(S)}_m)-y_n\Vert - \Vert h(x_n)-y_n\Vert . \end{aligned}$$

Given the iid assumption, it is clear that due to $X_n \perp X_{n'}$ for $n\ne n'$, the estimator is an unbiased estimator of the global FI $\phi ^{(S)}(h)$, if $\varphi (n) \ne n$ for all $n=1,\dots ,N$. In the case of $\varphi (n)=n$, the term in the sum is zero as well as its expectation, which implies ${\mathbb {E}}[{\hat{\phi }}^{(S)}_{\varphi }] \le \phi ^{(S)}(h)$ for any $\varphi$. We will now discuss a well understood choice of feature subsets $S \subset D$, sampling strategy $\varphi$ and two estimators for $\phi ^{(S)}(h)$.

2.2 Permutation feature importance (PFI)

A popular example of global FI is the well-known PFI (Breiman, 2001) that measures the importance of each feature $j \in D$ by using a set $S_j:= \{j\}$. More precisely, the FI for each feature $j \in D$ is given by $\phi ^{(S_{j})}$ with sets $S_{j}$ and their complement ${\bar{S}}_{j}:= D \setminus \{j\}$. The sampling strategy $\varphi$ used in PFI samples uniformly generated permutations $\varphi \in {\mathfrak {S}}_N$ over the set $\{1,\dots ,N\}$, where each permutation has a probability of 1/N!.

2.2.1 Empirical PFI

Permutation tests, as proposed initially in Breiman (2001), effectively approximate ${\mathbb {E}}_{\varphi }[{\hat{\phi }}^{(S_j)}_\varphi ]$ by averaging over M uniformly sampled random permutations. We now introduce a corrected version of the originally proposed estimator, which we refer to as PFI by introducing a normalizing factor $\frac{N}{N-1}$.

Definition 2

(PFI) Given samples $(x_1,y_1),\dots ,(x_N,y_N)$ and uniformly sampled permutations $\varphi _1,\dots ,\varphi _m \overset{iid}{\sim } \text {unif}({\mathfrak {S}}_N)$, we define the PFI estimator as

$$\begin{aligned} {\textbf {PFI}}: {\hat{\phi }}^{(S_j)}:= \frac{N}{N-1}\underbrace{\frac{1}{M} \sum _{m=1}^M {\hat{\phi }}^{(S_j)}_{\varphi _m}}_{\approx {\mathbb {E}}_{\varphi }[{\hat{\phi }}^{(S_j)}_\varphi ]}. \end{aligned}$$

(3)

As discussed above, the estimator ${\hat{\phi }}^{(S_j)}_{\varphi }$ for a given $\varphi$ is an unbiased estimator for global FI $\phi ^{(S_j)}(h)$, if the permutation is a derangement (no fixed points). Our version differs by the factor $\frac{N}{N-1}$ from the initially proposed approach (Breiman, 2001; Fisher et al., 2019). In the following, we show that, if the expectation over uniformly sampled permutations $\varphi \sim \text {unif}({\mathfrak {S}}_{N})$ is taken, our definition is an unbiased estimator of global FI. This expectation directly links PFI to model reliance (Fisher et al., 2019), which we thus refer to as expected PFI. While our definition of PFI is closely related to the original method (Breiman, 2001), the link to expected PFI allows to provide further theoretical results. We utilize this link in an incremental learning setting to provide theoretical guarantees.

2.2.2 Expected PFI

The PFI estimator can be efficiently computed but highly depends on the sampled permutations complicating the theoretical analysis. Another definition of PFI (model reliance), which was given and extensively studied in Fisher et al. (2019), is independent of sampled permutations. We refer to it as the expected PFI.

Definition 3

(Expected PFI) Given observations $(x_1,y_1),\dots ,(x_N,y_N)$ the expected PFI is defined as

$$\begin{aligned} {\bar{\phi }}^{(S_j)}:= \underbrace{\frac{1}{N(N-1)} \sum _{n=1}^N\sum _{m \ne n}\Vert h(x_n^{(\bar{S_j})},x_m^{(S_j)})-y_n\Vert }_{=: {\hat{e}}_{\text {switch}}} - \underbrace{\frac{1}{N} \sum _{n=1}^N\Vert h(x_n)-y_n\Vert }_{=:{\hat{e}}_{\text {orig}}} \end{aligned}$$

The expected PFI computes the difference of the error of the model averaged over all feature instantiations ${\hat{e}}_{\text {switch}}$ with the model error ${\hat{e}}_{\text {orig}}$^{Footnote 2}. We now show that the expected PFI is actually the expectation over the sampling procedure $\varphi$ of PFI, which directly links Definition 2 and Definition 3. As expected PFI is an unbiased estimator for global FI, we conclude that PFI is an unbiased estimator, if properly scaled as in Definition 2.

Theorem 1

The expected PFI (model reliance) can be rewritten as a normalized expectation over uniformly sampled permutations

$$\begin{aligned} {\bar{\phi }}^{(S_j)} = \frac{N}{N-1}{\mathbb {E}}_{\varphi \sim \text {unif}({\mathfrak {S}}_N)} \left[ {\hat{\phi }}^{(S_j)}_{\varphi } \right] \approx {\hat{\phi }}^{(S_j)} \end{aligned}$$

(4)

i.e. expected PFI is canonically estimated by the PFI estimator and in particular ${\bar{\phi }}^{(S_j)} = {\mathbb {E}}_\varphi [{\hat{\phi }}^{(S_j)}]$.

Due to space restrictions, all proofs are deferred to the supplementary material in Sect. A. Theorem 1 shows that the PFI estimator ${\hat{\phi }}^{(S_j)}$ is a canonical Monte-Carlo estimate of the theoretically well understood expected PFI estimator ${\bar{\phi }}^{(S_j)}$. Both ${\hat{e}}_{\text {switch}}$ and ${\hat{e}}_{\text {orig}}$ as well as the estimator ${\bar{\phi }}^{(S_j)}$ are U-statistics, which implies unbiasedness, asymptotic normality and finite sample boundaries under weak conditions (Fisher et al., 2019). The variance can, thus, be directly computed and it is easy to show that ${\mathbb {V}}[{\bar{\phi }}^{(S_j)}]= {\mathcal {O}}(1/N)$, which by Chebyshev’s inequality implies a bound on the approximation error as ${\mathbb {P}}(\vert {\bar{\phi }}^{(S_j)}- \phi ^{(S_j)}(h)\vert > \epsilon ) = {\mathcal {O}}(1/N)$. Hence, the approximation error of the expected PFI is directly controlled by the number of observations N used for computation. A possible link between permutation tests and the U-statistic ${\bar{\phi }}^{(S_j)}$ was already suggested in (Fisher et al., 2019, Appendix A.3), where it was shown that the sum over permutations without fixed points is proportional to ${\hat{e}}_{\text {switch}}$. Theorem 1 shows that both approaches are directly linked, if permutation tests are properly scaled (Definition 2). The biased estimator $\frac{1}{M} \sum _{m=1}^M {\hat{\phi }}^{(S_j)}_{\varphi _m}$ appears in Breiman (2001); Fisher et al. (2019); Gregorutti et al. (2017). To our knowledge, the unbiased version in Definition 2 has not yet been introduced. In practice, while this factor does not change the relative importance scores, it should be considered when comparing PFI estimates with varying N. Furthermore, Theorem 1 justifies to average over repeatedly sampled realizations of $\varphi$ in order to approximate the computationally prohibitive estimator ${\bar{\phi }}^{(S_j)}$. In the following, we will pick up this notion when constructing an incremental FI estimator.

3 Incremental permutation feature importance

In incremental learning, one deals with an a priory unlimited stream of training data. The challenge is to infer a model at any time point t based on the previous model and the currently observed data point, thereby using a fixed, limited amount of memory and efficient update schemes for the model. While incremental classification and regression models have been proposed (Bahri et al., 2021; Losing et al., 2018), technologies which accompany such methods by incremental explanation technologies are rare. In the following, we introduce an efficient incremental scheme for the popular PFI supported by theoretical guarantees using the link to expected PFI (model reliance) (Fisher et al., 2019).

We now consider a sequence of models $(h_t)_{t\in {\mathbb {N}}}$ from an incremental learning algorithm. At time t the observed data is $\{(x_0,y_0),\dots ,(x_t,y_t)\}$. The model is incrementally learned over time, such that at time t the observation $(x_t,y_t)$ is used to update $h_t$ to $h_{t+1}$. Our goal is to efficiently provide an estimate of PFI at each time step t for each feature $j \in D$ using subsets $S_j:= \{j\}$. Note that our results can immediately be extended to arbitrary feature subsets $S \subset D$.

In the following, we construct an efficient incremental estimator for PFI. We first discuss how (2) can be efficiently approximated in the incremental learning scenario, given a sampling strategy $\varphi _t$. In the sequel, we will rely on a random sampling strategy which is specifically suitable for the incremental setting and easier to implement than permutation-based approaches. Note that a permutation-based approach at time t is difficult to replicate in the incremental setting, as at time $s<t$ not all samples until time t are available. Moreover, as the model changes over time, naively computing (2) at each time step t using N previous observations results in N model evaluations per time step. Instead, we propose to use an estimator that averages the terms in (2) over time rather than over multiple data points at one time step. That means, we evaluate the current model only twice to compute the time-dependent quantity

$$\begin{aligned} {\hat{\lambda }}^{(S_j)}_t(x_t,x_{\varphi _t},y_t):= \Vert h_t(x_t^{(\bar{S_j})},x^{(S_j)}_{\varphi _t})-y_t\Vert - \Vert h_t(x_t)-y_t\Vert , \end{aligned}$$

where $\varphi _t$ is a stochastic sampling strategy to select a previous observation with values in $\{0,\dots ,t-1\}$, which we discuss in a second step in Sect. 3.1. We propose to average these calculations over time (rather than iterations over multiple data points) by using exponential smoothing. This yields to the definition of the incremental PFI (iPFI) estimator.

Definition 4

(iPFI) For a data stream at time t with previous observations $(x_0,y_0),\dots ,(x_t,y_t)$ and a sampling strategy $(\varphi _s)_{s=t_0,\dots ,t}$ for $t_0>0$ the incremental PFI (iPFI) estimator is recursively defined as

$$\begin{aligned} {\textbf {iPFI}}: {\hat{\phi }}^{(S_j)}_t:= (1-\alpha ){\hat{\phi }}^{(S_j)}_{t-1} + \alpha {\hat{\lambda }}^{(S_j)}_t(x_t,x_{\varphi _t},y_t), \end{aligned}$$

for $t>t_0$, ${\hat{\phi }}^{(S_j)}_{t_0-1}:= 0$, and $\alpha \in (0,1)$.

The parameter $\alpha$ is a hyperparameter that should be chosen based on the application. Note that a specific choice of $\alpha$ corresponds to a window size N, where $\alpha = \frac{2}{N+1}$ based on the well-known conversion formula, see e.g. (Nahmias & Olsen, 2015, p.73). Given a realization $\varphi _s$, observations $z_s:= (x_s,y_s)$ from iid $Z_s:= (X_s,Y_s) \overset{iid}{\sim }\ {\mathbb {P}}_{(X,Y)}$ and $x_s^{(S_j)}$ from $X_s^{(S_j)} \overset{iid}{\sim }\ {\mathbb {P}}_{S_j}$, each ${\hat{\lambda _s}}^{(S_j)}$ is an unbiased estimate of $\phi ^{(S_j)}(h_s)$. We further require $\varphi _s \perp (X,Y)$ and denote

$$\begin{aligned} p_{s,r}:= {\mathbb {P}}(\varphi _s = r) \text { for } s=t_0,\dots ,t \text { and } r=0,\dots ,s-1, \end{aligned}$$

(5)

i.e. the probability to select a previous observation from time r at time s. Note that $t_0>0$ is the first time step where ${\hat{\phi }}^{(S_j)}_t$ can be computed, as we need previous observations for the sampling process. In the following, we assume that the sampling strategy $(\varphi _s)_{t_0\le s\le t}$ is fixed and clear from the context, and thus omit the dependence on ${\hat{\phi }}^{(S_j)}_t$. We illustrate one explanation step at time t in Algorithm 1 and Fig. 2. This directly corresponds to (3) with $M=1$ and can be extended to $M>1$ by repeatedly running the procedure in parallel and averaging the results. Next, we discuss two possible sampling strategies, which are illustrated in Fig. 3.

3.1 Incremental sampling strategies $\varphi$

Since random permutations cannot easily be realized in an incremental setting as they require infinite memory of previous observations and knowledge of future events, we now present two alternative types of sampling strategies. We formalize $(\varphi _s)_{t_0\le s\le t}$ to choose the previous observation r at time s for the calculation in ${\hat{\lambda _s}}^{(S_j)}$. To do so, we will specify the probabilities $p_{s,r}$ in (5). An illustration of both approaches can be found in Fig. 3.

3.1.1 Uniform sampling

In uniform sampling we assume that each previous observation is equally likely to be sampled at time s, i.e., $p_{s,r}=1/s$ for $s=t_0,\dots ,t$ and $r=0,\dots ,s-1$. It could be naively implemented by storing all previous observations and uniformly sampling at each time step. However, since memory is limited, uniform sampling may be implemented with histograms for categorical features of known and small cardinality. For others, a reservoir of fixed length L can be maintained, known as reservoir sampling (Vitter, 1985). The probability of a new observation to be included in the reservoir, referred to as insertion probability, then decreases over time, see Fig. 3. Clearly, observations are drawn independently, but can be sampled more than once. In a data stream scenario, where changes to the underlying data distribution occur over time, the uniform sampling strategy may be inappropriate, and sampling strategies that prefer recent observations may be better suited.

3.1.2 Geometric sampling

Geometric sampling arises from the idea to maintain a reservoir of size L, which is updated by a new observation at each time step by randomly replacing a reservoir observation with the newly observed one. Until time $t_0$ the first L observations are stored in the reservoir. At each sampling step ($t \ge t_0$) an observation is uniformly chosen from the reservoir with probability $p:= 1/L$. Independently, a sample from the reservoir is selected with the same probability $p:= 1/L$ for replacement with the new observation. The resulting probabilities are of the geometric form $p_{s,r}=p(1-p)^{s-r-1}$ for $r\ge t_0$ and $p_{s,r}=p(1-p)^{s-t_0}$ for $r < t_0$. Clearly, the geometric sampling strategy yields increasing probabilities for more recent observations and we demonstrate in our experiments that this can be beneficial in scenarios with concept drift.

3.2 Theoretical results of estimation quality

The estimator ${\hat{\phi }}^{(S_j)}_t$ picks up the notion of the PFI estimator ${\hat{\phi }}^{(S_j)}$ in (3), which approximates the expectation over the random sampling strategy $(\varphi )_{t_0 \le s \le t}$ by averaging repeated realizations. While ${\hat{\phi }}^{(S_j)}_t$ only considers one realization of the sampling strategy, it is easy to extend the approach in the incremental learning scenario by computing the estimator ${\hat{\phi }}^{(S_j)}_t$ in multiple separate runs in parallel. While this yields an efficient estimate of PFI, it is difficult to analyze the estimator theoretically as each estimator highly depends on the realizations of the sampling strategy. We, thus, again study the expectation over the sampling strategy and introduce the expected iPFI.

Definition 5

(Expected iPFI) For a data stream at time t with previous observations $(x_0,y_0),\dots ,(x_t,y_t)$ and a sampling strategy $\varphi := (\varphi _s)_{s=t_0,\dots ,t}$ for $t_0>0$, we defined the expected iPFI as

$$\begin{aligned} {\bar{\phi }}^{(S_j)}_t:= {\mathbb {E}}_{\varphi }[{\hat{\phi }}^{(S_j)}_t], \end{aligned}$$

which corresponds to the expected PFI (model reliance) ${\bar{\phi }}^{(S_j)}$ in the batch setting.

To evaluate the estimation quality, we will analyze the bias $\vert {\bar{\phi }}^{(S_j)}_t - \phi ^{(S_j)}(h_t) \vert$ and the variance of ${\bar{\phi }}^{(S_j)}_t$. Both can be combined by Chebyshev’s inequality to obtain bounds on the approximation error of $\phi ^{(S_j)}(h_t)$ for $\epsilon > \vert {\bar{\phi }}^{(S_j)}_t - \phi ^{(S_j)}(h_t) \vert$ as

$$\begin{aligned} {\mathbb {P}}(\vert {\bar{\phi }}^{(S_j)}_t - \phi ^{(S_j)}(h_t) \vert > \epsilon )= {\mathcal {O}} ({\mathbb {V}}[{\bar{\phi }}^{(S_j)}_t]). \end{aligned}$$

(6)

As already said, all proofs are deferred to the supplementary material in Sect. A. Our theoretical results are stated and proven in a general manner, which allows one to extend our approach to other sampling strategies, other feature subsets, and even other aggregation techniques.

Static model Given iid observations from a data stream, we consider an incremental model that learns over time. We begin under the simplified assumption that the model does not change over time, i.e., $h_t \equiv h$ for all t.

Theorem 2

(Bias for static Model) If $h \equiv h_t$, then

$$\begin{aligned} \phi ^{(S_j)}(h) - {\mathbb {E}}[{\bar{\phi }}^{(S_j)}_t] = (1-\alpha )^{t-t_0+1} \phi ^{(S_j)}(h). \end{aligned}$$

From the above theorem it is clear that the bias of the expected iPFI ${\bar{\phi }}^{(S_j)}_t$ is exponentially decreasing towards zero for $t \rightarrow \infty$ and we thus continue to study the asymptotic estimator $\lim _{t\rightarrow \infty }{\bar{\phi }}_t^{(S_j)}$. While the bias does not depend on the sampling strategy, our next results analyzes the variance of the asymptotic estimator, which does depend on the sampling strategy.

Theorem 3

(Variance for static Model) If $h_t \equiv h$ and ${\mathbb {V}}[\Vert h(X_s^{(\bar{S_j})},X_r^{(S_j)})-Y_s\Vert -\Vert h(X_s)-Y_s\Vert ] <\infty$, then

$$\begin{aligned} \text {Uniform: }&{\mathbb {V}} \left[ \lim _{t\rightarrow \infty }{\bar{\phi }}_t^{(S_j)} \right] = {\mathcal {O}} (-\alpha \log (\alpha )). \\ \text {Geometric: }&{\mathbb {V}} \left[ \lim _{t\rightarrow \infty }{\bar{\phi }}_t^{(S_j)} \right] = {\mathcal {O}} (\alpha ) + {\mathcal {O}} (p). \end{aligned}$$

The variance is therefore directly controlled by the choice of parameters $\alpha$ and p. As the asymptotic estimator is unbiased, it is clear that these parameters control the approximation error, as shown in (6).

Changing model So far, we discussed properties of ${\bar{\phi }}_t^{(S_j)}$ under the simplified assumption that $h_t$ does not change over time. In an incremental learning scenario, $h_t$ is updated incrementally at each time step. In cases where no concept drift affects the underlying data generating distribution, we can assume that an incremental learning algorithm gradually converges to an optimal model. We thus assume that the change of the model is controlled and show results similar to the case where $h_t$ is static. To control model change formally, we introduce $f^{\Delta }_S(x^{(\bar{S_j})},h_s,h_t):= {\mathbb {E}}_{{\tilde{X}} \sim {\mathbb {P}}_S}[\Vert h_t(x^{(\bar{S_j})},\tilde{X})-h_s(x^{(\bar{S_j})},{\tilde{X}})\Vert ]$. The expectation of $f^\Delta _S$ is denoted $\Delta _S(h_s,h_t):= {\mathbb {E}}_X[f^{\Delta }_S(X,h_s,h_t)]$ and $\Delta (h_s,h_t):= \Delta _\emptyset (h_s,h_t)$. We show that $\Delta _S$ and $\Delta$ bound the difference of FI of two models $h_t$ and $h_s$ and the bias of our estimator.

Theorem 4

(Bias for changing Model) If $\Delta (h_s,h_t) \le \delta$ and $\Delta _S(h_s,h_t) \le \delta _S$ for $t_0 \le s \le t$, then

$$\begin{aligned} \vert {\mathbb {E}}[{\bar{\phi }}^{(S_j)}_t] - \phi ^{(S_j)}(h_t)\vert \le \delta _S + \delta +{\mathcal {O}}((1-\alpha )^{t}). \end{aligned}$$

In the case of a changing model the estimator is therefore only unbiased if $h_t \rightarrow h$ as $t \rightarrow \infty$. For results on the variance, we control the variability of the models at different points in time. In the case of a static model, the covariances can be uniformly bounded, as they do not change over time. Instead, for a changing model, we introduce the time-dependent function

$$\begin{aligned} f_s(Z_s,Z_r):= \Vert h_s(X_s^{(\bar{S_j})},X_r^{(S_j)})-Y_s\Vert -\Vert h_s(X_s)-Y_s\Vert \end{aligned}$$

and assume existence of some $\sigma _{\text {max}}^2$ such that

$$\begin{aligned} \text {cov}(f_s(Z_s,Z_r),f_{s'}(Z_{s'},Z_{r'})) \le \sigma _{\text {max}}^2 \end{aligned}$$

(7)

for $t_0\le s,s' \le t$, $r<s$ and $r'<s'$.

Theorem 5

(Variance for changing Model) Given (7) for a sequence of models $(h_t)_{t\ge 0}$, the results of Theorem 3 apply.

Summary We have shown that the approximation error of iPFI for FI is controlled by the parameters $\alpha$ and p. In the case of drifting data, the approximation error is additionally affected by the changes in the model, as it is then possibly biased and the covariances may change over time. As the expected PFI estimator has an approximation error of order ${\mathcal {O}}(1/N)$ for FI, we conclude that the above bounds on the approximation error of expected iPFI are also valid when compared with the expected PFI, if $\alpha$ is chosen according to $\alpha = \frac{2}{N+1}$. In the next section, we corroborate our theoretical findings with empirical evaluations and showcase the efficacy of iPFI in scenarios with concept drift. We also elaborate on the differences between the two sampling strategies.

4 Experiments

We conduct multiple experimental studies to validate our theoretical findings and present our approach on real data. We consider three benchmark datasets, which are well-established in the FI literature (Covert et al., 2020; Lundberg & Lee, 2017) called adult (Kohavi, 1996), bank (Moro et al., 2011), and bike (Fanaee-T & Gama, 2014), where bike constitutes a regression task. We further consider two binary classification real-world data streams called elec2 (Harries, 1999) and ozone (de Souza et al., 2020). Moreover, we apply the multi-class insects (de Souza et al., 2020) data stream. Lastly, we create multiple synthetic data streams based on the agrawal (Agrawal et al., 1993) and stagger (Schlimmer & Granger, 1986) concept generators where we manually induce concept drifts. As our approach is inherently model-agnostic, we present experimental results for different model types. In the static batch scenario we apply Gradient Boosting Tree (GBT) (Friedman, 2001) and LightGBM (LGBM) (Ke et al., 2017) ensembles and train small 2-layer Neural Networks (NN) with layer sizes (128, 64). In the dynamic incremental learning setting, we apply Adaptive Random Forest classifiers (ARF) (Gomes et al., 2017), small scale 3-layer NNs with layer sizes (100, 100, 10) and Hoeffding Adaptive Trees (HATs) Bifet & Gavaldà (2009). The models’ and data streams’ implementation is based on scikit-learn (Pedregosa et al., 2011), river (Montiel et al., 2020), PyTorch (Paszke et al., 2017), and OpenML (Feurer et al., 2020). We mainly rely on default parameters, yet the supplement in Sect. C contains additional information about the datasets and details about the applied models.^{Footnote 3} In all our experiments, we compute the iPFI estimator $\hat{\phi }_{\text {iPFI}}^{(S_j)}$ as the average over ten realizations ${\hat{\phi }}^{(S_j)}_t$ of the incremental sampling strategies (uniform or geometric). All baseline approaches are chosen, such that they require the same amount of model evaluations as iPFI.

4.1 Experiment A: online PFI calculation under drift

First, we consider a dynamic modeling scenario. Here, instead of a pre-trained model, we fit different models incrementally on real data streams and compute iPFI on the fly. We incrementally train ARF, HAT and NN models. However, as our approach is inherently model-agnostic, any incremental model (implemented for example in river) can be explained. As a baseline, we compare our approach to the interval PFI for feature $j \in D$, which computes the PFI over fixed time intervals during the online learning process with ten random permutations in each interval. This can be seen as a naive implementation of iPFI with large gaps of uncertainty and a substantial time delay.

With the synthetic agrawal stream we induce two kinds of real concept drifts: First, we switch the classification function of the data generator, which we refer to as function-drift (changing the functional dependency but retaining the distribution of X). Second, we switch the values of two or more features with each other, which we refer to as feature-drift (changing the functional dependency by changing the distribution of X). Note that feature-drift can be applied to datasets, where the classification function is unknown (like elec2).

Figure 4 showcases how well iPFI reacts to both concept drift scenarios. Both concept drifts are induced in the middle of the data stream (after 10,000 samples). For the function-drift example (Fig. 4, left), the agrawal classification function was switched from Agrawal et al. (1993)’s concept 2 to concept 3. Theoretically, only two features should be important for both concepts: For the first concept the pink salary and the purple age features are needed, and for the second concept the classification function relies on the cyan education and the purple age features. However, the ARF model also relies on the blue commission feature, which can be explained as commission directly depends on salary and, thus, is transitively correlated with the target variable.

In the feature-drift scenario (Fig. 4, right), the ARF model adapts to a sudden drift where both important features (education and age) are switched with two unimportant features (car and salary). In both scenarios iPFI instantly detects the shifts in importance. From both simulations, it is clear that iPFI and its anytime computation has clear advantages over interval PFI. In fact, iPFI quickly reacts to changes in the data distribution while still closely matching the “ground-truth” results of the interval-wise computation.

Next to synthetic concept drifts on agrawal, Fig. 5 illustrates how iPFI explanations are model-agnostic on the original elec2 data stream. There, we incrementally train a NN and an ARF classifier on the stream without inducing an additional feature drift. For further concept drift scenarios, we refer to the supplementary material in Sect. C.

Table 1 Summary of the additional time complexity of iPFI

Full size table

Time complexity

Aside from the approximation quality in the incremental setting, we also summarize the additional time complexity of iPFI in Table 1 and observe a linear relationship ($0.104 \cdot \vert D \vert , R^{2} = 0.966$) over the feature count $\vert D\vert$. For a detailed illustration of the linear relationship we refer to Sect. C.4. We run the explanation procedure ten times for seven datasets and track the run-time with and without iPFI explanations. To isolate the variability of the run-times to the explanation procedure, we use the same ARF classification model for all seven datasets. We further decompose the explanation time into the time it takes to run the model in inference (line 3 in Algorithm 1) and the remaining storing and sampling overhead. Most of the explanation time (95% to 99%) is dedicated to the inference time of the models for which performance gains cannot be easily achieved without parallelization.

Sanity check with tree-specific mean decrease in impurity To further illustrate the efficacy of our approach, we also compare our model-agnostic iPFI explainer to the model-specific baseline of Mean Decrease in Impurity (MDI). Earlier works (Cassidy & Deviney, 2014; Gomes et al., 2019) leverage MDI as an importance measure in the incremental setting. Similar to Gomes et al. (2019), we manually compute the MDI on incremental summary statistics stored at each split-node of a HAT classifier. As a impurity measure, we compute the gini impurity index like in Cassidy and Deviney (2014). Figure 6 shows the comparison of iPFI and MDI for an agrawal concept drift data stream and elec2. Aside from the differing scales, both measures detect the same importance rankings and react to concept drift. However, as MDI can only be computed for tree-based models such as HATs and ARFs, its applicability is strictly limited compared to the model-agnostic approach of calculating iPFI, which can be applied to any model class and loss function.

4.2 Experiment B: Geometric vs. uniform sampling

Second, we focus on the question, which sampling strategy to prefer in which learning environments. We conclude that geometric sampling should be applied under feature-drift scenarios, as the choice of sampling strategy substantially impacts iPFI’s performance in concept drift scenarios where feature distributions change over time. If a dynamic model adapts to changing feature distributions, and the PFI is estimated with samples from the outdated distribution, the resulting replacement samples are outside the current data manifold. Estimating PFI by using this data can result in skewed estimates, as illustrated in Fig. 7. There, we induce a feature-drift by switching the values of the most important feature for an ARF model on elec2 with a random feature. The uniform sampling strategy (Fig. 7, left) is incapable of matching the “ground-truth” interval PFI estimation like the geometric sampling strategy (Fig. 7, right). Hence, in dynamic learning environments like data stream analytics or continual learning, we recommend applying a sampling strategy that focuses on more recent samples, such as geometric distributions. For applications without drift in the feature-space like progressive data science, uniform sampling strategies, which evenly distribute the probability of a data point being sampled across the data stream, may still be preferred.

Parameter considerations We, further, conduct an analysis of the two most important hyperparameters on the elec2 data stream. The results are shown in Fig. 8. Therein, we show that the smoothing parameter $\alpha$ substantially effects iPFI’s FI estimates. Like any smoothing mechanism, this parameter controls the deviation of iPFI’s estimates. This parameter should be set individually for the task at hand. In our experiment, values between $\alpha = 0.001$ (conservative) and $\alpha = 0.01$ (reactive) appeared to be reasonable. The size of the the reservoir does not substantially effect the estimation quality for values between 50 and $2\,000$.

4.3 Experiment C: Approximation of batch PFI

Table 2 Median error of iPFI compared to batch PFI (IQR between $Q_{1}$ and $Q_{3}$ in braces)

Full size table

We further consider the static model setting where models are pre-trained before they are explained on the whole dataset (no incremental learning). This experiment demonstrates that iPFI correctly approximates batch PFI estimation. We compare iPFI with the classical batch PFI ${\hat{\phi }}^{(S_j)}_{\text {batch}}$ for feature $j \in D$, which is computed using the whole static dataset over ten random permutations. We normalize $\hat{\phi }_{\text {iPFI}}^{(S_j)}$ and $\hat{\phi }_{\text {batch}}^{(S_j)}$ between 0 and 1, and compute the sum over the feature-wise absolute approximation errors $\sum _{j \in D}{\vert \hat{\phi }_{\text {iPFI}}^{(S_j)}-\hat{\phi }_{\text {batch}}^{(S_j)}\vert }$. Table 2 shows the median and interquartile range (IQR) (difference between the first and third quartile) of the error based on ten random orderings of each dataset. Figure 9 illustrates the approximation quality of iPFI with geometric and uniform sampling per feature for the bike regression dataset. Further results can be found in the supplement material in Sect. C. In the static modeling case, there is no clear difference between geometric and uniform sampling. However, in the dynamic modeling context under drift, the sampling strategy has a substantial effect on the iPFI estimates.

5 Conclusion and future work

In this work, we considered global FI as a statistic measure of change in the model’s risk when features are marginalized. We discussed PFI as an approach to estimate feature importance and proved that only appropriately scaled permutation tests are unbiased estimators of global FI (Theorem 1). In this case, the expectation over the sampling strategy (expected PFI) then corresponds to the model reliance U-Statistic (Fisher et al., 2019).

Based on this notion, we presented iPFI, which is a model-agnostic algorithm to incrementally estimate global FI with PFI by averaging importance scores for individual observations over repeated realizations of a sampling strategy. We introduced two incremental sampling strategies and established theoretical results for the expectation over the sampling strategy (expected iPFI) to control the approximation error using iPFI’s parameters. On various benchmark datasets, we demonstrated the efficacy of our algorithms by comparing them with the batch PFI baseline method in a static progressive setting as well as with interval-based PFI in a dynamic incremental learning scenario with different types of concept drift and parameter choices.

Applying XAI methods incrementally to data stream analytics offers unique insights into models that change over time. In this work, we rely on PFI as an established and inexpensive FI measure. Other computationally more expensive approaches (such as SAGE) address some limitations of PFI. As our theoretical results can be applied to arbitrary feature subsets, analyzing these methods in the dynamic environment offers interesting research opportunities. In contrast to this work’s technical focus, analyzing the dynamic XAI scenario through a human-focused lens with human-grounded experiments is paramount (Doshi-Velez & Kim, 2017).

Data availability

Not applicable, as all data sets used in this paper are publicly available or synthetically created. However, the mechanism for creating the data is described in detail in the paper.

Code availability

The code is already publicly available at https://github.com/mmschlk/iPFI.

Notes

We provide iPFI as an open-source implementation in the iXAI online explanation framework available at https://github.com/mmschlk/iXAI.
As compared to Fisher et al. (2019), we consider the loss function $L(f,(y,x_n,x_m)):= \Vert h(x_n^{(\bar{S_j})},x_m^{(S_j)}) - y \Vert$ and denote ${\bar{\phi }}^{(S_j)}:= {\widehat{MR}}_{\text {difference}}(h)$ in our case.
All experiments can be found at https://github.com/mmschlk/iPFI.

References

Aas, K., Jullum, M., & Løland, A. (2021). Explaining individual predictions when features are dependent: More accurate approximations to Shapley values. Artificial Intelligence, 298, 103502. https://doi.org/10.1016/j.artint.2021.103502
Article MathSciNet MATH Google Scholar
Adadi, A., & Berrada, M. (2018). Peeking inside the black-box: A survey on explainable artificial intelligence (XAI). IEEE Access, 6, 52138–52160. https://doi.org/10.1109/ACCESS.2018.2870052
Article Google Scholar
Agrawal, R., Imielinski, T., & Swami, A. (1993). Database mining: A performance perspective. IEEE Transactions on Knowledge and Data Engineering, 5(6), 914–925. https://doi.org/10.1109/69.250074
Article Google Scholar
Altmann, A., Toloşi, L., Sander, O., & Lengauer, T. (2010). Permutation importance: A corrected feature importance measure. Bioinformatics, 26(10), 1340–1347. https://doi.org/10.1093/bioinformatics/btq134
Article Google Scholar
Archer, K. J., & Kimes, R. V. (2008). Empirical characterization of random forest variable importance measures. Computational Statistics & Data Analysis, 52(4), 2249–2260. https://doi.org/10.1016/j.csda.2007.08.015
Article MathSciNet MATH Google Scholar
Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.-R., & Samek, W. (2015). On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7), 0130140. https://doi.org/10.1371/journal.pone.0130140
Article Google Scholar
Bahri, M., Bifet, A., Gama, J., Gomes, H. M., & Maniu, S. (2021). Data stream analysis: Foundations, major tasks and tools. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 11(3), 1405. https://doi.org/10.1002/widm.1405
Article Google Scholar
Barddal, J. P., Enembreck, F., Gomes, H. M., Bifet, A., & Pfahringer, B. (2019). Boosting decision stumps for dynamic feature selection on data streams. Information Systems, 83, 13–29. https://doi.org/10.1016/j.is.2019.02.003
Article Google Scholar
Bifet, A., Gavaldà, R. (2009). Adaptive learning from evolving data streams. In Advances in intelligent data analysis VIII, 8th international symposium on intelligent data analysis (IDA 2009), pp. 249–260 . https://doi.org/10.1007/978-3-642-03915-7_22.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324
Article MATH Google Scholar
Calle, M. L., & Urrea, V. (2011). Letter to the editor: Stability of random forest importance measures. Briefings in Bioinformatics, 12(1), 86–89. https://doi.org/10.1093/bib/bbq011
Article Google Scholar
Casalicchio, G., Molnar, C., Bischl, B. (2018). Visualizing the feature importance for black box models. In Proceedings of machine learning and knowledge discovery in databases - European conference, (ECML PKDD 2018), pp. 655–670. https://doi.org/10.1007/978-3-030-10925-7_40
Cassidy, A. P., Deviney, F. A. (2014). Calculating feature importance in data streams with concept drift using online random forest. In 2014 IEEE international conference on big data (Big Data 2014), pp. 23–28 . https://doi.org/10.1109/BigData.2014.7004352.
Chen, H., Janizek, J. D., Lundberg, S. M., Lee, S. (2020). True to the model or true to the data? CoRR arXiv:abs/2006.16234
Covert, I., Lee, S.-I. (2021). Improving kernelshap: Practical shapley value estimation using linear regression. In Proceedings of international conference on artificial intelligence and statistics (AISTATS 2021), pp. 3457–3465.
Covert, I., Lundberg, S. M., Lee, S. -I. (2020). Understanding global feature contributions with additive importance measures. In Proceedings of international conference on neural information processing systems (NeurIPS 2020), pp. 17212–17223.
Covert, I., Lundberg, S., & Lee, S.-I. (2021). Explaining by removing: A unified framework for model explanation. Journal of Machine Learning Research, 22(209), 1–90.
MathSciNet MATH Google Scholar
de Souza, V. M. A., dos Reis, D. M., Maletzke, A. G., Batista, G. E. A. P. A. (2020). Challenges in benchmarking stream learning algorithms with real-world data. Data Mining and Knowledge Discovery, 34(6), 1805–1858 . https://doi.org/10.1007/s10618-020-00698-5.
Doshi-Velez, F., Kim, B. (2017). Towards a rigorous science of interpretable machine learning . https://arxiv.org/abs/1702.08608
Fanaee-T, H., & Gama, J. (2014). Event labeling combining ensemble detectors and background knowledge. Progress in Artificial Intelligence, 2(2), 113–127. https://doi.org/10.1007/s13748-013-0040-3
Article Google Scholar
Feurer, M., van Rijn, J.N., Kadra, A., Gijsbers, P., Mallik, N., Ravi, S., Mueller, A., Vanschoren, J., Hutter, F. (2020). OpenML-python: An extensible python API for OpenML . https://arxiv.org/abs/1911.02490.
Fisher, A., Rudin, C., & Dominici, F. (2019). All models are wrong, but many are useful: Learning a Variable’s importance by studying an entire class of prediction models simultaneously. Journal of Machine Learning Research, 20(177), 1–81.
MathSciNet MATH Google Scholar
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5), 1189–1232. https://doi.org/10.1214/aos/1013203451
Article MathSciNet MATH Google Scholar
Frye, C., Mijolla, D. D., Begley, T., Cowton, L., Stanley, M., Feige, I. (2021). Shapley explainability on the data manifold. In International conference on learning representations (ICLR 2021). https://openreview.net/forum?id=OPyWRrcjVQw.
García-Martín, E., Rodrigues, C. F., Riley, G., & Grahn, H. (2019). Estimation of energy consumption in machine learning. Journal of Parallel and Distributed Computing, 134, 75–88. https://doi.org/10.1016/j.jpdc.2019.07.007
Article Google Scholar
Gomes, H. M., Mello, R. F. D., Pfahringer, B., Bifet, A. (2019). Feature scoring using tree-based ensembles for evolving data streams. In 2019 IEEE international conference on big data (Big Data 2019), pp. 761–769.
Gomes, H. M., Bifet, A., Read, J., Barddal, J. P., Enembreck, F., Pfharinger, B., Holmes, G., & Abdessalem, T. (2017). Adaptive random forests for evolving data stream classification. Machine Learning, 106(9), 1469–1495. https://doi.org/10.1007/s10994-017-5f642-8
Article MathSciNet Google Scholar
Gregorutti, B., Michel, B., & Saint-Pierre, P. (2015). Grouped variable importance with random forests and application to multiple functional data analysis. Computational Statistics & Data Analysis, 90, 15–35. https://doi.org/10.1016/j.csda.2015.04.002
Article MathSciNet MATH Google Scholar
Gregorutti, B., Michel, B., & Saint-Pierre, P. (2017). Correlation and variable importance in random forests. Statistics and Computing, 27(3), 659–678. https://doi.org/10.1007/s11222-016-9646-1
Article MathSciNet MATH Google Scholar
Hapfelmeier, A., Hothorn, T., Ulm, K., & Strobl, C. (2014). A new variable importance measure for random forests with missing data. Statistics and Computing, 24(1), 21–34. https://doi.org/10.1007/s11222-012-9349-1
Article MathSciNet MATH Google Scholar
Harries, M. (1999). Splice-2 comparative evaluation: Electricity pricing. Technical report, The University of South Wales.
Haug, J., Braun, A., Zürn, S., Kasneci, G. (2022). Change detection for local explainability in evolving data streams. In Proceedings of the 31st ACM international conference on information and knowledge management (CIKM 2022), pp. 706–716.
Hoeffding, W. (1948). A class of statistics with asymptotically normal distribution. The Annals of Mathematical Statistics, 19(3), 293–325. https://doi.org/10.1007/978-1-4612-0919-5_20
Article MathSciNet MATH Google Scholar
Hooker, G., Mentch, L., Zhou, S. (2019). Unrestricted permutation forces extrapolation: Variable importance requires at least one more model, or there is no free variable importance. https://arxiv.org/abs/1905.03151
Janzing, D., Minorics, L., Bloebaum, P. (2020). Feature relevance quantification in explainable AI: A causal problem. In Proceedings of international conference on artificial intelligence and statistics (AISTATS 2020), pp. 2907–2916.
Jethani, N., Sudarshan, M., Covert, I. C., Lee, S.-I., Ranganath, R. (2021). Fastshap: Real-time shapley value estimation. In Proceedings of international conference on learning representations (ICLR 2021).
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., Liu, T.-Y. (2017). Lightgbm: A highly efficient gradient boosting decision tree. In Proceedings of international conference on neural information processing system (NeurIPS 2017).
Kohavi, R. (1996). Scaling up the accuracy of Naive–Bayes classifiers: A decision-tree hybrid. In Proceedings of international conference on knowledge discovery and data mining (KDD 1996), pp. 202–207.
König, G., Molnar, C., Bischl, B., Grosse-Wentrup, M. (2021). Relative feature importance. In Proceedings of international conference on pattern recognition (ICPR 2021), pp. 9318–9325.
Losing, V., Hammer, B., & Wersing, H. (2018). Incremental on-line learning: A review and comparison of state of the art algorithms. Neurocomputing, 275, 1261–1274. https://doi.org/10.1016/j.neucom.2017.06.084
Article Google Scholar
Lundberg, S. M., Lee, S.-I. (2017). A unified approach to interpreting model predictions. In Proceedings of international conference on neural information processing systems (NeurIPS 2017), pp. 4768–4777.
Lundberg, S. M., Erion, G., Chen, H., DeGrave, A., Prutkin, J. M., Nair, B., Katz, R., Himmelfarb, J., Bansal, N., & Lee, S.-I. (2020). From local explanations to global understanding with explainable AI for trees. Nature Machine Intelligence, 2(1), 56–67. https://doi.org/10.1038/s42256-019-0138-9
Article Google Scholar
Molnar, C., König, G., Bischl, B., Casalicchio, G. (2020). Model-agnostic feature importance and effects with dependent features: A conditional subgroup approach.
Montiel, J., Halford, M., Mastelini, S. M., Bolmier, G., Sourty, R., Vaysse, R., Zouitine, A., Gomes, H. M., Read, J., Abdessalem, T., Bifet, A. (2020). River: Machine learning for streaming data in Python. https://arxiv.org/abs/2012.04740.
Moro, S., Laureano, R. M. S., Cortez, P. (2011). Using data mining for bank direct marketing: An application of the CRISP-DM methodology. In Proceedings of the European simulation and modelling conference (ESM 2011).
Muschalik, M., Fumagalli, F., Hammer, B., & Hüllermeier, E. (2022). Agnostic explanation of model change based on feature importance. KI - Künstliche Intelligenz. https://doi.org/10.1007/s13218-022-00766-6
Article Google Scholar
Nahmias, S., & Olsen, T. L. (2015). Production and operations analysis. Illinois: Waveland Press.
Google Scholar
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A. (2017). Automatic differentiation in pytorch. In NIPS 2017 workshop on autodiff.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12, 2825–2830. https://doi.org/10.5555/1953048.2078195
Article MathSciNet MATH Google Scholar
Rényi, A. (1961). On measures of entropy and information. In Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, volume 1: Contributions to the theory of statistics, pp. 547–562.
Ribeiro, M. T., Singh, S., Guestrin, C. (2016). "Why Should I Trust You?": Explaining the predictions of any classifier. In Proceedings of international conference on knowledge discovery and data mining (KDD 2016), pp. 1135–1144.
Schlimmer, J. C., & Granger, R. H. (1986). Incremental learning from noisy data. Machine Learning, 1(3), 317–354. https://doi.org/10.1007/BF00116895
Article Google Scholar
Schlimmer, J. C., & Granger, R. H. (1986). Incremental learning from noisy data. Machine Learning, 1(3), 317–354. https://doi.org/10.1023/A:1022810614389
Article Google Scholar
Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D. (2017). Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision (ICCV 2017), pp. 618–626.
Shapley, L. S. (1953). A value for n-person games, volume IIContributions to the Theory of Games (AM-28) (pp. 307–318). New Jersey, USA: Princeton University Press.
Strobl, C., Boulesteix, A.-L., Zeileis, A., & Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics, 8(1), 25. https://doi.org/10.1186/1471-2105-8-25
Article Google Scholar
Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T., & Zeileis, A. (2008). Conditional variable importance for random forests. BMC Bioinformatics, 9(1), 307. https://doi.org/10.1186/1471-2105-9-307
Article Google Scholar
Turkay, C., Pezzotti, N., Binnig, C., Strobelt, H., Hammer, B., Keim, D. A., Fekete, J.-D., Palpanas, T., Wang, Y., Rusu, F. (2018). Progressive data science: Potential and challenges . https://arxiv.org/abs/1812.08032
Vitter, J. S. (1985). Random sampling with a reservoir. ACM Transactions on Mathematical Software, 11(1), 37–57. https://doi.org/10.1016/j.ipl.2005.11.003
Article MathSciNet MATH Google Scholar
Wang, H., Yang, F., & Luo, Z. (2016). An experimental study of the intrinsic stability of random forest variable importance measures. BMC Bioinformatics, 17(1), 60.
Article Google Scholar
Wastensteiner, J., Weiss, T. M., Haag, F., Hopf, K. (2021). Explainable AI for tailored electricity consumption feedback: An experimental evaluation of visualizations. In European conference on information systems (ECIS 2021), vol. 55.
Yuan, L., Pfahringer, B., Barddal, J. P. (2018). Iterative subset selection for feature drifting data streams. In Proceedings of the 33rd annual ACM symposium on applied computing, pp. 510–517.
Zhu, R., Zeng, D., & Kosorok, M. R. (2015). Reinforcement learning trees. Journal of the American Statistical Association, 110(512), 1770–1784. https://doi.org/10.1080/01621459.2015.1036994
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

We gratefully acknowledge funding by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation): TRR 318/1 2021-438445824.

Funding

Open Access funding enabled and organized by Projekt DEAL. We gratefully acknowledge funding by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation): TRR 318/1 2021 - 438445824.

Author information

Fabian Fumagalli and Maximilian Muschalik have contributed equally to this work.

Authors and Affiliations

Bielefeld University, 33594, Bielefeld, Germany
Fabian Fumagalli & Barbara Hammer
LMU Munich, 80539, Munich, Germany
Maximilian Muschalik & Eyke Hüllermeier

Authors

Fabian Fumagalli
View author publications
You can also search for this author in PubMed Google Scholar
Maximilian Muschalik
View author publications
You can also search for this author in PubMed Google Scholar
Eyke Hüllermeier
View author publications
You can also search for this author in PubMed Google Scholar
Barbara Hammer
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the conception of the problem setting and overall design of the work. Fabian Fumagalli and Maximilian Muschalik elaborated on the technical details and main theoretical results, and drafted a first version of the manuscript. This version was revised and improved by all authors, who also read and approved the final manuscript.

Corresponding authors

Correspondence to Fabian Fumagalli or Maximilian Muschalik.

Ethics declarations

Conflicts of interest

The authors have the following conflicts regarding the editorial board of the Machine Learning Journal and the journal track chair of the ECML-PKDD 2023: Willem Waegeman

Ethics approval

The authors approve that this submission does not raise any potential ethical concerns.

Consent to participate

All authors agreed with the content and all gave explicit consent to submit. Moreover, the authors obtained consent from the responsible authorities at the institute/organization where the work has been carried out, before the work was submitted. Finally, the authors consent that at least one author will participate at the ECML-PKDD 2023 in case of acceptance.

Consent for publication

All authors consent to publish an individual’s data or image, if this is required.

Additional information

Editors: Fabio Vitale, Tania Cerquitelli, Marcello Restelli, Charalampos Tsourakakis.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Organisation of the appendix

The supplement material is organized as follows. Section A contains all proofs of the theoretical analysis conducted in the main body of the work. Section B covers the approximation error of expected PFI. Further experimental results and detailed descriptions of the datasets and models used for the empirical analysis is discussed in Sect. C. Lastly, Sect. D shows how PFI may be computed analytically for a pre-defined classification function illustrated with the agrawal concepts.

A Proofs

In the following, we provide the proofs of all theorems. We further present more general results that are stated as propositions.

Theorem 6

The expected PFI (model reliance) can be rewritten as a normalized expectation over uniformly random permutations, i.e.

$$\begin{aligned} {\bar{\phi }}^{(S_j)} = \frac{N}{N-1}{\mathbb {E}}_{\varphi \sim \text {unif}({\mathfrak {S}}_N)} \left[ {\hat{\phi }}^{(S_j)}_{\varphi } \right] \approx {\hat{\phi }}^{(S_j)} \end{aligned}$$

(8)

i.e. expected PFI is canonically estimated by the PFI estimator and in particular ${\bar{\phi }}^{(S_j)} = {\mathbb {E}}_\varphi [{\hat{\phi }}^{(S_j)}]$.

Proof

We write $f(z_n,z_m):= \Vert h(x_n^{(\bar{S_j})},x_m^{(S_j)})-y_n\Vert -\Vert h(x_n)-y_n\Vert$ and compute the expectation over randomly sampled permutations $\varphi \in {\mathfrak {S}}_N$. Each permutation has probability $\frac{1}{N!}$, which yields

$$\begin{aligned} {\mathbb {E}}_{\varphi }[{\hat{\phi }}^{(S_j)}_\varphi ]&= \frac{1}{N!} \sum _{\varphi \in {\mathfrak {S}}_N}{\hat{\phi }}^{(S_j)}_\varphi \\&= \frac{1}{N} \frac{1}{N!} \sum _{n=1}^N \sum _{\varphi \in {\mathfrak {S}}_N} f(z_n,z_{\varphi (n)}) \\&= \frac{1}{N} \frac{1}{N!} \sum _{n=1}^N \sum _{m=1}^N (N-1)! f(z_n,z_m) \\&= \frac{1}{N} \frac{1}{N} \sum _{n=1}^N \sum _{m \ne n} f(z_n,z_m) \\&= \frac{1}{N} \frac{1}{N} \sum _{n=1}^N \sum _{m \ne n} \Vert h(x_n,x_{m})-y_n\Vert \\&- \frac{N-1}{N^2} \sum _{n=1}^N \Vert h(x_n)-y_n\Vert , \end{aligned}$$

where we used in the third line that there are $(N-1)!$ permutations with $\varphi (n)=m$. We thus conclude,

$$\begin{aligned} \frac{N}{N-1} {\mathbb {E}}_{\varphi }[{\hat{\phi }}^{(S_j)}_\varphi ]&= {\hat{e}}_{\text {switch}} -{\hat{e}}_{\text {orig}} = {\bar{\phi }}^{(S_j)}. \end{aligned}$$

$\square$

Theorem 7

(Bias for static Model) If $h \equiv h_t$, then

$$\begin{aligned} \phi ^{(S_j)}(h) - {\mathbb {E}}[{\bar{\phi }}^{(S_j)}_t] = (1-\alpha )^{t-t_0+1} \phi ^{(S_j)}(h). \end{aligned}$$

Proof

We consider the more general estimator ${{\tilde{\phi }}}^{(S)}_t:= {\mathbb {E}}_{\varphi }[\sum _{s=t_0}^t w_s {\hat{\lambda }}^{(S)}_t(x_t,x_{\varphi _t},y_t)]$ and prove a more general result that can be used for arbitrary sampling and aggregation techniques.

Proposition 8

If $h \equiv h_t$, then

$$\begin{aligned} \phi ^{(S)}(h) - {\mathbb {E}}[{{\tilde{\phi }}}^{(S)}_t] = (1-\mu _w) \phi ^{(S)}(h) \end{aligned}$$

with $\mu _w:= \sum _{s=t_0}^t w_s$.

Proof

As each ${\hat{\lambda _s}}^{(S)}$ is an unbiased estimator of $\phi ^{(S)}(h_s)$, we have ${\mathbb {E}}[{{\tilde{\phi }}}^{(S)}_t]=\sum _{s=t_0}^{t} w_s \phi ^{(S)}(h)= \mu _w \phi ^{(S)}(h)$, where we used $(\varphi )_{t_0 \le s \le t} \perp (X,Y)$. $\square$

The result then follows directly, as ${\bar{\phi }}^{(S)} = {{\tilde{\phi }}}^{(S)}$ for $w_s:= \alpha (1-\alpha )^{t-s}$, $\mu _w=1-(1-\alpha )^{t-t_0+1}$ and $S:= S_j$. $\square$

Theorem 9

(Variance for static Model) If $h_t \equiv h$ and ${\mathbb {V}}[\Vert h(X_s^{(\bar{S_j})},X_r^{(S_j)})-Y_s\Vert -\Vert h(X_s)-Y_s\Vert ] <\infty$, then

$$\begin{aligned} \text {Uniform: }&{\mathbb {V}} \left[ \lim _{t\rightarrow \infty }{\bar{\phi }}_t^{(S_j)} \right] = {\mathcal {O}} (-\alpha \log (\alpha )). \\ \text {Geometric: }&{\mathbb {V}} \left[ \lim _{t\rightarrow \infty }{\bar{\phi }}_t^{(S_j)} \right] = {\mathcal {O}} (\alpha ) + {\mathcal {O}} (p). \end{aligned}$$

Proof

We again consider the more general estimator ${{\tilde{\phi }}}^{(S)}_t:= {\mathbb {E}}_\varphi [\sum _{s=t_0}^t w_s {\hat{\lambda }}^{(S)}_t(x_t,x_{\varphi _t},y_t)]$ and prove a result, that can be used for arbitrary sampling and aggregation techniques. $\square$

Proposition 10

For $\varphi$ from (5) with $\varphi _s \perp \varphi _r$ for $r<s$ and $p_{s,r}\le p_{s',r}$ for $s>s'$, i.e., the probability to sample a previous observation r is non-increasing over time, it holds

$$\begin{aligned} {\mathbb {V}}\left[ {{\tilde{\phi }}}_t^{(S)}\right] \le 4\sigma _w^2\sigma _2^2 + 2\sigma _{2}^2\sum _{s=t_0}^t\sum _{s'=t_0}^{s-1} w_s w_{s'}\underbrace{\sum _{r=0}^{s'-1}p_{s',r}^2}_{=: \mathcal I_\varphi (s)}, \end{aligned}$$

provided that $\sigma _2^2:= {\mathbb {V}}[f(Z_s,Z_r)] <\infty$ and with $\sigma ^{2}_w:= \sum _{s=0}^t w_s^2$.

Proof

We denote $f(Z_s,Z_r):= \Vert h(X_s^{(\bar{S_j})},X_r^{(S)})-Y_s\Vert -\Vert h(X_s)-Y_s\Vert$. Using $p_{s,r}:= {\mathbb {P}}(\varphi _s=r)$ and properties of variance, we can write

$$\begin{aligned}&{\mathbb {V}}[{{\tilde{\phi }}}_t^{}] = {\mathbb {V}}[\sum _{s=t-N+1}^t w_s \sum _{r=0}^{s-1} p_{r,s} f(Z_s,Z_r)] \\&\quad = \sum _{s,s'=t_0}^t w_s w_{s'} \sum _{r=0}^{s-1}\sum _{r'=0}^{s'-1} p_{s,r}p_{s',r'}\text {cov}((s,r),(s',r')), \end{aligned}$$

where $\text {cov}((s,r),(s',r')):= \text {cov}(f(Z_s,Z_r),f(Z_{s'},Z_{r'}))$ denotes the covariance of the two random variables. The above sum ranges over all possible combinations of pairs (s, r), where $s=t_0\dots ,t$ and $r=0,\dots ,s-1$. As $r<s$ and $r'<s'$, it holds $\vert \{s,s',r,r'\}\vert \ge 2$. When $\vert \{s,s',r,r'\}\vert =2$ then $s=s'$ and $r=r'$ and the covariance reduces to the variance. When none of the indices match, i.e., $\vert \{s,s',r,r'\}\vert = 4$, then the covariance is zero, due to the independence assumption. When exactly one index matches, then there are three possible cases:

Case 1: $s=s',r\ne r'$,
Case 2: $s\ne s',r\ne r$ with $r'=s$ or $s'=r$
Case 3: $s \ne s', r=r'$.

Case 2 yields the same covariances due to the iid assumption and the symmetric of the covariance. For case 1, with ${\mathbb {E}}_{(Z_s,Z_r)}[f(Z_s,Z_r)] = {\mathbb {E}}_{Z_s} {\mathbb {E}}_{Z_r}[f(Z_s,Z_r)] = \phi ^{(S)}(h)$, we denote $\tilde{f}(Z_s,Z_r):= f(Z_s,Z_r) - \phi ^{(S)}(h)$ to compute the covariance as

$$\begin{aligned} \text {cov}((s,r),(s',r'))&= {\mathbb {E}}[{\tilde{f}}(Z_{s},Z_{r}){\tilde{f}}(Z_{s},Z_{r'})] \\&={\mathbb {E}}_{Z_{s}}[{\mathbb {E}}_{Z_{r}}[\tilde{f}(Z_{s},Z_{r})]{\mathbb {E}}_{Z_{r'}}[{\tilde{f}}(Z_{s},Z_{r'})]] \\&={\mathbb {E}}_{Z_{s}}[{\mathbb {E}}_{Z_{r}}[{\tilde{f}}(Z_{s},Z_{r})]^2] \\&= {\mathbb {V}}_{Z_{s}}[{\mathbb {E}}_{Z_{r}}[f(Z_{s},Z_{r})]], \end{aligned}$$

where we have used ${\mathbb {E}}_{Z_{s}}[{\mathbb {E}}_{Z_{r}}[\tilde{f}(Z_{s},Z_{r})]]=\phi ^{(S)}(h)$ as well as the iid assumption multiple times, in particular when ${\mathbb {E}}_{Z_{r}}[f(Z_{s},Z_{r})]={\mathbb {E}}_{Z_{r'}}[f(Z_{s},Z_{r'})]$. The same arguments apply for the second argument for case 3, as

$$\begin{aligned} \text {cov}((s,r),(s',r')) = {\mathbb {V}}_{Z_{r}}[{\mathbb {E}}_{Z_{s}}[f(Z_{s},Z_{r})]]. \end{aligned}$$

We thus summarize

$$\begin{aligned} \text {cov}((s,r),(s',r')) = {\left\{ \begin{array}{ll} {\mathbb {V}}[f(Z_s,Z_r)], \text { if } s=s', r=r' \\ {\mathbb {V}}_{Z_s}[{\mathbb {E}}_{Z_r}[f(Z_s,Z_r)]], \text { if case 1} \\ \text {cov}((s,r),(s',r')), \text { if case 2} \\ {\mathbb {V}}_{Z_r}[{\mathbb {E}}_{Z_s}[f(Z_s,Z_r)]], \text { if case 3} \\ 0, \text { if } \vert \{s,s',r,r'\}\vert = 4. \end{array}\right. } \end{aligned}$$

By the Cauchy–Schwarz inequality all covariances are bounded by $\sigma _2^2:= {\mathbb {V}}[f(Z_s,Z_r)]$. With $I:= \{t_0,\dots ,t\}$ and $I_{s}:= \{0,\dots ,s-1\}$ and $Q_2:= \{(s,r):s=s' \in I, r=r' \in I_s\}$ $Q_3:= \{(s,s',r,r'): s,s' \in I, r \in I_s, r' \in I_{r'}, \vert \{s,s',r,r'\}\vert =3\}$. We thus obtain

$$\begin{aligned} {\mathbb {V}}[{{\tilde{\phi }}}_t^{(S)}]&= \sigma _2^2\sum _{(s,r) \in Q_2}w_s^2p_{s,r}^2 \\&\quad + \sum _{(s,s',r,r') \in Q_3}w_s w_{s'}p_{s,r}p_{s',r'}\text {cov}((s,r),(s',r')). \end{aligned}$$

For the first sum, we have

$$\begin{aligned} \sum _{(s,r) \in Q_2}w_s^2 p_{s,r}^2 \le \sum _{(s,r) \in Q_2}w_s^2 p_{s,r} = \sum _{s=t_0}^t w_s^2 = \sigma _w^2. \end{aligned}$$

For the second sum, $Q_3$ decomposes into the three cases. For case 1,

$$\begin{aligned} \sum _{\begin{array}{c} (s,s',r,r') \in Q_3 \\ s=s', r\ne r' \end{array}} w_s w_{s'} p_{s,r}p_{s,r'}&= \sum _{s=t_0}^t w_s w_{s'} \sum _{\begin{array}{c} (r,r')\in I_s^2 \\ r\ne r' \end{array}} p_{s,r}p_{s,r'} \\&\quad \le \sum _{s=t_0}^t w_s^2(\sum _{r=0}^{s-1}p_{s,r})^2 = \sigma _w^2. \end{aligned}$$

For case 2 w.l.o.g assume $r=s'$, which implies $s>s'$ and thus $w_s\ge w_{s'}$, then

$$\begin{aligned} \sum _{\begin{array}{c} (s,s',r,r') \in Q_3 \\ s \ne s', r\ne r', s'=r \end{array}}w_s w_{s'} p_{s,s'}p_{s',r'}&= \sum _{s =t_0}^t w_s \sum _{s'=t_0}^{s-1} w_{s'}p_{s,s'} \\&\quad \le \sum _{s =t_0}^t w_s^2 = \sigma _w^2. \end{aligned}$$

For case 3, we have

$$\begin{aligned} \sum _{\begin{array}{c} (s,s',r,r') \in Q_3 \\ s \ne s', r=r' \end{array}}w_s w_{s'}p_{s,r}p_{s',r}&=\sum _{\begin{array}{c} (s,s')\in I^2 \\ s\ne s' \end{array}}w_s w_{s'}\sum _{r=0}^{\min (s,s')-1}p_{s,r}p_{s',r} \\&\quad = 2\sum _{\begin{array}{c} (s,s')\in I^2 \\ s> s' \end{array}}w_s w_{s'}\sum _{r=0}^{s'-1}p_{s,r}p_{s',r} \\&\quad \le 2\sum _{\begin{array}{c} (s,s')\in I^2 \\ s> s' \end{array}}w_s w_{s'}\sum _{r=0}^{s'-1}p_{s',r}^2. \end{aligned}$$

In summary, we conclude

$$\begin{aligned} {\mathbb {V}}\left[ {{\tilde{\phi }}}_t^{(S)}\right] \le 4\sigma _w^2\sigma _2^2 + 2\sigma _{2}^2\sum _{s=t_0}^t\sum _{s'=t_0}^{s-1} w_s w_{s'}\sum _{r=0}^{s'-1}p_{s',r}^2. \end{aligned}$$

$\square$

The last sum depends on both the choices of weights $w_s$ and the collision probability ${\mathcal {I}}_\varphi (s) = \sum _{r=0}^{s-1} p_{s,r}^2 = P(Q_1=Q_2)$ for $Q_1,Q_2 \overset{iid}{\sim }\ {\mathbb {P}}_{\varphi _s}$, which is related to the Rényi entropy (Rényi, 1961). The variance increases with the collision probabilities of the sampling strategy, in particular ${\mathcal {I}}_{\text {unif}}(s) = \frac{1}{s}$ and ${\mathcal {I}}_{\text {geom}}(s) = \frac{p}{2-p} (1 + (1-p)^{2(s-t_0)+1})$ for uniform and geometric sampling, respectively.

Lemma 1

For geometric sampling and $p \in (0,1)$ it holds

$$\begin{aligned} {\mathcal {I}}_{\text {geom}}(s) = \sum _{r=0}^{s-1}p^2_{s,r} =\frac{p}{2-p}(1 + (1-p)^{2(s-t_0)+1}). \end{aligned}$$

Proof

The probabilities for geometric sampling are

$$\begin{aligned} p_{s,r} = {\left\{ \begin{array}{ll} p \cdot (1-p)^{s-r-1}, r>t_0= \frac{1}{p} \\ p \cdot (1-p)^{s-t_0}, r\le t_0= \frac{1}{p}. \end{array}\right. } \end{aligned}$$

Then

$$\begin{aligned} {\mathcal {I}}_{\text {geom}}(s)&= \sum _{r=0}^{s-1}p^2_{s,r} \\&= \sum _{r=0}^{t_0-1} p^2 \cdot (1-p_r)^{2(s-t_0)} + \sum _{r=t_0}^{s-1} p^2 (1-p)^{2(s-r-1)} \\&=t_0 \cdot p^2 \cdot (1-p)^{2(s-t_0)} + \sum _{r=t_0}^{s-1} p^2 (1-p)^{2(s-r-1)} \\&= p\cdot (1-p)^{2(s-t_0)} + p^2\sum _{r=0}^{s-t_0-1} (1-p)^{2r} \\&= p \cdot (1-p)^{2(s-t_0)} + p^2 \frac{1-(1-p)^{2(s-t_0)}}{1-(1-p)^2} \\&= p \cdot (1-p)^{2(s-t_0)} + \frac{p}{2-p} (1-(1-p)^{2(s-t_0)}) \\&= \frac{p}{2-p}(1 + (1-p)^{2(s-t_0)+1}). \end{aligned}$$

$\square$

We now apply Proposition 10 to our particular estimator ${\bar{\phi }}^{(S)} = {{\tilde{\phi }}}^{(S)}$ with $w_s:= \alpha (1-\alpha )^{t-s}$ and take the limit for $t\rightarrow \infty$. Note that both uniform and geometric sampling fulfill the condition of the theorem. Furthermore, we have $\sigma ^2_w = \alpha ^2 \sum _{s=0}^{t-t_0}(1-\alpha )^s \nearrow \frac{\alpha }{2-\alpha }$.

Uniform sampling For uniform sampling, we have

$$\begin{aligned} {\mathbb {V}}[{\bar{\phi }}_t^{(S)}]&\le \frac{\alpha }{2-\alpha }4\sigma _2^2 + 2\sigma _{2}^2\sum _{s=t_0}^t\sum _{s'=t_0}^{s-1} \alpha ^2 \frac{(1-\alpha )^{t-s+t-s'}}{s'} \\&\le \frac{\alpha }{2-\alpha }4\sigma _2^2 + 2\sigma _{2}^2\alpha ^2\sum _{s=0}^{t-t_0} (1-\alpha )^s\sum _{s'=0}^{t-t_0} \frac{(1-\alpha )^{s'}}{t-s'} \end{aligned}$$

For the first sum, we have $\alpha \sum _{s=0}^{t-t_0} (1-\alpha )^s \nearrow 1$ for $t\rightarrow \infty$. For the second sum

$$\begin{aligned} \alpha \sum _{s'=0}^{t-t_0} \frac{(1-\alpha )^{s'}}{t-s'}&\le \alpha (\sum _{\begin{array}{c} s'=0 \\ s' \ge t/2 \end{array}}^{t-t_0} (1-\alpha )^{s'} + 1+ \sum _{\begin{array}{c} s'=1 \\ s' < t/2 \end{array}}^{t-t_0} \frac{(1-\alpha )^{s'}}{s'}) \\&\le (1-\alpha )^{t/2}-(1-\alpha )^{t-t_0+1} + \alpha -\alpha \log (\alpha ) \\&\overset{t \rightarrow \infty }{\longrightarrow }\ \alpha - \alpha \log (\alpha ). \end{aligned}$$

Hence,

$$\begin{aligned} {\mathbb {V}}[\lim _{t\rightarrow \infty }{\bar{\phi }}_t^{(S)}] = {\mathcal {O}} (-\alpha \log (\alpha )). \end{aligned}$$

Geometric sampling For geometric sampling, we have

$$\begin{aligned} {\mathbb {V}}[{\bar{\phi }}_t^{(S)}]&\le \underbrace{\frac{\alpha }{2-\alpha }4\sigma _2^2}_{= {\mathcal {O}}(\alpha )} \\ &\quad + 2\sigma _{2}^2 \underbrace{\alpha ^2 \sum _{s=t_0}^t \sum _{s'=t_0}^{s-1} (1-\alpha )^{t-s+t-s'}]}_{=: q(\alpha )}\mathcal I_{\text {geom}}(s). \end{aligned}$$

For the second term it is enough to show that $0<\lim _{t\rightarrow \infty } q(\alpha )<\infty$ to prove the result, as $\mathcal I_{\text {geom}}(s) = {\mathcal {O}}(p)$. By using the properties of geometric progression, we obtain

$$\begin{aligned} q(\alpha )&= \alpha \sum _{s=t_0}^t (1-\alpha )^{t-s} \alpha \sum _{s'=t-s}^{t-t_0}(1-\alpha )^{s'} \\&=\alpha \sum _{s=t_0}^t (1-\alpha )^{t-s} ((1-\alpha )^{t-s}-(1-\alpha )^{t-t_0+1}) \\&=\alpha \sum _{s=0}^{t-t_0} (1-\alpha )^{s} ((1-\alpha )^{s}-(1-\alpha )^{t-t_0+1}) \\&= \underbrace{\alpha \sum _{s=0}^{t-t_0} (1-\alpha )^{2s}}_{\nearrow \frac{1}{2-\alpha }} - (1-\alpha )^{t-t_0+1}\underbrace{\alpha \sum _{s=0}^{t-t_0}(1-\alpha )^{s}}_{\nearrow 1} \\&\overset{t \rightarrow \infty }{\longrightarrow }\ \frac{1}{2-\alpha }. \end{aligned}$$

Hence,

$$\begin{aligned} {\mathbb {V}}[\lim _{t\rightarrow \infty }{\bar{\phi }}_t^{(S)}] \le \mathcal O(\alpha ) + 2 \sigma _2^2 \frac{2}{2-\alpha } \frac{p}{2-p} = \mathcal O(\alpha ) + {\mathcal {O}}(p). \end{aligned}$$

$\square$

Theorem 11

(Bias for changing Model) If $\Delta (h_s,h_t) \le \delta$ and $\Delta _S(h_s,h_t) \le \delta _S$ for $t_0 \le s \le t$, then

$$\begin{aligned} \vert {\mathbb {E}}[{\bar{\phi }}^{(S_j)}_t] - \phi ^{(S_j)}(h_t)\vert \le \delta _S + \delta +{\mathcal {O}}((1-\alpha )^{t}). \end{aligned}$$

Proof

We again consider the more general estimator ${{\tilde{\phi }}}^{(S)}_t:= {\mathbb {E}}_{\varphi }[\sum _{s=t_0}^t w_s {\hat{\lambda }}^{(S)}_t(x_t,x_{\varphi _t},y_t)]$ and prove a more general result.

Proposition 12

If $\Delta (h_s,h_t) \le \delta$ and $\Delta _S(h_s,h_t) \le \delta _S$ for $t_0 \le s \le t$, then $\vert {\mathbb {E}}[{\hat{\phi }}^{(S)}_t] - \phi ^{(S)}(h_t)\vert \le \mu _w (\delta _S + \delta ) + \vert (1-\mu _w) \phi ^{(S)}(h_t)\vert$.

Proof

For the proof, we first show that for two models $h_s,h_t$ and a subset $S \subset D$, it holds that $\vert \phi ^{(S)}(h_t)-\phi ^{(S)}(h_s)\vert \le \Delta _S(h_s,h_t) + \Delta (h_s,h_t)$. This follows directly from the reverse triangle inequality for $f_S^\Delta (x^{({\bar{S}})},h_s,h_t) \ge {\mathbb {E}}_{{\tilde{X}}}[ \Vert h_t(x^{({\bar{S}})},{\tilde{X}})-y\Vert - \Vert y-h_s(x^{(\bar{S})},{\tilde{X}})\Vert ]$. The result then follows directly by definition, the observation that ${\hat{\lambda _s}}^{(S)}$ is an unbiased estimate of $\phi ^{(S)}(h_s)$, as

$$\begin{aligned} \vert {\mathbb {E}}[{\bar{\phi }}^{(S)}_t] - \phi ^{(S)}(h_t) \vert &= \vert \left( \sum _{s=t_0}^t w_s \phi ^{(S)}(h_s)\right) - \phi ^{(S)}(h_t)\vert \\ & \le \sum _{s=t_0}^t w_s\underbrace{\vert \phi ^{(S)}(h_s) - \phi ^{(S)}(h_t)\vert }_{\le \delta + \delta _S} \\ &\quad + \vert \left( \sum _{s=t_0}^t w_s-1\right) \phi ^{(S)}(h_t) \vert \\ & \le \mu _w(\delta +\delta _S) + \underbrace{\vert (1-\mu _w)\phi ^{(S)}(h_t)\vert }_{\text {bias for static model}}. \end{aligned}$$

$\square$

With $\mu _w = 1 - (1-\alpha )^{t-t_0+1}$ our special case follows immediately. $\square$

Theorem 13

(Variance for changing Model) If

$$\begin{aligned} \text {cov}(f_s(Z_s,Z_r),f_{s'}(Z_{s'},Z_{r'})) \le \sigma _{\text {max}}^2 \end{aligned}$$

(9)

for $t_0\le s,s' \le t$, $r<s$ and $r'<s'$, then for a sequence of models $(h_t)_{t\ge 0}$ the results of Theorem 3 apply.

Proof

In all proofs a changing model $h_t$ adds a time dependency on the function $f_s(Z_s,Z_r):= \Vert h_s(X_s^{(\bar{S})},X_r^{(S)})-Y_s\Vert -\Vert h_s(X_s)-Y_s\Vert$. Instead of bounding the covariances by $\sigma _2^2$, we now bound the covariances of the time-dependent functions by $\sigma _{\text {max}}^2$. This only directly affects Proposition 10, as

$$\begin{aligned} {\mathbb {V}}[{\bar{\phi }}_t^{(S)}]&= {\mathbb {V}}\left[ \sum _{s=t-N+1}^t w_s \sum _{r=0}^{s-1} p_{r,s} f_s(Z_s,Z_r)\right] \\ &= \sum _{s,s'=t_0}^t w_s w_{s'} \sum _{r=0}^{s-1}\sum _{r'=0}^{s'-1} p_{s,r}p_{s',r'} \text {cov}((s,r),(s',r')) \\&\le \sigma ^2_{\text {max}}\sum _{s,s'=t_0}^t w_s w_{s'} \sum _{r=0}^{s-1}\sum _{r'=0}^{s'-1} p_{s,r}p_{s',r'}. \end{aligned}$$

All remaining arguments and proofs are still valid for a changing model due to the iid assumption. $\square$

B Approximation error for expected PFI

With $f(Z_n,Z_m):= \Vert h(X_n^{(\bar{S_j})},X_m^{(S_j)})-Y_n\Vert -\Vert h(X_n)-Y_n\Vert$ and symmetric U-statistic kernel $f_0(Z_n,Z_m):= \frac{f(Z_n,Z_m)+f(Z_m,Z_n)}{2}$, we can write

$$\begin{aligned} {\bar{\phi }}^{(S_j)} = \left( {\begin{array}{c}N\\ 2\end{array}}\right) ^{-1} \sum _{1\le n < m \le N} f_0(Z_n,Z_m), \end{aligned}$$

which is the basic form of a U-statistic and therefore the variance can be computed as

$$\begin{aligned} {\mathbb {V}}\left[ {\bar{\phi }}^{(S_j)}\right] = \left( {\begin{array}{c}N\\ 2\end{array}}\right) ^{-1}\sum _{c=1}^2 \left( {\begin{array}{c}2\\ c\end{array}}\right) \left( {\begin{array}{c}N-2\\ 2-c\end{array}}\right) \sigma ^2_c = {\mathcal {O}}(1/N), \end{aligned}$$

where $\sigma _1^2:= {\mathbb {V}}_{Z_n}[{\mathbb {E}}_{Z_m}[f_0(Z_n,Z_m)]]$ and $\sigma _2^2:= {\mathbb {V}}[f_0(Z_n,Z_m)]$ are assumed to be finite (Hoeffding, 1948). For $\epsilon > 0$, we then obtain by Chebyshev’s inequality ${\mathbb {P}}(\vert {\bar{\phi }}^{(S_j)} - \phi ^{(S_j)}(h)\vert > \epsilon ) = {\mathcal {O}}(1/N)$, as ${\bar{\phi }}^{(S_j)}$ is unbiased.

C Experiments

In the following, we give more comprehensive details about the datasets and models used in our experiments.

1.1 C.1 Dataset description

Adult (Kohavi, 1996) Binary classification dataset that classifies 48,842 individuals based on 14 features into yearly salaries above and below 50k. There are six numerical features and eight nominal features.

Bank (Moro et al., 2011) Binary classification dataset that classifies 45,211 marketing phone calls based on 17 features to decide whether they decided to subscribe a term deposit. There are seven numerical features and ten nominal features.

Bike (Fanaee-T & Gama, 2014) Regression dataset that collects the number of bikes in different bike stations of Toulouse over 187,470 time stamps. There are six numerical features and two nominal features.

elec2 (Harries, 1999) Binary classification dataset that classifies, if the electricity price will go up or down. The data was collected for 45,312 time stamp from the Australian New South Wales Electricity Market and is based on eight features, six numerical and two nominal.

agrawal (Agrawal et al., 1993) Synthetic data stream generator to create binary classification problems to decide whether an indivdual will be granted a loan based on nine features, six numerical and three nominal. There are ten different decision functions available.

stagger (Schlimmer & Granger, 1986) The stagger concepts makes a simple toy classification data stream. The syntethtical data stream generator consists of three independent categorical features that describe the shape, size, and color of an artificial object. Different classification functions can be derived from these sharp distinctions.

insects (de Souza et al., 2020) The insects concept drift data streams capture flight information about different kinds of mosquito in various experimental settings. In total, 11 different variants of this stream (i.e. experimental settings) are available. The streams were created in a synthetic experiment with real mosquitoes and sensors. The data stream captures flight information about different mosquito kinds in various experimental settings. The dataset contains 33 numerical features. The variant called “abrupt balanced” used here contains 52, 848 samples.

ozone (de Souza et al., 2020) The ozone dataset contains air measurements values in the years of 1998 to 2004. The learning task is a binary classification problem of determining the ozone level (“ozone” day or “normal” day). In total the dataset contains 72 numerical features for $2\,534$ days.

1.2 C.2 Model description

All models are implemented with the default parameters from scikit-learn (Pedregosa et al., 2011) and River (Montiel et al., 2020) unless otherwise stated.

ARF The Adaptive Random Forest Classifier (ARF) uses an ensemble of 50 trees with binary splits, ADWIN drift detection and information gain split criterion. We used the default implementation AdaptiveRandomForestClassifier from River with n_models=50 and binary_split=True.

NN The Neural Network classifier (NN) was implemented with two hidden layers of size $128 \times 64$, ReLu activation function and optimized with stochastic gradient descent (ADAM). We used the default implementation MLPClassifier from scikit-learn.

GBT The Gradient Boosting Tree (GBT) uses 200 estimators and additively builds a decision tree ensemble using log-loss optimization. We used the GradientBoostingClassifier from scikit-learn with n_estimators=200.

LGBM The LightGBM (LGBM) constitutes a more lightweight implementation of GBT. We used HistGradientBoostingRegressor for regression tasks and HistGradientBoostingClassifier for classification tasks from scikit-learn with the standard parameters.

1.3 C.3 Hardware details

The experiments were mainly run on an computation cluster on hyperthreaded Intel Xeon E5-2697 v3 CPUs clocking at with 2.6Ghz. In total the experiments took around 300 CPU hours (30 CPUs for 10 h) on the cluster. This mainly stems from the number of parameters and different initializations. Before running the experiments on the cluster, the implementations were validated on a Dell XPS 15 9510 containing an Intel i7-11800 H at 2.30GHz. The laptop was running for around 12 h for the validation.

1.4 C.4 Additional time complexity

As described in 4.1, the runtime of iPFI scales linearly with the number of features. This relationship is illustrated in Fig. 10. Each dataset or data stream was explained in ten independent iterations. The average explanation time in relation to the time without explaining was averaged and plotted over the feature count. A linear regression describes the relationship between ($0.104 \cdot \vert D\vert$) the relative explanation time and feature count with an $R^2 = 0.966$ implying a linear effect.

1.5 C.5 Summary of incremental experiments

Table 3 contains summary information about the supplementary experiments conducted in the incremental learning scenario (cf. Sect. 4.1). Figures 11, 12, 13, 14, 15, and 16 illustrate the experiments conducted on the agrawal concept drift data streams. Figure 17 shows our additional experiments conducted on the synthetic stagger concept drift data streams. Lastly, Fig. 18 shows the experiments conducted on the elec2 data stream with an induced concept drift. The corresponding entries in Table 3 denote the approximation qualities for these experiments.

1.6 C.6 Summary of batch experiments

Next to single batch experiment showcased in Sect. 4.3 and Fig. 9, we also show the results for the other datasets. Figures 19, 20, 21, 22 and 23 show the static batch model experiments for the other corresponding datasets.

Table 3 Summary of additional concept drift experiments on agrawal, stagger, and elec2

Full size table

D Ground-truth PFI for the agrawal stream

River (Montiel et al., 2020) implements the agrawal (Agrawal et al., 1993) data stream with multiple classification functions. In our experiments we consider the following classification function (among others):

$$\begin{aligned} \text {Class A:}&((\text {age}< 40) \wedge (50K \le \text {salary} \le 100K ))\ \vee \\&((40 \le \text {age} < 60) \wedge (75K \le \text {salary} \le 125K ))\ \vee \\&((\text {age} \ge 60) \wedge (25K \le \text {salary} \le 75K )) \end{aligned}$$

Both feature age and salary are uniformly distributed with $X^{(\text {age})} \sim {\mathcal {U}}_{[20, 80]}$ and $X^{(\text {salary})} \sim {\mathcal {U}}_{[20, 150]}$. Given iid. samples from the data stream the classification problem can be transformed into a two-dimensional problem following the above defined classification function. The two-dimensional classification problem is illustrated in Fig. 24. A sample is classified as concept A when it occurs contained in $A_1$, $A_2$, or $A_3$. Otherwise the sample is classified as concept B.

The theoretical PFIs can be calculated with the base probability of an sample belonging to concept A ($P(A_1) = P(A_2) = P(A_3) = \frac{5}{39}$) times the probability of switching the class through changing a feature ($P(A_i \rightarrow B_{n,m})$) plus the vice versa for a sample originally belonging to concept B.

$$\begin{aligned} \phi ^{(\text {age})}&= P(A_1) \cdot P(A_1 \rightarrow B_{11}) + P(B_{11}) \cdot P(B_{11} \rightarrow A_1) \\&\quad + P(A_2) \cdot P(A_2 \rightarrow B_{21}) + P(B_{21}) \cdot P(B_{21} \rightarrow A_2) \\&\quad + P(A_3) \cdot P(A_3 \rightarrow B_{31}) + P(B_{31}) \cdot P(B_{31} \rightarrow A_3) = \\&= \frac{5}{39} \cdot \frac{1}{3} + (\frac{5}{13} \cdot \frac{1}{3} ) \cdot \frac{1}{3} \\&\quad + 2 \cdot (\frac{5}{39} \cdot \frac{1}{2} + (\frac{5}{13}\cdot \frac{1}{3}+\frac{5}{13}\cdot \frac{1}{3}\cdot \frac{1}{2})\cdot \frac{1}{3}) \approx \\&\approx 0.3419 \\ \phi ^{(\text {salary})}&= P(A_1) \cdot P(A_1 \rightarrow B_{12}) + P(B_{12}) \cdot P(B_{12} \rightarrow A_1) \\&\quad + P(A_2) \cdot P(A_2 \rightarrow B_{22}) + P(B_{22}) \cdot P(B_{22} \rightarrow A_2) \\&\quad + P(A_3) \cdot P(A_3 \rightarrow B_{32}) + P(B_{32}) \cdot P(B_{32} \rightarrow A_3) \\&= 3 \cdot (\frac{5}{39} \cdot \frac{8}{13} + (\frac{8}{13} \cdot \frac{1}{3}) \cdot \frac{5}{13})\approx \\&\approx 0.4734 \end{aligned}$$

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Fumagalli, F., Muschalik, M., Hüllermeier, E. et al. Incremental permutation feature importance (iPFI): towards online explanations on data streams. Mach Learn 112, 4863–4903 (2023). https://doi.org/10.1007/s10994-023-06385-y

Download citation

Received: 05 June 2023
Revised: 05 June 2023
Accepted: 12 July 2023
Published: 20 September 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s10994-023-06385-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Incremental permutation feature importance (iPFI): towards online explanations on data streams

Abstract

Similar content being viewed by others

Agnostic Explanation of Model Change based on Feature Importance

iPDP: On Partial Dependence Plots in Dynamic Modeling Scenarios

iSAGE: An Incremental Version of SAGE for Online Explanation on Data Streams

Explore related subjects

1 Introduction

2 Global feature importance

Definition 1

Remark 1

2.1 Empirical estimation of global FI

2.2 Permutation feature importance (PFI)

2.2.1 Empirical PFI

Definition 2

2.2.2 Expected PFI

Definition 3

Theorem 1

3 Incremental permutation feature importance

Definition 4

3.1 Incremental sampling strategies \(\varphi\)

3.1.1 Uniform sampling

3.1.2 Geometric sampling

3.2 Theoretical results of estimation quality

Definition 5

Theorem 2

Theorem 3

Theorem 4

Theorem 5

4 Experiments

4.1 Experiment A: online PFI calculation under drift

4.2 Experiment B: Geometric vs. uniform sampling

4.3 Experiment C: Approximation of batch PFI

5 Conclusion and future work

Data availability

Code availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflicts of interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Appendices

Organisation of the appendix

A Proofs

Theorem 6

Proof

Theorem 7

Proof

Proposition 8

Proof

Theorem 9

Proof

Proposition 10

Proof

Lemma 1

Proof

Theorem 11

Proof

Proposition 12

Proof

Theorem 13

Proof

B Approximation error for expected PFI

C Experiments

1.1 C.1 Dataset description

1.2 C.2 Model description

1.3 C.3 Hardware details

1.4 C.4 Additional time complexity

1.5 C.5 Summary of incremental experiments

1.6 C.6 Summary of batch experiments