Robust explainer recommendation for time series classification

Nguyen, Thu Trang; Le Nguyen, Thach; Ifrim, Georgiana

doi:10.1007/s10618-024-01045-8

Robust explainer recommendation for time series classification

Open access
Published: 20 June 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Robust explainer recommendation for time series classification

Download PDF

246 Accesses
1 Altmetric
Explore all metrics

Abstract

Time series classification is a task which deals with temporal sequences, a prevalent data type common in domains such as human activity recognition, sports analytics and general sensing. In this area, interest in explanability has been growing as explanation is key to understand the data and the model better. Recently, a great variety of techniques (e.g., LIME, SHAP, CAM) have been proposed and adapted for time series to provide explanation in the form of saliency maps, where the importance of each data point in the time series is quantified with a numerical value. However, the saliency maps can and often disagree, so it is unclear which one to use. This paper provides a novel framework to quantitatively evaluate and rank explanation methods for time series classification. We show how to robustly evaluate the informativeness of a given explanation method (i.e., relevance for the classification task), and how to compare explanations side-by-side. The goal is to recommend the best explainer for a given time series classification dataset. We propose AMEE, a Model-Agnostic Explanation Evaluation framework, for recommending saliency-based explanations for time series classification. In this approach, data perturbation is added to the input time series guided by each explanation. Our results show that perturbing discriminative parts of the time series leads to significant changes in classification accuracy, which can be used to evaluate each explanation. To be robust to different types of perturbations and different types of classifiers, we aggregate the accuracy loss across perturbations and classifiers. This novel approach allows us to recommend the best explainer among a set of different explainers, including random and oracle explainers. We provide a quantitative and qualitative analysis for synthetic datasets, a variety of time-series datasets, as well as a real-world case study with known expert ground truth.

Forecast evaluation for data scientists: common pitfalls and best practices

Article Open access 02 December 2022

A survey on semi-supervised learning

Article Open access 15 November 2019

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The last decade witnessed a rapid integration and increased impact of machine learning in everyday life. Machine learning algorithms work well in many applications and grow ever more complex with models having millions of parameters (Devlin et al. 2018; Brown et al. 2020). Data quality is key and checking against data leakage or bias is important to enable robust models. However, we are still behind in explaining why these algorithms work so well and occasionally fail to perform well, and what is in the data that leads multiple classifiers to predict a certain class (Goodfellow et al. 2015). The evaluation of explanation methods is still an open problem. While we have many new explanation methods and methodologies, it is still difficult to decide which is the best explainer for a given problem and dataset.

This unmatched growth of complexity and explanation of many machine learning algorithms and data, including those for time series, undermines application of these technologies in critical, human-related areas such as healthcare, sports and finance (Caruana et al. 2015; Lipton 2018). As time series data is prevalent in these applications (Petitjean et al. 2014; Ramgopal et al. 2014; Avci et al. 2010), Time Series Classification (TSC) algorithms often call for reliable explanations (Bostrom and Yudkowsky 2018; Ismail Fawaz et al. 2019a). This explanation is usually presented in the form of feature importance or as saliency weights (Adebayo et al. 2018), highlighting the parts of the time series which are informative for the classification decision. Saliency-based explanations were shown to be useful to find important motifs in data (Le Nguyen et al. 2019; Ismail Fawaz et al. 2019a), and as a starting point to prioritise features for further investigation in counterfactual explanation methods (Delaney et al. 2021).

Recent efforts in designing intrinsically explainable machine learning algorithms, as well as building post-hoc saliency-based explainers for black-box algorithms, have gained significant attention (Zhou et al. 2016; Selvaraju et al. 2017; Ribeiro et al. 2016; Lundberg et al. 2017). Most works focus on explaining one particular classification algorithm, with a lot of emphasis on deep learning methods (Boniol et al. 2022; Zhou et al. 2016; Selvaraju et al. 2017; Sundararajan et al. 2017; Smilkov et al. 2017; Springenberg et al. 2015). Often we are faced with a set of saliency maps for explanation, coming either from domain experts or from a diverse set of classifiers. In particular, for time series classification, a variety of classifiers are required for high accuracy, depending on the application domain (Bagnall et al. 2016; Middlehurst et al. 2023). Each of these classifiers may be tied in accuracy, can be explained with different methods (e.g., LIME or SHAP), and often the resulting explanations disagree, e.g., pointing to different parts of a time series as most relevant for a predicted class. We thus face the challenge: How to assess and objectively compare many explanation methods? In other words, if two or more explanation techniques give different explanations (i.e., two different saliency maps coming from the same classifier or different classifiers, Fig. 1), which explanation is best for our task? In this paper, we focus on the time series classification task and propose a methodology appropriate for time series. Some of the ideas we investigate are relevant to recommending explainers beyond the TSC task, but this is beyond the scope of this paper.

We propose a methodology to compute a standardized evaluation measure, which enables quantitative comparison and ranking of explainers (Table 1). From the application users’ perspective, having this recommendation can support short-listing of useful explanations for further analysis and optimisation (Doshi-Velez and Kim 2017). At the very least, we want to know that a given explanation is better (more informative) than a random explanation, and in general we want to be able to select the best explainer for a given dataset.

Table 1 Outcome of AMEE: a measure to evaluate multiple explanation methods

Full size table

In this paper, we present A Model-agnostic framework for Explanation Evaluation for Time Series Classification (AMEE). Specifically, we focus on explanations in the form of a saliency map and consider their informativeness within a defined computational scope, in which a more informative explanation means a higher capacity to influence classifiers to identify a class. We show that the saliency-guided perturbation of discriminative subsequences results in a reduced accuracy of classifiers. The higher the impact of a perturbation, the more informative are the perturbed time series subsequences. Estimation of this impact, measured by a committee of highly accurate referee classifiers, can reveal the informativeness of the explanation. This is the key idea behind AMEE, a post-hoc approach which uses a set of classifiers and explainers to recommend the best explainer for a given time series classification dataset.

Our work addresses an overlooked area of research: robust comparison and ranking of multiple explanation methods for time series classification. Our main contributions are:

A robust, model-agnostic, ensemble-based explanation evaluation framework. First, we leverage the use of multiple data perturbation strategies to create explanation-guided noisy data. Using synthetic data, we empirically show that applying multiple data perturbation strategies is particularly useful when the data is hard-to-classify, as such data is often more sensitive to the data perturbation type. We also show that a committee of referee classifiers is useful to reduce the potential bias that one single referee classifier may have. Our experiments demonstrate that a committee approach involving multiple types of data perturbations and multiple classifiers leads to explanation evaluation and ranking that better agrees with the explanation ground truth (synthetic data) and domain expert ground truth (real data).
A standardised evaluation measure (explanation power) that is comparable across different explanation methods, referee classifiers and datasets.
An empirical study on both synthetic and real datasets with recent state-of-the-art time series classifiers and explanation methods. We verify the evaluation methodology with annotated, real datasets. All data, code and detailed results are available.^{Footnote 1}

In the next sections we review related Explainable AI research including both time series specific and general methods (Sect. 2). We then define related concepts (Sect. 3) and describe our proposed solution (Sect. 4). We discuss experiments on both synthetic and real time series datasets, with detailed case studies (Sect. 5). We discuss important considerations for practitioners when using AMEE to evaluate and recommend explanations for time series classification in Sect. 6. Finally, we summarize our results and discuss future work in Sect. 7.

2 Related work

2.1 Explanation methods for time series classification

As deep learning has achieved high performance in machine learning domains such as computer vision (Szegedy et al. 2015; Krizhevsky et al. 2012), the research community started to develop techniques to explain these black-box models to understand why they work so well (Zhou et al. 2016; Selvaraju et al. 2017; Sundararajan et al. 2017; Smilkov et al. 2017; Springenberg et al. 2015). These explanations are in the form of a saliency map, visualizing the important pixels in an image by computing a saliency weight for each pixel (a type of feature importance). This saliency map, combined with the original image, can reveal whether a black-box model focuses on the correct area of the image and explain the model in a visually friendly way.

Explainable AI (XAI) methods for Time Series Classification have advanced in parallel with general XAI progress (Theissler et al. 2022). Although recent works on instance-based methods, such as factual and counterfactual explanations (Guidotti et al. 2020; Delaney et al. 2021; Zhendong et al. 2021) have become popular, the majority of explanation methods exist in the form of saliency maps (Kokhlikyan et al. 2020; Mishra et al. 2017; Parvatharaju et al. 2021; Rooke et al. 2021; Guillemé et al. 2019; Zhou et al. 2016; Smilkov et al. 2017), where a map visualizes the importance weight vector w and highlights the discriminative areas of a time series for the classification task. These saliency-based explanations can either be extracted directly from the classifier (intrinsic explanation), or indirectly by applying a post-hoc explanation method to the black-box classifier (post-hoc explanation).

2.1.1 Intrinsic explanation

Explanation from MrSEQL Time Series Classifier. MrSEQL (Le Nguyen et al. 2019) is a time series classification algorithm that is intrinsically explainable. The algorithm converts the numeric time series vector into strings, e.g., by using the SAX (Lin et al. 2007) transform with varying parameters to create multiple symbolic representations of the time series. The symbolic representations are then used as input for SEQL (Ifrim and Wiuf 2011), a sequence learning algorithm, to select the most discriminative subsequences for training a classifier using logistic regression. The symbolic features combined with the classifier weights learned by logistic regression make this classification algorithm explainable. For a time series, the explanation weight of each data point is the accumulated weight of the SAX features that it maps to. These weights can be mapped back to the original time series to create a saliency map to highlight the time series parts important for the classification decision. We call the saliency map explanation obtained this way, MrSEQL-SM. For using the weight vector from MrSEQL-SM, we take the absolute value of weights to obtain a vector of non-negative weights. Figures 2 and 3 show an example of the saliency map explanation obtained directly from the MrSEQL classifier weights, for the Coffee and GunPoint datasets from the UCR Archive (Dau et al. 2018).

Explanation from a Generic, White-box Classifier. A generic, white-box classifier such as Logistic Regression or Ridge Regression has been the primary source of providing feature importance (by using the learned model weights), especially for tabular data (Hosmer et al. 2013). These classifiers and their explanations are computationally cheap and can be useful for time series data (Frizzarin et al. 2023).

2.1.2 Posthoc explanation

Gradient-based Explanation. This approach uses the gradients from a trained deep neural network to infer explanations. Notable methods are Integrated Gradient (Sundararajan et al. 2017), GradientSHAP (Lundberg et al. 2018), GradCAM (Selvaraju et al. 2017), CAM (Zhou et al. 2016). For time series classification and explanation, the most common classifier is ResNet (Ismail Fawaz et al. 2019b) combined with some of the explanation methods mentioned.

Perturbation-based Explanation. This type of methods infuse noise into the data to create data variations and to infer the degree of data point importance (Castro et al. 2009; Abanda et al. 2022). Notable methods are Feature Occlusion (Suresh et al. 2017) and LIME (Ribeiro et al. 2016). One of the most popular post-hoc explanation methods is SHAP (Lundberg et al. 2017)—a unique way to explain any machine learning model using a game theoretic approach, in which all feature coalitions are evaluated. Feature importance is then calculated using the classic Shapley value (Štrumbelj and Kononenko 2014). Figures 2 and 3 show an example of the saliency map explanation obtained by applying SHAP to the MrSEQL classifier to get a post-hoc explanation for each of the Coffee and GunPoint datasets from the UCR Archive (Dau et al. 2018).

2.2 Quantitative evaluation of saliency-based explanation

Quantitative evaluation of explanations for time series data was a relatively untouched topic until recently. Unlike image and text, time series data often do not have annotated ground truth explanation; hence, it remains a challenge to determine whether a saliency-based explanation is correct. Approaches to benchmark and evaluate faithfulness of recent explanation methods overcome this problem by using synthetic datasets with assigned ground-truth (Ismail et al. 2020; Crabbé and Van Der Schaar 2021). Other research ventures into real datasets, yet these efforts focus on examining explanations by a single classifier (Crabbé and Van Der Schaar 2021) or averaging a non-comparable metric across multiple datasets (Schlegel et al. 2019). The approach in Guidotti (2021) uses a white-box classifier to get a pseudo ground-truth explanation (a) and evaluates a post-hoc, localized explanation method (b) by estimating cosine distance between (a) and (b). However, this method assumes that the white-box classifiers can always produce explanations of ground-truth quality. We show in our experiments that this is not the case. Notably, Schlegel et al. (2019); Nguyen et al. (2020); Agarwal et al. (2021) propose methods to quantify explanation methods, however, there are a few problems with the comparison: the use of a single perturbation type is problematic as it cannot always distinguish between explanations, the metric used (change in accuracy) is not comparable across the selected datasets, the individual effect is not separated (only average change in accuracy is reported), and there is no discussion involving explanation ground-truth. Additionally, there is little discussion in previous work about the impact of the classifier(s) accuracy on evaluating the explanation methods that are based on those classifier(s). This is an important point, as the evaluation can only be trusted if the classifier(s) are reliable. Furthermore, there are cases where multiple classifiers have high accuracy and are tied in this regard, but the explanations obtained from the classifiers may disagree, and in some cases could be ranked worse than a random explanation. Hence it is not clear which classifier and explanation to select in such cases.

3 Background and definitions

3.1 Time series and time series dataset

A time series $X = [x_0,x_1,\ldots ,x_{l-1}], x_i \in \mathbb {R}$, is a sequence of $l \in \mathbb {N}$ real values that are recorded values of a synthetic or real process. In this definition, l is also called the number of time steps or the length of the time series X, and $x_i$ are the data points or time points.

A time series dataset D consists of $n \in \mathbb {N}$ time series of equal length l that are recorded from a single process. If the time series are not of equal length, it is common to pad with zeroes or use resampling to bring them to equal length.

3.2 Saliency-based explanations for time series

In the context of this paper, we only consider explanations in the form of saliency maps. A saliency map to explain time series X is a vector of numerical weights $M = [w_0, \dots ,w_{l-1}]$ where $w_i \in \mathbb {R}$ and l is the length X. The value $w_i$ implies the importance (or saliency) of the time point i in the process of prediction making for X. This vector can be obtained from annotation (by a human) or computed by an explanation method. The explanation method can come from a white-box classification model (intrinsic explanation) or a black-box classifier coupled with a post-hoc explanation method (post-hoc explanation). The weights $w_i$ are typically rescaled to [0, 1].

3.2.1 Random explanation

For sanity checks, we use saliency maps generated through random sampling as a lower bound on explanation quality. Here, the weights $w_i$ are drawn from a random uniform distribution. Like a dummy classifier, this random explanation serves as a baseline for any reasonable explanation method, i.e., they all should be better than random guessing. Nonetheless, there are situations where a random explanation outperforms a method-based explanation. Specifically, when a method-based explanation highlights non-discriminative parts, or fails to identify any discriminative parts, that explanation can be considered worse than a random explanation.

3.2.2 Oracle explanation

In cases where explanation ground truth is available (e.g., for synthetic datasets or from domain experts), this should be the gold standard for any explanation method. We generally expect any explanation method to rank between the random and the oracle explanations.

4 Methodology

In this section, we describe our proposed methodology using concepts described in Sect. 3. Specifically, we present the blueprint of AMEE in Fig. 5. The framework involves a labelled time series dataset (split into training and test datasets), a set of explanation methods to be compared, and a set of evaluating classifiers (referee classifiers). The output of the framework is the explanation power of each explanation method (see Table 1).

4.1 Explanation-guided data perturbation

A good saliency-based explanation for a time series should highlight its discriminative part(s) that contain class-specific information to distinguish from other classes. Data perturbation is the process of adding noise to the data by replacing selected time points in the time series. Explanation-guided data perturbation uses a saliency-based explanation to determine the specific time points of the time series to be perturbed. As a result, the more informative the explanation, the higher the decrease in classifier accuracy is expected, because that perturbation removes important class-specific information in the respective time series. Given a threshold k ($0 \le k \le 100$), the discriminative parts of a time series of l steps are segmented using the top k-percentiles in M. This is a set of $k * l / 100$ time steps that have the highest weights in the saliency map M. Varying k allows us to control the scope of the perturbation. At $k=0$, the time series is the original; at $k=10$, only 10 percent of the time steps (that are most discriminative according to the explanation) are perturbed; at $k=100$ the entire time series is perturbed.

4.2 Referee classifiers

In our work we employ a set of independent and accurate classifiers that are trained with the original training set and are used to evaluate the target explanations on the test set. This committee is formed of member classifiers that we call Referee Classifiers. In order to evaluate the explanation methods, our framework measures the impact of each explanation-guided data perturbation on the accuracy of the referee classifiers R. We select the referees based on recent empirical benchmarks on TSC (Middlehurst et al. 2023).

4.3 Data perturbation strategy: multiple perturbations

In Fig. 4 we explain and visualize four strategies to perturb the discriminative areas of a time series, as guided by a given explanation (Mujkanovic et al. 2020). These strategies are either time-step dependent (local perturbation, using only the t-th step information) or time-step independent (global perturbation), using either Gaussian-based or single value replacement. With these strategies, discriminative time steps are replaced with noisy values, either by replacing the original time series values with a patch of constant values (like a grey mask in an image) or a patch of random Gaussian noise values (like a noise mask in an image). Let n be the number of time series in a dataset D, each with l time steps. We want to perturb one test time series of size $1\times l$, so its t-th value $x_t$ is replaced with a new value $r_t$. We define the global and local profile for this time step perturbation as follows.

Local perturbation:

$$\begin{aligned} \mu _t = \frac{1}{|n|}\sum _{x \in D}x_t; \sigma _t^{2} = \frac{1}{|n|-1} \sum _{x \in D} (x_t - \mu _t)^2 \end{aligned}$$

(1)

Global perturbation:

$$\begin{aligned} \mu = \frac{1}{l|n|} \sum _{i=1}^{l} \sum _{x \in D}x_i; \sigma ^{2} = \frac{1}{l|n|-1} \sum _{i=1}^{l} \sum _{x \in D} (x_i - \mu )^2 \end{aligned}$$

(2)

With these local-based and global-based profiles, we can define the perturbation $r_i$ accordingly. We use four perturbation strategies, two local and two global perturbation types. Local mean: $r_t^{(1)} = \mu _t$; Local Gaussian: $r_t^{(2)} \sim \mathcal {N}(\mu _t,\,\sigma _t^{2})$; Global mean: $r_t^{(3)} = \mu$; Global Gaussian: $r_t^{(4)} \sim \mathcal {N}(\mu ,\,\sigma ^{2})$. Figure 4 illustrates an example of how the four strategies effectively modify the original time series in the regions identified by the explanation weights. We show in our experiments that it is important to use a set of perturbation strategies, rather than a single fixed perturbation.

4.4 The AMEE framework for evaluating explanations

Figure 5 summarizes the components and steps in the AMEE framework. Our framework requires a labeled time series dataset (D), a set of explanations (M) to evaluate, and a set of referee classifiers (R) to be trained on a subset of the dataset. With these elements, the following steps are done to record the necessary information to calculate evaluation metrics:

0.
Split the labeled dataset D into training ($D_{train}$) and test ($D_{test}$);
1.
Train Referee Classifier(s) (R) with ($D_{train}$);
2.
Use each explanation in M to create a step-wise, explanation-based perturbation on $D_{test}$;
3.
Measure the accuracy of each trained referee in R on these perturbed datasets $D'_{test}$.

The output of this process is the accuracy on the perturbed dataset $D'_{test}$ at various thresholds (k), serving as an indicator of how much an explanation-based perturbation impacts the referees. Significant drop in accuracy in the first few steps of the explanation-guided perturbation (e.g., at $k=10$ or $k=20$) signals that meaningful, salient data points are disturbed based on the explanation. Hence, explanations that correctly identify such salient regions are likely to be informative.

4.5 Explanation AUC

We measure the impact of each explanation by estimating the Area Under the Curve (AUC) of its explanation-guided perturbation. Specifically, the accuracy scores at each threshold (k) are translated into an Explanation-AUC (EAUC) using the trapezoidal rule.

$$\begin{aligned} EAUC = \frac{1}{2} \Delta k_0 \sum _{i=1}^q (acc_{i-1} + acc_{i} ) \end{aligned}$$

(3)

Here $\Delta k_0$ denotes the difference in value of each step normalized to 0–1 range ($\Delta k_0 = \frac{1}{100} \Delta k$); q denotes the number of steps ($q = \frac{100}{k}$); $acc_i$ is the accuracy at step i. If we perturb the dataset with q steps, we will have a total of $q+1$ data points for accuracy scores. For example, if the perturbation is done in $q=10$ steps, each step will correspond to a difference of k = 10 percentage points in perturbation threshold (i.e. 0%, 10%,..., 100%). The step for $k=0$ corresponds to the original test dataset, while the step for $k=100$ corresponds to adding noise to the entire time series.

With this estimation, a smaller EAUC means higher impact (accuracy loss) of the explanation method (Fig. 6). The Explanation AUC is computed for each combination of Perturbation–Referee–Explanation (Fig. 7: Step 1).

4.6 Robustness of the AMEE framework

Two key aspects of AMEE are aimed to make the framework more robust by employing multiple Data Perturbation strategies and multiple Referee Classifiers. Specifically, for each explanation in M, we use different data perturbation strategies (as described in Sect. 4.3) to create explanation-based perturbations on the dataset $D_{test}$. Additionally, multiple referee classifiers are trained on $D_{train}$ and their accuracy is measured on the perturbed $D'_{test}$.

The various data perturbation strategies represent different ways that salient parts of the data can be replaced with noise. Unlike image data that is standardized in RGB, time series data is more dynamic: it can belong to many different domains, collected from various sources, or preprocessed in different ways. This characteristics of time series data make it harder to use one single method to mask out a specific part of the data. Using a variety of data perturbations ensures that data is perturbed in ways that completely mask out the relevant parts of the signal. We further investigate using multiple Data Perturbation strategies in Sect. 5.3.2.

Referee classifiers are used to evaluate the impact of data perturbation on the pre-trained model with the original, non-perturbed data. Thus, the evaluation by referee classifiers is dependent on the properties of these classifiers such as in-classifier data normalization, feature extraction, and feature processing. Having multiple referee classifiers can reduce potential biases introduced by using one single classifier. We analyze this characteristic and show the benefits of using multiple referee classifiers in Sect. 5.3.3.

4.6.1 Standardization and explanation power

AMEE employs multiple perturbation strategies and multiple referee classifiers. As the EAUC measures depend on the choice of referees and perturbation strategies, they are not directly comparable. The next steps (Fig. 7: Step 2–5) standardize and aggregate the EAUC to compute the final output of the framework, the explanation power.

Step 2 rescales the Explanation AUC to the same range [0, 1] for each row (i.e. each pair of Referee and Perturbation). Since each referee responds to changes in the perturbed dataset differently, this normalization is performed for each pair of Referee and Perturbation to ensure that the Explanation AUC is comparable across the different explanation methods in the evaluation. The red highlighted row is an example. After rescaling the Explanation AUC, the Average Scaled EAUC is computed in Step 3. It is basically the average of each column in Step 2. This simple average calculation can be performed because individual Explanation AUC are already normalized to the [0,1] range and comparable across each Referee-Perturbation pair. For example the Average Scaled EAUC of Rocket-SHAP is $(0.43 + 0.26 + 0.69 + 0.42 + 0.33 + 0.67)/6 = 0.47$.

In Step 4, the Average Scaled EAUC is again rescaled to the range between 0 and 1. The result is the Average Scaled Rank (lower is better). The explanation power is simply the inverse of Average Scaled Rank ($1-$ Average Scaled Rank), i.e., higher is better. Details of this calculation are summarised in Algorithm 1.

5 Experiments

In this section, we evaluate the performance of the AMEE framework in three groups of experiments in ascending order of difficulty. In the simplest case, we want to validate AMEE with synthetic datasets with known explanation ground-truth (Ismail et al. 2020). Next, we measure the performance of the framework with a diverse set of time series classification datasets from the UCR Time Series Classification Archive covering popular domains that require explanation (Dau et al. 2018). Finally, we test our framework on a real dataset and compare the result with ground-truth explanations provided by domain experts. Our experiments are repeated 5 times and the reported results are the average of these repetitions.

5.1 Referee classifiers

We employ 5 candidates for referee classifiers in our experiment, selected based on their accuracy, speed and diversity of approach (Schäfer and Leser 2023): baseline 1NN-DTW (distance-based) (Cover and Hart 1967), MrSEQL (dictionary-based, time domain) (Le Nguyen et al. 2019), ROCKET (convolution-based) (Dempster et al. 2020), RESNET (deep learning) (He et al. 2016; Ismail Fawaz et al. 2019b) and WEASEL 2.0 (dictionary-based, frequency domain) (Schäfer and Leser 2023). As the choice of referees is a critical component in our framework, we carefully select classifiers that perform well in accuracy on all studied datasets. For a classifier to be selected in the referee committee, it has to achieve at least the average accuracy of all candidates for referee classifiers, and this number has to be higher than the theoretical accuracy achieved by a random classifier. In case the average accuracy is over 90%, the threshold to choose referees is set to 90%. By using a high accuracy threshold in cases when average accuracy is relatively high, we want to include the referees that do have high performance but slightly below the average accuracy. For example, in a theoretical case when the accuracies of 5 candidate classifiers are 0.90, 0.95, 0.97, 0.98, 0.99; we want to include all the referees as they all have relatively high performance, despite one that is slightly below the average accuracy. For some datasets, all the classifiers are tied or very close in accuracy. Details of the referee accuracy are presented in the Appendix.

5.2 Explanation methods

In our experiments, we evaluate 8 popular explanation methods with diverse properties as described in Table 2.

We use the author’s implementation for LIME (Ribeiro et al. 2016) and MrSEQL (Le Nguyen et al. 2019), the captum (Kokhlikyan et al. 2020) library for gradient-based explainers, the time-explain library (Mujkanovic et al. 2020) for SHAP, and sklearn (Buitinck et al. 2013) to implement the remaining classifiers and explainers. We have considered a few other recent explainers, e.g., LIMESegment Sivill and Flach (2022a), but they proved too slow to be feasible to run on all our datasets. Since our goal is to rank a set of given explainers, rather than promote any particular explainer, we consider this explainer set to be sufficient to validate our methodology.

Table 2 Summary of properties of explanation methods

Full size table

5.3 Evaluation for synthetic data with known ground truth

5.3.1 Data

We work with 10 synthetic univariate time series classification datasets selected by taking the mid-channel from the time series benchmark generated by Ismail et al. (2020). The datasets are created using five processes: (a) a standard continuous autoregressive time series with Gaussian noise (CAR), (b) sequences of standard nonâ€“linear autoregressive moving average (NARMA) time series with Gaussian noise, (c) non-uniformly sampled from a harmonic function (Harmonic), (d) non-uniformly sampled from a pseudo periodic function with Gaussian noise (Pseudo Periodic), and (e) Gaussian with zero mean and unit variance (Gaussian Process). The important areas, either a Small Middle part (30% of time series length) or a very small part, Rare Time (10% of time series length), are created by adding or subtracting a constant $\mu$ ($\mu$ = 1) for the positive and negative class. The number of time steps is T = 50. Each dataset comprises of 500 samples in training set and 100 samples in testing set. Figure 8 visualizes the two classes in the 10 datasets.

Before presenting the experiment result of our evaluation on the synthetic datasets, we discuss the effect of the Data Perturbation strategy (Sect. 5.3.2), investigate the impact of Referees (Sect. 5.3.3), and perform a sanity check for the classifier quality used for model-agnostic post-hoc explanation methods such as LIME and SHAP (Sect. 5.3.4).

5.3.2 Impact of data perturbation strategy

Figure 9 shows the boxplots of explanation power for different data perturbation strategies. In datasets which are "easier" to classify (i.e., most classifiers get close to 100% accuracy) such as CAR and NARMA, the Explanation Power does not change with the perturbation strategy. On the other hand, we observe a larger change in explanation power when data is harder to classify (for example, in GaussianProcess datasets). We additionally present plots showing changes of explanation power in two extreme cases for the SmallMiddle_CAR dataset (easy-to-classify) and RareTime_GaussianProcess dataset (hard-to-classify) when different perturbations are gradually introduced (Fig. 10). Notably, for the harder dataset RareTime_GaussianProcess, having more perturbation methods encourages the evaluation results to get closer to the ground truth. Specifically, for the Oracle explanation, if only a single perturbation method was used, such as Local Mean or Local Gaussian, the evaluation result would rank the Oracle explanation as the 6th best explanation method. However, when more perturbations are introduced, the Oracle explanation is evaluated more robustly, placing this method to the top 1 best explanation and better aligned with the ground truth. When the explanation is Oracle (upper bound of explanation) and Random (lower bound of explanation), we generally observe that these explanations are the most and least informative informative methods, respectively.

5.3.3 Impact of referee classifiers

Similar to the previous investigation on the impact of the perturbation strategy, we now inspect how the explanation power changes with respect to the set of referees, and present the result in Fig. 11. Here, we also notice a relatively consistent explanation power among different referee classifiers in datasets that are easier to classify (such as CAR and NARMA datasets). In datasets that are harder-to-classify (for example, in Gaussian Process datasets), we observe a larger range in distribution of explanation methods over referee classifiers.

Random and Oracle explanations both have their explanation power in expected values for the evaluated datasets. We present the change of explanation power when different referees are sequentially introduced in the two extreme cases on SmallMiddle_CAR dataset (easy-to-classify) and RareTime_GaussianProcess dataset (hard-to-classify) (Fig. 12). We observe that for RareTime_GaussianProcess dataset which is hard to classify, having a committee of referees that are highly accurate is desirable and is helpful in reducing the potential bias of a single referee and can lead to a more stable evaluation. Specifically, for Oracle explanation, if only a single referee was employed, the evaluation result would have ranked Oracle explanation as the 2nd best explanation method. However, when more referees are introduced, Oracle explanation is evaluated more robustly, placing this method to the top 1 best explanation and closer to the ground truth.

Similarly, we note that for some real datasets, several referee classifiers that are highly accurate can disagree in their evaluation ranking. In such cases, having multiple referees leads to a considerably more robust and reliable result. We show an example using a real dataset with domain expert ground truth (the Counter Movement Jump dataset) in a later section (Sect. 5.5.3).

5.3.4 Sanity check for the impact of the base classifier quality

Model-agnostic post-hoc methods such as LIME and SHAP derive explanations based on a classifier of any type. Thus, these explanation are dependent on the performance of the base classifier. For example, the ROCKET-SHAP explanation is created by applying SHAP (explanation method) on ROCKET (base classifier). If the base classifier has low accuracy on the sample dataset, the explanation based on that classifier may not be as good as one based on a more accurate classifier. In our experiment, we get LIME and SHAP explanations from two sources: MrSEQL classifier Le Nguyen et al. (2019) and ROCKET classifier Dempster et al. (2020). We observe that ROCKET achieves higher accuracy than MrSEQL in datasets created from Pseudo Periodic, Harmonic, and Gaussian Process (Table 10 in AppendixWe compare the two pairs of explanation (MrSEQL-based and ROCKET-based) from LIME and SHAP and do a sanity check. Our experiment confirms that in both cases, under the AMEE evaluation approach, ROCKET-LIME and ROCKET-SHAP are considered better explanation methods as compared to MrSEQL-LIME and MrSEQL-SHAP, respectively (Fig. 13). This sanity check confirms our intuition that the quality of the base classifier is an important factor in model-agnostic, post-hoc explanation methods such as LIME and SHAP.

5.3.5 Results

Using a committee of 5 referee classifiers and 4 data perturbation strategies, we evaluate 10 explanation methods (8 computed explainers plus the lower bound explanation (Random) and upper bound explanation (Oracle)) using AMEE. The resulting explanation power is presented in Table 3.

Using a threshold of 0.5 of the min-max normalized saliency score in [0,1], we determine the ground truth of whether a time point is salient. We compare the explanation methods with the ground truth explanation for each time points and calculate the F1-score (Table 4) to measure how good each method is in determining saliency of a time point. For example, in the SmallMiddle_CAR dataset, AMEE selects the best explanation method to be Oracle with explanation power of 1.00 (Table 3), and the second best explanation is the Saliency Map from RidgeCV (explanation power of 0.79, Table 3). This result is similar to the F1-score of these explanations using the ground truth (Table 4). Here, the Oracle explanation achieves 1.00 in F1-score (highest), followed by RidgeCV saliency map, achieving 0.86 in F1-score (second-best). Similarly, we find that the ranks of the methods in Tables 3 and 4 in high agreement. Moreover, even for the hardest dataset to classify (RareTime_GaussianProcess), we observe that adding more referees brings the relative ranking of the evaluated explanation closer to the ground truth ranking (Table 5). This result further reinforces that using multiple referees is desirable: we observe that a highly accurate set of referees brings the explanation ranking closer to the ground truth. Overall, our result shows a high agreement between AMEE’s computed explanation power and the F1-score using ground truth time-point importance evaluation and confirms that the committee of referees is a desirable property of the explainer recommendation framework.

Table 3 Synthetic datasets: explanation power for each of the 10 explanation methods evaluated

Full size table

Table 4 Synthetic datasets: F1-score of explanation methods using explanation ground truth

Full size table

Table 5 Dataset RareTime_Gaussian process: explanation power rank when referees are added sequentially for Top 5 explanation methods

Full size table

5.3.6 Comparison with previous work

Previous work Nguyen et al. (2020) showcases initial results towards comparing Explanation Methods. However, the method utilizes only one type of perturbation which replaces salient areas with Gaussian noise of low magnitude. While the magnitude of the Gaussian noise can be customized, it requires extra work from users to determine this parameter. This initial work also does not propose a way to standardise the explanation AUC across methods and datasets. Our new framework employs a combination of perturbation types that allows a higher impact in changing the original signals, resulting in a more robust framework and better results. We include the results of the framework (using the default settings) introduced in Nguyen et al. (2020) in Table 6.

This table shows that for many of the synthetic datasets, the small perturbation (default setting) added into the signal is too little and it fails to trigger changes in the classification accuracy. The resulting outcome is that the past approach is unable to distinguish differences in the informativeness of different explanation methods.

Table 6 Synthetic datasets: using default perturbation settings with previous work in Nguyen et al. (2020) to get explanation power for each of the 10 explanation methods evaluated

Full size table

Even with a much larger noise level (Fig. 14), the previous framework does not provide a result as accurate as AMEE, especially for datasets that are difficult to classify. We include results of the framework introduced in Nguyen et al. (2020) using higher noise magnitude in Table 7.

Table 7 Synthetic datasets: using higher perturbation magnitude with previous work in Nguyen et al. (2020) to get explanation power for each of the 10 explanation methods evaluated

Full size table

5.4 Evaluation for real time series data

5.4.1 Data

We work with 15 datasets from the UCR Archive (Dau et al. 2018) that represent a variety of data sources and domains. These datasets are of 5 types: electrocardiogram (ECG), human motion (MOTION), device usage (DEVICE), device activities tracked by sensors (SENSOR) and spectroscopy (SPECTRO). Oracle explanation is not available for these datasets.

5.4.2 Results

We test explanations for these datasets with AMEE and report the result in Table 8. Since we do not have ground truth for the majority of these datasets, we use this experiment to show how AMEE can apply to real datasets. We note that the Random explanation sometimes outperforms a method-base explanation. This can happen as some explanation methods may not work well with certain datasets, resulting in unreasonable explanations that misleadingly highlight non-discriminative parts as discriminative, or fail to identify any significant discriminative parts at all. In this situation, the evaluation of random explanations can serve as a filter for reasonable explanation methods, and any methods that have lower performance than random should be filtered out.

Table 8 Explanation power on UCR datasets

Full size table

5.5 Evaluation for real dataset with expert ground truth

The Oracle explanation is the upperbound for any explanation method, however, it is only available in synthetic datasets. For real datasets the explanation ground truth is often available in an approximate level of precision, e.g., specifying the relative position of the shape and areas of importance. This approximate ground truth is widely used in other papers in evaluating explanation methods for images (Kim et al. 2018; Zhou et al. 2016; Selvaraju et al. 2017), however, this approximate ground truth is not readily available for time series data without opinions from data experts. Among the the datasets evaluated in Sect. 5.4, Coffee (Briandet et al. 1996), Counter Movement Jump (CMJ) (Le Nguyen et al. 2019), and GunPoint (Ratanamahatana and Keogh 2004) have this information of the true important areas for each class of the dataset. In this section, we will compare the saliency-based explanations evaluated by AMEE, and the expert ground truth of important areas.

5.5.1 Spectroscopy dataset: coffee

The Coffee dataset contains the spectroscopy sample of two types of coffee: Arabica and Robusta. This dataset was first introduced in Briandet et al. (1996) and is part of the UCR time series dataset (Dau et al. 2018). Figure 16 shows the top 3 and bottom 2 explanation methods ranked by AMEE. Notably, the discriminating region of the two classes of Coffee produced by the best explainer, ROCKET-SHAP, is the last peak region of the time series. This region was confirmed in the original paper (Briandet et al. 1996) to contain information about the chlorogenic acid content of the sample that contributes to the difference between the two types of coffee (Fig. 15). Arabica has a lower caffeine and chlorogenic acid content that contributes to its finer taste and greater market value. The region that the MrSEQL-SHAP and MrSEQL-SM also highlight is another part of the spectrum that contains information about the chlorogenic acid content (Briandet et al. 1996). The worst explanation methods among those evaluated, GradientSHAP and a random explanation, shows either very small, non-contiguous or randomly, scattered regions of interests and do not focus on parts of the time series that discriminate the two coffee types.

5.5.2 Video motion retrieval dataset: GunPoint

The famous GunPoint dataset is the time-series translation from a video sequence involving actors performing two distinct actions: pointing to a target with a gun (Gun class) and pointing with their index fingers only (Point class). This dataset was introduced in Ratanamahatana and Keogh (2004) and is part of the UCR time series archive (Dau et al. 2018). Figure 17 visualizes the examples of explanations from the best three methods, worst method, and random method for this dataset. The expert ground truth for the GunPoint dataset conveys that the two classes differ in the steps where the Gun class requires the actor/actress to lift their hand above a holster, then reach down for the gun. This distinct action creates a subtle difference in the time steps right before the action of hands moving to the shoulder level (the sharp increase in time series values) to pointing the gun or hand (the plateau in the middle of the time series). The detailed description can be found in Ratanamahatana and Keogh (2004). In this dataset, AMEE identifies explanations from the IntegratedGradient method as the most computationally informative explanation, followed by MrSEQL-SHAP and MrSEQL-SM. The least informative method is ROCKET-LIME, which is even less informative than a random saliency explanation as this method refers to the wrong area of importance, failing to point out any of the salient regions of the time series.

5.5.3 Motion sensing dataset: counter movement jump (CMJ)

The CMJ dataset records the counter movement jumps of participants of 3 classes: Normal (jump done correctly), Bend (jump with knee bend), and Stumble (stumble at landing) (Fig. 18).

According to the domain experts who recorded this data (Le Nguyen et al. 2019), the critical area for the first two classes (NORMAL and BEND) is the middle part, while that of the final class (STUMBLE) is in the end of the time series. In class NORMAL, this region is completely flat. The same region in class BEND is characterized by a hump in case participants’ knees are in bending posture. In the STUMBLE class, the end of the time series is different from the previous two classes because of its very high, sharp peak due to a wrong landing position.

The result of AMEE for all studied explanation methods is also given in Table 8. The top 3 rows shows the top 3 explanations for this dataset are MrSEQL-SHAP (SHAP explanation based on MrSEQL classifier), MrSEQL-LIME (LIME explanation based on MrSEQL classifier), and MrSEQL-SM (saliency map obtained directly from MrSEQL classifier). We see a high agreement between these explainers as they all correctly highlight the corresponding discriminative areas provided by the expert (Fig. 19). In addition, methods that are pointed out by AMEE as unreliable are also shown to highlight incorrect regions and do not agree with the opinion of the domain expert (e.g., explanation provided by Integrated Gradient method).

5.5.4 Impact of using multiple referees

The Counter Movement Jump (CMJ) dataset is an example of a real dataset with known domain expert ground truth (Le Nguyen et al. 2019). In our experiment, all of the referee classifiers achieve very high performance, ranging from 0.92 to 0.97 accuracy (Table 11.) Hence, this dataset presents an opportunity to investigate the benefit of using a committee of referee classifiers in comparison with using a single referee classifier. Figure 20 shows the explanation power using two approaches: (a) using an ensemble of referees and (b) using a single referee independently. If we look at the case of Random explanation (displayed in blue) that should clearly be worse than the MrSEQL-based explanation (as shown in Sect. 5.5.3). It is interesting that one of the referee classifiers that is quite accurate (Resnet with 0.92 accuracy) ranks the Random explanation as the best, with a significant Explanation Power difference to the others. If only a single referee was employed here, the recommendation could select the Random explanation. However, the risk decreases when an ensemble of referees are used. From this real example we observe that the benefit of using multiple referees is to improve the confidence and reliability of the evaluation, reducing the risk that a single referee is wrong by instead aggregating evaluations from multiple referees.

5.6 Discussion

Our study carried over both synthetic and UCR datasets shows that AMEE can be used to computationally evaluate and rank different explanation methods. We recommend the use of AMEE with full knowledge about the essential elements of the method. First, referees should be selected carefully, using classifiers of acceptable accuracy as determined by the application requirements. Using a committee of multiple accurate referee classifiers is recommended to reduce possible biases that one referee could introduce and results in a more reliable evaluation. Second, having a variety of data perturbation methods is helpful, especially for hard-to-classify datasets. In addition, adding a random explanation while carrying out the evaluation with AMEE is helpful in identifying unreliable explanations. A worse-than-random explanation means that the explanation fails to trigger a change in referee classifiers when compared even to a random explanation, either not identifying the important areas, or not focusing on any important areas at all. Finally, we recommend adding SHAP-based methods to accurate base classifiers for testing and further evaluation, as our experiments show that SHAP-based explanations often outperform other explanations using the same base classifiers.

6 Recommendations for practitioners

In this section, we present our recommendations for using the AMEE framework to evaluate and recommend explanation methods. These are some of the lessons learned during the process of developing, designing, and conducting the experiments in this paper.

Time Series Classifiers One of the key elements of our evaluation framework are the referee classifiers. The more accurate the referees, the more reliable the result that we can potentially expect. Hence, choosing the right set of referees is a very important step before we even start to evaluate the explanation methods. We recommend using state-of-the-art time series classifiers that are well studied and compared in the latest empirical benchmarks (Middlehurst et al. 2023). When selecting referees, we recommend to choose classifiers which are both accurate and computationally efficient, since AMEE requires repeated inference of perturbed versions of the original dataset.
Explanation Methods Unless the application users already have their preferred explanations pre-computed and only require AMEE for evaluation, we recommend to select explainers based on the extensive survey (Theissler et al. 2022). We recommend to use diverse explanation methods, covering both intrinsic explanation and post-hoc explanation. From a computational perspective, we recommend to consider the cost of obtaining explanations. From our experience, explainers that use data segments (chunking) to explain time series seem useful, but some of them are not efficient (for example, LIMESegment (Sivill and Flach 2022b)). Additionally, we strongly recommend adding SHAP-based explanations to the list of explainers, as we observed these are highly informative in many datasets that we have tested. Finally, we recommend to add a random explanation to the evaluation (in addition to method-based explanations), for a simple sanity check.
Perturbation Strategy The perturbation strategy plays a critical role in both obtaining an explainer and for our recommendation framework. An effective perturbation strategy is one that, when used for perturbing the informative parts of the time series, leads to a change in prediction. In our experience, this effectiveness strongly depends on the specific dataset and classifiers, thus choosing the right perturbation can be tricky and time-consuming. Therefore, we recommend using multiple perturbation strategies in our framework.
Datasets and Optimal number of Referees and Perturbation Strategies Our experiment covers a wide range of datasets of different classification difficulty level. We observe that when the dataset is easy to classify (e.g., many classification algorithms can achieve high accuracy), generally a lower number of referees and perturbation strategies can be used without affecting the evaluation results. However, when the dataset is harder to classify, using more referees and more perturbation methods is recommended to get a more reliable and less biased result.
Adaptability We present AMEE as a robust explainer recommendation system for the Time Series Classification task. However, the framework is adaptable and could be generally applied to other types of data (such as images and text). For adapting to other data types, practitioners can consider more suitable perturbation methods and referee classifiers that work well with the target data.

7 Conclusion

In this work we proposed AMEE, a Model-Agnostic Explanation Evaluation framework, for computationally assessing and ranking explanation methods for the time series classification task. We test the framework on 25 synthetic and UCR archive datasets to obtain explanation evaluations for a wide variety of common explanation methods for time series, covering different aspects of explanation including type, scope and model dependency. Our experiments show a high agreement of the explanation power (measured by AMEE) in the synthetic datasets with the Oracle explanation (ground truth for each time point) and the Expert explanation in a real dataset (ground truth provided by a domain expert). We also find that perturbation-based explainers based on SHAP generally perform better than gradient-based explainers for time series classification (given similar performance of the base models), but are computationally expensive. AMEE can be used to select appropriate explanation methods for application users. It could also potentially pinpoint inherent problems, such as bias, that may exist in the training data and subsequently enhance the trustworthiness of AI systems in critical tasks. This framework further empowers machine learning to discover new knowledge from the data. Another potential application is to use the best explainer recommended by AMEE as a proxy for downstream tasks to identify opportunities to compress data and optimize data storage, transmission, and analysis. Finally, since AMEE relies on response to perturbation to evaluate importance of explanation methods, it can potentially be adapted to other types of data (such as images) and other machine learning tasks (such as time series regression and clustering). Future work includes devising a robust, AMEE-optimized, explanation method and using data experts to evaluate the validity and potential of knowledge discovery using this framework in biomedical and heathcare-related tasks such as genetic data understanding and sports analytics.

Notes

Data and code are available at: https://github.com/mlgig/amee

References

Abanda A, Mori U, Lozano J (2022) Ad-hoc explanation for time series classification. Knowl Based Syst 252:109366
Article Google Scholar
Adebayo J, Gilmer J, Muelly M, Goodfellow I, Hardt M, Kim B (2018) Sanity checks for saliency maps. Adv Neural Inf Process Syst 31
Agarwal S, Nguyen TT, Nguyen TL, Ifrim G (2021) Ranking by aggregating referees: evaluating the informativeness of explanation methods for time series classification. In: International workshop on advanced analytics and learning on temporal data, pp 3–20
Avci A, Bosch S, Marin-Perianu M, Marin-Perianu R, Havinga P (2010) Activity recognition using inertial sensing for healthcare, wellbeing and sports applications: a survey. In: 23th international conference on architecture of computing systems 2010, pp 1–10
Bagnall A, Lines J, Hills J, Bostrom A (2016) Time-series classification with COTE: the collective of transformation-based ensembles. In: 2016 IEEE 32nd international conference on data engineering, ICDE 2016. DOI 10.1109/ICDE.2016.7498418
Boniol P, Meftah M, Remy E, Palpanas T (2022) dcam: dimension-wise class activation map for explaining multivariate data series classification. In: Proceedings of the 2022 international conference on management of data, pp 1175–1189
Bostrom N, Yudkowsky E (2018) The ethics of artificial intelligence. In: Artificial intelligence safety and security. Chapman and Hall/CRC, pp 57–69
Briandet R, Kemsley E, Wilson R (1996) Discrimination of Arabica and Robusta in instant coffee by Fourier transform infrared spectroscopy and chemometrics. J Agric Food Chem 44(1):170–174. https://doi.org/10.1021/jf950305a
Article Google Scholar
Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A et al (2020) Language models are few-shot learners. Adv Neural Inf Process Syst 33:1877–1901
Google Scholar
Buitinck L, Louppe G, Blondel M, Pedregosa F, Mueller A, Grisel O, Niculae V, Prettenhofer P, Gramfort A, Grobler J, Layton R, VanderPlas J, Joly A, Holt B, Varoquaux G (2013) API design for machine learning software: experiences from the scikit-learn project. In: ECML PKDD workshop: languages for data mining and machine learning, pp 108–122
Caruana R, Lou Y, Gehrke J, Koch P, Sturm M, Elhadad N (2015) Intelligible models for healthcare: predicting pneumonia risk and hospital 30-day readmission. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pp 1721–1730. KDD’15. Association for Computing Machinery, New York, NY, USA. DOI 10.1145/2783258.2788613
Castro J, Gómez D, Tejada J (2009) Polynomial calculation of the Shapley value based on sampling. Comput Oper Res 36(5):1726–1730. https://doi.org/10.1016/j.cor.2008.04.004
Article MathSciNet Google Scholar
Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27. https://doi.org/10.1109/TIT.1967.1053964
Article Google Scholar
Crabbé J, Van Der Schaar M (2021) Explaining time series predictions with dynamic masks. In: Meila M, Zhang T (eds) Proceedings of the 38th international conference on machine learning. Proceedings of machine learning research. PMLR, vol 139, pp 2166–2177. https://proceedings.mlr.press/v139/crabbe21a.html
Dau HA, Bagnall AJ, Kamgar K, Yeh CCM, Zhu Y, Gharghabi S, Ratanamahatana CA, Keogh EJ (2018) The UCR time series archive. CoRR, arXiv:1810.07758
Delaney E, Greene D, Keane MT (2021) Instance-based counterfactual explanations for time series classification. In: International conference on case-based reasoning, pp 32–47
Dempster A, Petitjean F, Webb GI (2020) ROCKET: exceptionally fast and accurate time series classification using random convolutional kernels. Data Min Knowl Disc 34(5):1454–1495
Article MathSciNet Google Scholar
Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Doshi-Velez F, Kim B (2017) Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608
Frizzarin M, Visentin G, Ferragina A, Hayes E, Bevilacqua A, Dhariyal B, Domijan K, Khan H, Ifrim G, Nguyen TL, Meagher J, Menchetti L, Singh A, Whoriskey S, Williamson R, Zappaterra M, Casa A (2023) Classification of cow diet based on milk mid infrared spectra: a data analysis competition at the “International Workshop on Spectroscopy and Chemometrics 2022’’. Chemom Intell Lab Syst 234:104755. https://doi.org/10.1016/j.chemolab.2023.104755
Article Google Scholar
Goodfellow I, Shlens J, Szegedy C (2015) Explaining and harnessing adversarial examples. In: International conference on learning representations, http://arxiv.org/abs/1412.6572
Guidotti R (2021) Evaluating local explanation methods on ground truth. Artif Intell 291:103428
Article MathSciNet Google Scholar
Guidotti R, Monreale A, Spinnato F, Pedreschi D, Giannotti F (2020) Explaining any time series classifier. In: 2020 IEEE second international conference on cognitive machine intelligence (CogMI), pp 167–176. DOI 10.1109/CogMI50398.2020.00029
Guillemé M, Masson V, Rozé L, Termier A (2019) Agnostic local explanation for time series classification. In: 2019 IEEE 31st international conference on tools with artificial intelligence (ICTAI), pp 432–439. DOI 10.1109/ICTAI.2019.00067
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Hosmer DW Jr, Lemeshow S, Sturdivant RX (2013) Applied logistic regression, vol 398. Wiley, New York
Book Google Scholar
Ifrim G, Wiuf C (2011) Bounded coordinate-descent for biological sequence classification in high dimensional predictor space. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining, pp 708–716
Ismail Fawaz H, Forestier G, Weber J, Idoumghar L, Muller PA (2019a) Accurate and interpretable evaluation of surgical skills from kinematic data using fully convolutional neural networks. Int J Comput Assist Radiol Surg 14(9):1611–1617
Ismail Fawaz H, Forestier G, Weber J, Idoumghar L, Muller PA (2019b) Deep learning for time series classification: a review. Data Min Knowl Discov. https://doi.org/10.1007/s10618-019-00619-1
Ismail AA, Gunady M, Corrada Bravo H, Feizi S (2020) Benchmarking Deep Learning Interpretability in Time Series Predictions. In: Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H (eds) Advances in neural information processing systems, vol 33. Curran Associates Inc., pp 6441–6452. https://proceedings.neurips.cc/paper/2020/file/47a3893cc405396a5c30d91320572d6d-Paper.pdf
Kim B, Wattenberg M, Gilmer J, Cai C, Wexler J, Viegas F, et al (2018) Interpretability beyond feature attribution: quantitative testing with concept activation vectors (tcav). In: International conference on machine learning, pp 2668–2677
Kokhlikyan N, Miglani V, Martin M, Wang E, Alsallakh B, Reynolds J, Melnikov A, Kliushkina N, Araya C, Yan S, Reblitz-Richardson O (2020) Captum: a unified and generic model interpretability library for pytorch
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25
Le Nguyen T, Gsponer S, Ilie I, O’Reilly M, Ifrim G (2019) Interpretable time series classification using linear models and multi-resolution multi-domain symbolic representations. Data Min Knowl Discov 33(4):1183–1222. https://doi.org/10.1007/s10618-019-00633-3
Article MathSciNet Google Scholar
Lin J, Keogh E, Wei L, Lonardi S (2007) Experiencing SAX: a novel symbolic representation of time series. Data Min Knowl Disc 15(2):107–144
Article MathSciNet Google Scholar
Lipton ZC (2018) The mythos of model interpretability: in machine learning, the concept of interpretability is both important and slippery. Queue 16(3):31–57. https://doi.org/10.1145/3236386.3241340
Article Google Scholar
Lundberg SM, Lee SI (2017) A unified approach to interpreting model predictions. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems, vol 30. Curran Associates, Inc., Red Hook, pp 4765–4774
Google Scholar
Lundberg SM, Nair B, Vavilala MS, Horibe M, Eisses MJ, Adams T, Liston DE, Low DKW, Newman SF, Kim J et al (2018) Explainable machine-learning predictions for the prevention of hypoxaemia during surgery. Nat Biomed Eng 2(10):749–760
Article Google Scholar
Middlehurst M, Schäfer P, Bagnall A (2023) Bake off redux: a review and experimental evaluation of recent time series classification algorithms
Mishra S, Sturm BL, Dixon S (2017) Local interpretable model-agnostic explanations for music content analysis. In: Cunningham SJ, Duan Z, Hu X, Turnbull D (eds) Proceedings of the 18th international society for music information retrieval conference, ISMIR 2017, Suzhou, China, October 23–27, 2017, pp 537–543, https://ismir2017.smcnus.org/wp-content/uploads/2017/10/216_Paper.pdf
Mujkanovic F, Doskoc V, Schirneck M, Schäfer P, Friedrich T (2020) timeXplain–a framework for explaining the predictions of time series classifiers. CoRR, arxiv:2007.07606
Nguyen TT, Le Nguyen T, Ifrim G (2020) A model-agnostic approach to quantifying the informativeness of explanation methods for time series classification. In: International workshop on advanced analytics and learning on temporal data, pp 77–94
Parvatharaju PS, Doddaiah R, Hartvigsen T, Rundensteiner EA (2021) Learning saliency maps to explain deep time series classifiers. In: Proceedings of the 30th ACM international conference on information & knowledge management, pp 1406–1415. CIKM’21. Association for Computing Machinery, New York, NY, USA. DOI 10.1145/3459637.3482446
Petitjean F, Forestier G, Webb GI, Nicholson AE, Chen Y, Keogh E (2014) Dynamic time warping averaging of time series allows faster and more accurate classification. In: 2014 IEEE international conference on data mining, pp 470–479
Ramgopal S, Thome-Souza S, Jackson M, Kadish NE, Fernández IS, Klehm J, Bosl W, Reinsberger C, Schachter S, Loddenkemper T (2014) Seizure detection, seizure prediction, and closed-loop warning systems in epilepsy. Epilepsy Behav 37:291–307
Article Google Scholar
Ratanamahatana CA, Keogh E (2004) Everything you know about dynamic time warping is wrong. In: Third workshop on mining temporal and sequential data. Citeseer
Ribeiro MT, Singh S, Guestrin C (2016) “Why should i trust you?” Explaining the predictions of any classifier. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining. DOI 10.1145/2939672.2939778
Rooke C, Smith J, Leung KK, Volkovs M, Zuberi S (2021) Temporal dependencies in feature importance for time series predictions. CoRR, arXiv:2107.14317
Schäfer P, Leser U (2023) WEASEL 2.0-a random dilated dictionary transform for fast, accurate and memory constrained time series classification. arXiv preprint arXiv:2301.10194
Schlegel U, Arnout H, El-Assady M, Oelke D, Keim DA (2019) Towards a rigorous evaluation of Xai methods on time series. In: 2019 IEEE/CVF international conference on computer vision workshop (ICCVW), pp 4197–4201. DOI 10.1109/ICCVW.2019.00516
Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision, pp 618–626
Sivill T, Flach P (2022) Limesegment: meaningful, realistic time series explanations. In: Proceedings of the 25th international conference on artificial intelligence and statistics. PMLR
Sivill T, Flach P (2022) Limesegment: Meaningful, realistic time series explanations. In: Camps-Valls G, Ruiz FJR, Valera I (eds) Proceedings of the 25th international conference on artificial intelligence and statistics. Proceedings of machine learning research, vol 151. PMLR (28–30), pp 3418–3433. https://proceedings.mlr.press/v151/sivill22a.html
Smilkov D, Thorat N, Kim B, Viégas F, Wattenberg M (2017) Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825
Springenberg JT, Dosovitskiy A, Brox T, Riedmiller M (2015) Striving for simplicity: the all convolutional net
Štrumbelj E, Kononenko I (2014) Explaining prediction models and individual predictions with feature contributions. Knowl Inf Syst 41:647–665
Article Google Scholar
Sundararajan M, Taly A, Yan Q (2017) Axiomatic attribution for deep networks. In: International conference on machine learning, pp 3319–3328
Suresh H, Hunt N, Johnson A, Celi LA, Szolovits P, Ghassemi M (2017) Clinical intervention prediction and understanding with deep neural networks. In: Machine learning for healthcare conference, pp 322–337
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9
Theissler A, Spinnato F, Schlegel U, Guidotti R (2022) Explainable AI for time series classification: a review, taxonomy and research directions. IEEE Access 10:100700–100724. https://doi.org/10.1109/ACCESS.2022.3207765
Article Google Scholar
Zhendong W, Isak S, Rami M, Panagiotis P (2021) Learning time series counterfactuals via latent space representations. In: Carlos S, Torgo L (eds) Discovery science. Springer, Cham, pp 369–384
Google Scholar
Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A (2016) Learning deep features for discriminative localization. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, vol 2016. IEEE Computer Society, pp 2921–2929. DOI 10.1109/CVPR.2016.319

Download references

Acknowledgements

We would like to thank the anonymous reviewers for their detailed and constructive feedback. We would also like to gratefully acknowledge the work by researchers at University of California Riverside, USA (especially Eamonn Keogh and his team) for their effort in collecting, updating and making available the UCR time series classification benchmarks. We want to thank all researchers in time series classification and explainable AI who have made their data, code and results open source and have helped the reproducibility of research methods in this area. This work was funded by Science Foundation Ireland through the SFI Centre for Research Training in Machine Learning (18/CRT/6183), the Insight Centre for Data Analytics (12/RC/2289_P2) and the VistaMilk SFI Research Centre (SFI/16/RC/3835).

Funding

Open Access funding provided by the IReL Consortium.

Author information

Authors and Affiliations

School of Computer Science, University College Dublin, Dublin, Ireland
Thu Trang Nguyen, Thach Le Nguyen & Georgiana Ifrim

Authors

Thu Trang Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Thach Le Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Georgiana Ifrim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thu Trang Nguyen.

Additional information

Responsible editor: Eamonn Keogh.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

1.1 Explanation power calculation using standard scaler

In Sect. 4.6.1, we employ Min/Max Scaler for re-scaling the metrics in Step 2 and Step 4. While this standardization method is not the only available option, its advantages lies in the intuitive final metric in [0,1] range - with methods that are more informative would achieve a higher Explanation Power. Nevertheless, using other standardization method, such as Standard Scaling, is always an option to consider. Note that for this design choice, Step 5 is no longer logical and should be removed from the calculation (Algorithm 2). The final metric can now be interpreted in a reverse fashion to explanation power, with lower metric reflects a better explanation method. Table 9 presents the results of the evaluation metric on the Synthetic datasets using Standard Scaler as standardization method for Step 2 and 4, with an elimination of Step 5.

Results in Table 9 shows a similar trend and agreement with explanation power presented 9 and ground truth shown in Table 4. For each dataset, methods with lowest values for the metric present in Table 9 are associated with computationally most informative in explanation ability. For example, for SmallMiddle_CAR dataset, random has highest metric value of 2.19 and associates with worst explanation method according to Table 4. This result is similar with Table 9, in which this method has zero value in explanation power.

Table 9 Synthetic datasets: evaluation metric using standard scaler for each of the 10 explanation methods evaluated on synthetic datasets

Full size table

1.2 Additional tables and figures

We include the full accuracy table for our experiments in Sect. 5 in Table 10 (synthetic data) and Table 11 (real time series data). Figure 21 shows the visualization of all examined explanation methods on 3 classes of CMJ dataset in Sect. 5.5.

Table 10 Classifier accuracy on synthetic datasets

Full size table

Table 11 Classifier accuracy on UCR datasets

Full size table

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Nguyen, T.T., Le Nguyen, T. & Ifrim, G. Robust explainer recommendation for time series classification. Data Min Knowl Disc (2024). https://doi.org/10.1007/s10618-024-01045-8

Download citation

Received: 06 September 2023
Accepted: 27 May 2024
Published: 20 June 2024
DOI: https://doi.org/10.1007/s10618-024-01045-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Robust explainer recommendation for time series classification

Abstract

Similar content being viewed by others

Forecast evaluation for data scientists: common pitfalls and best practices

A survey on semi-supervised learning

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

1 Introduction

2 Related work

2.1 Explanation methods for time series classification

2.1.1 Intrinsic explanation

2.1.2 Posthoc explanation

2.2 Quantitative evaluation of saliency-based explanation

3 Background and definitions

3.1 Time series and time series dataset

3.2 Saliency-based explanations for time series

3.2.1 Random explanation

3.2.2 Oracle explanation

4 Methodology

4.1 Explanation-guided data perturbation

4.2 Referee classifiers

4.3 Data perturbation strategy: multiple perturbations

4.4 The AMEE framework for evaluating explanations

4.5 Explanation AUC

4.6 Robustness of the AMEE framework

4.6.1 Standardization and explanation power

5 Experiments

5.1 Referee classifiers

5.2 Explanation methods

5.3 Evaluation for synthetic data with known ground truth

5.3.1 Data

5.3.2 Impact of data perturbation strategy

5.3.3 Impact of referee classifiers

5.3.4 Sanity check for the impact of the base classifier quality

5.3.5 Results

5.3.6 Comparison with previous work

5.4 Evaluation for real time series data

5.4.1 Data

5.4.2 Results

5.5 Evaluation for real dataset with expert ground truth

5.5.1 Spectroscopy dataset: coffee

5.5.2 Video motion retrieval dataset: GunPoint

5.5.3 Motion sensing dataset: counter movement jump (CMJ)

5.5.4 Impact of using multiple referees

5.6 Discussion

6 Recommendations for practitioners

7 Conclusion

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

1.1 Explanation power calculation using standard scaler

1.2 Additional tables and figures

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation