1 Introduction

The last decade witnessed a rapid integration and increased impact of machine learning in everyday life. Machine learning algorithms work well in many applications and grow ever more complex with models having millions of parameters (Devlin et al. 2018; Brown et al. 2020). Data quality is key and checking against data leakage or bias is important to enable robust models. However, we are still behind in explaining why these algorithms work so well and occasionally fail to perform well, and what is in the data that leads multiple classifiers to predict a certain class (Goodfellow et al. 2015). The evaluation of explanation methods is still an open problem. While we have many new explanation methods and methodologies, it is still difficult to decide which is the best explainer for a given problem and dataset.

This unmatched growth of complexity and explanation of many machine learning algorithms and data, including those for time series, undermines application of these technologies in critical, human-related areas such as healthcare, sports and finance (Caruana et al. 2015; Lipton 2018). As time series data is prevalent in these applications (Petitjean et al. 2014; Ramgopal et al. 2014; Avci et al. 2010), Time Series Classification (TSC) algorithms often call for reliable explanations (Bostrom and Yudkowsky 2018; Ismail Fawaz et al. 2019a). This explanation is usually presented in the form of feature importance or as saliency weights (Adebayo et al. 2018), highlighting the parts of the time series which are informative for the classification decision. Saliency-based explanations were shown to be useful to find important motifs in data (Le Nguyen et al. 2019; Ismail Fawaz et al. 2019a), and as a starting point to prioritise features for further investigation in counterfactual explanation methods (Delaney et al. 2021).

Recent efforts in designing intrinsically explainable machine learning algorithms, as well as building post-hoc saliency-based explainers for black-box algorithms, have gained significant attention (Zhou et al. 2016; Selvaraju et al. 2017; Ribeiro et al. 2016; Lundberg et al. 2017). Most works focus on explaining one particular classification algorithm, with a lot of emphasis on deep learning methods (Boniol et al. 2022; Zhou et al. 2016; Selvaraju et al. 2017; Sundararajan et al. 2017; Smilkov et al. 2017; Springenberg et al. 2015). Often we are faced with a set of saliency maps for explanation, coming either from domain experts or from a diverse set of classifiers. In particular, for time series classification, a variety of classifiers are required for high accuracy, depending on the application domain (Bagnall et al. 2016; Middlehurst et al. 2023). Each of these classifiers may be tied in accuracy, can be explained with different methods (e.g., LIME or SHAP), and often the resulting explanations disagree, e.g., pointing to different parts of a time series as most relevant for a predicted class. We thus face the challenge: How to assess and objectively compare many explanation methods? In other words, if two or more explanation techniques give different explanations (i.e., two different saliency maps coming from the same classifier or different classifiers, Fig. 1), which explanation is best for our task? In this paper, we focus on the time series classification task and propose a methodology appropriate for time series. Some of the ideas we investigate are relevant to recommending explainers beyond the TSC task, but this is beyond the scope of this paper.

Fig. 1
figure 1

Saliency map explanation is a vector of feature importance weights overlaid over the original time series, where each point in the time series is coloured according to its importance. The saliency is obtained by classifying a motion time series using different classifiers and explainers. The most discriminative parts according to the explanation method are colored in deep red, and the non-discriminative parts are colored in deep blue (Color figure online)

We propose a methodology to compute a standardized evaluation measure, which enables quantitative comparison and ranking of explainers (Table 1). From the application users’ perspective, having this recommendation can support short-listing of useful explanations for further analysis and optimisation (Doshi-Velez and Kim 2017). At the very least, we want to know that a given explanation is better (more informative) than a random explanation, and in general we want to be able to select the best explainer for a given dataset.

Table 1 Outcome of AMEE: a measure to evaluate multiple explanation methods

In this paper, we present A Model-agnostic framework for Explanation Evaluation for Time Series Classification (AMEE). Specifically, we focus on explanations in the form of a saliency map and consider their informativeness within a defined computational scope, in which a more informative explanation means a higher capacity to influence classifiers to identify a class. We show that the saliency-guided perturbation of discriminative subsequences results in a reduced accuracy of classifiers. The higher the impact of a perturbation, the more informative are the perturbed time series subsequences. Estimation of this impact, measured by a committee of highly accurate referee classifiers, can reveal the informativeness of the explanation. This is the key idea behind AMEE, a post-hoc approach which uses a set of classifiers and explainers to recommend the best explainer for a given time series classification dataset.

Our work addresses an overlooked area of research: robust comparison and ranking of multiple explanation methods for time series classification. Our main contributions are:

  • A robust, model-agnostic, ensemble-based explanation evaluation framework. First, we leverage the use of multiple data perturbation strategies to create explanation-guided noisy data. Using synthetic data, we empirically show that applying multiple data perturbation strategies is particularly useful when the data is hard-to-classify, as such data is often more sensitive to the data perturbation type. We also show that a committee of referee classifiers is useful to reduce the potential bias that one single referee classifier may have. Our experiments demonstrate that a committee approach involving multiple types of data perturbations and multiple classifiers leads to explanation evaluation and ranking that better agrees with the explanation ground truth (synthetic data) and domain expert ground truth (real data).

  • A standardised evaluation measure (explanation power) that is comparable across different explanation methods, referee classifiers and datasets.

  • An empirical study on both synthetic and real datasets with recent state-of-the-art time series classifiers and explanation methods. We verify the evaluation methodology with annotated, real datasets. All data, code and detailed results are available.Footnote 1

In the next sections we review related Explainable AI research including both time series specific and general methods (Sect. 2). We then define related concepts (Sect. 3) and describe our proposed solution (Sect. 4). We discuss experiments on both synthetic and real time series datasets, with detailed case studies (Sect. 5). We discuss important considerations for practitioners when using AMEE to evaluate and recommend explanations for time series classification in Sect. 6. Finally, we summarize our results and discuss future work in Sect. 7.

2 Related work

2.1 Explanation methods for time series classification

As deep learning has achieved high performance in machine learning domains such as computer vision (Szegedy et al. 2015; Krizhevsky et al. 2012), the research community started to develop techniques to explain these black-box models to understand why they work so well (Zhou et al. 2016; Selvaraju et al. 2017; Sundararajan et al. 2017; Smilkov et al. 2017; Springenberg et al. 2015). These explanations are in the form of a saliency map, visualizing the important pixels in an image by computing a saliency weight for each pixel (a type of feature importance). This saliency map, combined with the original image, can reveal whether a black-box model focuses on the correct area of the image and explain the model in a visually friendly way.

Explainable AI (XAI) methods for Time Series Classification have advanced in parallel with general XAI progress (Theissler et al. 2022). Although recent works on instance-based methods, such as factual and counterfactual explanations (Guidotti et al. 2020; Delaney et al. 2021; Zhendong et al. 2021) have become popular, the majority of explanation methods exist in the form of saliency maps (Kokhlikyan et al. 2020; Mishra et al. 2017; Parvatharaju et al. 2021; Rooke et al. 2021; Guillemé et al. 2019; Zhou et al. 2016; Smilkov et al. 2017), where a map visualizes the importance weight vector w and highlights the discriminative areas of a time series for the classification task. These saliency-based explanations can either be extracted directly from the classifier (intrinsic explanation), or indirectly by applying a post-hoc explanation method to the black-box classifier (post-hoc explanation).

2.1.1 Intrinsic explanation

Explanation from MrSEQL Time Series Classifier. MrSEQL (Le Nguyen et al. 2019) is a time series classification algorithm that is intrinsically explainable. The algorithm converts the numeric time series vector into strings, e.g., by using the SAX (Lin et al. 2007) transform with varying parameters to create multiple symbolic representations of the time series. The symbolic representations are then used as input for SEQL (Ifrim and Wiuf 2011), a sequence learning algorithm, to select the most discriminative subsequences for training a classifier using logistic regression. The symbolic features combined with the classifier weights learned by logistic regression make this classification algorithm explainable. For a time series, the explanation weight of each data point is the accumulated weight of the SAX features that it maps to. These weights can be mapped back to the original time series to create a saliency map to highlight the time series parts important for the classification decision. We call the saliency map explanation obtained this way, MrSEQL-SM. For using the weight vector from MrSEQL-SM, we take the absolute value of weights to obtain a vector of non-negative weights. Figures 2 and 3 show an example of the saliency map explanation obtained directly from the MrSEQL classifier weights, for the Coffee and GunPoint datasets from the UCR Archive (Dau et al. 2018).

Explanation from a Generic, White-box Classifier. A generic, white-box classifier such as Logistic Regression or Ridge Regression has been the primary source of providing feature importance (by using the learned model weights), especially for tabular data (Hosmer et al. 2013). These classifiers and their explanations are computationally cheap and can be useful for time series data (Frizzarin et al. 2023).

2.1.2 Posthoc explanation

Gradient-based Explanation. This approach uses the gradients from a trained deep neural network to infer explanations. Notable methods are Integrated Gradient (Sundararajan et al. 2017), GradientSHAP (Lundberg et al. 2018), GradCAM (Selvaraju et al. 2017), CAM (Zhou et al. 2016). For time series classification and explanation, the most common classifier is ResNet (Ismail Fawaz et al. 2019b) combined with some of the explanation methods mentioned.

Perturbation-based Explanation. This type of methods infuse noise into the data to create data variations and to infer the degree of data point importance (Castro et al. 2009; Abanda et al. 2022). Notable methods are Feature Occlusion (Suresh et al. 2017) and LIME (Ribeiro et al. 2016). One of the most popular post-hoc explanation methods is SHAP (Lundberg et al. 2017)—a unique way to explain any machine learning model using a game theoretic approach, in which all feature coalitions are evaluated. Feature importance is then calculated using the classic Shapley value (Štrumbelj and Kononenko 2014). Figures 2 and 3 show an example of the saliency map explanation obtained by applying SHAP to the MrSEQL classifier to get a post-hoc explanation for each of the Coffee and GunPoint datasets from the UCR Archive (Dau et al. 2018).

Fig. 2
figure 2

Saliency map from two explanation methods on two examples from the Coffee dataset: the bottom row is an explanation from MrSEQL Classifier (intrinsic explanation); the top row is an explanation from SHAP, a post-hoc explanation method based on MrSEQL Classifier

Fig. 3
figure 3

Saliency map from two explanation methods on two examples of GunPoint dataset: the top bottom row is explanation from MrSEQL Classifier (intrinsic explanation); the top row is explanation from SHAP, a post-hoc explanation method based on MrSEQL Classifier

2.2 Quantitative evaluation of saliency-based explanation

Quantitative evaluation of explanations for time series data was a relatively untouched topic until recently. Unlike image and text, time series data often do not have annotated ground truth explanation; hence, it remains a challenge to determine whether a saliency-based explanation is correct. Approaches to benchmark and evaluate faithfulness of recent explanation methods overcome this problem by using synthetic datasets with assigned ground-truth (Ismail et al. 2020; Crabbé and Van Der Schaar 2021). Other research ventures into real datasets, yet these efforts focus on examining explanations by a single classifier (Crabbé and Van Der Schaar 2021) or averaging a non-comparable metric across multiple datasets (Schlegel et al. 2019). The approach in Guidotti (2021) uses a white-box classifier to get a pseudo ground-truth explanation (a) and evaluates a post-hoc, localized explanation method (b) by estimating cosine distance between (a) and (b). However, this method assumes that the white-box classifiers can always produce explanations of ground-truth quality. We show in our experiments that this is not the case. Notably, Schlegel et al. (2019); Nguyen et al. (2020); Agarwal et al. (2021) propose methods to quantify explanation methods, however, there are a few problems with the comparison: the use of a single perturbation type is problematic as it cannot always distinguish between explanations, the metric used (change in accuracy) is not comparable across the selected datasets, the individual effect is not separated (only average change in accuracy is reported), and there is no discussion involving explanation ground-truth. Additionally, there is little discussion in previous work about the impact of the classifier(s) accuracy on evaluating the explanation methods that are based on those classifier(s). This is an important point, as the evaluation can only be trusted if the classifier(s) are reliable. Furthermore, there are cases where multiple classifiers have high accuracy and are tied in this regard, but the explanations obtained from the classifiers may disagree, and in some cases could be ranked worse than a random explanation. Hence it is not clear which classifier and explanation to select in such cases.

3 Background and definitions

3.1 Time series and time series dataset

A time series \(X = [x_0,x_1,\ldots ,x_{l-1}], x_i \in \mathbb {R}\), is a sequence of \(l \in \mathbb {N}\) real values that are recorded values of a synthetic or real process. In this definition, l is also called the number of time steps or the length of the time series X, and \(x_i\) are the data points or time points.

A time series dataset D consists of \(n \in \mathbb {N}\) time series of equal length l that are recorded from a single process. If the time series are not of equal length, it is common to pad with zeroes or use resampling to bring them to equal length.

3.2 Saliency-based explanations for time series

In the context of this paper, we only consider explanations in the form of saliency maps. A saliency map to explain time series X is a vector of numerical weights \(M = [w_0, \dots ,w_{l-1}]\) where \(w_i \in \mathbb {R}\) and l is the length X. The value \(w_i\) implies the importance (or saliency) of the time point i in the process of prediction making for X. This vector can be obtained from annotation (by a human) or computed by an explanation method. The explanation method can come from a white-box classification model (intrinsic explanation) or a black-box classifier coupled with a post-hoc explanation method (post-hoc explanation). The weights \(w_i\) are typically rescaled to [0, 1].

3.2.1 Random explanation

For sanity checks, we use saliency maps generated through random sampling as a lower bound on explanation quality. Here, the weights \(w_i\) are drawn from a random uniform distribution. Like a dummy classifier, this random explanation serves as a baseline for any reasonable explanation method, i.e., they all should be better than random guessing. Nonetheless, there are situations where a random explanation outperforms a method-based explanation. Specifically, when a method-based explanation highlights non-discriminative parts, or fails to identify any discriminative parts, that explanation can be considered worse than a random explanation.

3.2.2 Oracle explanation

In cases where explanation ground truth is available (e.g., for synthetic datasets or from domain experts), this should be the gold standard for any explanation method. We generally expect any explanation method to rank between the random and the oracle explanations.

4 Methodology

In this section, we describe our proposed methodology using concepts described in Sect. 3. Specifically, we present the blueprint of AMEE in Fig. 5. The framework involves a labelled time series dataset (split into training and test datasets), a set of explanation methods to be compared, and a set of evaluating classifiers (referee classifiers). The output of the framework is the explanation power of each explanation method (see Table 1).

4.1 Explanation-guided data perturbation

A good saliency-based explanation for a time series should highlight its discriminative part(s) that contain class-specific information to distinguish from other classes. Data perturbation is the process of adding noise to the data by replacing selected time points in the time series. Explanation-guided data perturbation uses a saliency-based explanation to determine the specific time points of the time series to be perturbed. As a result, the more informative the explanation, the higher the decrease in classifier accuracy is expected, because that perturbation removes important class-specific information in the respective time series. Given a threshold k (\(0 \le k \le 100\)), the discriminative parts of a time series of l steps are segmented using the top k-percentiles in M. This is a set of \(k * l / 100\) time steps that have the highest weights in the saliency map M. Varying k allows us to control the scope of the perturbation. At \(k=0\), the time series is the original; at \(k=10\), only 10 percent of the time steps (that are most discriminative according to the explanation) are perturbed; at \(k=100\) the entire time series is perturbed.

4.2 Referee classifiers

In our work we employ a set of independent and accurate classifiers that are trained with the original training set and are used to evaluate the target explanations on the test set. This committee is formed of member classifiers that we call Referee Classifiers. In order to evaluate the explanation methods, our framework measures the impact of each explanation-guided data perturbation on the accuracy of the referee classifiers R. We select the referees based on recent empirical benchmarks on TSC (Middlehurst et al. 2023).

4.3 Data perturbation strategy: multiple perturbations

In Fig. 4 we explain and visualize four strategies to perturb the discriminative areas of a time series, as guided by a given explanation (Mujkanovic et al. 2020). These strategies are either time-step dependent (local perturbation, using only the t-th step information) or time-step independent (global perturbation), using either Gaussian-based or single value replacement. With these strategies, discriminative time steps are replaced with noisy values, either by replacing the original time series values with a patch of constant values (like a grey mask in an image) or a patch of random Gaussian noise values (like a noise mask in an image). Let n be the number of time series in a dataset D, each with l time steps. We want to perturb one test time series of size \(1\times l\), so its t-th value \(x_t\) is replaced with a new value \(r_t\). We define the global and local profile for this time step perturbation as follows.

Local perturbation:

$$\begin{aligned} \mu _t = \frac{1}{|n|}\sum _{x \in D}x_t; \sigma _t^{2} = \frac{1}{|n|-1} \sum _{x \in D} (x_t - \mu _t)^2 \end{aligned}$$
(1)

Global perturbation:

$$\begin{aligned} \mu = \frac{1}{l|n|} \sum _{i=1}^{l} \sum _{x \in D}x_i; \sigma ^{2} = \frac{1}{l|n|-1} \sum _{i=1}^{l} \sum _{x \in D} (x_i - \mu )^2 \end{aligned}$$
(2)

With these local-based and global-based profiles, we can define the perturbation \(r_i\) accordingly. We use four perturbation strategies, two local and two global perturbation types. Local mean: \(r_t^{(1)} = \mu _t\); Local Gaussian: \(r_t^{(2)} \sim \mathcal {N}(\mu _t,\,\sigma _t^{2})\); Global mean: \(r_t^{(3)} = \mu\); Global Gaussian: \(r_t^{(4)} \sim \mathcal {N}(\mu ,\,\sigma ^{2})\). Figure 4 illustrates an example of how the four strategies effectively modify the original time series in the regions identified by the explanation weights. We show in our experiments that it is important to use a set of perturbation strategies, rather than a single fixed perturbation.

Fig. 4
figure 4

Time Series Data Perturbation strategy: An example time series with a known saliency map (left) is perturbed using mean or Gaussian noise using local time steps (local) or global time steps (global) across the entire dataset on its most discriminative region (in this example we perturb the top 20% values according to the highest saliency weights)

4.4 The AMEE framework for evaluating explanations

Figure 5 summarizes the components and steps in the AMEE framework. Our framework requires a labeled time series dataset (D), a set of explanations (M) to evaluate, and a set of referee classifiers (R) to be trained on a subset of the dataset. With these elements, the following steps are done to record the necessary information to calculate evaluation metrics:

  1. 0.

    Split the labeled dataset D into training (\(D_{train}\)) and test (\(D_{test}\));

  2. 1.

    Train Referee Classifier(s) (R) with (\(D_{train}\));

  3. 2.

    Use each explanation in M to create a step-wise, explanation-based perturbation on \(D_{test}\);

  4. 3.

    Measure the accuracy of each trained referee in R on these perturbed datasets \(D'_{test}\).

The output of this process is the accuracy on the perturbed dataset \(D'_{test}\) at various thresholds (k), serving as an indicator of how much an explanation-based perturbation impacts the referees. Significant drop in accuracy in the first few steps of the explanation-guided perturbation (e.g., at \(k=10\) or \(k=20\)) signals that meaningful, salient data points are disturbed based on the explanation. Hence, explanations that correctly identify such salient regions are likely to be informative.

Fig. 5
figure 5

The AMEE evaluation framework requires 3 elements: (a) a dataset that requires explanation evaluation, (b) a set of saliency-based explanations, and (c) a set of referee classifiers trained on a subset of (a)

4.5 Explanation AUC

We measure the impact of each explanation by estimating the Area Under the Curve (AUC) of its explanation-guided perturbation. Specifically, the accuracy scores at each threshold (k) are translated into an Explanation-AUC (EAUC) using the trapezoidal rule.

$$\begin{aligned} EAUC = \frac{1}{2} \Delta k_0 \sum _{i=1}^q (acc_{i-1} + acc_{i} ) \end{aligned}$$
(3)

Here \(\Delta k_0\) denotes the difference in value of each step normalized to 0–1 range (\(\Delta k_0 = \frac{1}{100} \Delta k\)); q denotes the number of steps (\(q = \frac{100}{k}\)); \(acc_i\) is the accuracy at step i. If we perturb the dataset with q steps, we will have a total of \(q+1\) data points for accuracy scores. For example, if the perturbation is done in \(q=10\) steps, each step will correspond to a difference of k = 10 percentage points in perturbation threshold (i.e. 0%, 10%,..., 100%). The step for \(k=0\) corresponds to the original test dataset, while the step for \(k=100\) corresponds to adding noise to the entire time series.

With this estimation, a smaller EAUC means higher impact (accuracy loss) of the explanation method (Fig. 6). The Explanation AUC is computed for each combination of Perturbation–Referee–Explanation (Fig. 7: Step 1).

Fig. 6
figure 6

Changes of accuracy measured by a referee classifier among two explanation methods (red and blue) at each threshold level k. When a signal is perturbed based on a more informative explanation, this signal becomes harder for the referee to classify correctly, leading to a more severe drop in accuracy. This impact is measured by the Explanation AUC, or the the area under the curve (AUC) of these changes in accuracy at different threshold k. The curve with lower explanation AUC (red curve) results from perturbation guided by a more informative explanation method (Color figure online)

4.6 Robustness of the AMEE framework

Two key aspects of AMEE are aimed to make the framework more robust by employing multiple Data Perturbation strategies and multiple Referee Classifiers. Specifically, for each explanation in M, we use different data perturbation strategies (as described in Sect. 4.3) to create explanation-based perturbations on the dataset \(D_{test}\). Additionally, multiple referee classifiers are trained on \(D_{train}\) and their accuracy is measured on the perturbed \(D'_{test}\).

The various data perturbation strategies represent different ways that salient parts of the data can be replaced with noise. Unlike image data that is standardized in RGB, time series data is more dynamic: it can belong to many different domains, collected from various sources, or preprocessed in different ways. This characteristics of time series data make it harder to use one single method to mask out a specific part of the data. Using a variety of data perturbations ensures that data is perturbed in ways that completely mask out the relevant parts of the signal. We further investigate using multiple Data Perturbation strategies in Sect. 5.3.2.

Referee classifiers are used to evaluate the impact of data perturbation on the pre-trained model with the original, non-perturbed data. Thus, the evaluation by referee classifiers is dependent on the properties of these classifiers such as in-classifier data normalization, feature extraction, and feature processing. Having multiple referee classifiers can reduce potential biases introduced by using one single classifier. We analyze this characteristic and show the benefits of using multiple referee classifiers in Sect. 5.3.3.

4.6.1 Standardization and explanation power

AMEE employs multiple perturbation strategies and multiple referee classifiers. As the EAUC measures depend on the choice of referees and perturbation strategies, they are not directly comparable. The next steps (Fig. 7: Step 2–5) standardize and aggregate the EAUC to compute the final output of the framework, the explanation power.

Step 2 rescales the Explanation AUC to the same range [0, 1] for each row (i.e. each pair of Referee and Perturbation). Since each referee responds to changes in the perturbed dataset differently, this normalization is performed for each pair of Referee and Perturbation to ensure that the Explanation AUC is comparable across the different explanation methods in the evaluation. The red highlighted row is an example. After rescaling the Explanation AUC, the Average Scaled EAUC is computed in Step 3. It is basically the average of each column in Step 2. This simple average calculation can be performed because individual Explanation AUC are already normalized to the [0,1] range and comparable across each Referee-Perturbation pair. For example the Average Scaled EAUC of Rocket-SHAP is \((0.43 + 0.26 + 0.69 + 0.42 + 0.33 + 0.67)/6 = 0.47\).

In Step 4, the Average Scaled EAUC is again rescaled to the range between 0 and 1. The result is the Average Scaled Rank (lower is better). The explanation power is simply the inverse of Average Scaled Rank (\(1-\) Average Scaled Rank), i.e., higher is better. Details of this calculation are summarised in Algorithm 1.

Fig. 7
figure 7

Measure standardisation and explanation power calculation. example of how explanation power is derived in a typical evaluation assessment, involving 2 perturbation strategies (local, global) and 3 referees (MrSEQL, k-nn, ROCKET)

Algorithm 1
figure a

AMEE: Calculate Explanation Power

5 Experiments

In this section, we evaluate the performance of the AMEE framework in three groups of experiments in ascending order of difficulty. In the simplest case, we want to validate AMEE with synthetic datasets with known explanation ground-truth (Ismail et al. 2020). Next, we measure the performance of the framework with a diverse set of time series classification datasets from the UCR Time Series Classification Archive covering popular domains that require explanation (Dau et al. 2018). Finally, we test our framework on a real dataset and compare the result with ground-truth explanations provided by domain experts. Our experiments are repeated 5 times and the reported results are the average of these repetitions.

5.1 Referee classifiers

We employ 5 candidates for referee classifiers in our experiment, selected based on their accuracy, speed and diversity of approach (Schäfer and Leser 2023): baseline 1NN-DTW (distance-based) (Cover and Hart 1967), MrSEQL (dictionary-based, time domain) (Le Nguyen et al. 2019), ROCKET (convolution-based) (Dempster et al. 2020), RESNET (deep learning) (He et al. 2016; Ismail Fawaz et al. 2019b) and WEASEL 2.0 (dictionary-based, frequency domain) (Schäfer and Leser 2023). As the choice of referees is a critical component in our framework, we carefully select classifiers that perform well in accuracy on all studied datasets. For a classifier to be selected in the referee committee, it has to achieve at least the average accuracy of all candidates for referee classifiers, and this number has to be higher than the theoretical accuracy achieved by a random classifier. In case the average accuracy is over 90%, the threshold to choose referees is set to 90%. By using a high accuracy threshold in cases when average accuracy is relatively high, we want to include the referees that do have high performance but slightly below the average accuracy. For example, in a theoretical case when the accuracies of 5 candidate classifiers are 0.90, 0.95, 0.97, 0.98, 0.99; we want to include all the referees as they all have relatively high performance, despite one that is slightly below the average accuracy. For some datasets, all the classifiers are tied or very close in accuracy. Details of the referee accuracy are presented in the Appendix.

5.2 Explanation methods

In our experiments, we evaluate 8 popular explanation methods with diverse properties as described in Table 2.

We use the author’s implementation for LIME (Ribeiro et al. 2016) and MrSEQL (Le Nguyen et al. 2019), the captum (Kokhlikyan et al. 2020) library for gradient-based explainers, the time-explain library (Mujkanovic et al. 2020) for SHAP, and sklearn (Buitinck et al. 2013) to implement the remaining classifiers and explainers. We have considered a few other recent explainers, e.g., LIMESegment Sivill and Flach (2022a), but they proved too slow to be feasible to run on all our datasets. Since our goal is to rank a set of given explainers, rather than promote any particular explainer, we consider this explainer set to be sufficient to validate our methodology.

Table 2 Summary of properties of explanation methods

5.3 Evaluation for synthetic data with known ground truth

5.3.1 Data

We work with 10 synthetic univariate time series classification datasets selected by taking the mid-channel from the time series benchmark generated by Ismail et al. (2020). The datasets are created using five processes: (a) a standard continuous autoregressive time series with Gaussian noise (CAR), (b) sequences of standard non–linear autoregressive moving average (NARMA) time series with Gaussian noise, (c) non-uniformly sampled from a harmonic function (Harmonic), (d) non-uniformly sampled from a pseudo periodic function with Gaussian noise (Pseudo Periodic), and (e) Gaussian with zero mean and unit variance (Gaussian Process). The important areas, either a Small Middle part (30% of time series length) or a very small part, Rare Time (10% of time series length), are created by adding or subtracting a constant \(\mu\) (\(\mu\) = 1) for the positive and negative class. The number of time steps is T = 50. Each dataset comprises of 500 samples in training set and 100 samples in testing set. Figure 8 visualizes the two classes in the 10 datasets.

Fig. 8
figure 8

Visualization of the synthetic datasets. The columns describe the process used to create the dataset, while the row specifies the specific salient areas

Before presenting the experiment result of our evaluation on the synthetic datasets, we discuss the effect of the Data Perturbation strategy (Sect. 5.3.2), investigate the impact of Referees (Sect. 5.3.3), and perform a sanity check for the classifier quality used for model-agnostic post-hoc explanation methods such as LIME and SHAP (Sect. 5.3.4).

5.3.2 Impact of data perturbation strategy

Figure 9 shows the boxplots of explanation power for different data perturbation strategies. In datasets which are "easier" to classify (i.e., most classifiers get close to 100% accuracy) such as CAR and NARMA, the Explanation Power does not change with the perturbation strategy. On the other hand, we observe a larger change in explanation power when data is harder to classify (for example, in GaussianProcess datasets). We additionally present plots showing changes of explanation power in two extreme cases for the SmallMiddle_CAR dataset (easy-to-classify) and RareTime_GaussianProcess dataset (hard-to-classify) when different perturbations are gradually introduced (Fig. 10). Notably, for the harder dataset RareTime_GaussianProcess, having more perturbation methods encourages the evaluation results to get closer to the ground truth. Specifically, for the Oracle explanation, if only a single perturbation method was used, such as Local Mean or Local Gaussian, the evaluation result would rank the Oracle explanation as the 6th best explanation method. However, when more perturbations are introduced, the Oracle explanation is evaluated more robustly, placing this method to the top 1 best explanation and better aligned with the ground truth. When the explanation is Oracle (upper bound of explanation) and Random (lower bound of explanation), we generally observe that these explanations are the most and least informative informative methods, respectively.

Fig. 9
figure 9

Impact of data perturbation strategy on explanation power for each explanation method. A smaller box-range (which comes from 4 perturbation methods) indicates a smaller change of the Explanation Power with different perturbation strategies. For datasets that are “easier” to classify by referees, this range is often smaller than that of “harder”-to-classify datasets

Fig. 10
figure 10

Changes of explanation power when different Perturbations are sequentially introduced. The sequence of perturbations is: Local Mean, Local Gaussian, Global Mean, and Global Gaussian (less to more extreme perturbation). The two example datasets are SmallMiddle_CAR (easy-to-classify dataset) and RareTime_GaussianProcess (hard-to-classify dataset). For the harder dataset RareTime_GaussianProcess, the relative position of the Explanation Methods changes, indicating that having multiple types of perturbation methods is helpful when the dataset is hard to classify. Specifically, for Oracle explanation, if only a single perturbation method was used, such as Local Mean or Local Gaussian, the evaluation result would rank Oracle explanation as the 6þbest explanation method. However, when more perturbations are introduced, Oracle explanation is evaluated more robustly, placing this method to the top 1 best explanation and closer to the ground truth

5.3.3 Impact of referee classifiers

Similar to the previous investigation on the impact of the perturbation strategy, we now inspect how the explanation power changes with respect to the set of referees, and present the result in Fig. 11. Here, we also notice a relatively consistent explanation power among different referee classifiers in datasets that are easier to classify (such as CAR and NARMA datasets). In datasets that are harder-to-classify (for example, in Gaussian Process datasets), we observe a larger range in distribution of explanation methods over referee classifiers.

Fig. 11
figure 11

Impact of referees on explanation power for each explanation method. A smaller box-range (which comes from 5 referee classifiers) signals a smaller change of explanation power by different referees. Besides, in a specific dataset, the relative position of this range indicates the level of critical difference in opinions of referees in their votes. Nevertheless, having a committee of referees that are highly accurate is generally desirable

Fig. 12
figure 12

Changes of explanation power when different Referees are sequentially introduced. The two example datasets are SmallMiddle_CAR (easy-to-classify dataset) and RareTime_GaussianProcess (hard-to-classify dataset. The sequence of addition of referees is from lowest to highest accuracy, filtered to represent the most reliable classifiers among the ones used for evaluation. Details of these sequence are in Section 8. For the harder dataset RareTime_ GaussianProcess, the relative position of the Explanation Methods changes, indicating that having a set of referees is helpful and leads to more stable results. Specifically, for Oracle explanation, if only a single referee is employed, the evaluation result could have ranked Oracle explanation as the 2nd best explanation method. However, when more referees are introduced, Oracle explanation is evaluated more robustly, placing this method to the top 1 best explanation and closer to the ground truth

Random and Oracle explanations both have their explanation power in expected values for the evaluated datasets. We present the change of explanation power when different referees are sequentially introduced in the two extreme cases on SmallMiddle_CAR dataset (easy-to-classify) and RareTime_GaussianProcess dataset (hard-to-classify) (Fig. 12). We observe that for RareTime_GaussianProcess dataset which is hard to classify, having a committee of referees that are highly accurate is desirable and is helpful in reducing the potential bias of a single referee and can lead to a more stable evaluation. Specifically, for Oracle explanation, if only a single referee was employed, the evaluation result would have ranked Oracle explanation as the 2nd best explanation method. However, when more referees are introduced, Oracle explanation is evaluated more robustly, placing this method to the top 1 best explanation and closer to the ground truth.

Similarly, we note that for some real datasets, several referee classifiers that are highly accurate can disagree in their evaluation ranking. In such cases, having multiple referees leads to a considerably more robust and reliable result. We show an example using a real dataset with domain expert ground truth (the Counter Movement Jump dataset) in a later section (Sect. 5.5.3).

5.3.4 Sanity check for the impact of the base classifier quality

Model-agnostic post-hoc methods such as LIME and SHAP derive explanations based on a classifier of any type. Thus, these explanation are dependent on the performance of the base classifier. For example, the ROCKET-SHAP explanation is created by applying SHAP (explanation method) on ROCKET (base classifier). If the base classifier has low accuracy on the sample dataset, the explanation based on that classifier may not be as good as one based on a more accurate classifier. In our experiment, we get LIME and SHAP explanations from two sources: MrSEQL classifier Le Nguyen et al. (2019) and ROCKET classifier Dempster et al. (2020). We observe that ROCKET achieves higher accuracy than MrSEQL in datasets created from Pseudo Periodic, Harmonic, and Gaussian Process (Table 10 in AppendixWe compare the two pairs of explanation (MrSEQL-based and ROCKET-based) from LIME and SHAP and do a sanity check. Our experiment confirms that in both cases, under the AMEE evaluation approach, ROCKET-LIME and ROCKET-SHAP are considered better explanation methods as compared to MrSEQL-LIME and MrSEQL-SHAP, respectively (Fig. 13). This sanity check confirms our intuition that the quality of the base classifier is an important factor in model-agnostic, post-hoc explanation methods such as LIME and SHAP.

Fig. 13
figure 13

Sanity Check of Base Model Quality with LIME and SHAP. Higher accuracy of the classifier base-model (ROCKET > MrSEQL), leads to higher explanation power for the corresponding explanations (ROCKET-LIME and ROCKET-SHAP have higher explanation power than MrSEQL-LIME and MrSEQL-SHAP)

5.3.5 Results

Using a committee of 5 referee classifiers and 4 data perturbation strategies, we evaluate 10 explanation methods (8 computed explainers plus the lower bound explanation (Random) and upper bound explanation (Oracle)) using AMEE. The resulting explanation power is presented in Table 3.

Using a threshold of 0.5 of the min-max normalized saliency score in [0,1], we determine the ground truth of whether a time point is salient. We compare the explanation methods with the ground truth explanation for each time points and calculate the F1-score (Table 4) to measure how good each method is in determining saliency of a time point. For example, in the SmallMiddle_CAR dataset, AMEE selects the best explanation method to be Oracle with explanation power of 1.00 (Table 3), and the second best explanation is the Saliency Map from RidgeCV (explanation power of 0.79, Table 3). This result is similar to the F1-score of these explanations using the ground truth (Table 4). Here, the Oracle explanation achieves 1.00 in F1-score (highest), followed by RidgeCV saliency map, achieving 0.86 in F1-score (second-best). Similarly, we find that the ranks of the methods in Tables 3 and 4 in high agreement. Moreover, even for the hardest dataset to classify (RareTime_GaussianProcess), we observe that adding more referees brings the relative ranking of the evaluated explanation closer to the ground truth ranking (Table 5). This result further reinforces that using multiple referees is desirable: we observe that a highly accurate set of referees brings the explanation ranking closer to the ground truth. Overall, our result shows a high agreement between AMEE’s computed explanation power and the F1-score using ground truth time-point importance evaluation and confirms that the committee of referees is a desirable property of the explainer recommendation framework.

Table 3 Synthetic datasets: explanation power for each of the 10 explanation methods evaluated
Table 4 Synthetic datasets: F1-score of explanation methods using explanation ground truth
Table 5 Dataset RareTime_Gaussian process: explanation power rank when referees are added sequentially for Top 5 explanation methods

5.3.6 Comparison with previous work

Previous work Nguyen et al. (2020) showcases initial results towards comparing Explanation Methods. However, the method utilizes only one type of perturbation which replaces salient areas with Gaussian noise of low magnitude. While the magnitude of the Gaussian noise can be customized, it requires extra work from users to determine this parameter. This initial work also does not propose a way to standardise the explanation AUC across methods and datasets. Our new framework employs a combination of perturbation types that allows a higher impact in changing the original signals, resulting in a more robust framework and better results. We include the results of the framework (using the default settings) introduced in Nguyen et al. (2020) in Table 6.

This table shows that for many of the synthetic datasets, the small perturbation (default setting) added into the signal is too little and it fails to trigger changes in the classification accuracy. The resulting outcome is that the past approach is unable to distinguish differences in the informativeness of different explanation methods.

Table 6 Synthetic datasets: using default perturbation settings with previous work in Nguyen et al. (2020) to get explanation power for each of the 10 explanation methods evaluated

Even with a much larger noise level (Fig. 14), the previous framework does not provide a result as accurate as AMEE, especially for datasets that are difficult to classify. We include results of the framework introduced in Nguyen et al. (2020) using higher noise magnitude in Table 7.

Fig. 14
figure 14

Sample time series from dataset SmallMiddle_CAR that was perturbed by a Gaussian noise addition proposed in Nguyen et al. (2020) at very high magnitude (left) and b Global Gaussian noise—a non-parametric perturbation) (right)

Table 7 Synthetic datasets: using higher perturbation magnitude with previous work in Nguyen et al. (2020) to get explanation power for each of the 10 explanation methods evaluated

5.4 Evaluation for real time series data

5.4.1 Data

We work with 15 datasets from the UCR Archive (Dau et al. 2018) that represent a variety of data sources and domains. These datasets are of 5 types: electrocardiogram (ECG), human motion (MOTION), device usage (DEVICE), device activities tracked by sensors (SENSOR) and spectroscopy (SPECTRO). Oracle explanation is not available for these datasets.

5.4.2 Results

We test explanations for these datasets with AMEE and report the result in Table 8. Since we do not have ground truth for the majority of these datasets, we use this experiment to show how AMEE can apply to real datasets. We note that the Random explanation sometimes outperforms a method-base explanation. This can happen as some explanation methods may not work well with certain datasets, resulting in unreasonable explanations that misleadingly highlight non-discriminative parts as discriminative, or fail to identify any significant discriminative parts at all. In this situation, the evaluation of random explanations can serve as a filter for reasonable explanation methods, and any methods that have lower performance than random should be filtered out.

Table 8 Explanation power on UCR datasets

5.5 Evaluation for real dataset with expert ground truth

The Oracle explanation is the upperbound for any explanation method, however, it is only available in synthetic datasets. For real datasets the explanation ground truth is often available in an approximate level of precision, e.g., specifying the relative position of the shape and areas of importance. This approximate ground truth is widely used in other papers in evaluating explanation methods for images (Kim et al. 2018; Zhou et al. 2016; Selvaraju et al. 2017), however, this approximate ground truth is not readily available for time series data without opinions from data experts. Among the the datasets evaluated in Sect. 5.4, Coffee (Briandet et al. 1996), Counter Movement Jump (CMJ) (Le Nguyen et al. 2019), and GunPoint (Ratanamahatana and Keogh 2004) have this information of the true important areas for each class of the dataset. In this section, we will compare the saliency-based explanations evaluated by AMEE, and the expert ground truth of important areas.

5.5.1 Spectroscopy dataset: coffee

The Coffee dataset contains the spectroscopy sample of two types of coffee: Arabica and Robusta. This dataset was first introduced in Briandet et al. (1996) and is part of the UCR time series dataset (Dau et al. 2018). Figure 16 shows the top 3 and bottom 2 explanation methods ranked by AMEE. Notably, the discriminating region of the two classes of Coffee produced by the best explainer, ROCKET-SHAP, is the last peak region of the time series. This region was confirmed in the original paper (Briandet et al. 1996) to contain information about the chlorogenic acid content of the sample that contributes to the difference between the two types of coffee (Fig. 15). Arabica has a lower caffeine and chlorogenic acid content that contributes to its finer taste and greater market value. The region that the MrSEQL-SHAP and MrSEQL-SM also highlight is another part of the spectrum that contains information about the chlorogenic acid content (Briandet et al. 1996). The worst explanation methods among those evaluated, GradientSHAP and a random explanation, shows either very small, non-contiguous or randomly, scattered regions of interests and do not focus on parts of the time series that discriminate the two coffee types.

Fig. 15
figure 15

Coffee dataset: ground truth from coffee dataset (Briandet et al. 1996). According to the original paper (Briandet et al. 1996), the chlorogenic acid content (region approximately of time steps 150–240, marked in red) is the major region that contributes to the difference in two coffee types. The caffeine content (region approximately of time steps 40–75, marked in orange) is also discriminative, but to a lesser extent (Color figure online)

Fig. 16
figure 16

Coffee dataset: visualization of top 3 best explanation methods (top 3 rows) and worst explanation method (bottom 2 rows) ranked by AMEE. It can be observed that the top explanation methods are all able to point out the discriminative regions that are confirmed by the domain expert, as represented in Fig. 15

5.5.2 Video motion retrieval dataset: GunPoint

The famous GunPoint dataset is the time-series translation from a video sequence involving actors performing two distinct actions: pointing to a target with a gun (Gun class) and pointing with their index fingers only (Point class). This dataset was introduced in Ratanamahatana and Keogh (2004) and is part of the UCR time series archive (Dau et al. 2018). Figure 17 visualizes the examples of explanations from the best three methods, worst method, and random method for this dataset. The expert ground truth for the GunPoint dataset conveys that the two classes differ in the steps where the Gun class requires the actor/actress to lift their hand above a holster, then reach down for the gun. This distinct action creates a subtle difference in the time steps right before the action of hands moving to the shoulder level (the sharp increase in time series values) to pointing the gun or hand (the plateau in the middle of the time series). The detailed description can be found in Ratanamahatana and Keogh (2004). In this dataset, AMEE identifies explanations from the IntegratedGradient method as the most computationally informative explanation, followed by MrSEQL-SHAP and MrSEQL-SM. The least informative method is ROCKET-LIME, which is even less informative than a random saliency explanation as this method refers to the wrong area of importance, failing to point out any of the salient regions of the time series.

Fig. 17
figure 17

GunPoint dataset: visualization of top 3 best explanation methods (top 3 rows) and worst explanation method (bottom 2 rows) identified by AMEE. According to the description of the dataset in its original paper (Ratanamahatana and Keogh 2004), the discriminative region right before the high plateau of the two classes. For the Gun class, this region reflects the action of actors’ hands moving above the holsters, and moving down to grasp the gun. For the Point class, there is no such action, resulting in a smoother curve from rest to point motion

5.5.3 Motion sensing dataset: counter movement jump (CMJ)

The CMJ dataset records the counter movement jumps of participants of 3 classes: Normal (jump done correctly), Bend (jump with knee bend), and Stumble (stumble at landing) (Fig. 18).

Fig. 18
figure 18

Examples of 3 classes of the Counter Movement Jump (CMJ) dataset

Fig. 19
figure 19

Counter Movement Jump (CMJ) dataset: Visualization of best explanation methods (top 3 rows), worst explanation method and random explanation (bottom 2 rows) identified by AMEE. Visualization of other explanations are presented in Section 8

According to the domain experts who recorded this data (Le Nguyen et al. 2019), the critical area for the first two classes (NORMAL and BEND) is the middle part, while that of the final class (STUMBLE) is in the end of the time series. In class NORMAL, this region is completely flat. The same region in class BEND is characterized by a hump in case participants’ knees are in bending posture. In the STUMBLE class, the end of the time series is different from the previous two classes because of its very high, sharp peak due to a wrong landing position.

The result of AMEE for all studied explanation methods is also given in Table 8. The top 3 rows shows the top 3 explanations for this dataset are MrSEQL-SHAP (SHAP explanation based on MrSEQL classifier), MrSEQL-LIME (LIME explanation based on MrSEQL classifier), and MrSEQL-SM (saliency map obtained directly from MrSEQL classifier). We see a high agreement between these explainers as they all correctly highlight the corresponding discriminative areas provided by the expert (Fig. 19). In addition, methods that are pointed out by AMEE as unreliable are also shown to highlight incorrect regions and do not agree with the opinion of the domain expert (e.g., explanation provided by Integrated Gradient method).

5.5.4 Impact of using multiple referees

The Counter Movement Jump (CMJ) dataset is an example of a real dataset with known domain expert ground truth (Le Nguyen et al. 2019). In our experiment, all of the referee classifiers achieve very high performance, ranging from 0.92 to 0.97 accuracy (Table 11.) Hence, this dataset presents an opportunity to investigate the benefit of using a committee of referee classifiers in comparison with using a single referee classifier. Figure 20 shows the explanation power using two approaches: (a) using an ensemble of referees and (b) using a single referee independently. If we look at the case of Random explanation (displayed in blue) that should clearly be worse than the MrSEQL-based explanation (as shown in Sect. 5.5.3). It is interesting that one of the referee classifiers that is quite accurate (Resnet with 0.92 accuracy) ranks the Random explanation as the best, with a significant Explanation Power difference to the others. If only a single referee was employed here, the recommendation could select the Random explanation. However, the risk decreases when an ensemble of referees are used. From this real example we observe that the benefit of using multiple referees is to improve the confidence and reliability of the evaluation, reducing the risk that a single referee is wrong by instead aggregating evaluations from multiple referees.

Fig. 20
figure 20

CMJ dataset: explanation power in (a) Ensemble Referee Mode versus (b) Single Referee Mode. The sequence of referees in both figures are (1) RESNET, (2) K-NN, (3) MrSEQL Classifier, (4) ROCKET, and (5) WEASEL 2.0. a explanation power is calculated in an ensemble approach with sequential addition of the referee classifiers. For example, the value of explanation power (reflected in the x-axis) of Random method (displayed in blue) when 2 classifiers are used (reflected in the y-axis) are aggregated from using both (1) RESNET and (2) K-NN. b explanation power is calculated by using the evaluation from a single referee, without any aggregation from other referee. For example, the value of explanation power (reflected in the x-axis) of Random explanation (displayed in blue) when only the second classifier (in this figure, K-NN) is used (reflected in the y-axis). The dash is used to display the difference in explanation power in b to reflect that there is no connection between the values of the explanation power between the evaluation results of the referee classifiers (Color figure online)

5.6 Discussion

Our study carried over both synthetic and UCR datasets shows that AMEE can be used to computationally evaluate and rank different explanation methods. We recommend the use of AMEE with full knowledge about the essential elements of the method. First, referees should be selected carefully, using classifiers of acceptable accuracy as determined by the application requirements. Using a committee of multiple accurate referee classifiers is recommended to reduce possible biases that one referee could introduce and results in a more reliable evaluation. Second, having a variety of data perturbation methods is helpful, especially for hard-to-classify datasets. In addition, adding a random explanation while carrying out the evaluation with AMEE is helpful in identifying unreliable explanations. A worse-than-random explanation means that the explanation fails to trigger a change in referee classifiers when compared even to a random explanation, either not identifying the important areas, or not focusing on any important areas at all. Finally, we recommend adding SHAP-based methods to accurate base classifiers for testing and further evaluation, as our experiments show that SHAP-based explanations often outperform other explanations using the same base classifiers.

6 Recommendations for practitioners

In this section, we present our recommendations for using the AMEE framework to evaluate and recommend explanation methods. These are some of the lessons learned during the process of developing, designing, and conducting the experiments in this paper.

  • Time Series Classifiers One of the key elements of our evaluation framework are the referee classifiers. The more accurate the referees, the more reliable the result that we can potentially expect. Hence, choosing the right set of referees is a very important step before we even start to evaluate the explanation methods. We recommend using state-of-the-art time series classifiers that are well studied and compared in the latest empirical benchmarks (Middlehurst et al. 2023). When selecting referees, we recommend to choose classifiers which are both accurate and computationally efficient, since AMEE requires repeated inference of perturbed versions of the original dataset.

  • Explanation Methods Unless the application users already have their preferred explanations pre-computed and only require AMEE for evaluation, we recommend to select explainers based on the extensive survey (Theissler et al. 2022). We recommend to use diverse explanation methods, covering both intrinsic explanation and post-hoc explanation. From a computational perspective, we recommend to consider the cost of obtaining explanations. From our experience, explainers that use data segments (chunking) to explain time series seem useful, but some of them are not efficient (for example, LIMESegment (Sivill and Flach 2022b)). Additionally, we strongly recommend adding SHAP-based explanations to the list of explainers, as we observed these are highly informative in many datasets that we have tested. Finally, we recommend to add a random explanation to the evaluation (in addition to method-based explanations), for a simple sanity check.

  • Perturbation Strategy The perturbation strategy plays a critical role in both obtaining an explainer and for our recommendation framework. An effective perturbation strategy is one that, when used for perturbing the informative parts of the time series, leads to a change in prediction. In our experience, this effectiveness strongly depends on the specific dataset and classifiers, thus choosing the right perturbation can be tricky and time-consuming. Therefore, we recommend using multiple perturbation strategies in our framework.

  • Datasets and Optimal number of Referees and Perturbation Strategies Our experiment covers a wide range of datasets of different classification difficulty level. We observe that when the dataset is easy to classify (e.g., many classification algorithms can achieve high accuracy), generally a lower number of referees and perturbation strategies can be used without affecting the evaluation results. However, when the dataset is harder to classify, using more referees and more perturbation methods is recommended to get a more reliable and less biased result.

  • Adaptability We present AMEE as a robust explainer recommendation system for the Time Series Classification task. However, the framework is adaptable and could be generally applied to other types of data (such as images and text). For adapting to other data types, practitioners can consider more suitable perturbation methods and referee classifiers that work well with the target data.

7 Conclusion

In this work we proposed AMEE, a Model-Agnostic Explanation Evaluation framework, for computationally assessing and ranking explanation methods for the time series classification task. We test the framework on 25 synthetic and UCR archive datasets to obtain explanation evaluations for a wide variety of common explanation methods for time series, covering different aspects of explanation including type, scope and model dependency. Our experiments show a high agreement of the explanation power (measured by AMEE) in the synthetic datasets with the Oracle explanation (ground truth for each time point) and the Expert explanation in a real dataset (ground truth provided by a domain expert). We also find that perturbation-based explainers based on SHAP generally perform better than gradient-based explainers for time series classification (given similar performance of the base models), but are computationally expensive. AMEE can be used to select appropriate explanation methods for application users. It could also potentially pinpoint inherent problems, such as bias, that may exist in the training data and subsequently enhance the trustworthiness of AI systems in critical tasks. This framework further empowers machine learning to discover new knowledge from the data. Another potential application is to use the best explainer recommended by AMEE as a proxy for downstream tasks to identify opportunities to compress data and optimize data storage, transmission, and analysis. Finally, since AMEE relies on response to perturbation to evaluate importance of explanation methods, it can potentially be adapted to other types of data (such as images) and other machine learning tasks (such as time series regression and clustering). Future work includes devising a robust, AMEE-optimized, explanation method and using data experts to evaluate the validity and potential of knowledge discovery using this framework in biomedical and heathcare-related tasks such as genetic data understanding and sports analytics.