Robust Explainer Recommendation for Time Series Classification

Time series classification is a task which deals with temporal sequences, a prevalent data type common in domains such as human activity recognition, sports analytics and general sensing. In this area, interest in explainability has been growing as explanation is key to understand the data and the model better. Recently, a great variety of techniques have been proposed and adapted for time series to provide explanation in the form of saliency maps, where the importance of each data point in the time series is quantified with a numerical value. However, the saliency maps can and often disagree, so it is unclear which one to use. This paper provides a novel framework to quantitatively evaluate and rank explanation methods for time series classification. We show how to robustly evaluate the informativeness of a given explanation method (i.e., relevance for the classification task), and how to compare explanations side-by-side. The goal is to recommend the best explainer for a given time series classification dataset. We propose AMEE, a Model-Agnostic Explanation Evaluation framework, for recommending saliency-based explanations for time series classification. In this approach, data perturbation is added to the input time series guided by each explanation. Our results show that perturbing discriminative parts of the time series leads to significant changes in classification accuracy, which can be used to evaluate each explanation. To be robust to different types of perturbations and different types of classifiers, we aggregate the accuracy loss across perturbations and classifiers. This novel approach allows us to recommend the best explainer among a set of different explainers, including random and oracle explainers. We provide a quantitative and qualitative analysis for synthetic datasets, a variety of timeseries datasets, as well as a real-world case study with known expert ground truth.


Introduction
The last decade witnessed a rapid integration and increased impact of machine learning in everyday life.Machine learning algorithms work well in many applications and grow ever more complex with models having millions of parameters [9,18].Data quality is key and checking against data leakage or bias is important to enable robust models.However, we are still behind in explaining why these algorithms work so well and occasionally fail to perform well, and what is in the data that leads multiple classifiers to predict a certain class [21].The evaluation of explanation methods is still an open problem.While we have many new explanation methods and methodologies, it is still difficult to decide which is the best explainer for a given problem and dataset.
This unmatched growth of complexity and explanation of many machine learning algorithms and data, including those for time series, undermines application of these technologies in critical, human-related areas such as healthcare, sports and finance [11,36].As time series data is prevalent in these applications [4,44,45], Time Series Classification (TSC) algorithms often call for reliable explanations [7,29].This explanation is usually presented in the form of feature importance or as saliency weights [2], highlighting the parts of the time series which are informative for the classification decision.Saliencybased explanations were shown to be useful to find important motifs in data [29,34], and as a starting point to prioritise features for further investigation in counterfactual explanation methods [16].
Recent efforts in designing intrinsically explainable machine learning algorithms, as well as building post-hoc saliency-based explainers for black-box algorithms, have gained significant attention [37,47,51,62].Most works focus on explaining one particular classification algorithm, with a lot of emphasis on deep learning methods [6,51,54,55,57,62].Often we are faced with a set of saliency maps for explanation, coming either from domain experts or from a diverse set of classifiers.In particular, for time series classification, a variety of classifiers are required for high accuracy, depending on the application domain [5,39].Each of these classifiers may be tied in accuracy, can be explained with different methods (e.g., LIME or SHAP), and often the resulting explanations disagree, e.g., pointing to different parts of a time series as most relevant for a predicted class.We thus face the challenge: How to assess and objectively compare many explanation methods?In other words, if two or more explanation techniques give different explanations (i.e., two different saliency maps coming from the same classifier or different classifiers, Figure 1), which explanation is best for our task?In this paper, we focus on the time series classification task and propose a methodology appropriate for time series.Some of the ideas we investigate are relevant to recommending explainers beyond the TSC task, but this is beyond the scope of this paper.Fig. 1: Saliency map explanation is a vector of feature importance weights overlaid over the original time series, where each point in the time series is coloured according to its importance.The saliency is obtained by classifying a motion time series using different classifiers and explainers.The most discriminative parts according to the explanation method are colored in deep red, and the non-discriminative parts are colored in deep blue.
We propose a methodology to compute a standardized evaluation measure, which enables quantitative comparison and ranking of explainers (Table 1).From the application users' perspective, having this recommendation can support short-listing of useful explanations for further analysis and optimisation [19].At the very least, we want to know that a given explanation is better (more informative) than a random explanation, and in general we want to be able to select the best explainer for a given dataset.
Table 1: Outcome of AMEE: a measure to evaluate multiple explanation methods.Explanation Power measures the informativeness of each explanation, taking values from 0 to 1, where 0 is worst and 1 is best.
In this paper, we present A Model-agnostic framework for Explanation Evaluation for Time Series Classification (AMEE).Specifically, we focus on explanations in the form of a saliency map and consider their informativeness within a defined computational scope, in which a more informative explanation means a higher capacity to influence classifiers to identify a class.We show that the saliency-guided perturbation of discriminative subsequences results in a reduced accuracy of classifiers.The higher the impact of a perturbation, the more informative are the perturbed time series subsequences.Estimation of this impact, measured by a committee of highly accurate referee classifiers, can reveal the informativeness of the explanation.This is the key idea behind AMEE, a post-hoc approach which uses a set of classifiers and explainers to recommend the best explainer for a given time series classification dataset.
Our work addresses an overlooked area of research: robust comparison and ranking of multiple explanation methods for time series classification.Our main contributions are: -A robust, model-agnostic, ensemble-based explanation evaluation framework.First, we leverage the use of multiple data perturbation strategies to create explanation-guided noisy data.Using synthetic data, we empirically show that applying multiple data perturbation strategies is particularly useful when the data is hard-to-classify, as such data is often more sensitive to the data perturbation type.We also show that a committee of referee classifiers is useful to reduce the potential bias that one single referee classifier may have.Our experiments demonstrate that a committee approach involving multiple types of data perturbations and multiple classifiers leads to explanation evaluation and ranking that better agrees with the explanation ground truth (synthetic data) and domain expert ground truth (real data).-A standardised evaluation measure (Explanation Power) that is comparable across different explanation methods, referee classifiers and datasets.-An empirical study on both synthetic and real datasets with recent stateof-the-art time series classifiers and explanation methods.We verify the evaluation methodology with annotated, real datasets.All data, code and detailed results are available1 .
In the next sections we review related Explainable AI research including both time series specific and general methods (Section 2).We then define related concepts (Section 3) and describe our proposed solution (Section 4).We discuss experiments on both synthetic and real time series datasets, with detailed case studies (Section 5).We discuss important considerations for practitioners when using AMEE to evaluate and recommend explanations for time series classification in Section 6.Finally, we summarize our results and discuss future work in Section 7.

Explanation Methods for Time Series Classification
As deep learning has achieved high performance in machine learning domains such as computer vision [33,59], the research community started to develop techniques to explain these black-box models to understand why they work so well [51,54,55,57,62].These explanations are in the form of a saliency map, visualizing the important pixels in an image by computing a saliency weight for each pixel (a type of feature importance).This saliency map, combined with the original image, can reveal whether a black-box model focuses on the correct area of the image and explain the model in a visually friendly way.
Explainable AI (XAI) methods for Time Series Classification have advanced in parallel with general XAI progress [60].Although recent works on instance-based methods, such as factual and counterfactual explanations [16,23,61] have become popular, the majority of explanation methods exist in the form of saliency maps [24,32,40,43,48,54,62], where a map visualizes the importance weight vector w and highlights the discriminative areas of a time series for the classification task.These saliency-based explanations can either be extracted directly from the classifier (intrinsic explanation), or indirectly by applying a post-hoc explanation method to the black-box classifier (post-hoc explanation).

Intrinsic Explanation
Explanation from MrSEQL Time Series Classifier.MrSEQL [34] is a time series classification algorithm that is intrinsically explainable.The algorithm converts the numeric time series vector into strings, e.g., by using the SAX [35] transform with varying parameters to create multiple symbolic representations of the time series.The symbolic representations are then used as input for SEQL [27], a sequence learning algorithm, to select the most discriminative subsequences for training a classifier using logistic regression.The symbolic features combined with the classifier weights learned by logistic regression make this classification algorithm explainable.For a time series, the explanation weight of each data point is the accumulated weight of the SAX features that it maps to.These weights can be mapped back to the original time series to create a saliency map to highlight the time series parts important for the classification decision.We call the saliency map explanation obtained this way, MrSEQL-SM.For using the weight vector from MrSEQL-SM, we take the absolute value of weights to obtain a vector of non-negative weights.Figures 2  and 3 show an example of the saliency map explanation obtained directly from the MrSEQL classifier weights, for the Coffee and GunPoint datasets from the UCR Archive [15].
Explanation from a Generic, White-box Classifier.A generic, white-box classifier such as Logistic Regression or Ridge Regression has been the primary source of providing feature importance (by using the learned model weights), especially for tabular data [26].These classifiers and their explanations are computationally cheap and can be useful for time series data [20].

Posthoc Explanation
Gradient-based Explanation.This approach uses the gradients from a trained deep neural network to infer explanations.Notable methods are Integrated Gradient [57], GradientSHAP [38], GradCAM [51], CAM [62].For time series classification and explanation, the most common classifier is ResNet [30] combined with some of the explanation methods mentioned.
Perturbation-based Explanation.This type of methods infuse noise into the data to create data variations and to infer the degree of data point importance [1,12].Notable methods are Feature Occlusion [58] and LIME [47].One of the most popular post-hoc explanation methods is SHAP [37] -a unique way to explain any machine learning model using a game theoretic approach, in which all feature coalitions are evaluated.Feature importance is then calculated using the classic Shapley value [56].Figures 2 and 3 show an example of the saliency map explanation obtained by applying SHAP to the MrSEQL classifier to get a post-hoc explanation for each of the Coffee and GunPoint datasets from the UCR Archive [15].Fig. 2: Saliency map from two explanation methods on two examples from the Coffee dataset: the bottom row is an explanation from MrSEQL Classifier (intrinsic explanation); the top row is an explanation from SHAP, a post-hoc explanation method based on MrSEQL Classifier.

Quantitative Evaluation of Saliency-based Explanation
Quantitative evaluation of explanations for time series data was a relatively untouched topic until recently.Unlike image and text, time series data often do not have annotated ground truth explanation; hence, it remains a challenge to determine whether a saliency-based explanation is correct.Approaches to benchmark and evaluate faithfulness of recent explanation methods overcome this problem by using synthetic datasets with assigned ground-truth [14,28].Fig. 3: Saliency map from two explanation methods on two examples of Gun-Point dataset: the top bottom row is explanation from MrSEQL Classifier (intrinsic explanation); the top row is explanation from SHAP, a post-hoc explanation method based on MrSEQL Classifier.
Other research ventures into real datasets, yet these efforts focus on examining explanations by a single classifier [14] or averaging a non-comparable metric across multiple datasets [50].The approach in [22] uses a white-box classifier to get a pseudo ground-truth explanation (a) and evaluates a post-hoc, localized explanation method (b) by estimating cosine distance between (a) and (b).However, this method assumes that the white-box classifiers can always produce explanations of ground-truth quality.We show in our experiments that this is not the case.Notably, [3,42,50] propose methods to quantify explanation methods, however, there are a few problems with the comparison: the use of a single perturbation type is problematic as it cannot always distinguish between explanations, the metric used (change in accuracy) is not comparable across the selected datasets, the individual effect is not separated (only average change in accuracy is reported), and there is no discussion involving explanation ground-truth.Additionally, there is little discussion in previous work about the impact of the classifier(s) accuracy on evaluating the explanation methods that are based on those classifier(s).This is an important point, as the evaluation can only be trusted if the classifier(s) are reliable.Furthermore, there are cases where multiple classifiers have high accuracy and are tied in this regard, but the explanations obtained from the classifiers may disagree, and in some cases could be ranked worse than a random explanation.Hence it is not clear which classifier and explanation to select in such cases.

Time Series & Time Series Dataset
A time series X = [x 0 ,x 1 ,...,x l−1 ], x i ∈ R, is a sequence of l ∈ N real values that are recorded values of a synthetic or real process.In this definition, l is also called the number of time steps or the length of the time series X, and x i are the data points or time points.
A time series dataset D consists of n ∈ N time series of equal length l that are recorded from a single process.If the time series are not of equal length, it is common to pad with zeroes or use resampling to bring them to equal length.

Saliency-based Explanations for Time Series
In the context of this paper, we only consider explanations in the form of saliency maps.A saliency map to explain time series X is a vector of numerical weights M = [w 0 , . . ., w l−1 ] where w i ∈ R and l is the length X.The value w i implies the importance (or saliency) of the time point i in the process of prediction making for X.This vector can be obtained from annotation (by a human) or computed by an explanation method.The explanation method can come from a white-box classification model (intrinsic explanation) or a black-box classifier coupled with a post-hoc explanation method (post-hoc explanation).The weights w i are typically rescaled to [0, 1].

Random Explanation
For sanity checks, we use saliency maps generated through random sampling as a lower bound on explanation quality.Here, the weights w i are drawn from a random uniform distribution.Like a dummy classifier, this random explanation serves as a baseline for any reasonable explanation method, i.e., they all should be better than random guessing.Nonetheless, there are situations where a random explanation outperforms a method-based explanation.Specifically, when a method-based explanation highlights non-discriminative parts, or fails to identify any discriminative parts, that explanation can be considered worse than a random explanation.

Oracle Explanation
In cases where explanation ground truth is available (e.g., for synthetic datasets or from domain experts), this should be the gold standard for any explanation method.We generally expect any explanation method to rank between the random and the oracle explanations.

Methodology
In this section, we describe our proposed methodology using concepts described in Section 3. Specifically, we present the blueprint of AMEE in Figure 5.The framework involves a labelled time series dataset (split into training and test datasets), a set of explanation methods to be compared, and a set of evaluating classifiers (referee classifiers).The output of the framework is the explanation power of each explanation method (see Table 1).

Explanation-Guided Data Perturbation
A good saliency-based explanation for a time series should highlight its discriminative part(s) that contain class-specific information to distinguish from other classes.Data perturbation is the process of adding noise to the data by replacing selected time points in the time series.Explanation-guided data perturbation uses a saliency-based explanation to determine the specific time points of the time series to be perturbed.As a result, the more informative the explanation, the higher the decrease in classifier accuracy is expected, because that perturbation removes important class-specific information in the respective time series.Given a threshold k (0 ≤ k ≤ 100), the discriminative parts of a time series of l steps are segmented using the top k-percentiles in M .This is a set of k * l/100 time steps that have the highest weights in the saliency map M .Varying k allows us to control the scope of the perturbation.At k = 0, the time series is the original; at k = 10, only 10 percent of the time steps (that are most discriminative according to the explanation) are perturbed; at k = 100 the entire time series is perturbed.

Referee Classifiers
In our work we employ a set of independent and accurate classifiers that are trained with the original training set and are used to evaluate the target explanations on the test set.This committee is formed of member classifiers that we call Referee Classifiers.In order to evaluate the explanation methods, our framework measures the impact of each explanation-guided data perturbation on the accuracy of the referee classifiers R. We select the referees based on recent empirical benchmarks on TSC [39].

Data Perturbation Strategy: Multiple Perturbations
In Figure 4 we explain and visualize four strategies to perturb the discriminative areas of a time series, as guided by a given explanation [41].These strategies are either time-step dependent (local perturbation, using only the t-th step information) or time-step independent (global perturbation), using Fig. 4: Time Series Data Perturbation strategy: An example time series with a known saliency map (left) is perturbed using mean or Gaussian noise using local time steps (local) or global time steps (global) across the entire dataset on its most discriminative region (in this example we perturb the top 20% values according to the highest saliency weights).
either Gaussian-based or single value replacement.With these strategies, discriminative time steps are replaced with noisy values, either by replacing the original time series values with a patch of constant values (like a grey mask in an image) or a patch of random Gaussian noise values (like a noise mask in an image).Let n be the number of time series in a dataset D, each with l time steps.We want to perturb one test time series of size 1 × l, so its t-th value x t is replaced with a new value r t .We define the global and local profile for this time step perturbation as follows.
Local perturbation: Global perturbation: With these local-based and global-based profiles, we can define the perturbation r i accordingly.We use four perturbation strategies, two local and two global perturbation types.Local mean: r ∼ N (µ, σ 2 ). Figure 4 illustrates an example of how the four strategies effectively modify the original time series in the regions identified by the explanation weights.We show in our experiments that it is important to use a set of perturbation strategies, rather than a single fixed perturbation.

The AMEE Framework for Evaluating Explanations
Figure 5 summarizes the components and steps in the AMEE framework.Our framework requires a labeled time series dataset (D), a set of explanations (M ) to evaluate, and a set of referee classifiers (R) to be trained on a subset of the dataset.With these elements, the following steps are done to record the necessary information to calculate evaluation metrics: 0. Split the labeled dataset D into training (D train ) and test (D test ); 1. Train Referee Classifier(s) (R) with (D train ); 2. Use each explanation in M to create a step-wise, explanation-based perturbation on D test ; 3. Measure the accuracy of each trained referee in R on these perturbed datasets D ′ test .The output of this process is the accuracy on the perturbed dataset D ′ test at various thresholds (k), serving as an indicator of how much an explanationbased perturbation impacts the referees.Significant drop in accuracy in the first few steps of the explanation-guided perturbation (e.g., at k = 10 or k = 20) signals that meaningful, salient data points are disturbed based on the explanation.Hence, explanations that correctly identify such salient regions are likely to be informative.

Explanation AUC
We measure the impact of each explanation by estimating the Area Under the Curve (AUC) of its explanation-guided perturbation.Specifically, the accuracy scores at each threshold (k) are translated into an Explanation-AUC (EAU C) using the trapezoidal rule.
Here ∆k 0 denotes the difference in value of each step normalized to 0-1 range (∆k 0 = 1 100 ∆k); q denotes the number of steps (q = 100 k ); acc i is the accuracy at step i.If we perturb the dataset with q steps, we will have a total of q + 1 data points for accuracy scores.For example, if the perturbation is done in q = 10 steps, each step will correspond to a difference of k = 10 percentage points in perturbation threshold (i.e.0%, 10%, ..., 100%).The step for k = 0 corresponds to the original test dataset, while the step for k = 100 corresponds to adding noise to the entire time series.
With this estimation, a smaller EAU C means higher impact (accuracy loss) of the explanation method (Figure 6).The Explanation AUC is computed for each combination of Perturbation -Referee -Explanation (Figure 7: Step 1).Fig. 6: Changes of accuracy measured by a referee classifier among two explanation methods (red and blue) at each threshold level k.When a signal is perturbed based on a more informative explanation, this signal becomes harder for the referee to classify correctly, leading to a more severe drop in accuracy.This impact is measured by the Explanation AUC, or the the area under the curve (AUC) of these changes in accuracy at different threshold k.The curve with lower explanation AUC (red curve) results from perturbation guided by a more informative explanation method.

Robustness of the AMEE Framework
Two key aspects of AMEE are aimed to make the framework more robust by employing multiple Data Perturbation strategies and multiple Referee Classifiers.Specifically, for each explanation in M , we use different data perturbation strategies (as described in Section 4.3) to create explanation-based perturbations on the dataset D test .Additionally, multiple referee classifiers are trained on D train and their accuracy is measured on the perturbed D ′ test .The various data perturbation strategies represent different ways that salient parts of the data can be replaced with noise.Unlike image data that is standardized in RGB, time series data is more dynamic: it can belong to many different domains, collected from various sources, or preprocessed in different ways.This characteristics of time series data make it harder to use one single method to mask out a specific part of the data.Using a variety of data perturbations ensures that data is perturbed in ways that completely mask out the relevant parts of the signal.We further investigate using multiple Data Perturbation strategies in Section 5.3.2.
Referee classifiers are used to evaluate the impact of data perturbation on the pre-trained model with the original, non-perturbed data.Thus, the evaluation by referee classifiers is dependent on the properties of these classifiers such as in-classifier data normalization, feature extraction, and feature processing.Having multiple referee classifiers can reduce potential biases introduced by using one single classifier.We analyze this characteristic and show the benefits of using multiple referee classifiers in Section 5.3.3.

Standardization and Explanation Power
AMEE employs multiple perturbation strategies and multiple referee classifiers.As the EAUC measures depend on the choice of referees and perturbation strategies, they are not directly comparable.The next steps (Figure 7: Step 2-5) standardize and aggregate the EAUC to compute the final output of the framework, the Explanation Power.
Step 2 rescales the Explanation AUC to the same range [0, 1] for each row (i.e. each pair of Referee and Perturbation).Since each referee responds to changes in the perturbed dataset differently, this normalization is performed for each pair of Referee and Perturbation to ensure that the Explanation AUC is comparable across the different explanation methods in the evaluation.The red highlighted row is an example.After rescaling the Explanation AUC, the Average Scaled EAUC is computed in Step 3. It is basically the average of each column in Step 2. This simple average calculation can be performed because individual Explanation AUC are already normalized to the [0,1] range and comparable across each Referee-Perturbation pair.For example the Average Scaled EAUC of Rocket-SHAP is (0.43 + 0.26 + 0.69 + 0.42 + 0.33 + 0.67)/6 = 0.47.
In Step 4, the Average Scaled EAUC is again rescaled to the range between 0 and 1.The result is the Average Scaled Rank (lower is better).The Explanation Power is simply the inverse of Average Scaled Rank (1− Average Scaled Rank), i.e., higher is better.Details of this calculation are summarised in Algorithm 1.

Experiments
In this section, we evaluate the performance of the AMEE framework in three groups of experiments in ascending order of difficulty.In the simplest case, we want to validate AMEE with synthetic datasets with known explanation ground-truth [28].Next, we measure the performance of the framework with a diverse set of time series classification datasets from the UCR Time Series Classification Archive covering popular domains that require explanation [15].Finally, we test our framework on a real dataset and compare the result with ground-truth explanations provided by domain experts.Our experiments are repeated 5 times and the reported results are the average of these repetitions.

Referee Classifiers
We employ 5 candidates for referee classifiers in our experiment, selected based on their accuracy, speed and diversity of approach [49]: baseline 1NN-DTW (distance-based) [13], MrSEQL (dictionary-based, time domain) [34], ROCKET (convolution-based) [17], RESNET (deep learning) [25,30] and WEASEL 2.0 (dictionary-based, frequency domain) [49].As the choice of referees is a critical component in our framework, we carefully select classifiers that perform well in accuracy on all studied datasets.For a classifier to be selected in the referee committee, it has to achieve at least the average accuracy of all candidates for referee classifiers, and this number has to be higher than the theoretical accuracy achieved by a random classifier.In case the average accuracy is over 90%, the threshold to choose referees is set to 90%.By using a high accuracy threshold in cases when average accuracy is relatively high, we want to include the referees that do have high performance but slightly below the average accuracy.For example, in a theoretical case when the accuracies of 5 candidate classifiers are 0.90, 0.95, 0.97, 0.98, 0.99; we want to include all the referees as they all have relatively high performance, despite one that is slightly below the average accuracy.For some datasets, all the classifiers are tied or very close in accuracy.Details of the referee accuracy are presented in the Appendix.

Explanation Methods
In our experiments, we evaluate 8 popular explanation methods with diverse properties as described in Table 2.
We use the author's implementation for LIME [47] and MrSEQL [34], the captum [32] library for gradient-based explainers, the time-explain library [41] for SHAP, and sklearn [10] to implement the remaining classifiers and explainers.We have considered a few other recent explainers, e.g., LIMESegment [52], but they proved too slow to be feasible to run on all our datasets.Since our goal is to rank a set of given explainers, rather than promote any particular explainer, we consider this explainer set to be sufficient to validate our methodology.Before presenting the experiment result of our evaluation on the synthetic datasets, we discuss the effect of the Data Perturbation strategy (Section 5.3.2),investigate the impact of Referees (Section 5.3.3), and perform a sanity check for the classifier quality used for model-agnostic post-hoc explanation methods such as LIME and SHAP (Section 5.3.4).

Impact of Data Perturbation Strategy
Figure 9 shows the boxplots of Explanation Power for different data perturbation strategies.In datasets which are "easier" to classify (i.e., most classifiers get close to 100% accuracy) such as CAR and NARMA, the Explanation Power does not change with the perturbation strategy.On the other hand, we observe a larger change in Explanation Power when data is harder to classify (for example, in GaussianProcess datasets).We additionally present plots showing changes of Explanation Power in two extreme cases for the SmallMiddle CAR dataset (easy-to-classify) and RareTime GaussianProcess dataset (hard-toclassify) when different perturbations are gradually introduced (Figure 10).Notably, for the harder dataset RareTime GaussianProcess, having more perturbation methods encourages the evaluation results to get closer to the ground truth.Specifically, for the Oracle explanation, if only a single perturbation method was used, such as Local Mean or Local Gaussian, the evaluation result would rank the Oracle explanation as the 6 th best explanation method.However, when more perturbations are introduced, the Oracle explanation is evaluated more robustly, placing this method to the top 1 best explanation and better aligned with the ground truth.When the explanation is Oracle (upper bound of explanation) and Random (lower bound of explanation), we generally observe that these explanations are the most and least informative informative methods, respectively.Fig. 9: Impact of data perturbation strategy on Explanation Power for each explanation method.A smaller box-range (which comes from 4 perturbation methods) indicates a smaller change of the Explanation Power with different perturbation strategies.For datasets that are "easier" to classify by referees, this range is often smaller than that of "harder"-to-classify datasets.

Impact of Referee Classifiers
Similar to the previous investigation on the impact of the perturbation strategy, we now inspect how the Explanation Power changes with respect to the set Fig. 10: Changes of Explanation Power when different Perturbations are sequentially introduced.The sequence of perturbations is: Local Mean, Local Gaussian, Global Mean, and Global Gaussian (less to more extreme perturbation).The two example datasets are SmallMiddle CAR (easy-to-classify dataset) and RareTime GaussianProcess (hard-to-classify dataset).For the harder dataset RareTime GaussianProcess, the relative position of the Explanation Methods changes, indicating that having multiple types of perturbation methods is helpful when the dataset is hard to classify.Specifically, for Oracle explanation, if only a single perturbation method was used, such as Local Mean or Local Gaussian, the evaluation result would rank Oracle explanation as the 6 th best explanation method.However, when more perturbations are introduced, Oracle explanation is evaluated more robustly, placing this method to the top 1 best explanation and closer to the ground truth. of referees, and present the result in Figure 11.Here, we also notice a relatively consistent explanation power among different referee classifiers in datasets that are easier to classify (such as CAR and NARMA datasets).In datasets that are harder-to-classify (for example, in Gaussian Process datasets), we observe a larger range in distribution of explanation methods over referee classifiers.
Random and Oracle explanations both have their Explanation Power in expected values for the evaluated datasets.We present the change of Explanation Power when different referees are sequentially introduced in the two extreme cases on SmallMiddle CAR dataset (easy-to-classify) and RareTime GaussianProcess dataset (hard-to-classify)(Figure 12).We observe that for RareTime GaussianProcess dataset which is hard to classify, having a committee of referees that are highly accurate is desirable and is helpful in reducing the potential bias of a single referee and can lead to a more stable evaluation.Specifically, for Oracle explanation, if only a single referee was employed, the evaluation result would have ranked Oracle explanation as the 2 nd best explanation method.However, when more referees are introduced, Oracle explanation is evaluated more robustly, placing this method to the top 1 best explanation and closer to the ground truth.Fig. 11: Impact of referees on explanation power for each explanation method.A smaller box-range (which comes from 5 referee classifiers) signals a smaller change of explanation power by different referees.Besides, in a specific dataset, the relative position of this range indicates the level of critical difference in opinions of referees in their votes.Nevertheless, having a committee of referees that are highly accurate is generally desirable.
Similarly, we note that for some real datasets, several referee classifiers that are highly accurate can disagree in their evaluation ranking.In such cases, having multiple referees leads to a considerably more robust and reliable result.We show an example using a real dataset with domain expert ground truth (the Counter Movement Jump dataset) in a later section (Section 5.5.3).

Sanity Check for the Impact of the Base Classifier Quality
Model-agnostic post-hoc methods such as LIME and SHAP derive explanations based on a classifier of any type.Thus, these explanation are dependent on the performance of the base classifier.For example, the ROCKET-SHAP explanation is created by applying SHAP (explanation method) on ROCKET (base classifier).If the base classifier has low accuracy on the sample dataset, the explanation based on that classifier may not be as good as one based on a more accurate classifier.In our experiment, we get LIME and SHAP explanations from two sources: MrSEQL classifier [34] and ROCKET classifier [17].We observe that ROCKET achieves higher accuracy than MrSEQL in datasets created from Pseudo Periodic, Harmonic, and Gaussian Process (Table 10 in Appendix).We compare the two pairs of explanation (MrSEQLbased and ROCKET-based) from LIME and SHAP and do a sanity check.Our experiment confirms that in both cases, under the AMEE evaluation approach, ROCKET-LIME and ROCKET-SHAP are considered better expla-Fig.12: Changes of Explanation Power when different Referees are sequentially introduced.The two example datasets are SmallMiddle CAR (easyto-classify dataset) and RareTime GaussianProcess (hard-to-classify dataset.The sequence of addition of referees is from lowest to highest accuracy, filtered to represent the most reliable classifiers among the ones used for evaluation.Details of these sequence are in Section 8.For the harder dataset RareTime GaussianProcess, the relative position of the Explanation Methods changes, indicating that having a set of referees is helpful and leads to more stable results.Specifically, for Oracle explanation, if only a single referee is employed, the evaluation result could have ranked Oracle explanation as the 2 nd best explanation method.However, when more referees are introduced, Oracle explanation is evaluated more robustly, placing this method to the top 1 best explanation and closer to the ground truth.nation methods as compared to MrSEQL-LIME and MrSEQL-SHAP, respectively (Figure 13).This sanity check confirms our intuition that the quality of the base classifier is an important factor in model-agnostic, post-hoc explanation methods such as LIME and SHAP.

Results
Using a committee of 5 referee classifiers and 4 data perturbation strategies, we evaluate 10 explanation methods (8 computed explainers plus the lower bound explanation (Random) and upper bound explanation (Oracle)) using AMEE.The resulting Explanation Power is presented in Table 3.
Using a threshold of 0.5 of the min-max normalized saliency score in [0,1], we determine the ground truth of whether a time point is salient.We compare the explanation methods with the ground truth explanation for each time points and calculate the F1-score (Table 4) to measure how good each method is in determining saliency of a time point.For example, in the Small-Middle CAR dataset, AMEE selects the best explanation method to be Oracle with Explanation Power of 1.00 (Table 3), and the second best explanation is the Saliency Map from RidgeCV (Explanation Power of 0.79, Ta-Fig.13: Sanity Check of Base Model Quality with LIME and SHAP.Higher accuracy of the classifier base-model (ROCKET > MrSEQL), leads to higher explanation power for the corresponding explanations (ROCKET-LIME and ROCKET-SHAP have higher explanation power than MrSEQL-LIME and MrSEQL-SHAP).ble 3).This result is similar to the F1-score of these explanations using the ground truth (Table 4).Here, the Oracle explanation achieves 1.00 in F1score (highest), followed by RidgeCV saliency map, achieving 0.86 in F1-score (second-best).Similarly, we find that the ranks of the methods in Table 3 and 4 in high agreement.Moreover, even for the hardest dataset to classify (RareTime GaussianProcess), we observe that adding more referees brings the relative ranking of the evaluated explanation closer to the ground truth ranking(Table 5).This result further reinforces that using multiple referees is desirable: we observe that a highly accurate set of referees brings the explanation ranking closer to the ground truth.Overall, our result shows a high agreement between AMEE's computed Explanation Power and the F1-score using ground truth time-point importance evaluation and confirms that the committee of referees is a desirable property of the explainer recommendation framework.Table 5: Dataset RareTime Gaussian Process: Explanation Power rank when referees are added sequentially for Top 5 Explanation methods.The sequence of referees is determined by the order of accuracy (least to most accurate, see Table 10).With the introduction of more referees, the rank of Explanation Power gets closer to the rank using F1 Ground Truth.

Comparison with Previous Work
Previous work [42] showcases initial results towards comparing Explanation Methods.However, the method utilizes only one type of perturbation which replaces salient areas with Gaussian noise of low magnitude.While the magnitude of the Gaussian noise can be customized, it requires extra work from users to determine this parameter.This initial work also does not propose a way to standardise the explanation AUC across methods and datasets.Our new framework employs a combination of perturbation types that allows a higher impact in changing the original signals, resulting in a more robust framework and better results.We include the results of the framework (using the default settings) introduced in [42] in Table 6.This table shows that for many of the synthetic datasets, the small perturbation (default setting) added into the signal is too little and it fails to trigger changes in the classification accuracy.The resulting outcome is that the past approach is unable to distinguish differences in the informativeness of different explanation methods.
Even with a much larger noise level (Figure 14), the previous framework does not provide a result as accurate as AMEE, especially for datasets that   [42] to get Explanation Power for each of the 10 explanation methods evaluated.
are difficult to classify.We include results of the framework introduced in [42] using higher noise magnitude in Table 7 .

Data
We work with 15 datasets from the UCR Archive [15] that represent a variety of data sources and domains.These datasets are of 5 types: electrocardiogram (ECG), human motion (MOTION), device usage (DEVICE), device activities tracked by sensors (SENSOR) and spectroscopy (SPECTRO).Oracle explanation is not available for these datasets.Table 7: Synthetic Datasets: Using higher perturbation magnitude with previous work in [42] to get Explanation Power for each of the 10 explanation methods evaluated.Even with extreme perturbation, results of previous work did not agree with the F1-measure in Table 4 as AMEE's Explanation Power does.

Results
We test explanations for these datasets with AMEE and report the result in Table 8.Since we do not have ground truth for the majority of these datasets, we use this experiment to show how AMEE can apply to real datasets.We note that the Random explanation sometimes outperforms a method-base explanation.This can happen as some explanation methods may not work well with certain datasets, resulting in unreasonable explanations that misleadingly highlight non-discriminative parts as discriminative, or fail to identify any significant discriminative parts at all.In this situation, the evaluation of random explanations can serve as a filter for reasonable explanation methods, and any methods that have lower performance than random should be filtered out.

Evaluation for Real Dataset with Expert Ground Truth
The Oracle explanation is the upperbound for any explanation method, however, it is only available in synthetic datasets.For real datasets the explanation ground truth is often available in an approximate level of precision, e.g., specifying the relative position of the shape and areas of importance.This approximate ground truth is widely used in other papers in evaluating explanation methods for images [31,51,62], however, this approximate ground truth is not readily available for time series data without opinions from data experts.Among the the datasets evaluated in Section 5.4, Coffee [8], Counter Movement Jump (CMJ) [34], and GunPoint [46] have this information of the true important areas for each class of the dataset.In this section, we will compare the saliency-based explanations evaluated by AMEE, and the expert ground truth of important areas.The last row shows the count of occurrences when a method is selected as the most informative explanation according to AMEE.

Spectroscopy Dataset: Coffee
The Coffee dataset contains the spectroscopy sample of two types of coffee: Arabica and Robusta.This dataset was first introduced in [8] and is part of the UCR time series dataset [15].Figure 16 shows the top 3 and bottom 2 explanation methods ranked by AMEE.Notably, the discriminating region of the two classes of Coffee produced by the best explainer, ROCKET-SHAP, is the last peak region of the time series.This region was confirmed in the original paper [8] to contain information about the chlorogenic acid content of the sample that contributes to the difference between the two types of coffee (Figure 15).Arabica has a lower caffeine and chlorogenic acid content that contributes to its finer taste and greater market value.The region that the MrSEQL-SHAP and MrSEQL-SM also highlight is another part of the spectrum that contains information about the chlorogenic acid content [8].The worst explanation methods among those evaluated, GradientSHAP and a random explanation, shows either very small, non-contiguous or randomly, scattered regions of interests and do not focus on parts of the time series that discriminate the two coffee types.

Video Motion Retrieval Dataset: GunPoint
The famous GunPoint dataset is the time-series translation from a video sequence involving actors performing two distinct actions: pointing to a target with a gun (Gun class) and pointing with their index fingers only (Point class).This dataset was introduced in [46] and is part of the UCR time series archive [15].Figure 17 visualizes the examples of explanations from the best three methods, worst method, and random method for this dataset.The expert Fig. 15: Coffee Dataset: Ground truth from Coffee dataset [8].According to the original paper [8], the chlorogenic acid content (region approximately of time steps 150-240, marked in red) is the major region that contributes to the difference in two coffee types.The caffeine content (region approximately of time steps 40-75, marked in orange) is also discriminative, but to a lesser extent.
ground truth for the GunPoint dataset conveys that the two classes differ in the steps where the Gun class requires the actor/actress to lift their hand above a holster, then reach down for the gun.This distinct action creates a subtle difference in the time steps right before the action of hands moving to the shoulder level (the sharp increase in time series values) to pointing the gun or hand (the plateau in the middle of the time series).The detailed description can be found in [46].In this dataset, AMEE identifies explanations from the IntegratedGradient method as the most computationally informative explanation, followed by MrSEQL-SHAP and MrSEQL-SM.The least informative method is ROCKET-LIME, which is even less informative than a random saliency explanation as this method refers to the wrong area of importance, failing to point out any of the salient regions of the time series.

Motion Sensing Dataset: Counter Movement Jump (CMJ)
The CMJ dataset records the counter movement jumps of participants of 3 classes: Normal (jump done correctly), Bend (jump with knee bend), and Stumble (stumble at landing) (Figure 18).According to the domain experts who recorded this data [34], the critical area for the first two classes (NORMAL and BEND) is the middle part, while that of the final class (STUMBLE) is in the end of the time series.In class NORMAL, this region is completely flat.The same region in class BEND is characterized by a hump in case participants' knees are in bending posture.In the STUMBLE class, the end of the time series is different from the previous two classes because of its very high, sharp peak due to a wrong landing position.
The result of AMEE for all studied explanation methods is also given in Table 8.The top 3 row show the top 3 explanations for this dataset are MrSEQL-SHAP (SHAP explanation based on MrSEQL classifier), MrSEQL-LIME (LIME explanation based on MrSEQL classifier), and MrSEQL-SM (saliency map obtained directly from MrSEQL classifier).We see a high agree-Fig.16: Coffee Dataset: Visualization of top 3 best explanation methods (top 3 rows) and worst explanation method (bottom 2 rows) ranked by AMEE.It can be observed that the top explanation methods are all able to point out the discriminative regions that are confirmed by the domain expert, as represented in Figure 15.
ment between these explainers as they all correctly highlight the corresponding discriminative areas provided by the expert (Figure 19).In addition, methods that are pointed out by AMEE as unreliable are also shown to highlight incorrect regions and do not agree with the opinion of the domain expert (e.g., explanation provided by Integrated Gradient method).

Impact of Using Multiple Referees
The Counter Movement Jump (CMJ) dataset is an example of a real dataset with known domain expert ground truth [34].In our experiment, all of the Fig. 17: GunPoint Dataset: Visualization of top 3 best explanation methods (top 3 rows) and worst explanation method (bottom 2 rows) identified by AMEE.According to the description of the dataset in its original paper [46],the discriminative region right before the high plateau of the two classes.For the Gun class, this region reflects the action of actors' hands moving above the holsters, and moving down to grasp the gun.For the Point class, there is no such action, resulting in a smoother curve from rest to point motion.referee classifiers achieve very high performance, ranging from 0.92 to 0.97 accuracy (Table 11.)Hence, this dataset presents an opportunity to investigate the benefit of using a committee of referee classifiers in comparison with using a single referee classifier.Figure 20 shows the Explanation Power using two approaches: (a) using an ensemble of referees and (b) using a single referee independently.If we look at the case of Random explanation (displayed in blue) that should clearly be worse than the MrSEQL-based explanation (as  shown in Section 5.5.3).It is interesting that one of the referee classifiers that is quite accurate (Resnet with 0.92 accuracy) ranks the Random explanation as the best, with a significant Explanation Power difference to the others.If only a single referee was employed here, the recommendation could select the Random explanation.However, the risk decreases when an ensemble of referees are used.From this real example we observe that the benefit of using multiple referees is to improve the confidence and reliability of the evaluation, reducing the risk that a single referee is wrong by instead aggregating evaluations from multiple referees.

Discussion
Our study carried over both synthetic and UCR datasets shows that AMEE can be used to computationally evaluate and rank different explanation methods.We recommend the use of AMEE with full knowledge about the essential elements of the method.First, referees should be selected carefully, using classifiers of acceptable accuracy as determined by the application requirements.Using a committee of multiple accurate referee classifiers is recommended to reduce possible biases that one referee could introduce and results in a more reliable evaluation.Second, having a variety of data perturbation methods is helpful, especially for hard-to-classify datasets.In addition, adding a random explanation while carrying out the evaluation with AMEE is helpful in identifying unreliable explanations.A worse-than-random explanation means that the explanation fails to trigger a change in referee classifiers when compared even to a random explanation, either not identifying the important areas, or not focusing on any important areas at all.Finally, we recommend adding SHAPbased methods to accurate base classifiers for testing and further evaluation, as our experiments show that SHAP-based explanations often outperform other explanations using the same base classifiers.

Recommendations for Practitioners
In this section, we present our recommendations for using the AMEE framework to evaluate and recommend explanation methods.These are some of the lessons learned during the process of developing, designing, and conducting the experiments in this paper.
-Time Series Classifiers.One of the key elements of our evaluation framework are the referee classifiers.The more accurate the referees, the more reliable the result that we can potentially expect.Hence, choosing the right set of referees is a very important step before we even start to evaluate the explanation methods.We recommend using state-of-the-art time series classifiers that are well studied and compared in the latest empirical benchmarks [39].When selecting referees, we recommend to choose classifiers which are both accurate and computationally efficient, since AMEE requires repeated inference of perturbed versions of the original dataset.-Explanation Methods.Unless the application users already have their preferred explanations pre-computed and only require AMEE for evaluation, we recommend to select explainers based on the extensive survey [60].We recommend to use diverse explanation methods, covering both intrinsic explanation and post-hoc explanation.From a computational perspective, we recommend to consider the cost of obtaining explanations.From our experience, explainers that use data segments (chunking) to explain time series seem useful, but some of them are not efficient (for example, LIME-Segment [53]).Additionally, we strongly recommend adding SHAP-based explanations to the list of explainers, as we observed these are highly informative in many datasets that we have tested.Finally, we recommend to add a random explanation to the evaluation (in addition to method-based explanations), for a simple sanity check.-Perturbation Strategy: The perturbation strategy plays a critical role in both obtaining an explainer and for our recommendation framework.An effective perturbation strategy is one that, when used for perturbing the informative parts of the time series, leads to a change in prediction.In our experience, this effectiveness strongly depends on the specific dataset and classifiers, thus choosing the right perturbation can be tricky and timeconsuming.Therefore, we recommend using multiple perturbation strategies in our framework.-Datasets & Optimal number of Referees and Perturbation Strategies.Our experiment covers a wide range of datasets of different classification difficulty level.We observe that when the dataset is easy to classify (e.g., many classification algorithms can achieve high accuracy), generally a lower number of referees and perturbation strategies can be used without affecting the evaluation results.However, when the dataset is harder to classify, using more referees and more perturbation methods is recommended to get a more reliable and less biased result.-Adaptability.We present AMEE as a robust explainer recommendation system for the Time Series Classification task.However, the framework is adaptable and could be generally applied to other types of data (such as images and text).For adapting to other data types, practitioners can consider more suitable perturbation methods and referee classifiers that work well with the target data.

Conclusion
In this work we proposed AMEE, a Model-Agnostic Explanation Evaluation framework, for computationally assessing and ranking explanation methods for the time series classification task.We test the framework on 25 synthetic and UCR archive datasets to obtain explanation evaluations for a wide variety of common explanation methods for time series, covering different aspects of explanation including type, scope and model dependency.Our experiments show a high agreement of the Explanation Power (measured by AMEE) in the synthetic datasets with the Oracle explanation (ground truth for each time point) and the Expert explanation in a real dataset (ground truth provided by a domain expert).We also find that perturbation-based explainers based on SHAP generally perform better than gradient-based explainers for time series classification (given similar performance of the base models), but are computationally expensive.AMEE can be used to select appropriate explanation methods for application users.It could also potentially pinpoint inherent problems, such as bias, that may exist in the training data and subsequently enhance the trustworthiness of AI systems in critical tasks.This framework further empowers machine learning to discover new knowledge from the data.Another potential application is to use the best explainer recommended by AMEE as a proxy for downstream tasks to identify opportunities to compress data and optimize data storage, transmission, and analysis.Finally, since AMEE relies on response to perturbation to evaluate importance of explanation methods, it can potentially be adapted to other types of data (such as images) and other machine learning tasks (such as time series regression and clustering).Future work includes devising a robust, AMEE-optimized, explanation method and using data experts to evaluate the validity and potential of knowledge discovery using this framework in biomedical and heathcare-related tasks such as genetic data understanding and sports analytics.

Explanation Power Calculation using Standard Scaler
In Section 4.6.1,we employ Min/Max Scaler for re-scaling the metrics in Step 2 and Step 4. While this standardization method is not the only available option, its advantages lies in the intuitive final metric in [0,1] range -with methods that are more informative would achieve a higher Explanation Power.Nevertheless, using other standardization method, such as Standard Scaling, is always an option to consider.Note that for this design choice, Step 5 is no longer logical and should be removed from the calculation (Algorithm 2).The final metric can now be interpreted in a reverse fashion to Explanation Power, with lower metric reflects a better explanation method.Table 9 presents the results of the evaluation metric on the Synthetic datasets using Standard Scaler as standardization method for Step 2 and 4, with an elimination of Step 5.
Results in Table 9 shows a similar trend and agreement with Explanation Power presented 9 and ground truth shown in Table 4.For each dataset, methods with lowest values for the metric present in Table 9 are associated with computationally most informative in explanation ability.For example, for SmallMiddle CAR dataset, random has highest metric value of 2.19 and associates with worst explanation method according to Table 4.This result is similar with Table 9, in which this method has zero value in Explanation Power.

Additional Tables and Figures
We include the full accuracy table for our experiments in Section 5 in Table 10 (Synthetic Data) and Table 11 (Real Time Series Data). Figure 21 shows the visualization of all examined explanation methods on 3 classes of CMJ dataset in Section 5.5.

Fig. 5 :
Fig. 5: The AMEE evaluation framework requires 3 elements: (a) a dataset that requires explanation evaluation, (b) a set of saliency-based explanations, and (c) a set of referee classifiers trained on a subset of (a).

Fig. 7 :
Fig. 7: Measure Standardisation and Explanation Power Calculation.Example of how Explanation Power is derived in a typical evaluation assessment, involving 2 perturbation strategies (local, global) and 3 referees (MrSEQL, k-nn, ROCKET).

Algorithm 1 :
AMEE: Calculate Explanation Power Input: Set of XAI methods M , set of Perturbations T , set of Referees R, set of thresholds for important area k, test accuracy (acc M,T,R,k ) Output: Average Scaled Explanation AUC (asEAU C M ), Average Scaled Rank (asRank M ), Explanation Power (eP ower M ) 1 Calculate Explanation AUC (EAU C M,T,R ) using acc M,T,R,k 2 Calculate rescaled AUC (sEAU C M,T,R ) by Min/Max Rescaling EAU C M,T,R 3 Calculate Average Scaled Explanation AUC (asEAU C M ) of each M by averaging sEAU C M,T,R across R and T 4 Calculate Average Scaled Rank (asRank M ) of each M by Min/Max Rescaling asEAU C M 5 Calculate Explanation Power (eP ower M ) by 1 -asEAU C M

Fig. 8 :
Fig. 8: Visualization of the Synthetic Datasets.The columns describe the process used to create the dataset, while the row specifies the specific salient areas.

Fig. 14 :
Fig. 14: Sample time series from dataset SmallMiddle CAR that was perturbed by (a) Gaussian noise addition proposed in [42] at very high magnitude (left) and (b) Global Gaussian noise -a non-parametric perturbation) (right)

Fig. 19 :
Fig. 19: Counter Movement Jump (CMJ) dataset: Visualization of best explanation methods (top 3 rows), worst explanation method and random explanation (bottom 2 rows) identified by AMEE.Visualization of other explanations are presented in Section 8.

Fig. 20 :
Fig. 20: CMJ dataset: Explanation Power in (a) Ensemble Referee Mode vs.(b) Single Referee Mode.The sequence of referees in both figures are (1) RESNET, (2) K-NN, (3) MrSEQL Classifier, (4) ROCKET, and (5) WEASEL 2.0.In figure(a), Explanation Power is calculated in an ensemble approach with sequential addition of the referee classifiers.For example, the value of Explanation Power (reflected in the x-axis) of Random method (displayed in blue) when 2 classifiers are used (reflected in the y-axis) are aggregated from using both (1) RESNET and (2) K-NN.In figure (b), Explanation Power is calculated by using the evaluation from a single referee, without any aggregation from other For example, the value of Explanation Power (reflected in the x-axis) of Random explanation (displayed in blue) when only the second classifier (in this figure, K-NN) is used (reflected in the y-axis).The dash is used to display the difference in Explanation Power in Figure (b) to reflect that there is no connection between the values of the explanation power between the evaluation results of the referee classifiers.

Algorithm 2 :
Calculate comparison metric using Standard Scaling optionInput: Set of XAI methods M , set of Perturbations T , set of Referees R, set of thresholds for important area k, test accuracy (acc M,T,R,k ) Output: Average Scaled Explanation AUC (asEAU C M ), Average Scaled Rank (asRank M ), Explanation Power (eP ower M ) 1 Calculate Explanation AUC (EAU C M,T,R ) using acc M,T,R,k 2 Calculate rescaled AUC (sEAU C M,T,R ) by Standard Scaling EAU C M,T,R 3 Calculate Average Scaled Explanation AUC (asEAU C M ) of each M by averaging sEAU C M,T,R across R and T 4 Calculate Average Scaled Rank (asRank M ) of each M by Standard Scaling asEAU C M

Table 2 :
Summary of properties of Explanation Methods.

Table 3 :
Synthetic Datasets: Explanation Power for each of the 10 explanation methods evaluated.

Table 4 :
Synthetic Datasets: F1-score of explanation methods using explanation ground truth.Abbreviation: SM -Small Middle, RT -Rare Time.

Table 6 :
Synthetic Datasets: Using default perturbation settings with previous work in

Table 8 :
Explanation Power on UCR Datasets.Most informative method is highlighted in bold.

Table 9 :
Synthetic Datasets: Evaluation Metric using Standard Scaler for each of the 10 explanation methods evaluated on Synthetic datasets.

Table 10 :
Classifier accuracy on synthetic datasets.Classifiers that are selected as referees are in bold

Table 11 :
Classifier accuracy on UCR datasets.Classifiers that are selected as referees are in bold.Abbreviations: CounterMovementJump -CMJ, Italy-Power -ItalyPowerDemand, Sony1/2 -SonyAIBORobotSurface1/2 Fig.21: Visualization of all examined explanation methods on 3 classes of CMJ dataset (ordered by Explanation Power, high to low).This figure is best read in combination with results in Table8.