1 Introduction

Machine learning is developing rapidly to address real-world classification problems and automate decisions in different fields. Especially, time series classification (TSC) has gained popularity in several application domains, such as electrocardiogram (ECG) signal classification (Kampouraki et al., 2009), sensor signal classification (Yao et al., 2017), and stream monitoring (Rebbapragada et al., 2009). Nevertheless, most machine learning models remain opaque, while model interpretability is crucial for end-users and practitioners to gain trust in the predictions. Furthermore, in several domains, such as healthcare, providing actionable explanations is essential for practitioners to not only understand the causes of an unfavourable prediction but also to prevent and counteract that prediction. Towards this direction, Wachter et al. (2017) suggested counterfactual explanations as a solution to provide sample-based explanations. Counterfactual explanations are mainly recommendations on which features of the original test example should be modified and how, in order to achieve the desired target prediction. Recent works on counterfactual explanations have demonstrated great applicability in different prediction tasks, such as the classification of tabular data (Pawelczyk et al., 2020; Karimi et al., 2020; Kanamori et al., 2020), images (Vermeire et al., 2022; Van Looveren & Klaise, 2021) and time series (Karlsson et al., 2020; Ates et al., 2021; Delaney et al., 2021).

In the TSC domain, counterfactual explanations have been studied extensively (Karlsson et al., 2020; Delaney et al., 2021; Ates et al., 2021). Given a target time series example and a classifier that predicts a class label for the time series, the objective is to generate a new version (i.e., a counterfactual) of the time series example, such that the classifier predicts an alternative class label. Time series counterfactuals are highly applicable in different domains, such as healthcare (Karlsson et al., 2020; Delaney et al., 2021) and spectroscopy (Delaney et al., 2021). For example, they can be used by cardiologists to understand why a given ECG (example time series) has been classified as “myocardial infarction” by comparing it to a normal ECG generated by applying several adjustments to the values of the original example. This comparison will result in identifying which time series regions (and values in those regions) are discriminative for the classification outcome. As another example, we can consider that of spectroscopy analysis of food, where infrared spectrographs are used to determine the origin of, e.g., different types of coffee or honey. In this case, counterfactuals can be used to indicate which time series regions have the highest discriminative power, and can hence lead to the use (or design) of cheaper sensors considering a smaller portion of the light spectra (Delaney et al., 2021).

Several earlier approaches towards time series counterfactual explanations have been proposed in the literature, including two techniques presented by Karlsson et al. (2020). The first technique is defined for the random shapelet forest (RSF) (Karlsson et al., 2016) classifier. Counterfactuals are generated by identifying the time series subsequences that need to be changed based on the internal decision nodes of each tree, so that the decision of the ensemble also changes. The second technique simply defines the counterfactual to be the k nearest neighbour (k-NN) of the target class in the training set. Another approach, named Native Guide (NG) (Delaney et al., 2021), generates time series counterfactuals by utilizing the nearest neighbour and class-activation-map (CAM) feature-weight vectors. The CAM vectors are applied to identify the most influential subsequence and then iteratively perturb the original sample to generate the counterfactual. This technique is however only suitable for convolutional neural networks (CNNs). Despite their promising performance on a large collection of time series datasets, all three techniques are hampered by the fact that they are model-specific and thus cannot be applied to any other time series classifiers.

A model-agnostic approach, called LatentCF++ (Wang et al., 2021), has recently been proposed for generating time series counterfactuals by applying perturbations in a learned latent space using an auto-encoder. Despite its promising performance, one limitation of LatentCF++ is that perturbations are applied to the whole time series, without imposing any constraints on which regions should be given a higher priority to be perturbed when generating the counterfactuals. This limitation could result in perturbations that may not be relevant or valid in the application domain of interest, hence compromising the reliability of the generated counterfactuals and resulting in unnecessary modifications of the original time series.

Imposing constraints to time series regions having the most discriminative power for the classifier can potentially lead to more reliable counterfactuals, while increasing the efficiency of the counterfactual generation process. We consider two ways to define constraints for time series counterfactuals. The first is called example-specific, referring to constraints obtained from a given test example, e.g., using a local time series explainer; and global, referring to constraints obtained from the classifier, e.g., by using temporal interval importance. For instance, example-specific constraints are relevant in the case of a patient suffering from “QT prolongation”, where the QT interval region of the patient’s ECG is more relevant than the rest of the ECG curve (Nachimuthu et al., 2012). Hence this region is more discriminative for that patient and should be constrained. On the other hand, in the engine manufacturing industry, noise signals are usually monitored through standard sensors to detect product failures; in that case, introducing global constraints that can be generalised to all the sensors would be more advantageous. Finally, an additional way of imposing constraints is by letting a domain expert provide them; in this case, the constraints are referred to as expert-based.

In this paper, we extend LatentCF++ to a general time series counterfactual generation method, called Glacier (Guided LocAlly ConstraIned countERfactual explanations for time series classification), that is model-agnostic, and can provide counterfactual explanations for any deep learning classifier using gradient descent search either on the original space or on a learned latent space (e.g., by using an auto-encoder). Our method has the additional flexibility of including both example-specific and global constraints on the counterfactual generation. We exemplify Glacier using a state-of-the-art LSTM-FCN classifier (Karim et al., 2018). Note, however, that the method proposed in this paper can, without loss of generality, generate counterfactuals for any deep neural network with or without an auto-encoder.

1.1 Examples

To highlight the importance of the problem presented and solved in this paper, we demonstrate two examples of time series counterfactuals generated by Glacier using two datasets from the UCR time series repository: TwoLeadECG (Fig. 1) and FordA (Fig. 2). The original time series are illustrated in blue colour, while their corresponding counterfactuals of the opposite class are illustrated in red, and the local regions where changes are favoured are indicated as “red” bars.

Fig. 1
figure 1

Examples of counterfactuals for ECG measurements: unconstrained and with example-specific constraints in Glacier. The “red” bars suggest the time series points for which changes are favourable (Color figure online)

In Fig. 1, we present two examples of counterfactuals for ECG measurements: unconstrained (Fig. 1a) and with example-specific constraints (Fig. 1b). Example-specific constraints are applied to an abnormal ECG to reach the desired target class (i.e., normal ECG), where the counterfactual is more favoured to change time points inside the “red” bars. Example-specific constraints can be distinctive for each patient and can be imposed by, e.g., using a local time series explainer, such as LIMESegment (Sivill et al., 2022) or following doctors’ expert advice. On the contrary, Fig. 1a shows that in the absence of example-specific constraints, all the local regions are favoured to change, indicated as a larger “red” area. In addition, in Fig. 2, we present two examples of counterfactuals for engine signals: unconstrained (Fig. 2a) and with global constraints (Fig. 2b). Global constraints are introduced in generating counterfactuals for an instance of engine failure. As we can observe in Fig. 2b, the majority of changes occurred inside the “red” bars. In this case, the global constraints are applied to all the samples for the FordA dataset, which can be determined by, e.g., using temporal interval importance on the training set.

Fig. 2
figure 2

Examples of counterfactuals for engine signals: unconstrained and with global constraints in Glacier. The “red” bars suggest the time series points for which changes are favourable (Color figure online)

1.2 Contributions

The main contributions of this paper are summarized as follows:

  • We provide a generalized formulation for generating univariate time series counterfactuals allowing for temporal constraints on the counterfactual generation.

  • Extending our previously proposed method, called LatentCF++ (Wang et al., 2021), we propose Glacier, a model-agnostic method for univariate time series counterfactuals. Our method can be applied to any deep learning classifier and can generate counterfactuals by inducing perturbations on the input time series using gradient descent search either on the original or on a latent feature space, e.g., by employing an auto-encoder.

  • Glacier has the additional flexibility of defining and imposing two particular types of temporal constraints, example-specific or global, on the input time series example that encourages perturbations at specific time points.

  • Glacier optimizes the trade-off between validity and proximity by employing a weighted loss function.

  • We provide an extensive experimental evaluation of Glacier against the aforementioned model-specific competitors (Karlsson et al., 2020; Delaney et al., 2021) on 40 univariate time series datasets from the UCR repository, using a state-of-the-art classification architecture (Karim et al., 2018). We generate counterfactual explanations using the original space (i.e., without auto-encoder), or using a learned latent space (either through a 1dCNN or an LSTM-based auto-encoder). We additionally explore four variants of Glacier: unconstrained (referring to the case of favouring changes in the whole time series), example-specific, global, and uniform (referring to the case of disfavouring changes in the whole time series).

2 Related work

There is a wide range of TSC algorithms proposed using different ML techniques in recent years (Bagnall et al., 2017). Symbolic aggregate approximation (SAX) is a time series summarization technique that transforms the input time series to a string. Due to its nature, SAX has been recently used for designing explainable time series classifiers, such as XEM (Fauvel et al., 2022) and PETSC (Feremans et al., 2022). Shapelet-based methods employ time series subsequences (called shapelets) as discriminatory features to train classifiers, such as random forests and SVMs (Karlsson et al., 2016; Bagnall et al., 2017). Moreover, HIVE-COTE (Lines et al., 2016) was proposed as an ensemble method of different classifiers (e.g., elastic ensembles and shapelet transform) together with a hierarchical voting structure, outperforming all previous methods. In addition, several deep learning based TSC algorithms have been proposed, such as LSTM-FCN (Karim et al., 2018), InceptionTime (Ismail Fawaz et al., 2020), or convolutional feature transforms, such as ROCKET (Dempster et al., 2020), demonstrating comparable benchmark metrics with improved model scalability. Nevertheless, due to their black-box nature, they lack model transparency and interpretability (Molnar, 2019).

Interest in counterfactual explanations has surged in the last few years in several high-stake applications (Stepin et al., 2021). For TSC problems, Ates et al. (2021) presented a counterfactual explanation approach on sample-based predictions using CORELS, focusing on multivariate time series datasets. Moreover, Delaney et al. (2021) proposed the NG method for CNN-based time series classification with the component of CAM feature-weight vectors to iteratively perturb a subsequence of the time series to generate counterfactual explanations. Similarly, Karlsson et al. (2020) proposed a solution that perturbs time series locally or globally guided by the random shapelet forest classifier or the k-nearest neighbour classifier, respectively. However, both approaches were proposed to provide model-specific explanations, since they cannot be applied to any other classifiers.

A large number of counterfactual approaches were proposed that can provide model-agnostic explanations for any black-box classifier. For an extensive survey, the reader can refer to Stepin et al. (2021) and Verma et al. (2020). Moreover, there are several counterfactual generation methods (Pawelczyk et al., 2020; Joshi et al., 2019; Balasubramanian et al., 2020; Van Looveren & Klaise, 2021) that learn latent representations of each class, using an auto-encoder (AE) or a variational auto-encoder (VAE). However, they mainly focus on tabular or image data, and none of them has been applied to TSC. Recently, a few researchers have investigated exploiting latent representations to provide explainable results in the TSC domain. LASTS (Guidotti et al., 2020) was proposed by utilizing auto-encoders to generate factual and counterfactual rules by learning a local latent decision tree classifier; these rules require the original time series to contain (or not) certain shapelets to reach the desired or original class. Wang et al. (2021) proposed LatentCF++ for learning latent space representations for time series counterfactuals, which we extend in this paper by providing a generalized approach allowing for temporal constraints and perturbations either in a latent or in the original space.

Besides, other types of explainable machine learning methods have been proposed for explaining model predictions by providing feature importance scores, such as generating local model-agnostic explanations (LIME) by randomly perturbing input samples to fit surrogate models (Ribeiro et al., 2016) or computing Shapely values to approximate the feature importance for a given classifier (Lundberg & Lee, 2017). In order to adopt these methods to the time series domain, Sivill et al. (2022) proposed LIMESegment, by applying a nearest neighbour-based technique together with harmonic analysis to generate robust perturbations for the surrogate model to obtain temporal segments of time series together with their local importance scores. On the other hand, Bento et al. (2021) presented TimeSHAP to provide feature-level, timestep-level and cell-level importance scores for recurrent neural network models. To the best of our knowledge, none of these feature importance methods has been considered or incorporated into time series counterfactual generation.

3 Preliminaries

In this section, we first provide a description of the auto-encoder, and then we formulate the problems of time series counterfactual explanations and locally-constrained time series counterfactuals.

3.1 Auto-encoder

Auto-encoder (AE) is a specific model architecture of deep learning which can be applied for dimensionality reduction, feature learning and generative models using latent representations (Goodfellow et al., 2016). Previous work demonstrated the high utility of auto-encoders on time series data, such as time series clustering (Tavakoli et al., 2020), anomaly detection (Yin et al., 2022) and learning representative features from fMRI data (Huang et al., 2018). A traditional auto-encoder consists of two main procedures: the Encode() function for encoding the original input X into the latent representation z, and the Decode() function for re-constructing the latent representation back to the original feature space (with the same dimension as X), denoted as \(X_r\):

$$\begin{aligned} z = \texttt {Encode}({X}) , \text { and } X_r = \texttt {Decode}({z}) , \end{aligned}$$
(1)

where Encode() and Decode() can be a special case of feedforward networks. More specifically, a simplified notation is defined as:

$$\begin{aligned} z = \sigma (W_1 \cdot X + b_1), \text { and } X_r = \sigma (W_2 \cdot z + b_2) , \end{aligned}$$
(2)

where \(\sigma\) is the activation function, \(W_{\cdot }\) and \(b_{\cdot }\) represent the weight matrices and bias vectors. Especially, the latent representation z can learn influential features from the original sample X that can be decoded back as \(X_r\). During training, a loss function \(L(X, X_r)\) is applied to minimize the distance between the input X and re-constructed sample \(X_r\), such as the mean of squared errors. Without loss of generality, the feedforward networks can be replaced by other deep learning models. In the time series domain, two main types of auto-encoders are applied, long short-term memory (LSTM) (Tavakoli et al., 2020; Yin et al., 2022) and CNN (Yin et al., 2022; Huang et al., 2018) models.

3.2 Problem formulation

Let \(X=x_1, \ldots , x_n\) be a univariate time series of n values with each \(x_i\in \mathbb {R}\), and let \(\mathcal {Y}\) be a set of target class labels. We consider a binary classification problem where \(\mathcal {Y} = \{\)\(+\)’, ‘−’\(\}\). Let the function \(f(\cdot ): \mathbb {R}^n \rightarrow [0,1]\) be given through a trained deep learning (black-box) model, which takes as input a time series X and outputs the prediction probability that X is of the positive class. Consequently, the probability that X is of the negative class is \(1-f(X)\). The classification of X is then determined by comparing f(X) to a threshold \(\tau\). If \(f(X) < \tau\), then X is of class ‘−’. Otherwise, X is predicted to be class ‘\(+\)’. In this paper, we study the problem of generating counterfactual time series \(X'\) reflecting the modifications required on X to change the classification result from an undesired state (e.g., negative) to a desired state (e.g., positive). More formally, we study the following problems:

Problem 1

(time series counterfactual explanations) Given \(f(\cdot )\) and a time series sample X, such that \(f(X) < \tau\), i.e., X is predicted to be class ‘−’, we define a counterfactual time series \(X'\), such that \(f(X') \ge \tau\), with \(f(X')\) being as close to the decision threshold \(\tau\) as possible and the distance of \(X'\) to its original counterpart X is minimized, i.e.,

$$\begin{aligned} X' = \underset{X^{*}}{{\textrm{arg min}}} \quad |f(X^{*}) - \tau | + ||X-X^*|| \ . \end{aligned}$$
(3)

We additionally define a binary constraint vector \(\varvec{c}=[c_1, \ldots , c_n]\), where \(c_i=0\) indicates that changing the value at time point i is favourable, while \(c_i=1\) discourages any changes to this time point. More specifically, let \(\varvec{c}\) be defined by a function \(h(\cdot )\), i.e., \(\varvec{c}=h(\cdot )\), that maps any input to \(\{0,1\}^{n}\). Different instantiations of \(h(\cdot )\) are provided in Sect. 4. Applying \(\varvec{c}\) when generating a counterfactual for X produces a locally-constrained time series counterfactual \(X'_{\varvec{c}}\).

Problem 2

(locally-constrained time series counterfactual explanations) Given \(f(\cdot )\), a time series sample X, such that \(f(X) < \tau\), i.e., X is assigned with class ‘−’, and a constraint vector \(\varvec{c}\), we define a locally-constrained counterfactual time series \(X'_{\varvec{c}}\), such that \(f(X'_{\varvec{c}}) \ge \tau\) and

$$\begin{aligned} X'_{\varvec{c}} = \underset{X^{*}}{{\textrm{arg min}}} \quad |f(X^{*})-\tau | + |\varvec{c}^{T}(X-X^*)| \ . \end{aligned}$$
(4)

As a practical example for Problem 2 in the abnormal ECG classification, consider a classifier \(f(\cdot )\) trained on ECGs. Then the goal is to find the counterfactual \(X'_{\varvec{c}}\) (class ‘\(+\)’) for the abnormal sample X (class ‘−’), where the counterfactual imposes constraints to the region that is represented by vector \(\varvec{c}\), such that the changes of other time points (i.e., out of the constraint region) will be favoured.

4 Glacier: guided locally constrained counterfactual explanations

We present Glacier: Guided LocAlly ConstraIned countERfactual explanations for time series classification. Compared to the predecessor LatentCF++ (Wang et al., 2021), Glacier can also generate locally-constrained time series counterfactuals, as defined in Prob. 2. Next, we provide the main steps of Glacier, including two instantiations of the way the counterfactuals are generated using gradient descent search, followed by two approaches for defining constraints.

4.1 Gradient counterfactual search

Glacier supports two counterfactual generation techniques, both based on gradient search: one that employs a latent representation space (e.g., by using a 1dCNN or LSTM auto-encoder), and one that uses the original feature space (denoted as NoAE). The detailed steps of our method are described in Algorithm 1.

figure a

For the first technique, we first decouple the pre-trained auto-encoder into Decode() and Encode() functions. After that, we apply Encode() to transform the input of time series sample X into latent representation z, followed by Decode() that reconstructs it back to the original feature space \(X^{*}\) (Lines 2 and 3). Next, function \(f(\cdot )\) predicts the class probability of the reconstructed sample, denoted as \(y^{*}\) (Line 6).

Following our problem formulation (Sect. 3.2), we define the loss of the prediction margin between the counterfactual prediction \(y^{*}\) and the decision boundary \(\tau\) as follows:

$$\begin{aligned} L_{p}(y^{*}, \tau ) = ||y^{*} - \tau || , \end{aligned}$$
(5)

and the loss based on the constraint vector \(\varvec{c}\) as follows:

$$\begin{aligned} L_{c}(X^{*}, X, \varvec{c}) = |\varvec{c}^{T} (X^{*} - X)|, \end{aligned}$$
(6)

where \(X^{*}\) and X correspond to the counterfactual and the original sample, respectively, while \(\varvec{c}\) is the constraint vector.

Using these two loss functions, the total loss after the counterfactual search process is computed as follows (Line 7):

$$\begin{aligned} L(X^{*}, y^{*}, X, w, \tau , \varvec{c}) = w \cdot L_{p}(y^{*}, \tau ) + (1 - w) \cdot L_{c}(X^{*}, X, \varvec{c}), \end{aligned}$$
(7)

where w is the prediction margin weight for \(L_{p}\) and \((1-w)\) is the weight for \(L_{c}\), respectively.

Subsequently, in Lines 9–17, the constraints are validated to guarantee that the loss iteratively decreases and that the output probability \(y^{*}\) crosses the decision boundary \(\tau\), where the Adam optimization (Kingma & Ba, 2015) is applied (i.e., AdamOptimize() in Line 11) to update the latent representation z. The AdamOptimize() function consists of two other momentum-based algorithms AdaGrad and RMS-Prop, to accelerate the gradient descent approach, where latent space z is updated directly through iterations in the while loop. The final output counterfactual is the decoded sample \(X^{*}\), when the while loop condition breaks (i.e., either loss is lower than tol, \(y^{*}\) is larger than or equal to \(\tau\), or the while loop reaches the maximum number of defined iterations).

The second technique (referred to as NoAE) for generating counterfactuals corresponds to Algorithm 1 but without the encoding and decoding steps. More precisely, in this case, an auto-encoder is not provided, hence the counterfactual search space changes. Instead of perturbing the input sample X on latent space z, the algorithm performs the search directly on the original feature space \(X^{*}\) with the AdamOptimize() function (see Lines 5 and 14). The loss calculation and the while loop conditions remain the same.

Among the five hyperparameters in Algorithm 1, two of them are defined to customize the desideratum of the counterfactuals: prediction margin weight w and decision boundary threshold \(\tau\). More concretely, the prediction margin weight w manipulates the trade-off relationship between proximity and validity in Eq. 7, while the decision boundary threshold \(\tau\) controls the minimum required value of the counterfactual prediction in order to cross the decision boundary of the classifier. In addition, the other three hyperparameters are defined to have a joint effect in controlling the convergence of the counterfactual generation, i.e., learning rate \(\alpha\), loss tolerance tol, and maximum iteration \(max\_iter\). In the experiment setup, it is desirable to have relatively small values for \(max\_iter\) and tol. To select the optimal hyperparameters for a faster model convergence, we first conduct an empirical search to determine \(max\_iter\). Then, for the learning rate \(\alpha\) we apply a brute-force search between 0.001 and 0.0001, while fixing \(max\_iter\) and tol. The optimal \(\alpha\) is then selected for each specific dataset for the final comparison.

4.2 Constraints

As already stated both in Sects. 3.2 and 4.1, Glacier requires an additional input, i.e., a constraint vector \(\varvec{c}\), that is defined by the function \(h(\cdot )\). By default, \(\varvec{c}=[0, 0, \ldots , 0]\) for all steps in time series sample X, referred to as the unconstrained variant of Glacier, solving Problem 1, where all time steps are favoured to change. Depending on the definition and nature of \(\varvec{c}\), we can have constraints that are example-specific, i.e., imposed on a given test example or global, e.g., determined from \(f(\cdot )\) by using interval importance. Finally, we define the uniform variant of Glacier where \(\varvec{c}=[1, 1, \ldots , 1]\) for all steps in the time series sample X. In this case, we are interested in counterfactual time series with very few changes with respect to the original time series, thus no time step is favoured to change. For example, in certain domains like finance, every time step can be equally costly to change, hence the need to penalize the change of all the time steps equally.

4.2.1 Example-specific constraints

The first variant of the constraint generation function, denoted as \(h_1(\cdot )\), is defined for the case of example-specific constraints. These constraints are generated by applying LIMESegment (Sivill et al., 2022) to the input time series X, which provides a set of time segments \(\mathcal {B}\) alongside a set of segment importance weights \(\varvec{v}\). LIMESegment is an adaption of LIME (Ribeiro et al., 2016) to time series classification for generating meaningful and realistic segment importance scores for a given time series. More concretely, given X, LIMESegment() first applies nearest-neighbour search to identify change points on X using classifier \(f(\cdot )\); secondly, it adopts Short Time Frequency Transform (STFT) from harmonic analysis to find realistic background perturbations in the frequency domain. After that, the function results in a partitioning of X to b segments, with \(\mathcal {B}^{start}=\{B_1^{start}, \ldots , B_b^{start}\}\) and \(\mathcal {B}^{end}=\{B_1^{end}, \ldots , B_b^{end}\}\) denoting the set start and end time points of the segments. Finally, normalized Dynamic Time Wrapping (DTW) is used to calculate the segment importance weights \(\varvec{v}\) of the perturbed samples for the underlying surrogate model (i.e., ridge regression in LIMESegment). A simplified notation to illustrate the inputs and outputs of LIMESegment() is shown below:

$$\begin{aligned} \{\varvec{v}, \mathcal {B}\} = \texttt {LIMESegment}(X, f, y'), \end{aligned}$$
(8)

where \(y'\) is the desired target class (i.e., ‘\(+\)’). Since the size of \(\varvec{v}\) is \(b \le n\), i.e., less than or equal to the size of X, the resulting constraint vector \(\varvec{c}\) is obtained by mapping the normalized segment importance scores to time point importance scores \(\varvec{\hat{v}}\) by repeating the same importance value for each time point in the corresponding segment. More concretely:

$$\begin{aligned} \hat{v}_i = v_j, \forall i \in [B_j^{start}, B_j^{end}] \quad \texttt { and } \quad \forall j\in [1,b] . \end{aligned}$$

Given a threshold \(\gamma \in [0,1]\), the value of each \(c_i\) in the constraint vector \(\varvec{c}\), \(\forall i\in [1,n]\) is obtained as follows:

$$\begin{aligned} c_i = {\left\{ \begin{array}{ll} 0, & \quad \text {if } \hat{v}_i \le \gamma \\ 1, & \quad \text {otherwise } \\ \end{array}\right. } \end{aligned}$$
(9)

We should note that in vector \(\varvec{c}\), \(c_i = 0\) means that the modification at time point i is desirable (with respect to Eq. 7), while \(c_i = 1\) indicates that changing the value at time point i is not favourable. The threshold parameter \(\gamma\) can be customized based on the given dataset. In the experiment, it is set to 0.25 (i.e., corresponding to the \(25^{th}\) percentile). Hence, the resulting constraint generation function \(h_1(\cdot )\) requires five inputs, i.e., \(h_1(X, f, y', \texttt {LIMESegment}, \gamma )\).

4.2.2 Global constraints

The second variant of the constraint generation function, denoted as \(h_2(\cdot )\), is defined for the case of global constraints. In this case, we employ an idea similar to feature permutation importance (Breiman, 2001), by computing interval importance. More specifically, given a pre-trained classifier \(f(\cdot )\), and a collection of time series \(\mathcal {D}\), we compute a reference score s for \(\mathcal {D}^a\), where \(\mathcal {D}^a \subseteq \mathcal {D}\). Reference score s can be any metric of the model performance, and in our case s is the accuracy. For each uniform and consecutive time interval \(i \in X\) and for each repetition \(k \in [1, \ldots , K]\), where K is the maximum number of repetitions, we randomly shuffle \(\mathcal {D}^a\) along the interval i to generate a perturbed dataset \(\mathcal {D}_{k,i}^{a}\). Next, we compute the score \(s_{k,i}\) using model \(f(\cdot )\) on the perturbed dataset. Finally, we compute the importance of interval i as follows:

$$\begin{aligned} v_i = s - \frac{1}{K}\sum ^K_{k=1} s_{k,i} \ . \end{aligned}$$
(10)

Similar to the example-specific constraints, interval importance \(v_i\) is converted to the binary constraint vector \(\varvec{c}\), given a threshold \(\gamma \in [0,1]\), as shown below:

$$\begin{aligned} c_i = {\left\{ \begin{array}{ll} 0, & \text {if } v_i \ge \gamma \\ 1, & \text {otherwise } \\ \end{array}\right. } \end{aligned}$$
(11)

where the generated constraint vector \(\varvec{c}\) is derived from the dataset \(\mathcal {D}^a\). Note that in the case of global constraints, the threshold condition in Eq. 11 is different from that in Eq. 9. This is because example-specific and global constraints obtain feature importance scores in different ranges. Without loss of generality, threshold \(\gamma\) for global constraints is set to 0.75 (i.e., 75th percentile) for our experimental evaluation. Hence, the resulting global constraint generation function \(h_2(\cdot )\) requires four inputs, i.e., \(h_2(f, y', K, \gamma )\).

5 Experimental evaluation

We conduct our experiments using the UCR time series archive (Dau et al., 2018). We mainly focus on the problem of univariate time series counterfactuals generation for binary classification, and thus we select a subset of 40 datasets, containing representations from different data sources. For example, TwoLeadECG represents ECG measurements in the medical domain and Wafer exemplifies sensor data in semiconductor manufacturing. The selected time series have different time series lengths, varying from 24 (ItalyPowerDemand) to 2709 timesteps (HandOutlines). For the evaluation, we apply a fivefold cross-validation with stratified splits, with \(80/20\%\) for training and testing for all the datasets. Moreover, to ensure the same evaluation set of counterfactual samples between Glacier and baseline methods, we choose to generate counterfactual explanations for a subset of 50 samples with a fixed random seed among the testing data. Furthermore, to compensate for the imbalanced target classes, we apply an up-sampling technique that resamples the minority class during training.

5.1 Experiment setup

There are three main deep neural network architectures that have been adopted for time series classification tasks in recent years: multi-layer perceptron (MLP), convolutional neural networks (CNN) and recurrent neural networks (RNN) (Fawaz et al., 2019). In our experiment, we choose to train one of the most recent state-of-the-art CNN models, LSTM-FCN (Karim et al., 2018), which was benchmarked over 85 UCR datasets and showed outstanding performance. Additionally, to transform the original samples into counterfactuals in a latent space representation, we employ two auto-encoders: a 1-dimensional convolutional neural network (1dCNN) model and an LSTM model.

Furthermore, we explore four variants of Glacier: unconstrained, example-specific, global, and uniform. In our main experiment, we set the decision boundary threshold \(\tau\) to 0.5, and the prediction margin weight w to 0.5 in the optimization function (see. Equation 7). We apply the optimal learning rate \(\alpha\) obtained from the brute-force search for each specific dataset (either 0.0001 or 0.001), and set \(max\_iter\) to 100 and tol to 1e-6 based on the empirical search. Finally, we conduct an ablation study to examine the effects of these two hyperparameters in Glacier for the TwoLeadECG dataset: prediction margin weight parameter w and decision boundary threshold \(\tau\). For the prediction margin weight w, we gradually decrease the value from 1 to 0 with a step of 0.1; while we apply a value range from 0.4 to 0.99 for the decision boundary threshold \(\tau\), with an average step of 0.1. The ablation study shows that our parameter setups (i.e., \(w = 0.5\) and \(\tau = 0.5\)) can provide a balanced trade-off among the evaluation metrics in the main experiment.

5.2 Baseline models

Firstly, we compare Glacier with two different approaches from Karlsson et al. (2020), involving local and global time series tweaking - random shapelet forest (RSF) and the k-NN counterfactual method (k-NN). We apply the same parameter setup across all datasets for these baseline models.Footnote 1 For RSF, we set the number of estimators to 50 and max depth to 5; while the other parameters are kept at their default values. For k-NN, we first train a k-NN classifier with k equals to 5 and the distance metric set to Euclidean; then the trained classifier is utilized to find the counterfactual samples for further evaluation. In addition, we compare Glacier to the Native Guide (NG) method (Delaney et al., 2021), with an implementation and default parameters adopted from the authors’ supporting website.Footnote 2

5.3 Evaluation metrics

To evaluate the performance of our proposed approach in terms of explainability, we present three metrics: validity, proximity, and compactness, as motivated by Keane et al. (2021).

  • Validity measures whether the generated counterfactual leads to a valid transformation to the desired target class (Verma et al., 2020; Mothilal et al., 2020). More specifically, it reports the fraction of counterfactuals predicted as the opposite class (i.e. have crossed the decision boundary \(\tau\)), which is defined as:

    $$\begin{aligned} validity (\mathcal {D}', \tau ) = \frac{\#(f(X_i') \ge \tau , X_i' \in \mathcal {D}') }{ n } , \end{aligned}$$
    (12)

    where \(f(X_i')\) is the model prediction of one counterfactual sample \(X'\) and n is the count of samples in a given dataset of all the test examples \(\mathcal {D}'\). Higher validity is better.

  • Proximity measures the feature-wise distance between the generated counterfactuals and the corresponding original samples (Verma et al., 2020; Pawelczyk et al., 2020; Delaney et al., 2021). In our case, we define proximity as the average Euclidean distance between the transformed and the original time series:

    $$\begin{aligned} proximity (\mathcal {D}, \mathcal {D}') = \frac{1}{n}\sum _{i=1}^n ||X_i - X_i'|| , \end{aligned}$$
    (13)

    where \(X_i \in \mathcal {D}\) is the original time series and \(X_i' \in \mathcal {D}'\) represents the generated counterfactual, respectively. Lower proximity is better.

  • Compactness refers to the fraction of time series steps that remain unchanged in the generated counterfactuals compared to the original samples (Karlsson et al., 2020). A similar metric has been proposed in previous literature (Delaney et al., 2021; Keane et al., 2021), reported as sparsity. This metric captures the amount of information that remains unchanged from the original time series, which is then defined as:

    $$\begin{aligned} compactness (\mathcal {D}, \mathcal {D}') = \frac{\#(|X_i - X_i'| \le tol,\ X_i \in \mathcal {D} \text { and } X_i' \in \mathcal {D}') }{ n } , \end{aligned}$$
    (14)

    where tol is the tolerance parameter of the difference (set to 0.01 in the experiment), and n is count of samples in \(\mathcal {D}\) and \(\mathcal {D}'\). Higher compactness is better.

In addition, we report the classification performance of all models as the balanced accuracy, and the reconstruction loss of the auto-encoder models at our supporting website.Footnote 3

5.4 Results

We first compare the validity, proximity, and compactness of our proposed method Glacier with the three instantiations 1dCNN, LSTM, and NoAE, as well as the baseline methods RSF, k-NN, and NG. Note that the 1dCNN and LSTM instantiations are different from the previous work (Wang et al., 2021) since we apply a model-agnostic approach to explain the same LSTM-FCN classifier in this experiment. For a detailed comparison, we report the average performance metrics over the fivefold cross-validation for the 40 UCR datasets. To sum up, we also report the average score across all the 40 datasets (denoted as Total avg.) and the winning or drawing method count (indicated as Win/draw ct.) for each model.Footnote 4

Table 1 Summary of validity for the 40 datasets in the experiment of unconstrained Glacier
Table 2 Summary of proximity for the 40 datasets in the experiment of unconstrained Glacier
Table 3 Summary of compactness for the 40 datasets in the experiment of unconstrained Glacier

In Table 1, we can observe that the baseline RSF method achieved the best average validity of 0.994 over the 40 datasets in terms of the total average, with a win/draw count of 37. In the meanwhile, the 1dCNN instantiation of Glacier obtained the second-best win/draw count of 17; and the NoAE instantiation obtained the second-highest average validity of 0.964 in terms of the total average, which indicates that approximately \(96\%\) of the generated counterfactuals are considered valid for the LSTM-FCN classifier. In Table 2, we can observe that the NoAE instantiation from our proposed method Glacier achieved the highest win/draw count of 22, with the second-best total average proximity value of 0.240. In comparison, the state-of-the-art RSF, 1dCNN, and NG achieved the first (0.222), the third (0.679), and the fourth (0.735) best scores concerning the total average of the 40 datasets. In contrast, LSTM received the worst average proximity of 2.109 among all. This evidence indicates that the generated counterfactuals from the proposed NoAE solution are the closest to the original samples. Table 3 shows the compactness score for each dataset, together with the average score and win/draw count at the bottom. NoAE from Glacier achieved the best win/draw count of 18 with a total average compactness score of 0.593, suggesting that approximately \(59\%\) of the time steps from the counterfactuals were close to the original time series compared to the other instantiations that had a score of 0.366 and 0.138; while the RSF and NG methods achieved the optimal scores of 0.801 and 0.733 in total average respectively, suggesting that their counterfactual samples are relatively more compact than those of the other approaches. LSTM received the worse compactness of 0.138, which means that approximately a fraction of \(14\%\) of each time series remains unchanged compared to the original sample.

Table 4 Summary of validity for the 40 datasets in the experiment of locally-constrained time series counterfactuals
Table 5 Summary of proximity for the 40 datasets in the experiment of locally-constrained time series counterfactuals
Table 6 Summary of compactness for the 40 datasets in the experiment of locally-constrained time series counterfactuals

Next, we compare the unconstrained variant of Glacier with the locally-constrained counterfactual variants and the uniform variant in Tables 4, 5, and 6. Based on the previous results, we found that the LSTM instantiation obtained the worst average scores in all three metrics, which indicates that the LSTM auto-encoder model cannot guarantee a successful generation of counterfactual explanations. Therefore, we chose to exclude it in the following comparison, and instead, we focused only on the 1dCNN and NoAE instantiations. Finally, these four variants are compared separately for these two instantiations.

Table 4 shows the validity scores for the 1dCNN and NoAE instantiations. We can observe that the unconstrained variant outperformed the other variants with total average scores of 0.938 and 0.964, respectively. Moreover, we observed that for some specific datasets, the validity improved with additional example-specific or global constraints in the counterfactual generation. For example, the example-specific variant for the FreezerR dataset obtained the best validity for both 1dCNN and NoAE; the global variant got 1.000 validity in the case of the Coffee dataset. In the meanwhile, the uniform variant obtained the worst validity scores of 0.392 and 0.351 for 1dCNN and NoAE instantiations, respectively.

In Table 5, we can observe that the uniform variant achieved the best proximity score for both 1dCNN and NoAE across all 40 datasets, with an average score of 0.352 and 0.079, respectively. The example-specific and global variants obtained the second-best proximity of 0.594 and the third-best proximity of 0.610, respectively, concerning all the 40 datasets for 1dCNN. However, the difference between example-specific and global variants is relatively small, which suggests that both of them had a similar performance. On the other hand, we found that for the NoAE instantiation the global variant slightly improved the unconstrained Glacier’s proximity from 0.240 to 0.238. Table 6 shows that the uniform variant obtained the highest compactness among the variants, with a total average of 0.764 and 0.921 for 1dCNN and NoAE, suggesting that approximately \(76\%\) and \(92\%\) of the time steps from counterfactuals were close to the original time series. While the unconstrained variant received the worst compactness score, we observed that both example-specific and global variants improved the unconstrained Glacier with an increased margin of approximately \(20\%\). For the NoAE instantiation, the global variant had the best compactness for the SonyAIBO2 dataset, indicating that including additional constraints in Glacier could further improve the counterfactual compactness for some specific datasets.

Fig. 3
figure 3

Critical difference diagrams for proximity, validity, and compactness over the 40 UCR datasets. a, c, e depict the comparison between the unconstrained Glacier, k-NN, RSF, and NG; b, d, f show the comparison for different variants of weighting constraints: unconstrained, example-specific, global, and uniform

5.4.1 Critical difference diagrams

In this subsection, we show the critical difference diagrams (Fawaz et al., 2019) demonstrating the pairwise significance of counterfactual generation methods from the Wilcoxon-Holm post-hoc analysis over the 40 UCR datasets. In Fig. 3, we present two sets of comparisons in terms of proximity, validity, and compactness: the first one is to compare the unconstrained variant of Glacier with k-NN, RSF, and NG; the second one is among the different variants of weighting constraints in Glacier. In Fig. 3a, we observed that the unconstrained NoAE achieved an average rank of 1.8 in terms of proximity, while there was no significant difference between the RSF method and the NoAE instantiation. From Fig. 3c we can see that RSF outperformed the unconstrained variant of Glacier with both 1dCNN and NoAE instantiations in terms of validity. Besides, RSF obtained the optimal rank of 1.8 in compactness, compared to both NG and the NoAE instantiation of Glacier with average ranks of 2.2 and 2.7 (Fig. 3e).

Figure 3b, d shows that the unconstrained variant of Glacier outperformed the others in validity with the highest average rank (1.2) while obtaining the lowest rank (3.3) in proximity; vice versa, the uniform variant obtained the highest average rank in proximity (1.0) but the lowest rank (3.9) in validity. This evidence suggests that there is a trade-off between the performance of proximity and validity for different variants of Glacier. In addition, we observed that the global and example-specific variants both outperformed the unconstrained variant in Fig. 3b, f, which suggests that introducing weighting constraints in Glacier can further improve the performance in terms of proximity and compactness. On the other hand, the global variant appeared to outperform the example-specific variant in general, as it ranked relatively higher in both proximity and validity. Meanwhile, for compactness, these two variants of weighting constraints received the same average rank, as shown in Fig. 3f.

Fig. 4
figure 4

A trade-off analysis for different variations of Glacier, each point in the scatter plot represents one specific dataset from the 40 UCR datasets in our experiment

In the previous comparison of critical difference diagrams, we observed a trade-off between the different constraint variants of Glacier. Next, we provide a detailed trade-off analysis for the weighting constraints, demonstrating the relationship between proximity and validity, compared with compactness and validity. Figure 4a shows the trade-off between proximity and validity across the 40 UCR datasets for the unconstrained, example-specific, global and uniform variants of Glacier, respectively. We can observe that the example-specific and global variants showed a balanced performance compared to the other two variants, where the unconstrained variant reached high validity with low proximity, and the uniform variant obtained a low proximity score with low validity as well. In Fig. 4b, we found a similar relationship between compactness and validity. The figure shows that even though the uniform variant obtained high compactness, it received low validity as a trade-off. On the other hand, the example-specific and global variants of Glacier both improved the compactness of the unconstrained variant, while retaining high validity for most of the datasets in the experiment. In this case, the global variant showed a bit more dense points compared to the example-specific variant, indicating the performance is slightly more robust for the global variant.

5.4.2 Ablation study

Next, we perform an ablation study to examine the effects of the prediction margin weight w and the decision boundary threshold \(\tau\) in the Glacier method. We choose to investigate the performance of proximity, validity, and compactness for the TwoLeadECG dataset. The prediction margin weight parameter w decreases from 1 to 0, with a step of 0.1, while the decision boundary threshold \(\tau\) increases from 0.4 to 0.99, with an average step of 0.1. Note that when the prediction margin weight \(w = 1\), then it becomes the unconstrained variant of Glacier.

Fig. 5
figure 5

Ablation study over the prediction margin weight w, ranging from 1 to 0, with a step of 0.1. ac show the proximity, validity, and compactness of different constraint variants for the 1dCNN and NoAE instantiations

Fig. 6
figure 6

Ablation study over the decision boundary threshold \(\tau\), ranging from 0.4 to 0.99, with an average step of 0.1. ac show the proximity, validity, and compactness of different constraint variants for the 1dCNN and NoAE instantiations

In Fig. 5b, we observed that the validity did not decrease substantially until the margin weight parameter w became lower than 0.5, across the three Glacier variants for the 1dCNN instantiation; while for NoAE, the validity started to drop sharply when \(w \ge 0.5\) for the global and example-specific variants. Moreover, Fig. 5a shows that all of the variants obtained better proximity scores when the prediction margin weight w ranges from 0.5 to 0.2. On the other hand, we found that when \(w = 0.5\) for example-specific and global variants, the compactness remained high (Fig. 5c), without a decrease in the validity score. This evidence suggests that the margin weight \(w = 0.5\) is a suitable margin weight value that could provide an improvement in proximity and compactness without sacrificing the validity performance.

In Fig. 6, we can observe the performance of proximity, validity, and compactness when increasing the decision boundary threshold \(\tau\). Figure 6b shows that when we changed the value of parameter \(\tau\) from 0.4 to 0.5, the validity was risen from \(0\%\) to above \(80\%\) for seven out of the eight variants presented in the figure. In combination with Fig. 6a, we found that when we increased \(\tau\)’s value from 0.5 to 0.99, the validity barely increased while proximity became much higher for both the 1dCNN and NoAE instantiations. Meanwhile, we observed in Fig. 6c that the compactness score constantly decreased when we increased the value of \(\tau\). This evidence demonstrates that \(\tau = 0.5\) is a reasonable threshold value to compare Glacier’s performance in our main experiment. This comparison also suggests that we should thoughtfully control the parameter \(\tau\) in the Glacier framework – the benefit of a higher \(\tau\) is that we can be more confident that the generated counterfactual explanations have crossed the decision boundary of the classifier; at the same time, the counterfactuals become more distant and less compact from the original samples.

Fig. 7
figure 7

Examples of generated counterfactuals by the 1dCNN (ad) and NoAE (eh) instantiations, with different weighting constraints: unconstrained, example-specific, global and uniform. Illustrated in blue are the original time series and in red the generated counterfactuals of the opposite class. The “red” bars highlight the area of time series points for which changes are favourable by each variant (Color figure online)

Fig. 8
figure 8

Residual plots of differences between generated counterfactuals and original samples by the 1dCNN (ad) and NoAE (eh) instantiations, with different weighting constraints: unconstrained, example-specific, global and uniform, where the blue line is the mean and the purple area represents the residual error. The “red” bars highlight the area of time series points for which changes are favourable by each variant (Color figure online)

5.4.3 Use-case study on ECGs

In Fig. 7, we show examples of generated counterfactuals for the TwoLeadECG dataset by both 1dCNN and NoAE instantiations, using different constraint variants. From the figures in the top row (Fig. 7a–d), we can observe that the majority of modifications for the original ECG occurred between the 10-th and 30-th timesteps across four variants of the 1dCNN instantiation. We found that the counterfactuals from both Fig. 7b, c contained fewer changes at the beginning and the end of the original time series (i.e., between timestep 0–5 and 60–82) compared to Fig. 7a, which suggests that both example-specific and global variants could constrain the counterfactual changes to only the important regions (i.e., within the “red” bars). However, the uniform variant (see Fig. 7d) appeared to restrict most of the time points of the counterfactual sample. Figure 7e–h shows the counterfactuals for the four variants in the NoAE instantiation. We observed that most of the modifications were small spikes across different time points of the ECG, and the counterfactual samples appeared to be less realistic compared to 1dCNN. Our assumption is that the auto-encoder learns important time series features from the ECG training data distribution to make the counterfactual generation more effective, and thus the counterfactuals from the 1dCNN instantiation are relatively closer to the original samples of the ECGs.

In addition, we conducted an in-depth analysis of the residual plots to examine the effects of different variants of Glacier. Figure 8 shows the differences between the counterfactual samples and the original samples from the test set, where the blue line is the mean, the purple band represents the residual error, and the “red” bars are regions that are favourable for changes (note that for Fig. 8b, f the red bars were aggregated by all important regions from the test set). In Fig. 8a, e, we observed that residual error bands are relatively large, and the band width is consistent through all the time points, indicating that all time points are equally favourable to counterfactual changes for unconstrained variants of Glacier. On the contrary, uniform variants have narrower bands (Fig. 8d, h), which means that the whole time series is not favourable to changes due to the constraint vector at every time point. From Fig. 8b, f, c, g, we observed that in both example-specific and global variants, the majority of changes occurred within the red bars, indicating that they favoured changes from the important time series regions. The main difference between them is that the global variant has a fixed region for applying the constraints; while for the example-specific variant, different counterfactual samples have slightly different red bars. In addition, we compared the 1dCNN and NoAE instantiations and observed that the 1dCNN instantiation (Fig. 8a–d) contained larger spikes in the middle areas (i.e., between timestep 30–40) of the time series. This is aligned with our assumption that applying an auto-encoder could potentially prevent feature changes from unreasonable values by learning the training data distribution. While NoAE appeared to have more evenly distributed residuals (i.e., an equal amount of changes) along the time series; still, the majority of the changes appeared within the red bars (see Fig. 8f, g).

Finally, we investigated the medical relevance of the red bar areas of the example-specific and global variants, and we observed that they correspond to QT prolongation regions (Nachimuthu et al., 2012). These regions are characterized by delayed ventricular repolarization, which, when excessive, can in turn lead to the development of early after-depolarization in the myocardium. Such conditions can trigger tachycardias (such as Torsades de Pointes) or can degenerate into different types of arrhythmias, such as ventricular fibrillation (Li & Ramos, 2017). Hence, in this particular use-case dataset, our two proposed variants of Glacier manage to identify critical ECG regions and provide counterfactual samples that attempt to correct medically relevant ECG abnormalities, such as QT prolongation.

6 Discussion

In the previous section, we observed that 1dCNN and NoAE instantiations of Glacier had a trade-off for the model performance regarding proximity, validity, and compactness. The proximity of NoAE was better than 1dCNN for most of the comparisons; while on the other hand, we found that 1dCNN outperformed NoAE in terms of the validity for each constraint variant of Glacier. This evidence further suggests that utilizing an auto-encoder could make counterfactual generation more efficient and generate more reliable counterfactuals, with trade-offs on proximity and compactness. Nevertheless, we would like to emphasise that our primary focus is to propose the general Glacier framework, where the counterfactual search process can be achieved either on the original or on a latent feature space, e.g., with an auto-encoder. In addition, our objective is to examine the performance differences across different variants of weighting constraints in Glacier; hence, our main empirical evaluation placed the 1dCNN and NoAE comparisons separately. That was also the reason we utilized a simplified auto-encoder structure across 40 UCR datasets. As a general framework, our assumption is that if there is an auto-encoder with a more refined, possibly domain-adapted structure, the performance of Glacier in latent spaces can be improved with similar constraint types and hyperparameters.

In our experimental design, we decided to limit the scope of this experiment to the scenario of binary constraint weights (i.e., 0 s and 1 s), such that we focus on examining the performance difference among unconstrained, example-specific, global and uniform constraints. Moreover, in the current version of Glacier , we did not put any explicit restrictions like a reasonable value range on the time steps that are desired to change (i.e., when the constraint weight is 1). A side note is that the auto-encoder version of Glacier could potentially prevent the feature changes from unreasonable values by following the training data distribution (using 1dCNN or LSTM), even though it is not an explicit restriction. Future work could introduce more specific restrictions on the counterfactual generation, such as providing a desired value range for each important time point.

7 Conclusions

We presented a model-agnostic method for univariate time series counterfactual generation, called Glacier, that extends our previous method LatentCF++. Glacier can be applied to any deep learning classifier and can generate counterfactuals by perturbating the input time series using gradient descent search either on the original or on a latent feature space (using an auto-encoder). Additionally, Glacier allows the definition of temporal constraints, either example-specific or global, on the input time series example, that encourage perturbations at specific time points.

Our experimental results on the UCR archive focusing on binary classification showed that Glacier outperforms the baseline models, in terms of proximity and compactness, providing robust counterfactuals. Additionally, our proposed approach achieved comparable validity compared to the state-of-the-art time series tweaking approach RSF. Furthermore, the comparative evaluation of the unconstrained variant against the locally-constrained counterfactual variants showed that including example-specific and global constraints yielded a balanced performance compared to the other two variants, where the uniform variant had better proximity with low validity, while the unconstrained variant provided more valid but less compact counterfactuals. It is important to highlight that there are alternative ways to define the temporal constraints, e.g., based on domain knowledge; however, in this study we explored one that employs LIMESegment and the other one that employs temporal interval importance.

For future work, we plan to extend our work to generalize Glacier to broader counterfactual problems using other types of data, e.g., multivariate time series, textual, or tabular data. In addition, we intend to expand the adaptability of the Glacier method, such as setting constraint vectors as continuous values to allow flexible changes, incorporating additional restrictions for the value ranges of counterfactuals, and extending to a multi-class scenario. Also, we intend to conduct a qualitative analysis with the help of domain experts to validate that the produced counterfactuals are meaningful in different application fields, such as ECG measurements in healthcare or sensor data from signal processing.

8 Reproducibility

All our code to reproduce the experiments and the full results are publicly available at our supporting website.Footnote 5