1 Introduction

Machine learning models are increasingly being used for high stakes decision-making settings such as healthcare, law or finance. Many of these machine learning models are black-boxes and therefore they do not explain how they arrive to decisions in a way that humans can understand. Nowadays, there is an increasing number of laws and regulations (Goodman and Flaxman 2017) coming into place to enforce the decisions of algorithms to be interpretable (a.k.a. transparent) (Du et al. 2019; Eiras-Franco et al. 2019; Fu et al. 2022; Miller 2019; Zhdanov et al. 2022). Interpretability is enhanced by selecting the features that have the greatest impact on the model as a whole (Benítez-Peña et al. 2021; Bertsimas et al. 2016; Zheng et al. 2021), but also knowing these locally for the decision made for each individual (Lundberg and Lee 2017; Lundberg et al. 2020; Ribeiro et al. 2016).

In this paper, we specifically address the problem of interpretability when data are functions. This type of data arises in important domains such as econometrics, energy, marketing (Jank and Shmueli 2006; Sood et al. 2009; Sunar and Swaminathan 2021). There is rather extensive literature on the use of machine learning to analyse functional data, e.g., adapting Support Vector Machine models to functional data (Blanquero et al. 2019; Chaovalitwongse et al. 2008), using regression trees to detect critical intervals (Blanquero et al. 2023) or novel forms of intepretability when dealing with functional data (Carrizosa et al. 2022; Martín-Barragán et al. 2014). See also Aneiros et al. (2022), Ramsay (2006) for an overview of methods for functional data analysis.

A specific type of interpretability tools is the counterfactual explanation (Carrizosa et al. 2023; Martens and Provost 2014; Wachter et al. 2017) where one seeks the minimum cost changes that can be made to an instance such that the given machine learning model would have classified it in a different class. For instance, in a credit score application one may be interested in knowing how the debt history of a person should have been to change the prediction to loan should be granted. See Guidotti (2022), Karimi et al. (2022), Verma et al. (2020) for recent surveys on Counterfactual Analysis.

Apart from the advantages mentioned above to finding counterfactuals for a given instance, in terms of guidance on how to change the predicted class to desired one, there are others to the stakeholder. First, it allows us to know how robust the prediction is, i.e., how much should the record be perturbed to make the classifier label it in a different class. Second, imposing some sort of sparsity in the process of building counterfactuals allow us to identify the most relevant features, i.e., those that, for this particular instance, are forcing the classifier to classify it in the desired class.

While the literature on machine learning to analyse functional data is extensive, this is not the case for counterfactual analysis. Most of the work on counterfactual explanations focuses on tabular, image data or text data (Karimi et al. 2022; Ramon et al. 2020; Tolkachev et al. 2022), and much less on functional data. In principle, one could apply the methods developed for tabular data also to functional data, just by considering that each feature is the measurement of the function at a time instant. However, doing so, fundamental information such as the autocorrelation structure along consecutive time instants would be lost. For this reason, some works on counterfactual explanations exploiting the functional nature of data have been suggested, e.g., Ates et al. (2021), Delaney et al. (2021), but, as far as the authors know, none of them uses the structure and properties of the machine learning model.Moreover, when working with functional data, other types of distance measures may appear, such as the Dynamic Time Warping distance (Xing et al. 2010), which, with our methodology, we can consider.

An instance \(\varvec{x} \in {\mathcal {X}} \subset {\mathcal {F}}^{J}\) is defined as a vector of J functional features. The counterfactual explanation \(\varvec{x}\in {\mathcal {X}}^{0} \subset {\mathcal {X}}\) of the instance \(\varvec{x}^0\) is a hypothetical instance generated by combining existing instances in the dataset, heareafter prototypes, so that the cost \(C(\varvec{x},\varvec{x}^0)\) of perturbing the features in \(\varvec{x}^0\) to yield \(\varvec{x}\) is minimal. With this, we achieve certain interpretability goals. First, we can deal with multivariate functional data, i.e., our data are functions taking values in some \({\mathbb {R}}^J.\) Second, we are able to identify the instances from the dataset that generate the counterfactual for each instance, controlling how sparse the counterfactual explanation is, in terms of both the number of prototypes used to create the counterfactual \(\varvec{x}\) and the number of functional features changed to move from \(\varvec{x}^0\) to \(\varvec{x}\). Third, we can model the cost function C by means of different distance measures, including popular distances in functional analysis such as the Dynamic Time Warping distance. We will show that, under mild assumptions, obtaining counterfactual explanations reduces to solving a Mixed Integer Convex Quadratic Model with linear constraints, which can be solved with standard optimization packages.

The remainder of the paper is organized as follows. In Sect. 2, we model the problem of finding counterfactual explanations when data are functions through an optimization problem. In Sect. 3, we focus on counterfactual analysis for additive tree models. In Sect. 4, a numerical illustration using real-world datasets is provided. Finally, conclusions and possible lines of future research are provided in Sect. 5.

2 Counterfactual analysis for functional data

In this section, we will detail the mathematical optimization formulation for generating counterfactual explanations when dealing with functional data. To do this we need to model the structure of the counterfactual instances, the constraints associated with them, as well as the cost function C. This will be done in what follows. We postpone to the next section the analysis of the case in which a state-of the-art score-based classifier, namely, an additive classification tree, is used as classifier.

Recall that an instance \(\varvec{x} \in {\mathcal {X}} \subset {\mathcal {F}}^{J}\) is defined as a vector of J functional features. Hence, \(\varvec{x}=(x_1(t),\dots ,x_J(t))\), where \(x_j:[0,T] \rightarrow {\mathbb {R}}\), \(j=1,\dots ,J\), are Riemann integrable functions defined in interval [0, T]. Notice that \(x_j(t)\) may be a static feature, e.g., birth date, defined then as a constant function. Note also that, for a given time instant \(t \in [0,T],\) \(\varvec{x}(t)\) is a vector in \({\mathbb {R}}^J\) components, which may represent J measurements of independent attributes, or they may be related, e.g., one can include an attribute \(x_j,\) some of its derivatives to provide information also on e.g., the growth speed or the convexity of function \(x_j.\)

As mentioned in the introduction, the construction of counterfactual solutions depends on the (multiclass) classifier used. Assuming a score-based classifier, we are given a function \(f:{\mathcal {X}} \subset {\mathcal {F}}^{J} \rightarrow \{1,\dots , K\}\) based on score functions \((f_1,\dots ,f_K)\), where K is the number of classes. Given an instance \(\varvec{x}^0 \in {\mathcal {X}}\), let \(f(\varvec{x}^0)\in \textrm{arg}\,\textrm{max}_k f_k(\varvec{x}^0)\) denote the class assigned by the classifier to \(\varvec{x}^0\). For a fixed class \(k^{*}\), the counterfactual instance of \(\varvec{x}^0\) is defined in this paper as the feasible \(\varvec{x}\) obtained with a minimal cost of perturbation of \(\varvec{x}^0\) and classified by the score-based classifier in class \(k^{*}\). This yields the following optimization problem:

$$\begin{aligned} \left\{ \begin{matrix} \min _{\varvec{x}} &{}C(\varvec{x},\varvec{x}^0)\\ \text {s.t.} &{} f_{k^{*}}(\varvec{x}) \ge f_k(\varvec{x}) \quad \forall k=1,\dots , K \quad k\ne k^{*}\\ &{}\varvec{x}\in {\mathcal {X}}^{0}. \end{matrix} \right. \end{aligned}$$
(1)

The objective function \(C(\varvec{x},\varvec{x}^0)\) is a cost function that measures the dissimilarity between the given instance \(\varvec{x}^0\) and the counterfactual instance \(\varvec{x}\). In the feasible region, we have two types of constraints. In the first one, we ensure that the counterfactual \(\varvec{x}\) is classified in class \(k^{*}\) by imposing that the score \(f_{k^{*}}(\varvec{x})\) is the maximum across all k. In the second type of constraint, we ensure that the counterfactual is in \({\mathcal {X}}^{0} \subset {\mathcal {F}}^{J}\), the set defined through the actionability and plausibility constraints (Mohammadi et al. 2021; Wachter et al. 2017), i.e., constraints ensuring that a counterfactual does not change immovable features, and that guarantee that counterfactual explanations are realistic.

2.1 Counterfactual instances and constraints

Let us discuss the constraints on \(\varvec{x}\) in Problem (1). First, we need to ensure that the counterfactual explanation \(\varvec{x}\) is realistic. In the case of functional data, this yields an infinite-dimensional optimization problem. To enhance the tractability of this requirement, we propose the use of instances of the dataset, i.e., prototypes, to generate the counterfactual explanation. Let \(\varvec{x}^b\), \(b=1,\dots ,B\), be all the instances that have been classified by the model in class \(k^{*}\) and are close enough to \(\varvec{x}^0\) so that they can be seen as references for \(\varvec{x}^0\). For an instance \(\varvec{x}^0=(x_1^0,\dots ,x_J^0)\), feature j of the counterfactual explanation \(x_j\) is defined as the convex combination of the original feature \(x_j^0\) and the feature j of all B prototypes \(x_j^b\). Thus, the counterfactual explanation \(\varvec{x}\) is defined for each feature j as \(x_j=\alpha _{0j} x^0 +\sum _{b=1}^B \alpha ^b_j x_j^b\), where \(\sum _{b=0}^B \alpha ^b_j=1, \quad \forall j=1,\dots ,J\).

In order to gain interpretability of the so obtained counterfactual explanation \(\varvec{x}\), we want to use as few prototypes \(\varvec{x}^b\) as possible in the construction of \(\varvec{x}\). For this reason we will impose a maximum of \(B^{\textrm{max}}\) prototypes to be used, where \(B^{\textrm{max}}\) is a parameter defined by the user.

In \({\mathcal {X}}^0\) we may also impose the unmovable constraints or other constraints like upper or lower limits on the static variables.

2.2 Cost function

Recall that \(C(\varvec{x},\varvec{x}^0)\) is the cost of changing \(\varvec{x}^0\) to \(\varvec{x}\), which can be measured by the proximity between the curves defining \(\varvec{x}\) and \(\varvec{x}^0\).

The proximity between curves can be measured in several ways. One can use for instance the squared Euclidean distance:

$$\begin{aligned} \Vert \varvec{x}-\varvec{x}^0\Vert _2^2=\int _{0}^T \sum _{j=1}^J (x_j(t)-x^0_j(t))^2 dt \end{aligned}$$
(2)

Needless to say, different weights can be assigned to each feature in the expression above.

Another popular distance used in the literature (Esling and Agon 2012; Xing et al. 2010) is the Dynamic Time Warping (DTW) distance, which measures the dissimilarity between two functions that may be inspected at different speed, see Fig. 1. More explicitly, suppose we have \(\varvec{x}\) and \(\varvec{x}'\), discretised in two sequences of length n,  so that the J-variate functions \(\varvec{x}\) and \(\varvec{x}'\) are replaced by \((\varvec{x}(t_1),\ldots ,\varvec{x}(t_n))\) and \((\varvec{x}'(t_1),\ldots ,\varvec{x}'(t_n)).\) Observe that each \(\varvec{x}(t), \varvec{x}'(t)\) are vectors in \({\mathbb {R}}^J.\) A warping path \(\pi\) is a chain of pairs of the form \(\pi =(q_{11},q_{21}) \rightarrow (q_{12},q_{22}) \rightarrow \ldots \rightarrow (q_{1Q},q_{2Q})\) of length Q, \(n \le Q \le 2n-1,\) satisfying the following two conditions:

  1. 1.

    \((q_{11},q_{21}) =(t_1,t_1),\) and \((q_{1Q},q_{2Q}) =(t_n,t_n)\)

  2. 2.

    \(q_{1r} \le q_{1 (r+1)} \le q_{1r}+1,\) and \(q_{2r} \le q_{2 (r+1)} \le q_{2r}+1,\) \(r=1,2,\ldots ,Q-1\)

Let \({\mathcal {W}}\) denote the set of all warping paths. Then, the DTW distance \(\text {DTW}(\varvec{x},\varvec{x}')\) between \(\varvec{x}\) and \(\varvec{x}'\) is the minimal squared Euclidean distance between pairs of the form \((\varvec{x}(q_{11}),\ldots ,\varvec{x}(q_{1Q}))\) and \((\varvec{x}'(q_{21}),\ldots , \varvec{x}'(q_{2Q}))\) when \((q_{11},q_{21}) \rightarrow (q_{12},q_{22}) \rightarrow \ldots \rightarrow (q_{1Q},q_{2Q})\) is a warping path, i.e.,

$$\begin{aligned} \text {DTW}(\varvec{x},\varvec{x}')= & {} \min \quad \sum _{r=1}^Q \sum _{j=1}^J \left( x_j(q_{1r}) - x'_j(q_{2r})\right) ^2\nonumber \\{} & {} \text{ s.t. } \quad {\pi =(q_{11},q_{21}) \rightarrow (q_{12},q_{22}) \rightarrow \ldots \rightarrow (q_{1Q},q_{2Q}) \in {\mathcal {W}}} \end{aligned}$$
(3)

Observe that DTW can be efficiently evaluated using dynamic programming (Berndt and Clifford 1994).

Fig. 1
figure 1

Comparison between different warping paths between the functions \(\varvec{x}\) (in blue) and \(\varvec{x}'\) (in orange)

Additionally, C may contain, on top of the distance-based term described above, other terms measuring, e.g., the number of features altered when moving from \(\varvec{x}^0\) to \(\varvec{x}\). In particular, in Sect. 3 we will discuss in detail the cases

$$\begin{aligned} C(\varvec{x},\varvec{x}^0)=\lambda _0 \Vert \varvec{x}^0-\varvec{x}\Vert _0 +\lambda _2 \int _{0}^T \sum _{j=1}^J (x_j(t)-x^0_j(t))^2 dt, \end{aligned}$$
(4)

and

$$\begin{aligned} C(\varvec{x},\varvec{x}^0)=\lambda _0 \Vert \varvec{x}^0-\varvec{x}\Vert _0 +\lambda _2 \text {DTW}(\varvec{x},\varvec{x}^0), \end{aligned}$$
(5)

where \(\Vert \varvec{x}^0-\varvec{x}\Vert _0\) indicates how many components of \(\varvec{x}^0 = (x^0_1,\ldots ,x^0_J)\) and \(\varvec{x} = (x_1,\ldots ,x_J)\) are not equal,

$$\begin{aligned} \Vert \varvec{x}^0-\varvec{x}\Vert _0 = \vert \{j: \, x^0_j \ne x_j \}\vert , \end{aligned}$$
(6)

and \(\lambda _0,\lambda _2 \ge 0,\) not simultaneously 0.

3 Additive tree models

Problem (1) under the modelling assumptions in Sect. 2 can be addressed for several score-based classifiers. These include, among others, additive tree models (ATM) such as Random Forest (Breiman 2001) or XGBoost models (Chen and Guestrin 2016), as well as linear models such as logistic regression and linear support vector machines. Below, we focus on ATM, and extend to functional data the analysis for tabular data described in (Carrizosa et al. 2021).

The ATM is composed of T classification trees. Each tree t has a series of branching nodes s,  each having associated a feature v(s),  a time instant \(t_s,\) and a threshold value \(c_s,\) so that records \(\varvec{x}\) go through the left or the right of the branching node depending on whether \(x_{v(s)}(t_s)\le c_s\) or not. Moreover, the tree t has associated a weight \(w^t \ge 0,\) so that the class predicted for an instance \(\varvec{x}\) is the most voted class according to the weights \(w^t\). The ATM can be viewed as a score-based classifier by associating to class k the score \(f_k\) defined as:

$$\begin{aligned} f_k(\varvec{x})=\sum _{t\in \{1,\dots ,T\} /t\in {\mathcal {T}}_k(\varvec{x})} w^t, \end{aligned}$$
(7)

where \({\mathcal {T}}_k(\varvec{x})\) denotes the subset of trees that classify \(\varvec{x}\) in class k.

To model Problem (1) for functional data and additive tree models, the parameters and decision variables in Fig. 2 will be used.

Fig. 2
figure 2

Parameters and decision variables used to model Problem (1) for additive tree models, when data are functions

Recall that the ATM is already known, i.e., the whole structure, including the topology of the trees and the feature and threshold used in each split, is given. Thus, in order to compute the score of the counterfactual instance, the only requirement is to know in which leaf node it has ended up. When we end up in a specific leaf, the corresponding branching conditions are activated. For each split \(s\in \text {Left}(l,t)\) if the condition is true, then \(x_{v(s)}(t_s)\le c_s\), otherwise \(x_{v(s)}(t_s) >c_s\). To introduce these logical conditions, we use the following big M constraints:

$$\begin{aligned}&x_{v(s)}(t_s) - M_1(1-z_l^t) + \epsilon \le c_s \quad s \in \text {Left}(l,t) \end{aligned}$$
(8)
$$\begin{aligned}&x_{v(s)}(t_s)+ M_2(1-z_l^t) - \epsilon \ge c_s \quad s \in \text {Right}(l,t). \end{aligned}$$
(9)

Due to the impossibility of the Mixed-Integer Optimization solvers to model a strict inequality, a small positive quantity \(\epsilon\) is introduced in Eqs. (8) and (9), as is done in Bertsimas and Dunn (2017). With this, our counterfactual variable \(x_{v(s)}\) at point \(t_s\) is not allowed to take values around the threshold value in \(c_s\) at the split s. Please note that the value of \(M_1\) and \(M_2\) can be tightened for each split.

The score function in (7) can be rewritten as a linear expression as follows:

$$\begin{aligned} \sum _{t=1}^T \sum _{l\in {\mathcal {L}}^t_k} w^t z_l^t, \end{aligned}$$

for \(k=1,\dots ,K\).

Recall that one type of sparsity that we wanted was to use few prototypes to build our counterfactual explanation. To model this, we introduce binary decision variables \(u_b\), which control the number of prototypes that can be used in the convex combination yielding \(\varvec{x}\) through parameter \(B^{\textrm{max}}\).

Given instance \(\varvec{x}^0\) and a cost function C,  the formulation associated with Problem (1), the problem of finding the minimal cost perturbation that causes the classifier to classify it in class \(k^{*}\) is as follows:

$$\begin{aligned}&\min _{\varvec{x}, \varvec{z}, \varvec{\alpha }, \varvec{u}} \quad C(\varvec{x},\varvec{x}^0) \end{aligned}$$
(10)
$$\begin{aligned}&\qquad \text {s.t.} \quad x_{v(s)}(t_s) - M_1(1-z_l^t) +\epsilon \le c_s \quad \forall s\in \text {Left}(l,t) \quad \forall l\in {\mathcal {L}}^t \quad \forall t=1,\dots ,T \end{aligned}$$
(11)
$$\begin{aligned}&\qquad \qquad x_{v(s)}(t_s) + M_2(1-z_l^t)-\epsilon \ge c_s \quad \forall s\in \text {Right}(l,t) \quad \forall l \in {\mathcal {L}}^t \quad \forall t=1,\dots ,T \end{aligned}$$
(12)
$$\begin{aligned}&\qquad \qquad \sum _{l\in {\mathcal {L}}^t} z_l^t=1 \quad \forall t=1,\dots ,T \end{aligned}$$
(13)
$$\begin{aligned}&\qquad \qquad \sum _{t=1}^T \sum _{l\in {\mathcal {L}}^t_{k^{*}}} w^t z_l^t \ge \sum _{t=1}^T \sum _{l\in {\mathcal {L}}^t_k} w^t z_l^t \quad \forall k=1,\dots , K \quad k\ne k^{*} \end{aligned}$$
(14)
$$\begin{aligned}&\qquad \qquad x_j=\alpha _{0j} x_j^0 +\sum _{b=1}^B \alpha ^b_j x_j^{b} \quad \forall j=1,\dots ,J \end{aligned}$$
(15)
$$\begin{aligned}&\qquad \qquad \sum _{b=0}^B \alpha ^b_j = 1 \quad \forall j=1,\dots ,J \end{aligned}$$
(16)
$$\begin{aligned}&\qquad \qquad \alpha ^b_j \le u_b \quad \forall b=1,\dots ,B \quad \forall j=1,\dots ,J \end{aligned}$$
(17)
$$\begin{aligned}&\qquad \qquad \sum _{b=1}^B u_b \le B^{\textrm{max}} \end{aligned}$$
(18)
$$\begin{aligned}&\qquad \qquad u_b \in \{0,1\} \quad \forall b=1,\dots , B \end{aligned}$$
(19)
$$\begin{aligned}&\qquad \qquad z_l^t \in \{0,1\} \quad \forall l \in {\mathcal {L}}^t \quad \forall t=1,\dots ,T \end{aligned}$$
(20)
$$\begin{aligned}&\qquad \qquad \alpha ^b_j\ge 0 \quad \forall b=1,\dots ,B \quad \forall j=1,\dots ,J \end{aligned}$$
(21)
$$\begin{aligned}&\qquad \qquad \varvec{x}\in {\mathcal {X}}^{0}. \end{aligned}$$
(22)

The cost function in (10) is discussed in more detail below, where we measure the movement from the original instance \(\varvec{x}^0\) to its counterfactual explanation \(\varvec{x}\) for functional data. Constraints (11) and (12) control to which leaf the counterfactual instance is assigned and constraint (13) enforces that only one leaf is active for each tree. Constraint (14) ensures that the counterfactual instance is assigned to class \(k^{*}\), i.e., the score of class \(k^{*}\) is the highest one among all classes. Constraints (15) and (16) define for each feature j the counterfactual instance as the convex combination of \(x^0_j\) and the prototypes \(x_j^b\). To ensure sparsity in the prototypes, constraints (17)–(18) restrict the number of prototypes used in the convex combination to \(B^{\textrm{max}}\). Constraints (19) and (20) ensure that all \(u_b\) and \(z_l^t\) are binary, constraint (21) that the coefficients \(\alpha ^b_j\) are nonnegative and constraint (22) that the counterfactual \(\varvec{x}\) is in \({\mathcal {X}}^{0}\), the set containing the rest of the actionability and plausibility constraints. An overview of all the constraints is detailed in Fig. 3.

Fig. 3
figure 3

Description of constraints (11)–(22), used to model Problem (1) for additive tree models, when data are functions

Let us now discuss the objective function in (10) for the particular choices of C introduced in Sect. 2, namely, (4) and (5). In order to model the \(\ell _0\) term defined in (6), binary decision variables \(\xi _j\) are introduced. For every feature \(j=1,\ldots ,J\), \(\xi _j=0\) if and only if \(\alpha _{0j}=1\), i.e., if \(x_j=x_j^0\). This is expressed as

$$\begin{aligned}&-\xi _j \le 1- \alpha _{0j} \le \xi _j \quad j=1,\dots ,J \end{aligned}$$
(23)
$$\begin{aligned}&\xi _j \in \{0,1\}, \quad j=1,\dots , J. \end{aligned}$$
(24)

Moreover, we have that

$$\begin{aligned} \Vert \varvec{x}^0-\varvec{x} \Vert _0 = \sum _{j=1}^J \xi _j. \end{aligned}$$

Thus, for the cost function C in (4), we have the following reformulation of (10)–(22):

$$\begin{aligned} \displaystyle\min _{\varvec{x},\varvec{z}, \varvec{\alpha },\varvec{u}, \varvec{\xi }} &{} \quad \lambda _0 \sum\limits _{j=1}^J \xi _j +\lambda _2 \int _{0}^T \sum\limits _{j=1}^J (x_j(t)-x^0_j(t))^2 dt\\ \text {s.t.} &{} \quad (11)-(22), (23)-(24). \end{aligned}$$
(CEF)

For the particular case of (CEF) where only the \(\ell _0\) distance is considered, i.e., \(\lambda _2=0\), the objective function as well as the constraints are linear, (assuming \({\mathcal {X}}^0\) is also defined through linear constraints), while we have both binary and continuous decision variables. Therefore, Problem (CEF) can be solved using a Mixed Integer Linear Programming (MILP) solver. For arbitrary \(\lambda _2\ge 0\), taking into account that, by (15),

$$\begin{aligned} (x_j(t)-x^0_j(t))^2 = \left( \sum _{b=0}^B \alpha ^b_j x^b_j(t) - x^0_j(t)\right) ^2, \end{aligned}$$

the second term in the objective can be expressed as a convex quadratic function in the decision variables \(\alpha ^b_j,\) and thus (again, assuming \({\mathcal {X}}^0\) is also defined through linear constraints) Problem (CEF) is a Mixed Integer Convex Quadratic Model with linear constraints.

Let us address Problem (1) when the cost function C has the form (5), and thus the DTW distance is involved. As in Sect. 2, the time interval [0, T] is discretised in time instants \(t_1,\ldots ,t_n,\) and thus the DTW distance is the minimal squared Euclidean distance among the warping paths W, yielding

$$\begin{aligned} \displaystyle\min _{\varvec{x},\varvec{z}, \varvec{\alpha }, \varvec{\xi },\varvec{u}} &{} \quad \lambda _0 \sum\limits _{j=1}^J \xi _j +\lambda _2 \sum\limits _{r=1}^Q \sum\limits _{j=1}^J \left( x_j(q_{1r}) - x^0_j(q_{2r})\right) ^2 \\ \text {s.t.} &{} \quad (11)-(22), (23)-(24) \\ &{}\quad {(q_{11},q_{21}) \rightarrow (q_{12},q_{22}) \rightarrow \ldots \rightarrow (q_{1Q},q_{2Q}) \in {\mathcal {W}}} \end{aligned}$$
(CEFDTW)

Notice how for a fixed warping path in \({\mathcal {W}}\), constraints (11)–(24) are all linear, while we have both binary and continuous variables. Hence, if \({\mathcal {X}}^0\) is again defined by linear constraints, since the objective function is quadratic, Problem (CEFDTW) is a Mixed Integer Convex Quadratic Model with linear constraints, that can be solved using standard optimization packages. For this reason we propose an alternating heuristic to solve Problem (CEFDTW):

Algorithm 1
figure a

Algorithm to calculate counterfactual explanations with the DTW-based cost function (5)

4 Numerical illustration

We will illustrate our methodology in two real-world datasets with functional data, one univariate and another multivariate, from the UCR archive (Dau et al. 2019). For a given instance, we are able to identify the individuals of the dataset from which the corresponding counterfactual is made up and what their contribution is. Furthermore, we show the two different sparsities that we can obtain with our model, namely, the number of prototypes used for the counterfactual and the number of functional features that change. The use of different distances, i.e., the Euclidean and the DTW distances, is also displayed.

All the mathematical optimization problems have been implemented using Pyomo optimization modeling language (Hart et al. 2017, 2011) in Python 3.8. As solver, we have used Gurobi 9.0 (Gurobi Optimization 2021). A value of \(\epsilon =1\textrm{e}{-6}\) has been imposed in (11) and (12). The values of the big-M in (11) and (12) are node dependent, and they have been tightened following the process described in Carrizosa et al. (2021). For all the computational experiments, the classification model considered has been a Random Forest with \(T=200\) trees and a maximum depth of 4. Our experiments have been conducted on a PC, with an Intel R CoreTM i7-1065G7 CPU @ 1.30GHz 1.50 GHz processor and 16 gigabytes RAM. The operating system is 64 bits.

The first dataset, ItalyPowerDemand (Keogh et al. 2006), has one functional feature. There are 1096 instances and each instance is a time series of length 24, representing the power demand in Italy in six months. The binary classification task is to distinguish days from October to March (response value \(-1\)) from April to September (response value \(+1\)).

The second dataset, NATOPS (Ghouaiel et al. 2017), has 24 functional time series of length 51 representing the X, Y, and Z coordinates of the left and right hand, wrist, thumb and elbows as captured by a Kinect 2 sensor. There are 260 instances and we chose two classes of the 6 that there are in the dataset. The binary classification task is thus to distinguish the gesture “All Clear" (response value \(-1\)) from “Not Clear" (response value \(+1\)).

4.1 Experimental results

4.1.1 ItalyPowerDemand

We present the counterfactual for an instance \(\varvec{x}^0\) of the dataset ItalyPowerDemand in Fig. 4. In each case, we represent the original curve, the prototypes, and the final counterfactual.

Fig. 4
figure 4

Counterfactual explanations for \(\varvec{x}^0\) of the ItalyPowerDemand data set which has been predicted by the Random Forest in \(k^0=-1\) and whose counterfactual \(\varvec{x}\) has to be predicted in class \(k^{*}=+1\). Different values of \(B^{\textrm{max}}\), i.e., the number of prototypes used for the convex combination, have been imposed. The cost function is model (4) with \(\lambda _0 =0, \lambda _2 = 1\)

The first cost model analysed is the squared Euclidean model (4) with \(\lambda _0=0\) (since we have only one feature, \(\lambda _0 > 0\) is meaningless). Different values of \(B^{\textrm{max}}\) have been used. The smaller the value of \(B^{\textrm{max}}\) is, the more sparse the counterfactual is in terms of prototypes, while the larger the value of \(B^{\textrm{max}}\) is, the higher the freedom to use prototypes and therefore the smaller the distances obtained. In Fig. 5a we plot the relation between the number of prototypes and the distance. It is illustrated how using more than one prototype may be beneficial, but using more than 4 prototypes gives us less sparsity without smaller distances.

Fig. 5
figure 5

Distance obtained versus number of prototypes used in a counterfactual explanation \(\varvec{x}\) for \(\varvec{x}^0\) of the ItalyPowerDemand data set which has been predicted by the Random Forest in \(k^0=-1\) and it is imposed \(k^{*}=+1\)

To show the flexibility of our model, the same experiments have been carried out but changing the cost based on DTW distances (5), again with \(\lambda _0=0\). The counterfactual solutions have been calculated with the heuristic procedure described in Algorithm 1. The results are depicted in Fig. 6. As before, one can see how the objective function decreases as the number of prototypes \(B^{\textrm{max}}\) increases. However, in this case, it is sufficient to use 2 prototypes, as 3 or more will not improve much the objective function, see Fig. 5b.

Fig. 6
figure 6

Counterfactual explanations for \(\varvec{x}^0\) of the ItalyPowerDemand data set which has been predicted with a Random Forest in \(k^0=-1\) and whose counterfactual \(\varvec{x}\) has to be predicted in class \(k^{*}=+1\). Different values of \(B^{\textrm{max}}\), i.e., the number of prototypes used for the convex combination, have been imposed. The cost function is model (5) with \(\lambda _0 =0, \lambda _2 = 1\)

4.1.2 NATOPS

We present now the counterfactual for an instance \(\varvec{x}^0\) of the multivariate dataset NATOPS. The cost function used has been of the form is the squared Euclidean model (4) with \(\lambda _0=1,\) \(\lambda _2= 0.005.\) As we are interested in detecting the critical functional features to flip the classifier’s decision, we give more weight in the cost function to the \(\ell _0\) as an illustration.

In Fig. 7 the counterfactual instance \(\varvec{x}\) for \(\varvec{x}^0\) for \(B^{\textrm{max}}=1\) is shown. As the cost function C contains as its first term the \(\ell _0\) norm, we obtain a sparse solution in the sense of the features we need to change to move from \(\varvec{x}^0\) to \(\varvec{x}\). Indeed, to change its class, only three functional features have to be modified. In Fig. 8 the changed features are presented.

Fig. 7
figure 7

Counterfactual explanations for \(\varvec{x}^0\) of the NATOPS data set which has been predicted by the Random Forest in \(k^0=+1\) and whose counterfactual \(\varvec{x}\) has to be predicted in class \(k^{*}=-1\). \(B^{\textrm{max}}=1\) prototype has been imposed. The cost function is model (4) with \(\lambda _0 =1, \lambda _2 = 0.005\)

Fig. 8
figure 8

Changed features in the counterfactual explanation for \(\varvec{x}^0\) of the NATOPS data set which has been predicted by the Random Forest in \(k^0=-1\) and whose counterfactual \(\varvec{x}\) has to be predicted in class \(k^{*}=+1\) with \(B^{\textrm{max}}=1\). The cost function is model (4) with \(\lambda _0 =1, \lambda _2 = 0.005\)

As in the univariate case, we can impose different values of \(B^{\textrm{max}}\). In Fig. 9 we show the counterfactual explanation for \(B^{\textrm{max}}=2\) and for the same cost function. Note how giving the flexibility to use more than one prototype, results in only having to change two features, see Fig. 10.

Fig. 9
figure 9

Counterfactual explanations for \(\varvec{x}^0\) of the NATOPS data set which has been predicted by the Random Forest in \(k^0=+1\) and whose counterfactual \(\varvec{x}\) has to be predicted in class \(k^{*}=-1\). \(B^{\textrm{max}}=2\) prototypes has been imposed. Cost function: model (4) with \(\lambda _0 =1, \lambda _2 = 0.005\)

Fig. 10
figure 10

Changed features in the counterfactual explanation for \(\varvec{x}^0\) of the NATOPS data set which has been predicted by the Random Forest in \(k^0=+1\) and whose counterfactual \(\varvec{x}\) has to be predicted in class \(k^{*}=-1\) with \(B^{\textrm{max}}=2\). The cost function is model (4) with \(\lambda _0 =1, \lambda _2 = 0.005\)

5 Conclusions

In this paper, we have proposed a novel approach to build counterfactual explanations when dealing with multivariate functional data in classification problems by means of mathematical optimization. With our method, we ensure plausible and sparse explanations, controlling not only the number of prototypes of the dataset used to create the counterfactuals, but also the number of features that need to be changed. Our model is also flexible enough to be used with different distance measures, e.g., the Euclidean distance or the DTW distance. Moreover, our methodology is applicable to score-based classifiers, including additive tree models, such as random forest or XGBoost models, as well as linear models, such as logistic regression and linear support vector machines. We have illustrated our methodology on various real-world datasets.

There are several interesting lines of future research. First, an extension to other non score-based classifiers, like k-NN classifiers, deserve some study. Secondly, to define counterfactual explanations for functional data one could be interested in keeping fixed a part of the curves defining the features. With our method we build the counterfactuals from scratch using the combinations of prototypes in the interval [0, T], but suppose we have an instance defined in the interval \([0,t_0)\), and one might want to find out how the rest of the curve in the interval \([t_0,T]\) would have to be like to make the overall curve being classified in class \(k^*\). When constructing the rest of the curve, one would need to maintain the smoothness and other properties of the curve. Finally, the case in which other distance, such as the optimal transportation distance are used to measure closeness, is a topic of interest.