Keywords

1 Introduction

Time series classification has received a lot of attention from researchers in the last years due to the increasing amount of data available and its applicability in many domains like medicine, environment, business, etc. Classical methods for time series classification have been designed to make a decision based on complete time series. However, in many applications where observations arrive one at a time, it is rewarding to be able to classify a time series even without knowing it entirely. In this case, each time a new sample arrives, one can decide either to perform classification or to wait for more data. It is more valuable to make a decision as soon as possible, but premature decisions are more likely to be erroneous. Consequently, two antagonist notions are at stake. For example, in applications related to business, higher profits can be made thanks to early classification of customer behaviours, provided that the classification is accurate. Hence, two different costs clearly appear from this point of view: the cost of delaying the decision and the cost of misclassification.

Most of the approaches for early classification in the literature focus on the design of algorithms that make a reliable decision from incomplete time series, without explicitly accounting for the cost of delaying the decision [1, 47, 10, 13, 14]. In these works, sufficient guarantee of accuracy triggers the decision.

In this paper, we focus on cost-aware early classification of time series. Let us illustrate the advantages of cost-aware early classification of time series on a concrete example. In a surveillance scenario, one would want the system to warn the police as soon as a crime scene is detected. The optimal time to make a decision depends on two antagonist notions: accuracy and earliness. For this scenario, warning the police too early would induce a cost if the scene is actually not a crime scene, but waiting too long would prevent the police from stepping in the crime scene. It hence becomes desirable to allow end-users to set a trade-off between earliness and accuracy, which is the goal of cost-aware early classification of time series.

Recently, Dachraoui et al. [3] introduced a framework for cost-aware early classification that is dedicated to optimizing a trade-off between the accuracy of the prediction and the time at which it is performed. This framework hence involves both costs defined above: a misclassification cost and a cost of delaying the decision. The authors of [3] derive an algorithm that decides, at test time, whether the decision should be made at a given time instant t or if more data should be waited for (based on both costs defined above). In addition, the method of [3] has two other interesting properties: it is adaptive and non-myopic. It is adaptive in the sense that the optimal time of decision depends on the time series to be classified. It is non-myopic because at every time instant t, the algorithm not only decides if t is the optimal time instant to classify, but it also estimates when this optimal time is likely to happen, if not at time t. Such a property is essential since it can help end users of such systems to get prepared for action when the decision should soon be made (e.g. for medical applications).

In this paper, we propose two new early-classification schemes built upon the framework defined in [3]. This framework is based on approximating an expected cost through clustering, which tends to bring vagueness in the process. Our schemes improve on this existing work by removing the clustering step. They optimize a trade-off between a misclassification cost and a cost of delaying the decision, they are adaptive and non-myopic.

The rest of this paper is organized as follows. Section 2 explains the problem statement and reviews the related work on early classification of time series. Section 3 presents both proposed approaches and how they differ from [3]. Experiments on real time series data sets are presented in Sect. 4. Notations used throughout this paper are summarized in Table 1.

Table 1. Mathematical notations used throughout the paper.

2 Related Work

A wide range of methods tackle the early classification problem without explicitly accounting for the cost of delaying the decision [1, 47, 10, 13, 14]. These methods differ in (i) the design of the early classifiers and (ii) the estimation of the optimal time to make a decision. They mostly wait for sufficient confidence to make their decision, using varied procedures to define this confidence. The authors of [7] design an early classification scheme based on the boosting procedure for classification, where one weak classifier is built at each time stamp. This method is able to predict the class of a test time series at any time, but the issue of estimating the optimal time to make the decision is not adressed. Hatami et al. [5] propose an ensemble classifier with a reject option. A decision is made as soon as the agreement between all classifiers is above a certain threshold. Otherwise, the reject option is activated and more data is waited for. In [14], the authors rely on feature extraction for early classification. They first extract, from a training set of time series, a set of features called shapelets with a high discriminating power. Shapelets having high utility are then selected, where utility is an extension of the F-measure which also takes earliness into account. At test time, classification is performed as soon as one of the selected shapelets is found in the testing time series. Ghalwash et al. [4] extend this work so that it also provides an estimation of uncertainty associated with the decision. An alternative of these works designed for multivariate time series is proposed in [6]. Antonucci et al. [1] use Hidden Markov Model Classifiers with set valued parameters to determine whether the model is confident enough about its prediction (if not, several possible outputs are returned, which illustrates the uncertainty of the model). Early classification happens when there exists no ambiguity anymore about the prediction.

These methods do not attempt to infer whether future observations could improve classification accuracy. For difficult classification problems, they might tend to delay the decision even if future observations are unlikely to help. To bypass this limitation, Xing et al. [13] present a method that relies on neighborhood properties. For a given training time series \(\mathbf {x^i} \in \mathcal {T}\), optimal prediction time \(\tau (\mathbf {x^i})\) is set to the smallest \(\tau ^i\) such that the set of reverse nearest neighbors for \(\mathbf {x^i}\) is stable for all \(t \ge \tau ^i\). At test time, if a time series has \(\mathbf {x_{t}^i}\) as nearest neighbor after \(t \ge \tau (\mathbf {x^i})\) time steps, classification is performed at this point. To improve on this setting, clustering can be performed on training time series so that stability can be robustly evaluated at the cluster level. Parrish et al. [10] design a probabilistic approach based on Quadratic Discriminant Analysis which aims at maximizing the reliabilty, i.e. the probability that the early decision leads to the same classification result as the classification of the complete time series.

One drawback of these methods is that they do not explicitly optimize a trade-off between earliness and accuracy, hence leading to sub-optimal solutions in this regard. In order to overcome this limitation, Dachraoui et al. [3] propose a framework for early classification, where the cost of delaying the decision is included in the optimization function. This framework considers a time series classification task for which two cost functions are given:

  • \(C_m(\hat{y}, y)\) is the misclassification cost function, i.e. the cost of predicting class label \(\hat{y}\) whereas the effective class label was y;

  • \(C_d(t)\) is the cost of delaying the decision to time t.

In this framework, the cost of classifying a time series \(\mathbf {x}\) of class y using a set \(\mathcal {H} = \{h_t\}_{1 \le t \le T}\) of classifiers is given by:

$$\begin{aligned} C(\mathbf {x}, y) = C_m(h_{\tau (\mathbf {x})}(\mathbf {x}), y) + C_d(\tau (\mathbf {x})), \end{aligned}$$
(1)

where \(\tau (\mathbf {x})\) is the time index at which classification is performed and \(h_{\tau (\mathbf {x})}(\mathbf {x})\) is the class predicted at time \(\tau (\mathbf {x})\) by the ad hoc classifier. To optimize on Eq. (1), the authors compute an expected cost for current and future time instants. If the current expected cost is lower than all future ones, classification is performed. This method is further discussed in Sect. 3.

The authors of [9] also design an early classification scheme that takes the cost of delaying the decision into account. They define a stopping rule that depends on 4 parameters (to be fine tuned), on the time of the decision and on the posterior probabilities output by the classifiers. The optimal parameters for the stopping rule are learned through cross-validation so that they minimize a cost function that is similar in spirit to Eq. (1):

$$\begin{aligned} C'(\mathbf {x},y) = \alpha \times C_m(h_{\tau (\mathbf {x})}(\mathbf {x}), y) + (1-\alpha ) \times C_d(\tau (\mathbf {x})), \end{aligned}$$
(2)

where \(\alpha \) is a parameter that designs a trade-off between \(C_m\) and \(C_d\). Unlike the method proposed in [3], the one from [9] is myopic in the sense that at a given time t, if the algorithm decides that it is better to wait to perform classification, it cannot provide an estimate \(\hat{\tau }(\mathbf {x_t}) > t\) of the optimal classification time.

In this paper, we propose two non-myopic early classification methods taking into account the cost of delaying the decision. These methods are improvements over the framework defined in [3]. In the next section, we detail this framework and position the methods we propose with respect to it.

Fig. 1.
figure 1

Illustration of proposed methods on Gun_Point time series.

3 Proposed Early Classification Schemes

The problem tackled in this paper is to estimate, for an incoming time series \(\mathbf {x}\), the optimal time \(\tau ^*(\mathbf {x})\) at which the classification of \(\mathbf {x}\) should be done in order to minimize the cost given by Eq. (1). In Sect. 3.1, we first give a detailed review of the method proposed in [3] and its limitations. We then introduce an improved version of this approach from which we derive two early classification methods presented in Sects. 3.2 and 3.3.

3.1 Computing Expected Costs for Training Time Series

As an attempt to optimize on Eq. (1), Dachraoui et al. [3] intend to perform classification at a time t such that the expected cost:

$$\begin{aligned} f(\mathbf {x_t}) = \sum _{y \in \mathcal {Y}}P(y|\mathbf {x_t}) \sum _{\hat{y} \in \mathcal {Y}} P(\hat{y}|y,\mathbf {x_t}) \, C_m(\hat{y}, y) + C_d(t) \end{aligned}$$
(3)

is minimized. The term \(P(y|\mathbf {x_t})\) is unknown by definition of the learning problem, which prevents us from computing this expected cost directly.

In order to approximate Eq. (3), authors rely on a clustering \({\mathcal {C} = \{\mathbf {c^1}, \dots , \mathbf {c^K}\}}\) of the set \(\mathcal {T}\) of complete training time series. At any given instant t, the expected cost of classifying \(\mathbf {x}\) at future time instants \(t+\delta \) (for \(\delta \ge 0)\) is computed as:

$$\begin{aligned} \nonumber f_{\delta }(\mathbf {x_t}) =&\sum _{k = 1}^K P(\mathbf {c^k}|\mathbf {x_t}) \sum _{y \in \mathcal {Y}}P(y|\mathbf {c^k})\sum _{\hat{y} \in \mathcal {Y}} P_{t+\delta }(\hat{y}|y,\mathbf {c^k}) \, C_m(\hat{y}, y) + C_d(t+\delta )\\ =&\sum _{k = 1}^K P(\mathbf {c^k}|\mathbf {x_t}) \, E_{t+\delta }(\mathbf {c^k}), \end{aligned}$$
(4)

where \(E_{t+\delta }(\mathbf {c^k})\) is the expected cost of performing classification at time \(t+\delta \) for series in cluster \(\mathbf {c^k}\), that is computed as follows. At every time t, probabilities \(P_t(\hat{y} \, | \, y, \mathbf {c^k})\) are estimated using cross-validation on the training set. Each of these probabilities correspond to a bin of the confusion matrix associated to cluster \(\mathbf {c^k}\) at time t. The computation of the expected cost function \(f_\delta (\mathbf {x_t})\) also depends on the probabilities \(P(\mathbf {c^k} \, | \, \mathbf {x_t})\). In [3], these probabilities are computed as:

$$\begin{aligned} P(\mathbf {c^k}|\mathbf {x_t}) = \frac{s^k}{\sum _{i=1}^K s^i}, \, \text { where } s^k = \frac{1}{1+\exp ^{-\lambda \varDelta ^k_t}}, \end{aligned}$$
(5)

\(\lambda \) is some positive constant and \(\varDelta ^k_t=(\bar{D_t}-d^k_t)/\bar{D_t}\) is the normalized difference between the average \(\bar{D_t}\) of the distances between \(\mathbf {x_t}\) and all the clusters, and the distance \(d^k_t\) between \(\mathbf {x_t}\) and the cluster \(\mathbf {c^k}\). We refer the interested reader to [3] for more details on these equations.

In other words, the expected cost function \(f_\delta (\mathbf {x_t})\) for a time series to be classified is a weighted sum of the per-cluster expected cost functions \(E_{t+\delta }(\mathbf {c^k})\) computed from the training set. Classification is finally performed at time t if, for any \(\delta > 0\), \(f_{\delta }(\mathbf {x_t}) \ge f_0(\mathbf {x_t})\).

This approach hence relies on the assumption that intra-cluster variability is sufficiently low for distance to centroid to be an acceptable proxy for the distance between time series. In practice, such an assumption is unlikely to hold. Furthermore, the clustering \(\mathcal {C}\) is obtained using complete time series whereas distances between \(\mathbf {x_t}\) and the centroids are computed from incomplete series. This surely impacts the estimation of \(\tau ^*(\mathbf {x})\).

In order to overcome these two weaknesses, we propose to get rid of the clustering step used in [3]. Training expected costs are then computed on a per-series basis as:

$$\begin{aligned} E_{t}(\mathbf {x^i})= & {} \sum _{y \in \mathcal {Y}} P(y|\mathbf {x^i}) \sum _{\hat{y} \in \mathcal {Y}} P_{t}(\hat{y} \, | \, \mathbf {x^i_{t}}) \, C_m(\hat{y}, y^i) + C_d(t) \nonumber \\= & {} \sum _{\hat{y} \in \mathcal {Y}} P_{t}(\hat{y} \, | \, \mathbf {x^i_{t}}) \, C_m(\hat{y}, y^i) + C_d(t). \end{aligned}$$
(6)

Indeed, as we do not tackle the multilabel case in this work, for every individual time series \(\mathbf {x^i}\) (of class \(y^i\)) in the training set, we have \(P(y\, | \,\mathbf {x^i}) = 1\) if \(y = y^i\) and 0 otherwise, which removes the summation over y. Then, \(P_{t}(\hat{y}\, | \,\mathbf {x_{t}^i})\) can be obtained from any classifier with probabilistic outputs.Footnote 1 To do so, we learn a set of classifiers \(\mathcal {H}^{-i} = \{h_t^{-i}\}_{1\le t \le T}\). In order to get reliable estimations of \(P_{t}(\hat{y}\, | \,\mathbf {x_{t}^i})\), classifiers \(h_t^{-i}\) are built on subsets of \(\mathcal {T}_t\) that do not contain \(\mathbf {x_t^i}\). In practice, we use a standard cross-validation procedure to make sure that this condition is fulfilled. In this case, reliable estimation of \(P_{t}(\hat{y}\, | \,\mathbf {x_t^i})\) is given by the probabilities output by \(h_{t}^{-i}\).

An example of 4 sample time series together with their expected cost functions is given respectively in Fig. 1a and b.

3.2 NoCluster

Our first proposed method, which we denote NoCluster, consists in computing the expected cost for a test time series as a weighted sum of training expected costs calculated using Eq. (6). In other words, present and future expected costs for a time series to be classified are computed, at time t, as:

$$\begin{aligned} \forall \delta \ge 0, \, \, \, f_{\delta }(\mathbf {x_t}) = \sum _{i=1}^N \frac{1}{K_0}\frac{1}{1+\exp ^{-\lambda \varDelta ^i_t}} \, E_{t+\delta }(\mathbf {x^i}), \end{aligned}$$
(7)

where \(K_0 = \sum _{i=1}^N \frac{1}{1+\exp ^{-\lambda \varDelta ^i_t}}\) and \(\varDelta ^i_t\) is computed the way \(\varDelta ^k_t\) is in Eq. (5), where distance to centroids are replaced by distance to training time series. As in [3], classification is performed at the smallest time t such that, for any \(\delta > 0\), we get \(f_{\delta }(\mathbf {x_t}) \ge f_0(\mathbf {x_t})\). This time instant is denoted \(\tau (\mathbf {x_t})\). At test time, the \(\varDelta ^i_t\) are computed for all times t up to \(\tau (\mathbf {x})\), while when clustering was used, only K such normalized distance computations were required per time t. As this method is derived from [3], it preserves its non-myopia and adaptiveness properties.

Figure 1c shows a sample test time series of length t, and Fig. 1d shows its associated expected cost from time t, and the estimation of the optimal decision time \(\hat{\tau }(\mathbf {x_t})\). Figure 1e depicts the histogram of \(\tau \) values obtained by NoCluster for all the test time series of the Gun_Point dataset. Note that peaks of this histogram (which also correspond to valleys of the expected cost presented in Fig. 1d) are related to regions of high interest in the time series for this dataset: a first increasing part of the time series, followed by a plateau and a final decrease. Algorithmic description of this method is provided in Algorithms 1 and 2.

figure a
figure b

3.3 2Step

As explained above, the computational complexity of NoCluster at test time is higher than that of the baseline, which can be a limitation for some applications. In this section, we present another early classification method that we call 2Step. This method has a lower complexity at test time, at the cost of possibly higher training complexity. 2Step also relies on the computation of the expected costs for every training time series (Eq. (6)), but uses them in a different manner. A set of classifiers \(\mathcal D = \{d_t\}_{1 \le t \le T}\) is built as follows. The classifier \(d_t\) is built on the training set \(\mathcal {T}_t\). Its target variable is not the class of the time series but a binary variable \(\gamma _t\) indicating if t is the expected optimal time to classify the time series. In other words,

figure c
$$\begin{aligned} \forall \, \mathbf {x^i_t} \in \mathcal {T}_t, \, \, \gamma _t(\mathbf {x^i_t}) = \left\{ \begin{aligned} 1&\text { if }E_t(\mathbf {x^i}) = \displaystyle \min _{\delta \ge 0} E_{t+\delta }(\mathbf {x^i}) \\ 0&\text { otherwise.} \\ \end{aligned} \right. \end{aligned}$$
(8)

A graphical example showing how \(\gamma _t\) can be computed from the expected cost of classification of a time series is given in Fig. 2. In particular, this figure illustrates that \(\gamma _t\) is not monotonically increasing with t.

Fig. 2.
figure 2

Expected early classification cost for a training time series from the Gun_Point dataset, together with the corresponding \(\gamma _t\) values. For all time instants that are depicted in pink, \(\gamma _t = 0\), and for time instants depicted in green, \(\gamma _t = 1\). Best viewed in color

At test time, time series \(\mathbf {x_t}\) is given to classifier \(d_t\). If the output of \(d_t\) is 1, then \(\tau (\mathbf {x})=t\) and \(h_t\) is used to classify \(\mathbf {x_t}\). Otherwise, the classification of \(\mathbf {x_t}\) is delayed. Figure 1f depicts the histogram of \(\tau \) values obtained by 2Step for all the test time series of the Gun_Point dataset. One can note from Fig. 1e and f that both proposed methods are adaptive, in the sense that predicted timings \(\tau \) vary across test time series. Algorithmic description of this method is provided in Algorithms 3 and 4.

figure d

The early classification method as presented in this section is myopic. In order to keep the non-myopic property, a set of regressors \(\mathcal M = \{m_t\}_{1\le t \le T}\) should be learned in addition to the set \(\mathcal {D}\) of classifiers. The target variable of regressor \(m_t\) should represent the time difference between t and the expected time of minimal cost, that is \(\underset{\delta \ge 0}{\arg \min } \left( E_{t+\delta }(\mathbf {x_t})\right) \).

4 Experimental Results

In this section, we design extensive experiments in order to evaluate the performance of both NoCluster and 2Step early classification methods. The method presented in [3] is used as our main baseline, as (i) our proposed methods are built on top of it and (ii) all three methods share the desirable properties of non-myopia and adaptiveness. However, comparison with other state-of-the-art methods is also provided in Sects. 4.4 and 4.5. The cost function defined in Eq. (1) (which defines a trade-off between accuracy and earliness of the decision) is used as our main performance indicator. Other criteria such as classification accuracy, earliness (expressed as the average prediction time \(\bar{\tau }\)) and timings are also considered.

4.1 Experimental Setup

Experiments are conducted on all 76 datasets from the UCR archive [8] that have sufficient training data for cross-validation on the method parameters to be conducted (as a rule of thumb, we keep all datasets with at least 10 training time series per class). The complete list of datasets on which experiments are run is provided in the Supplementary material, which details per-dataset performance of the methods. Datasets from this archive cover a wide range of application domains. They are also diverse in terms of the number and the lengths of time series in each dataset. The datasets are split into a training and a test set in the archive and we keep this separation in our experiments. For this set of experiments, the cost of delaying the decision is chosen linear in t:

$$\begin{aligned} C_d(t) = \beta \times t. \end{aligned}$$
(9)

So as not to focus on a single trade-off between accuracy and earliness, we vary \(\beta \) in the set \(\{0.0005, 0.001, 0.005, 0.01\}\) in all experiments. In real-world applications, \(\beta \) should be set according to expert knowledge and effective cost of waiting for more observations. For instance, in remote sensing applications where buying new images can be costly, high \(\beta \) values shall be used, unlike applications such as ECG monitoring for which new observations are made at almost no cost.

Results are reported for all setups (i.e., all datasets and \(\beta \) values) for which any of the compared methods finishes in less than 7 days. Presented results (for all criteria) are medians over 4 runs for each method.

Classifiers from sets \(\mathcal {D}\) and \(\mathcal {H}\) are linear Support Vector Machines trained using scikit-learn and LIBSVM Python binding [2, 11]. Parameters C from the SVM and \(\lambda \) from the expected cost functions are learned through cross-validation on the training set. Five values for parameter C are sampled regularly from a logarithmic scale between \(10^{-1}\) and \(10^1\). Five values for \(\lambda \) are sampled in a similar way between \(10^1\) and \(10^3\). As done in [3], we do not perform early classification before time \(t_\text {min}=4\). For the baseline method, the number of clusters is chosen so as to maximize the silhouette factor [12], as suggested in [3].

Following the principle of reproducible research, the Python code used in these experiments (including both proposed method and our implementation of the baseline) is made publicly available for downloadFootnote 2.

Fig. 3.
figure 3

Early classification performance as a function of \(\beta \) for dataset FISH.

4.2 Sensitivity to \(\beta \)

Figure 3 presents results obtained for dataset FISH in terms of accuracy, earliness and overall early classification cost function (\(C(\mathbf {x},y)\)) for different values of \(\beta \). As expected, when parameter \(\beta \) increases, accuracy and earliness both drop, for all methods. Indeed, high values of \(\beta \) indicate that it is more costly to wait for more data. Decision is hence made earlier, leading to worse accuracy. On the contrary, overall early classification costs increases with \(\beta \), whatever the method. For this dataset (see Sect. 4.5 for more results) and for the whole range of \(\beta \) values considered here, 2Step and NoCluster methods both outperform the baseline in terms of early classification cost, which is the indicator all three methods optimize on (remind that the lower cost, the better).

Fig. 4.
figure 4

Cumulative density functions of training (a) and test (b) times over all datasets and \(\beta \) values.

4.3 Timings

Figure 4 presents both training and test timings of the two proposed methods using cumulative density functions (CDF), and compare them to the baseline method. At a given probability level, a smaller execution time would correspond to a faster method, hence leftmost CDF curves are the most desirable. This figure shows that the baseline method runs faster than NoCluster for both training and test. The difference in test timings is due to the fact that NoCluster requires to compute more distances than the baseline, as explained in Sect. 3.2. For training timings, the difference is due to the number of expected cost function calculations that is lower for the baseline (one per cluster for the baseline versus one per training time series for NoCluster).

Concerning 2Step method, training consists in computing the same expected cost functions as for NoCluster and then building classifiers on top of them, so it is expected that this method has higher training timings. Yet, it is important to notice that the use of classifiers has a positive impact on test timings, for which 2Step even outperforms the baseline. As explained in Sect. 3, this feature can be important for some applications where decisions need to be made quickly.

Table 2. Win / Tie / Lose scores based on early classification costs. Results from all datasets using all \(\beta \) values are used. For these scores, “Win” means that proposed method outperforms the baseline in terms of cost minimization.

4.4 Comparisons with Classical Early Classification Methods

We first compare 2Step and NoCluster methods with classical early classification methods that do not directly optimize on a mixed cost function. To do so, we use earliness and accuracy scores published in the Supplementary material associated to [9]. Such scores are available for EDSC [14], strict and loose variants of ECTS [14] (denoted ECTS1 and ECTS2 respectively) and RelClass [10] with four different reliability thresholds (0.001, 0.1, 0.5, 0.9 for methods RelClass1 to RelClass4 respectively). Based on these scores, we can compute early classification costs for all methods with varying \(\beta \) values. Win/Tie/Lose scores computed from these costs are presented in Table 2. These results show that both proposed methods outperform all these baselines. To assess statistical significance, we perform one-sided Wilcoxon signed rank tests between each proposed method and each baseline. All tests show significance with p-values lower than \(10^{-6}\).

This performance improvement was expected, as baseline methods considered in this Section do not optimize on the cost function that mixes earliness and accuracy. To further evaluate performance of our proposed methods, we compare them to other cost-aware methods in the following Section.

Fig. 5.
figure 5

Comparison of early classification costs between proposed methods and the baselines from [3] and [9]. Results from all datasets using all \(\beta \) values are gathered here. Solid line indicates equal costs and dashed lines correspond to the +/-20 % cost interval.

4.5 Comparisons with Cost-Aware Methods

Figure 5 presents comparisons of early classification costs between both our proposed methods and cost-aware baselines. Presented results show that proposed methods enable to reach lower early classification cost than the baseline from [3] in most cases: 2Step outperforms the baseline in 128 out of the 198 cases (i.e., in \(64.6\,\%\) of the cases), while NoCluster has better results in 135 cases (\(68.2\,\%\)). Moreover, the number of setups for which the baseline is improved by more than 20 % is another indicator showing the benefit of both proposed methods (cf. points outside the dashed lines in Fig. 5a and b).

One-sided Wilcoxon signed rank tests between each proposed method and the baseline show significance with a p-value lower than \(10^{-6}\), which confirms that both our methods outperform the baseline in terms of cost minimization.

We then compare both proposed methods to EarlyOpt.SVM [9], that optimizes on a slightly different trade-off between accuracy and earliness, in Fig. 5 When doing so, evaluation can be performed with respect to two different cost functions: the one on which EarlyOpt.SVM optimizes or the one used for our methods. Figure 5c and d present results obtained when using the cost function from EarlyOpt.SVM, that favors the latter in the comparisonFootnote 3. In spite of this, both proposed methods improve on EarlyOpt.SVM performance and these improvements are statistically significant at the 5 % significance level.

Finally, when comparing 2Step and NoCluster costs as shown in Fig. 5e, NoCluster gets slightly better results that are considered significant when tested at the 5 % significance level (\(p=0.037\)).

5 Conclusion

In this paper, we build upon the framework introduced in [3]. This framework deals with the problem of early classification from a new point of view: it aims at minimizing a cost function that includes a cost of misclassification and a cost of delaying the decision.

Following this framework, our proposition consists in designing two different methods (called NoCluster and 2Step) that optimize on such a cost function. These methods decide online for an incoming testing time series if the current time instant is the optimal moment to classify the time series (with respect to the cost function) or if it is more valuable to wait for more samples of the time series before classifying it. In the latter case, they also give an estimate of when the optimal moment is more likely to occur. Both methods are hence adaptive and non myopic.

We have performed extensive experiments on a large and widely used set of datasets in order to evaluate the appropriateness of these methods, in terms of performance and timings. It turns out that (i) 2Step outperforms the state-of-the art in terms of both cost minimization and classification time and (ii) NoCluster is an even better option in terms of early classification cost at the expense of slower classification. As future work, we will consider the use of time series specific classifiers, which should improve the overall performance of these approaches.