TEASER: early and accurate time series classification

Early time series classification (eTSC) is the problem of classifying a time series after as few measurements as possible with the highest possible accuracy. The most critical issue of any eTSC method is to decide when enough data of a time series has been seen to take a decision: Waiting for more data points usually makes the classification problem easier but delays the time in which a classification is made; in contrast, earlier classification has to cope with less input data, often leading to inferior accuracy. The state-of-the-art eTSC methods compute a fixed optimal decision time assuming that every times series has the same defined start time (like turning on a machine). However, in many real-life applications measurements start at arbitrary times (like measuring heartbeats of a patient), implying that the best time for taking a decision varies widely between time series. We present TEASER, a novel algorithm that models eTSC as a two-tier classification problem: In the first tier, a classifier periodically assesses the incoming time series to compute class probabilities. However, these class probabilities are only used as output label if a second-tier classifier decides that the predicted label is reliable enough, which can happen after a different number of measurements. In an evaluation using 45 benchmark datasets, TEASER is two to three times earlier at predictions than its competitors while reaching the same or an even higher classification accuracy. We further show TEASER’s superior performance using real-life use cases, namely energy monitoring, and gait detection.


INTRODUCTION
A time series (TS) is a collection of values sequentially ordered in time.One strong force behind their rising importance is the increasing use of sensors for automatic and high resolution monitoring in domains like smart homes [15], starlight observations [24], machine surveillance [20], or smart grids [14,32].Time series classification (TSC) is the problem of assigning one of a predefined class to a time series, like recognizing the electronic device producing a certain temporal pattern of energy consumption [9,11] or classifying a signal of earth motions as either an earthquake or a bypassing lorry [23].Conventional TSC works on time series of a given, fixed length and assumes access to the entire input at classification time.In contrast, early time series classification (eTSC), which we study in this work, tries to solve the TSC problem after seeing as few measurements as possible [34].This need arises when the classification decision is time-critical, for instance to prevent damage (the earlier a warning system can predict an earthquake from seismic data [23], the more time there is for preparation), to speed-up diagnosis (the earlier an abnormal heart-beat is detected, the more time there is for prevention of fatal attacks [13]), or to protect markets and systems (the earlier

Traces of Microwaves (acsf1)
Predictions are only safe after this fraction of the data, as we have seen events from all traces Characteristic event Characteristic event Figure 1: Traces of microwaves taken from [11].The operational state of the microwave starts between 5% and 50% of the whole trace length.To have at least one event (typically a high burst of energy consumption) for each microwave, the threshold has to be set to that of the latest seen operational state (after seeing more than 46.2%).
a crisis of a particular stock is detected, the faster it can be banned from trading [10]).
eTSC has two goals: Classifying TS early and with high accuracy.However, these two goals are contradictory in nature: The earlier a TS has to be classified, the less data is available for this classification, usually leading to lower accuracy.In contrast, the higher the desired classification accuracy, the more data points have to be inspected and the later eTSC will be able to take a decision.Thus, a critical issue in any eTSC method is the determination of the point in time at which an incoming TS can be classified.State-of-the-art methods in eTSC [19,34,35] assume that all time series being classified have a defined start time.Consequently, these methods assume that characteristic patterns appear roughly at the same offset in all TS, and try to learn the fixed fraction of the TS that is needed to make high accuracy predictions, i.e., when the accuracy of classification most likely is close to the accuracy on the full TS.However, in many real life applications this assumption is wrong.For instance, sensors often start their observations at arbitrary points in time of an essentially indefinite time series.Intuitively, existing methods expect to see a TS from the point in time when the observed system starts working, while in many applications, this system has already been working for an unknown period of time when measurements start.In such settings, it is suboptimal to wait for a fixed number of measurements; instead, the algorithm should wait for the characteristic patterns, which may occur early (in which case an early classification is possible) or later (in which case the eTSC algorithm has to wait longer).As an example, Figure 1 illustrates traces for the operational state of microwaves [11].Observations started while the microwaves were already under power; the concrete operational state, characterized by high bursts of energy consumption, happened after 5% to 50% In this paper we present TEASER, a Two-tier Early and Accurate Series classifiER, that is robust regarding the start time of a TS's recording.It models eTSC as a two-tier classification problem (see Figure 3).In the first tier, a slave classifier periodically assesses the input series and computes class probabilities.In the second tier, a master classifier takes the series of class probabilities of the slave as input and computes a binary decision on whether to report these as final result or continue with the observation.As such, TEASER does not presume a fixed starting time of the recordings nor does it rely on a fixed decision time for predictions, but takes its decisions whenever it confident of its prediction.On a popular benchmark of 45 datasets [36], TEASER is two to three times as early while keeping a competitive, and for some datasets reaching an even higher level of accuracy, when compared to the state of the art [18,19,22,34,35].Overall, TEASER achieves the highest average accuracy, lowest average rank, and highest number of wins among all competitors.We furthermore evaluate TEASER's performance on the basis of real use-cases, namely device-classification of energy load monitoring traces, and classification of walking motions into normal and abnormal.
The rest of the paper is organized as follows: In Section 2 we formally describe the background of eTSC.Section 3 introduces TEASER and its building blocks.Section 4 presents evaluation results including benchmark data and real use cases.Section 5 discusses related work and Section 6 presents the conclusion.

BACKGROUND: TIME SERIES AND ETSC
In this section, we formally introduce time series (TS) and early time series classification (eTSC).We also describe the typical learning framework used in eTSC.
The values are also called data points.A dataset D is a collection of time series.
We assume that all TS of a dataset have the same sampling frequency, i.e., every i'th data point was measured at the same temporal distance from the first point.In accordance to all previous approaches [19,34,35], we will measure earliness in the number of data points and from now on disregard the actual time of data points.A central assumption of eTSC is that TS data arrives incrementally.If a classifier is to classify a TS after s data points, it has access to these s data points only.This is called a snapshot.Definition 2.2.A snapshot T (s) = (t 1 , . . ., t s ) of a time series T , k ≤ n, is the prefix of T available for classification after seeing s data points.
In principle, an eTSC system could try to classify a time series after every new data point that was measured.However, it is more practical and efficient to call the eTSC only after the arrival of a fixed number of new data points [19,34,35].We call this number the interval length w.Typical values are 5, 10, 20, . . .eTSC is commonly approached as a supervised learning problem [18,19,22,34,35].Thus, we assume the existence of a set D t r ain of training TS, where each one is assigned to one of a predefined set of class labels Y = {c 1 , . . ., c k }.The eTSC system learns a model from D t r ain that can separate the different classes.Its performance is estimated by applying this model to all instances of a test set D t est .
The quality of an eTSC system can be measured by different indicators.The accuracy of an eTSC is calculated as the percentage of correct predictions of the test instances, where higher is better: The earliness of an eTSC is defined as the mean number of data points s after which a label is assigned, where lower is better: We can now formally define the problem of eTSC.Definition 2.3.Early time series classification (eTSC) is the problem of assigning all time series T ∈ D t est a label from Y as early and as accurate as possible.eTSC thus has two optimization goals that are contradictory in nature, as later classification typically allows for more accurate predictions and vice versa.Accordingly, eTSC methods can be evaluated in different ways, such as comparing accuracies at a fixedlength snapshot (keeping earliness constant), comparing earliness at which a fixed accuracy is reached (keeping accuracy constant), or by combining these two measures.A popular choice for the latter is the harmonic mean of earliness and accuracy: + accurac An HM of 1 is equal to an earliness of 0% and an accuracy of 100%.These have three characteristic patterns (a) to (c).In the bottom part, eTSC is performed on a snapshot of the time series of a digital receiver.In its first snapshot it is easily confused with pattern (c) of a microwave.However, the trace later contains pattern (a) which is characteristic for a receiver.
an underlying oscillating pattern and in total there are three important patterns (a), (b), (c) which are different among the appliances.The characteristic part of a receiver trace is an energy burst with two plateaus (a), which can appear at different offsets.If an eTSC classifies a trace too early (Figure 2 second from bottom), the signal is easily confused with that of microwaves based on the similarity to the (c) pattern.However, if an eTSC always waits until the offset at which all training traces of microwaves can be correctly classified, the first receiver trace will be classified much later than possible (eventually after seeing the full trace).To achieve optimal earliness at high accuracy, an eTSC system must determine its decision times individually for each TS it analyses.

EARLY & ACCURATE TS CLASSIFICATION: TEASER
TEASER addresses the problem of finding optimal and individual decision times by following a two-tier approach.Intuitively, it trains a pair of classifiers for each snapshot s: A slave classifier computes class probabilities which are passed on to a master classifier that decides whether these probabilities are high enough that a safe classification can be emitted.TEASER monitors these predictions and predicts a class c if it occurred times in a row; the minimum length of a series of predictions is an important parameter of TEASER.Intuitively, the slave classifiers give their best prediction based on the data they have seen, whereas the master classifiers decide if these results can be trusted, and the final filter suppresses spurious predictions.Formally, let w be the user-defined interval length and let n max be the length of the longest time series in the training set D t r ain .We then extract snapshots T (s i ) = T [1..i • w], i.e., time series snapshots of lengths s i = i • w.A TEASER model consists of a set of S = [n max /w] pairs of slave/master classifiers, trained on the snapshots of the TS in D t r ain (see below for details).When confronted with a new time series, TEASER waits for the next w data points to arrive and then calls the appropriate slave classifier which outputs probabilities for all classes.Next, TEASER passes these probabilities to the slave's paired master classifier which either returns a class label or NIL, meaning that no decision could be derived.If the answer is a class label c and this answer was also given for the last − 1 snapshots, TEASER returns c as result; otherwise, it keeps waiting.
Before going into the details of TEASER's components, consider the example shown in Figure 3.The first slave classifier sc 1 falsely labels this trace of a digital receiver as a microwave (by computing a higher probability of the latter class than for the former class) after seeing the first w data points.However, the master classifier mc 1 decides that this prediction is unsafe and TEASER continues to wait.After i − 1 further intervals, the i'th pair of slave and master classifiers sc i and mc i are called.Because the TS contained characteristic patterns in the i'th interval, the slave now computes a high probability for the digital receiver class, and the master decides that this prediction is safe.TEASER counts the number of consecutive predictions for this class and, if a threshold is passed, outputs the predicted class.
Clearly, the interval length w and the threshold are two important yet opposing parameters of TEASER.A smaller w results in more frequent predictions, due to smaller prediction intervals.However, a classification decision usually only changes after seeing a sufficient number of novel data points; thus, a too small value for w leads to series of very similar class probabilities at the slave classifiers, which may trick the master classifier.This can be compensated by increasing .In contrast, a large value for w leads to fewer predictions, where each one has seen more new data and thus is probably more reliable.For such settings, may be reduced without harming earliness or accuracy.In our experiments, we shall analyze the influence of w on accuracy and earliness in Section 4.3.In all experiments is treated as a hyper-parameter that is learned by performing a grid-search and maximizing HM on the training dataset.

Slave Classifier
Each slave classifier sc i , with i ≤ S is a full-fledged time series classifier of its own, trained to predict classes after seeing a fixed snapshot length.Given a snapshot T (s i ) of length s i = i • w, the slave classifier sc i computes class probabilities P(s i ) = p c 1 (s i ) , . . ., p c k (s i ) for this time series for each of the predefined classes and determines the class c(s i ) with highest probability.Furthermore, it computes the difference d i between  the highest and second highest class probabilities: In TEASER, the most probable class label c(s i ), the vector of class probabilities P(s i ), and the difference d(s i ) are passed as features to the paired i'th master classifier mc i , which then has to decide if the prediction is reliable (see Figure 4) or not.

Master Classifier
A master classifier mc i , with i ≤ S in TEASER learns whether the results of its paired slave classifier should be trusted or not.We model this task as a classification problem in its own, where the i'th master classifier uses the results of the i'th slave classifier as features for learning its model (see Section 3.3 for the details on training).However, training this classifier is tricky.To learn accurate decisions, it needs to be trained with a sufficient number of correct and false predictions.However, the more accurate a slave classifier is, the less mis-classifications are produced, and the worse gets the expected performance of the paired master classifier.Figure 6 illustrates this problem by showing a typical slave's train accuracy  with an increasing number of data points.In this example, the slave classifiers start with an accuracy of around 70% and reach 100% quickly on the train data.Once the train accuracy reaches 100%, there are no negative samples left for the master classifier to train its decision boundary.
To overcome this issue, we use a so-called one-class classifier as master classifier [16].One-class classification refers to the problem of classifying positive samples in the absence of negative samples.It is closely related to, but non-identical to outlier/anomaly detection [4].In TEASER, we use a one-class Support Vector Machine (oc-SVM) [31] which does not determine a separating hyperplane between positive and negative samples but instead computes a hypersphere around the positively labeled samples with minimal dilation that has maximal distance to the origin.At classification time, all samples that fall outside this hypersphere are considered as negative samples; in our case, this implies that the master learns a model of positive samples and regards all results not fitting this model as negative.The major challenge is to determine a hyper-sphere that is neither too large nor too small to avoid false positives, leading to lower accuracy, or dismissals, which lead to delayed predictions.
In Figure 5 a trace is either labeled as microwave or receiver, and the master classifier learns that its paired slave is very precise at predicting receiver traces but produces many false predictions for microwave traces.Thus, only receiver predictions with class probability above p( recei er ) ≥ 0.6, and microwaves above p( microwa e ) ≥ 0.95 are accepted.As can be seen in this example, using a one-class SVM leads to very flexible decision boundaries.

Training Slave and Master Classifiers
Consider a labeled set of time series D t r ain with class labels Y = {c 1 , . . ., c k } and an interval length w.As before, n max is the length of the longest training instance.Then, the i'th pair of slave / master classifier is trained as follows: (1) First, we truncate the train dataset D t r ain to the prefix length determined by i (snapshot s i ): In Figure 4 (bottom) the four TS are truncated.(2) Next, these truncated snapshots are z-normalized.This is a critical step for training to remove any bias resulting from values that will only be available in the future.I.e., if a time series is first z-normalized like all UCR time series, and then a truncated snapshot is generated, this snapshot may not make use of the absolute values resulting from the z-normalization of the whole series (as opposed to [18]).
(3) The hyper-parameters of the slave classifier are trained on D t r ain (s i ) using 10-fold-cross validation.Using the derived hyper-parameters we can build the final slave classifier sc i producing its 3-tuple output (c(s i ), P(s i ), d(s i )) for each T ∈ D t r ain (Figure 4  In accordance to prior works [18,22,34,35], we consider the interval length w to be a user-specified parameter.However, we will also investigate the impact of varying w in Section 4.3. The pseudo-code for training TEASER is given in Algorithm 1.The aim of the training is to obtain S pairs of slave/master classifiers, and the threshold for consecutive predictions.First, for all znormalized snapshots (line 4), the slaves are trained and the predicted labels and class probabilities are kept (line 5).Prior to training the master, incorrectly classified instances are removed (line 6).The feature vectors of correctly labeled samples are passed on to train the master (one-class SVM) classifier (line 7).Finally, an optimal value for is determined using grid-search.

EXPERIMENTAL EVALUATION
We first evaluate TEASER using the 45 datasets from the UCR archive that also have been used in prior works on eTSC [19,22,34,35].Each UCR dataset provides a train and test split set which we use unchanged.Note that most of these datasets were preprocessed to create approximately aligned patterns of equal length and scale [27].Such an alignment is advantageous for methods that make use of a fixed decision time but also requires additional effort and introduces new parameters that must be determined, steps that are not required with TEASER.We also evaluate on additional real-life datasets where no such alignment was performed.
We compared our approach to the state-of-the-art methods, ECTS [34], RelClass [22], EDSC [35], and ECDIRE [19].On the UCR datasets, we use published numbers on accuracy and earliness of these methods to compare to TEASER's performance.As in these papers, we use w = n max /20 as default interval length.For ECDIRE, ECTS, and RelCLASS, the respective authors also released their code, which we use to compute their performance on our additional two use-cases.We were not able to obtain runnable code of EDSC, but note that EDSC was the least accurate eTSC on the UCR data.All experiments ran on a server running LINUX with 2xIntel Xeon E5-2630v3 and 64GB RAM, using JAVA JDK x64 1.8.
TEASER is a two-tier model using a slave and a master classifier.As a first tier, TEASER required a TSC which produces class probabilities as output.Thus, we performed our experiments using three different time series classifiers: WEASEL [30], BOSS [28] and 1-NN Dynamic Time Warping (DTW).As a second tier, we have benchmarked three master classifiers, one-class SVM using LIBSVM [2], linear regression using liblinear [6], and an SVM using an RBF kernel [2].
For each experiment, we report the evaluation metrics accuracy, earliness, their harmonic mean HM, and Pareto optimality.The Pareto optimality criterion counts a method as better than a competitor whenever it obtains better results in at least one metric without being worse in any other metrics.All performance metrics were computed using only results on the test split.To support reproducibility, we provide the TEASER source code and the raw measurement sheets [? ].In our first experiment we tested the influence of different slave and master classifiers.We compared the three different slave TS classifiers: DTW, BOSS, WEASEL.As a master classifier we have used one-class SVM (ocSVM), SVM with a RBF kernel (RBF-SVM) and linear regression (Regression).We performed these experiments using default hyper-parameters to ease comparisons.We compare performances in terms of HM to the other competitors ECTS, RelClass, EDSC and ECDIRE.

Choice of Slave and Master classifiers
We first fixed the master classifier to oc-SVM and compared all three different slave classifiers (DTW+ocSVM, BOSS+ocSVM, WEASEL+ocSVM) in Figure 7. Out of these, TEASER using WEASEL (WEASEL+ocSVM) has the best (lowest) rank.Next, we fixed the slave classifier to WEASEL and compared the three different master classifiers (ocSVM, RBF-SVM, Regression).Again, TEASER using ocSVM performed best.The most significant improvement over the state of the art was archived by TEASER+WEASEL+ocSVM, which justifies our design decision to model early classification as a one-class classification problem.
Based on these results we use WEASEL [30] as a slave classifier and ocSVM for all remaining experiments and refer to it as TEASER.A nice aspect of WEASEL is that it is comparably fast, highly accurate, and works with variable length time series.As a hyper-parameter we learn the best word length between 4 and 6 for WEASEL on each dataset using 10-fold cross-validation on the train data.ocSVM parameters for the remaining experiments were determined as follows: nu-value was fixed to 0.05, i.e. 5% of the samples may be dismissed, kernel was fixed to RBF and the optimal gamma value was obtained by grid-search within {1 . . .100} on the train dataset.

Performance on the UCR Datasets
eTSC is about predicting accurately and earlier.Figure 8 shows two critical difference diagrams (as introduced in [5]) for earliness and accuracy over the average ranks of the different eTSC methods.The best classifiers are shown to the right of the diagram and have the lowest (best) average ranks.The group of classifiers that are not significantly different in their rankings are connected by a bar.The critical difference (CD) length represents statistically significant differences using a Wilcoxon signed-rank test.With a rank of 1.44 (earliness) and 2.38 (accuracy) TEASER is significantly earlier than all other methods and overall is among the most accurate approaches.On our webpage we published all raw measurements [? ] for each of the 45 datasets.TEASER is the most accurate on 22 datasets, followed by ECDIRE and RelClass being best in 12 and 10 sets, respectively.TEASER also has the highest average accuracy of 75%, followed by RelClass (74%), ECDIRE (72.6%) and ECTS (71%).EDSC is clearly inferior to all other methods in terms of accuracy with 62%.TEASER provides the earliest predictions in 32 cases, followed by ECDIRE with 7 cases and the remaining competitors with 2 cases each.On average, TEASER takes its decision after seeing 23% of the test time series, whereas the second and third earliest methods, i.e., EDCS and ECDIRE, have to wait for 49% and 50%, respectively.It is also noteworthy that the second most accurate method RelClass provides the overall latest predictions with 71%.
Note that all competitors have been designed for highest possible accuracy, whereas TEASER was optimized for the harmonic mean of   earliness and accuracy (Recall that TEASER nevertheless also is the most accurate eTSC method on average).It is thus not surprising that TEASER beats all competitors in terms of HM in 36 of the 45 cases.Figure 9 visualizes the HM value achieved by TEASER (black line) vs. the four other eTSC methods.This graphic sorts the datasets according to a predefined grouping of the benchmark data into four types, namely synthetic, motion sensors, sensor readings and image outlines.TEASER has the best average HM value in all four of these groups; only in the group composed of synthetic datasets EDSC comes close with a difference of just 3 percentage points (pp).In all other groups TEASER improves the HM by at least 20 pp when compared to its best performing competitor.In some of the UCR datasets classifiers excel in one metric (accuracy or earliness) but are beaten in another.To determine cases were a method is clearly better than a given competitor, we also computed the number of sets where a method is Pareto optimal over this competitor.Results are shown in Table 2. TEASER is dominated in only two cases by another method, whereas it dominates in 19 to 30 out of the 45 cases In the context of eTSC the most runtime critical aspect is the prediction phase, in which we wish to be able to provide an answer as soon as possible, before new data points arrives.As all competitors were implemented using different languages, it would not be entirely fair to compare wall-clock-times of implementations.Thus, we count the number of master predictions that are needed on average for TEASER to accept a master's prediction .TEASER requires 3.6 predictions on average (median 3.0) to accept the prediction after seeing 23% of the TS on average.Thus, regardless of the used master classifier, a roughly 360% faster infrastructure would be needed on average for TEASER, in comparison to making a single prediction at a fixed threshold (like the ECTS framework with earliness of 70%).

Impact of Interval Length
To make results comparable to that of previous publications, all experiments described so far used a fixed value for the interval length w derived from breaking the time series into S = 20 intervals.Figure 10 shows boxplot diagrams for earliness (left) and accuracy (right) when varying the value of w so that predictions are made Figure 10: Average earliness (left; lower is better) and accuracy (right; higher is better) for TEASER on the 45 TS datasets.

Samples N TS length n Classes
Train after seeing multiples from 10% of a dataset down to multiples of 3%.Thus, in the latter case TEASER outputs a prediction after seeing 3%, 6%, 9%, etc. of the entire time series.Interestingly, accuracy decreases whereas earliness improves with decreasing w, meaning that TEASER tends to make earlier predictions, thus seeing less data, with shorter interval length.Thus, changing w influences the trade-off between earliness and accuracy: If early (accurate) predictions are needed, one should choose a low (high) w value.We further plot the upper bound of TEASER, that is the accuracy at w = 100, equal to always using the full TS to do the classification.The difference between w = 100 and w = 10 is surprisingly small with 5pp difference.Overall, TEASER gets to 95% of the optimum using on average 40% of the time series.

Three Real-Life Datasets
The UCR datasets used so all have been preprocessed to make their analysis easier and, in particular, to achieve roughly the same offsets for the most characteristic patterns.This setting is very favorable for those methods that expect equal offsets, which is true for all eTSC methods discussed here except TEASER; it is even more reassuring that even under such non-favorable settings TEASER generally outperforms its competitors.In the following we describe an experiment performed on three additional datasets, namely ACS-F1 [11], PLAID [9], and CMU [3].As can be seen from Table 3, these datasets have interesting characteristics which are quite distinct from those of the UCR data, as all UCR datasets have a fixed length and were preprocessed for approximate alignment.The former two use-cases were generated in the context of appliance load monitoring and capture the power consumption of common household appliances over time, whereas the latter records the z-axis accelerometer values of either the right or the left toe of four persons while walking to discriminate normal from abnormal walking styles.ACS-F1 monitored about 100 home appliances divided into 10 appliance types (mobile phones, coffee machines, personal computers, fridges and freezers, Hi-Fi systems, lamps, laptops, microwave oven, printers, and televisions) over two sessions of one hour each.The time series are very long and have no defined start points.No preprocessing was applied.We expect all eTSC methods to require only a fraction of the overall TS, and we expect TEASER to outperform other methods in terms of earliness.
PLAID monitored 537 home appliances divided into 11 types (air conditioner, compact fluorescent, lamp, fridge, hairdryer, laptop, microwave, washing machine, bulb, vacuum, fan, and heater).For each device, there are two concatenated time series, where the first was taken at start-up of the device and the second during steady-state operation.The resulting TS were preprocessed to create approximately aligned patterns of equal length and scale.We expect eTSC methods to require a larger fraction of the data and the advantage of TEASER being less pronounced due to the alignment.
CMU recorded time series taken from four walking persons, with some short walks that last only three seconds and some longer walks that last up to 52 seconds.Each walk is composed of multiple gait cycles.The difficulties in this dataset result from variable length gait cycles, gait styles and paces due to different subjects performing different activities including stops and turns.No preprocessing was applied.Here, we expect TEASER to strongly outperform the other eTSC methods due to the higher heterogeneity of the measurements and the lack of defined start times.
We fixed w to 5% of the maximal time series length of the dataset for each experiment.Table 4 shows results of all methods on these three datasets.TEASER requires 19% (ACS-F1) and 64% (PLAID) of the length of the sessions to make reliable predictions with accuracies of 83% and 91.6%, respectively.As expected, a smaller fraction of the TS is necessary for ACS-F1 than for PLAID.All competitors are considerably less accurate than TEASER with a difference of  4: Accuracy and harmonic mean (HM), where higher is better, and earliness, where lower is better, on three real world use cases.TEASER has the highest accuracy on all datasets, and the best earliness on all but the PLAID dataset.

Walking Motions
Abnormal pattern Regular pattern 10 to 20 percentage points (pp) on ACS-F1 and 29 to 34 pp on PLAID.In terms of earliness TEASER is the earliest method on the ACS-F1 dataset but the slowest on the PLAID dataset; although its accuracy on this dataset is far better than that of the other methods, it is only third best in terms of HM value.As ECDIRE has an earliness of 21% for the PLAID dataset, we performed an additional experiment where we forced TEASER to always output a prediction after seeing at most 20% of the data, which is roughly equal to the earliness of ECDIRE.In this case TEASER achieves an accuracy of 78.2%, which is still higher than that of all competitors.Recall that TEASER and its competitors have different optimization goals: HM vs accuracy.Still, if we set the earliness of TEASER to that of its earliest competitor, TEASER obtains a higher accuracy.The advantages of TEASER become even more visible on the difficult CMU dataset.Here, TEASER is 15 to 40 pp more accurate while requiring 35 to 54 pp less data points than its competitors.The reasons become visible when inspecting some examples of this dataset (see Figure 11).A normal walking motion consists of roughly three repeated similar patterns.TEASER is able to detect normal walking motions after seeing 34% of the walking patterns on average, which is mostly equal to one out of the three gait cycles.Abnormal walking motions take much longer to classify due to the absence of a gait cycle.Also, one of the normal walking motions (third from top) requires longer inspection time of two gait cycles, as the first gait cycle seems to start with an abnormal spike.

RELATED WORK
The techniques used for time series classification (TSC) can be broadly categorized into two classes: whole series-based methods and feature-based methods [17].Whole series-based methods make use of a point-wise comparison of entire TS like 1-NN Dynamic Time Warping (DTW) [25].In contrast, feature-based classifiers rely on comparing features generated from substructures of TS.Approaches can be grouped as either using shapelets or bag-of-patterns (BOP).Shapelets are defined as TS subsequences that are maximally representative of a class [12,37].The (BOP) model [17,[28][29][30] breaks up a TS into a bag of substructures, represents these substructures as discrete features, and finally builds a histogram of feature counts as basis for classification.The recent Word ExtrAction for time SEries cLassification (WEASEL) [30] also conceptually builds on the bag-of-patterns (BOP) approach and is one of the fastest and most accurate classifiers.In [33] deep learning networks are applied to TSC.Their best performing full convolutional network (FCN) performs not significantly different from state of the art.[7] presents an overview of deep learning approaches.
Early classification of time series (eTSC) [26] is important when data becomes available over time and decisions need to be taken as early as possible.It addresses two conflicting goals: maximizing accuracy typically reduces earliness and vise-versa.Early Classification on Time Series (ECTS) [34] is one of the first papers to introduce the problem.The authors adopt a 1-nearest neighbor (1-NN) approach and introduce the concept of minimum prediction length (MPL) in combination with clustering.Time series with the same 1-NN are clustered.The optimal prefix length for each cluster is obtained by analyzing the stability of the 1-NN decision for increasing time stamps.Only those clusters with stable and accurate offsets are kept.To give a prediction for an unlabeled TS, the 1-NN is searched among clusters.Reliable Early Classification (RelClass) [22] presents a method based on quadratic discriminant analysis (QDA).A reliability score is defined as the probability that the predicted class for the truncated and the whole time series will be the same.At each time stamp, RelClass then checks if the reliability is higher than a user-defined threshold.Early Classification of Time Series based on Discriminating Classes Over Time (ECDIRE) [19] trains classifiers at certain time stamps, i.e. at percentages of the full time series length.It learns a safe time stamp as the fraction of the time series which states that a prediction is safe.Furthermore, a reliability threshold is learned using the difference between the two highest class probabilities.Only predictions passing this threshold after the safe time stamp are chosen.The idea of EDSC [35] is to learn Shapelets that appear early in the time series, and that discriminate between classes as early as possible.[18] approaches early classification as an optimization problem.The authors combine a set of probabilistic classifiers with a stopping rule that is optimized using a cost function on earliness and accuracy.Their best performing model SR1-CF1 is significantly earlier than the state of the art but their accuracy falls behind.However, the code has a logical design flaw, which renders the results hard to compare to, but apparently results in good scores on the UCR datasets.Their algorithm uses z-normalized time series, which are then truncated to build a training set.Thereby, a truncated subsequence makes use of information about values that will only be available in future for normalization.I.e., their absolute values are a result of z-normalization with data that has not arrived yet.In TEASER the truncated train series are z-normalized first, thus removing any bias from values that have not been seen.We decided to omit SR1-SF1 from our evaluation due to this normalization issue.A problem related to eTSC is the classification of streaming time series [8,21].In these works, the task is to assign class labels to time windows of a potentially infinite data stream, and is similar to event detection in streams [1].The data enclosed in a time window is considered to be an instance for a classifier.Due to the windowing, multiple class labels can be assigned to a data stream.In contrast, eTSC aims at assigning a label to an entire TS as soon as possible.

CONCLUSION
We presented TEASER, a novel method for early classification of time series.TEASER's decision for the safety (accuracy) of a prediction is treated as a classification problem, in which master classifiers continuously analyze the output of probabilistic slave classifiers to decide if their predictions should be trusted or not.By means of this technique, TEASER adapts to the characteristics of classes irrespectively of the moments at which they occur in a TS.In an extensive experimental evaluation using altogether 48 datasets, TEASER outperforms all other methods, often by a large margin and often both in terms of earliness and accuracy.

Figure 2 Figure 2 :
Figure 2: eTSC on a trace of a digital receiver.The figure shows one traces of a digital receiver, and of microwaves on the top.These have three characteristic patterns (a) to (c).In the bottom part, eTSC is performed on a snapshot of the time series of a digital receiver.In its first snapshot it is easily confused with pattern (c) of a microwave.However, the trace later contains pattern (a) which is characteristic for a receiver.

Figure 3 :Figure 4 :
Figure3: TEASER is given a snapshot of an energy consumption time series.After seeing the first s measurements, the first slave classifier sc 1 performs a prediction which the master classifier mc 1 rejects due to low class probabilities.After observing the i'th interval which includes a characteristic energy burst, the slave classifier sc i (correctly) predicts RECEIVER, and the master classifier mc i eventually accepts this prediction.When the prediction of RECEIVER has been consistently derived times, it is output as final prediction.

Figure 5 :
Figure 5: The master computes a hyper-sphere around the correctly predicted samples.A novel sample is accepted/rejected if it's probabilities fall into/outside the orange hypersphere.

Figure 6 :
Figure 6: The accuracy of the slave classifier reaches 100% after seeing 13 time stamps on the train data, resulting in oneclass classification.

Figure 7 :
Figure 7: Average Harmonic Mean (HM) over earliness and accuracy for all 45 TS datasets (lower rank is better).
Average ranks over earliness for early TS classifiers.Average ranks over accuracy for early TS classifiers.

Figure 8 :
Figure 8: Average ranks over earliness (left) and accuracy (right) for 45 TS datasets (lower rank is better).
Influence of w on Earliness(a) Boxplot for earliness for varying parameter w over all 45 datasets.Influence of w on Accuracy (b) Boxplot for accuracy for varying parameter w over all 45 datasets.

Figure 11 :
Figure 11: Earliness of predictions on the walking motion dataset.Orange (top): abnormal walking motions.Green (bottom, dashed): Normal walking motions.In bold color: the fraction of the TS needed for classification.In light color: the whole series.

Table 1 :
arXiv:1908.03405v2 [cs.LG] 16 Aug 2019 Symbol Meaning sc i / mc i a slave / master classifier at the i'th snapshot Symbols and Notations.
of the whole trace (amounting to 1 hour).Current eTSC methods trained on this data would always wait for 30mins, because this was the last time they had seen the important event in the training data.But actually most TS could be classified safely much earlier; instead of assuming a fixed classification time, an algorithm should adapt its decision time to the individual time series.
(4)tre).(4)To train the paired master classifier, we first remove all instances which were incorrectly classified by the slave.Assume that there were N ≤ N correct predictions.We then train a one-class SVM on the N training samples, where each sample is represented by the 3-tuple (c(s i ), P(s i ), d(s i ) produced by the slave classifier.(5)Finally, we perform a grid-search over values ∈ {1...5} to find the threshold which yields the highest harmonic mean HM of earliness and accuracy on D t r ain .Training phase of TEASER using S time stamps, and a labeled train dataset.

Table 2 :
Harmonic mean (HM) for TEASER vs. the four eTSC classifiers (ECTS, EDSC, RelClass and ECDIRE).Red dots indicate where TEASER has a higher HM than the other classifiers.In total there are 36 wins for TEASER.Summary of domination counts (wins/ties/losses) using earliness and accuracy (Pareto Optimality):