ELIS++: a shapelet learning approach for accurate and efficient time series classification

In recent years, time series classification with shapelets, due to the high accuracy and good interpretability, has attracted considerable interests. These approaches extract or learn shapelets from the training time series. Although they can achieve higher accuracy than other approaches, there still confront some challenges. First, they may suffer from low accuracy in the case of small training dataset. Second, they must manually set some parameters, like the number of shapelets and the length of each shapelet beforehand, and some hyper-parameters, like learning rate and regulation weight, which are difficult to set without prior knowledge. Third, extracting or learning shapelets incurs a huge computation cost, due to the huge search space. In this paper, we extend our previous shapelet learning approach ELIS to ELIS++. To improve the accuracy on the small training dataset, we propose a data augmentation approach. To learn the higher quality shapelets, based on the PAA shapelet candidates search technique proposed in ELIS, ELIS++ first propose a novel entropy-based approach shapelet candidate selection mechanism to discover shapelet candidates, and then applies the logistic regression model to adjust shapelets.To avoid setting other parameters manually, we propose a Bayesian Optimization based approach. Moreover, two techniques are proposed to improve the efficiency, coarse-grained shapelet adjustment and SIMD-based parallel computation. We conduct extensive experiments on 35 UCR datasets, and results verify the effectiveness and efficiency of ELIS++.

enumerates subsequences in the training time series, and represents them by PAA [28]. For each class, we rank all PAA words by the TFIDF-style scores. The mechanism of false set is proposed to determine the number of shapelets automatically. The second phase is the shapelet adjustment phase, in which, ELIS utilizes the machine learning approach to tune the shapelets and build the classification model. ELIS achieves higher accuracy and efficiency than LTS. However, ELIS has the following drawbacks. First, it may not work well in the case of small training dataset. The reason is that the training time series will be covered by a small number of shapelet candidates, so some "true" shapelets will be missed. Second, although ELIS can determine the key parameter, the number of shapelets and the shapelet lengths, we still need to manually tune some parameters, like the learning rate and the regulation weight. Third, for the large training dataset, training the model in ELIS is time consuming. The reason is that just as LTS [7], ELIS needs to adjust every data point of each shapelet.
In this paper, we extend ELIS further to ELIS++, which is more accurate and efficient. In the experiments, on 35 datasets of UCR repository [3], ELIS++ is always more accurate than ELIS. The contributions we make in this paper can be summarized as follows.
-To improve the accuracy on the small training dataset, we propose a data augmentation approach (Section 5.1). Considering the high expense of obtaining the labelled data, in many applications, analysts can only get a small training dataset. So improving the effectiveness in the case of small training dataset is important. In ELIS++, we propose a two-step approach to augment the training dataset. A parameter, the augmentation amount, is used to control the augmentation scale. Besides improving the accuracy, our approach even improves the efficiency in most cases because less iteration rounds are needed to converge the model training. -To address the problem of manually setting the parameters, we propose an approach to set the parameters automatically (Section 5.2). ELIS++ has several parameters, including data augmentation amount , regularization λ, learning rate η and the iteration round I . Tuning them manually is a tedious task. We propose a Bayesian Optimization based approach to set the parameters automatically. Experimental results verify the effectiveness of the proposed approach. -We propose two techniques to improve the efficiency. First, we propose the coarsegrained shapelet adjustment mechanism (Section 4.4). Instead of adjusting every data point in the shapelets, we adjust fixed number of continuous data points together, which is determined by the adjustment granularity. Except improving the efficiency, it can also improve the accuracy by reducing the overfitting phenomenon of fine-grained adjustment. Second, we utilize the SIMD-based parallel computation to accelerate the model training (Section 4.5). -To improve the quality of the shapelet candidates, we propose a novel shapelet candidate selection mechanism (Section 3.5). In ELIS, the mechanism of false set was used to evaluate the discrimination ability of candidate sets and determine the number of shapelet number. In ELIS++, we replace it with an entropy-based approach, since the format of the new measurement is consistent with that in the shapelet adjustment stage. -We expand the experimental evaluation greatly (Section 6). First, we add three baseline approaches, 1NN-cDTW [4], Proximity Forest [17] and the deep learning based approach mWDN [27]. Second, we extend the datasets from 15 to 35.
The rest of the paper is organized as follows. First of all, in Section 2 some essential definitions are presented. The ELIS++ approach are introduced in Sections 3, 4 and 5 respectively. In Section 6, we present the experiment setting and the results. Next, We introduce the related works in Section 7. Finally, we conclude for the paper in Section 8.

Preliminaries
In this section, we present the necessary definitions.
Definition 1 (Time Series and Subsequence) A time series T is a sequence of ordered values T = (t 1 , t 2 , · · · , t i , · · · , t n ), where n is the length of T . A subsequence of time series T , (t b , t b+1 , · · · , t e−1 , t e ), can be denoted as T [b : e] or T (b, e − b + 1), in which b and e are the beginning and ending points of the subsequence respectively. Definition 2 (Dataset) A dataset D is a set of pairs of time series, T i , and its class label c i . Formally, D = {< T 1 , c 1 >, < T 2 , c 2 >, · · · , < T N , c N >}, N is the number of time series in D. C = {c 1 , c 2 , · · · , c |C| } is the set of class labels, and |C| denotes the number of labels.

Definition 3 (PAA Sequence) A subsequence T (b, m)
can be represented by a sequence V = (v 1 , v 2 , · · · , v w ) by the PAA. The i-th element of V , v i , is calculated as follows: The main idea of PAA is to replace original subsequence with the mean values of disjoint windows. PAA window number, w, is a parameter that determines the length of the PAA sequence. Figure 1 shows two PAA sequences as w equals 5 and 10 respectively.
For two equal length sequences, we can measure the distance between them by traditional euclidean distance. Moreover, we need to define the distance between two sequences with different lengths. Definition 4 (Distance Between Two Sequences) The distance between sequences T u of length L u and T v of length L v (L v ≥ L u ), no matter they are time series or subsequences, is denoted as: where J = L v − L u + 1, tu i and tv i are the i-th value of T u and T v respectively.
The PAA word and the SAX word are two symbolic representations for a subsequence. The difference is how to split the range of Y-axis (all the values of time series in D).
Definition 5 (PAA Word) The PAA word is obtained by transforming a PAA sequence into an equal-length sequence of discrete integers. To simplify the expression of the PAA word, we represent all these discrete integers as alphabets, i.e., 0 as "A", 1 as "B" and so on. PAA sequence V , and then represented by a PAA word P w m = (p 1 , p 2 , · · · , p w ). The i-th element of P w m is calculated as follows: in which δ is a small enough number compared to MAX − MI N.
The transformation is reversible, that is, PAA word P w m can be reverted into a lengthm numeric subsequenceT = (t 1 ,t 2 , · · · ,t m ). The i-th element ofT is calculated by the following equation:t Based onT , we define the distance between a PAA word P w m and a length-m sequence T as dist (T , T ) in (2). Figure 2 shows an example of the PAA word "CDHGGHFIFA" with w = 10 and d = 10. Definition 7 (Shapelet) A shapelet S is a time series subsequence which is maximally representative for a class. It semantically represents intelligence on how to discriminate one class against others.
Interpretability is an important property of shapelets. That means a shapelet can be interpreted as a shape characteristic of time series in a class.

ELIS++ overview
In this paper, we extend ELIS to ELIS++. Similar with ELIS, ELIS++ combines the advantages of both structure-based and shapelet-based approaches by the shapelet granularity. The benefit of the shapelet-based approach is that the shapelets, no matter selected from the dataset [30] or learned by gradient descent [7], have more distinguishing power. However, obtaining the high-quality shapelets is time consuming. In contrast, the structure-based approach is more efficient, since both transforming the subsequences into approximate representations and determining their occurrences in time series are much easier than computing the distance between the raw time series. However, the loss of detailed information may influence the model accuracy. ELIS++ consists of three phases, data augmentation phase, shapelet candidate discovery phase and model training phase.
Data augmentation module Although we can train the classification model directly on dataset D, the trained model may suffer from low accuracy when D is small. So, before training the model, we first augment the training dataset by two operations, rotation and noise insertion.
Candidate generation module To learn the shapelets, we first need some initial shapelets, which are called as shapelet candidates. High-quality candidates can not only lead to the higher accuracy, but also make the learning process easier to converge. In this phase, we find some high-quality shapelet candidates in an efficient way. Specifically, we first enumerate and rank the candidates in PAA word space, which makes it easy to determine the similarity between two subsequences and calculate their occurrences efficiently. Then we transform them into the raw numeric space, and select the final candidates.

Model training module
In this phase, we apply the logistic regression model to adjust the shapelets to improve the accuracy. The classifier is initialized based on the shapelet candidates, and then learns better shapelets on the raw numeric time series. More, to avoid the high expense of adjusting each point of the shapelets, as LTS and ELIS, we propose the mechanism of coarse-grained shapelet adjustment. We use the adjustment granularity to control the "fine" extent of shapelet representing the raw time series.
Moreover, we propose a Bayesian Optimization based approach to set the parameters automatically. It first learns a model to represent the relationship between parameters and the classifier accuracy, and then based on the model to find the best parameters.
The proposed ELIS++ approach is outlined in Figure 3. In next sections, we introduce the technical details in turn.

Shapelet candidates discovery
In this section, we introduce the shapelet candidate discovery module. Specifically, for each class, we generate the candidates from the PAA word space, and then revert them into the raw numeric space.

Overview
The rationale of candidates discovery is that if a subsequence appears in one class frequently and other classes occasionally, it is a high-quality shapelet candidate for this class. The challenge is to determine the occurrence for each subsequence. The brute-force approach is too expensive, O(N 2 n 4 ), to process all possible subsequences, where N is the number of time series and n is the length of time series [30]. So we transform subsequences to PAA words to discover candidates efficiently. However, we need to solve three problems. First, the PAA representation may cause the problem of mismatching, that is, two similar subsequences may be represented as two different PAA words. Second, we need to keep the number of shapelets as small as possible, which can guarantee the high interpretability and high training efficiency, in the meanwhile, they are sufficiently discriminative. The third one is how to alleviate the problem of parameter tuning. We need to determine the number and length of the shapelets, as well as the granularity of PAA representation, which have a great influence on the accuracy.
In this paper, we propose a three-step approach to discover the shapelet candidates. In the first step, we transform all subsequences in D to PAA words. Unlike the previous structure-based approaches, in which the window number of PAA or SAX word, w, is fixed, we generate the PAA words with multiple granularity for each subsequence. This approach can solve the problem of improper granularity introduced in Section 1. In the second step, we compute the TFIDF score for each PAA word in each class and rank them accordingly. The PAA word with higher score is more discriminative. We propose a concept of fuzzy counting to avoid the mismatching issue. In the third step, for each class, from the sorted PAA words, we use a coverage-based approach to select some of them as the final set of candidates. We use cover gain to measure the contribution of each PAA word, and use entropy to determine the number of candidates to be selected. The selected candidates are different from each other, and the set of them can provide high classification accuracy.
Previous works of shapelets [7,23] need to set two important parameters, the length of shapelets and the total number of shapelets. Unfortunately, these parameters are difficult to determine in the absence of prior knowledge. In our approach, we rank and select the shapelet candidates of variable lengths together to avoid setting the shapelet length manually. Also, we use entropy to determine the number of shapelets adaptively.

PAA word generation
First, we generate a set of PAA words from the raw time series. We only select subsequences with length i · φ · n (i = 1, 2, · · · ) and transform them to PAA words, where φ is a fixed step ratio, such as 0.05. The rationale is that subsequence T (b, m) will be very similar with T (b, m ), if their lengths, m and m , are very similar. In most cases, these two subsequences will have similar discrimination ability.
For each time series T in D, we traverse it from left to right to generate PAA words. Formally, from each beginning point b (1 ≤ b ≤ n), we enumerate subsequences of length i · φ · n (i = 1, 2, · · · ), that is, T (b, φ · n), T (b, 2 · φ · n), · · · . We transform each of these subsequences to multiple PAA words with variable PAA window number w. We transform each subsequence to PAA words with variable window lengths, w ∈ {2, 4, 8, 16, 32, 64}, respectively if possible. In fact, the values of w can be set freely. The only thing that needs to be guaranteed is that the values of w can reflect the granularity from coarse to fine.
Moreover, we eliminate the PAA words whose shapes are significantly different from the subsequences they correspond to. In this case, PAA word can not represent the shape of subsequence. For example, assume the subsequence is (3.5, −3.5, −2.5, 2.5) and w = 2, PAA sequence will be (0, 0). This PAA word should be dismissed because it cannot represent the shape of the raw subsequence. We use the PAA bucket width, = MAX−MI N d , World Wide Web (202 ) 2 : -511 539 4 1 to constrain the distance. Specifically, for subsequence T (b, m) and the transformed PAA word P w m , if the distance between T (b, m) and P w m exceeds , we discard this PAA word to avoid the deformation caused by PAA.
Finally, we insert the PAA words into an inverted table, denoted as H . Each entry in H takes the PAA word as key, and the set of time series indexes in which this PAA word occurs, called cover set, as value. Note that if a PAA word occurs in a time series many times, it should be counted only once.

TFIDF score
In the second step, we use a TFIDF style score to evaluate how discriminative a PAA word is for a class The TFIDF score of PAA word P for class c is defined as the product of two factors: term frequency(TF) and inverse document frequency(IDF). The term frequency is defined as: where F P ,c is the frequency that PAA word P occurs in the time series of class c and N c is the total time series number of class c. The inverse document frequency is defined as: in which I P ,c and O P ,c are defined as: where F P is the frequency that the PAA word P occurs in time series of D. When computing all the frequencies above, we also add the occurrence times of the similar PAA words, which can solve the problem of mismatching [20]. For example, the occurrence of PAA word "CCDDB" should also be counted when computing the frequency of the PAA word "CCDDC". Formally, we define the similar PAA words as follows, Definition 8 (Similar PAA words) Let P = (p 1 , p 2 , · · · , p w ) and P = (p 1 , p 2 , · · · , p w ) be two different PAA words with the same PAA window number w and raw subsequence length m. P and P are similar if they satisfy After computing all TFIDF scores, we sort the PAA words of each class according to their TFIDF scores in descending order, and maintain the sorted PAA words for each class c in a list, denoted as L c .

Candidate selection
In the third step, we generate the final set of shapelet candidates for each class. We intend to find sufficient candidates to cover class characteristics, and in the meanwhile, guarantee the diversity between candidates. We use two techniques, cover gain and entropy , to select shapelets for each class.
For class c, we traverse the PAA words from top to bottom in L c and select some of them to form the candidate set, denoted as Ω c . Note that for different classes, the number of candidates, denoted as |Ω c |, may be different. The challenge here is how to determine |Ω c | for each class. The rationale behind our approach is to form multiple candidate sets for each class. By evaluating the discrimination ability of each generated candidate set with the coverage, we choose the best one as the optimal candidate set. First, we give some definitions.
Definition 9 (Coverage) The coverage, denoted as θ , is a threshold that determines whether the characteristics of a time series is covered by enough candidates. Specifically, if each time series of class c is covered at least θ times by candidates selected from L c , we believe that the candidates already cover all the features of the time series in class c with threshold θ. Definition 10 (Cover Gain) The cover gain of a PAA word P equals to the number of time series covered by P whose coverages are lower than θ yet.
For each class c and coverage threshold θ , we generate a candidate set, denoted as Ω c θ . We convert the problem of finding out the optimal set of shapelet candidates to the goal of confirming the optimal value for θ. That is, we define a score to measure the discrimination ability of each candidate set, score(Ω c θ ), and select the candidate set with the maximal score(Ω c θ ). We have two tasks, i) find Ω c θ for each θ ; 2) define score(Ω c θ ). Next, we introduce them in turn.
We generate Ω c θ for each θ as follows. Initially, Ω c θ is empty. Then we traverse the PAA words in L c from top to bottom. For each PAA word P , we have three cases.
1. If P has no cover gain, we simply discard it. It means P is meaningless, since it only covers the time series that have been covered at least θ times by candidates with higher TFIDF scores. 2. P has cover gain, but it is similar with certain candidate, say P , which is already in Ω c θ . In this case, we just add the time series covered by P to those covered by P . P will not be added into Ω c θ . By eliminating overlapped PAA words, we keep the size of the candidate set smaller, which is good for interpretability. 3. If P has cover gain and isn't similar with any candidate in Ω c θ , we put it into Ω c θ . This process terminates when all time series are covered at least θ times, or all PAA words in L c are processed. We discuss the definition of score(Ω c θ ) in the next section.

The score of Shapelet candidate set score(Ω c θ )
Now we discuss the second task, defining score(Ω c θ ). Let Ω c θ = {S 1 , S 2 , · · · , S p }, where p = |Ω c θ |. S j is the j -th shapelet candidatem, which is reverted from the j -th PAA word with Eq. 4. The rationale behind our approach is to use score(Ω c θ ) to mimic the classification process with entropy to evaluate the discrimination ability of this candidate set. We first introduce the entropy of one single shapelet, and then extend it to the shapelet set.
Given shapelet S j , we compute the distances between it and other time series in D, sort the distances, and place them on an ordered line, as shown in Figure 4. The blue square and red circle objects correspond to the time series of different classes. The left objects mean that the corresponding time series has smaller distance with S.
We use entropy to measure the distinguishing power of shapelet S. Specifically, we split all objects on the ordered line into left part and right part by a split point, denoted as SP , as shown in Figure 4. That is, we classify all time series T 's, which satisfy dist (S, T ) ≤ SP a group, and the other time series as another group. For any SP , we can compute the entropy based on the time series grouping caused by SP . For each shapelet S j (1 ≤ j ≤ p), we select a split point with the smallest entropy, which means that it can produce two groups as "pure" as possible. We call this split point as the optimal split point of S j , denoted as SP j , and denoted its entropy as ENT j . To define the score of a shapelet candidate set, score(Ω c θ ), we still project all time series into an ordered line, denoted as OL, find an optimal split point SP , and use the entropy to measure the purity of two time series groups. The challenge here is how to project time series onto OL, since the distances of a time series with different shapelets are different.
The basic idea of our approach is to generate the ordered lines for all candidates in Ω c θ , and merge them to form the final OL. First, we generate the ordered line, OL j , for each candidate S j by computing and sorting the distances between it and all the time series. Then we find the optimal split point SP j and compute the corresponding entropy ENT j .
Then we align these ordered lines by aligning the split points SP j 's. Moreover, we normalize them by extending or shrinking the ordered lines based on the entropy ENT j 's. Formally, for ordered line OL j of candidate S j , the normalized distance between S j and time series We merge all normalized ordered lines to form the final ordered line, OL, on which the position of time series T i is After obtaining positions of all time series on OL, we find the optimal split point, and compute the entropy, denoted as ENT . We set score(Ω c θ ) as 1 ENT . The larger score(Ω c θ ), the most discriminative the candidate set Ω c θ .
Speedup the computation of score(Ω c θ ) When the number of shapelet candidates and that of time series are large, computing ENT j by finding the best optimal split point for each shapelet candidate S j is time consuming. Therefore, we compute score(Ω c θ ) with a more efficient way. By simple transformation of (11), we have Note that the right term in (12) is a constant for all time series. So removing it will not influence the relative positions of all time series on OL. As for the denominator ENT j , it represents the discrimination power of candidate S j . To improve the efficiency, we replace it with the T F I DF value of the corresponding PAA word. So, the position of time series where T F I DF j is the TFIDF value of the j -th PAA word, as introduced in Section 3.3. Based on the adapted ordered line, we find the optimal split point, compute the entropy and set score(Ω c θ ) as 1 ENT . We compute score(Ω c θ ) for variable values of θ . The candidate set with maximal score(Ω c θ ) can best discriminate time series of different classes, so we choose it as the final candidate set. Ties are solved by choosing the one with fewer candidates. Because the fewer candidates we have, the more efficient follow-up operations will be and the more interpretable the shapelets will be.
According to extensive experiments, the average and standard deviation of θ for the best candidate set are 2.021 and 1.266 respectively, therefore we limit the range of θ from 1 to 5.
Analysis Indeed, later in Section 5, we use the logical regression model to adjust the shapelet. So (13) is very close to the model (16). Moreover, when training the model, we initialize the weight of each shapelet candidate as 1 T F IDF . In this sense, score(Ω c θ ) makes the shapelet selection and shapelet adjustment have the uniform criterion, which can improve the accuracy. The pseudocode of the candidate selection is shown in Algorithm 2.
We use the example in Figure 5 to illustrate the candidate discovery process with θ = 1. All the subsequences are transformed to PAA words and organized by a hash table, as in Section 3.2. We rank the PAA words by their TFIDF scores and select candidates for each class, as in Sections 3.3 and 3.4 respectively. It can be seen that, a PAA word occurs in many classes should have the smaller TFIDF value, such as "CB". The PAA word which has no cover gain, such as "CEDDE", or is similar with a selected candidate, such as "DCCCC", will be ignored when selecting candidates.

Shapelet adjustment and model training
In this section, we introduce the shapelets adjustment phase. The main idea of this stage is that the true shapelets which have the better discrimination ability may not occur in the training time series. So we adjust the shape of shapelets by learning the logical regression model to achieve the high-quality classification.

Redefine distance between Shapelets and time series
According to the (2), the distance of two sequences is not differentiable because of the minimum function. So we replace the minimum function by soft-minimum function. And the distance between the i-th shapelet S i of length l and the time series T is denoted as: where When the parameter α → −∞, the soft-minimum function is the same with the minimum function. We omit the correctness proof of the soft-minimum function. Nevertheless, considering the precision loss of float number, we can not assign a very small number to α to avoid the case that e αD i,l,j equals to 0. We have found out that α = −25 is small enough and do not cause the precision loss of float number in our experiments. Therefore, α is kept fixed throughout all our experiments.

The learning model
We leverage a linear learning model to predict approximate target valuesŶ .
where T is the time series to be classified, W 0 and W i are linear weights and K c is the number of shapelets for class c 1 . We use the sigmoid function for the prediction of target variables via a logistic regression loss.
We convert multi-class problem into |C| number of one-vs-rest sub-problems. Each subproblem is a binary classification problem that discriminates one class against others. We regard S(Ŷ c ) for class c as a confidence probability. Therefore, the class with the most confident probability will be chosen as the predicted target class for a test time series.
The logistic loss is denoted as: where Y is true target value. Specifically, if the time series belongs to class c, Y c = 1.
The objective function for class c with regularization terms is denoted as: where S is the set of shapelets for class c, W = (W 0 , W 1 , · · · , W K c ) is the weight vector, λ is the L2 regularization coefficient,Ŷ (i) and Y (i) is the predicted target value and the true target value of i-th time series for class c.
We adopt a stochastic gradient descent approach as optimization that remedies the classification error caused by each time series at one time. Thus the decomposed objective function F (i) for i-th time series and class c is:

Initialization
Since the gradient descent technique can not theoretically guarantee the discovery of the globally optimal solution in non-linear functions, the different initialized shapelet candidates may greatly affect the classification accuracy. We initialize the shapelets based on the shapelet candidates discovered in Section 3 for each binary classifier which can achieve high accuracy. We transform the i-th candidate of class c back to subsequence to initialize the i-th shapelet of class c. The i-th linear weight of class c is initialized by the opposite number of the candidate's TFIDF score. The reason why using the opposite number is that the greater the distance between the shapelet and the time series is, the higher probability of S(Ŷ ) < 0.5 will be.

Coarse-grained Shapelet adjustment
In our previous work ELIS [5], the shapelets are adjusted at the finest granularity, that is, each point of the shapelet can increase or decrease individually. Although the finegranularity gives us the most flexibility to refine the shapelet, it has two drawbacks. First, it may cause the phenomenon of overfitting, which will influence the accuracy, especially in the case of small dataset. Second, it causes the long training time, since each point of each shapelet is the potential dimension to tune.
To solve these problems, in ELIS++, we propose the mechanism of coarse-grained shapelet adjustment. In Figure 6, we illustrate the basic idea with a toy example. Figure 6a shows a shapelet candidate S, which is reverted from a PAA word. We amplify and show its length-12 prefix, denoted as S [1 : 12]. In Figure 6b, we show the shapelet X, which is obtained by adjusting S with fine-grained granularity, and the comparison between S[1 : 12] and X [1 : 12]. It can be seen that some points decrease while others increase. In Figure 6c, we show the shapelet Y after coarse-grained adjustment of S. Instead of adjusting each point of S, we group continuous points of S as a whole, and adjust them together. In this example, each 4 data points are grouped. Now, we present the concept of adjustment granularity.
In the example of Figure 6, we set g = 4. Note that the example in Figure 6b can be seen as a special case of g = 1. Next, we introduce how the coarse-grained adjustment works.

dist (S , T (i, l))
It is worth noting that although shapelet S has different length with S and T (i, l), in fact, S represents a length-l shapelet. So we can define the Euclidean distance between S and T (i, l). Obviously, the distance in (21) is differentiable. By replacing the distance function in (15) with (21), we can achieve the desired coarse-grained shapelet adjustment. In other words, g number of continuous data points will be adjusted together.
To make the learning rate η uniform for all shapelets, we apply the parameter g to all shapelets. The value of g will be tuned by the parameter auto-tuning approach introduced in Section 5.2.

Accelerate the learning by parallel computation
At the candidate discovery stage, the time complexity of PAA words generation and TFIDF score calculation is O(|C|Nn 2 log(Nn 2 )) and the time complexity of candidate selection is O(|C|KNn 2 ), where |C| is the number of classes, K is the size of candidate set, N is the number of time series and n is the length of time series. The shapelets adjustment stage has a total time complexity of O(|C|I KNn 2 ), where I is the iteration rounds.
The shapelet adjustment phase is more time consuming. In detail, the most expensive part is the calculation of distances between subsequences and the gradients. Specifically, we use the for loop many times to calculate the distance for each matching position of two sequences in (15). Also, when we train the logistic regression classifiers, the gradient of each position of each shapelet (line 19 in Algorithm 3) also needs to be calculated by the for loop.
In ELIS++, we accelerate these computations by parallel computation [26]. We utilize the ubiquitous Single Instruction Multiple Data (SIMD) architecture, which assigns multiple threads to process the same set of instructions on multiple data. All software packages or hardware supporting SIMD can be used to accelerate our approach. Concretely, for distance calculation, we compute the distances of the positions where the two sequences match at the same time. When we calculate the distance between S = (s 1 , s 2 , · · · , s l ) and T (j, l) = (t j , t j +1 , · · · , t j +l−1 ) For gradient calculation, we can simultaneously compute the gradients of all positions for each shapelet. That means when we use gradient decent to train the logistic regression classifiers, we can calculate ∂F ∂S c [k] for each position of the shapelet at the same time. We implement our approach with Eigen [8], which makes our approach over 8 times faster on training stage and over 19 times faster on testing stage than the naive implementation. We show the experiment results in Section 6.

Data augmentation and parameter auto-tuning
In this section, we introduce our approaches of data augmentation and parameter autotuning.

Data augmentation
When the size of training set is small, ELIS may not work well. The reason is that the training time series are covered by a small number of shapelets, which can already achieve the perfect purity. So some "true" shapelets will be missed due to the TFIDF score tie. Therefore, we propose the data augmentation approach to improve the quality of the training data before the candidate discovery.
Since we only focus on the best match position of a shapelet, it is reasonable to rotate the suffix subsequence of any length to the front to generate new time series. The rotation operation tends to select shorter candidates. It benefits enhancing generalization. Moreover, we add some noise obeying a normal distribution to the generated time series to make them different from the previous one. Specifically, the data augmentation consists of two operations: -Rotation. We choose a cut point τ by the uniform probability on time series T . Then we slice T into two subsequences T [1 : τ ] and T [τ + 1 : n]. Finally, they are cascaded to generate a new time sereis T = (t τ +1 , t τ +2 , · · · , t n , t 1 , t 2 , · · · , t τ ). -Noise insertion. We generate a length-n series of noise whose values satisfy normal distribution and add it to T .
Data augmentation helps us to overcome the overfitting problem. We use a parameter, data augmentation amount, denoted as , to control how many time series are generated from each time series. That is, given dataset D of size N , after the data augmentation, the dataset has * N time series.
An unexpected surprise is that data generation replaces the role of regularization. That means a fixed small regularization, such as 0.001, is enough for shapelets adjustment stage in most instances. Since is an integer and only searched in a small range while λ is a float and searched in a large range, it makes the parameter tuning much easier. We compare our approach with and without data augmentation in Section 6 to show the effectiveness of data augmentation.

Bayesian optimization based parameter tuning
Our approach has several parameters, including data augmentation amount , regularization λ, learning rate η, the iteration round I and adjustment granularity g. Tuning them manually is a tedious task. In order to solve this problem, we propose a Bayesian Optimization based approach to determine the values of these parameters automatically. The Bayesian Optimization is a popular technique to optimize the objective function which is costly to evaluate [2,24]. To ease the description, we denote the set of parameters as a vector H , one dimension for a parameter. Initially, we define a value range for each parameter, which forms a |H |-dimensional search space D H .
Before introducing the approach, we first describe the relationships among H , the classification model M and the accuracy y. For each parameter vector H , we can learn a model M, which is composed of a set of shapelets S and weights W . Then for model M, we can apply it to dataset D, and obtain the accuracy y. So, the optimization goal is to find an optimal vector H * which can achieve the highest value of y.
The Bayesian Optimization approach considers the vector H as the input and the model accuracy y as the output, which can be modeled by certain regression model, such as linear regression or Gaussian process regression. In this paper, we utilize the latter due to its popularity.
We learn the optimal H * with an iterative process, which consists of two phases. The pesudocode is shown in Algorithm 4. In the first phase, we sample N 1 number of parameter vectors H randomly. for each H i (1 ≤ i ≤ N 1 ), the model M i and the corresponding accuracy y i are computed (line 4-7). After that, we maintain a candidate set of size N 1 , denoted as R H , each of which is a triplet (H i , M i , y i ).
In the second phase, we generate another N 2 number of candidates. Different from the first phase, we try to learn the model between the parameter vector and the accuracy based on R H , and use the model to find the best parameter. Specifically, in the i-th round, we first learn the Gaussian process regression model M G based on the current R H (line 9). Then based on M G , we compute the optimal parameter, H i , with the Acquisition function (line 10). Acquisition function F a defines the balance between exploring new areas in the search space and exploiting areas that are already known to have favorable value. In this paper, we use Upper Confidence Bound (UCB) [25] as the acquisition function. After H i is obtained, we compute the corresponding model M i and accuracy y i , and add the triplet (H i , M i , y i ) to R H .
As more and more triplets are added into R H , the learned M G will get better which can generate the better model M. After the second phase, we have N 1 + N 2 triplets in R H in total. We select the optimal H and M from it, as the final one, denoted as H * and M * . In this paper, we set N 1 as 20 and N 2 as 50 by default.

Experimental results
In this section, we verify the effectiveness and efficiency of our approach.

Datasets and baselines
We conduct experiments on the UCR archive [3], a popular benchmark for the most time series classification works. We randomly select 35 datasets from the benchmark for the comparative experiments, whose detail is shown in Table 1. We compare our approach with 7 baselines in 4 different types, -Distance-based approach. Within 1NN approaches, we select the best one, 1NN-cDTW [4], which uses cDTW distance as similarity measurement. Moreover, we select Proximity Forest [17] (Pforest for short) which is a novel random forest based approach. It uses diversity distances between time series to construct each tree in the forest. -Shapelet based approach. We select two traditional shapelet-based approaches, and one shapelet learning approach. The first is the shapelet decision tree (NSD) [30], which uses information gain to measure the shapelet quality of each decision tree node. The second is Fast Shapelet (FSH) [20], which exploits a fast random projection technique on the SAX representation instead of searching from all subsequences to improve the efficiency. The learning shapelet approach (LTS) [7] first initializes the fixed number of shapelets and then uses logistic regression model to optimize shapelets. -Structure-based approach. In this type, we select the best structure-based approach, SAX-VFSEQL [19], which applies SAX technique to discriminate all subsequences and discovers the most discriminative subsequences by training SEQL [11,12]. -Deep learning approach. We select mWDN [27] as the baseline because it is the star-of-the-art deep learning approach for the time series classification, which uses the multi-level wavelet decomposition to get the features and uses deep residual networks [9] as the classification model. We use the authors' default wavelet default function and the parameter values.
We does not compare the ensemble-based approach, HIVE-COTE [16], which focuses on how to combine multiple classifiers to improve the accuracy. It is orthogonal to our approach.

Experiment setting
All experiments are conducted on a Ubuntu PC with Intel(R) Core(TM) i7-6700K CPU (8-core 4.00GHz CPU), 32GB memory and 2TB disk space. The implementation of our approach is written in C++. The codes of most baselines are publicly available, and we use the public codes. Since we cannot find the public code of LTS [7], we implement it with C++ by ourselves.
There are 5 parameters in our approach. Instead of tuning them manually one by one, we use the Bayesian Optimization based approach to learn the optimal values automatically. So we only need to give the range for each parameter. The learning rate is searched in a range of η ∈ [0.001, 1] and the regularization is searched from λ ∈ [0.001, 1]. The iteration rounds vary from 5 to 500, that is, I ∈ {5, · · · , 500}. is searched from 1 to 5, in which 1 represents no data augmentation and 5 represents that we generate 4 additional time series for each time series. The adjustment granularity, g, is integers within [1,8]. We show the tuned parameter values for each dataset and the number of shapelets for each class in Table 1. The source code and the parameter values of our approach are made publicly available. 3

Accuracy
In the first experiment, we compare the accuracy of all approaches on the 35 datasets, and results are shown in Table 2. The best approach for each dataset is highlighted in  bold. Moreover, we also present the average accuracy and the number of winners for each approach.
The superiority of our algorithm is obvious. ELIS++ achieves the highest accuracy on 17 datasets, and the average accuracy is also the highest. The results verify that our approach obtains high-quality shapelets, which can distinguish the classes well.
Among all the baselines, LTS is the most competitive one, which wins on 7 datasets. So, the results demonstrate the advantage of the shapelet learning approach. mWDN and Pforest, have the highest accuracy on 4 and 5 datasets respectively.
In essential, mWDN uses the wavelet basis as the features, and builds the network model based on them. Considering the wavelet basis as the subsequence, mWDN can also be considered as a "shapelet" based approach. Although the network model has good learning performance in many applications, it has some limitations in the time series classification problem.
First, each wavelet basis can only simulate specific shape of subsequence, and is only suitable for the time series dataset in which the shapes of shapelets are similar with this wavelet basis. It may not work well if the wavelet basis is totally different with the discriminative shapelets.
Second, since the wavelet decomposition is lossless, so theoretically, if all wavelet coefficients are used as the features to build the deep learning network, the mWDN approach can model any time series perfectly. However, the large number of features will incur a large network which needs a large number of training time series to make the learning process converge, which is not true in many time series dataset. Moreover, it is difficult to determine the optimal network structure given an arbitrary training dataset.
In contrast, ELIS++ can learn the shapelets of arbitrary shape, and it has better capability to learn the shapelets based on a small number of training time series.

Efficiency
In this experiment, we compare the efficiency of shapelet learning approaches, LTS and ELIS++, because they are more accurate than other baselines. In order to verify the efficiency improvement of ELIS++, we also compare the efficiency of ELIS. In Figure 7 show the average training and testing time of 35 datasets for all three approaches. Moreover, we compare the parameter tuning efficiency of ELIS and ELIS++ (the "tune" bar). ELIS++ tunes parameters with the Bayesian Optimization based approach, while ELIS uses the brute force approach to find the best combination of parameters in the given range [27].
In Table 3, we show the training time of ELIS++ (with or without SIMD optimization) and ELIS on each dataset. ELIS++ is 373 times faster on training phase and 15.7 times faster on testing phase than LTS. ELIS is 37 times faster on training phase and 1.2 times faster on testing phase than LTS. One reason why ELIS++ is more efficient is that initializing the shapelets by our candidates discovery approach not only can produce a smaller shapelet set than both LTS and ELIS. Moreover, Table 3 shows that ELIS++ with SIMD is about 10 times faster on training phase than the ELIS. It means that our approach has the capability to improve efficiency with SIMD, even when CPU is used.
To find the best parameters, ELIS++ trains 70 models, as introduced in Section 5.2. ELIS has two parameters to tune, regularization λ and learning rate η. For each parameter, ELIS tries 3 possible values, so that ELIS trains 9 models in total. Although ELIS++ trains much more models than ELIS, it is still slightly fast than ELIS, which verifies the effectiveness of SIMD-based parallel computation.

ELIS++ vs. ELIS
In this experiment, we compared ELIS++ with ELIS in terms of the accuracy and the training time, and the results are shown in Table 3. ELIS++ achieves higher accuracy on all 35 datasets, especially on the small training dataset, such as "BeetleFly", "BirdChicken", "DiatomSizeReduction" and so on. It can be seen that in Table 1, the data augmentation amount, , is larger than 1, which verifies the effectiveness of the data augmentation mechanism. We can still find some datasets which satisfy, 1) no data augmentation, = 1, and 2) using the coarse-grained adjustment granularity (g > 1), like "Ham" and "Herring", on which ELIS++ is more accurate than ELIS. These results demonstrate that the coarsegrained adjustment can alleviate the overfitting.
As for the efficiency, if SIMD-based parallel computation is utilized, ELIS++ is clearly faster than ELIS. One interesting phenomenon is that, in 16 out of 35 datasets, ELIS++ without SIMD is still has better efficiency, although more training time series are used due to the data augmentation. The major reason is that, due to the entropy-based shapelet candidate selection, ELIS++ can generate better shapelet candidates, which makes the shapelet adjustment phase converges faster.

Related work
The time series classification works can be divided into four major categories.
The first category is the distance-based approach. It computes similarity between time series based on the raw numeric data. The traditional approaches use Euclidean distance [29] or DTW [21] distance to build a 1NN classification model. They are easy to implement and understand, but the accuracy is lower than other types of approaches. To improve the accuracy, some novel distances have been proposed [4,31]. These approaches still utilize the 1NN as the classification model, but they use different similarity definitions. Also, Benjamin et al proposes Proximity Forest [17], which uses the random forest model to combine multiple similarity measurements. Currently, distance based semi supervised learning has gradually attracted the attention of researchers [13].
The second category, structure-based approach, uses high-level time series representations as the features to describe the characteristics of classes. SAX-VFSEQL [19] and SAX-VSM [23] build classifiers based on SAX representation [14]. BOSS VS [22] transforms time series from time domain to frequency domain. Since the structure-based approaches may lose some detailed information of time series, they suffer from the lower accuracy.
The third category is the deep learning based approaches [6,27], which construct the complex network model. They are usually end-to-end algorithms. At the same time, they usually use a large number of labeled data, select the appropriate objective function, effective network structure and good parameters to achieve extraordinary fitting effect for classification. Neural network model ability is highly related to the number of the train dataset. Some deep learning models, like CNN and Resnet [6], have been applied in the time series classification. However, the interpretation is the weak point of DL-based approach. Reasonable hyper-parameter tuning method has rarely been discussed in the former DLbased works. [27] is one of the state-of-the-art in time series classification. It constructs a model that can automatically extract time-frequency characteristics of time series from the perspective of digital signal processing, and gets a good classification effect through deep learning model. But its fundamental wave's shape needs to be specified manually, which is its weakness.
The forth category is the shapelet-based approach [7,18,20,30]. The concept of "shapelet" has been proposed in [30]. It first extracts some typical subsequences and then uses them to build the classification model. Other metrics are also utilized to evaluate the discrimination of a subsequence, such as F-Stats [15], Kruskall-Wallis and Mood's median [10]. To find shapelet candidates efficiently, Fast Shapelet [20] transforms time series subsequences to SAX words to estimate the frequency of the subsequences.
Unlike other shapelet-based approaches, LTS [7] learns the discriminative shapelets, instead of searching them from the raw numeric data. So LTS can generate shapelets that do not appear in the training time series, which can improve the accuracy significantly. However, the training phase is computationally intensive and needs to manually tune lots of hyper-parameters, such as shapelet number and shapelet length. Moreover, LTS generates too many shapelets, which makes the interpretability weak. SAX-VFSEQL [19], a novel shapelet-learning based approach, generates a pool of SAX words as the shapelets, and iteratively optimizes the weights of the shapelets. However, this approach still cannot generate shapelets which are absent in training data. Also, the discovered shapelets are numerous and redundant which makes interpreting the classification decision hard.
Although, the ensemble-based approach [16] has been proposed, which focuses on how to combine multiple classifiers to improve the accuracy. It is orthogonal to our approach.

Conclusion
In this paper, we introduce a shapelet learning approach ELIS++. ELIS++ first discovers shapelet candidates from the PAA space in candidate generation phase, and then applies the logistic regression model to adjust shapelets in the model training phase, by which we utilizes the advantages of both structure-based approaches and shapelet-based approaches. To improve the accuracy on small training dataset, we propose a data augmentation approach which can avoid the missing of some "true" shapelets. Moreover, we propose a Bayesian Optimization based approach to automatically set the parameters. The extensive experimental results show that ELIS++ can achieve higher accuracy and efficiency. In future, we aim to improve ELIS++ in two directions. First, we will try different learning models on the model training phase, especially the deep learning based model. Second, we hope to extend ELIS++ to the multi-dimensional time series classification.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommonshorg/licenses/by/4.0/.