1 Introduction

Time series classification is an important topic in the research field of time series analysis and mining. A plethora of classifiers have been developed for this topic [1, 2], e.g., decision tree, nearest neighbor (1NN), naive Bayes, Bayesian network, random forest, support vector machine, rotation forest, etc. However, the recent empirical evidence [3,4,5] strongly suggests that, with the merits of robustness, high accuracy, and free parameter, the simple 1NN classifier employing the generic time series similarity measure is exceptionally difficult to beat. Besides, due to the high precision of dynamic time warping distance (DTW), the 1NN classifier based on DTW has been found to outperform an exhaustive list of alternatives [5], including decision trees, multi-scale histograms, multi-layer perception neural networks, order logic rules with boosting, as well as the 1NN classifiers based on many other similarity measures. However, the computational complexity of DTW is quadratic to the time series length, i.e., O(n 2), and the 1NN classifier has to search the entire dataset to classify an object. As a result, the 1NN classifier based on DTW is low efficient for the high-dimensional time series. To address this problem, researchers have proposed to compute DTW in the alternative piecewise approximation space (PA-DTW) [6,7,8,9], which transforms the raw data into the feature space based on segmentation, and extracts the discriminatory and low-dimensional features for similarity measure. If the original time series with length n is segmented into N(N << n) subsequences, the computational complexity of PA-DTW will reduce to O(N 2).

Many piecewise approximation methods have been proposed so far, e.g., piecewise aggregation approximation (PAA) [6], piecewise linear approximation (PLA) [7, 10], adaptive piecewise constant approximation (APCA) [8], derivative time series segment approximation (DSA) [9], piecewise cloud approximation (PWCA) [11], etc. The most prominent merit of piecewise approximation is the ability of capturing the local characteristics of time series. However, most of the existing piecewise approximation methods need to fix the segment length, which is hard to be predefined for the different kinds of time series, and focus on the simple statistical features, which only capture the aggregation characteristics of time series. For example, PAA and APCA extract the mean values, PLA extracts the linear fitting slopes, and DSA extracts the mean values of the derivative subsequences. If PA-DTW is computed on these methods, its precision would be influenced.

In this paper, we propose a novel piecewise factorization model for time series, named piecewise Chebyshev approximation (PCHA), where a novel code-based segment method is proposed to adaptively segment time series. Rather than focusing on the statistical features, we factorize the subsequences with Chebyshev polynomials, and employ the Chebyshev coefficients as features to approximate the raw data. Besides, the PA-DTW based on PCHA (ChebyDTW) is proposed for the 1NN classification. Since the Chebyshev polynomials with the different degrees represent the fluctuation components of time series, the local fluctuation information can be captured from time series for the ChebyDTW measure. The comprehensive experimental results show that ChebyDTW can support the accurate and fast 1NN classification.

The structure of this paper is as follows: The related work on data representation and similarity measure for time series is reviewed in Sect. 2; Sect. 3 shows the proposed methodology framework; the details of PCHA are presented in Sect. 4; Sect. 5 describes the ChebyDTW measure; Sect. 6 provides the comprehensive experiment results and analysis; Sect. 7 concludes this paper.

2 Related Work

2.1 Data Representation

In many application fields, the high dimensionality of time series has limited the performance of a myriad of algorithms. With this problem, a great number of data representation methods have been proposed to reduce the dimensionality of time series [1, 2]. In these methods, the piecewise approximation methods are prevalent for their simplicity and effectiveness. The first attempt is the PAA representation [6], which segments time series into the equal-length subsequences, and extracts the mean values of the subsequences as features to approximate the raw data. However, the extracted single sort of features only indicates the height of subsequences, which may cause the local information loss. Consecutively, an adaptive version of PAA named piecewise constant approximation (APCA) [8] was proposed, which can segment time series into the subsequences with adaptive lengths and thus can approximate time series with less error. As well, a multi-resolution version of PAA named MPAA [12] was proposed, which can iteratively segment time series into 2i subsequences. However, both of the variations inherit the poor expressivity of PAA. Another pioneer piecewise representation is the PLA [7, 10], which extracts the linear fitting slopes of the subsequences as features to approximate the raw data. However, the fitting slopes only reflect the movement trends of the subsequences. For the time series fluctuating sharply with high frequency, the effect of PLA on dimension reduction is not prominent. In addition, two novel piecewise approximation methods were proposed recently. One is the DSA representation [9], which takes the mean values of the derivative subsequences of time series as features. However, it is sensitive to the small fluctuation caused by the noise. The other is the PWCA representation [11], which employs the cloud models to fit the data distribution of subsequences. However, the extracted features only reflect the data distribution characteristics and cannot capture the fluctuation information of time series.

2.2 Similarity Measure

DTW [1, 2, 5] is one of the most prevalent similarity measures for time series, which is computed by realigning the indices of time series. It is robust to the time warping and phase-shift, and has high measure precision. However, it is computed by the dynamic programming algorithm, and thus has the expensive O(n 2) computational complexity, which largely limits its application to the high dimensional time series [13]. To overcome this shortcoming, the PA-DTW measures were proposed. The PAA representation based PDTW [14] and the PLA representation based SDTW [10] are the early pioneers, and the DSA representation based DSADTW [9] is the state-of-the-art method. Rather than in the raw data space, they compute DTW in the PAA, PLA, and DSA spaces respectively. Since the segment numbers are much less than the original time series length, the PA-DTW methods can greatly decrease the computational complexity of the original DTW. Nonetheless, the precision of PA-DTWs greatly depends on the used piecewise approximation methods, where both the segment method and the extracted features are crucial factors. As a result, with the weakness of the existing piecewise approximation methods, the PA-DTWs cannot achieve the high precision. In our proposed ChebyDTW, a novel adaptive segment method and the Chebyshev factorization are used, which overcomes the drawback of the fixed segmentation, and can capture the fluctuation information of time series for similarity measure.

3 Methodology Framework

Figure 1 shows the framework of the methods proposed in this paper, which consists of two parts:

Fig. 1.
figure 1

The framework of the proposed methods.

  1. (a)

    Piecewise Chebyshev approximation (PCHA). The time series is first coded into the binary sequence, and then segmented into the subsequences with adaptive lengths by matching the turning patterns. After that, the subsequences are factorized with the Chebyshev polynomials and projected into the Chebyshev factorization domain. The Chebyshev coefficients will be extracted as features to approximate the raw data.

  2. (b)

    ChebyDTW computation. DTW will be computed in the Chebyshev factorization domain. Concretely, in the dynamic programming computation of DTW, the subsequence matching over the Chebyshev features is taken as the subroutine, where the squared Euclidean distance can be employed.

4 Piecewise Factorization

Without loss of generality, the relevant definitions are first given as follows.

Definition 1.

(Time Series): The sample sequence of a variable X over n contiguous time moments is called time series, denoted as T = {t 1, t 2, …, t i , …, t n }, where t i R denotes the sample value of X on the i-th moment, and n is the length of T.

Definition 2.

(Subsequence): Given a time series T = {t 1, t 2, …, t i , …, t n }, the subset S of T that consists of the continuous samples {t i+1, t i+2, …, t i+l }, where 0 ≤ i ≤ n-l and 0 ≤ l ≤ n, is called the subsequence of T.

Definition 3.

(Piecewise Approximation): Given a time series T = {t 1, t 2, …, t i , …, t n }, which is segmented into the subsequence set S = {S 1, S 2, …, S j , …, S N }, if ∃ f: S j  → V j  = [v 1, …, v m ] ∈ R m, the set V = {V 1, V 2, …, V j , …, V N } is called the piecewise approximation of T.

4.1 Adaptive Segmentation

Inspired by the Marr’s theory of vision [15], we regard the turning points, where the trend of time series changes, as a good choice to segment time series. However, the practical time series is mixed with a mass of noise, which results in many trivial turning points with small fluctuation. This problem can be simply solved by the efficient moving average (MA) smoothing method [16].

In order to recognize the significant turning points, we first exhaustively enumerate the location relationships of three adjacent samples t 1t 3 with their mean μ in time series, as shown in Fig. 2. Six basic cell codes can be defined as Fig. 2(a), which is composed by the binary codes δ 1δ 3 of t 1t 3, and denoted as Φ(t 1, t 2, t 3) = (δ 1 δ 2 δ 3) b . Six special relationships that one of t 1t 3 equals to μ are encoded as Fig. 2(b).

Fig. 2.
figure 2

Three adjacent samples with the basic cell codes of (A) basic relationships, and (B) specific relationships.

Based on the cell codes, all the minimum turning patterns (composed with two cell codes) at the turning points can be enumerated as Fig. 3. Note that, the basic cell codes 010 and 101 per se are the turning patterns. Then, we employ a sliding window of length 3 to scan the time series, and encode the samples within each window by Fig. 2. In this process, all the significant turning points can be found by matching Fig. 3, with which time series can be segmented into the subsequences with adaptive lengths.

Fig. 3.
figure 3

The minimum turning patterns composed with two cell codes.

However, the above segmentation is not perfect. Although the trivial turning points can be removed with the MA, the “singular” turning patterns may exist, i.e., the turning patterns appearing very close. As shown in Fig. 4, a Cricket time series from the UCR time series archive [17] is segmented by the turning patterns (dash line), where the raw data is first smoothed with the smooth degree 10 (sd = 10).

Fig. 4.
figure 4

Segmentation for the Cricket time series (sd = 10).

Obviously, the dash lines can significantly segment time series, but the two black dash lines are so close that the segment between them can be ignored. In view of this, we introduce the segment threshold ρ that stipulates the minimum segment length. This parameter can be set as the ratio to the time series length. Since the time series from a specific field exhibit the same fluctuation characteristics, ρ is data-adaptive and can be learned from the labeled dataset. Nevertheless, the segmentation is still primarily established on the recognition of turning patterns, which determines the segment number or lengths adaptively, and is essentially different from the principles of the existing segmentation methods.

4.2 Chebyshev Factorization

At the beginning, it is necessary to z-normalize the obtained subsequences as a pre-processing step. Rather than focusing on the statistical features, PCHA will factorize each subsequence with the first kind of Chebyshev polynomials, and take the Chebyshev coefficients as features. Since the Chebyshev polynomials with different degrees represent the fluctuation components, the local fluctuation information of time series can be captured in PCHA.

The first kind of Chebyshev polynomials are derived from the trigonometric identity T n (cos(θ)) = cos(), which can be rewritten as a polynomial of variable t with degree n, as Formula (1).

$$ T_{n} (t) = \left\{ {\begin{array}{*{20}l} {\cos (n\cos^{ - 1} (t)),} \hfill & {t \in [ - 1, \, 1]} \hfill \\ {\cosh (n\cosh^{ - 1} (t)),} \hfill & {t \ge 1} \hfill \\ {( - 1)^{n} \cosh (n\cosh^{ - 1} ( - t)),} \hfill & {t \le - 1} \hfill \\ \end{array} } \right. $$
(1)

For the sake of consistent approximation, we only employ the first sub-expression to factorize the subsequences, which is defined over the interval [−1, 1]. With the Chebyshev polynomials, a function F(t) can be factorized as Formula (2).

$$ F(t) \cong \sum\limits_{i = 0}^{n} {c_{i} T_{i} (t)} $$
(2)

The approximation is exact if F(t) is a polynomial with the degree of less than or equal to n. The coefficients c i can be calculated from the Gauss-Chebyshev Formula (3), where k is 1 for c 0 and 2 for the other c i , and t j is one of the n roots of T n (t), which can be get from the formula t j  = cos[(j − 0.5)π/n].

$$ c_{i} = \frac{k}{n}\sum\limits_{j = 1}^{n} {F(t_{j} )T_{i} (t_{j} )} $$
(3)

However, the employed Chebyshev polynomials are defined over the interval [−1, 1]. If the subsequences are factorized with this “interval function”, they must be scaled into the time interval [−1, 1]. Besides, the Chebyshev polynomials are defined everywhere in the interval, but time series is a discrete function, whose values are defined only at the sample moments. To compute the Chebyshev coefficients, we would process each subsequence with the method proposed in [18], which can extend time series into an interval function. Given a scaled subsequence S = {(v 1, t 1), …, (v m , t m )}, where −1 ≤ t 1 < … < t m  ≤ 1, we first divide the interval [−1, 1] into m disjoint subintervals as follows:

$$ I_{i} = \left\{ {\begin{array}{*{20}l} {[ - 1,\frac{{t_{1} + t_{2} }}{2}),i = 1} \hfill \\ {[\frac{{t_{i - 1} ,t_{i} }}{2},\frac{{t_{i} + t_{i + 1} }}{2}),2 \le i \le m - 1} \hfill \\ {[\frac{{t_{m - 1} + t_{m} }}{2},1],i = m} \hfill \\ \end{array}} \right. $$

Then, the original subsequence can be extended into a step function as Formula (4), where each subinterval [t i , t i+1] is divided by the mid-point (t i  + t i+1)/2. The first half takes the value v i , and the second half takes v i+1.

$$ F(t) = v_{i} , \, t \in I_{i} , \, 1 \le i \le m $$
(4)

After the above processing, the Chebyshev coefficients c i can be computed. For the sake of dimension reduction, we only take the first several coefficients to approximate the raw data, which can reflect the principal fluctuation components of time series.

Figure 5 shows the examples of (a) PAA, (b) APCA, (c) PLA, and (d) PCHA representations for the stock time series of Google Inc. (symbol: GOOG) from The NASDAQ Stock Market, which consists of the close prices at 800 consecutive trading days (2010/10/4-2013/12/5). As shown in Fig. 5(a), PAA extracts the mean values of the subsequences with equal-length as features. In Fig. 5(b), APCA takes the mean values and spans of the subsequences with adaptive-length as features, e.g., [−0.62, 134] for the first subsequence. In Fig. 5(c), PLA takes the linear fitting slopes and spans of the subsequences with adaptive-length as features, e.g., [−0.0035, 96] for the first subsequence. In Fig. 5(d), PCHA factorizes each subsequence and takes the first four Chebyshev coefficients as features, e.g., [−3.8, 0.34, 3, −0.39] for the first subsequence. It is obvious that the approximation of PCHA is different from the others, which can well fit the local fluctuation characteristics of time series.

Fig. 5.
figure 5

PAA/APCA/PLA/PCHA representation examples.

In the entire procedure, the time series only needs to be scanned once for the adaptive segmentation and factorization. Thus, the computational complexity of PCHA is O(kn), where k is the extracted Chebyshev coefficient number and much less than the time series length n.

5 Similarity Measure

DTW is one of the most prevalent similarity measures for time series [5]. It exploits the one-to-many aligning scheme to find the optimal alignment between time series, as shown in Fig. 6. Thus, DTW can deal with the intractable basic shape variations, e.g., time warping and phase-shift, etc. Given a sample space F, time series T = {t 1, t 2, …, t i , …, t m } and Q = {q 1, q 2, …, q j , …, q n }, t i , q j  ∈ F, a local distance measure d: (x, y) → R + should be first set in DTW for measuring two samples. Then, a distance matrix C ∈ R m×n is computed, where each cell records the distance between each pair of samples from T and Q respectively, i.e., C(i, j) = d(t i , q j ). There is an optimal warping path in C, which has the minimal sum of cells.

Fig. 6.
figure 6

One-to-many aligning scheme of DTW.

Definition 4.

(Warping Path): Given the distance matrix C ∈ R m×n, if the sequence p = {c 1, …, c l , …, c L }, where c l  = (a l , b l ) ∈ [1: n] × [1: m] for l ∈ [1: L], satisfies the conditions that:

  1. (i)

    c 1 = (1, 1) and c L  = (m, n);

  2. (ii)

    c l+1 − c l  ∈ {(1, 0), (0, 1), (1, 1)} for l ∈ [1: L − 1];

  3. (iii)

    a 1 ≤ a 2 ≤ … ≤ a L and b 1 ≤ b 2 ≤ … ≤ b L ;

Then, p is called warping path. The sum of cells in p is defined as Formula (5).

$$ \varPhi_{p} = \varvec{C}(c_{1} ) + \varvec{C}(c_{2} ) + \cdots + \varvec{C}(c_{L} ) $$
(5)

Definition 5.

(Dynamic Time Warping Distance): Given the distance matrix CR m×n over time series T and Q, and its warping path set P = {p 1, …, p i , …, p x }, i, x ∈ R +, the minimal sum of cells in the warping paths Φ min  = {Φ ξ |Φ ξ  ≤ Φ λ , ξ, λ ∈ P} is defined as the DTW distance between T and Q.

The computation of DTW performs with dynamic programming algorithm, which would lead to the quadratic computational complexity to the time series length, i.e., O(n 2). Figure 7(a) shows the dynamic programming table with the optimal warping path in DTW computation.

Fig. 7.
figure 7

(a) Dynamic programming table with the optimal-aligned path (red shadow) of DTW, (b) against that of ChebyDTW. (Color figure online)

Based on PCHA, we propose a novel PA-DTW measure, named ChebyDTW, which contains two layers: subsequence matching and dynamic programming computation. Figure 7(b) shows the dynamic programming table with the optimal-aligned path (red shadow) of ChebyDTW, where each cell records the subsequence matching result over the Chebyshev coefficients. By the intuitive comparison with Fig. 7(a), ChebyDTW would have much lower computational complexity than the original DTW.

With high computational efficiency, the squared Euclidean distance is a proper measure for the subsequence matching. Given d Chebyshev coefficients are employed in PCHA, for the subsequences S 1 and S 2, respectively approximated as C = [c 1, …, c d ] and Ĉ = [ĉ 1, …, ĉ d ], the squared Euclidean distance between them can be computed as Formula (6).

$$ D(\varvec{C},{\hat{\varvec{C}}}) = \sum\limits_{i = 1}^{d} {(c_{i} - \hat{c}_{i} )^{2} } $$
(6)

Over the subsequence matching, the dynamic programming computation performs. Given that time series T with length m is segmented into M subsequences, and time series Q with length n is segmented into N subsequences, ChebyDTW can be computed as Formula (7). C T and C Q are the PCHA representations of T and Q respectively; C T1 and C Q1 are the first coefficient vectors of C T and C Q respectively; rest(C T) means the rest coefficient vectors of C T except for C T1 ; the same meaning is taken for rest(C Q).

$$ \begin{aligned} & ChebyDTW(T,Q) = \\ & \left\{ {\begin{array}{*{20}l} {0,} \hfill & {{\text{if }}m = n = 0} \hfill \\ {\infty ,} \hfill & {{\text{if }}m = 0 {\text{ or }}n = 0} \hfill \\ {D(C_{1}^{T} ,C_{1}^{Q} ) + \hbox{min} \left\{ {\begin{array}{*{20}l} {ChebyDTW[rest(\varvec{C}^{T} ),\varvec{C}^{Q} ],} \hfill \\ {ChebyDTW[\varvec{C}^{T} ,rest(\varvec{C}^{Q} )],} \hfill \\ {ChebyDTW[rest(\varvec{C}^{T} ),rest(\varvec{C}^{Q} )]} \hfill \\ \end{array} } \right\}} \hfill & {} \hfill \\ {,\,otherwise} \hfill & {} \hfill \\ \end{array} } \right. \\ \end{aligned} $$
(7)

6 Experiments

We evaluate the 1NN classifier based on ChebyDTW from the aspects of accuracy and efficiency respectively. 12 real-world datasets provided by the UCR time series archive [17] are employed, which come from the various application domains and are characterized by the different series profiles and dimensionality. All the datasets have been z-normalized and partitioned into the training and testing sets by the provider. Figure 8 shows the sample representative instances from each class of the datasets.

Fig. 8.
figure 8

Sample representative instances from each class of 12 datasets.

All parameters in the measures are learned on the training datasets by the DIRECT global optimization algorithm [19], which is used to seek for the global minimum of multivariate function within a constraint domain. The experiment environment is Intel(R) Core(TM) i5-2400 CPU @ 3.10 GHz; 8G Memory; Windows 7 64-bit OS; MATLAB 8.0_R2012b.

6.1 Classification Accuracy

Firstly, we take four PA-DTWs based on the statistical features as baselines, i.e., PDTW [14], SDTW [10], DTWAPCA [8], and DTWDSA [9], which are based on PAA, PLA, APCA, and DSA representations respectively. Secondly, since PA-DTW is computed over the approximate representation, its precision is regarded lower than the measures computed on the raw data. To test this assumption, we also take 4 DTW measures computed on the raw data as baselines, including the original DTW and its variations, i.e., CDTW [3], CIDDTW [20], DDTW [21].

Tables 1 and 2 present the 1NN classification accuracy based on ChebyDTW and the baselines respectively. The best results on each dataset are highlighted in bold. The learned parameters are also presented, which could make each classifier achieve the highest accuracy on each training dataset, including the segment threshold (ρ), the smooth degree (sd), and the extracted Chebyshev coefficient number (θ). For the sake of dimension reduction, we learn the parameter θ in the range of [1, 10] for ChebyDTW.

Table 1. The accuracy of 1NN classifiers based on ChebyDTW and four PA-DTW baselines.
Table 2. The accuracy of 1NN classifiers based on ChebyDTW and four DTW baselines.

By the comparison, we find that, (1) the 1NN classifier based on ChebyDTW wins all datasets over that based on the PA-DTW baselines. The superiority mainly derives from the distinctive features extracted in ChebyDTW, which can capture the fluctuation information for similarity measure. Concretely, as shown in Fig. 8, the practical time series in the datasets have the relatively complicated fluctuation that can be transformed into the wide Chebyshev domain, thus the difference between time series can be easily captured by the Chebyshev coefficients. Whereas the statistical features extracted in the baselines only focus on the aggregation characteristics of time series, which would result in much fluctuation information loss.

(2) The classifier based on ChebyDTW has higher accuracy on more datasets than the original DTW and its variations. The reason is apparent that, the noise mixed in the time series can be filtered out by the Chebyshev factorization effectively, which is one of the principal factors affecting the precision of similarity measures. Thus, the above assumption that ChebyDTW has lower precision than the measures computed on the raw data is not supported.

6.2 Computational Efficiency

The speedup of computational complexity gained by PA-DTW over the original DTW is O(n 2/w 2), where n is the time series length and w is the segment number. It is positively correlated with the data compression rate (DCR = n/w) of piecewise approximation over the raw data. In Table 3, we present the segment numbers and the DCRs of five PA-DTWs on all datasets. As above, the optimal segment numbers for the 1NN classifiers based on PDTW, SDTW, and DTWAPCA are learned on the training datasets, while the average segment numbers on each dataset are computed for ChebyDTW and DTWDSA.

Table 3. The DCR results of five PA-DTWs.

As shown in Table 3, the DCRs of ChebyDTW are not only much larger than the baselines on all datasets, but also robust to the time series length. Thus, it has the highest computational efficiency among the five PA-DTWs. The efficiency superiority of ChebyDTW mainly derives from the precise approximation of PCHA over the raw data, and the data-adaptive segment method, which can segment time series into the less number of subsequences with the adaptive lengths.

In addition, the average runtime of 1NN classification based on DTW and ChebyDTW are presented in Table 4. According to the results, the efficiency speedup (Ω) of ChebyDTW over DTW can achieve as much as 3 orders of magnitude.

Table 4. The average runtime of 1NN classification based on DTW and ChebyDTW (ms).

7 Conclusions

We proposed a novel piecewise factorization model for time series, i.e., PCHA, where a novel adaptive segment method was proposed, and the subsequences were factorized with the Chebyshev polynomials. We employed the Chebyshev coefficients as features for PA-DTW measure, and thus proposed the ChebyDTW for 1NN classification. The comprehensive experimental results show that ChebyDTW can support the accurate and fast 1NN classification.