1 Introduction

Today Time Series data management has become an interesting research topic for the data miners. Particularly, the clustering of time series has attracted the interest.

Clustering is the process of finding natural groups, called clusters, the grouping should maximize inter-cluster variance while minimizing intra-cluster variance [1], most of the clustering techniques can be two major categories, Partition-based clustering and Hierarchical Clustering [2]. Many of the traditional clustering algorithms use Euclidean distance or Pearson’s correlation coefficient to measure the proximity between the data points. However, in case of time-series data these parameters involve the individual magnitudes at each time point therefore the traditional algorithms perform poorly with time-series expressions data, to overcome these limitations the proposed work aims to represent the variations in the measurements of the time-series for fast implementation of an efficient agglomerative nesting algorithm, the focus of this work is on fast whole sequence similarity search in the changes in respect to time rather than the values in the time series data.

The rest of the paper is organized as follows: Sect. 2 presents a brief review of related work. Sections 3 and 4 demonstrates the basic concept and presents the analysis of the proposed algorithm respectively. In Sects. 5 and 6 experimental and the conclusions and some future directions.

2 Related Work

Many clustering algorithms have been proposed such as k-means, DBSCAN, STING, p-cluster and COD [46]. One of the recently proposed algorithms is VCD algorithm [3] to analyze the trends of expressions based on their variation over time, using cosine similarity measure with two user inputs, it has been enhanced later in EVCD algorithm [2] for same purpose with one single user input and provides results in several levels which allows the user to select the most appropriate level by using different parameters such as the silhouette coefficient, number of clusters and clusters density. Both algorithms Enhanced Variation Co-expression Detection (EVCD) and (VCD) algorithms [2, 3] inferred that the cosine similarity measure was the most appropriate similarity measure for clustering the time varying microarray data.

3 Concepts and Definition

In order to determine the variation patterns in the time series based on the changes in the values observed at fixed time points binarization of the change has been proposed. Some related definitions are presented in this section.

3.1 Variation Vector

Given a sequence of n + 1 measurements observed at time periods t0, t1, t2…tn to denote a univariate time series, say, \( {\text{Y}} = \left\langle {{\text{y}}_{0} , {\text{y}}_{1} , {\text{y}}_{2} \ldots {\text{y}}_{\text{n}} } \right\rangle \in {\mathbb{R}}^{{{\text{n}} + 1}} \). A variation vector \( {\text{Y}}_{\text{v}} \in {\mathbb{R}}^{\text{n}} \) of Y is a sequence of the differences denoted by, \( {\text{Y}}_{\text{v}} = \left\langle {{\text{d}}_{1} , {\text{d}}_{2} \ldots {\text{d}}_{\text{n}} } \right\rangle \), where \( {\text{d}}_{\text{i}} = {\text{y}}_{\text{i}} - {\text{y}}_{{{\text{i}} - 1}} \), for \( 1\le {\text{i}} \le {\text{n}} \). The increase in the measurement \( \left( {{\text{y}}_{\text{i}} \ge {\text{y}}_{{{\text{i}} - 1}} } \right) \) and its magnitude is represented by the difference \( {\text{d}}_{\text{i}} \ge 0 \). Similarly, the decrease \( \left( {{\text{y}}_{\text{i}} < {\text{y}}_{{{\text{i}} - 1}} } \right) \) is computed as di < 0.

The trend is the tendency of a continuous process that is measured during a fixed time interval. The trend analysis may traditionally be carried out by plotting a trend curve or a trend line and by monitoring the increase (decrease) in the values. Thus trend analyses involve observation of the tendencies of the values by way of analyzing the changes that occur in terms of the quantum of the change and/or the nature of the changes. The pattern of increase or decrease in the values of the measurements may play a significant role in the trend analyses. Variation vectors quantify the difference in measurements at two consecutive time periods say ti and ti+1 in terms. The directions of change, increase or decrease, may be captured by the positive or negative sign of the magnitude of difference di respectively. Therefore, a binary representation of the direction of change is suitable for computational efficiency. Binarization of the change for any time-series has been proposed by a direction vector. Further, the trend similarity based on the distance metric of the n-dimensional binary vectors has been defined.

3.2 Direction Vector

For a variation vector, \( {\text{Y}}_{\text{v}} = \left\langle {{\text{v}}_{1} , {\text{v}}_{2} , \ldots ,{\text{v}}_{\text{n}} } \right\rangle \in {\mathbb{R}}^{\text{n}} \), a direction vector \( {\text{Y}}_{\text{d}} \in \left\{ {0, 1} \right\}^{\text{n}} \) is defined as \( {\text{Y}}_{\text{d}} = \left\langle {{\text{b}}_{1} , {\text{b}}_{2} , \ldots ,{\text{b}}_{\text{n}} } \right\rangle \),

where,

$$ {\text{b}}_{\text{i}} = \left\{ {\begin{array}{*{20}c} {0\;{\kern 1pt} {\text{if}}{\kern 1pt} \;{\text{v}}_{\text{i}} \ge 0} \\ {1 \;{\kern 1pt} {\text{if}} {\kern 1pt} \;{\text{v}}_{\text{i}} < 0} \\ \end{array} } \right.. $$
(1)

Example 1:

Consider two time series \( {\text{T}}_{1} = \left\langle {3, 7, 2, 0, 4, 5, 9, 7, 2} \right\rangle \) and \( {\text{T}}_{2} = \left\langle {10, 15, 11, 5, 19, 25, 27, 24, 13} \right\rangle \). The corresponding variation vectors are, \( {\text{V}}_{1} = \left\langle {4, - 5, - 2, 4, 1, 4, - 2, - 5} \right\rangle \) and \( {\text{V}}_{2} = \left\langle {5, - 4, - 6, 14, 6, 2, - 3, - 11} \right\rangle \). The direction vectors of T1 and T2 are \( {\text{D}}_{1} = \left\langle {0, 1,1, 0, 0, 0, 1, 1} \right\rangle \) and \( {\text{D}}_{1} = \left\langle {0, 1,1, 0, 0, 0, 1, 1} \right\rangle \) respectively.

3.3 Trend Similarity

Let two time series \( {\text{X}} = \left\langle {{\text{x}}_{0} , {\text{x}}_{1} , {\text{x}}_{2} , \ldots ,{\text{x}}_{\text{n}} } \right\rangle \) and \( {\text{Y}} = \left\langle {{\text{y}}_{0} , {\text{y}}_{1} , {\text{y}}_{2} , \ldots ,{\text{y}}_{\text{n}} } \right\rangle \) be measured at the time t0, t1,…,tn. Let \( {\text{X}}_{\text{v}} = \left\langle {{\text{v}}_{1} , {\text{v}}_{2} , \ldots ,{\text{v}}_{\text{n}} } \right\rangle \) and \( {\text{Y}}_{\text{v }} = \left\langle {{\text{u}}_{1} , {\text{u}}_{2} , \ldots ,{\text{u}}_{\text{n}} } \right\rangle \) be the corresponding variation vectors and \( {\text{X}}_{\text{d}} = \left\langle {{\text{l}}_{1} , {\text{l}}_{2} , \ldots ,{\text{l}}_{\text{n}} } \right\rangle \) and \( {\text{Y}}_{\text{d}} = \left\langle {{\text{s}}_{1} , {\text{s}}_{2} , \ldots \ldots {\text{s}}_{\text{n}} } \right\rangle \) be the corresponding direction vectors. Then X and Y are said to be similar in trend if and only if li = si for 1 ≤ i ≤ n.

Both direction vectors Xd and Yd are n-bit binary vectors. For each i if xi ≥ xi−1 in series X i.e. vi ≥ 0 then li = 0 and li = 1 for vice versa. In case of the time series Y the bit value of s i would depict the increase if the value at ti from the values at ti−i as ui ≥ 0 and correspondingly, si = 0, and vice versa. If for each i, li = si then Y is said to be trend similar to X. It may be noted that for the definition of similarity the magnitude of difference in the two time-series has not been considered. However, only the concept of direction of change i.e. increase or decrease, has been considered. The information in the direction vector may be utilized to determine the degree of similarity.

Example 2:

Consider the direction vectors D1 and D2 in the above example corresponding the two time-series T1 and T2 each of length 9. The magnitude of the differences are represented by the variation vectors V1 and V2. It may be noted that for each i, \( 1 \le {\text{i}} \le 8 \), \( {\text{V}}_{{1{\text{i}}}} \ne {\text{V}}_{{2{\text{i}}}} \). However, D1 and D2 are bit-wise equal, i.e. D1i = D2i, for \( 1 \le {\text{i}} \le 8 \), therefore, the two series T1 and T2 are observed to be similar in trend.

The following metric to measure the distance between two n-dimensional binary vectors has been considered in this work. Let \( \beta = \left\{ {0, 1} \right\} \) and \( I_{n} = \left\{ {0, 1, 2 \ldots n} \right\} \) then the binary function \( d_{binary} : \beta \times \beta \to \beta \). For \( b_{1} , b_{2} \in \beta \),

$$ d_{binary} \left( {b_{1} , b_{2} } \right) = \left\{ {\begin{array}{*{20}c} {0{\kern 1pt} \;\;\;if\;{\kern 1pt} b_{1} = b_{2} } \\ {1{\kern 1pt} \quad otherwise} \\ \end{array} } \right. $$
(2)

Then the distance function between a pair of n-dimensional binary vectors is \( d_{n} :\beta^{n} \times \beta^{n} \to I_{n} \) Consider two n-dimensional binary vectors say \( D_{1} , D_{2} \in \beta^{n} \).

$$ d_{n} \left( {D_{1} , D_{2} } \right) = \sum\nolimits_{j = 1}^{n} { d_{binary} \left( {b_{1j} , b_{2j} } \right)} $$
(3)

Let \( d_{n} \left( {D_{1} , D_{2} } \right) = k \). Then k = 0 if \( \sum\nolimits_{i = 1}^{n} {d_{binary} \left( {b_{1i} , b_{2i} } \right) = 0} \) and k = n if \( \sum\nolimits_{i = 1}^{n} {d_{binary} \left( {b_{1i} , b_{2i} } \right) = n} \). Therefore \( 0 \le k \le n \)

Example 3:

Consider the following two sequences as time series, \( T_{1} = \left\langle { 3, 7, 2, 0, 4, 5, 9, 7, 2} \right\rangle \) and \( T_{3} = \left\langle {45, 80, 22, 10, 40, 63, 45, 90, 10} \right\rangle \), then variation vectors V1 and V3 of T1 and T3 are, \( V_{1} = \left\langle {4, - 5, - 2, 4, 1, 4, - 2, - 5} \right\rangle \) and \( V_{3} = \left\langle {35, - 58, - 12, 30, 23, - 18, 45, - 80} \right\rangle \), the direction vectors D 1 and D 3 are \( D_{1} = \left\langle {0, 1, 1, 0, 0, 0, 1, 1} \right\rangle \) and \( D_{3} = \left\langle {0, 1, 1, 0, 0, 1, 0, 1} \right\rangle \).

For \( D_{1} , D_{3} \in B^{8} \), the dissimilarity between D 1 and D 3 may be computed using the distance function d 8,

$$ d_{8} \left( {D_{1} , D_{3} } \right) = 2 $$
(4)

where,

$$ d_{binary} \left( {b_{1i} , b_{2i} } \right) = 1\quad {\text{for }}i \in \left\{ { 6,{ 7}} \right\} $$
(5)

and

$$ d_{binary} \left( {b_{1i} , b_{2i} } \right) = 0\quad {\text{for }}i \in \left\{ { 1,{ 2},{ 3},{ 4},{ 5},{ 8}} \right\} $$
(6)

To allow difference in trends at the certain bits out of the n-bits, the concept of trend dissimilarity of degree-k has been considered where k ≤ n may be the number of bits at which the two n-dimensional direction vectors encounter bit-mismatch.

3.4 Trend Dissimilarity of Degree K

Given two n-dimensional time series Ti and Tj, and their respective direction vectors Di and Dj, Ti and Tj are said to have dissimilarity of degree k, if \( {\text{d}}_{\text{n}} \left( {{\text{D}}_{\text{i}} , {\text{D}}_{\text{j}} } \right) = {\text{k,}}\quad {\text{for 1}} \le {\text{k}} \le n \).

The clusters at level-0 may contain identical objects. Consider any two arbitrary objects x and y, and the Euclidian distance function d, the traditional measure of dissimilarity. Then \( d(\varvec{x},\varvec{y}) = 0 \), i.e. \( \sqrt {\mathop \sum \nolimits \left( {x_{i} , y_{i} } \right)^{2} } = 0 \) if the two objects are identical. Therefore, the objects x and y must be grouped in the same cluster at level-0, say ith cluster denoted by, C 0,i . Let C i,j denote cluster-id j at level-i. Then the m clusters at level-0 are C 0,1, C 0,2, C 0,3,…,C 0,m . Let a measure of dissimilarity at 1 bit represented by distance metric d 1 be associated to the clusters at level-1, dissimilarity at 2 bits represented by d 2 and so on. Then any two arbitrary objects x, y may be in the same cluster at level-1, C 1,j , only if, \( 0 < d\left( {x, y} \right) \le d_{1} \). In this section the concept of Trend Cluster of level-k using the dissimilarity of degree-k is defined.

3.5 Trend Cluster of Level-K

For \( {\mathcal{T}} = \{ {\text{T}}_{1} , {\text{T}}_{2} , \ldots , {\text{T}}_{\text{m}} \} \), a set of n-dimensional time series of cardinality m, and the set of corresponding direction vectors \( \varGamma = \{ {\text{D}}_{1} , {\text{D}}_{2} , \ldots , {\text{D}}_{\text{m}} \} \), a trend cluster of level-k, C k,j would include all time-series Ti and Tj in the same cluster if \( {\text{d}}_{\text{n}} \left( {{\text{D}}_{\text{i}} , {\text{D}}_{\text{j}} } \right) = {\text{k}} \). However, if \( {\text{d}}_{\text{n}} \left( {{\text{D}}_{\text{i}} , {\text{D}}_{\text{j}} } \right) \ne k^{{\prime }} \) for all \( {\text{k}}^{{\prime }} , 0 \le {\text{k}}^{{\prime }} < k \), then Ti and Tj will be allocated to distinct trend clusters of level-0, level-1, up to level-k′, say Ck′,i and Ck′,j, but would be grouped in the same trend clusters of level-k, say Ck,i.

Example 4:

Consider time series T1, T2 and T3 as in the Examples 1 and 3. The direction vectors of each is \( {\text{D}}_{1} = \left\langle {0, 1, 1, 0, 0, 0, 1, 1} \right\rangle \), and \( {\text{D}}_{2} = \left\langle {0, 1, 1, 0, 0, 0, 1, 1} \right\rangle \) \( {\text{D}}_{3} = \left\langle {0, 1, 1, 0, 0, 1, 0, 1} \right\rangle \). Consider D1 and D2, \( {\text{d}}_{8} \left( {{\text{D}}_{1} , {\text{D}}_{2} } \right) = 0 \) therefore, T1 and T2 must be grouped in the same cluster of level-0. Consider D1 and D3, \( {\text{d}}_{8} \left( {{\text{D}}_{1} , {\text{D}}_{3} } \right) = 2 \). i.e. the series T1 and T3 have the trend dissimilarity of degree-2. Therefore, T1 and T3 must be grouped in different trend clusters of level-0 and level-1 say C0,1 and C0,3, and C1,1 and C1,3 respectively. However, the two must be grouped in the same trend cluster of level-2 say, C2,1.

Example 5:

Consider the 5-dimensional view of the four gene expressions a, b, c and d, as shown in Fig. 1. The direction vectors Da and Dc are identical therefore genes a and c are trend similar. Even visually the vectors a and c are the most similar to each other than to the vectors b and d.

Fig. 1.
figure 1

Trend similarity in gene expressions

An advantage of this approach is the simplicity of representation of the objects of m-dimensional time series database, using only one bit to represent the change in value from time ti to ti+1,

$$ {\text{b}}_{\text{i}} = \left\{ {\begin{array}{*{20}c} {0 \;{\text{x}}_{{{\text{i}} + 1}} \ge {\text{x}}_{\text{i}} } \\ {1 \;{\text{x}}_{{{\text{i}} + 1}} < {\text{x}}_{\text{i}} } \\ \end{array} } \right.; \, 0 \, < {\text{ i }} < {\text{ m}} - 1 $$
(7)

The direction vectors are loss transformation of the original data from which no original values can be retrieved. Thus it is a novel representation from the perspective of security and privacy preservation of the original data.

4 Fast Trend Similarity-Based Clustering (FTSC) Algorithm

FTSC algorithm starts with generating the variation vectors, second is binarization of the variation vector, and third is direction vectors indicate similarity in trend in the time series thus forming the trend clusters of level-0 in the hierarchy of clusters. The higher level clusters may result from merging the closest clusters in the previous level starting by smaller clusters, each cluster is represented by a direction vector as medoid of the cluster. The distance between clusters is computed by the distance between the medoids of the two clusters.

The FTSC algorithm is a nonparametric algorithm and it does not require any prior information related to data or number of clusters.

The asymptotic time complexity of the algorithm is quadratic on the product of the dimension of the time series and number of clusters level-i, n i  < n, therefore the complexity of the algorithm is O((mn)2). However, due to the binarization of the variation in the time series, the comparisons of the m bits and distance computation may be implemented using fast bit operators.

5 Experiments and Results

5.1 Data Sets

The experiments have been carried out to perform clustering on two microarray data sets and two financial data sets. Table 1 describes the data sets.

Table 1. Data set

5.2 System Configuration

Windows 8 enterprise © 2012, 64-bit, with processor intel® core (TM) i7 CPU, U 640@1.20 GHz an. Dot Net platform has been used to implementation.

5.3 Design of Experiments

The experiments have been designed to assess the performance of FTSC algorithm in terms the efficiency and accuracy. Efficiency is mainly observed in terms of execution time. The accuracy of the algorithm is considered to be the consistency in cluster allocation to a time series irrespective of the number of re-executed, cluster allocation to multiple copies of the time series data, and the order of input of the time series to the algorithm. Second experiment compares both algorithms FTSC and EVCD.

5.4 Efficiency and Accuracy of FTSC

The first experiment has been designed to examine the speed of Fast Trend Similarity Clustering algorithm to cluster the four data sets. The experiment of running the program implementing the algorithm repeated five times, the average running time to yield the hierarchical clusters for each of the four data sets Affymetrix, Drosophila genome, Exchange Rates and PPPs over GDP and NSE with execution time 00:00:02.66, 00:00:01.72, 00:00:10.11 and 00:00:01.34 respectivly.

The outcomes of running the FTSC algorithm on Affymetrix are presented in Tables 2, 3 and 4. In Table 2 the 7-bit direction vector of gene Id 11251 is 0000001 which is in cluster \( C_{0,0} \) while the two genes 11152 and 12182 in serial 7 and 8 have identical direction vectors 0000101. Therefore, \( C_{0,3} \) includes two genes. The total clusters of level-0 is 115.

Table 2. Direction vectors, clusters of level-0 of AffyMetrix data
Table 3. Level-3 cluster formation
Table 4. Level-4 cluster formation

Tables 3 and 4 present the clusters of level-2 and level-3 respectively. In the two tables the rows display all the clusters \( {\text{C}}_{{{\text{i}}, {\text{j}}}} , \) i denoting the cluster level and j the cluster ID. The cluster medoid has been presented in the second column by the identifier of the direction vector representing the cluster of level-0. In Table 3, the 3rd, 4th, 5th, 6th and 7th column display the clusters of level-2 that are merged to form the cluster of level-3. Thus the cluster id \( {\text{C}}_{3,0} \) represented by the medoid 0 is formed by merging the clusters of level-2 represented by the medoids 4, 7, 14, 29 and 58 yielding the cluster with a total of 486 genes. The cluster \( {\text{C}}_{3,1} \) is the outcome of merging three clusters of level-2 that are represented by the medoids 26, 52 and 81 to the cluster represented by medoid 20 at level-3 having a total of 689 genes. To obtain the clusters \( {\text{C}}_{3,6} \) to \( {\text{C}}_{3,15} \) no other clusters of level-2 were merged to the ones represented by the respective medoids indicated in column two. The blank ‘−’ entries in the table indicate no clusters of level-2. Therefore, the row pertaining to the cluster \( {\text{C}}_{3,6} \) with medoid 16 indicates that no cluster of level-2 satisfied the criterion for the merge operation although the total number of genes in the cluster \( {\text{C}}_{3,6} \) is 2, where number of clusters of level-3 are 16.

The clusters from \( {\text{C}}_{3,6} \) to \( {\text{C}}_{3,15} \) in level-3 have not changed from the previous level with the same medoids and densities.

Similarly the Table 4 exhibits the details of the clusters of level-4. From both Tables 3 and 4 it may be observed that the cluster C4,0 with medoid 0 has been formed by merging the clusters C3,0, C3,5, C3,6, C3,7 and C3,11 referred to by the medoids 0, 9, 16, 31 and 60 respectively. It may also be observed that the density of C4,0 is the sum of the densities of the C3,0, C3,5, C3,6, C3,7 and C3,11. Similarly the cluster C4,2 is formed by merging C3,13, and C3,4, to C3,3 resulting in the density 1880.

As the FTSC algorithm is an agglomerative clustering algorithm yielding a hierarchical clustering of levels 0–7 for Affymetrix data. The cluster at the highest level C6,0, represented by the medoid 0 includes all the 12488 genes (Figs. 2 and 3).

Fig. 2.
figure 2

Random clusters plot for DS 1 level 0

Fig. 3.
figure 3

Random clusters plot for DS 2 level 0

In order to estimate the efficiency, accuracy and sensitivity to order of data inputs, all the rows of the Affymetrix data set were duplicated four times and randomly shuffled. Therefore, the algorithm was executed with a total of 4 × 12488 = 49952 rows with 8 dimensions. The output of the program was a hierarchical clustering with levels 0−7 with same number of clusters at each level as before but the density of each cluster was four time the previous density. i.e. the cluster C4,5 with inputs four time the first run was represented by a gene that had direction vector identical to the gene 9 and contained 192 genes. The same phenomenon was observed for all the clusters of each level from level-0 to level-7. Thus the accuracy of the algorithm has been assessed. The average running time of repeated execution of the four times the original data set was 00:00:10.714.

The repeated execution of the program after randomly shuffling the rows yielded the same number of clusters. However, each time the execution time was differed in the 3rd or the 4th decimal point with the mean being 00:00:02.6599 (Figs. 4 and 5).

Fig. 4.
figure 4

Random clusters plot for DS 3 level 0

Fig. 5.
figure 5

Random clusters plot for DS 4 level 0

5.5 Comparison of FTSC and EVCD Algorithms

In this experiment the results of EVCD algorithm and FTSC algorithm have been compared. Two real world data sets Affymetrix and Drosophilia data sets as described in Table 1 are used in this experiment to assess the novelness of trend dissimilarity as the changes in the time series are represented by direction vectors. The EVCD algorithm is also a parametric algorithm while FTSC algorithm is not. EVCD algorithm requires one user input as the parameter ε. The experiment has been repeated for three values of ε, i.e. 0.01, 0.05 and 0.1 respectively. As EVCD performs a hierarchical clustering, for ε = 0.01, 10 clusters and 6 singletons were obtained at level 14, while for ε = 0.05, 10 clusters and 6 singletons were obtained at level 2, and finally 11 clusters and 2 Singleton were obtained at level 1 for ε = 0.1.

6 Conclusions

The experiments indicate that although the FTSC algorithm has the complexity O((mn)2) it is fast in terms of execution time due to the binarizing the change in the time-series. The binary representation in terms of the direction vector affect the distance computation implemented using bit level operators. The binarization also helps in privacy and security of the actual data. The nonparametric characteristic of the algorithm keeps the end user from exercise of parameter tuning. User also does not require any prior knowledge of the data or the clusters. The FTSC algorithm is time efficient and has the potential to yield accurate clusters of time-series data. The scalability of the algorithm in terms of multi-dimensions time-series and dealing with noise shall be investigated in future. To select a better medoid of the cluster of each higher level is also considered as future work.