1 Introduction

Trend analysis is a form of comparative analysis that is often employed to identify current and future movements in various applications. In business, trend is the general direction in which the market is headed, an aspect of technical analysis that tries to predict the future movements of stock based on past data. In weather, air temperature and precipitation are principle elements, and examination of their behavior is important for understanding of climate variability. Trend analysis is based on the past behavior that gives information about what will happen in the future.

In most of the real world applications such as stock market, weather records, customer buying pattern, growth pattern of diseases, etc. data is present in time series form. Time series represents or traces the values taken by a variable over a time period, such as a month or year. These applications generate huge amount of data even in a short period of time [1] and if we are applying trend analysis on that data, it will take lots of time. Time series data generated by these applications, found to have some properties like they exhibit similar behavior for a specific time period in case of stock market, customer buying pattern, etc. or similar behavior for a region in case of weather data. Hence we can utilize the fact of having similarity present in time sequences, for trend analysis.

Why would we analyze each sequence if we can put similar time sequence into a cluster and then form an approximate representation for that cluster? Using this we only lose a very less amount of information and can save lot of time.

There is lots of work that has been performed for trend analysis over time-series data in past by [25]. All these research works on trend analysis of time-series data are performed with-out forming any representation, which cost lots of time in analyzing each sequence and deriving results for that. Hence we introduce Representative Time Sequence (RTS). RTS is a time sequence, which is used to represent the cluster formed using clustering technique. Using the RTS, we can perform many kind of regression test, trend detection test on such time-series data and analyze it to predict the future behavior.

In this paper, we introduce an algorithm for finding the representative time sequence for the clusters of sequences having similar behavior. Our algorithm uses the concept of Agglomerative hierarchical clustering, which we will describe later in this paper. We verify the representative sequence formed by the algorithm described, by calculating its similarity with all the original sequences through which it is formed. The representative sequence is used for trend analysis. As We form RTS by merging all the sequences of cluster, we need to validate it on the the constraint that, it must follow the same behavior (i.e. trend) as all or most of the sequences present in the cluster behave. So for validation purpose, we use Sen’s median slope estimation method.

2 Related work

In the task of making representative series very little work has been done in past. For making a representation of time series there are many methods exist like discrete Fourier transformation (DFT) [6], discrete wavelet transformation (DWT) [7], singular value decomposition (SVD) [8]. But all these methods are used to reduce the dimensionality on the data so that large time series data reduce to small. We can’t use these dimensionality reductions methods for finding the representation, because if we do so we can’t apply trend test on each attribute as they might get eliminated.

To merge the time sequence, a merging algorithm is proposed by [9]. In this algorithm author describe an “influence term”, which is associated with every time series, to show its significance in the resulting time series. The advantage of this algorithm is that user can defined the influence term for time series or it might use the property of time series domain to define the influence term. But this advantage might not help when user not known which sequence has to be participated more in result (e.g. for a large data set) or time series domain does not give enough knowledge to define influence term (e.g. time series data set having random value for each data point).

Rest of the paper is organized as, Sect. 3 will contain background detail for finding representative time sequence algorithm and detail of data set used in this paper, in Sect. 4 we will be describing the algorithm for finding representative sequence, which is followed by Sect. 5 in which, we will perform experiments on RTS formed by our algorithm and to verify and validate methodology, we will be showing verification, cross verification and validation test on real time data set and Sect. 6 will conclude the paper and gives the directions for future work.

3 Data-set and background

In this section, we will give brief detail on data set used in this work, we will explain agglomerative hierarchical clustering [1]. Our proposed algorithm follows the hierarchical procedure as agglomerative hierarchical clustering does, but for making RTS. Then we will discuss about the similarity measure technique for time series data and the Sen’s Median slope Estimation method, which we will be used to verify and validate the Representative Time Series (RTS), respectively.

3.1 Data-set

We used 100 years long, rainfall time-series data of 28 states (624 districts) of India. The data set used in this study is obtained from the Indian Meteorological Department (IMD), Pune [10]. Time-series considered in the experiment is of equal length and due to real world data used, it might contain outlier, noise, etc. In this paper, we will discuss results and experiments of 5 states Maharashtra, Madhya Pradesh, Orissa, Punjab and West Bengal (includes 136 districts).

While defining similarity measure for time series data there are many difficulties are faced [11] and to remove them at some extent we need to normalize the time sequence [1]. Z-score normalization technique [12] can be used to assure that all values of the input, time-series \(T = \{t_{1},t_{2},t_{3}, \ldots , t_{n}\}\) (\(n\) = no. of time-series) are transformed into the series whose mean \(\mu (T)\) is approximately 0 while standard deviation \(\sigma (T)\) is in the range close to 1. Using eq. (1) input time-series \(T\) is replaced by the normalizes series \(T'\), where

$$\begin{aligned} t'_{i} = \frac{t_{i}-\mu (T)}{\sigma (T)} \quad \text {for\,i}=1\,\text {to n} \end{aligned}$$
(1)

3.2 Agglomerative hierarchical clustering

Agglomerative Nesting (AGNES), an agglomerative hierarchical clustering method [13]. It start with assumption that data contains that many cluster as many data points are present in it and place each data point into a separate cluster, then it start merging these data point step by step using some criteria [14]. AGNES terminate when single cluster is remain to be merge. In this way AGNES create a binary tree structure known as dendrogram (a simplified model in which data that are “close” have been grouped into a hierarchical tree).

3.3 Similarity/distance measure

There are many similarity measures which can be used for time series data dynamic time warping (DTW) [15], triangle distance measure (TDM) [16], Spear-man rank correlation coefficient, Pearson correlation coefficient, Euclidean [16]. In this paper, we will use TDM and DTW for verification and cross verification respectively. Similarity measures give an \(n \times n\) matrix as output; in which each cell represent the distance between the two time series.

TDM is used to verify the results produce by our algorithm and to generate the RTS. TDM considers each time-series as a vector in \(n\)-dimensional space. Let \(r_{i}\) be a time-series object of \(n\)-dimension, \(r_{i} = \{r_{i1},r_{i2},r_{i3}, \ldots , r_{in}\}\). The standardized time-series object \(\widehat{r}_{i} = \{\widehat{r}_{i1},\widehat{r}_{i2},\widehat{r}_{i3}, \ldots , \widehat{r}_{in}\}\), where

$$\begin{aligned} \widehat{r}_{ij} = \frac{r_{ij}}{{\left( \sum _{k=1}^{t} r_{ik}^{2}\right) }^{1/2}} \end{aligned}$$
(2)

The TDM between \(r_{i}\) and \(r_{j}\) is defined by eq. (3)

$$\begin{aligned} {d(r_{i} , r_{j})} = \frac{\sum _{k=1}^{t}{r_{ik}r_{jk}}}{{{\left( \sum _{k=1}^{t} r_{ik}^{2}\right) }^{1/2}}{{\left( \sum _{k=1}^{t} r_{jk}^{2}\right) }^{1/2}}} = 1 - \sum _{k=1}^{t}\widehat{r}_{ik}\widehat{r}_{jk} \end{aligned}$$
(3)

TDM is the cosine of the triangle between two vectors, so the value lies from 0 to 2 [15]. The value is 0, if two vectors having similar direction and overlapping, which shows that two time-series vectors are almost similar to each other. On the other hand, if two time-series are opposite in direction, but overlapping, the value is 2, it shows the two most different time-series vectors.

DTW is a classical option available for calculating the similarity between the two time-series [15]. DTW is important because it doesn’t require the same length time-series objects. DTW gives similarities in walking patterns, because if one time-series vary in time and speed with another, yet we get accurate results. It optimally align (or ’warping’) two time sequences \(r_{i} = \{r_{1},r_{2},r_{3}, \ldots , r_{n}\}\) and \(s_{i} = \{s_{1},s_{2},s_{3}, \ldots , s_{n}\}\) of length \(n\) and \(m\) respectively, so that the difference between them is minimized. Using dynamic programming efficiently, this difference can be obtained between two time-series, with the matrix \(D\) which is initialized to \(D_{0,0} = 0\) and all other cells, \(D_{i,j} = \infty \), recursively applying

$$\begin{aligned} D_{i,j} = d(r_{i},s_{i}) + min\{D_{i,j-1},D_{i-1,j},D_{i-1,j-1}\} \end{aligned}$$
(4)

Where \(i=1\) to \(n\) and \(j = 1\) to \(m\), we get an \(n\) x \(m\) matrix, where the distance between the two points, \(r_{i}\) and \(s_{i}\) are calculated using the Euclidean distance function [1] and the distance between the sequence \(r\) and \(s\) is value of the last cell \(D_{m,n}\).

3.4 Sen’s median slope estimator

Sen’s Median Slope Estimator test is commonly used with Mann-Kendall (MK) test [17] to detect the trend present in series. This test will used in this paper to measure magnitude of the trend to validate the RTS.

The MK test suggests the presence of a trend in the series, but its magnitude shows the trend nature, i.e. whether the trend is increasing or decreasing. To estimate the trend nature Sen’s median slope estimator test is use, which is not affected by outliers[18]. For N pairs of data, slope estimate as Eq. (5):

$$\begin{aligned} {Q_{i}} = \frac{(x_{j} - x_{k})}{j-k} \quad \text {for\,i}=1\,\text {to n} \end{aligned}$$
(5)

Here \(x_{j}\) and \(x_{k}\) are annual values in year \(j\) and \(k\) of a series, respectively, where \(j > k\). Median of these \(N\) values of \(Q_{i}\) is Sen’s median estimator of slope. Where

$$\begin{aligned} {N} = \frac{n(n-1)}{2} \end{aligned}$$
(6)

4 Finding representative time sequence

Growing time-series data depict the trends present in the observed value over time, and hence, it is important to capture valuable information that users may wishes to analyze and understand. Due to huge time-series data generated from many applications described in introduction section, trend analysis becomes an important and challenging problem.

As we are not interested in what the exact values each time-series has, we introduce a hierarchical time-series merging algorithm which can be very useful to analyze time-series data because it reduces the overhead of considering all time sequences in analysis. There are many clustering techniques exist for time series data [11] such as k-means, hierarchical, relocation, etc. By applying clustering technique on time series, we will get the clusters, which contain time sequences those are similar to each other, but dissimilar to sequences present in other clusters. By applying clustering technique on time series, we will get the clusters, which contain time sequences those are similar to each other, but dissimilar to sequences present in other clusters.

Fig. 1
figure 1

Dendrogram for Representative Time Sequence

Clusters containing time sequences, which are used to form representative sequence. We compute distance matrix for the sequences of each cluster using similarity measure, every cell of matrix shows the distance between the two sequences, now the two time sequences having least distance between them are combine to each other and form a new sequences. To combine two sequences we take average of each data point in sequence. Let \(R = \{r_{1},r_{2},r_{3}, \ldots , r_{n}\}\) and \(S = \{s_{1},s_{2},s_{3}, \ldots , s_{n}\}\) are the two time sequences of length \(n\), then new sequences T is formed using eq. (7).

$$\begin{aligned} t_{i} = \frac{(r_{i}+s_{i})}{2} \quad \text {for\,i}=1\,\text {to n} \end{aligned}$$
(7)
figure a

Algorithm 1 explains the complete procedure for producing RTS. Fig. 1 show how Algorithm 1 works, which is quite similar as Hierarchical clustering algorithm, because every time two most similar time sequences get merged and similarity matrix is updated for new time sequence with the original time sequences those are left to process. Algorithm terminates when single sequence is left to process, which is known as the RTS of the cluster on which algorithm operates. Algorithm 1 operates for each cluster formed in time-series data.

Table 1 Cluster 1 information for Maharashtra State
Fig. 2
figure 2

Representative tme sequence

We use clustering results in our algorithm and form dendrogram for each cluster to combine the time-series. All series of a cluster are merged in hierarchical fashion. We have shown some results for RTS, which is for Cluster 1 (MH_C1) of Maharashtra state having 4 time sequences in it. Table 1 shows the cluster 1 information for Maharashtra state. Figure 2 represents time sequences of MH_C1 and its RTS, which graphically demonstrates how well RTS, follows the other sequences of cluster.

5 Experiment and results

We introduce an algorithm for finding the RTS for the clusters time series data. We verify RTS formed by proposed algorithm, via calculating its similarity with all the original sequences through which that RTS is formed. The representative sequence will used for trend analysis and because we form it by merging all the sequences of cluster, we need to validate it on constraint that, it must follow the same behavior (i.e. trend) as all or most of the sequences present in the cluster have and to do this we will use Sen’s median slope estimation method.

5.1 Clustering result

To perform verification and validation over RTS, we need to form clusters of time-series data, to form clusters of time sequences we use group average hierarchical clustering. In this paper, we are showing results for 100 years long rainfall time-series data of 108 districts of 4 states Madhya Pradesh, Orissa, Punjab and West-Bengal respectively. Table 2 show the clustering result for Madhya Pradesh, Orissa, Punjab, West-Bengal respectively.

Table 2 Clustering result

5.2 Verification

To verify that the RTS formed by our algorithm, we measure the similarity of RTS with all the original time sequences of that cluster and to measure the similarity between RTS and original sequences, TDM similarity measure is used, because it produces output in the range from 0 to 2. Value near to zero represents the closeness of RTS to original sequences of cluster.

We show the results of similarity measure in Tables 3 and 4 for rainfall time-series of 4 states Madhya Pradesh, Orissa, Punjab and West-Bengal respectively. Columns in Tables 3 and 4 represents the RTS of cluster and rows represent the original time sequences of the cluster corresponding to the column number. Result for similarity measure shows almost all values near to 0, which depicts that the representative series is well formed for each cluster and can be useful to represent each cluster using RTS.

Table 3 Similarity of representative time sequence with original sequence of cluster
Table 4 Similarity of representative time sequence with original sequence of cluster

5.3 Cross verification

To Cross Verifiy that the RTS found for the a particular cluster best fits for only that cluster, we have calculated similarity measure for RTS of one cluster to the original time sequences of other clusters.

In Tables 5678 we shows the results for cross verification of RTS for rainfall time-series of Maharashtra state divided in to four clusters. In these table, cells contains similarity of RTS of cluster (e.g. for cluster 1 RTS is known as RTS 1 and so on) to the original time sequences of the cluster. DTW is used as similarity measure to calculate the similarity.

The results derived by this experiment clearly shows that the RTS of particular cluster represent that cluster very well and not suitable for any other cluster.

Table 5 Cross verification for cluster 1
Table 6 Cross verification for cluster 2
Table 7 Cross verification for cluster 3
Table 8 Cross verification for cluster 4

5.4 Validation

The RTS will have used further to analyze the trends in series. To validate that the RTS of a cluster and original sequences of that cluster give approximately the same results in trend, we perform validation test over RTS.

As we are having time sequences of 100 data points, we divide each sequence and RTS of the cluster into 4 segments of 25 data points in each. Then using Sen’s median slope estimator, the magnitude of the trend line of each segment is measured. We assign a letter ‘P’, ‘N’ and ‘L’ for positive, negative and plateau trend respectively. Table 9 shows the result, which indicates that the pattern formed by RTS of each cluster, approximately follows the pattern formed by original time-series present in respective cluster.

To validate, we used 172 districts of India. In Table 9, first column is cluster number, second column show the segment number of the time sequence, third (big column contain 15 sub-column for original time sequences of cluster)and fifth column represent the trend result of original and RTS respectively for each segment. Forth column contains the number of positive (P), negative (N) and plateau (L) trend occurred in each segment by original time sequences of cluster.

Table 9 Trends on representative time sequence with original sequence of cluster

5.5 Comparision of RTS

In Table 9 we compare the result of merging algorithm with average algorithm to form RTS. In average algorithm, RTS is formed by taking the average of all sequences simultaneously. To highlight the results we use bold fonts where difference occur in merging algorithm and average algorithm. There are 8 times in which average algorithm gives wrong answer while merging algorithm gives 2 times out of 48 chances.

In Table 10, we highlight the difference between the results for RTS formed using Hierarchical merging algorithm and RTS formed by taking simple average. In that table we also include the possible reason behind the result derived. In cluster of time sequences there are many type of combination can be found out. We explain some of combination which might changes the results for two algorithm used.

  • Highly positive (HP): sequence containing high positive values.

  • Highly negative (HN): sequence containing high negative values.

  • Low positive (LP): sequence containing low positive values.

  • Low negative (LN): sequence containing low negative values.

Table 10 Comparison between hierarchical merging vs simple average method

6 Conclusion and future work

Trend analysis is most common subject in time series domain and time series data set is often very large, due to which trend analysis task becomes time consuming. To reduce that time at some extend, in this paper, we introduced an algorithm, which uses clusters of time series as input and generate representative time series for each cluster. In this way lots of time sequences similar to each other not consider in trend analysis again and again. We also demonstrate the empirically quality of RTS by calculating its similarity with original sequences and also derive the trend results for RTS and original time sequences of clusters.

Using RTS, we can perform many kind of regression test, trend detection test on it and analyze it to predict the future behavior.

In future, we might try to apply our algorithm on some other time series dataset. Concept of RTS is not only covers the trend analysis but can also be very useful in other analysis performed on time series data such as finding regular pattern, forecasting, etc.