Fuzzy Clustering Stability Evaluation of Time Series
- 3 Citations
- 942 Downloads
Abstract
The discovery of knowledge by analyzing time series is an important field of research. In this paper we investigate multiple multivariate time series, because we assume a higher information value than regarding only one time series at a time. There are several approaches which make use of the granger causality or the cross correlation in order to analyze the influence of time series on each other. In this paper we extend the idea of mutual influence and present FCSETS (Fuzzy Clustering Stability Evaluation of Time Series), a new approach which makes use of the membership degree produced by the fuzzy c-means (FCM) algorithm. We first cluster time series per timestamp and then compare the relative assignment agreement (introduced by Eyke Hüllermeier and Maria Rifqi) of all subsequences. This leads us to a stability score for every time series which itself can be used to evaluate single time series in the data set. It is then used to rate the stability of the entire clustering. The stability score of a time series is higher the more the time series sticks to its peers over time. This not only reveals a new idea of mutual time series impact but also enables the identification of an optimal amount of clusters per timestamp. We applied our model on different data, such as financial, country related economy and generated data, and present the results.
Keywords
Time series analysis Fuzzy clustering Evaluation1 Introduction
However, in some cases the exact course of time series is not relevant but rather the detection of groups of time series that follow the same trend. Additionally, time-dependent information can be meaningful for the identification of patterns or anomalies. For this purpose it is necessary to cluster the time series data per time point, as the comparison of whole (sub-)sequences at once leads to a loss of information. For example, in case of the euclidean distance the mean distance over all time points is considered. In case of Dynamic Time Warping (DTW) the smallest distance is relevant. The information at one timestamp has therefore barely an impact. The approach of clustering time series per time point enables an advanced analysis of their temporal correlation, since the behavior of sequences to their cluster peers can be examined. In the following this procedure will be called over-time clustering. An example is shown in Fig. 1. Note, that for simplicity reasons only univariate time series are illustrated. However, over-time clustering is especially valuable for multivariate time series analysis.
Unfortunately new problems like the right choice of parameters arise. Often the comparison of clusterings with different parameter settings is difficult since there is no evaluation function which distinguishes the quality of clusterings properly. In addition, some methods, such as outlier detection, require good clustering as a basis, whereby the quality can contextually be equated with the stability of the clusters.
In this paper, we focus on multiple multivariate time series with same length and equivalent time steps. We introduce an evaluation measure named FCSETS (Fuzzy Clustering Stability Evaluation of Time Series) for the over-time stability of a fuzzy clustering per time point. For this purpose our approach rates the over-time stability of all sequences considering their cluster memberships. To the best of our knowledge this is the first approach that enables the stability evaluation of clusterings and sequences regarding the temporal linkage of clusters.
Over-time clustering can be helpful in many applications. For example, the development of relationships between different terms can be examined when tracking topics in online forums. Another application example is the analysis of financial data. The over-time clustering of different companies’ financial data can be helpful regarding the detection of anomalies or even fraud. If the courses of different companies’ financial data can be divided into groups, e.g. regarding their success, the investigation of clusters and their members’ transitions might be a fundamental step for further analysis. As probably not all fraud cases are known (some may remain uncovered) this problem cannot be solved with fully supervised learning.
The stability evaluation of temporal clusterings offers a great benefit as it not only enables the identification of suitable hyper-parameters for different algorithms but also ensures a reliable clustering as a basis for further analysis.
2 Related Work
In the field of time series analysis, different techniques for clustering time series data were proposed. However, to the best of our knowledge, there does not exist any approach similar to ours. The approaches described in [8, 19, 28] cluster entire sequences of multiple time series. This procedure is not well suited for our context because potential correlations between subsequences of different time series are not revealed. Additionally, the exact course of the time series is not relevant, but rather the trend they show. The problem of not recognizing interrelated subsequences also persists in a popular method where the entire sequences are first transformed to feature vectors and then clustered [17]. Methods for clustering streaming data like the ones proposed in [14] and [25] are not comparable to our method because they consider only one time series at a time and deal with other problems such as high memory requirements and time complexity. Another area related to our work is community detection in dynamic networks. While approaches presented in [12, 13, 26, 36] aim to detect and track local communities in graphs over time, the goal of our method is finding a stable partitioning of time series over the entire period so that time series following the same trend are assigned to the same cluster.
In this section, first we briefly describe the fuzzy c-means clustering algorithm that we use for clustering time series objects at different time points. Then, we refer on the one hand to related work with regard to time-independent evaluation measures for clusterings. Finally, we describe a resampling approach for cluster validation and a fuzzy variant of the Rand index that we use in our method.
2.1 Fuzzy C-Means (FCM)
2.2 Internal Evaluation Measures
Many different external and internal evaluation measures for evaluating clusters and clusterings were proposed in the literature. In the case of the external evaluation, the clustering results are compared with a ground truth which is already known. In the internal evaluation, no information about the actual partitioning of the data set is known, so that the clusters are often evaluated primarily on the basis of characteristics such as compactness and separation.
One metric that evaluates the compactness of clusters is the Sum of Squared Errors. It calculates the overall distance between the data points and the cluster prototype. In the case of fuzzy clustering, these distances are additionally weighted by the membership degrees. The better the data objects are assigned to clusters, the smaller the error, the greater the compactness. However, this measure does not explicitly take the separation of different clusters into account.
There are dozens of fuzzy cluster validity indices that evaluate the compactness as well as the separation of different clusters in the partitioning. Some validity measures use only membership degrees [20, 21], other include the distances between the data points and cluster prototypes [3, 5, 11, 35]. All these measures cannot be directly compared to our method because they lack a temporal aspect. However, they can be applied in FCSETS for producing an initial partitioning of a data set for different time points.
2.3 Stability Evaluation
Since we deal with fuzzy partitionings, in our approach we use a modified version of the Hüllermeier-Rifqi Index [18]. There are other similarity indices for comparing fuzzy partitions like Campello’s Fuzzy Rand Index [6] or Frigui Fuzzy Rand Index [10] but they are not reflexive.
3 Fundamentals
In this chapter we clarify our understanding of some basic concepts regarding our approach. For this purpose we supplement the definitions from [32]. Our method considers multivariate time series, so instead of a definition with real values we use the following definition.
Definition 1 (Time Series)
A time series \(T = o_{t_1}, ... , o_{t_n}\) is an ordered set of n real valued data points of arbitrary dimension. The data points are chronologically ordered by their time of recording, with \(t_1\) and \(t_n\) indicating the first and last timestamp, respectively.
Definition 2 (Data Set)
A data set \(D = T_1, ..., T_m\) is a set of m time series of same length n and equal points in time.
The vectors of all time series are denoted as the set \(O = \{o_{t_1,1},..., o_{t_n,m}\}\). With the second index indicating the time series the data point originates from. We write \(O_{t_i}\) for all data points at a certain point in time.
Definition 3 (Cluster)
A cluster \(C_{t_i, j} \subseteq O_{t_i}\) at time \(t_i\), with \(j \in \{1,...,k_{t_i}\}\) with \(k_{t_i}\) being the number of clusters at time \(t_i\), is a set of similar data points, identified by a cluster algorithm.
Definition 4 (Fuzzy Cluster Membership)
The membership degree \(u_{C_{t_i,j}}(o_{t_i,l}) \in [0,1]\) expresses the relative degree of belonging of the data object \(o_{t_i,l}\) of time series \(T_l\) to cluster \(C_{t_i,j}\) at time \(t_i\).
Definition 5 (Fuzzy Time Clustering)
A fuzzy time clustering is the result of a fuzzy clustering algorithm at one timestamp. In concrete it is the membership matrix \(U_{t_i} = [u_{C_{t_i,j}}(o_{t_i,l})]\).
Definition 6 (Fuzzy Clustering)
A fuzzy clustering of time series is the overall result of a fuzzy clustering algorithm for all timestamps. In concrete it is the ordered set \(\zeta = U_{t_1}, ..., U_{t_n}\) of all membership matrices.
4 Method
An obvious disadvantage of creating clusters for every timestamp is the missing temporal link. In our approach we assume that clusterings with different parameter settings show differences in the connectedness of clusters and that this connection can be measured. In order to do so, we make use of a stability function. Given a fuzzy clustering \(\zeta \), we first analyze the behavior of every subsequence of a time series \(T = o_{t_1}, ..., o_{t_i}\), with \(t_i \le t_n\), starting at the first timestamp. In this way we rate a temporal linkage of time series to each other. Time series that are clustered together at all time stamps, have a high temporal linkage, while time series which often separate from their clusters’ peers, indicate a low temporal linkage. One could say we rate the team spirit of the individual time series and therefore their cohesion with other sequences over time. In the example shown in Fig. 2, the time series \(T_a\) and \(T_b\) show a good team spirit because they move together over the entire period of time. In contrast, the time series \(T_c\) and \(T_d\) show a lower temporal linkage. While they are clustered together at time points \(t_i\) and \(t_k\), they are assigned to different clusters in between at time point \(t_j\). After the evaluation of the individual sequences, we assign a score to the fuzzy clustering \(\zeta \), depending on the over-time stability of every time series.
5 Experiments
In the following, we present the results on an artificially generated data set, that demonstrates a meaningful usage of our measure and shows the impact of the stability evaluation. Additionally, we discuss experiments on two real world data sets. One consists of financial figures from balance sheets and the other one contains country related economy data. In all cases fuzzy c-means was used with different parameter combinations for the number of clusters per time point.
5.1 Artificially Generated Data Set
Stability scores for the generated data set depending on \(k_{t_i}\).
\({k_{t_1}}\) | \({k_{t_2}}\) | \({k_{t_3}}\) | \({k_{t_4}}\) | FCSETS score |
---|---|---|---|---|
2 | 2 | 2 | 2 | 0.995 |
2 | 3 | 2 | 2 | 0.951 |
2 | 3 | 3 | 2 | 0.876 |
2 | 3 | 3 | 3 | 0.829 |
3 | 3 | 2 | 2 | 0.967 |
3 | 3 | 3 | 3 | 0.9 |
2 | 3 | 4 | 5 | 0.71 |
5 | 3 | 4 | 2 | 0.908 |
3 | 10 | 3 | 10 | 0.577 |
To find the best stability score for the data set, FCM was used with various settings for the number of clusters per time point. All combinations with \(k_{t_i} \in [2,5]\) were investigated. Figure 3 shows the resulting fuzzy clustering with the highest FCSETS score of 0.995. For illustration reasons the clustering was defuzzyfied. Although it might seem intuitive to use a partitioning with three clusters at time points 1 and 2, regarding the over-time stability it is beneficial to choose only two clusters. This can be explained by the fact that there are time series that move between the two apparent groups of the upper (blue) cluster. The stability is therefore higher when these two groups are clustered together.
In Table 1 a part of the corresponding scores for the different parameter settings of \(k_{t_i}\) are listed. As shown in Fig. 3, the best score is achieved with \(k_{t_i}\) being set to 2 for all time points. The worst score results with the setting \(k_{t_1} = 2\), \(k_{t_2} = 3\), \(k_{t_3} = 4\) and \(k_{t_4} = 5\). The score is not only decreased because the upper (blue) cluster is divided in this case, but also because the number of clusters varies and therefore sequences get separated from their peers. It is obvious that the stability score is negatively affected, if the number of clusters significantly changes over time. This influence is also expressed by the score of 0.577 for the extreme example in the last row.
5.2 EIKON Financial Data Set
Stability scores for the EIKON financial data set depending on \(k_{t_i}\).
\({k_{t_1}}\) | \({k_{t_2}}\) | \({k_{t_3}}\) | \({k_{t_4}}\) | \({k_{t_5}}\) | \({k_{t_6}}\) | \({k_{t_7}}\) | \({k_{t_8}}\) | FCSETS score |
---|---|---|---|---|---|---|---|---|
2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 0.929 |
3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 0.9 |
3 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 0.945 |
5 | 4 | 3 | 2 | 2 | 2 | 2 | 2 | 0.924 |
2 | 2 | 4 | 3 | 2 | 4 | 5 | 5 | 0.72 |
We generated the clusterings for all combinations of \(k_{t_i}\) from two to five clusters per timestamp. Selected results can be seen in Table 2. The actual maximum retrieved from the iterations (in the third row) is printed bold. The worst score can be found in the last row and represents an unstable clustering. It can be seen that the underlying data is well separated into three clusters in the first point in time and into two clusters at the following timestamps. This is actually a rare case but can be explained with the selection of features and companies. Actually TR-TtlPlanExpectedReturn is rarely provided by Thomson Reuters and the fact that we only chose companies which got complete data for all regarded points in time. This may have diminished the number of companies which might have lower membership degrees.
5.3 GlobalEconomy Data Set
Stability scores for the GlobalEconomy data set depending on \(k_{t_i}\).
\({k_{t_1}}\) | \({k_{t_2}}\) | \({k_{t_3}}\) | \({k_{t_4}}\) | \({k_{t_5}}\) | \({k_{t_6}}\) | \({k_{t_7}}\) | \({k_{t_8}}\) | FCSETS score |
---|---|---|---|---|---|---|---|---|
2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 0.978 |
3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 0.963 |
3 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 0.945 |
5 | 3 | 4 | 2 | 2 | 2 | 2 | 2 | 0.955 |
2 | 3 | 2 | 2 | 4 | 5 | 5 | 5 | 0.837 |
The results are shown in Table 3. It can be seen that the best score is achieved with two clusters at every point in time. Evidently the chosen countries can be well separated into two groups at every point in time. More clusters or different numbers of clusters for different timestamps performed worse. In this experiment we also iterated over all combinations of \(k_{t_i}\) for the given points in time. The bold printed maximum, and the minimum, which can be found in the last row of the table, represent the actual maximum and minimum within the range of the iterated combinations.
6 Conclusion and Future Work
In this paper we presented a new method for analyzing multiple multivariate time series with the help of fuzzy clustering per timestamp. Our approach defines a new target function for sequence-based clustering tasks, namely the stability of sequences. In our experiments we have shown that this enables the identification of optimal \(k_{t_i}\)s per timestamp and that our measure can not only rate time series and clusterings but also can be used to evaluate the stability of data sets. The latter is possible by examining the maximum achieved FCSETS score. Our approach can be applied whenever similar behavior for groups of time series can be assumed. As it is based on membership degrees, clusterings with overlapping clusters and soft transitions can be handled. With the help of our evaluation measure a stable over-time clustering can be achieved, which can be used for further analysis such as outlier detection.
Future work could include the development of a fuzzy clustering algorithm which is based on our formulated target function. The temporal linkage could therefore already be taken into account when determining groups of time series. Another interesting field of research could be the examination of other fuzzy clustering algorithms like the Possibilistic Fuzzy c-Means algorithm [27]. This algorithm can also handle outliers which can be handy for certain data sets. In the experiment with the GlobalEconomy data set we faced the problem, that one outlier would form a cluster on its own in every point in time. This led to very high FCSETS scores. The handling of outliers could overcome such misbehavior. Future work should also include the application of our approach to incomplete data, since appropriate fuzzy clustering approaches already exist [15, 16, 33]. We have faced this problem when applying our algorithm to the EIKON financial data set. Also, the identification of time series that show a good team spirit for a specific time period could be useful in some applications and might therefore be investigated. Finally, the examination and optimization of FCSETS’ computational complexity would be of great interest as it currently seems to be fairly high.
Notes
Acknowledgement
We thank the Jürgen Manchot Foundation, which supported this work by funding the AI research group Decision-making with the help of Artificial Intelligence at Heinrich Heine University Düsseldorf.
References
- 1.Global economy, world economy. https://www.theglobaleconomy.com/
- 2.Banerjee, A., Ghosh, J.: Clickstream clustering using weighted longest common subsequences. In: Proceedings of the Web Mining Workshop at the 1st SIAM Conference on Data Mining, pp. 33–40 (2001)Google Scholar
- 3.Beringer, J., Hüllermeier, E.: Adaptive optimization of the number of clusters in fuzzy clustering. In: Proceedings of the IEEE International Conference on Fuzzy Systems, pp. 1–6 (2007)Google Scholar
- 4.Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. Kluwer Academic Publishers, Norwell (1981)CrossRefGoogle Scholar
- 5.Bouguessa, M., Wang, S., Sun, H.: An objective approach to cluster validation. Pattern Recogn. Lett. 27, 1419–1430 (2006)CrossRefGoogle Scholar
- 6.Campello, R.: A fuzzy extension of the rand index and other related indexes for clustering and classification assessment. Pattern Recogn. Lett. 28(7), 833–841 (2007)CrossRefGoogle Scholar
- 7.Dunn, J.C.: A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. J. Cybern. 3(3), 32–57 (1973)MathSciNetCrossRefGoogle Scholar
- 8.Ernst, J., Nau, G.J., Bar-Joseph, Z.: Clustering short time series gene expression data. Bioinformatics 21(suppl-1), i159–i168 (2005)CrossRefGoogle Scholar
- 9.Ferreira, L.N., Zhao, L.: Time series clustering via community detection in networks. Inf. Sci. 326, 227–242 (2016)MathSciNetCrossRefGoogle Scholar
- 10.Frigui, H., Hwang, C., Rhee, F.C.H.: Clustering and aggregation of relational data with applications to image database categorization. Pattern Recogn. 40(11), 3053–3068 (2007)CrossRefGoogle Scholar
- 11.Fukuyama, Y., Sugeno, M.: A new method of choosing the number of clusters for the fuzzy c-mean method. In: Proceedings of the 5th Fuzzy Systems Symposium, pp. 247–250 (1989)Google Scholar
- 12.Granell, C., Darst, R., Arenas, A., Fortunato, S., Gomez, S.: Benchmark model to assess community structure in evolving networks. Phys. Rev. E 92, 012805 (2015)CrossRefGoogle Scholar
- 13.Greene, D., Doyle, D., Cunningham, P.: Tracking the evolution of communities in dynamic social networks. In: Proceedings - 2010 International Conference on Advances in Social Network Analysis and Mining, ASONAM 2010, vol. 2010, pp. 176–183 (2010)Google Scholar
- 14.Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams: theory and practice. IEEE Trans. Knowl. Data Eng. 15(3), 515–528 (2003)CrossRefGoogle Scholar
- 15.Hathaway, R., Bezdek, J.: Fuzzy c-means clustering of incomplete data. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 31, 735–44 (2001)CrossRefGoogle Scholar
- 16.Himmelspach, L., Conrad, S.: Fuzzy c-means clustering of incomplete data using dimension-wise fuzzy variances of clusters. In: Carvalho, J.P., Lesot, M.-J., Kaymak, U., Vieira, S., Bouchon-Meunier, B., Yager, R.R. (eds.) IPMU 2016. CCIS, vol. 610, pp. 699–710. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-40596-4_58CrossRefGoogle Scholar
- 17.Huang, X., Ye, Y., Xiong, L., Lau, R.Y., Jiang, N., Wang, S.: Time series k-means: a new k-means type smooth subspace clustering for time series data. Inf. Sci. 367–368, 1–13 (2016)zbMATHGoogle Scholar
- 18.Hüllermeier, E., Rifqi, M.: A fuzzy variant of the rand index for comparing clustering structures. In: Proceedings of the Joint 2009 International Fuzzy Systems Association World Congress and 2009 European Society of Fuzzy Logic and Technology Conference, pp. 1294–1298 (2009)Google Scholar
- 19.Izakian, H., Pedrycz, W., Jamal, I.: Fuzzy clustering of time series data using dynamic time warping distance. Eng. Appl. Artif. Intell. 39, 235–244 (2015)CrossRefGoogle Scholar
- 20.Kim, Y.I., Kim, D.W., Lee, D., Lee, K.: A cluster validation index for GK cluster analysis based on relative degree of sharing. Inf. Sci. 168, 225–242 (2004)MathSciNetCrossRefGoogle Scholar
- 21.Le Capitaine, H., Frelicot, C.: A cluster-validity index combining an overlap measure and a separation measure based on fuzzy-aggregation operators. IEEE Trans. Fuzzy Syst. 19, 580–588 (2011)CrossRefGoogle Scholar
- 22.Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)MathSciNetCrossRefGoogle Scholar
- 23.MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297. University of California Press (1967)Google Scholar
- 24.Möller-Levet, C.S., Klawonn, F., Cho, K.-H., Wolkenhauer, O.: Fuzzy clustering of short time-series and unevenly distributed sampling points. In: R. Berthold, M., Lenz, H.-J., Bradley, E., Kruse, R., Borgelt, C. (eds.) IDA 2003. LNCS, vol. 2810, pp. 330–340. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-45231-7_31CrossRefGoogle Scholar
- 25.O’Callaghan, L., Mishra, N., Meyerson, A., Guha, S., Motwani, R.: Streaming-data algorithms for high-quality clustering. In: Proceedings of IEEE International Conference on Data Engineering, p. 685 (2001)Google Scholar
- 26.Orlinski, M., Filer, N.: The rise and fall of spatio-temporal clusters in mobile ad hoc networks. Ad Hoc Netw. 11(5), 1641–1654 (2013)CrossRefGoogle Scholar
- 27.Pal, N., Pal, K., Keller, J., Bezdek, J.: A possibilistic fuzzy c-means clustering algorithm. IEEE Trans. Fuzzy Syst. 13, 517–530 (2005)CrossRefGoogle Scholar
- 28.Paparrizos, J., Gravano, L.: k-shape: efficient and accurate clustering of time series. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD 2015, pp. 1855–1870. ACM, New York (2015)Google Scholar
- 29.Rand, W.M.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336), 846–850 (1971)CrossRefGoogle Scholar
- 30.Roth, V., Lange, T., Braun, M., Buhmann, J.: A resampling approach to cluster validation. In: Härdle, W., Rönz, B. (eds.) COMPSTAT, pp. 123–128. Springer, Heidelberg (2002). https://doi.org/10.1007/978-3-642-57489-4_13CrossRefGoogle Scholar
- 31.Runkler, T.A.: Comparing partitions by subset similarities. In: Hüllermeier, E., Kruse, R., Hoffmann, F. (eds.) IPMU 2010. LNCS (LNAI), vol. 6178, pp. 29–38. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-14049-5_4CrossRefGoogle Scholar
- 32.Tatusch, M., Klassen, G., Bravidor, M., Conrad, S.: Show me your friends and i’ll tell you who you are. finding anomalous time series by conspicuous cluster transitions. In: Le, T.D., et al. (eds.) AusDM 2019. CCIS, vol. 1127, pp. 91–103. Springer, Singapore (2019). https://doi.org/10.1007/978-981-15-1699-3_8CrossRefGoogle Scholar
- 33.Timm, H., Döring, C., Kruse, R.: Different approaches to fuzzy clustering of incomplete datasets. Int. J. Approx. Reason. 35, 239–249 (2004)MathSciNetCrossRefGoogle Scholar
- 34.Truong, C.D., Anh, D.T.: A novel clustering-based method for time series motif discovery under time warping measure. Int. J. Data Sci. Anal. 4(2), 113–126 (2017). https://doi.org/10.1007/s41060-017-0060-3CrossRefGoogle Scholar
- 35.Xie, X.L., Beni, G.: A validity measure for fuzzy clustering. IEEE Trans. Pattern Anal. Mach. Intell. 13(8), 841–847 (1991)CrossRefGoogle Scholar
- 36.Zakrzewska, A., Bader, D.: A dynamic algorithm for local community detection in graphs. In: 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pp. 559–564 (2015)Google Scholar