Iterative Incremental Clustering of Time Series

  • Jessica Lin
  • Michail Vlachos
  • Eamonn Keogh
  • Dimitrios Gunopulos
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2992)

Abstract

We present a novel anytime version of partitional clustering algorithm, such as k-Means and EM, for time series. The algorithm works by leveraging off the multi-resolution property of wavelets. The dilemma of choosing the initial centers is mitigated by initializing the centers at each approximation level, using the final centers returned by the coarser representations. In addition to casting the clustering algorithms as anytime algorithms, this approach has two other very desirable properties. By working at lower dimensionalities we can efficiently avoid local minima. Therefore, the quality of the clustering is usually better than the batch algorithm. In addition, even if the algorithm is run to completion, our approach is much faster than its batch counterpart. We explain, and empirically demonstrate these surprising and desirable properties with comprehensive experiments on several publicly available real data sets. We further demonstrate that our approach can be generalized to a framework of much broader range of algorithms or data mining problems.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Agrawal, R., Faloutsos, C., Swami, A.: Efficient Similarity Search in Sequence Databases. In: Proceedings of the 4th Int’l Conference on Foundations of Data Organization and Algorithms, Chicago, IL, October 13-15, pp. 69–84 (1993)Google Scholar
  2. 2.
    Bradley, P., Fayyad, U., Reina, C.: Scaling Clustering Algorithms to Large Databases. In: Proceedings of the 4th Int’l Conference on Knowledge Discovery and Data Mining, New York, NY, August 27-31, pp. 9–15 (1998)Google Scholar
  3. 3.
    Chan, K., Fu, A.W.: Efficient Time Series Matching by Wavelets. In: Proceedings of the 15th IEEE Int’l Conference on Data Engineering, Sydney, Australia, March 23-26, pp. 126–133 (1999)Google Scholar
  4. 4.
    Chu, S., Keogh, E., Hart, D., Pazzani, M.: Iterative Deepening Dynamic Time Warping for Time Series. In: Proceedings of the 2002 IEEE International Conference on Data Mining, Maebashi City, Japan, December 9-12 (2002)Google Scholar
  5. 5.
    Ding, C., He, X., Zha, H., Simon, H.: Adaptive Dimension Reduction for Clustering High Dimensional Data. In: Proceedings of the 2002 IEEE Int’l Conference on Data Mining, Maebashi, Japan, December 9-12, pp. 147–154 (2002)Google Scholar
  6. 6.
    Daubechies, I.: Ten Lectures on Wavelets. In: CBMS-NSF Regional Conference Series in Applied Mathematics, Philadelphia. Society for Industrial and Applied Mathematics, vol. 61 (1992)Google Scholar
  7. 7.
    Dempster, A., Laird, N., Rubin, D.: Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Series B 39(1), 1–38 (1977)MATHMathSciNetGoogle Scholar
  8. 8.
    Dumoulin, J.: NSTS 1988 News Reference Manual (1998), http://www.fas.org/spp/civil/sts/
  9. 9.
    Faloutsos, C., Ranganathan, M., Manolopoulos, Y.: Fast Subsequence Matching in Time-Series Databases. In: Proceedings of the ACM SIGMOD Int’l Conference on Management of Data, Minneapolis, MN, May 25-27, pp. 419–429 (1994)Google Scholar
  10. 10.
    Fayyad, U., Reina, C., Bradley, P.: Initialization of Iterative Refinement Clustering Algorithms. In: Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, New York, NY, August 27-31, pp. 194–198 (1998)Google Scholar
  11. 11.
    Grass, J., Zilberstein, S.: Anytime Algorithm Development Tools. Sigart Artificial Intelligence 7(2). ACM Press (April 1996)Google Scholar
  12. 12.
    Keogh, E., Pazzani, M.: An Enhanced Representation of Time Series Which Allows Fast and Accurate Classification, Clustering and Relevance Feedback. In: Proceedings of the 4th Int’l Conference on Knowledge Discovery and Data Mining, NewYork, NY, August 27-31, pp. 239–241 (1998)Google Scholar
  13. 13.
    Keogh, E., Chakrabarti, K., Pazzani, M., Mehrotra, S.: Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases. In: Proceedings of ACM SIGMOD Conference on Management of Data, Santa Barbara, CA, pp. 151–162 (2001)Google Scholar
  14. 14.
    Keogh, E., Folias, T.: The UCR Time Series Data Mining Archive (2002), http://www.cs.ucr.edu/~amonn/TSDMA/index.html
  15. 15.
    Keogh, E., Kasetty, S.: On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, July 23-26, pp. 102–111 (2002)Google Scholar
  16. 16.
    Korn, F., Jagadish, H., Faloutsos, C.: Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences. In: Proceedings of the ACM SIGMOD Int’l Conference on Management of Data, Tucson, AZ, May 13-15, pp. 289–300 (1997)Google Scholar
  17. 17.
    Lawrence, C., Reilly, A.: An Expectation Maximization (EM) Algorithm for the Identification and Characterization of Common Sites in Unaligned Biopolymer Sequences. Proteins 7, 41–51 (1990)CrossRefGoogle Scholar
  18. 18.
    McQueen, J.: Some Methods for Classification and Analysis of Multivariate Observation. In: Le Cam, L., Neyman, J. (eds.) Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, vol. 1, pp. 281–297 (1967)Google Scholar
  19. 19.
    Popivanov, I., Miller, R.J.: Similarity Search over Time Series Data Using Wavelets. In: Proceedings of the 18th Int’l Conference on Data Engineering, San Jose, CA, February 26-March 1, pp. 212–221 (2002)Google Scholar
  20. 20.
    Rafiei, D., Mendelzon, A.: Efficient Retrieval of Similar Time Sequences Using DFT. In: Proceedings of the FODO Conference, Kobe, Japan (November 1998)Google Scholar
  21. 21.
    Smyth, P., Wolpert, D.: Anytime Exploratory Data Analysis for Massive Data Sets. In: Proceedings of the 3rd Int’l Conference on Knowledge Discovery and Data Mining, Newport Beach, CA, pp. 54–60 (1997)Google Scholar
  22. 22.
    Shahabi, C., Tian, X., Zhao, W.: TSA-tree: a Wavelet Based Approach to Improve the Efficiency of Multi-Level Surprise and Trend Queries. In: Proceedings of the 12th Int’l Conference on Scientific and Statistical Database Management, Berlin, Germany, July 26-28, pp. 55–68 (2000)Google Scholar
  23. 23.
    Struzik, Z., Siebes, A.: The Haar Wavelet Transform in the Time Series Similarity Paradigm. In: Proceedings of Principles of Data Mining and Knowledge Discovery, 3rd European Conference, Prague, Czech Republic, September 15-18, pp. 12–22 (1999)Google Scholar
  24. 24.
    Vlachos, M., Lin, J., Keogh, E., Gunopulos, D.: A Wavelet-Based Anytime Algorithm for K-Means Clustering of Time Series. In: Workshop on Clustering High Dimensionality Data and Its Applications, at the 3rd SIAM Int’l Conference on Data Mining, San Francisco, CA, May 1-3 (2003)Google Scholar
  25. 25.
    Wu, Y., Agrawal, D., El Abbadi, A.: A Comparison of DFT and DWT Based Similarity Search in Time-Series Databases. In: Proceedings of the 9th ACM Int’l Conference on Information and Knowledge Management, McLean, VA, November 6-11, pp. 488–495 (2000)Google Scholar
  26. 26.
    Yi, B., Faloutsos, C.: Fast Time Sequence Indexing for Arbitrary Lp Norms. In: Proceedings of the 26th Int’l Conference on Very Large Databases, Cairo, Egypt, September 10-14, pp. 385–394 (2000); l Database Management, Berlin, Germany, July 26-28, pp. 55–68Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Jessica Lin
    • 1
  • Michail Vlachos
    • 1
  • Eamonn Keogh
    • 1
  • Dimitrios Gunopulos
    • 1
  1. 1.Computer Science & Engineering DepartmentUniversity of California, RiversideRiverside

Personalised recommendations