Advertisement

Knowledge and Information Systems

, Volume 15, Issue 2, pp 181–214 | Cite as

Tracking clusters in evolving data streams over sliding windows

  • Aoying ZhouEmail author
  • Feng Cao
  • Weining Qian
  • Cheqing Jin
Regular Paper

Abstract

Mining data streams poses great challenges due to the limited memory availability and real-time query response requirement. Clustering an evolving data stream is especially interesting because it captures not only the changing distribution of clusters but also the evolving behaviors of individual clusters. In this paper, we present a novel method for tracking the evolution of clusters over sliding windows. In our SWClustering algorithm, we combine the exponential histogram with the temporal cluster features, propose a novel data structure, the Exponential Histogram of Cluster Features (EHCF). The exponential histogram is used to handle the in-cluster evolution, and the temporal cluster features represent the change of the cluster distribution. Our approach has several advantages over existing methods: (1) the quality of the clusters is improved because the EHCF captures the distribution of recent records precisely; (2) compared with previous methods, the mechanism employed to adaptively maintain the in-cluster synopsis can track the cluster evolution better, while consuming much less memory; (3) the EHCF provides a flexible framework for analyzing the cluster evolution and tracking a specific cluster efficiently without interfering with other clusters, thus reducing the consumption of computing resources for data stream clustering. Both the theoretical analysis and extensive experiments show the effectiveness and efficiency of the proposed method.

Keywords

Cluster tracking Evolving Data streams Sliding windows 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Freytag JC, Lockemann PC, Abiteboul S, Carey MJ, Selinger PG, Heuer A (eds) Proceedings of 29th international conference on very large data bases, Berlin, Germany, pp 81–92Google Scholar
  2. Aggarwal CC, Han J, Wang J, Yu PS (2004) A framework for projected clustering of high dimensional data streams. In: Nascimento MA, Özsu MT, Kossmann D, Miller RJ, Blakeley JA, Schiefer KB (eds) Proceedings of the 30th international conference on very large data bases, Toronto, Canada, pp 852–863Google Scholar
  3. Aggarwal CC, Han J, Wang J, Yu PS (2005) On high dimensional projected clustering of data streams. Data Min Knowl Discovery 10(3):251–273Google Scholar
  4. Aggarwal CC, Yu P (2006) A framework for clustering massive text and categorical data streams. In: Proceedings of the ACM SIAM conference on data mining, 2006 (Text and categorical clustering of high dimensional data streams) pp 479–483Google Scholar
  5. Arasu A, Manku GS (2004) Approximate counts and quantiles over sliding windows. In: Deutsch A (ed) Proceedings of the 23th ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems, Paris, France, pp 286–296Google Scholar
  6. Babcock B, Babu S, Datar M, Motwani R, Widom J (2002) Models and issues in data stream systems. In: Popa L (ed) Proceedings of the 21st ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems, Madison, WI, pp 1–16Google Scholar
  7. Babcock B, Datar M, Motwani R, Callaghan LO' (2003) Maintaining variance and k-medians over data stream windows. In: Proceedings of the 22nd ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems, San Diego, CA, pp 234–243Google Scholar
  8. Beringer J, Hullermeier E (2006) Online Clustering of parallel data streams. Data Knowl Eng 58(2):180–204Google Scholar
  9. Cao F, Ester M, Qian W, Zhou A (2006) Density-based clustering over an evolving data stream with noise. In: Proceedings of the 2006 SIAM conference on data mining (SDM), Bethesda, MD, pp 328–339Google Scholar
  10. Chalaghan LO, Mishra N, Meyerson A, Guha S (2002) Streaming data algorithms for high-quality clustering. In: Proceedings of the 18th international conference on data engineering, San Jose, CA, pp 685–694Google Scholar
  11. Charikar M, Callaghan LO', Panigrahy R (2003) Better streaming algorithms for clustering problems. In: Proceedings of the 35th annual ACM symposium on theory of computing, San Diego, CA, pp 30–39Google Scholar
  12. Chi Y, Wang H, Yu PS, Muntz RR (2004) Moment: maintaining closed frequent itemsets over a stream sliding window. In: Proceedings of the 2004 IEEE international conference on data mining, ICDM 2004, Brighton, UK, pp 59–66Google Scholar
  13. Chi Y, Wang H, Yu PS, Muntz RR (2006) Catch the moment: maintaining closed frequent itemsets over a data stream sliding window. Knowl Inf Syst 10(3):265–294Google Scholar
  14. Cormode G, Muthukrishnan S (2003) What's hot and what's not: tracking most frequent items dynamically. In: Proceedings of the 22nd ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems, San Diego, CA, pp 296–306Google Scholar
  15. Dai B, Huang J, Yeh M, Chen M (2004) Clustering on demand for multiple data streams. In: Proceedings of the 4th IEEE international conference on data mining (ICDM 2004), Brighton, UK, pp 367–370Google Scholar
  16. Dai BR, Huang JW, Yeh MY, Chen MS (2006) Adaptive clustering for multiple evolving streams. IEEE Trans Knowl Data Eng 18(9):1166–1180Google Scholar
  17. Datar M, Gionis A, Indyk P, Motwani R (2002) Maintaining stream statistics over sliding windows. In: Proceedings of the 13th annual ACM-SIAM symposium on discrete algorithms, San Francisco, CA, pp 635–644Google Scholar
  18. Domingos P, Hulten G (2000) Mining high-speed data streams. In: Proceedings of the 6th ACM SIGKDD international conference on knowledge discovery and data mining, Boston, MA, pp 71–80Google Scholar
  19. Domingos P, Hulton G (2001) A general method for scaling up machine learning algorithms and its application to clustering. In: Brodley CE, Danyluk AP (eds) Proceedings of the 18th international conference on machine learning (ICML 2001), Williams College, Williamstown, MA, pp 106–113Google Scholar
  20. Domingos P, Hulton G (2001) Catching up with the data: research issues in mining data streams. In: SIGMOD workshop on research issues in data mining and knowledge discovery, Santa Barbara, CAGoogle Scholar
  21. Farnstrom F, Lewis J, Elkan C (2000) Scalability for clustering algorithms revisited. SIGKDD Explor 2(1):51–57Google Scholar
  22. Guha S, Koudas N (2001) Data-streams and histograms. In: Proceedings of the thirty-third annual ACM symposium on theory of computing, STOC 2001, New York, pp 471–475Google Scholar
  23. Guha S, Mishra N, Motwani R, Callaghan LO' (2000) Clustering data stream. In: Proceedings of the 41st annual symposium on foundations of computer science, FOCS 2000, Redondo Beach, CA, pp 359–366Google Scholar
  24. Guha S, Meyerson A, Mishra N, Motwani R, Callaghan LO' (2003) Clustering data streams: theory and practice. IEEE Trans Knowl Data Eng (TKDE) 3(15):515–528Google Scholar
  25. He Z, Xu X, Deng S, Huang JZ (2004) Clustering categorical data streams. J Comput Methods Sci Eng (JCMSE) URL http://arxiv.org/ftp/cs/papers/0412/0412058.pdf
  26. He Z, Xu X, Dend S (2002) Squeezer: an efficient algorithm for clustering categorical data. J Comput Sci Technol 17(5):611–624Google Scholar
  27. Hulten G, Spencer L, Domingos P (2001) Mining time-changing data streams. In: Proceedings of the 7th ACM SIGKDD international conference on knowledge discovery and data mining, San Francisco, CA, pp 97–106Google Scholar
  28. Jain A, Dubes R (1998) Algorithms for clustering data. Prentice-Hall, Englewood Cliffs, NJGoogle Scholar
  29. Jin C, Qian W, Sha C, Yu J, Zhou A (2003) Dynamically maintaining frequent items over a data stream. In: Proceedings of the 12nd ACM CIKM international conference on information and knowledge management, New Orleans, LA, pp 287–294Google Scholar
  30. Keogh E, Lin J (2005) Clustering of time-series subsequences is meaningless: implications for previous and future research. Knowl Inf Syst 8(2):154–177Google Scholar
  31. Keogh E, Lin J, Truppel W (2003) Clustering of time series subsequences is meaningless: implications for past and future research. In: Proceedings of the 3rd IEEE international conference on data mining, Melbourne, FLGoogle Scholar
  32. Lin M, Lee S (2005) Efficient mining of sequential patterns with time constraints by delimited pattern growth. Knowl Inf Syst 7(4):499–514Google Scholar
  33. Manku GS, Motwani R (2002) Approximate frequency counts over data streams. In: Proceedings of 28th international conference on very large data bases, Hong Kong, China, pp 346–357Google Scholar
  34. Muthukrishnan S, Shah R, Vitter JS (2004) Mining deviants in time series data streams. In: Proceedings of the 16th international conference on scientific and statistical database management (SSDBM 2004), Santorini Island, Greece, pp 41–50Google Scholar
  35. Nasraoui O, Cardona C, Rojas C, Gonzalez F (2003) TECNO-STREAMS: tracking evolving clusters in noisy data streams with a scalable immune system learning model. In: Proceedings of the 3rd IEEE international conference on data mining (ICDM 2003), Melbourne, FL, pp 235–242Google Scholar
  36. Nasraoui O, Cardona C, Rojas C, Gonzalez F (2003) Mining evolving user profiles in noisy web clickstream data with a scalable immune system clustering algorithm. In: Proceeding of WebKDD 2003CKDD workshop on web mining as a premise to effective and intelligent Web applications, Washington DCGoogle Scholar
  37. Ordonez C (2003) Clustering binary data streams with k-means. In: Zaki MJ, Aggarwal CC (eds) Proceedings of the 8th ACM SIGMOD workshop on research issues in data mining and knowledge discovery, DMKD 2003, San Diego, CA pp 12–19Google Scholar
  38. Rodrigues P, Gama J, Pedroso JP (2004) Hierarchical time-series clustering for data streams. In: Jesus Aguilar-Ruiz, Joäo Gama (eds) Proceedings of the first international workshop on knowledge discovery in data streams, Pisa, Italy, pp 22–31Google Scholar
  39. Yang J (2003) Dynamic clustering of evolving streams with a single pass. In: Dayal U, Ramamritham K, Vijayaraman TM (eds) Proceedings of the 19th international conference on data engineering, Bangalore, India, pp 695–697Google Scholar
  40. Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Jagadish HV, Mumick IS (eds) Proceedings of the 15th ACM SIGMOD international conference on management of data, Montreal, Quebec, Canada, pp 103–114Google Scholar
  41. Zhou A, Cai Z, Wei L, Qian W (2003) M-kernel merging: towards density estimation over data streams. In: Proceeeding of 8th international conference on database systems for advanced applications (DASFAA'03), Kyoto, Japan, pp 285–292Google Scholar
  42. Zhu Y, Shasha D (2002) StatStream: Statistical monitoring of thousands of data streams in real time. In: Proceeding of very large data bases conference, Hong Kong, China, pp 358–369Google Scholar

Copyright information

© Springer-Verlag London Limited 2007

Authors and Affiliations

  • Aoying Zhou
    • 1
    Email author
  • Feng Cao
    • 2
  • Weining Qian
    • 1
  • Cheqing Jin
    • 1
    • 3
  1. 1.Department of Computer Science and EngineeringFudan UniversityShanghaiP.R. China
  2. 2.IBM China Research LabBeijingP.R. China
  3. 3.Department of Computer ScienceEast China University of Science and TechnologyShanghaiP.R. China

Personalised recommendations