Skip to main content
Log in

Tracking clusters in evolving data streams over sliding windows

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Mining data streams poses great challenges due to the limited memory availability and real-time query response requirement. Clustering an evolving data stream is especially interesting because it captures not only the changing distribution of clusters but also the evolving behaviors of individual clusters. In this paper, we present a novel method for tracking the evolution of clusters over sliding windows. In our SWClustering algorithm, we combine the exponential histogram with the temporal cluster features, propose a novel data structure, the Exponential Histogram of Cluster Features (EHCF). The exponential histogram is used to handle the in-cluster evolution, and the temporal cluster features represent the change of the cluster distribution. Our approach has several advantages over existing methods: (1) the quality of the clusters is improved because the EHCF captures the distribution of recent records precisely; (2) compared with previous methods, the mechanism employed to adaptively maintain the in-cluster synopsis can track the cluster evolution better, while consuming much less memory; (3) the EHCF provides a flexible framework for analyzing the cluster evolution and tracking a specific cluster efficiently without interfering with other clusters, thus reducing the consumption of computing resources for data stream clustering. Both the theoretical analysis and extensive experiments show the effectiveness and efficiency of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Freytag JC, Lockemann PC, Abiteboul S, Carey MJ, Selinger PG, Heuer A (eds) Proceedings of 29th international conference on very large data bases, Berlin, Germany, pp 81–92

  • Aggarwal CC, Han J, Wang J, Yu PS (2004) A framework for projected clustering of high dimensional data streams. In: Nascimento MA, Özsu MT, Kossmann D, Miller RJ, Blakeley JA, Schiefer KB (eds) Proceedings of the 30th international conference on very large data bases, Toronto, Canada, pp 852–863

  • Aggarwal CC, Han J, Wang J, Yu PS (2005) On high dimensional projected clustering of data streams. Data Min Knowl Discovery 10(3):251–273

    Google Scholar 

  • Aggarwal CC, Yu P (2006) A framework for clustering massive text and categorical data streams. In: Proceedings of the ACM SIAM conference on data mining, 2006 (Text and categorical clustering of high dimensional data streams) pp 479–483

  • Arasu A, Manku GS (2004) Approximate counts and quantiles over sliding windows. In: Deutsch A (ed) Proceedings of the 23th ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems, Paris, France, pp 286–296

  • Babcock B, Babu S, Datar M, Motwani R, Widom J (2002) Models and issues in data stream systems. In: Popa L (ed) Proceedings of the 21st ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems, Madison, WI, pp 1–16

  • Babcock B, Datar M, Motwani R, Callaghan LO' (2003) Maintaining variance and k-medians over data stream windows. In: Proceedings of the 22nd ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems, San Diego, CA, pp 234–243

  • Beringer J, Hullermeier E (2006) Online Clustering of parallel data streams. Data Knowl Eng 58(2):180–204

  • Cao F, Ester M, Qian W, Zhou A (2006) Density-based clustering over an evolving data stream with noise. In: Proceedings of the 2006 SIAM conference on data mining (SDM), Bethesda, MD, pp 328–339

  • Chalaghan LO, Mishra N, Meyerson A, Guha S (2002) Streaming data algorithms for high-quality clustering. In: Proceedings of the 18th international conference on data engineering, San Jose, CA, pp 685–694

  • Charikar M, Callaghan LO', Panigrahy R (2003) Better streaming algorithms for clustering problems. In: Proceedings of the 35th annual ACM symposium on theory of computing, San Diego, CA, pp 30–39

  • Chi Y, Wang H, Yu PS, Muntz RR (2004) Moment: maintaining closed frequent itemsets over a stream sliding window. In: Proceedings of the 2004 IEEE international conference on data mining, ICDM 2004, Brighton, UK, pp 59–66

  • Chi Y, Wang H, Yu PS, Muntz RR (2006) Catch the moment: maintaining closed frequent itemsets over a data stream sliding window. Knowl Inf Syst 10(3):265–294

    Google Scholar 

  • Cormode G, Muthukrishnan S (2003) What's hot and what's not: tracking most frequent items dynamically. In: Proceedings of the 22nd ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems, San Diego, CA, pp 296–306

  • Dai B, Huang J, Yeh M, Chen M (2004) Clustering on demand for multiple data streams. In: Proceedings of the 4th IEEE international conference on data mining (ICDM 2004), Brighton, UK, pp 367–370

  • Dai BR, Huang JW, Yeh MY, Chen MS (2006) Adaptive clustering for multiple evolving streams. IEEE Trans Knowl Data Eng 18(9):1166–1180

    Google Scholar 

  • Datar M, Gionis A, Indyk P, Motwani R (2002) Maintaining stream statistics over sliding windows. In: Proceedings of the 13th annual ACM-SIAM symposium on discrete algorithms, San Francisco, CA, pp 635–644

  • Domingos P, Hulten G (2000) Mining high-speed data streams. In: Proceedings of the 6th ACM SIGKDD international conference on knowledge discovery and data mining, Boston, MA, pp 71–80

  • Domingos P, Hulton G (2001) A general method for scaling up machine learning algorithms and its application to clustering. In: Brodley CE, Danyluk AP (eds) Proceedings of the 18th international conference on machine learning (ICML 2001), Williams College, Williamstown, MA, pp 106–113

  • Domingos P, Hulton G (2001) Catching up with the data: research issues in mining data streams. In: SIGMOD workshop on research issues in data mining and knowledge discovery, Santa Barbara, CA

  • Farnstrom F, Lewis J, Elkan C (2000) Scalability for clustering algorithms revisited. SIGKDD Explor 2(1):51–57

    Google Scholar 

  • Guha S, Koudas N (2001) Data-streams and histograms. In: Proceedings of the thirty-third annual ACM symposium on theory of computing, STOC 2001, New York, pp 471–475

  • Guha S, Mishra N, Motwani R, Callaghan LO' (2000) Clustering data stream. In: Proceedings of the 41st annual symposium on foundations of computer science, FOCS 2000, Redondo Beach, CA, pp 359–366

  • Guha S, Meyerson A, Mishra N, Motwani R, Callaghan LO' (2003) Clustering data streams: theory and practice. IEEE Trans Knowl Data Eng (TKDE) 3(15):515–528

    Google Scholar 

  • He Z, Xu X, Deng S, Huang JZ (2004) Clustering categorical data streams. J Comput Methods Sci Eng (JCMSE) URL http://arxiv.org/ftp/cs/papers/0412/0412058.pdf

  • He Z, Xu X, Dend S (2002) Squeezer: an efficient algorithm for clustering categorical data. J Comput Sci Technol 17(5):611–624

    Google Scholar 

  • Hulten G, Spencer L, Domingos P (2001) Mining time-changing data streams. In: Proceedings of the 7th ACM SIGKDD international conference on knowledge discovery and data mining, San Francisco, CA, pp 97–106

  • Jain A, Dubes R (1998) Algorithms for clustering data. Prentice-Hall, Englewood Cliffs, NJ

  • Jin C, Qian W, Sha C, Yu J, Zhou A (2003) Dynamically maintaining frequent items over a data stream. In: Proceedings of the 12nd ACM CIKM international conference on information and knowledge management, New Orleans, LA, pp 287–294

  • Keogh E, Lin J (2005) Clustering of time-series subsequences is meaningless: implications for previous and future research. Knowl Inf Syst 8(2):154–177

    Google Scholar 

  • Keogh E, Lin J, Truppel W (2003) Clustering of time series subsequences is meaningless: implications for past and future research. In: Proceedings of the 3rd IEEE international conference on data mining, Melbourne, FL

  • Lin M, Lee S (2005) Efficient mining of sequential patterns with time constraints by delimited pattern growth. Knowl Inf Syst 7(4):499–514

    Google Scholar 

  • Manku GS, Motwani R (2002) Approximate frequency counts over data streams. In: Proceedings of 28th international conference on very large data bases, Hong Kong, China, pp 346–357

  • Muthukrishnan S, Shah R, Vitter JS (2004) Mining deviants in time series data streams. In: Proceedings of the 16th international conference on scientific and statistical database management (SSDBM 2004), Santorini Island, Greece, pp 41–50

  • Nasraoui O, Cardona C, Rojas C, Gonzalez F (2003) TECNO-STREAMS: tracking evolving clusters in noisy data streams with a scalable immune system learning model. In: Proceedings of the 3rd IEEE international conference on data mining (ICDM 2003), Melbourne, FL, pp 235–242

  • Nasraoui O, Cardona C, Rojas C, Gonzalez F (2003) Mining evolving user profiles in noisy web clickstream data with a scalable immune system clustering algorithm. In: Proceeding of WebKDD 2003CKDD workshop on web mining as a premise to effective and intelligent Web applications, Washington DC

  • Ordonez C (2003) Clustering binary data streams with k-means. In: Zaki MJ, Aggarwal CC (eds) Proceedings of the 8th ACM SIGMOD workshop on research issues in data mining and knowledge discovery, DMKD 2003, San Diego, CA pp 12–19

  • Rodrigues P, Gama J, Pedroso JP (2004) Hierarchical time-series clustering for data streams. In: Jesus Aguilar-Ruiz, Joäo Gama (eds) Proceedings of the first international workshop on knowledge discovery in data streams, Pisa, Italy, pp 22–31

  • Yang J (2003) Dynamic clustering of evolving streams with a single pass. In: Dayal U, Ramamritham K, Vijayaraman TM (eds) Proceedings of the 19th international conference on data engineering, Bangalore, India, pp 695–697

  • Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Jagadish HV, Mumick IS (eds) Proceedings of the 15th ACM SIGMOD international conference on management of data, Montreal, Quebec, Canada, pp 103–114

  • Zhou A, Cai Z, Wei L, Qian W (2003) M-kernel merging: towards density estimation over data streams. In: Proceeeding of 8th international conference on database systems for advanced applications (DASFAA'03), Kyoto, Japan, pp 285–292

  • Zhu Y, Shasha D (2002) StatStream: Statistical monitoring of thousands of data streams in real time. In: Proceeding of very large data bases conference, Hong Kong, China, pp 358–369

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aoying Zhou.

Additional information

Aoying Zhou is currently a Professor in Computer Science at Fudan University, Shanghai, P.R. China. He won his Bachelor and Master degrees in Computer Science from Sichuan University in Chengdu, Sichuan, P.R. China in 1985 and 1988, respectively, and Ph.D. degree from Fudan University in 1993. He served as the member or chair of program committee for many international conferences such as WWW, SIGMOD, VLDB, EDBT, ICDCS, ER, DASFAA, PAKDD, WAIM, and etc. His papers have been published in ACM SIGMOD, VLDB, ICDE, and several other international journals. His research interests include Data mining and knowledge discovery, XML data management, Web mining and searching, data stream analysis and processing, peer-to-peer computing.

Feng Cao is currently an R&D engineer in IBM China Research Laboratories. He received a B.E. degree from Xi'an Jiao Tong University, Xi'an, P.R. China, in 2000 and an M.E. degree from Huazhong University of Science and Technology, Wuhan, P.R. China, in 2003. From October 2004 to March 2005, he worked in Fudan-NUS Competency Center for Peer-to-Peer Computing, Singapore. In 2006, he received his Ph.D. degree from Fudan University, Shanghai, P.R. China. His current research interests include data mining and data stream.

Weining Qian is currently an Assistant Professor in computer science at Fudan University, Shanghai, P.R. China. He received his M.S. and Ph.D. degree in computer science from Fudan University in 2001 and 2004, respectively. He is supported by Shanghai Rising-Star Program under Grant No. 04QMX1404 and National Natural Science Foundation of China (NSFC) under Grant No. 60673134. He served as the program committee member of several international conferences, including DASFAA 2006, 2007 and 2008, APWeb/WAIM 2007, INFOSCALE 2007, and ECDM 2007. His papers have been published in ICDE, SIAM DM, and CIKM. His research interests include data stream query processing and mining, and large-scale distributed computing for database applications.

Cheqing Jin is currently an Assistant Professor in Computer Science at East China University of Science and Technology. He received his Bachelor and Master degrees in Computer Science from Zhejiang University in Hangzhou, P.R. China in 1999 and 2002, respectively, and the Ph.D. degree from Fudan University, Shanghai, P.R. China. He worked as a Research Assistant at E-business Technology Institute, the Hong Kong University from December 2003 to May 2004. His current research interests include data mining and data stream.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhou, A., Cao, F., Qian, W. et al. Tracking clusters in evolving data streams over sliding windows. Knowl Inf Syst 15, 181–214 (2008). https://doi.org/10.1007/s10115-007-0070-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-007-0070-x

Keywords

Navigation