Tracking clusters in evolving data streams over sliding windows

Zhou, Aoying; Cao, Feng; Qian, Weining; Jin, Cheqing

doi:10.1007/s10115-007-0070-x

Tracking clusters in evolving data streams over sliding windows

Regular Paper
Published: 09 March 2007

Volume 15, pages 181–214, (2008)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Aoying Zhou¹,
Feng Cao²,
Weining Qian¹ &
…
Cheqing Jin^1,3

931 Accesses
121 Citations
Explore all metrics

Abstract

Mining data streams poses great challenges due to the limited memory availability and real-time query response requirement. Clustering an evolving data stream is especially interesting because it captures not only the changing distribution of clusters but also the evolving behaviors of individual clusters. In this paper, we present a novel method for tracking the evolution of clusters over sliding windows. In our SWClustering algorithm, we combine the exponential histogram with the temporal cluster features, propose a novel data structure, the Exponential Histogram of Cluster Features (EHCF). The exponential histogram is used to handle the in-cluster evolution, and the temporal cluster features represent the change of the cluster distribution. Our approach has several advantages over existing methods: (1) the quality of the clusters is improved because the EHCF captures the distribution of recent records precisely; (2) compared with previous methods, the mechanism employed to adaptively maintain the in-cluster synopsis can track the cluster evolution better, while consuming much less memory; (3) the EHCF provides a flexible framework for analyzing the cluster evolution and tracking a specific cluster efficiently without interfering with other clusters, thus reducing the consumption of computing resources for data stream clustering. Both the theoretical analysis and extensive experiments show the effectiveness and efficiency of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Stream Clustering Algorithms: A Primer

Online Clustering for Evolving Data Streams with Online Anomaly Detection

Equi-Clustream: a framework for clustering time evolving mixed data

Article 26 February 2018

References

Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Freytag JC, Lockemann PC, Abiteboul S, Carey MJ, Selinger PG, Heuer A (eds) Proceedings of 29th international conference on very large data bases, Berlin, Germany, pp 81–92
Aggarwal CC, Han J, Wang J, Yu PS (2004) A framework for projected clustering of high dimensional data streams. In: Nascimento MA, Özsu MT, Kossmann D, Miller RJ, Blakeley JA, Schiefer KB (eds) Proceedings of the 30th international conference on very large data bases, Toronto, Canada, pp 852–863
Aggarwal CC, Han J, Wang J, Yu PS (2005) On high dimensional projected clustering of data streams. Data Min Knowl Discovery 10(3):251–273
Google Scholar
Aggarwal CC, Yu P (2006) A framework for clustering massive text and categorical data streams. In: Proceedings of the ACM SIAM conference on data mining, 2006 (Text and categorical clustering of high dimensional data streams) pp 479–483
Arasu A, Manku GS (2004) Approximate counts and quantiles over sliding windows. In: Deutsch A (ed) Proceedings of the 23th ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems, Paris, France, pp 286–296
Babcock B, Babu S, Datar M, Motwani R, Widom J (2002) Models and issues in data stream systems. In: Popa L (ed) Proceedings of the 21st ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems, Madison, WI, pp 1–16
Babcock B, Datar M, Motwani R, Callaghan LO' (2003) Maintaining variance and k-medians over data stream windows. In: Proceedings of the 22nd ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems, San Diego, CA, pp 234–243
Beringer J, Hullermeier E (2006) Online Clustering of parallel data streams. Data Knowl Eng 58(2):180–204
Cao F, Ester M, Qian W, Zhou A (2006) Density-based clustering over an evolving data stream with noise. In: Proceedings of the 2006 SIAM conference on data mining (SDM), Bethesda, MD, pp 328–339
Chalaghan LO, Mishra N, Meyerson A, Guha S (2002) Streaming data algorithms for high-quality clustering. In: Proceedings of the 18th international conference on data engineering, San Jose, CA, pp 685–694
Charikar M, Callaghan LO', Panigrahy R (2003) Better streaming algorithms for clustering problems. In: Proceedings of the 35th annual ACM symposium on theory of computing, San Diego, CA, pp 30–39
Chi Y, Wang H, Yu PS, Muntz RR (2004) Moment: maintaining closed frequent itemsets over a stream sliding window. In: Proceedings of the 2004 IEEE international conference on data mining, ICDM 2004, Brighton, UK, pp 59–66
Chi Y, Wang H, Yu PS, Muntz RR (2006) Catch the moment: maintaining closed frequent itemsets over a data stream sliding window. Knowl Inf Syst 10(3):265–294
Google Scholar
Cormode G, Muthukrishnan S (2003) What's hot and what's not: tracking most frequent items dynamically. In: Proceedings of the 22nd ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems, San Diego, CA, pp 296–306
Dai B, Huang J, Yeh M, Chen M (2004) Clustering on demand for multiple data streams. In: Proceedings of the 4th IEEE international conference on data mining (ICDM 2004), Brighton, UK, pp 367–370
Dai BR, Huang JW, Yeh MY, Chen MS (2006) Adaptive clustering for multiple evolving streams. IEEE Trans Knowl Data Eng 18(9):1166–1180
Google Scholar
Datar M, Gionis A, Indyk P, Motwani R (2002) Maintaining stream statistics over sliding windows. In: Proceedings of the 13th annual ACM-SIAM symposium on discrete algorithms, San Francisco, CA, pp 635–644
Domingos P, Hulten G (2000) Mining high-speed data streams. In: Proceedings of the 6th ACM SIGKDD international conference on knowledge discovery and data mining, Boston, MA, pp 71–80
Domingos P, Hulton G (2001) A general method for scaling up machine learning algorithms and its application to clustering. In: Brodley CE, Danyluk AP (eds) Proceedings of the 18th international conference on machine learning (ICML 2001), Williams College, Williamstown, MA, pp 106–113
Domingos P, Hulton G (2001) Catching up with the data: research issues in mining data streams. In: SIGMOD workshop on research issues in data mining and knowledge discovery, Santa Barbara, CA
Farnstrom F, Lewis J, Elkan C (2000) Scalability for clustering algorithms revisited. SIGKDD Explor 2(1):51–57
Google Scholar
Guha S, Koudas N (2001) Data-streams and histograms. In: Proceedings of the thirty-third annual ACM symposium on theory of computing, STOC 2001, New York, pp 471–475
Guha S, Mishra N, Motwani R, Callaghan LO' (2000) Clustering data stream. In: Proceedings of the 41st annual symposium on foundations of computer science, FOCS 2000, Redondo Beach, CA, pp 359–366
Guha S, Meyerson A, Mishra N, Motwani R, Callaghan LO' (2003) Clustering data streams: theory and practice. IEEE Trans Knowl Data Eng (TKDE) 3(15):515–528
Google Scholar
He Z, Xu X, Deng S, Huang JZ (2004) Clustering categorical data streams. J Comput Methods Sci Eng (JCMSE) URL http://arxiv.org/ftp/cs/papers/0412/0412058.pdf
He Z, Xu X, Dend S (2002) Squeezer: an efficient algorithm for clustering categorical data. J Comput Sci Technol 17(5):611–624
Google Scholar
Hulten G, Spencer L, Domingos P (2001) Mining time-changing data streams. In: Proceedings of the 7th ACM SIGKDD international conference on knowledge discovery and data mining, San Francisco, CA, pp 97–106
Jain A, Dubes R (1998) Algorithms for clustering data. Prentice-Hall, Englewood Cliffs, NJ
Jin C, Qian W, Sha C, Yu J, Zhou A (2003) Dynamically maintaining frequent items over a data stream. In: Proceedings of the 12nd ACM CIKM international conference on information and knowledge management, New Orleans, LA, pp 287–294
Keogh E, Lin J (2005) Clustering of time-series subsequences is meaningless: implications for previous and future research. Knowl Inf Syst 8(2):154–177
Google Scholar
Keogh E, Lin J, Truppel W (2003) Clustering of time series subsequences is meaningless: implications for past and future research. In: Proceedings of the 3rd IEEE international conference on data mining, Melbourne, FL
Lin M, Lee S (2005) Efficient mining of sequential patterns with time constraints by delimited pattern growth. Knowl Inf Syst 7(4):499–514
Google Scholar
Manku GS, Motwani R (2002) Approximate frequency counts over data streams. In: Proceedings of 28th international conference on very large data bases, Hong Kong, China, pp 346–357
Muthukrishnan S, Shah R, Vitter JS (2004) Mining deviants in time series data streams. In: Proceedings of the 16th international conference on scientific and statistical database management (SSDBM 2004), Santorini Island, Greece, pp 41–50
Nasraoui O, Cardona C, Rojas C, Gonzalez F (2003) TECNO-STREAMS: tracking evolving clusters in noisy data streams with a scalable immune system learning model. In: Proceedings of the 3rd IEEE international conference on data mining (ICDM 2003), Melbourne, FL, pp 235–242
Nasraoui O, Cardona C, Rojas C, Gonzalez F (2003) Mining evolving user profiles in noisy web clickstream data with a scalable immune system clustering algorithm. In: Proceeding of WebKDD 2003CKDD workshop on web mining as a premise to effective and intelligent Web applications, Washington DC
Ordonez C (2003) Clustering binary data streams with k-means. In: Zaki MJ, Aggarwal CC (eds) Proceedings of the 8th ACM SIGMOD workshop on research issues in data mining and knowledge discovery, DMKD 2003, San Diego, CA pp 12–19
Rodrigues P, Gama J, Pedroso JP (2004) Hierarchical time-series clustering for data streams. In: Jesus Aguilar-Ruiz, Joäo Gama (eds) Proceedings of the first international workshop on knowledge discovery in data streams, Pisa, Italy, pp 22–31
Yang J (2003) Dynamic clustering of evolving streams with a single pass. In: Dayal U, Ramamritham K, Vijayaraman TM (eds) Proceedings of the 19th international conference on data engineering, Bangalore, India, pp 695–697
Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Jagadish HV, Mumick IS (eds) Proceedings of the 15th ACM SIGMOD international conference on management of data, Montreal, Quebec, Canada, pp 103–114
Zhou A, Cai Z, Wei L, Qian W (2003) M-kernel merging: towards density estimation over data streams. In: Proceeeding of 8th international conference on database systems for advanced applications (DASFAA'03), Kyoto, Japan, pp 285–292
Zhu Y, Shasha D (2002) StatStream: Statistical monitoring of thousands of data streams in real time. In: Proceeding of very large data bases conference, Hong Kong, China, pp 358–369

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Fudan University, Shanghai, 200433, P.R. China
Aoying Zhou, Weining Qian & Cheqing Jin
IBM China Research Lab, Beijing, 100094, P.R. China
Feng Cao
Department of Computer Science, East China University of Science and Technology, Shanghai, 200237, P.R. China
Cheqing Jin

Authors

Aoying Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Feng Cao
View author publications
You can also search for this author in PubMed Google Scholar
Weining Qian
View author publications
You can also search for this author in PubMed Google Scholar
Cheqing Jin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aoying Zhou.

Additional information

Aoying Zhou is currently a Professor in Computer Science at Fudan University, Shanghai, P.R. China. He won his Bachelor and Master degrees in Computer Science from Sichuan University in Chengdu, Sichuan, P.R. China in 1985 and 1988, respectively, and Ph.D. degree from Fudan University in 1993. He served as the member or chair of program committee for many international conferences such as WWW, SIGMOD, VLDB, EDBT, ICDCS, ER, DASFAA, PAKDD, WAIM, and etc. His papers have been published in ACM SIGMOD, VLDB, ICDE, and several other international journals. His research interests include Data mining and knowledge discovery, XML data management, Web mining and searching, data stream analysis and processing, peer-to-peer computing.

Feng Cao is currently an R&D engineer in IBM China Research Laboratories. He received a B.E. degree from Xi'an Jiao Tong University, Xi'an, P.R. China, in 2000 and an M.E. degree from Huazhong University of Science and Technology, Wuhan, P.R. China, in 2003. From October 2004 to March 2005, he worked in Fudan-NUS Competency Center for Peer-to-Peer Computing, Singapore. In 2006, he received his Ph.D. degree from Fudan University, Shanghai, P.R. China. His current research interests include data mining and data stream.

Weining Qian is currently an Assistant Professor in computer science at Fudan University, Shanghai, P.R. China. He received his M.S. and Ph.D. degree in computer science from Fudan University in 2001 and 2004, respectively. He is supported by Shanghai Rising-Star Program under Grant No. 04QMX1404 and National Natural Science Foundation of China (NSFC) under Grant No. 60673134. He served as the program committee member of several international conferences, including DASFAA 2006, 2007 and 2008, APWeb/WAIM 2007, INFOSCALE 2007, and ECDM 2007. His papers have been published in ICDE, SIAM DM, and CIKM. His research interests include data stream query processing and mining, and large-scale distributed computing for database applications.

Cheqing Jin is currently an Assistant Professor in Computer Science at East China University of Science and Technology. He received his Bachelor and Master degrees in Computer Science from Zhejiang University in Hangzhou, P.R. China in 1999 and 2002, respectively, and the Ph.D. degree from Fudan University, Shanghai, P.R. China. He worked as a Research Assistant at E-business Technology Institute, the Hong Kong University from December 2003 to May 2004. His current research interests include data mining and data stream.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhou, A., Cao, F., Qian, W. et al. Tracking clusters in evolving data streams over sliding windows. Knowl Inf Syst 15, 181–214 (2008). https://doi.org/10.1007/s10115-007-0070-x

Download citation

Received: 05 October 2005
Revised: 02 September 2006
Accepted: 17 January 2007
Published: 09 March 2007
Issue Date: May 2008
DOI: https://doi.org/10.1007/s10115-007-0070-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Tracking clusters in evolving data streams over sliding windows

Abstract

Access this article

Similar content being viewed by others

Stream Clustering Algorithms: A Primer

Online Clustering for Evolving Data Streams with Online Anomaly Detection

Equi-Clustream: a framework for clustering time evolving mixed data

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Tracking clusters in evolving data streams over sliding windows

Abstract

Access this article

Similar content being viewed by others

Stream Clustering Algorithms: A Primer

Online Clustering for Evolving Data Streams with Online Anomaly Detection

Equi-Clustream: a framework for clustering time evolving mixed data

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation