A survey on data stream clustering and classification

Nguyen, Hai-Long; Woon, Yew-Kwong; Ng, Wee-Keong

doi:10.1007/s10115-014-0808-1

A survey on data stream clustering and classification

Survey Paper
Published: 17 December 2014

Volume 45, pages 535–569, (2015)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Hai-Long Nguyen¹,
Yew-Kwong Woon² &
Wee-Keong Ng³

9079 Accesses
211 Citations
3 Altmetric
Explore all metrics

Abstract

Nowadays, with the advance of technology, many applications generate huge amounts of data streams at very high speed. Examples include network traffic, web click streams, video surveillance, and sensor networks. Data stream mining has become a hot research topic. Its goal is to extract hidden knowledge/patterns from continuous data streams. Unlike traditional data mining where the dataset is static and can be repeatedly read many times, data stream mining algorithms face many challenges and have to satisfy constraints such as bounded memory, single-pass, real-time response, and concept-drift detection. This paper presents a comprehensive survey of the state-of-the-art data stream mining algorithms with a focus on clustering and classification because of their ubiquitous usage. It identifies mining constraints, proposes a general model for data stream mining, and depicts the relationship between traditional data mining and data stream mining. Furthermore, it analyzes the advantages as well as limitations of data stream algorithms and suggests potential areas for future research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data stream clustering: a review

Article 21 July 2020

Optimizing Data Stream Representation: An Extensive Survey on Stream Clustering Algorithms

Article 21 January 2019

State-of-the-art on clustering data streams

Article Open access 01 December 2016

Notes

References

Adomavicius G, Bockstedt J (2008) C-trend: temporal cluster graphs for identifying and visualizing trends in multiattribute transactional data. IEEE Trans Knowl Data Eng 20:721–735
Article Google Scholar
Aggarwal CC (2009) Data streams: an overview and scientific applications. Springer, Berlin, pp 377–397
Google Scholar
Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of the 29th international conference on very large data bases, vol 29, pp 81–92. VLDB Endowment
Aggarwal CC, Han J, Wang J, Yu PS (2004) A framework for projected clustering of high dimensional data streams. In: Proceedings of the 13th international conference on very large data bases, vol 30, pp 852–863, 1316763. VLDB Endowment
Aggarwal CC, Han J, Wang J, Yu PS (2006) A framework for on-demand classification of evolving data streams. IEEE Trans Knowl Data Eng 18(5):577–589
Article Google Scholar
Aggarwal CC, Philip SY (2010) On clustering massive text and categorical data streams. Knowl Inf Syst 24(2):171–196
Article Google Scholar
Aggarwal CC, Subbian K (2012) Event detection in social streams. In: Proceedings of the SIAM international conference on data mining, pp 624–635
Aggarwal CC, Yu P (2005) Online analysis of community evolution in data streams. In: Proceedings of the SIAM international conference on data mining, pp 56–67
Aksoy C, Can F, Kocberber S (2012) Novelty detection for topic tracking. J Am Soc Inf Sci Technol 63(4):777–795
Article Google Scholar
Alahakoon D, Halgamuge SK, Srinivasan B (2000) Dynamic self-organizing maps with controlled growth for knowledge discovery. IEEE Trans Neural Netw 11(3):601–614
Article Google Scholar
Alexander H, Er H, Daniel AK (1998) An efficient approach to clustering in large multimedia databases with noise. In: Proceeding of the 1998 international conference knowledge discovery and data mining, pp 58–65
Allan J, Papka R, Lavrenko V (1998) On-line new event detection and tracking. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, pp 37–45. ACM
Amigo E, Gonzalo J, Artiles J (2009) A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf Retr 12(4):461–486
Article Google Scholar
Ankerst M, Breunig MM, Kriegel H-P, Sander J (1999) Optics: ordering points to identify the clustering structure. In: Proceedings of the 1999 ACM SIGMOD international conference on management of data, vol 28, pp 49–60, 304187. ACM
Beckmann N, Kriegel H-P, Schneider R, Seeger B (1990) The R*-tree: an efficient and robust access method for points and rectangles, vol 19. ACM
Bifet A (2010) Adaptive stream mining: pattern learning and mining from evolving data streams. In: Proceeding of the 2010 conference on adaptive stream mining: pattern learning and mining from evolving data streams. IOS Press
Bifet A, Holmes G, Kirkby R (2012) Moa: massive online analysis. J Mach Learn Res 11:1601–1604
Google Scholar
Bohm C, Kailing K, Kriegel H-P, Kroger P (2004) Density connected clustering with local subspace preferences. In: Proceedings of the 4th IEEE international conference on data mining, pp 27–34, 1033433. IEEE Computer Society
Can F (1993) Incremental clustering for dynamic information processing. ACM Trans Inf Syst (TOIS) 11(2):143–164
Article MathSciNet Google Scholar
Can F, Kocberber S, Baglioglu O, Kardas S, Ocalan HC, Uyar E (2010) New event detection and topic tracking in turkish. J Am Soc Inf Sci Technol 61(4):802–819
Google Scholar
Can F, Ozkarahan EA (1990) Concepts and effectiveness of the cover-coefficient-based clustering methodology for text databases. ACM Trans Database Syst (TODS) 15(4):483–517
Article Google Scholar
Cao F, Ester M, Qian W, Zhou A (2006) Density-based clustering over an evolving data stream with noise. In: Proceedings of the 2006 SIAM international conference on data mining, pp 328–339
Chakrabarti D, Kumar R, Tomkins A (2006) Evolutionary clustering. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, pp 554–560. ACM
Chen Y, Tu L (2007) Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, pp 133–142, 1281210. ACM
Chi Y, Song X, Zhou D, Hino K, Tseng B (2007) Evolutionary spectral clustering by incorporating temporal smoothness. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, pp 153–162. ACM
Chi Y, Song X, Zhou D, Hino K, Tseng B (2009) On evolutionary spectral clustering. ACM Trans Knowl Discov Data 3(4):1–30
Article Google Scholar
Chien S, Immorlica N (2005) Semantic similarity between search engine queries using temporal correlation. In: Proceedings of the 14th international conference on World Wide Web, pp 2–11, 1060752. ACM
Chow TWS, Sitao W (2004) An online cellular probabilistic self-organizing map for static and dynamic data sets. IEEE Trans Circuits Syst I Regul Pap 51(4):732–747
Article Google Scholar
Chris G, Jiaweihan Y, Jianpei Z, Xifeng Yan Y, Philip SY (2003) Mining frequent patterns in data streams at multiple time granularities. Next Gener Data Min 212:191–212
Google Scholar
Ciaccia P, Patella M, Zezula P (1997) M-tree: An efficient access method for similarity search in metric spaces. In: Proceedings of the 23rd international conference on very large data bases, pP 426–435, 671005. Morgan Kaufmann Publishers Inc
Cui W, Liu S, Tan L, Shi C, Song Y, Gao Z, Qu H, Tong X (2011) Textflow: Towards better understanding of evolving topics in text. IEEE Trans Vis Comput Graph 17(12):2412–2421
Article Google Scholar
Dang X, Lee V, Ng W, Ciptadi A, Ong K (2009) An EM-based algorithm for clustering data streams in sliding windows, vol 5463 of Lecture notes in computer science, pp 230–235. Springer, Berlin
de Leeuw J (2005) Applications of convex analysis to multidimensional scaling. North Holland Publishing Company, Amsterdam, pp 133–146
Google Scholar
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc Ser B (Methodol) 39(1):1–38
MathSciNet MATH Google Scholar
Domingos P, Hulten G (2000) Mining high-speed data streams. In: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining, pp 71–80, 347107. ACM
Dork M, Gruen D, Williamson C, Carpendale S (2010) A visual backchannel for large-scale events. IEEE Trans Vis Comput Graph 16(6):1129–1138
Article Google Scholar
Fung BC, Wang K, Yu PS (2005) Top-down specialization for information and privacy preservation. In: 21st international conference on data engineering, 2005. ICDE 2005. Proceedings, pp 205–216. IEEE
Gaber MM, Zaslavsky A, Krishnaswamy S (2005) Mining data streams: a review. ACM Sigmod Rec 34(2):18–26
Article Google Scholar
Gama J (2013) Data stream mining: the bounded rationality. Informatica (Slovenia) 37(1):21–25
Google Scholar
Gama J, Gaber MM (2007) Learning from data streams—processing techniques in sensor networks. Springer, Berlin
MATH Google Scholar
Gama J, Pinto C (2006) Discretization from data streams: applications to histograms and data mining. In: Proceedings of the 2006 ACM symposium on applied computing, pp 662–667. ACM
Gama J, Rodrigues P (2009) An overview on mining data streams, vol 206 of Studies in computational intelligence, pp 29–45. Springer, Berlin
Gama J, Sebastião R, Rodrigues P (2013) On evaluating stream learning algorithms. Mach Learn 90(3):317–346
Article MathSciNet MATH Google Scholar
Guha S, Kim C, Shim K (2004) Xwave: optimal and approximate extended wavelets. In: Proceedings of the thirtieth international conference on very large data bases, vol 30, pp 288–299, 1316716. VLDB Endowment
Guha S, Meyerson A, Mishra N, Motwani R, O’Callaghan L (2003) Clustering data streams: theory and practice. IEEE Trans Knowl Data Eng 15:515–528
Article Google Scholar
Guha S, Mishra N, Motwani R, O’Callaghan L (2000) Clustering data streams. In: Proceedings of the 41st annual symposium on foundations of computer science, pp 359–366
Guha S, Rastogi R, Shim K (1998) Cure: an efficient clustering algorithm for large databases. In: Proceedings of the 1998 ACM SIGMOD international conference on management of data, pp 73–84, 276312. ACM
Guha S, Rastogi R, Shim K (1999) Rock: a robust clustering algorithm for categorical attributes. In: Proceedings of the 15th international conference on data engineering, p 512, 847264. IEEE Computer Society
Guttman A (1984) R-trees: a dynamic index structure for spatial searching. In: Proceedings of the 1984 ACM SIGMOD international conference on management of data, pp 47–57
Hahsler M, Dunham M (2011) Temporal structure learning for clustering massive data streams in real-time. In: SIAM conference on data mining, pp 664–675. SIAM
Han J (2005) Data mining: concepts and techniques. Morgan Kaufmann Publishers Inc., San Francisco
Google Scholar
He Q, Chang K, Lim E-P, Zhang J (2007) Bursty feature representation for clustering text streams. In: SIAM international conference on data mining
Hulten G, Spencer L, Domingos P (2001) Mining time-changing data streams. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, pp 97–106, 502529. ACM
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv (CSUR) 31(3):264–323
Article Google Scholar
Karypis G, Eui-Hong H, Kumar V (1999) Chameleon: hierarchical clustering using dynamic modeling. Computer 32(8):68–75
Article Google Scholar
Kaufman L, Rousseeuw P (1990) Finding groups in data: an introduction to cluster analysis, vol 39. Wiley, New York
Book Google Scholar
Kelly MG, Hand DJ, Adams NM (1999) The impact of changing populations on classifier performance. In: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining, pp 367–371, 312285. ACM
Kleinberg J (2003) Bursty and hierarchical structure in streams. Data Min Knowl Discov 7(4):373–397
Article MathSciNet Google Scholar
Kohonen T (1990) The self-organizing map. Proc IEEE 78(9):1464–1480
Article Google Scholar
Kranen P, Assent I, Baldauf C, Seidl T (2011) The clustree: indexing micro-clusters for anytime stream mining. Knowl Inf Syst 29(2):249–272
Article Google Scholar
Kranen P, Günnemann S, Fries S, Seidl T (2010) MC-tree: improving Bayesian anytime classification, vol 6187 of Lecture notes in computer science, pp 252–269. Springer, Berlin
Kremer H, Kranen P, Jansen T, Seidl T, Bifet A, Holmes G, Pfahringer B (2011) An effective evaluation measure for clustering on evolving data streams. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining, pp 868–876, 2020555. ACM
Kriegel H-P, Kröger P, Ntoutsi I, Zimek A (2011) Density based subspace clustering over dynamic data, vol 6809 of Lecture notes in computer science, pp 387–404. Springer, Berlin
Lam W, Meng H, Wong K, Yen J (2001) Using contextual analysis for news event detection. Int J Intell Syst 16(4):525–546
Article MATH Google Scholar
Last M (2002) Online classification of nonstationary data streams. Intell Data Anal 6(2):129–147
MathSciNet MATH Google Scholar
Lei Y, Huan L (2003) Feature selection for high-dimensional data: a fast correlation-based filter solution. In: The 20th international conference on machine learning, pp 856–863
Leite D, Costa P, Gomide F (2010) Evolving granular neural network for semi-supervised data stream classification. In: The 2010 international joint conference on neural networks (IJCNN), pp 1–8. IEEE
Leite DF, Costa P, Gomide F (2009) Evolving granular classification neural networks. In: International joint conference on neural networks, 2009. IJCNN 2009, pp 1736–1743
Lühr S, Lazarescu M (2009) Incremental clustering of dynamic data streams using connectivity based representative points. Data Knowl Eng 68(1):1–27
Article Google Scholar
Liu H, Yu L (2005) Toward integrating feature selection algorithms for classification and clustering. IEEE Trans Knowl Data Eng 17(4):491–502
Article Google Scholar
Martin E, Hans-Peter K, Jörg S, Xiaowei X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd international conference on knowledge discovery and data mining, pp 226–231. AAAI Press
Muthukrishnan SM (2003) Data streams: algorithms and applications. Now Publishers Inc
Narasimhamurthy A, Kuncheva LI (2007) A framework for generating data to simulate changing environments. In: Proceedings of the 25th IASTED international multi-conference: artificial intelligence and applications, vol 549, p 389. ACTA Press
Nguyen H-L, Woon Y-K, Ng W-K, Wan L (2012) Heterogeneous ensemble for feature drifts in data streams. In: Proceedings of the 16th Pacific-Asia conference on advances in knowledge discovery and data mining, vol 2, pp 1–12, 2342648. Springer
O’Callaghan L, Mishra N, Meyerson A, Guha S, Motwani R (2002) Streaming-data algorithms for high-quality clustering. In: Proceedings of the 18th international conference on data engineering, pp 685–694
Oza NC (2005) Online bagging and boosting. In: 2005 IEEE international conference on systems, man and cybernetics, vol 3, pp 2340–2345. IEEE
Park NH, Lee WS (2004) Statistical grid-based clustering over data streams. ACM SIGMOD Rec 33(1):32–37
Article Google Scholar
Park NH, Lee WS (2007) Cell trees: an adaptive synopsis structure for clustering multi-dimensional on-line data streams. Data Knowl Eng 63(2):528–549
Article Google Scholar
Rai P, Daum H, Venkatasubramanian S (2009) Streamed learning: one-pass svms. In: Proceedings of the 21st international jont conference on artificial intelligence, pp 1211–1216, 1661639. Morgan Kaufmann Publishers Inc
Rodrigues P, Gama J, Pedroso J (2006) Odac: hierarchical clustering of time series data streams. In: SIAM international conference on data mining, Bethesda, Maryland, USA, pp 499–503
Sakaki T, Okazaki M, Matsuo Y (2010) Earthquake shakes twitter users: real-time event detection by social sensors. In: Proceedings of the 19th international conference on World wide web, pp 851–860. ACM
Sattar H, Ying Y, Zahra M, Mohammadreza K (2009) Adapted one-vs-all decision trees for data stream classification. IEEE Trans Knowl Data Eng 21:624–637
Article Google Scholar
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv (CSUR) 34(1):1–47
Article Google Scholar
Seidl T, Assent I, Kranen P, Krieger R, Herrmann J (2009) Indexing density models for incremental learning and anytime classification on data streams. In: Proceedings of the 12th international conference on extending database technology: advances in database technology, pp 311–322, 1516397. ACM
Sheikholeslami G, Chatterjee S, Zhang A (2000) Wavecluster: a wavelet-based clustering approach for spatial data in very large databases. VLDB J 8(3–4):289–304
Article Google Scholar
Silva JA, Faria ER, Barros RC, Hruschka ER, Carvalho ACd, Gama J (2013) Data stream clustering: a survey. ACM Comput Surv (CSUR) 46(1):13
Article Google Scholar
Smith T, Alahakoon D (2009) Growing self-organizing map for online continuous clustering, vol 204 of Studies in computational intelligence, pp 49–83. Springer, Berlin
Sow D, Biem A, Blount M, Ebling M, Verscheure O (2010) Body sensor data processing using stream computing. In: Proceedings of the international conference on multimedia information retrieval, pp 449–458, 1743465. ACM
Spiliopoulou M, Ntoutsi I, Theodoridis Y, Schult R (2006) Monic: modeling and monitoring cluster transitions. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, pp 706–711. ACM
Street WN, Kim Y (2001) A streaming ensemble algorithm (sea) for large-scale classification. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, pp 377–382, 502568. ACM
Sun J, Faloutsos C, Papadimitriou S, Yu PS (2007) Graphscope: parameter-free mining of large time-evolving graphs. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, pp 687–696. ACM
Tasoulis DK, Ross G, Adams NM (2007) Visualising the cluster structure of data streams. In: Proceedings of the 7th international conference on intelligent data analysis, pp 81–92, 1771633. Springer
Teh Y, Jordan M, Beal M, Blei D (2006) Hierarchical dirichlet processes. J Am Stat Assoc 101(476):1566–1581
Article MathSciNet MATH Google Scholar
Tsang IW, Kocsor A, Kwok JT (2007) Simpler core vector machines with enclosing balls. In: Proceedings of the 24th international conference on machine learning, pp 911–918, 1273611. ACM
Udommanetanakit K, Rakthanmanon T, Waiyamai K (2007) E-stream: evolution-based technique for stream clustering, vol 4632 of Lecture notes in computer science, pp 605–615. Springer, Berlin
van Rijsbergen C (1979) Information retrieval, 2nd edn. Butterworths, Massachusetts
Google Scholar
Wan L, Ng WK, Dang XH, Yu PS, Zhang K (2009) Density-based clustering of data streams at multiple resolutions. ACM Trans Knowl Discov Data (TKDD) 3(3):1–28
Article Google Scholar
Wang H, Fan W, Yu PS, Han J (2003) Mining concept-drifting data streams using ensemble classifiers. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, pp 226–235, 956778. ACM
Wang H, Yu PS, Han J (2005) Mining data streams. Springer, US, pp 777–792
Google Scholar
Wang W, Yang J, Muntz R (1997) Sting: a statistical information grid approach to spatial data mining. In: Proceedings of the international conference on very large data bases, pp 186–195
Yang Y, Pierce T, Carbonell J (1998) A study of retrospective and on-line event detection. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, pp 28–36. ACM
Yang Y, Zhang J, Carbonell J, Jin C (2002) Topic-conditioned novelty detection. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining, pp 688–693. ACM
Zeng H-J, He Q-C, Chen Z, Ma W-Y, Ma J (2004) Learning to cluster web search results. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, pp 210–217, 1009030. ACM
Zhang P, Gao B, Zhu X, Guo L (2011) Enabling fast lazy learning for data streams. In: Proceedings of the 11th IEEE international conference on data mining (ICDM-11), pp 932–941. IEEE
Zhang P, Li J, Wang P, Gao BJ, Zhu X, Guo L (2011) Enabling fast prediction for ensemble models on data streams. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining, pp 177–185, 2020442. ACM
Zhang P, Zhu X, Shi Y, Guo L, Wu X (2011) Robust ensemble learning for mining noisy data streams. Decis Support Syst 50(2):469–479
Article Google Scholar
Zhang P, Zhu X, Shi Y, Wu X (2009) An aggregate ensemble for mining concept drifting data streams with noise, vol 5476 of Lecture notes in computer science, pp 1021–1029. Springer, Berlin
Zhang T, Ramakrishnan R, Livny M (1996) Birch: an efficient data clustering method for very large databases. In: Proceedings of the 1996 ACM SIGMOD international conference on management of data, pp 103–114, 233324. ACM
Zhong S (2005) Efficient streaming text clustering. Neural Netw 18(5):790–798
Article MATH Google Scholar
Zhou A, Cao F, Qian W, Jin C (2008) Tracking clusters in evolving data streams over sliding windows. Knowl Inf Syst 15(2):181–214
Article Google Scholar

Download references

Acknowledgments

This work is partially supported by the Wattalyzer grant project of I2R, Singapore with Grant reference: NRF2012EWT-EIRP002-044.

Author information

Authors and Affiliations

Data Analytics Department, Institute for Infocomm Research (I2R) Agency for Science Technology and Research (A*STAR), Singapore, Singapore
Hai-Long Nguyen
Airbus Group Innovations South Asia, Singapore, Singapore
Yew-Kwong Woon
School of Computer Engineering, Nanyang Technological University, Singapore, Singapore
Wee-Keong Ng

Authors

Hai-Long Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Yew-Kwong Woon
View author publications
You can also search for this author in PubMed Google Scholar
Wee-Keong Ng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hai-Long Nguyen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nguyen, HL., Woon, YK. & Ng, WK. A survey on data stream clustering and classification. Knowl Inf Syst 45, 535–569 (2015). https://doi.org/10.1007/s10115-014-0808-1

Download citation

Received: 22 August 2012
Revised: 31 August 2014
Accepted: 15 November 2014
Published: 17 December 2014
Issue Date: December 2015
DOI: https://doi.org/10.1007/s10115-014-0808-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey on data stream clustering and classification

Abstract

Access this article

Similar content being viewed by others

Data stream clustering: a review

Optimizing Data Stream Representation: An Extensive Survey on Stream Clustering Algorithms

State-of-the-art on clustering data streams

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A survey on data stream clustering and classification

Abstract

Access this article

Similar content being viewed by others

Data stream clustering: a review

Optimizing Data Stream Representation: An Extensive Survey on Stream Clustering Algorithms

State-of-the-art on clustering data streams

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation