Advertisement

Optimizing Data Stream Representation: An Extensive Survey on Stream Clustering Algorithms

  • Matthias CarneinEmail author
  • Heike Trautmann
State of the Art
  • 31 Downloads

Abstract

Analyzing data streams has received considerable attention over the past decades due to the widespread usage of sensors, social media and other streaming data sources. A core research area in this field is stream clustering which aims to recognize patterns in an unordered, infinite and evolving stream of observations. Clustering can be a crucial support in decision making, since it aims for an optimized aggregated representation of a continuous data stream over time and allows to identify patterns in large and high-dimensional data. A multitude of algorithms and approaches has been developed that are able to find and maintain clusters over time in the challenging streaming scenario. This survey explores, summarizes and categorizes a total of 51 stream clustering algorithms and identifies core research threads over the past decades. In particular, it identifies categories of algorithms based on distance thresholds, density grids and statistical models as well as algorithms for high dimensional data. Furthermore, it discusses applications scenarios, available software and how to configure stream clustering algorithms. This survey is considerably more extensive than comparable studies, more up-to-date and highlights how concepts are interrelated and have been developed over time.

Keywords

Stream clustering Data streams Online clustering Pattern recognition Decision support Data representation 

Notes

Acknowledgements

The authors would like to thank for the support provided by Karsten Kraume and the ERCIS Omni-Channel Lab – powered by Arvato (https://omni-channel.ercis.org/).

References

  1. Ackermann MR, Märtens M, Raupach C, Swierkot K, Lammersen C, Sohler C (2012) StreamKM++: a clustering algorithm for data streams. J Exp Algorithmics 17:2.4:2.1–2.4:2.30Google Scholar
  2. Aggarwal CC (2007) Data streams: models and algorithms, vol 31. Springer, BerlinGoogle Scholar
  3. Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of the 29th international conference on very large data bases, volume 29 of VLDB ’03, VLDB Endowment, Berlin, pp 81–92Google Scholar
  4. Aggarwal CC, Han J, Wang J, Yu PS (2004) A framework for projected clustering of high dimensional data streams. In: Proceedings of the thirtieth international conference on very large data bases, volume 30 of VLDB ’04, VLDB Endowment, Toronto, pp 852–863Google Scholar
  5. Ali MH, Sundus A, Qaiser W, Ahmed Z, Halim Z (2011) Applicative implementation of D-stream clustering algorithm for the real-time data of telecom sector. In: International conference on computer networks and information technology, pp 293–297Google Scholar
  6. Amini A, Wah TY (2011) Density micro-clustering algorithms on data streams: a review. In: Proceeding of the international multiconference of engineers and computer scientists (IMECS)Google Scholar
  7. Amini A, Wah TY (2012) A comparative study of density-based clustering algorithms on data streams: micro-clustering approaches. Springer, US, Boston, pp 275–287Google Scholar
  8. Amini A, Wah TY (2013) LeaDen-Stream: a leader density-based clustering algorithm over evolving data stream. J Comput Commun 01(05):26–31Google Scholar
  9. Amini A, Wah TY, Saybani MR, Yazdi SRAS (2011) A study of density-grid based clustering algorithms on data streams. In: Eighth international conference on fuzzy systems and knowledge discovery (FSKD) 3:1652–1656Google Scholar
  10. Amini A, Wah TY, Teh YW (2012) DENGRIS-Stream: a density-grid based clustering algorithm for evolving data streams over sliding window. In: Proceedings of the international conference on data mining and computer engineering, pp 206–210Google Scholar
  11. Amini A, Saboohi H, Wah TY, Herawan T (2014a) A fast density-based clustering algorithm for real-time internet of things stream. Sci World J 2014:1–11Google Scholar
  12. Amini A, Wah TY, Saboohi H (2014b) On density-based data streams clustering algorithms: a survey. J Comput Sci Technol 29(1):116–141Google Scholar
  13. Amini A, Saboohi H, Herawan T, Wah TY (2016) MuDi-Stream: a multi density clustering algorithm for evolving data stream. J Netw Comput Appl 59:370–385Google Scholar
  14. Arthur D, Vassilvitskii S (2007) K-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms, SODA ’07, Society for Industrial and Applied Mathematics, New Orleans, pp 1027–1035Google Scholar
  15. Barbará D, Chen P (2000) Using the fractal dimension to cluster datasets. In: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’00, ACM, Boston, pp 260–264Google Scholar
  16. Barbará D, Chen P (2003) Using self-similarity to cluster large data sets. Data Min Knowl Discov 7(2):123–152Google Scholar
  17. Ben-Hur A, Horn D, Siegelmann HT, Vapnik V (2001) Support vector clustering. J Mach Learn Res 2:125–137Google Scholar
  18. Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful? Springer, Berlin, pp 217–235Google Scholar
  19. Bhatnagar V, Kaur S (2007) Exclusive and complete clustering of streams. Springer, Berlin, pp 629–638Google Scholar
  20. Bhatnagar V, Kaur S, Chakravarthy S (2014) Clustering data streams using grid-based synopsis. Knowl Inf Syst 41(1):127–152Google Scholar
  21. Bifet A, Holmes G, Kirkby R, Pfahringer B (2010) MOA: massive online analysis. J Mach Learn Res 11:1601–1604Google Scholar
  22. Bifet A, Gavaldà R, Holmes G, Pfahringer B (2018) Machine learning for data streams with practical examples in MOA. MIT Press, CambridgeGoogle Scholar
  23. Bohm C, Kailing K, Kriegel H-P, Kroger P (2004) Density connected clustering with local subspace preferences. In: Proceedings of the fourth IEEE international conference on data mining, ICDM ’04, IEEE Computer Society, Washington, DC, pp 27–34Google Scholar
  24. Bolaños M, Forrest J, Hahsler M (2014) Clustering large datasets using data stream clustering techniques. In: Spiliopoulou M, Schmidt-Thieme L, Janning R (eds) Data analysis, machine learning and knowledge discovery, studies in classification, data analysis, and knowledge organization. Springer, Berlin, pp 135–143Google Scholar
  25. Bradley PS, Fayyad U, Reina C (1998) Scaling clustering algorithms to large databases. In: Proceedings of the 4th international conference on knowledge discovery and data mining (KDD’98). AAAI Press, pp 9–15Google Scholar
  26. Cao F, Ester M, Qian W, Zhou A (2006) Density-based clustering over an evolving data stream with noise. In: Conference on data mining (SIAM ’06), pp 328–339Google Scholar
  27. Carnein M, Trautmann H (2018) evoStream—evolutionary stream clustering utilizing idle times. Big Data Res 14:101–111.  https://doi.org/10.1016/j.bdr.2018.05.005 Google Scholar
  28. Carnein M, Assenmacher D, Trautmann H (2017a) An empirical comparison of stream clustering algorithms. In: Proceedings of the ACM international conference on computing frontiers (CF ’17). ACM, pp 361–365Google Scholar
  29. Carnein M, Assenmacher D, Trautmann H (2017b) Stream clustering of chat messages with applications to twitch streams. In Proceedings of the 36th international conference on conceptual modeling (ER’17). Springer International Publishing, pp 79–88Google Scholar
  30. Chen Y, Tu L (2007) Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’07, ACM, San Jose, pp 133–142Google Scholar
  31. Dang XH, Lee V, Ng WK, Ciptadi A, Ong KL (2009a) An EM-based algorithm for clustering data streams in sliding windows. In: Zhou X, Yokota H, Deng K, Liu Q (eds) Proceedings of the 14th international conference on database systems for advanced applications (DASFAA 2009). Springer, Berlin, pp 230–235Google Scholar
  32. Dang XH, Lee VCS, Ng WK, Ong KL (2009b) Incremental and adaptive clustering stream data over sliding window. Springer, Berlin, pp 660–674Google Scholar
  33. Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd international conference on knowledge discovery and data mining. AAAI Press, pp 226–231Google Scholar
  34. Farnstrom F, Lewis J, Elkan C (2000) Scalability for clustering algorithms revisited. SIGKDD Explor Newsl 2(1):51–57Google Scholar
  35. Fichtenberger H, Gillé M, Schmidt M, Schwiegelshohn C, Sohler C (2013) BICO: BIRCH meets coresets for k-means clustering. In: Algorithms - ESA 2013—Proceedings of 21st annual European symposium, Sophia Antipolis, pp 481–492. http://ls2-www.cs.tu-dortmund.de/grav/de/bico. Accessed 27 Dec 2018
  36. Fisher DH (1987) Knowledge acquisition via incremental conceptual clustering. Mach Learn 2(2):139–172Google Scholar
  37. Forestiero A, Pizzuti C, Spezzano G (2013) A single pass algorithm for clustering evolving data streams based on swarm intelligence. Data Min Knowl Discov 26(1):1–26Google Scholar
  38. Gao J, Li J, Zhang Z, Tan P-N (2005) An incremental data stream clustering algorithm based on dense units detection. Springer, Berlin, pp 420–425Google Scholar
  39. Gao X, Ferrara E, Qiu J (2015) Parallel clustering of high-dimensional social media data streams. arXiv:1502.00316Google Scholar
  40. Ghesmoune M, Azzag H, Lebbah M (2014) G-Stream: growing neural gas over data stream. In: Loo CK, Siah YK, Wong KW, Jin AT, Huang K (eds) Proceedings of neural information processing: 21st international conference, ICONIP 2014, Kuching, Malaysia, November 3–6, 2014, Part I. Springer International Publishing, pp 207–214Google Scholar
  41. Ghesmoune M, Lebbah M, Azzag H (2015) Clustering over data streams based on growing neural gas. Springer, Berlin, pp 134–145Google Scholar
  42. Ghesmoune M, Lebbah M, Azzag H (2016) State-of-the-art on clustering data streams. Big Data Anal 1(1):13Google Scholar
  43. Guha S, Meyerson A, Mishra N, Motwani R, O’Callaghan L (2003) Clustering data streams: theory and practice. IEEE Trans Knowl Data Eng 15(3):515–528Google Scholar
  44. Hahsler M, Bolaños M (2016) Clustering data streams based on shared density between micro-clusters. IEEE Trans Knowl Data Eng 28(6):1449–1461Google Scholar
  45. Hahsler M, Bolanos M, Forrest J (2015) streamMOA: interface for MOA stream clustering algorithms. https://cran.r-project.org/web/packages/streamMOA/. Accessed 27 Dec 2018
  46. Hahsler M, Bolanos M, Forrest J, Carnein M, Assenmacher D (2018) stream: infrastructure for data stream mining. https://cran.r-project.org/web/packages/stream/. Accessed 27 Dec 2018
  47. Hassani M, Kranen P, Seidl T (2011) Precise anytime clustering of noisy sensor data with logarithmic complexity. In: Proceedings of 5th international workshop on knowledge discovery from sensor data (SensorKDD 2011) in conjunction with 17th ACM SIGKDD conference on knowledge discovery and data mining (KDD 2011), ACM, San Diego, pp 52–60Google Scholar
  48. Hassani M, Spaus P, Gaber MM, Seidl T (2012) Density-based projected clustering of data streams. Springer, Berlin, pp 311–324Google Scholar
  49. Hassani M, Kim Y, Seidl T (2013) Subspace MOA: subspace stream clustering evaluation using the MOA framework. Springer, Berlin, pp 446–449Google Scholar
  50. Hassani M, Hansen M, Kim Y, Seidl T (2016) subspaceMOA: interface to ’subspaceMOA’. https://cran.r-project.org/web/packages/subspaceMOA/. Accessed 27 Dec 2018
  51. Huawei Noah’s Ark Lab (2015). streamDM. http://huawei-noah.github.io/streamDM/. Accessed 27 Dec 2018
  52. Hutter F, Hoos HH, Stützle T (2007) Automatic algorithm configuration based on local search. In: Proceedings of the twenty-second conference on artifical intelligence (AAAI ’07), pp 1152–1157Google Scholar
  53. Hutter F, Hoos HH, Leyton-Brown K, Stützle T (2009) ParamILS: an automatic algorithm configuration framework. J Artif Intell Res 36:267–306Google Scholar
  54. Hutter F, Hoos HH, Leyton-Brown K (2011) Sequential model-based optimization for general algorithm configuration. In: Proceedings of LION-5, pp 507–523Google Scholar
  55. Isaksson C, Dunham MH, Hahsler M (2012) SOStream: self organizing density-based clustering over data stream. Springer, Berlin, pp 264–278Google Scholar
  56. Ismael N, Alzaalan M, Ashour W (2014) Improved multi threshold birch clustering algorithm 2(1):1–10.  https://doi.org/10.14257/ijaiasd.2014.2.1.01 Google Scholar
  57. Jia C, Tan C, Yong A (2008) A grid and density-based clustering algorithm for processing data stream. In: Second international conference on genetic and evolutionary computing (WGEC ’08), pp 517–521Google Scholar
  58. Kohonen T (1982) Self-organized formation of topologically correct feature maps. Biol Cybern 43(1):59–69Google Scholar
  59. Kontaki M, Papadopoulos AN, Manolopoulos Y (2008) Continuous trend-based clustering in data streams. Springer, Berlin, pp 251–262Google Scholar
  60. Kranen P, Assent I, Baldauf C, Seidl T (2009) Self-adaptive anytime stream clustering. In: 9th IEEE international conference on data mining (ICDM ’09), pp 249–258Google Scholar
  61. Kranen P, Assent I, Baldauf C, Seidl T (2011a) The ClusTree: indexing micro-clusters for anytime stream mining. In: Knowledge and information systems journal (Springer KAIS), Vol 29, Issue 2. Springer, London, pp 249–272Google Scholar
  62. Kranen P, Reidl F, Villaamil FS, Seidl T (2011b) Hierarchical clustering for real-time stream data with noise. Springer, Berlin, pp 405–413Google Scholar
  63. Lin J, Lin H (2009) A density-based clustering over evolving heterogeneous data stream. In: 2009 ISECS international colloquium on computing, communication, control, and management, vol 4, pp 275–277Google Scholar
  64. Liu LX, Huang H, Guo YF, Chen FC (2009) rDenStream, a clustering algorithm over an evolving data stream. In: 2009 international conference on information engineering and computer science, pp 1–4Google Scholar
  65. López-Ibáñez M, Dubois-Lacoste J, Cáceres LP, Stützle T, Birattari M (2016) The irace package: iterated racing for automatic algorithm configuration. Oper Res Perspect 3:43–58Google Scholar
  66. Lorbeer B, Kosareva A, Deva B, Softić D, Ruppel P, Küpper A (2017) A-BIRCH: automatic threshold estimation for the BIRCH clustering algorithm. Springer, Berlin, pp 169–178Google Scholar
  67. Lühr S, Lazarescu M (2008) Connectivity based stream clustering using localised density exemplars. Springer, Berlin, pp 662–672Google Scholar
  68. Lühr S, Lazarescu M (2009) Incremental clustering of dynamic data streams using connectivity based representative points. Data Knowl Eng 68(1):1–27Google Scholar
  69. Ma WH (2014) Survey on data streams clustering techniques. In: Manufacture engineering, quality and production system III, volume 933 of Advanced Materials Research. Trans Tech Publications, pp 768–773Google Scholar
  70. Martinetz T, Schulten K et al (1991) A “neural-gas” network learns topologies. University of Illinois at Urbana-ChampaignGoogle Scholar
  71. Meesuksabai W, Kangkachit T, Waiyamai K (2011) Hue-Stream: evolution-based clustering technique for heterogeneous data streams with uncertainty. In: Tang J, King I, Chen L, Wang J (eds) ADMA, volume 7121 of Lecture Notes in Computer Science. Springer, pp 27–40Google Scholar
  72. Motoyoshi M, Miura T, Shioya I (2004) Clustering stream data by regression analysis. In: Proceedings of the second workshop on Australasian information security, data mining and web intelligence, and software internationalisation, volume 32 of ACSW Frontiers ’04, Australian Computer Society, Darlinghurst, pp 115–120Google Scholar
  73. Mousavi M, Bakar AA, Vakilian M (2015) Data stream clustering algorithms: a review. Int J Adv Soft Comput Appl 7:1–15Google Scholar
  74. Nguyen H-L, Woon Y-K, Ng W-K (2015) A survey on data stream clustering and classification. Knowl Inf Syst 45(3):535–569Google Scholar
  75. Ntoutsi I, Zimek A, Palpanas T, Kröger P, Kriegel H-P (2012) Density-based projected clustering over high dimensional data streams. In: Proceedings of the 2012 SIAM international conference on data mining, pp 987–998Google Scholar
  76. O’Callaghan L, Mishra N, Meyerson A, Guha S, Motwani R (2002) Streaming-data algorithms for high-quality clustering. In: Proceedings of the 18th international conference on data engineering (ICDE), pp 685–694Google Scholar
  77. Park NH, Lee WS (2004) Statistical Grid-based clustering over data streams. SIGMOD Rec 33(1):32–37Google Scholar
  78. Park NH, Lee WS (2007a) Cell trees: an adaptive synopsis structure for clustering multi-dimensional on-line data streams. Data Knowl Eng 63(2):528–549Google Scholar
  79. Park NH, Lee WS (2007b) Grid-based subspace clustering over data streams. In: Proceedings of the sixteenth ACM conference on conference on information and knowledge management, ACM, New York, pp 801–810Google Scholar
  80. Ren J, Ma R (2009) Density-based data streams clustering over sliding windows. In: 2009 Sixth international conference on fuzzy systems and knowledge discovery, volume 5, pp 248–252Google Scholar
  81. Ren J, Cai B, Hu C (2011) Clustering over data streams based on grid density and index tree. J Converg Inf Technol 6(1):83–93Google Scholar
  82. Ruiz C, Spiliopoulou M, Menasalvas E (2007) C-DBSCAN: density-based clustering with constraints. Springer, Berlin, pp 216–223Google Scholar
  83. Ruiz C, Menasalvas E, Spiliopoulou M (2009) C-DenStream: using domain knowledge on a data stream. Springer, Berlin, pp 287–301Google Scholar
  84. Silva JA, Faria ER, Barros RC, Hruschka ER, de Carvalho AC, Gama J (2013) Data stream clustering: a survey. ACM Comput Surv 46(1):13:1–13:31Google Scholar
  85. Spinosa EJ, de Leon F de Carvalho AP, Gama J (2007) Olindda: a cluster-based approach for detecting novelty and concept drift in data streams. In: Proceedings of the 2007 ACM symposium on applied computing. ACM, pp 448–452Google Scholar
  86. Steil J, Huang MX, Bulling A (2018) Fixation detection for head-mounted eye tracking based on visual similarity of gaze targets. In: Proceedings of international symposium on eye tracking research and applications (ETRA), pp 23:1–23:9Google Scholar
  87. Tasoulis DK, Adams NM, Hand DJ (2006) Unsupervised clustering in streaming data. In: Sixth IEEE international conference on data mining–workshops (ICDMW’06), pp 638–642Google Scholar
  88. Tasoulis D, Adams N, Weston DJ, Hand DJ (2008) Mining information from plastic card transaction streams. In: Proceedings in computational statistics: 18th symposium (COMPSTAT 2008), volume 2, pp 315–322Google Scholar
  89. Theiler J (1990) Estimating fractal dimension. J Opt Soc Am A 7(6):1055–1073Google Scholar
  90. Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc Ser B (Stat Methodol) 63(2):411–423Google Scholar
  91. Tu L, Chen Y (2009) Stream data clustering based on grid density and attraction. ACM Trans Knowl Discov Data 3(3):12:1–12:27Google Scholar
  92. Udommanetanakit K, Rakthanmanon T, Waiyamai K (2007) E-Stream: evolution-based technique for stream clustering. Springer, Berlin, pp 605–615Google Scholar
  93. van Rijn JN, Holmes G, Pfahringer B, Vanschoren J (2014) Algorithm selection on data streams. In: Džeroski S, Panov P, Kocev D, Todorovski L (eds) Proceedings of the 17th international conference on discovery science (DS), volume 8777 of lecture notes in computer science (LNCS). Springer, pp 325–336Google Scholar
  94. van Rijn J, Nicolaas GH, Pfahringer B, Vanschoren J (2018) The online performance estimation framework: heterogeneous ensemble learning for data streams. Mach Learn 107(1):149–176Google Scholar
  95. Wan L, Ng WK, Dang XH, Yu PS, Zhang K (2009) Density-based clustering of data streams at multiple resolutions. ACM Trans Knowl Discov Data 3(3):14:1–14:28Google Scholar
  96. Wang CD, Lai JH, Huang D, Zheng WS (2013) SVStream: a support vector-based algorithm for clustering data streams. IEEE Trans Knowl Data Eng 25(6):1410–1424Google Scholar
  97. Wang G, Zhang X, Tang S, Zheng H, Zhao BY (2016) Unsupervised clickstream clustering for user behavior analysis. In: Proceedings of the 2016 CHI conference on human factors in computing systems, ACM, New York, pp 225–236Google Scholar
  98. Wedel M, Kamakura WA (2000) Market segmentation, 2nd edn. Springer, USGoogle Scholar
  99. Yang C, Zhou J (2006) HClustream: a novel approach for clustering evolving heterogeneous data stream. In: Sixth IEEE international conference on data mining—workshops (ICDMW’06), pp 682–688Google Scholar
  100. Yang Y, Liu Z, Zhang Jp, Yang J (2012) Dynamic density-based clustering algorithm over uncertain data streams. In: 2012 9th international conference on fuzzy systems and knowledge discovery, pp 2664–2670Google Scholar
  101. Zhang X, Wang W (2010) Self-adaptive change detection in streaming data with non-stationary distribution. Springer, Berlin, pp 334–345Google Scholar
  102. Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the 1996 ACM SIGMOD international conference on management of data, ACM, Montreal, pp 103–114Google Scholar
  103. Zhang T, Ramakrishnan R, Livny M (1997) BIRCH: a new data clustering algorithm and its applications. Data Mini Knowl Discov 1(2):141–182Google Scholar
  104. Zhang X, Germain C, Sebag M (2010) Adaptively detecting changes in autonomic grid computing. In: 2010 11th IEEE/ACM international conference on grid computing, pp 387–392Google Scholar
  105. Zhou A, Cao F, Qian W, Jin C (2007a) Tracking clusters in evolving data streams over sliding windows. Knowl Inf Syst 15(2):181–214Google Scholar
  106. Zhou A, Cao F, Yan Y, Sha C, He X (2007b) Distributed data stream clustering: a fast EM-based approach. In: 2007 IEEE 23rd international conference on data engineering, pp 736–745Google Scholar
  107. Zhu Y, Shasha D (2002) StatStream: statistical monitoring of thousands of data streams in real time. In: Proceedings of the 28th international conference on very large data bases, VLDB Endowment, Hong Kong, pp 358–369Google Scholar

Copyright information

© Springer Fachmedien Wiesbaden GmbH, ein Teil von Springer Nature 2019

Authors and Affiliations

  1. 1.Information Systems and StatisticsUniversity of MünsterMünsterGermany

Personalised recommendations