Skip to main content
Log in

On Density-Based Data Streams Clustering Algorithms: A Survey

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Clustering data streams has drawn lots of attention in the last few years due to their ever-growing presence. Data streams put additional challenges on clustering such as limited time and memory and one pass clustering. Furthermore, discovering clusters with arbitrary shapes is very important in data stream applications. Data streams are infinite and evolving over time, and we do not have any knowledge about the number of clusters. In a data stream environment due to various factors, some noise appears occasionally. Density-based method is a remarkable class in clustering data streams, which has the ability to discover arbitrary shape clusters and to detect noise. Furthermore, it does not need the number of clusters in advance. Due to data stream characteristics, the traditional density-based clustering is not applicable. Recently, a lot of density-based clustering algorithms are extended for data streams. The main idea in these algorithms is using density-based methods in the clustering process and at the same time overcoming the constraints, which are put out by data stream’s nature. The purpose of this paper is to shed light on some algorithms in the literature on density-based clustering over data streams. We not only summarize the main density-based clustering algorithms on data streams, discuss their uniqueness and limitations, but also explain how they address the challenges in clustering data streams. Moreover, we investigate the evaluation metrics used in validating cluster quality and measuring algorithms’ performance. It is hoped that this survey will serve as a steppingstone for researchers studying data streams clustering, particularly density-based algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Han J, Kamber M, Pei J. Data Mining: Concepts and Techniques (3rd edition). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2011.

    Google Scholar 

  2. Gaber M, Zaslavsky A, Krishnaswamy S. Mining data streams: A review. ACM SIGMOD Record, 2005, 34(2): 18-26.

    Article  Google Scholar 

  3. Cao F, Ester M, Qian W, Zhou A. Density-based clustering over an evolving data stream with noise. In Proc. the 2006 SIAM Conference on Data Mining, April 2006, pp. 328-339.

  4. Chen Y, Tu L. Density-based clustering for real-time stream data. In Proc. the 13th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, Aug. 2007, pp. 133-142.

  5. Aggarwal C C (ed.). Data Streams: Models and Algorithms. New York, USA: Springer, 2007.

  6. Hahsler M, Dunham M H. Temporal structure learning for clustering massive data streams in real-time. In Proc. the 11th SIAM Conference on Data Mining, April 2011, pp. 664-675.

  7. O Ćallaghan L, Mishra N, Meyerson A et al. Streaming-data algorithms for high-quality clustering. In Proc. the 18th Int. Conf. Data Engineering, Feb. 26-Mar. 1, 2002, pp. 685-694.

  8. Barbará D. Requirements for clustering data streams. SIGKDD Explorations Newsletter, 2002, 3(2): 23-27.

    Article  Google Scholar 

  9. Guha S, Meyerson A, Mishra N et al. Clustering data streams: Theory and practice. IEEE Trans. Knowledge and Data Engineering, 2003, 15(3): 515-528.

    Article  Google Scholar 

  10. Aggarwal C C, Han J, Wang J, Yu P S. A framework for clustering evolving data streams. In Proc. the 29th International Conference on Very Large Data Bases, Sept. 2003, pp. 81-92.

  11. Ackermann M R, Lammersen C, Märtens M, Raupach C, Sohler C, Swierkot K. StreamKM++: A clustering algorithm for data streams. In Proc. the 12th Workshop on Algorithm Engineering and Experiments, Jan. 2010, pp. 173-187.

  12. Ikonomovska E, Loskovska S, Gjorgjevik D. A survey of stream data mining. In Proc. the 8th National Conference with International Participation, Sept. 2007, pp. 19-25.

  13. Gaber M, Zaslavsky A, Krishnaswamy S. Data stream mining. Data Mining and Knowledge Discovery Handbook, 2010, pp. 759-787.

  14. Babcock B, Babu S, Datar M, Motwani R, Widom J. Models and issues in data stream systems. In Proc. the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, June 2002, pp. 1-16.

  15. Jain A K, Dubes R C. Algorithms for Clustering Data. Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 1988.

  16. Jain A K. Data clustering: 50 years beyond K-means. Pattern Recognition Letter, 2010, 31(8): 651-666.

    Article  Google Scholar 

  17. Mahdiraji A. Clustering data stream: A survey of algorithms. Int. J. Knowledge-Based and Intelligent Engineering Systems, 2009, 13(2): 39-44.

    Google Scholar 

  18. Amini A, Wah T, Saybani M et al. A study of density-grid based clustering algorithms on data streams. In Proc. the 8th Int. Conf. Fuzzy Systems and Knowledge Discovery, July 2011, pp. 1652-1656.

  19. Amini A, Wah T Y. Density micro-clustering algorithms on data streams: A review. In Proc. Int. Multiconf. Data Mining and Applications, March 2011, pp. 410-414.

  20. Amini A, Wah T Y. A comparative study of density-based clustering algorithms on data streams: Micro-clustering approaches. In Lecture Notes in Electrical Engineering 110, Ao S, Castillo O, Huang X (eds.), Springer, 2012, pp. 275-287.

  21. Aggarwal C C. A survey of stream clustering algorithms. In Data Clustering: Algorithms and Applications, Aggarwal C C, Reddy C (eds.), CRC Press, 2013, pp. 457-482.

  22. Han J, Kamber M. Data Mining: Concepts and Techniques (2nd edition). Morgan Kaufmann, 2006.

    Google Scholar 

  23. MacQueen J. Some methods for classification and analysis of multivariate observations. In Proc. the 5th Berkeley Symposium on Mathematical Statistics and Probability, June 21-July 18, 1967, pp. 281-297.

  24. Lloyd S P. Least squares quantization in PCM. IEEE Transactions on Information Theory, 1982, 28(2): 129-137.

    Article  MATH  MathSciNet  Google Scholar 

  25. Guha S, Mishra N, Motwani R, O’Callaghan L. Clustering data streams. In Proc. the 41st Annual Symposium on Foundations of Computer Science, Nov. 2000, pp. 359-366.

  26. Zhang T, Ramakrishnan R, Livny M. BIRCH: An efficient data clustering method for very large databases. In Proc. the 1996 ACM SIGMOD International Conference on Management of Data, June 1996, pp. 103-114.

  27. Karypis G, Han E, Kumar V. Chameleon: Hierarchical clustering using dynamic modeling. Computer, 1999, 32(8): 68-75.

    Article  Google Scholar 

  28. Kranen P, Assent I, Baldauf C, Seidl T. The clustree: Indexing micro-clusters for anytime stream mining. Knowl. Inf. Syst., 2011, 29(2): 249-272.

    Article  Google Scholar 

  29. Wang W, Yang J, Muntz R R. STING: A statistical information grid approach to spatial data mining. In Proc. the 23rd Int. Conf. Very Large Data Bases, Aug. 1997, pp. 186-195.

  30. Sheikholeslami G, Chatterjee S, Zhang A. Wavecluster: A wavelet-based clustering approach for spatial data in very large databases. The VLDB Journal, 2000, 8(3/4): 289-304.

    Article  Google Scholar 

  31. Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Automatic subspace clustering of high dimensional data for data mining applications. ACM SIGMOD Record, 1998, 27(2): 94-105.

    Article  Google Scholar 

  32. Tu L, Chen Y. Stream data clustering based on grid density and attraction. ACM Transactions on Knowledge Discovery Data, 2009, 3(3): Article No. 12.

  33. Wan L, Ng W K, Dang X H et al. Density-based clustering of data streams at multiple resolutions. ACM Trans. Knowledge Discovery from Data, 2009, 3(3): Article No. 14.

  34. Dempster A P, Laird N M, Rubin D B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 1977, 39(1): 1-38.

    MATH  MathSciNet  Google Scholar 

  35. Dang X, Lee V, Ng W K et al. An EM-based algorithm for clustering data streams in sliding windows. In Proc. the Int. Conf. Database Systems for Advanced Applications, Apr. 2009, pp. 230-235.

  36. Ester M, Kriegel H, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proc. the 2nd International Conference on Knowledge Discovery and Data Mining, Aug. 1996, pp. 226-231.

  37. Ankerst M, Breunig M M, Kriegel H P, Sander J. Optics: Ordering points to identify the clustering structure. ACM SIGMOD Record, 1999, 28(2): 49-60.

    Article  Google Scholar 

  38. Hinneburg A, Keim D A. An efficient approach to clustering in large multimedia databases with noise. In Proc. the 4th KDD, Sept. 1998, pp. 58-65.

  39. Matysiak M. Data stream mining: Basic methods and techniques. Technical Report, RWTH Aachen University, 2012.

    Google Scholar 

  40. Zhou A, Cao F, Qian W, Jin C. Tracking clusters in evolving data streams over sliding windows. Knowledge and Information Systems, 2008, 15(2): 181-214.

    Article  Google Scholar 

  41. Ren J, Ma R. Density-based data streams clustering over sliding windows. In Proc. the 6th Int. Conf. Fuzzy systems and Knowledge Discovery, Aug. 2009, pp. 248-252.

  42. Charikar M, O’Callaghan L, Panigrahy R. Better streaming algorithms for clustering problems. In Proc. the 35th Annual ACM Symp. Theory of Computing, June 2003, pp. 30-39.

  43. Gao J, Li J, Zhang Z, Tan P N. An incremental data stream clustering algorithm based on dense units detection. In Proc. the 9th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, May 2005, pp. 420-425.

  44. Aggarwal C C, Han J, Wang J, Yu P S. A framework for projected clustering of high dimensional data streams. In Proc. the 30th International Conference on Very Large Data Bases, Volume 30, Aug. 29-Sept. 3, 2004, pp. 852-863.

  45. Aggarwal C C, Han J, Wang J, Yu P S. On high dimensional projected clustering of data streams. Data Mining and Knowledge Discovery, 2005, 10(3): 251-273.

    Article  MathSciNet  Google Scholar 

  46. Babcock B, Datar M, Motwani R, O’Callaghan L. Maintaining variance and k-medians over data stream windows. In Proc. the 22nd ACM SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems, June 2003, pp. 234-243.

  47. Ng W, Dash M. Discovery of frequent patterns in transactional data streams. In Lecture Notes in Computer Science 6380, Hameurlain A, Küng J, Wagner R et al. (eds.), Springer Berlin/Heidelberg, 2010, pp. 1-30.

  48. Vitter J S. Random sampling with a reservoir. ACM Trans. Math. Softw., 1985, 11(1): 37-57.

    Article  MATH  MathSciNet  Google Scholar 

  49. Garofalakis M, Gehrke J, Rastogi R. Querying and mining data streams: You only get one look: A tutorial. In Proc. the 2002 ACM SIGMOD Int. Conf. Management of Data, June 2002, pp. 635-635.

  50. Aggarwal C C, Yu P S. A survey of synopsis construction in data streams. In Advances in Database Systems 31, Aggarwal C C (ed.), Springer, 2007, pp. 169-207.

  51. Garofalakis M N. Wavelets on streams. In Encyclopedia of Database Systems, Springer US, 2009, pp. 3446-3451.

  52. Gilbert A C, Kotidis Y, Muthukrishnan S, Strauss M J. One-pass wavelet decompositions of data streams. IEEE Trans. Knowl. and Data Eng., 2003, 15(3): 541-554.

    Article  Google Scholar 

  53. Gama J, Gaber M M (eds.). Learning from Data Streams - Processing Techniques in Sensor Networks. Springer, 2007.

  54. Rosset S, Inger A. KDD-cup 99: Knowledge discovery in a charitable organization’s donor database. SIGKDD Explorations Newsletter, 2000, 1(2): 85-90.

    Article  Google Scholar 

  55. Hubert L J, Levin J R. A general statistical framework for assessing categorical clustering in free recall. Psychological Bulletin, 1976, 83(6): 1072-1080.

    Article  Google Scholar 

  56. Kaufman L, Rousseeuw P J. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley-Interscience, 2005.

  57. Wu J, Xiong H, Chen J. Adapting the right measures for K-means clustering. In Proc. the 15th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, June 2009, pp. 877-886.

  58. Rand W M. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 1971, 66(336): 846-850.

    Article  Google Scholar 

  59. Zhao Y, Karypis G. Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learning, 2004, 55(3): 311-331.

    Article  MATH  Google Scholar 

  60. Dongen S. Performance criteria for graph clustering and Markov cluster experiments. Technical Report, National Research Institute for Mathematics and Computer Science, Stichting Mathematisch Centrum, Netherlands, 2000.

    Google Scholar 

  61. Rosenberg A, Hirschberg J. V-measure: A conditional entropy-based external cluster evaluation measure. In Proc. the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, June 2007, pp. 410-420.

  62. Meilǎ M. Comparing clusterings: An axiomatic view. In Proc. the 22nd Int. Conf. Machine Learning, Aug. 2005, pp. 577-584.

  63. Rijsbergen C J V. Information Retrieval. Newton, MA, USA: Butterworth-Heinemann, 1979.

  64. Milligan G. A Monte Carlo study of thirty internal criterion measures for cluster analysis. Psychometrika, 1981, 46(2): 187-199.

    Article  MATH  MathSciNet  Google Scholar 

  65. Pereira C M M, de Mello R F. A comparison of clustering algorithms for data streams. In Proc. the 1st Int. Conf. Integrated Comp. Tech., May 31-June 2, 2011, pp. 59-74.

  66. Manning C D, Raghavan P, Schtze H. Introduction to Information Retrieval. New York, NY, USA: Cambridge University Press, 2008.

  67. Forestiero A, Pizzuti C, Spezzano G. A single pass algorithm for clustering evolving data streams based on swarm intelligence. Data Mining and Knowledge Discovery, 2013, 26(1): 1-26.

    Article  MathSciNet  Google Scholar 

  68. Bifet A, Holmes G, Pfahringer B et al. MOA: Massive online analysis, a framework for stream classification and clustering. Journal of Machine Learning Research, 2010, 11: 44-50.

    Google Scholar 

  69. Holmes G, Donkin A, Witten I H. WEKA: A machine learning workbench. In Proc. the 2nd Australian and New Zealand Conference on Intelligent Information Systems, Nov. 29-Dec. 3, 1994, pp. 357-361.

  70. Kranen P, Kremer H, Jansen T et al. Clustering performance on evolving data streams: Assessing algorithms and evaluation measures within MOA. In Proc. the IEEE Int. Conf. Data Mining Workshops, Dec. 2010, pp. 1400-1403.

  71. Kremer H, Kranen P, Jansen T et al. An effective evaluation measure for clustering on evolving data streams. In Proc. the 17th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, July 2011, pp. 868-876.

  72. De Francisci Morales G. SAMOA: A platform for mining big data streams. In Proc. the 22nd Int. Conf. World Wide Web Companion, May 2013, pp. 777-778.

  73. Tasoulis D K, Ross G, Adams N M. Visualising the cluster structure of data streams. In Proc. the 7th International Conference on Intelligent Data Analysis, Sept. 2007, pp. 81-92.

  74. Ruiz C, Menasalvas E, Spiliopoulou M. C-DenStream: Using domain knowledge on a data stream. In Proc. the 12th International Conference on Discovery Science, Oct. 2009, pp. 287-301.

  75. Liu L, Jing K, Guo Y et al. A three-step clustering algorithm over an evolving data stream. In Proc. the IEEE Int. Conf. Intelligent Computing and Intelligent Systems, Nov. 2009, pp. 160-164.

  76. Lin J, Lin H. A density-based clustering over evolving heterogeneous data stream. In Proc. the 2nd Int. Colloquium on Computing, Communication, Control, and Management, Aug. 2009, pp. 275-277.

  77. Isaksson C, Dunham M, Hahsler M. SOStream: Self organizing density-based clustering over data stream. In Lecture Notes in Computer Science 7376, Perner P (ed.), Springer Berlin Heidelberg, 2012, pp. 264-278.

  78. Ntoutsi I, Zimek A, Palpanas T et al. Density-based projected clustering over high dimensional data streams. In Proc. the 12th SIAM Int. Conf. Data Mining, April 2012, pp. 987-998.

  79. Hassani M, Spaus P, Gaber M M, Seidl T. Density-based projected clustering of data streams. In Proc. the 6th Int. Conf. Scalable Uncertainty Management, Sept. 2012, pp. 311-324.

  80. Jia C, Tan C, Yong A. A grid and density-based clustering algorithm for processing data stream. In Proc. the 2nd Int. Conf. Genetic and Evolutionary Computing, Sept. 2008, pp. 517-521.

  81. Ren J, Cai B, Hu C. Clustering over data streams based on grid density and index tree. Journal of Convergence Information Technology, 2011, 6(1): 83-93.

    Article  Google Scholar 

  82. Yang Y, Liu Z, Zhang J et al. Dynamic density-based clustering algorithm over uncertain data streams. In Proc. the 9th Int. Conf. Fuzzy Systems and Knowledge Discovery, May 2012, pp. 2664-2670.

  83. Amini A, Teh Ying W. DENGRIS-Stream: A density-grid based clustering algorithm for evolving data streams over sliding window. In Proc. International Conference on Data Mining and Computer Engineering, Dec. 2012, pp. 206-210.

  84. Bhatnagar V, Kaur S, Chakravarthy S. Clustering data streams using grid-based synopsis. Knowledge and Information Systems, June 2013.

    Google Scholar 

  85. Ruiz C, Spiliopoulou M, Menasalvas E. C-DBSCAN: Density-based clustering with constraints. In Proc. the 11th International Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing, May 2007, pp. 216-223.

  86. Yang C, Zhou J. HClustream: A novel approach for clustering evolving heterogeneous data stream. In Proc. the 6th IEEE Int. Conf. Data Mining Workshops, Dec. 2006, pp. 682-688.

  87. Kohonen T. Self-organized formation of topologically correct feature maps. Biological Cybernetics, 1982, 43(1): 59-69.

    Article  MATH  MathSciNet  Google Scholar 

  88. Bohm C, Kailing K, Kriegel H P, Kroger P. Density connected clustering with local subspace preferences. In Proc. the 4th IEEE Int. Conf. Data Mining, Nov. 2004, pp. 27-34.

  89. Kennedy J F, Kennedy J, Eberhart R C. Swarm Intelligence. Morgan Kaufmann Pub, 2001.

  90. Shamshirband S, Anuar N, Kiah M et al. An appraisal and design of a multi-agent system based cooperative wireless intrusion detection computational intelligence technique. Engineering Applications of Artificial Intelligence, 2013, 26(9): 2105-2127.

    Article  Google Scholar 

  91. Sander J, Ester M, Kriegel H P, Xu X. Density-based clustering in spatial databases: The algorithm GDBSCAN and its applications. Data Mining and Knowledge Discovery, 1998, 2(2): 169-194.

    Article  Google Scholar 

  92. Plant C, Teipel S J, Oswald A et al. Automated detection of brain atrophy patterns based on MRI for the prediction of Alzheimer’s disease. NeuroImage, 2010, 50(1): 162-174.

    Article  Google Scholar 

  93. Mete M, Kockara S, Aydin K. Fast density-based lesion detection in dermoscopy images. Computerized Medical Imaging and Graphics, 2011, 35(2): 128-136.

    Article  Google Scholar 

  94. Yang D, Rundensteiner E A, Ward M O. Summarization and matching of density-based clusters in streaming environments. Proc. VLDB Endow., 2011, 5(2): 121-132.

    Google Scholar 

  95. Lee C H. Mining spatio-temporal information on microblogging streams using a density-based online clustering method. Expert Systems with Applications, 2012, 39(10): 9623-9641.

    Article  Google Scholar 

  96. Yu Y, Wang Q, Wang X, Wang H, He J. Online clustering for trajectory data stream of moving objects. Computer Science and Information Systems, 2013, 10(3): 1293-1317.

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Amineh Amini.

Additional information

The work is supported by the University of Malaya Research under Grant No. RG097-12ICT.

Electronic supplementary material

Below is the link to the electronic supplementary material.

ESM 1

(DOC 27 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Amini, A., Wah, T.Y. & Saboohi, H. On Density-Based Data Streams Clustering Algorithms: A Survey. J. Comput. Sci. Technol. 29, 116–141 (2014). https://doi.org/10.1007/s11390-014-1416-y

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-014-1416-y

Keywords

Navigation