Journal of Computer Science and Technology

, Volume 29, Issue 1, pp 116–141 | Cite as

On Density-Based Data Streams Clustering Algorithms: A Survey

Regular Paper

Abstract

Clustering data streams has drawn lots of attention in the last few years due to their ever-growing presence. Data streams put additional challenges on clustering such as limited time and memory and one pass clustering. Furthermore, discovering clusters with arbitrary shapes is very important in data stream applications. Data streams are infinite and evolving over time, and we do not have any knowledge about the number of clusters. In a data stream environment due to various factors, some noise appears occasionally. Density-based method is a remarkable class in clustering data streams, which has the ability to discover arbitrary shape clusters and to detect noise. Furthermore, it does not need the number of clusters in advance. Due to data stream characteristics, the traditional density-based clustering is not applicable. Recently, a lot of density-based clustering algorithms are extended for data streams. The main idea in these algorithms is using density-based methods in the clustering process and at the same time overcoming the constraints, which are put out by data stream’s nature. The purpose of this paper is to shed light on some algorithms in the literature on density-based clustering over data streams. We not only summarize the main density-based clustering algorithms on data streams, discuss their uniqueness and limitations, but also explain how they address the challenges in clustering data streams. Moreover, we investigate the evaluation metrics used in validating cluster quality and measuring algorithms’ performance. It is hoped that this survey will serve as a steppingstone for researchers studying data streams clustering, particularly density-based algorithms.

Keywords

data stream density-based clustering grid-based clustering micro-clustering 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Supplementary material

11390_2014_1416_MOESM1_ESM.doc (27 kb)
ESM 1(DOC 27 kb)

References

  1. 1.
    Han J, Kamber M, Pei J. Data Mining: Concepts and Techniques (3rd edition). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2011.Google Scholar
  2. 2.
    Gaber M, Zaslavsky A, Krishnaswamy S. Mining data streams: A review. ACM SIGMOD Record, 2005, 34(2): 18-26.CrossRefGoogle Scholar
  3. 3.
    Cao F, Ester M, Qian W, Zhou A. Density-based clustering over an evolving data stream with noise. In Proc. the 2006 SIAM Conference on Data Mining, April 2006, pp. 328-339.Google Scholar
  4. 4.
    Chen Y, Tu L. Density-based clustering for real-time stream data. In Proc. the 13th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, Aug. 2007, pp. 133-142.Google Scholar
  5. 5.
    Aggarwal C C (ed.). Data Streams: Models and Algorithms. New York, USA: Springer, 2007.Google Scholar
  6. 6.
    Hahsler M, Dunham M H. Temporal structure learning for clustering massive data streams in real-time. In Proc. the 11th SIAM Conference on Data Mining, April 2011, pp. 664-675.Google Scholar
  7. 7.
    O Ćallaghan L, Mishra N, Meyerson A et al. Streaming-data algorithms for high-quality clustering. In Proc. the 18th Int. Conf. Data Engineering, Feb. 26-Mar. 1, 2002, pp. 685-694.Google Scholar
  8. 8.
    Barbará D. Requirements for clustering data streams. SIGKDD Explorations Newsletter, 2002, 3(2): 23-27.CrossRefGoogle Scholar
  9. 9.
    Guha S, Meyerson A, Mishra N et al. Clustering data streams: Theory and practice. IEEE Trans. Knowledge and Data Engineering, 2003, 15(3): 515-528.CrossRefGoogle Scholar
  10. 10.
    Aggarwal C C, Han J, Wang J, Yu P S. A framework for clustering evolving data streams. In Proc. the 29th International Conference on Very Large Data Bases, Sept. 2003, pp. 81-92.Google Scholar
  11. 11.
    Ackermann M R, Lammersen C, Märtens M, Raupach C, Sohler C, Swierkot K. StreamKM++: A clustering algorithm for data streams. In Proc. the 12th Workshop on Algorithm Engineering and Experiments, Jan. 2010, pp. 173-187.Google Scholar
  12. 12.
    Ikonomovska E, Loskovska S, Gjorgjevik D. A survey of stream data mining. In Proc. the 8th National Conference with International Participation, Sept. 2007, pp. 19-25.Google Scholar
  13. 13.
    Gaber M, Zaslavsky A, Krishnaswamy S. Data stream mining. Data Mining and Knowledge Discovery Handbook, 2010, pp. 759-787.Google Scholar
  14. 14.
    Babcock B, Babu S, Datar M, Motwani R, Widom J. Models and issues in data stream systems. In Proc. the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, June 2002, pp. 1-16.Google Scholar
  15. 15.
    Jain A K, Dubes R C. Algorithms for Clustering Data. Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 1988.Google Scholar
  16. 16.
    Jain A K. Data clustering: 50 years beyond K-means. Pattern Recognition Letter, 2010, 31(8): 651-666.CrossRefGoogle Scholar
  17. 17.
    Mahdiraji A. Clustering data stream: A survey of algorithms. Int. J. Knowledge-Based and Intelligent Engineering Systems, 2009, 13(2): 39-44.Google Scholar
  18. 18.
    Amini A, Wah T, Saybani M et al. A study of density-grid based clustering algorithms on data streams. In Proc. the 8th Int. Conf. Fuzzy Systems and Knowledge Discovery, July 2011, pp. 1652-1656.Google Scholar
  19. 19.
    Amini A, Wah T Y. Density micro-clustering algorithms on data streams: A review. In Proc. Int. Multiconf. Data Mining and Applications, March 2011, pp. 410-414.Google Scholar
  20. 20.
    Amini A, Wah T Y. A comparative study of density-based clustering algorithms on data streams: Micro-clustering approaches. In Lecture Notes in Electrical Engineering 110, Ao S, Castillo O, Huang X (eds.), Springer, 2012, pp. 275-287.Google Scholar
  21. 21.
    Aggarwal C C. A survey of stream clustering algorithms. In Data Clustering: Algorithms and Applications, Aggarwal C C, Reddy C (eds.), CRC Press, 2013, pp. 457-482.Google Scholar
  22. 22.
    Han J, Kamber M. Data Mining: Concepts and Techniques (2nd edition). Morgan Kaufmann, 2006.Google Scholar
  23. 23.
    MacQueen J. Some methods for classification and analysis of multivariate observations. In Proc. the 5th Berkeley Symposium on Mathematical Statistics and Probability, June 21-July 18, 1967, pp. 281-297.Google Scholar
  24. 24.
    Lloyd S P. Least squares quantization in PCM. IEEE Transactions on Information Theory, 1982, 28(2): 129-137.CrossRefMATHMathSciNetGoogle Scholar
  25. 25.
    Guha S, Mishra N, Motwani R, O’Callaghan L. Clustering data streams. In Proc. the 41st Annual Symposium on Foundations of Computer Science, Nov. 2000, pp. 359-366.Google Scholar
  26. 26.
    Zhang T, Ramakrishnan R, Livny M. BIRCH: An efficient data clustering method for very large databases. In Proc. the 1996 ACM SIGMOD International Conference on Management of Data, June 1996, pp. 103-114.Google Scholar
  27. 27.
    Karypis G, Han E, Kumar V. Chameleon: Hierarchical clustering using dynamic modeling. Computer, 1999, 32(8): 68-75.CrossRefGoogle Scholar
  28. 28.
    Kranen P, Assent I, Baldauf C, Seidl T. The clustree: Indexing micro-clusters for anytime stream mining. Knowl. Inf. Syst., 2011, 29(2): 249-272.CrossRefGoogle Scholar
  29. 29.
    Wang W, Yang J, Muntz R R. STING: A statistical information grid approach to spatial data mining. In Proc. the 23rd Int. Conf. Very Large Data Bases, Aug. 1997, pp. 186-195.Google Scholar
  30. 30.
    Sheikholeslami G, Chatterjee S, Zhang A. Wavecluster: A wavelet-based clustering approach for spatial data in very large databases. The VLDB Journal, 2000, 8(3/4): 289-304.CrossRefGoogle Scholar
  31. 31.
    Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Automatic subspace clustering of high dimensional data for data mining applications. ACM SIGMOD Record, 1998, 27(2): 94-105.CrossRefGoogle Scholar
  32. 32.
    Tu L, Chen Y. Stream data clustering based on grid density and attraction. ACM Transactions on Knowledge Discovery Data, 2009, 3(3): Article No. 12.Google Scholar
  33. 33.
    Wan L, Ng W K, Dang X H et al. Density-based clustering of data streams at multiple resolutions. ACM Trans. Knowledge Discovery from Data, 2009, 3(3): Article No. 14.Google Scholar
  34. 34.
    Dempster A P, Laird N M, Rubin D B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 1977, 39(1): 1-38.MATHMathSciNetGoogle Scholar
  35. 35.
    Dang X, Lee V, Ng W K et al. An EM-based algorithm for clustering data streams in sliding windows. In Proc. the Int. Conf. Database Systems for Advanced Applications, Apr. 2009, pp. 230-235.Google Scholar
  36. 36.
    Ester M, Kriegel H, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proc. the 2nd International Conference on Knowledge Discovery and Data Mining, Aug. 1996, pp. 226-231.Google Scholar
  37. 37.
    Ankerst M, Breunig M M, Kriegel H P, Sander J. Optics: Ordering points to identify the clustering structure. ACM SIGMOD Record, 1999, 28(2): 49-60.CrossRefGoogle Scholar
  38. 38.
    Hinneburg A, Keim D A. An efficient approach to clustering in large multimedia databases with noise. In Proc. the 4th KDD, Sept. 1998, pp. 58-65.Google Scholar
  39. 39.
    Matysiak M. Data stream mining: Basic methods and techniques. Technical Report, RWTH Aachen University, 2012.Google Scholar
  40. 40.
    Zhou A, Cao F, Qian W, Jin C. Tracking clusters in evolving data streams over sliding windows. Knowledge and Information Systems, 2008, 15(2): 181-214.CrossRefGoogle Scholar
  41. 41.
    Ren J, Ma R. Density-based data streams clustering over sliding windows. In Proc. the 6th Int. Conf. Fuzzy systems and Knowledge Discovery, Aug. 2009, pp. 248-252.Google Scholar
  42. 42.
    Charikar M, O’Callaghan L, Panigrahy R. Better streaming algorithms for clustering problems. In Proc. the 35th Annual ACM Symp. Theory of Computing, June 2003, pp. 30-39.Google Scholar
  43. 43.
    Gao J, Li J, Zhang Z, Tan P N. An incremental data stream clustering algorithm based on dense units detection. In Proc. the 9th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, May 2005, pp. 420-425.Google Scholar
  44. 44.
    Aggarwal C C, Han J, Wang J, Yu P S. A framework for projected clustering of high dimensional data streams. In Proc. the 30th International Conference on Very Large Data Bases, Volume 30, Aug. 29-Sept. 3, 2004, pp. 852-863.Google Scholar
  45. 45.
    Aggarwal C C, Han J, Wang J, Yu P S. On high dimensional projected clustering of data streams. Data Mining and Knowledge Discovery, 2005, 10(3): 251-273.CrossRefMathSciNetGoogle Scholar
  46. 46.
    Babcock B, Datar M, Motwani R, O’Callaghan L. Maintaining variance and k-medians over data stream windows. In Proc. the 22nd ACM SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems, June 2003, pp. 234-243.Google Scholar
  47. 47.
    Ng W, Dash M. Discovery of frequent patterns in transactional data streams. In Lecture Notes in Computer Science 6380, Hameurlain A, Küng J, Wagner R et al. (eds.), Springer Berlin/Heidelberg, 2010, pp. 1-30.Google Scholar
  48. 48.
    Vitter J S. Random sampling with a reservoir. ACM Trans. Math. Softw., 1985, 11(1): 37-57.CrossRefMATHMathSciNetGoogle Scholar
  49. 49.
    Garofalakis M, Gehrke J, Rastogi R. Querying and mining data streams: You only get one look: A tutorial. In Proc. the 2002 ACM SIGMOD Int. Conf. Management of Data, June 2002, pp. 635-635.Google Scholar
  50. 50.
    Aggarwal C C, Yu P S. A survey of synopsis construction in data streams. In Advances in Database Systems 31, Aggarwal C C (ed.), Springer, 2007, pp. 169-207.Google Scholar
  51. 51.
    Garofalakis M N. Wavelets on streams. In Encyclopedia of Database Systems, Springer US, 2009, pp. 3446-3451.Google Scholar
  52. 52.
    Gilbert A C, Kotidis Y, Muthukrishnan S, Strauss M J. One-pass wavelet decompositions of data streams. IEEE Trans. Knowl. and Data Eng., 2003, 15(3): 541-554.CrossRefGoogle Scholar
  53. 53.
    Gama J, Gaber M M (eds.). Learning from Data Streams - Processing Techniques in Sensor Networks. Springer, 2007.Google Scholar
  54. 54.
    Rosset S, Inger A. KDD-cup 99: Knowledge discovery in a charitable organization’s donor database. SIGKDD Explorations Newsletter, 2000, 1(2): 85-90.CrossRefGoogle Scholar
  55. 55.
    Hubert L J, Levin J R. A general statistical framework for assessing categorical clustering in free recall. Psychological Bulletin, 1976, 83(6): 1072-1080.CrossRefGoogle Scholar
  56. 56.
    Kaufman L, Rousseeuw P J. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley-Interscience, 2005.Google Scholar
  57. 57.
    Wu J, Xiong H, Chen J. Adapting the right measures for K-means clustering. In Proc. the 15th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, June 2009, pp. 877-886.Google Scholar
  58. 58.
    Rand W M. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 1971, 66(336): 846-850.CrossRefGoogle Scholar
  59. 59.
    Zhao Y, Karypis G. Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learning, 2004, 55(3): 311-331.CrossRefMATHGoogle Scholar
  60. 60.
    Dongen S. Performance criteria for graph clustering and Markov cluster experiments. Technical Report, National Research Institute for Mathematics and Computer Science, Stichting Mathematisch Centrum, Netherlands, 2000.Google Scholar
  61. 61.
    Rosenberg A, Hirschberg J. V-measure: A conditional entropy-based external cluster evaluation measure. In Proc. the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, June 2007, pp. 410-420.Google Scholar
  62. 62.
    Meilǎ M. Comparing clusterings: An axiomatic view. In Proc. the 22nd Int. Conf. Machine Learning, Aug. 2005, pp. 577-584.Google Scholar
  63. 63.
    Rijsbergen C J V. Information Retrieval. Newton, MA, USA: Butterworth-Heinemann, 1979.Google Scholar
  64. 64.
    Milligan G. A Monte Carlo study of thirty internal criterion measures for cluster analysis. Psychometrika, 1981, 46(2): 187-199.CrossRefMATHMathSciNetGoogle Scholar
  65. 65.
    Pereira C M M, de Mello R F. A comparison of clustering algorithms for data streams. In Proc. the 1st Int. Conf. Integrated Comp. Tech., May 31-June 2, 2011, pp. 59-74.Google Scholar
  66. 66.
    Manning C D, Raghavan P, Schtze H. Introduction to Information Retrieval. New York, NY, USA: Cambridge University Press, 2008.Google Scholar
  67. 67.
    Forestiero A, Pizzuti C, Spezzano G. A single pass algorithm for clustering evolving data streams based on swarm intelligence. Data Mining and Knowledge Discovery, 2013, 26(1): 1-26.CrossRefMathSciNetGoogle Scholar
  68. 68.
    Bifet A, Holmes G, Pfahringer B et al. MOA: Massive online analysis, a framework for stream classification and clustering. Journal of Machine Learning Research, 2010, 11: 44-50.Google Scholar
  69. 69.
    Holmes G, Donkin A, Witten I H. WEKA: A machine learning workbench. In Proc. the 2nd Australian and New Zealand Conference on Intelligent Information Systems, Nov. 29-Dec. 3, 1994, pp. 357-361.Google Scholar
  70. 70.
    Kranen P, Kremer H, Jansen T et al. Clustering performance on evolving data streams: Assessing algorithms and evaluation measures within MOA. In Proc. the IEEE Int. Conf. Data Mining Workshops, Dec. 2010, pp. 1400-1403.Google Scholar
  71. 71.
    Kremer H, Kranen P, Jansen T et al. An effective evaluation measure for clustering on evolving data streams. In Proc. the 17th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, July 2011, pp. 868-876.Google Scholar
  72. 72.
    De Francisci Morales G. SAMOA: A platform for mining big data streams. In Proc. the 22nd Int. Conf. World Wide Web Companion, May 2013, pp. 777-778.Google Scholar
  73. 73.
    Tasoulis D K, Ross G, Adams N M. Visualising the cluster structure of data streams. In Proc. the 7th International Conference on Intelligent Data Analysis, Sept. 2007, pp. 81-92.Google Scholar
  74. 74.
    Ruiz C, Menasalvas E, Spiliopoulou M. C-DenStream: Using domain knowledge on a data stream. In Proc. the 12th International Conference on Discovery Science, Oct. 2009, pp. 287-301.Google Scholar
  75. 75.
    Liu L, Jing K, Guo Y et al. A three-step clustering algorithm over an evolving data stream. In Proc. the IEEE Int. Conf. Intelligent Computing and Intelligent Systems, Nov. 2009, pp. 160-164.Google Scholar
  76. 76.
    Lin J, Lin H. A density-based clustering over evolving heterogeneous data stream. In Proc. the 2nd Int. Colloquium on Computing, Communication, Control, and Management, Aug. 2009, pp. 275-277.Google Scholar
  77. 77.
    Isaksson C, Dunham M, Hahsler M. SOStream: Self organizing density-based clustering over data stream. In Lecture Notes in Computer Science 7376, Perner P (ed.), Springer Berlin Heidelberg, 2012, pp. 264-278.Google Scholar
  78. 78.
    Ntoutsi I, Zimek A, Palpanas T et al. Density-based projected clustering over high dimensional data streams. In Proc. the 12th SIAM Int. Conf. Data Mining, April 2012, pp. 987-998.Google Scholar
  79. 79.
    Hassani M, Spaus P, Gaber M M, Seidl T. Density-based projected clustering of data streams. In Proc. the 6th Int. Conf. Scalable Uncertainty Management, Sept. 2012, pp. 311-324.Google Scholar
  80. 80.
    Jia C, Tan C, Yong A. A grid and density-based clustering algorithm for processing data stream. In Proc. the 2nd Int. Conf. Genetic and Evolutionary Computing, Sept. 2008, pp. 517-521.Google Scholar
  81. 81.
    Ren J, Cai B, Hu C. Clustering over data streams based on grid density and index tree. Journal of Convergence Information Technology, 2011, 6(1): 83-93.CrossRefGoogle Scholar
  82. 82.
    Yang Y, Liu Z, Zhang J et al. Dynamic density-based clustering algorithm over uncertain data streams. In Proc. the 9th Int. Conf. Fuzzy Systems and Knowledge Discovery, May 2012, pp. 2664-2670.Google Scholar
  83. 83.
    Amini A, Teh Ying W. DENGRIS-Stream: A density-grid based clustering algorithm for evolving data streams over sliding window. In Proc. International Conference on Data Mining and Computer Engineering, Dec. 2012, pp. 206-210.Google Scholar
  84. 84.
    Bhatnagar V, Kaur S, Chakravarthy S. Clustering data streams using grid-based synopsis. Knowledge and Information Systems, June 2013.Google Scholar
  85. 85.
    Ruiz C, Spiliopoulou M, Menasalvas E. C-DBSCAN: Density-based clustering with constraints. In Proc. the 11th International Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing, May 2007, pp. 216-223.Google Scholar
  86. 86.
    Yang C, Zhou J. HClustream: A novel approach for clustering evolving heterogeneous data stream. In Proc. the 6th IEEE Int. Conf. Data Mining Workshops, Dec. 2006, pp. 682-688.Google Scholar
  87. 87.
    Kohonen T. Self-organized formation of topologically correct feature maps. Biological Cybernetics, 1982, 43(1): 59-69.CrossRefMATHMathSciNetGoogle Scholar
  88. 88.
    Bohm C, Kailing K, Kriegel H P, Kroger P. Density connected clustering with local subspace preferences. In Proc. the 4th IEEE Int. Conf. Data Mining, Nov. 2004, pp. 27-34.Google Scholar
  89. 89.
    Kennedy J F, Kennedy J, Eberhart R C. Swarm Intelligence. Morgan Kaufmann Pub, 2001.Google Scholar
  90. 90.
    Shamshirband S, Anuar N, Kiah M et al. An appraisal and design of a multi-agent system based cooperative wireless intrusion detection computational intelligence technique. Engineering Applications of Artificial Intelligence, 2013, 26(9): 2105-2127.CrossRefGoogle Scholar
  91. 91.
    Sander J, Ester M, Kriegel H P, Xu X. Density-based clustering in spatial databases: The algorithm GDBSCAN and its applications. Data Mining and Knowledge Discovery, 1998, 2(2): 169-194.CrossRefGoogle Scholar
  92. 92.
    Plant C, Teipel S J, Oswald A et al. Automated detection of brain atrophy patterns based on MRI for the prediction of Alzheimer’s disease. NeuroImage, 2010, 50(1): 162-174.CrossRefGoogle Scholar
  93. 93.
    Mete M, Kockara S, Aydin K. Fast density-based lesion detection in dermoscopy images. Computerized Medical Imaging and Graphics, 2011, 35(2): 128-136.CrossRefGoogle Scholar
  94. 94.
    Yang D, Rundensteiner E A, Ward M O. Summarization and matching of density-based clusters in streaming environments. Proc. VLDB Endow., 2011, 5(2): 121-132.Google Scholar
  95. 95.
    Lee C H. Mining spatio-temporal information on microblogging streams using a density-based online clustering method. Expert Systems with Applications, 2012, 39(10): 9623-9641.CrossRefGoogle Scholar
  96. 96.
    Yu Y, Wang Q, Wang X, Wang H, He J. Online clustering for trajectory data stream of moving objects. Computer Science and Information Systems, 2013, 10(3): 1293-1317.CrossRefMathSciNetGoogle Scholar

Copyright information

© Springer Science+Business Media New York & Science Press, China 2014

Authors and Affiliations

  1. 1.Department of Information Systems, Faculty of Computer Science and Information Technology, University of MalayaKuala LumpurMalaysia

Personalised recommendations