# On Density-Based Data Streams Clustering Algorithms: A Survey

- 1.6k Downloads
- 83 Citations

## Abstract

Clustering data streams has drawn lots of attention in the last few years due to their ever-growing presence. Data streams put additional challenges on clustering such as limited time and memory and one pass clustering. Furthermore, discovering clusters with arbitrary shapes is very important in data stream applications. Data streams are infinite and evolving over time, and we do not have any knowledge about the number of clusters. In a data stream environment due to various factors, some noise appears occasionally. Density-based method is a remarkable class in clustering data streams, which has the ability to discover arbitrary shape clusters and to detect noise. Furthermore, it does not need the number of clusters in advance. Due to data stream characteristics, the traditional density-based clustering is not applicable. Recently, a lot of density-based clustering algorithms are extended for data streams. The main idea in these algorithms is using density-based methods in the clustering process and at the same time overcoming the constraints, which are put out by data stream’s nature. The purpose of this paper is to shed light on some algorithms in the literature on density-based clustering over data streams. We not only summarize the main density-based clustering algorithms on data streams, discuss their uniqueness and limitations, but also explain how they address the challenges in clustering data streams. Moreover, we investigate the evaluation metrics used in validating cluster quality and measuring algorithms’ performance. It is hoped that this survey will serve as a steppingstone for researchers studying data streams clustering, particularly density-based algorithms.

## Keywords

data stream density-based clustering grid-based clustering micro-clustering## Preview

Unable to display preview. Download preview PDF.

## Supplementary material

## References

- 1.Han J, Kamber M, Pei J. Data Mining: Concepts and Techniques (3rd edition). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2011.Google Scholar
- 2.Gaber M, Zaslavsky A, Krishnaswamy S. Mining data streams: A review.
*ACM SIGMOD Record*, 2005, 34(2): 18-26.CrossRefGoogle Scholar - 3.Cao F, Ester M, Qian W, Zhou A. Density-based clustering over an evolving data stream with noise. In
*Proc. the 2006 SIAM Conference on Data Mining*, April 2006, pp. 328-339.Google Scholar - 4.Chen Y, Tu L. Density-based clustering for real-time stream data. In
*Proc. the 13th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining*, Aug. 2007, pp. 133-142.Google Scholar - 5.Aggarwal C C (ed.). Data Streams: Models and Algorithms. New York, USA: Springer, 2007.Google Scholar
- 6.Hahsler M, Dunham M H. Temporal structure learning for clustering massive data streams in real-time. In
*Proc. the 11th SIAM Conference on Data Mining*, April 2011, pp. 664-675.Google Scholar - 7.O Ćallaghan L, Mishra N, Meyerson A
*et al*. Streaming-data algorithms for high-quality clustering. In*Proc. the 18th Int. Conf. Data Engineering*, Feb. 26-Mar. 1, 2002, pp. 685-694.Google Scholar - 8.Barbará D. Requirements for clustering data streams.
*SIGKDD Explorations Newsletter*, 2002, 3(2): 23-27.CrossRefGoogle Scholar - 9.Guha S, Meyerson A, Mishra N
*et al*. Clustering data streams: Theory and practice.*IEEE Trans. Knowledge and Data Engineering*, 2003, 15(3): 515-528.CrossRefGoogle Scholar - 10.Aggarwal C C, Han J, Wang J, Yu P S. A framework for clustering evolving data streams. In
*Proc. the 29th International Conference on Very Large Data Bases*, Sept. 2003, pp. 81-92.Google Scholar - 11.Ackermann M R, Lammersen C, Märtens M, Raupach C, Sohler C, Swierkot K. StreamKM++: A clustering algorithm for data streams. In
*Proc. the 12th Workshop on Algorithm Engineering and Experiments*, Jan. 2010, pp. 173-187.Google Scholar - 12.Ikonomovska E, Loskovska S, Gjorgjevik D. A survey of stream data mining. In
*Proc. the 8th National Conference with International Participation*, Sept. 2007, pp. 19-25.Google Scholar - 13.Gaber M, Zaslavsky A, Krishnaswamy S. Data stream mining.
*Data Mining and Knowledge Discovery Handbook*, 2010, pp. 759-787.Google Scholar - 14.Babcock B, Babu S, Datar M, Motwani R, Widom J. Models and issues in data stream systems. In
*Proc. the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems*, June 2002, pp. 1-16.Google Scholar - 15.Jain A K, Dubes R C. Algorithms for Clustering Data. Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 1988.Google Scholar
- 16.Jain A K. Data clustering: 50 years beyond K-means.
*Pattern Recognition Letter*, 2010, 31(8): 651-666.CrossRefGoogle Scholar - 17.Mahdiraji A. Clustering data stream: A survey of algorithms.
*Int. J. Knowledge-Based and Intelligent Engineering Systems*, 2009, 13(2): 39-44.Google Scholar - 18.Amini A, Wah T, Saybani M
*et al*. A study of density-grid based clustering algorithms on data streams. In*Proc. the 8th Int. Conf. Fuzzy Systems and Knowledge Discovery*, July 2011, pp. 1652-1656.Google Scholar - 19.Amini A, Wah T Y. Density micro-clustering algorithms on data streams: A review. In
*Proc. Int. Multiconf. Data Mining and Applications*, March 2011, pp. 410-414.Google Scholar - 20.Amini A, Wah T Y. A comparative study of density-based clustering algorithms on data streams: Micro-clustering approaches. In
*Lecture Notes in Electrical Engineering 110*, Ao S, Castillo O, Huang X (eds.), Springer, 2012, pp. 275-287.Google Scholar - 21.Aggarwal C C. A survey of stream clustering algorithms. In
*Data Clustering: Algorithms and Applications*, Aggarwal C C, Reddy C (eds.), CRC Press, 2013, pp. 457-482.Google Scholar - 22.Han J, Kamber M. Data Mining: Concepts and Techniques (2nd edition). Morgan Kaufmann, 2006.Google Scholar
- 23.MacQueen J. Some methods for classification and analysis of multivariate observations. In
*Proc. the 5th Berkeley Symposium on Mathematical Statistics and Probability*, June 21-July 18, 1967, pp. 281-297.Google Scholar - 24.Lloyd S P. Least squares quantization in PCM.
*IEEE Transactions on Information Theory*, 1982, 28(2): 129-137.CrossRefzbMATHMathSciNetGoogle Scholar - 25.Guha S, Mishra N, Motwani R, O’Callaghan L. Clustering data streams. In
*Proc. the 41st Annual Symposium on Foundations of Computer Science*, Nov. 2000, pp. 359-366.Google Scholar - 26.Zhang T, Ramakrishnan R, Livny M. BIRCH: An efficient data clustering method for very large databases. In
*Proc. the 1996 ACM SIGMOD International Conference on Management of Data*, June 1996, pp. 103-114.Google Scholar - 27.Karypis G, Han E, Kumar V. Chameleon: Hierarchical clustering using dynamic modeling.
*Computer*, 1999, 32(8): 68-75.CrossRefGoogle Scholar - 28.Kranen P, Assent I, Baldauf C, Seidl T. The clustree: Indexing micro-clusters for anytime stream mining.
*Knowl. Inf. Syst.*, 2011, 29(2): 249-272.CrossRefGoogle Scholar - 29.Wang W, Yang J, Muntz R R. STING: A statistical information grid approach to spatial data mining. In
*Proc. the 23rd Int. Conf. Very Large Data Bases*, Aug. 1997, pp. 186-195.Google Scholar - 30.Sheikholeslami G, Chatterjee S, Zhang A. Wavecluster: A wavelet-based clustering approach for spatial data in very large databases.
*The VLDB Journal*, 2000, 8(3/4): 289-304.CrossRefGoogle Scholar - 31.Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Automatic subspace clustering of high dimensional data for data mining applications.
*ACM SIGMOD Record*, 1998, 27(2): 94-105.CrossRefGoogle Scholar - 32.Tu L, Chen Y. Stream data clustering based on grid density and attraction.
*ACM Transactions on Knowledge Discovery Data*, 2009, 3(3): Article No. 12.Google Scholar - 33.Wan L, Ng W K, Dang X H
*et al*. Density-based clustering of data streams at multiple resolutions.*ACM Trans. Knowledge Discovery from Data*, 2009, 3(3): Article No. 14.Google Scholar - 34.Dempster A P, Laird N M, Rubin D B. Maximum likelihood from incomplete data via the EM algorithm.
*Journal of the Royal Statistical Society, Series B*, 1977, 39(1): 1-38.zbMATHMathSciNetGoogle Scholar - 35.Dang X, Lee V, Ng W K
*et al*. An EM-based algorithm for clustering data streams in sliding windows. In*Proc. the Int. Conf. Database Systems for Advanced Applications*, Apr. 2009, pp. 230-235.Google Scholar - 36.Ester M, Kriegel H, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. In
*Proc. the 2nd International Conference on Knowledge Discovery and Data Mining*, Aug. 1996, pp. 226-231.Google Scholar - 37.Ankerst M, Breunig M M, Kriegel H P, Sander J. Optics: Ordering points to identify the clustering structure.
*ACM SIGMOD Record*, 1999, 28(2): 49-60.CrossRefGoogle Scholar - 38.Hinneburg A, Keim D A. An efficient approach to clustering in large multimedia databases with noise. In
*Proc. the 4th KDD*, Sept. 1998, pp. 58-65.Google Scholar - 39.Matysiak M. Data stream mining: Basic methods and techniques. Technical Report, RWTH Aachen University, 2012.Google Scholar
- 40.Zhou A, Cao F, Qian W, Jin C. Tracking clusters in evolving data streams over sliding windows.
*Knowledge and Information Systems*, 2008, 15(2): 181-214.CrossRefGoogle Scholar - 41.Ren J, Ma R. Density-based data streams clustering over sliding windows. In
*Proc. the 6th Int. Conf. Fuzzy systems and Knowledge Discovery*, Aug. 2009, pp. 248-252.Google Scholar - 42.Charikar M, O’Callaghan L, Panigrahy R. Better streaming algorithms for clustering problems. In
*Proc. the 35th Annual ACM Symp. Theory of Computing*, June 2003, pp. 30-39.Google Scholar - 43.Gao J, Li J, Zhang Z, Tan P N. An incremental data stream clustering algorithm based on dense units detection. In
*Proc. the 9th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining*, May 2005, pp. 420-425.Google Scholar - 44.Aggarwal C C, Han J, Wang J, Yu P S. A framework for projected clustering of high dimensional data streams. In
*Proc. the 30th International Conference on Very Large Data Bases, Volume 30*, Aug. 29-Sept. 3, 2004, pp. 852-863.Google Scholar - 45.Aggarwal C C, Han J, Wang J, Yu P S. On high dimensional projected clustering of data streams.
*Data Mining and Knowledge Discovery*, 2005, 10(3): 251-273.CrossRefMathSciNetGoogle Scholar - 46.Babcock B, Datar M, Motwani R, O’Callaghan L. Maintaining variance and k-medians over data stream windows. In
*Proc. the 22nd ACM SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems*, June 2003, pp. 234-243.Google Scholar - 47.Ng W, Dash M. Discovery of frequent patterns in transactional data streams. In
*Lecture Notes in Computer Science 6380*, Hameurlain A, Küng J, Wagner R*et al*. (eds.), Springer Berlin/Heidelberg, 2010, pp. 1-30.Google Scholar - 48.Vitter J S. Random sampling with a reservoir.
*ACM Trans. Math. Softw.*, 1985, 11(1): 37-57.CrossRefzbMATHMathSciNetGoogle Scholar - 49.Garofalakis M, Gehrke J, Rastogi R. Querying and mining data streams: You only get one look: A tutorial. In
*Proc. the 2002 ACM SIGMOD Int. Conf. Management of Data*, June 2002, pp. 635-635.Google Scholar - 50.Aggarwal C C, Yu P S. A survey of synopsis construction in data streams. In
*Advances in Database Systems 31*, Aggarwal C C (ed.), Springer, 2007, pp. 169-207.Google Scholar - 51.Garofalakis M N. Wavelets on streams. In
*Encyclopedia of Database Systems*, Springer US, 2009, pp. 3446-3451.Google Scholar - 52.Gilbert A C, Kotidis Y, Muthukrishnan S, Strauss M J. One-pass wavelet decompositions of data streams.
*IEEE Trans. Knowl. and Data Eng.*, 2003, 15(3): 541-554.CrossRefGoogle Scholar - 53.Gama J, Gaber M M (eds.). Learning from Data Streams - Processing Techniques in Sensor Networks. Springer, 2007.Google Scholar
- 54.Rosset S, Inger A. KDD-cup 99: Knowledge discovery in a charitable organization’s donor database.
*SIGKDD Explorations Newsletter*, 2000, 1(2): 85-90.CrossRefGoogle Scholar - 55.Hubert L J, Levin J R. A general statistical framework for assessing categorical clustering in free recall.
*Psychological Bulletin*, 1976, 83(6): 1072-1080.CrossRefGoogle Scholar - 56.Kaufman L, Rousseeuw P J. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley-Interscience, 2005.Google Scholar
- 57.Wu J, Xiong H, Chen J. Adapting the right measures for K-means clustering. In
*Proc. the 15th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining*, June 2009, pp. 877-886.Google Scholar - 58.Rand W M. Objective criteria for the evaluation of clustering methods.
*Journal of the American Statistical Association*, 1971, 66(336): 846-850.CrossRefGoogle Scholar - 59.Zhao Y, Karypis G. Empirical and theoretical comparisons of selected criterion functions for document clustering.
*Machine Learning*, 2004, 55(3): 311-331.CrossRefzbMATHGoogle Scholar - 60.Dongen S. Performance criteria for graph clustering and Markov cluster experiments. Technical Report, National Research Institute for Mathematics and Computer Science, Stichting Mathematisch Centrum, Netherlands, 2000.Google Scholar
- 61.Rosenberg A, Hirschberg J. V-measure: A conditional entropy-based external cluster evaluation measure. In
*Proc. the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning*, June 2007, pp. 410-420.Google Scholar - 62.Meilǎ M. Comparing clusterings: An axiomatic view. In
*Proc. the 22nd Int. Conf. Machine Learning*, Aug. 2005, pp. 577-584.Google Scholar - 63.Rijsbergen C J V. Information Retrieval. Newton, MA, USA: Butterworth-Heinemann, 1979.Google Scholar
- 64.Milligan G. A Monte Carlo study of thirty internal criterion measures for cluster analysis.
*Psychometrika*, 1981, 46(2): 187-199.CrossRefzbMATHMathSciNetGoogle Scholar - 65.Pereira C M M, de Mello R F. A comparison of clustering algorithms for data streams. In
*Proc. the 1st Int. Conf. Integrated Comp. Tech.*, May 31-June 2, 2011, pp. 59-74.Google Scholar - 66.Manning C D, Raghavan P, Schtze H. Introduction to Information Retrieval. New York, NY, USA: Cambridge University Press, 2008.Google Scholar
- 67.Forestiero A, Pizzuti C, Spezzano G. A single pass algorithm for clustering evolving data streams based on swarm intelligence.
*Data Mining and Knowledge Discovery*, 2013, 26(1): 1-26.CrossRefMathSciNetGoogle Scholar - 68.Bifet A, Holmes G, Pfahringer B
*et al*. MOA: Massive online analysis, a framework for stream classification and clustering.*Journal of Machine Learning Research*, 2010, 11: 44-50.Google Scholar - 69.Holmes G, Donkin A, Witten I H. WEKA: A machine learning workbench. In
*Proc. the 2nd Australian and New Zealand Conference on Intelligent Information Systems*, Nov. 29-Dec. 3, 1994, pp. 357-361.Google Scholar - 70.Kranen P, Kremer H, Jansen T
*et al*. Clustering performance on evolving data streams: Assessing algorithms and evaluation measures within MOA. In*Proc. the IEEE Int. Conf. Data Mining Workshops*, Dec. 2010, pp. 1400-1403.Google Scholar - 71.Kremer H, Kranen P, Jansen T
*et al*. An effective evaluation measure for clustering on evolving data streams. In*Proc. the 17th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining*, July 2011, pp. 868-876.Google Scholar - 72.De Francisci Morales G. SAMOA: A platform for mining big data streams. In
*Proc. the 22nd Int. Conf. World Wide Web Companion*, May 2013, pp. 777-778.Google Scholar - 73.Tasoulis D K, Ross G, Adams N M. Visualising the cluster structure of data streams. In
*Proc. the 7th International Conference on Intelligent Data Analysis*, Sept. 2007, pp. 81-92.Google Scholar - 74.Ruiz C, Menasalvas E, Spiliopoulou M. C-DenStream: Using domain knowledge on a data stream. In
*Proc. the 12th International Conference on Discovery Science*, Oct. 2009, pp. 287-301.Google Scholar - 75.Liu L, Jing K, Guo Y
*et al*. A three-step clustering algorithm over an evolving data stream. In*Proc. the IEEE Int. Conf. Intelligent Computing and Intelligent Systems*, Nov. 2009, pp. 160-164.Google Scholar - 76.Lin J, Lin H. A density-based clustering over evolving heterogeneous data stream. In
*Proc. the 2nd Int. Colloquium on Computing, Communication, Control, and Management*, Aug. 2009, pp. 275-277.Google Scholar - 77.Isaksson C, Dunham M, Hahsler M. SOStream: Self organizing density-based clustering over data stream. In
*Lecture Notes in Computer Science 7376*, Perner P (ed.), Springer Berlin Heidelberg, 2012, pp. 264-278.Google Scholar - 78.Ntoutsi I, Zimek A, Palpanas T
*et al*. Density-based projected clustering over high dimensional data streams. In*Proc. the 12th SIAM Int. Conf. Data Mining*, April 2012, pp. 987-998.Google Scholar - 79.Hassani M, Spaus P, Gaber M M, Seidl T. Density-based projected clustering of data streams. In
*Proc. the 6th Int. Conf. Scalable Uncertainty Management*, Sept. 2012, pp. 311-324.Google Scholar - 80.Jia C, Tan C, Yong A. A grid and density-based clustering algorithm for processing data stream. In
*Proc. the 2nd Int. Conf. Genetic and Evolutionary Computing*, Sept. 2008, pp. 517-521.Google Scholar - 81.Ren J, Cai B, Hu C. Clustering over data streams based on grid density and index tree.
*Journal of Convergence Information Technology*, 2011, 6(1): 83-93.CrossRefGoogle Scholar - 82.Yang Y, Liu Z, Zhang J
*et al*. Dynamic density-based clustering algorithm over uncertain data streams. In*Proc. the 9th Int. Conf. Fuzzy Systems and Knowledge Discovery*, May 2012, pp. 2664-2670.Google Scholar - 83.Amini A, Teh Ying W. DENGRIS-Stream: A density-grid based clustering algorithm for evolving data streams over sliding window. In
*Proc. International Conference on Data Mining and Computer Engineering*, Dec. 2012, pp. 206-210.Google Scholar - 84.Bhatnagar V, Kaur S, Chakravarthy S. Clustering data streams using grid-based synopsis.
*Knowledge and Information Systems*, June 2013.Google Scholar - 85.Ruiz C, Spiliopoulou M, Menasalvas E. C-DBSCAN: Density-based clustering with constraints. In
*Proc. the 11th International Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing*, May 2007, pp. 216-223.Google Scholar - 86.Yang C, Zhou J. HClustream: A novel approach for clustering evolving heterogeneous data stream. In
*Proc. the 6th IEEE Int. Conf. Data Mining Workshops*, Dec. 2006, pp. 682-688.Google Scholar - 87.Kohonen T. Self-organized formation of topologically correct feature maps.
*Biological Cybernetics*, 1982, 43(1): 59-69.CrossRefzbMATHMathSciNetGoogle Scholar - 88.Bohm C, Kailing K, Kriegel H P, Kroger P. Density connected clustering with local subspace preferences. In
*Proc. the 4th IEEE Int. Conf. Data Mining*, Nov. 2004, pp. 27-34.Google Scholar - 89.Kennedy J F, Kennedy J, Eberhart R C. Swarm Intelligence. Morgan Kaufmann Pub, 2001.Google Scholar
- 90.Shamshirband S, Anuar N, Kiah M
*et al*. An appraisal and design of a multi-agent system based cooperative wireless intrusion detection computational intelligence technique.*Engineering Applications of Artificial Intelligence*, 2013, 26(9): 2105-2127.CrossRefGoogle Scholar - 91.Sander J, Ester M, Kriegel H P, Xu X. Density-based clustering in spatial databases: The algorithm GDBSCAN and its applications.
*Data Mining and Knowledge Discovery*, 1998, 2(2): 169-194.CrossRefGoogle Scholar - 92.Plant C, Teipel S J, Oswald A
*et al*. Automated detection of brain atrophy patterns based on MRI for the prediction of Alzheimer’s disease.*NeuroImage*, 2010, 50(1): 162-174.CrossRefGoogle Scholar - 93.Mete M, Kockara S, Aydin K. Fast density-based lesion detection in dermoscopy images.
*Computerized Medical Imaging and Graphics*, 2011, 35(2): 128-136.CrossRefGoogle Scholar - 94.Yang D, Rundensteiner E A, Ward M O. Summarization and matching of density-based clusters in streaming environments.
*Proc. VLDB Endow.*, 2011, 5(2): 121-132.Google Scholar - 95.Lee C H. Mining spatio-temporal information on microblogging streams using a density-based online clustering method.
*Expert Systems with Applications*, 2012, 39(10): 9623-9641.CrossRefGoogle Scholar - 96.Yu Y, Wang Q, Wang X, Wang H, He J. Online clustering for trajectory data stream of moving objects.
*Computer Science and Information Systems*, 2013, 10(3): 1293-1317.CrossRefMathSciNetGoogle Scholar