Clustering Large Datasets Using Data Stream Clustering Techniques

  • Matthew Bolaños
  • John Forrest
  • Michael HahslerEmail author
Conference paper
Part of the Studies in Classification, Data Analysis, and Knowledge Organization book series (STUDIES CLASS)


Unsupervised identification of groups in large data sets is important for many machine learning and knowledge discovery applications. Conventional clustering approaches (k-means, hierarchical clustering, etc.) typically do not scale well for very large data sets. In recent years, data stream clustering algorithms have been proposed which can deal efficiently with potentially unbounded streams of data. This paper is the first to investigate the use of data stream clustering algorithms as light-weight alternatives to conventional algorithms on large non-streaming data. We will discuss important issue including order dependence and report the results of an initial study using several synthetic and real-world data sets.


Data Stream Data Stream Cluster Conventional Cluster Algorithm Data Stream Cluster Algorithm Data Stream Algorithm 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This work is supported in part by the U.S. National Science Foundation as a research experience for undergraduates (REU) under contract number IIS-0948893 and by the National Institutes of Health under contract number R21HG005912.


  1. Aggarwal, C. (2007). Data streams: Models and algorithms. Advances in database systems (Vol. 31). New York: Springer.Google Scholar
  2. Aggarwal, C. C., Han, J., Wang, J., & Yu, P. S. (2003). A framework for clustering evolving data streams. In Proceedings of the 29th International Conference on Very Large Data Bases (VLDB ’03) (Vol. 29, pp. 81–92). VLDB Endowment.Google Scholar
  3. Bifet, A., Holmes, G., Kirkby, R., & Pfahringer, B. (2010). MOA: Massive online analysis. Journal of Machine Learning Research, 99, 1601–1604.Google Scholar
  4. Cao, F., Ester, M., Qian, W., & Zhou, A. (2006). Density-based clustering over an evolving data stream with noise. In Proceedings of the 2006 SIAM International Conference on Data Mining (pp. 328–339). Philadelphia: SIAM.Google Scholar
  5. Gama, J. (2010). Knowledge discovery from data streams (1st ed.). Boca Raton: Chapman & Hall/CRC.CrossRefzbMATHGoogle Scholar
  6. Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193–218.CrossRefGoogle Scholar
  7. Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data. Upper Saddle River: Prentice-Hall.zbMATHGoogle Scholar
  8. Kaufman, L., & Rousseeuw, P. J. (1990). Finding groups in data: An introduction to cluster analysis. New York: Wiley.CrossRefGoogle Scholar
  9. Milligan, G. W., & Cooper, M. C. (1986). A study of the comparability of external criteria for hierarchical cluster analysis. Multivariate Behavioral Research, 21(4), 441–458.CrossRefGoogle Scholar
  10. Vitter, J. S. (1985). Random sampling with a reservoir. ACM Transactions on Mathematical Software, 11(1), 37–57.MathSciNetCrossRefzbMATHGoogle Scholar
  11. Zhang, T., Ramakrishnan, R., & Livny, M. (1996). BIRCH: An efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data (pp. 103–114). New York: ACM.CrossRefGoogle Scholar
  12. Zhao, W., Ma, H., & He, Q. (2009) Parallel k-means clustering based on MapReduce. In Proceedings of the 1st International Conference on Cloud Computing, CloudCom ’09 (pp. 674–679). Berlin: Springer.Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Matthew Bolaños
    • 1
  • John Forrest
    • 2
  • Michael Hahsler
    • 1
    Email author
  1. 1.Southern Methodist UniversityDallasUSA
  2. 2.MicrosoftRedmondUSA

Personalised recommendations