Skip to main content

A Comparison of Clustering Algorithms for Data Streams

  • Conference paper
Integrated Computing Technology (INTECH 2011)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 165))

Included in the following conference series:

Abstract

In this paper we present a comparative study of three data stream clustering algorithms: STREAM, CluStream and MR-Stream. We used a total of 90 synthetic data sets generated from spatial point processes following Gaussian distributions or Mixtures of Gaussians. The algorithms were executed in three main scenarios: 1) low dimensional; 2) low dimensional with concept drift and 3) high dimensional with concept drift. In general, CluStream outperformed the other algorithms in terms of clustering quality at a higher execution time cost. Our results are analyzed with the non-parametric Friedman test and post-hoc Nemenyi test, both with α = 5%. Recommendations and future research directions are also explored.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aggarwal, C.C.: Data Streams: Models and Algorithms (Advances in Database Systems). Springer, Secaucus (2006)

    Google Scholar 

  2. Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for clustering evolving data streams. In: VLDB 2003: Proceedings of the 29th International Conference on Very Large Data Bases. pp. 81–92. VLDB Endowment (2003)

    Google Scholar 

  3. Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for projected clustering of high dimensional data streams. In: VLDB 2004: Proceedings of the Thirtieth International Conference on Very Large Data Bases. pp. 852–863. VLDB Endowment (2004)

    Google Scholar 

  4. Cao, F.: Density-based clustering over an evolving data stream with noise. In: Proc. Sixth SIAM Intl Conf. Data Mining (2006)

    Google Scholar 

  5. Chen, Y., Tu, L.: Density-based clustering for real-time stream data. In: KDD 2007: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 133–142. ACM Press, New York (2007)

    Google Scholar 

  6. Csernel, B., Clerot, F., Hébrail, G.: Streamsamp – datastream clustering over tilted windows through sampling. In: Proceedings of the International Workshop on Knowledge Discovery from Data Streams, IWKDDS 2006 (2006)

    Google Scholar 

  7. Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)

    MathSciNet  MATH  Google Scholar 

  8. Ester, M., Kriegel, H., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proc. KDD, vol. 96, pp. 226–231 (1996)

    Google Scholar 

  9. Gaber, M.M., Zaslavsky, A., Krishnaswamy, S.: Mining data streams: a review. SIGMOD Rec. 34(2), 18–26 (2005)

    Google Scholar 

  10. Gantz, J., Reinsel, D., Chute, C., Schlichting, W., McArthur, J., Minton, S., Xheneti, I., Toncheva, A., Manfrediz, A.: The expanding digital universe: A forecast of worldwide information growth through 2010. Tech. rep., IDC (2007)

    Google Scholar 

  11. Han, J., Kamber, M.: Data mining: concepts and techniques. Morgan Kaufmann Publishers Inc., San Francisco (2000)

    MATH  Google Scholar 

  12. Hubert, L., Arabie, P.: Comparing partitions. Journal of Classification 2, 193–218 (1985), doi:10.1007/BF01908075

    Article  MATH  Google Scholar 

  13. Jain, A.K.: Data clustering: 50 years beyond K-means. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part I. LNCS (LNAI), vol. 5211, pp. 3–4. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  14. Jain, A.K., Dubes, R.C.: Algorithms for clustering data. Prentice-Hall, Inc., Upper Saddle River (1988)

    MATH  Google Scholar 

  15. Jain, K., Vazirani, V.V.: Primal-dual approximation algorithms for metric facility location and k-median problems. In: FOCS 1999: Proceedings of the 40th Annual Symposium on Foundations of Computer Science, p. 2. IEEE Computer Society Press, Washington, DC, USA (1999)

    Google Scholar 

  16. Kaufman, L., Rousseeuw, P.: Finding groups in data; an introduction to cluster analysis. Wiley Series in Probability and Mathematical Statistics. Applied Probability and Statistics Section. EUA (1990)

    Google Scholar 

  17. Manning, C.D., Raghavan, P.: Schutze: Introduction to Information Retrieval, 1st edn. Cambridge University Press, Cambridge (2008)

    Book  MATH  Google Scholar 

  18. O’Callaghan, L., Meyerson, A., Motwani, R., Mishra, N., Guha, S.: Streaming-data algorithms for high-quality clustering. In: Data Engineering, International Conference on, vol. 0, p. 0685 (2002)

    Google Scholar 

  19. Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley Longman Publishing Co., Boston (2005)

    Google Scholar 

  20. Udommanetanakit, K., Rakthanmanon, T., Waiyamai, K.: E-stream: Evolution-based technique for stream clustering. In: Alhajj, R., Gao, H., Li, X., Li, J., Zaïane, O.R. (eds.) ADMA 2007. LNCS (LNAI), vol. 4632, pp. 605–615. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  21. Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: is a correction for chance necessary? In: ICML 2009, pp. 1073–1080. ACM Press, New York (2009)

    Google Scholar 

  22. Walpole, R., Myers, R., Myers, S., Ye, K.: Probability and statistics for engineers and scientists. Prentice-Hall, Upper Saddle River (1998)

    MATH  Google Scholar 

  23. Wan, L., Ng, W.K., Dang, X.H., Yu, P.S., Zhang, K.: Density-based clustering of data streams at multiple resolutions. ACM Trans. Knowl. Discov. Data 3(3), 1–28 (2009)

    Article  Google Scholar 

  24. Zhang, T., Ramakrishnan, R., Livny, M.: Birch: an efficient data clustering method for very large databases. SIGMOD Rec. 25(2), 103–114 (1996)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Pereira, C.M.M., de Mello, R.F. (2011). A Comparison of Clustering Algorithms for Data Streams. In: Hruschka, E.R., Watada, J., do Carmo Nicoletti, M. (eds) Integrated Computing Technology. INTECH 2011. Communications in Computer and Information Science, vol 165. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22247-4_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-22247-4_6

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-22246-7

  • Online ISBN: 978-3-642-22247-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics