A Comparison of Clustering Algorithms for Data Streams

Pereira, Cássio M. M.; de Mello, Rodrigo F.

doi:10.1007/978-3-642-22247-4_6

Cássio M. M. Pereira⁴ &
Rodrigo F. de Mello⁴

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 165))

Included in the following conference series:

International Conference on Integrated Computing Technology

480 Accesses
4 Citations

Abstract

In this paper we present a comparative study of three data stream clustering algorithms: STREAM, CluStream and MR-Stream. We used a total of 90 synthetic data sets generated from spatial point processes following Gaussian distributions or Mixtures of Gaussians. The algorithms were executed in three main scenarios: 1) low dimensional; 2) low dimensional with concept drift and 3) high dimensional with concept drift. In general, CluStream outperformed the other algorithms in terms of clustering quality at a higher execution time cost. Our results are analyzed with the non-parametric Friedman test and post-hoc Nemenyi test, both with α = 5%. Recommendations and future research directions are also explored.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aggarwal, C.C.: Data Streams: Models and Algorithms (Advances in Database Systems). Springer, Secaucus (2006)
Google Scholar
Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for clustering evolving data streams. In: VLDB 2003: Proceedings of the 29th International Conference on Very Large Data Bases. pp. 81–92. VLDB Endowment (2003)
Google Scholar
Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for projected clustering of high dimensional data streams. In: VLDB 2004: Proceedings of the Thirtieth International Conference on Very Large Data Bases. pp. 852–863. VLDB Endowment (2004)
Google Scholar
Cao, F.: Density-based clustering over an evolving data stream with noise. In: Proc. Sixth SIAM Intl Conf. Data Mining (2006)
Google Scholar
Chen, Y., Tu, L.: Density-based clustering for real-time stream data. In: KDD 2007: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 133–142. ACM Press, New York (2007)
Google Scholar
Csernel, B., Clerot, F., Hébrail, G.: Streamsamp – datastream clustering over tilted windows through sampling. In: Proceedings of the International Workshop on Knowledge Discovery from Data Streams, IWKDDS 2006 (2006)
Google Scholar
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
MathSciNet MATH Google Scholar
Ester, M., Kriegel, H., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proc. KDD, vol. 96, pp. 226–231 (1996)
Google Scholar
Gaber, M.M., Zaslavsky, A., Krishnaswamy, S.: Mining data streams: a review. SIGMOD Rec. 34(2), 18–26 (2005)
Google Scholar
Gantz, J., Reinsel, D., Chute, C., Schlichting, W., McArthur, J., Minton, S., Xheneti, I., Toncheva, A., Manfrediz, A.: The expanding digital universe: A forecast of worldwide information growth through 2010. Tech. rep., IDC (2007)
Google Scholar
Han, J., Kamber, M.: Data mining: concepts and techniques. Morgan Kaufmann Publishers Inc., San Francisco (2000)
MATH Google Scholar
Hubert, L., Arabie, P.: Comparing partitions. Journal of Classification 2, 193–218 (1985), doi:10.1007/BF01908075
Article MATH Google Scholar
Jain, A.K.: Data clustering: 50 years beyond K-means. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part I. LNCS (LNAI), vol. 5211, pp. 3–4. Springer, Heidelberg (2008)
Chapter Google Scholar
Jain, A.K., Dubes, R.C.: Algorithms for clustering data. Prentice-Hall, Inc., Upper Saddle River (1988)
MATH Google Scholar
Jain, K., Vazirani, V.V.: Primal-dual approximation algorithms for metric facility location and k-median problems. In: FOCS 1999: Proceedings of the 40th Annual Symposium on Foundations of Computer Science, p. 2. IEEE Computer Society Press, Washington, DC, USA (1999)
Google Scholar
Kaufman, L., Rousseeuw, P.: Finding groups in data; an introduction to cluster analysis. Wiley Series in Probability and Mathematical Statistics. Applied Probability and Statistics Section. EUA (1990)
Google Scholar
Manning, C.D., Raghavan, P.: Schutze: Introduction to Information Retrieval, 1st edn. Cambridge University Press, Cambridge (2008)
Book MATH Google Scholar
O’Callaghan, L., Meyerson, A., Motwani, R., Mishra, N., Guha, S.: Streaming-data algorithms for high-quality clustering. In: Data Engineering, International Conference on, vol. 0, p. 0685 (2002)
Google Scholar
Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley Longman Publishing Co., Boston (2005)
Google Scholar
Udommanetanakit, K., Rakthanmanon, T., Waiyamai, K.: E-stream: Evolution-based technique for stream clustering. In: Alhajj, R., Gao, H., Li, X., Li, J., Zaïane, O.R. (eds.) ADMA 2007. LNCS (LNAI), vol. 4632, pp. 605–615. Springer, Heidelberg (2007)
Chapter Google Scholar
Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: is a correction for chance necessary? In: ICML 2009, pp. 1073–1080. ACM Press, New York (2009)
Google Scholar
Walpole, R., Myers, R., Myers, S., Ye, K.: Probability and statistics for engineers and scientists. Prentice-Hall, Upper Saddle River (1998)
MATH Google Scholar
Wan, L., Ng, W.K., Dang, X.H., Yu, P.S., Zhang, K.: Density-based clustering of data streams at multiple resolutions. ACM Trans. Knowl. Discov. Data 3(3), 1–28 (2009)
Article Google Scholar
Zhang, T., Ramakrishnan, R., Livny, M.: Birch: an efficient data clustering method for very large databases. SIGMOD Rec. 25(2), 103–114 (1996)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Mathematical and Computer Sciences, University of São Paulo, São Carlos, SP, 13566-590, Brazil
Cássio M. M. Pereira & Rodrigo F. de Mello

Authors

Cássio M. M. Pereira
View author publications
You can also search for this author in PubMed Google Scholar
Rodrigo F. de Mello
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Federal University of São Carlos, CP 676, 13.565-905, São Carlos, SP, Brazil
Estevam Rafael Hruschka Jr.
Graduate School of Information, Production & Systems, Waseda University, 2-7 Hibikino, Wakamatsu, 808-0135, Kitakyushu, Japan
Junzo Watada
Universidade Federal de Sao Carlos /DC, C.P. 676 - 13565-905, S. Carlos, SP, Brazil
Maria do Carmo Nicoletti

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pereira, C.M.M., de Mello, R.F. (2011). A Comparison of Clustering Algorithms for Data Streams. In: Hruschka, E.R., Watada, J., do Carmo Nicoletti, M. (eds) Integrated Computing Technology. INTECH 2011. Communications in Computer and Information Science, vol 165. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22247-4_6

Download citation

DOI: https://doi.org/10.1007/978-3-642-22247-4_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-22246-7
Online ISBN: 978-3-642-22247-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics