Abstract
In this paper we present a comparative study of three data stream clustering algorithms: STREAM, CluStream and MR-Stream. We used a total of 90 synthetic data sets generated from spatial point processes following Gaussian distributions or Mixtures of Gaussians. The algorithms were executed in three main scenarios: 1) low dimensional; 2) low dimensional with concept drift and 3) high dimensional with concept drift. In general, CluStream outperformed the other algorithms in terms of clustering quality at a higher execution time cost. Our results are analyzed with the non-parametric Friedman test and post-hoc Nemenyi test, both with α = 5%. Recommendations and future research directions are also explored.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aggarwal, C.C.: Data Streams: Models and Algorithms (Advances in Database Systems). Springer, Secaucus (2006)
Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for clustering evolving data streams. In: VLDB 2003: Proceedings of the 29th International Conference on Very Large Data Bases. pp. 81–92. VLDB Endowment (2003)
Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for projected clustering of high dimensional data streams. In: VLDB 2004: Proceedings of the Thirtieth International Conference on Very Large Data Bases. pp. 852–863. VLDB Endowment (2004)
Cao, F.: Density-based clustering over an evolving data stream with noise. In: Proc. Sixth SIAM Intl Conf. Data Mining (2006)
Chen, Y., Tu, L.: Density-based clustering for real-time stream data. In: KDD 2007: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 133–142. ACM Press, New York (2007)
Csernel, B., Clerot, F., Hébrail, G.: Streamsamp – datastream clustering over tilted windows through sampling. In: Proceedings of the International Workshop on Knowledge Discovery from Data Streams, IWKDDS 2006 (2006)
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
Ester, M., Kriegel, H., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proc. KDD, vol. 96, pp. 226–231 (1996)
Gaber, M.M., Zaslavsky, A., Krishnaswamy, S.: Mining data streams: a review. SIGMOD Rec. 34(2), 18–26 (2005)
Gantz, J., Reinsel, D., Chute, C., Schlichting, W., McArthur, J., Minton, S., Xheneti, I., Toncheva, A., Manfrediz, A.: The expanding digital universe: A forecast of worldwide information growth through 2010. Tech. rep., IDC (2007)
Han, J., Kamber, M.: Data mining: concepts and techniques. Morgan Kaufmann Publishers Inc., San Francisco (2000)
Hubert, L., Arabie, P.: Comparing partitions. Journal of Classification 2, 193–218 (1985), doi:10.1007/BF01908075
Jain, A.K.: Data clustering: 50 years beyond K-means. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part I. LNCS (LNAI), vol. 5211, pp. 3–4. Springer, Heidelberg (2008)
Jain, A.K., Dubes, R.C.: Algorithms for clustering data. Prentice-Hall, Inc., Upper Saddle River (1988)
Jain, K., Vazirani, V.V.: Primal-dual approximation algorithms for metric facility location and k-median problems. In: FOCS 1999: Proceedings of the 40th Annual Symposium on Foundations of Computer Science, p. 2. IEEE Computer Society Press, Washington, DC, USA (1999)
Kaufman, L., Rousseeuw, P.: Finding groups in data; an introduction to cluster analysis. Wiley Series in Probability and Mathematical Statistics. Applied Probability and Statistics Section. EUA (1990)
Manning, C.D., Raghavan, P.: Schutze: Introduction to Information Retrieval, 1st edn. Cambridge University Press, Cambridge (2008)
O’Callaghan, L., Meyerson, A., Motwani, R., Mishra, N., Guha, S.: Streaming-data algorithms for high-quality clustering. In: Data Engineering, International Conference on, vol. 0, p. 0685 (2002)
Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley Longman Publishing Co., Boston (2005)
Udommanetanakit, K., Rakthanmanon, T., Waiyamai, K.: E-stream: Evolution-based technique for stream clustering. In: Alhajj, R., Gao, H., Li, X., Li, J., Zaïane, O.R. (eds.) ADMA 2007. LNCS (LNAI), vol. 4632, pp. 605–615. Springer, Heidelberg (2007)
Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: is a correction for chance necessary? In: ICML 2009, pp. 1073–1080. ACM Press, New York (2009)
Walpole, R., Myers, R., Myers, S., Ye, K.: Probability and statistics for engineers and scientists. Prentice-Hall, Upper Saddle River (1998)
Wan, L., Ng, W.K., Dang, X.H., Yu, P.S., Zhang, K.: Density-based clustering of data streams at multiple resolutions. ACM Trans. Knowl. Discov. Data 3(3), 1–28 (2009)
Zhang, T., Ramakrishnan, R., Livny, M.: Birch: an efficient data clustering method for very large databases. SIGMOD Rec. 25(2), 103–114 (1996)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Pereira, C.M.M., de Mello, R.F. (2011). A Comparison of Clustering Algorithms for Data Streams. In: Hruschka, E.R., Watada, J., do Carmo Nicoletti, M. (eds) Integrated Computing Technology. INTECH 2011. Communications in Computer and Information Science, vol 165. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22247-4_6
Download citation
DOI: https://doi.org/10.1007/978-3-642-22247-4_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-22246-7
Online ISBN: 978-3-642-22247-4
eBook Packages: Computer ScienceComputer Science (R0)