Encyclopedia of Machine Learning

2010 Edition
| Editors: Claude Sammut, Geoffrey I. Webb

Clustering from Data Streams

  • João Gama
Reference work entry
DOI: https://doi.org/10.1007/978-0-387-30164-8_127

Definition

 Clustering is the process of grouping objects into different groups, such that the common properties of data in each subset is high, and between different subsets is low. The data stream clustering problem is defined as to maintain a consistent good clustering of the sequence observed so far, using a small amount of memory and time. The issues are imposed by the continuous arriving data points, and the need to analyze them in real time. These characteristics require incremental clustering, maintaining cluster structures that evolve over time. Moreover, the data stream may evolve over time and new clusters might appear, others disappear reflecting the dynamics of the stream.

Main Techniques

Major clustering approaches in data stream cluster analysis include:
  • Partitioning algorithms: construct a partition of a set of objects into k clusters, that minimize some objective function (e.g., the sum of squares distances to the centroid representative). Examples include k-means (Far...

This is a preview of subscription content, log in to check access.

Recommended Reading

  1. Aggarwal, C., Han, J., Wang, J., & Yu, P. (2003). A framework for clustering evolving data streams. In Proceedings of the 29th international conference on very large data bases (pp. 81–92). San Mateo, MA: Morgan Kaufmann.Google Scholar
  2. Domingos, P., & Hulten, G. (2001). A general method for scaling up machine learning algorithms and its application to clustering. In Proceedings of international conference on machine learning (pp. 106–113). San Mateo, MA: Morgan Kaufmann.Google Scholar
  3. Farnstrom, F., Lewis, J., & Elkan, C. (2000). Scalability for clustering algorithms revisited. SIGKDD Explorations, 2(1), 51–57.Google Scholar
  4. Guha, S., Meyerson, A., Mishra, N., Motwani, R., & O’Callaghan, L. (2003). Clustering data streams: Theory and practice. IEEE Transactions on Knowledge and Data Engineering, 15(3), 515–528.Google Scholar
  5. Spiliopoulou, M., Ntoutsi, I., Theodoridis, Y., & Schult, R. (2006). Monic: Modeling and monitoring cluster transitions. In Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining (pp. 706–711). New York: ACM Press.Google Scholar
  6. Zhang, T., Ramakrishnan, R., & Livny, M. (1996). Birch: An efficient data clustering method for very large databases. In Proceedings of ACM SIGMOD international conference on management of data (pp. 103–114). New York: ACM Press.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  • João Gama

There are no affiliations available