Clustering Large Datasets Using Data Stream Clustering Techniques

Bolaños, Matthew; Forrest, John; Hahsler, Michael

doi:10.1007/978-3-319-01595-8_15

Matthew Bolaños²¹,
John Forrest²² &
Michael Hahsler²¹

Part of the book series: Studies in Classification, Data Analysis, and Knowledge Organization ((STUDIES CLASS))

5431 Accesses
2 Citations

Abstract

Unsupervised identification of groups in large data sets is important for many machine learning and knowledge discovery applications. Conventional clustering approaches (k-means, hierarchical clustering, etc.) typically do not scale well for very large data sets. In recent years, data stream clustering algorithms have been proposed which can deal efficiently with potentially unbounded streams of data. This paper is the first to investigate the use of data stream clustering algorithms as light-weight alternatives to conventional algorithms on large non-streaming data. We will discuss important issue including order dependence and report the results of an initial study using several synthetic and real-world data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
stream is available at http://R-Forge.R-Project.org/projects/clusterds/.
2.
Created with the default settings of function DSD_Gaussian_Static() in stream.
3.
Obtained from UCI Machine Learning Repository at http://archive.ics.uci.edu/ml/datasets/Covertype.
4.
Obtained from Greengenes at http://greengenes.lbl.gov/Download/Sequence_Data/Fasta_data_files/current_GREENGENES_gg16S_unaligned.fasta.gz.

References

Aggarwal, C. (2007). Data streams: Models and algorithms. Advances in database systems (Vol. 31). New York: Springer.
Google Scholar
Aggarwal, C. C., Han, J., Wang, J., & Yu, P. S. (2003). A framework for clustering evolving data streams. In Proceedings of the 29th International Conference on Very Large Data Bases (VLDB ’03) (Vol. 29, pp. 81–92). VLDB Endowment.
Google Scholar
Bifet, A., Holmes, G., Kirkby, R., & Pfahringer, B. (2010). MOA: Massive online analysis. Journal of Machine Learning Research, 99, 1601–1604.
Google Scholar
Cao, F., Ester, M., Qian, W., & Zhou, A. (2006). Density-based clustering over an evolving data stream with noise. In Proceedings of the 2006 SIAM International Conference on Data Mining (pp. 328–339). Philadelphia: SIAM.
Google Scholar
Gama, J. (2010). Knowledge discovery from data streams (1st ed.). Boca Raton: Chapman & Hall/CRC.
Book MATH Google Scholar
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193–218.
Article Google Scholar
Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data. Upper Saddle River: Prentice-Hall.
MATH Google Scholar
Kaufman, L., & Rousseeuw, P. J. (1990). Finding groups in data: An introduction to cluster analysis. New York: Wiley.
Book Google Scholar
Milligan, G. W., & Cooper, M. C. (1986). A study of the comparability of external criteria for hierarchical cluster analysis. Multivariate Behavioral Research, 21(4), 441–458.
Article Google Scholar
Vitter, J. S. (1985). Random sampling with a reservoir. ACM Transactions on Mathematical Software, 11(1), 37–57.
Article MathSciNet MATH Google Scholar
Zhang, T., Ramakrishnan, R., & Livny, M. (1996). BIRCH: An efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data (pp. 103–114). New York: ACM.
Chapter Google Scholar
Zhao, W., Ma, H., & He, Q. (2009) Parallel k-means clustering based on MapReduce. In Proceedings of the 1st International Conference on Cloud Computing, CloudCom ’09 (pp. 674–679). Berlin: Springer.
Google Scholar

Download references

Acknowledgements

This work is supported in part by the U.S. National Science Foundation as a research experience for undergraduates (REU) under contract number IIS-0948893 and by the National Institutes of Health under contract number R21HG005912.

Author information

Authors and Affiliations

Southern Methodist University, Dallas, TX, USA
Matthew Bolaños & Michael Hahsler
Microsoft, Redmond, WA, USA
John Forrest

Authors

Matthew Bolaños
View author publications
You can also search for this author in PubMed Google Scholar
John Forrest
View author publications
You can also search for this author in PubMed Google Scholar
Michael Hahsler
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael Hahsler .

Editor information

Editors and Affiliations

Faculty of Computer Science, Otto-von-Guericke-Universität Magdeburg, Magdeburg, Germany
Myra Spiliopoulou
Institute of Computer Science, University of Hildesheim, Hildesheim, Germany
Lars Schmidt-Thieme
Institute of Computer Science, University of Hildesheim, Hildesheim, Germany
Ruth Janning

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bolaños, M., Forrest, J., Hahsler, M. (2014). Clustering Large Datasets Using Data Stream Clustering Techniques. In: Spiliopoulou, M., Schmidt-Thieme, L., Janning, R. (eds) Data Analysis, Machine Learning and Knowledge Discovery. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Cham. https://doi.org/10.1007/978-3-319-01595-8_15

Download citation

DOI: https://doi.org/10.1007/978-3-319-01595-8_15
Published: 10 October 2013
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-01594-1
Online ISBN: 978-3-319-01595-8
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics