Abstract
Unsupervised identification of groups in large data sets is important for many machine learning and knowledge discovery applications. Conventional clustering approaches (k-means, hierarchical clustering, etc.) typically do not scale well for very large data sets. In recent years, data stream clustering algorithms have been proposed which can deal efficiently with potentially unbounded streams of data. This paper is the first to investigate the use of data stream clustering algorithms as light-weight alternatives to conventional algorithms on large non-streaming data. We will discuss important issue including order dependence and report the results of an initial study using several synthetic and real-world data sets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
stream is available at http://R-Forge.R-Project.org/projects/clusterds/.
- 2.
Created with the default settings of function DSD_Gaussian_Static() in stream.
- 3.
Obtained from UCI Machine Learning Repository at http://archive.ics.uci.edu/ml/datasets/Covertype.
- 4.
Obtained from Greengenes at http://greengenes.lbl.gov/Download/Sequence_Data/Fasta_data_files/current_GREENGENES_gg16S_unaligned.fasta.gz.
References
Aggarwal, C. (2007). Data streams: Models and algorithms. Advances in database systems (Vol. 31). New York: Springer.
Aggarwal, C. C., Han, J., Wang, J., & Yu, P. S. (2003). A framework for clustering evolving data streams. In Proceedings of the 29th International Conference on Very Large Data Bases (VLDBĀ ā03) (Vol. 29, pp. 81ā92). VLDB Endowment.
Bifet, A., Holmes, G., Kirkby, R., & Pfahringer, B. (2010). MOA: Massive online analysis. Journal of Machine Learning Research, 99, 1601ā1604.
Cao, F., Ester, M., Qian, W., & Zhou, A. (2006). Density-based clustering over an evolving data stream with noise. In Proceedings of the 2006 SIAM International Conference on Data Mining (pp. 328ā339). Philadelphia: SIAM.
Gama, J. (2010). Knowledge discovery from data streams (1st ed.). Boca Raton: Chapman & Hall/CRC.
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193ā218.
Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data. Upper Saddle River: Prentice-Hall.
Kaufman, L., & Rousseeuw, P. J. (1990). Finding groups in data: An introduction to cluster analysis. New York: Wiley.
Milligan, G. W., & Cooper, M. C. (1986). A study of the comparability of external criteria for hierarchical cluster analysis. Multivariate Behavioral Research, 21(4), 441ā458.
Vitter, J. S. (1985). Random sampling with a reservoir. ACM Transactions on Mathematical Software, 11(1), 37ā57.
Zhang, T., Ramakrishnan, R., & Livny, M. (1996). BIRCH: An efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data (pp. 103ā114). New York: ACM.
Zhao, W., Ma, H., & He, Q. (2009) Parallel k-means clustering based on MapReduce. InĀ Proceedings of the 1st International Conference on Cloud Computing, CloudCom ā09 (pp. 674ā679). Berlin: Springer.
Acknowledgements
This work is supported in part by the U.S. National Science Foundation as a research experience for undergraduates (REU) under contract number IIS-0948893 and by the National Institutes of Health under contract number R21HG005912.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
Ā© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
BolaƱos, M., Forrest, J., Hahsler, M. (2014). Clustering Large Datasets Using Data Stream Clustering Techniques. In: Spiliopoulou, M., Schmidt-Thieme, L., Janning, R. (eds) Data Analysis, Machine Learning and Knowledge Discovery. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Cham. https://doi.org/10.1007/978-3-319-01595-8_15
Download citation
DOI: https://doi.org/10.1007/978-3-319-01595-8_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-01594-1
Online ISBN: 978-3-319-01595-8
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)