Parallel K-Means Clustering Based on MapReduce

Zhao, Weizhong; Ma, Huifang; He, Qing

doi:10.1007/978-3-642-10665-1_71

Weizhong Zhao^19,20,
Huifang Ma^19,20 &
Qing He¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNCCN,volume 5931))

Included in the following conference series:

IEEE International Conference on Cloud Computing

18k Accesses
304 Citations

Abstract

Data clustering has been received considerable attention in many applications, such as data mining, document retrieval, image segmentation and pattern classification. The enlarging volumes of information emerging by the progress of technology, makes clustering of very large scale of data a challenging task. In order to deal with the problem, many researchers try to design efficient parallel clustering algorithms. In this paper, we propose a parallel k-means clustering algorithm based on MapReduce, which is a simple yet powerful parallel programming technique. The experimental results demonstrate that the proposed algorithm can scale well and efficiently process large datasets on commodity hardware.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Rasmussen, E.M., Willett, P.: Efficiency of Hierarchical Agglomerative Clustering Using the ICL Distributed Array Processor. Journal of Documentation 45(1), 1–24 (1989)
Article Google Scholar
Li, X., Fang, Z.: Parallel Clustering Algorithms. Parallel Computing 11, 275–290 (1989)
Article MATH MathSciNet Google Scholar
Olson, C.F.: Parallel Algorithms for Hierarchical Clustering. Parallel Computing 21(8), 1313–1325 (1995)
Article MATH MathSciNet Google Scholar
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: Proc. of Operating Systems Design and Implementation, San Francisco, CA, pp. 137–150 (2004)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. Communications of The ACM 51(1), 107–113 (2008)
Article Google Scholar
Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., Kozyrakis, C.: Evaluating MapReduce for Multi-core and Multiprocessor Systems. In: Proc. of 13th Int. Symposium on High-Performance Computer Architecture (HPCA), Phoenix, AZ (2007)
Google Scholar
Lammel, R.: Google’s MapReduce Programming Model - Revisited. Science of Computer Programming 70, 1–30 (2008)
Article MathSciNet Google Scholar
Hadoop: Open source implementation of MapReduce, http://lucene.apache.org/hadoop/
Ghemawat, S., Gobioff, H., Leung, S.: The Google File System. In: Symposium on Operating Systems Principles, pp. 29–43 (2003)
Google Scholar
MacQueen, J.: Some Methods for Classification and Analysis of Multivariate Observations. In: Proc. 5th Berkeley Symp. Math. Statist, Prob., vol. 1, pp. 281–297 (1967)
Google Scholar
Borthakur, D.: The Hadoop Distributed File System: Architecture and Design (2007)
Google Scholar
Xu, X., Jager, J., Kriegel, H.P.: A Fast Parallel Clustering Algorithm for Large Spatial Databases. Data Mining and Knowledge Discovery 3, 263–290 (1999)
Article Google Scholar

Download references

Author information

Authors and Affiliations

The Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences,
Weizhong Zhao, Huifang Ma & Qing He
Graduate University of Chinese Academy of Sciences,
Weizhong Zhao & Huifang Ma

Authors

Weizhong Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Huifang Ma
View author publications
You can also search for this author in PubMed Google Scholar
Qing He
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

SINTEF ICT, NO-7465, Trondheim, Norway
Martin Gilje Jaatun
School of Computer Science, South China Normal University, Guangzhou, China
Gansen Zhao
Department of Electrical Engineering and Computer Science, University of Stavanger, NO- 4036, Stavanger, Norway
Chunming Rong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhao, W., Ma, H., He, Q. (2009). Parallel K-Means Clustering Based on MapReduce. In: Jaatun, M.G., Zhao, G., Rong, C. (eds) Cloud Computing. CloudCom 2009. Lecture Notes in Computer Science, vol 5931. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-10665-1_71

Download citation

DOI: https://doi.org/10.1007/978-3-642-10665-1_71
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-10664-4
Online ISBN: 978-3-642-10665-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics