Parallel K-Means Clustering Based on MapReduce

  • Weizhong Zhao
  • Huifang Ma
  • Qing He
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5931)


Data clustering has been received considerable attention in many applications, such as data mining, document retrieval, image segmentation and pattern classification. The enlarging volumes of information emerging by the progress of technology, makes clustering of very large scale of data a challenging task. In order to deal with the problem, many researchers try to design efficient parallel clustering algorithms. In this paper, we propose a parallel k-means clustering algorithm based on MapReduce, which is a simple yet powerful parallel programming technique. The experimental results demonstrate that the proposed algorithm can scale well and efficiently process large datasets on commodity hardware.


Data mining Parallel clustering K-means Hadoop MapReduce 


  1. 1.
    Rasmussen, E.M., Willett, P.: Efficiency of Hierarchical Agglomerative Clustering Using the ICL Distributed Array Processor. Journal of Documentation 45(1), 1–24 (1989)CrossRefGoogle Scholar
  2. 2.
    Li, X., Fang, Z.: Parallel Clustering Algorithms. Parallel Computing 11, 275–290 (1989)zbMATHCrossRefMathSciNetGoogle Scholar
  3. 3.
    Olson, C.F.: Parallel Algorithms for Hierarchical Clustering. Parallel Computing 21(8), 1313–1325 (1995)zbMATHCrossRefMathSciNetGoogle Scholar
  4. 4.
    Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: Proc. of Operating Systems Design and Implementation, San Francisco, CA, pp. 137–150 (2004)Google Scholar
  5. 5.
    Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. Communications of The ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  6. 6.
    Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., Kozyrakis, C.: Evaluating MapReduce for Multi-core and Multiprocessor Systems. In: Proc. of 13th Int. Symposium on High-Performance Computer Architecture (HPCA), Phoenix, AZ (2007)Google Scholar
  7. 7.
    Lammel, R.: Google’s MapReduce Programming Model - Revisited. Science of Computer Programming 70, 1–30 (2008)CrossRefMathSciNetGoogle Scholar
  8. 8.
    Hadoop: Open source implementation of MapReduce,
  9. 9.
    Ghemawat, S., Gobioff, H., Leung, S.: The Google File System. In: Symposium on Operating Systems Principles, pp. 29–43 (2003)Google Scholar
  10. 10.
    MacQueen, J.: Some Methods for Classification and Analysis of Multivariate Observations. In: Proc. 5th Berkeley Symp. Math. Statist, Prob., vol. 1, pp. 281–297 (1967)Google Scholar
  11. 11.
    Borthakur, D.: The Hadoop Distributed File System: Architecture and Design (2007)Google Scholar
  12. 12.
    Xu, X., Jager, J., Kriegel, H.P.: A Fast Parallel Clustering Algorithm for Large Spatial Databases. Data Mining and Knowledge Discovery 3, 263–290 (1999)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Weizhong Zhao
    • 1
    • 2
  • Huifang Ma
    • 1
    • 2
  • Qing He
    • 1
  1. 1.The Key Laboratory of Intelligent Information Processing, Institute of Computing TechnologyChinese Academy of Sciences 
  2. 2.Graduate University of Chinese Academy of Sciences 

Personalised recommendations