Skip to main content
Log in

Optimized big data K-means clustering using MapReduce

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Clustering analysis is one of the most commonly used data processing algorithms. Over half a century, K-means remains the most popular clustering algorithm because of its simplicity. Recently, as data volume continues to rise, some researchers turn to MapReduce to get high performance. However, MapReduce is unsuitable for iterated algorithms owing to repeated times of restarting jobs, big data reading and shuffling. In this paper, we address the problems of processing large-scale data using K-means clustering algorithm and propose a novel processing model in MapReduce to eliminate the iteration dependence and obtain high performance. We analyze and implement our idea. Extensive experiments on our cluster demonstrate that our proposed methods are efficient, robust and scalable.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Madhulatha TS (2012) An overview on clustering methods[J]. arXiv preprint arXiv:1205.1117

  2. Xu R, Wunsch D (2005) Survey of clustering algorithms[J]. IEEE Trans Neural Netw 16(3):645–678

    Article  Google Scholar 

  3. Drineas P, Frieze A, Kannan R et al (2004) Clustering large graphs via the singular value decomposition[J]. Mach Learn 56(1–3):9–33

    Article  MATH  Google Scholar 

  4. Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters[J]. Commun ACM 51(1):107–113

    Article  Google Scholar 

  5. Ekanayake J, Pallickara S, Fox G (2008) Mapreduce for data intensive scientific analyses[C]. eScience, eScience’08. IEEE fourth international conference on. IEEE 2008, pp 277–284

  6. Liu T, Rosenberg C, Rowley HA (2007) Clustering billions of images with large scale nearest neighbor search[C]. Applications of Computer Vision, WACV’07. IEEE Workshop on. IEEE 2007, p 28

  7. Zhao W, Ma H, He Q (2009) Parallel k-means clustering based on mapreduce[M]., Cloud computingSpringer, Berlin Heidelberg

    Google Scholar 

  8. Ene A, Im S, Moseley B (2011) Fast clustering using MapReduce[C]. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 681–689

  9. Vattani A (2011) K-means requires exponentially many iterations even in the plane[J]. Discret Comput Geom 45(4):596–616

    Article  MATH  MathSciNet  Google Scholar 

  10. Arthur D, Vassilvitskii S (2007) k-means++: the advantages of careful seeding[C]. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms. Society for industrial and applied mathematics, pp 1027–1035

  11. Wang J, Su X (2011) An improved K-means clustering algorithm[C]. Communication software and networks (ICCSN), 2011 IEEE 3rd international conference on. IEEE, pp 44–46

  12. Davidson I, Satyanarayana A (2003) Speeding up k-means clustering by bootstrap averaging[C]. IEEE data mining workshop on clustering large data sets

  13. Farnstrom F, Lewis J, Elkan C (2000) Scalability for clustering algorithms revisited[J]. ACM SIGKDD Explor Newsl 2(1):51–57

    Article  Google Scholar 

  14. Domingos P, Hulten G (2001) A general method for scaling up machine learning algorithms and its application to clustering[C]. ICML, pp 106–113

  15. Fahim AM, Salem AM, Torkey FA et al (2006) An efficient enhanced k-means clustering algorithm[J]. J Zhejiang Univ Sci A 7(10):1626–1633

    Article  MATH  Google Scholar 

  16. Kanungo T, Mount DM, Netanyahu NS et al (2002) An efficient k-means clustering algorithm: analysis and implementation [J]. IEEE Trans Pattern Anal Mach Intell 24(7):881–892

    Article  Google Scholar 

  17. Bahmani B, Moseley B, Vattani A et al (2012) Scalable k-means++[J]. Proc VLDB Endow 5(7):622–633

    Article  Google Scholar 

  18. Mishra BK, Rath A, Nayak NR et al (2012) Far efficient K-means clustering algorithm[C]. In: Proceedings of the international conference on advances in computing, communications and informatics. ACM, pp 106–110

  19. Friedman HP, Rubin J (1967) On some invariant criteria for grouping data[J]. J Am Stat Assoc 62(320):1159–1178

    Article  MathSciNet  Google Scholar 

  20. Davies DL, Bouldin DW (1979) A cluster separation measure[J]. IEEE Trans Pattern Anal Mach Intell 2:224–227

    Article  Google Scholar 

Download references

Acknowledgments

The work of this paper is supported by the National Science Foundation for Distinguished Young Scholars of China (Grant No. 61225010), NSFC under Grants 61173160, 61173161, 61173162, 61103234, 61272417, 61300084 and 61300189.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaoli Cui.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cui, X., Zhu, P., Yang, X. et al. Optimized big data K-means clustering using MapReduce. J Supercomput 70, 1249–1259 (2014). https://doi.org/10.1007/s11227-014-1225-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-014-1225-7

Keywords

Navigation