Abstract
Clustering analysis is one of the most commonly used data processing algorithms. Over half a century, K-means remains the most popular clustering algorithm because of its simplicity. Recently, as data volume continues to rise, some researchers turn to MapReduce to get high performance. However, MapReduce is unsuitable for iterated algorithms owing to repeated times of restarting jobs, big data reading and shuffling. In this paper, we address the problems of processing large-scale data using K-means clustering algorithm and propose a novel processing model in MapReduce to eliminate the iteration dependence and obtain high performance. We analyze and implement our idea. Extensive experiments on our cluster demonstrate that our proposed methods are efficient, robust and scalable.
Similar content being viewed by others
References
Madhulatha TS (2012) An overview on clustering methods[J]. arXiv preprint arXiv:1205.1117
Xu R, Wunsch D (2005) Survey of clustering algorithms[J]. IEEE Trans Neural Netw 16(3):645–678
Drineas P, Frieze A, Kannan R et al (2004) Clustering large graphs via the singular value decomposition[J]. Mach Learn 56(1–3):9–33
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters[J]. Commun ACM 51(1):107–113
Ekanayake J, Pallickara S, Fox G (2008) Mapreduce for data intensive scientific analyses[C]. eScience, eScience’08. IEEE fourth international conference on. IEEE 2008, pp 277–284
Liu T, Rosenberg C, Rowley HA (2007) Clustering billions of images with large scale nearest neighbor search[C]. Applications of Computer Vision, WACV’07. IEEE Workshop on. IEEE 2007, p 28
Zhao W, Ma H, He Q (2009) Parallel k-means clustering based on mapreduce[M]., Cloud computingSpringer, Berlin Heidelberg
Ene A, Im S, Moseley B (2011) Fast clustering using MapReduce[C]. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 681–689
Vattani A (2011) K-means requires exponentially many iterations even in the plane[J]. Discret Comput Geom 45(4):596–616
Arthur D, Vassilvitskii S (2007) k-means++: the advantages of careful seeding[C]. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms. Society for industrial and applied mathematics, pp 1027–1035
Wang J, Su X (2011) An improved K-means clustering algorithm[C]. Communication software and networks (ICCSN), 2011 IEEE 3rd international conference on. IEEE, pp 44–46
Davidson I, Satyanarayana A (2003) Speeding up k-means clustering by bootstrap averaging[C]. IEEE data mining workshop on clustering large data sets
Farnstrom F, Lewis J, Elkan C (2000) Scalability for clustering algorithms revisited[J]. ACM SIGKDD Explor Newsl 2(1):51–57
Domingos P, Hulten G (2001) A general method for scaling up machine learning algorithms and its application to clustering[C]. ICML, pp 106–113
Fahim AM, Salem AM, Torkey FA et al (2006) An efficient enhanced k-means clustering algorithm[J]. J Zhejiang Univ Sci A 7(10):1626–1633
Kanungo T, Mount DM, Netanyahu NS et al (2002) An efficient k-means clustering algorithm: analysis and implementation [J]. IEEE Trans Pattern Anal Mach Intell 24(7):881–892
Bahmani B, Moseley B, Vattani A et al (2012) Scalable k-means++[J]. Proc VLDB Endow 5(7):622–633
Mishra BK, Rath A, Nayak NR et al (2012) Far efficient K-means clustering algorithm[C]. In: Proceedings of the international conference on advances in computing, communications and informatics. ACM, pp 106–110
Friedman HP, Rubin J (1967) On some invariant criteria for grouping data[J]. J Am Stat Assoc 62(320):1159–1178
Davies DL, Bouldin DW (1979) A cluster separation measure[J]. IEEE Trans Pattern Anal Mach Intell 2:224–227
Acknowledgments
The work of this paper is supported by the National Science Foundation for Distinguished Young Scholars of China (Grant No. 61225010), NSFC under Grants 61173160, 61173161, 61173162, 61103234, 61272417, 61300084 and 61300189.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Cui, X., Zhu, P., Yang, X. et al. Optimized big data K-means clustering using MapReduce. J Supercomput 70, 1249–1259 (2014). https://doi.org/10.1007/s11227-014-1225-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-014-1225-7