Optimized big data K-means clustering using MapReduce

Cui, Xiaoli; Zhu, Pingfei; Yang, Xin; Li, Keqiu; Ji, Changqing

doi:10.1007/s11227-014-1225-7

Optimized big data K-means clustering using MapReduce

Published: 19 June 2014

Volume 70, pages 1249–1259, (2014)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Xiaoli Cui¹,
Pingfei Zhu²,
Xin Yang¹,
Keqiu Li¹ &
…
Changqing Ji³

3651 Accesses
112 Citations
Explore all metrics

Abstract

Clustering analysis is one of the most commonly used data processing algorithms. Over half a century, K-means remains the most popular clustering algorithm because of its simplicity. Recently, as data volume continues to rise, some researchers turn to MapReduce to get high performance. However, MapReduce is unsuitable for iterated algorithms owing to repeated times of restarting jobs, big data reading and shuffling. In this paper, we address the problems of processing large-scale data using K-means clustering algorithm and propose a novel processing model in MapReduce to eliminate the iteration dependence and obtain high performance. We analyze and implement our idea. Extensive experiments on our cluster demonstrate that our proposed methods are efficient, robust and scalable.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Optimized K-means Clustering Approach on Top of MapReduce

A Novel MapReduce Based k-Means Clustering

Fast K-Means Clustering for Very Large Datasets Based on MapReduce Combined with a New Cutting Method

References

Madhulatha TS (2012) An overview on clustering methods[J]. arXiv preprint arXiv:1205.1117
Xu R, Wunsch D (2005) Survey of clustering algorithms[J]. IEEE Trans Neural Netw 16(3):645–678
Article Google Scholar
Drineas P, Frieze A, Kannan R et al (2004) Clustering large graphs via the singular value decomposition[J]. Mach Learn 56(1–3):9–33
Article MATH Google Scholar
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters[J]. Commun ACM 51(1):107–113
Article Google Scholar
Ekanayake J, Pallickara S, Fox G (2008) Mapreduce for data intensive scientific analyses[C]. eScience, eScience’08. IEEE fourth international conference on. IEEE 2008, pp 277–284
Liu T, Rosenberg C, Rowley HA (2007) Clustering billions of images with large scale nearest neighbor search[C]. Applications of Computer Vision, WACV’07. IEEE Workshop on. IEEE 2007, p 28
Zhao W, Ma H, He Q (2009) Parallel k-means clustering based on mapreduce[M]., Cloud computingSpringer, Berlin Heidelberg
Google Scholar
Ene A, Im S, Moseley B (2011) Fast clustering using MapReduce[C]. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 681–689
Vattani A (2011) K-means requires exponentially many iterations even in the plane[J]. Discret Comput Geom 45(4):596–616
Article MATH MathSciNet Google Scholar
Arthur D, Vassilvitskii S (2007) k-means++: the advantages of careful seeding[C]. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms. Society for industrial and applied mathematics, pp 1027–1035
Wang J, Su X (2011) An improved K-means clustering algorithm[C]. Communication software and networks (ICCSN), 2011 IEEE 3rd international conference on. IEEE, pp 44–46
Davidson I, Satyanarayana A (2003) Speeding up k-means clustering by bootstrap averaging[C]. IEEE data mining workshop on clustering large data sets
Farnstrom F, Lewis J, Elkan C (2000) Scalability for clustering algorithms revisited[J]. ACM SIGKDD Explor Newsl 2(1):51–57
Article Google Scholar
Domingos P, Hulten G (2001) A general method for scaling up machine learning algorithms and its application to clustering[C]. ICML, pp 106–113
Fahim AM, Salem AM, Torkey FA et al (2006) An efficient enhanced k-means clustering algorithm[J]. J Zhejiang Univ Sci A 7(10):1626–1633
Article MATH Google Scholar
Kanungo T, Mount DM, Netanyahu NS et al (2002) An efficient k-means clustering algorithm: analysis and implementation [J]. IEEE Trans Pattern Anal Mach Intell 24(7):881–892
Article Google Scholar
Bahmani B, Moseley B, Vattani A et al (2012) Scalable k-means++[J]. Proc VLDB Endow 5(7):622–633
Article Google Scholar
Mishra BK, Rath A, Nayak NR et al (2012) Far efficient K-means clustering algorithm[C]. In: Proceedings of the international conference on advances in computing, communications and informatics. ACM, pp 106–110
Friedman HP, Rubin J (1967) On some invariant criteria for grouping data[J]. J Am Stat Assoc 62(320):1159–1178
Article MathSciNet Google Scholar
Davies DL, Bouldin DW (1979) A cluster separation measure[J]. IEEE Trans Pattern Anal Mach Intell 2:224–227
Article Google Scholar

Download references

Acknowledgments

The work of this paper is supported by the National Science Foundation for Distinguished Young Scholars of China (Grant No. 61225010), NSFC under Grants 61173160, 61173161, 61173162, 61103234, 61272417, 61300084 and 61300189.

Author information

Authors and Affiliations

School of Computer Science and Technology, Dalian University of Technology, Dalian, 116024, China
Xiaoli Cui, Xin Yang & Keqiu Li
Beijing China-Power Information Technology Co., Ltd., State Grid Electric Power Research Institute, Beijing, 100192, China
Pingfei Zhu
School of Physical Science and Technology, Dalian University, Dalian, 116600, China
Changqing Ji

Authors

Xiaoli Cui
View author publications
You can also search for this author in PubMed Google Scholar
Pingfei Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Xin Yang
View author publications
You can also search for this author in PubMed Google Scholar
Keqiu Li
View author publications
You can also search for this author in PubMed Google Scholar
Changqing Ji
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaoli Cui.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cui, X., Zhu, P., Yang, X. et al. Optimized big data K-means clustering using MapReduce. J Supercomput 70, 1249–1259 (2014). https://doi.org/10.1007/s11227-014-1225-7

Download citation

Published: 19 June 2014
Issue Date: December 2014
DOI: https://doi.org/10.1007/s11227-014-1225-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimized big data K-means clustering using MapReduce

Abstract

Access this article

Similar content being viewed by others

An Optimized K-means Clustering Approach on Top of MapReduce

A Novel MapReduce Based k-Means Clustering

Fast K-Means Clustering for Very Large Datasets Based on MapReduce Combined with a New Cutting Method

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Optimized big data K-means clustering using MapReduce

Abstract

Access this article

Similar content being viewed by others

An Optimized K-means Clustering Approach on Top of MapReduce

A Novel MapReduce Based k-Means Clustering

Fast K-Means Clustering for Very Large Datasets Based on MapReduce Combined with a New Cutting Method

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation