Abstract
The K-Means algorithm is one the most efficient and widely used algorithms for clustering data. However, K-Means performance tends to get slower as data grows larger in size. Moreover, the rapid increase in the size of data has motivated the scientific and industrial communities to develop novel technologies that meet the needs of storing, managing, and analysing large-scale datasets known as Big Data. This paper describes the implementation of parallel K-Means on the MapReduce framework, which is a distributed framework best known for its reliability in processing large-scale datasets. Moreover, a detailed analysis of the effect of distance computations on the performance of K-Means on MapReduce is introduced. Finally, two optimisation techniques are suggested to accelerate K-Means on MapReduce by reducing distance computations per iteration to achieve the same deterministic results.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Lloyd, S.: Least Squares Quantization in PCM. IEEE Trans. Inf. Theor. 28(2), 129–137 (1982)
Dhillon, I.S., Modha, D.S.: A data-clustering algorithm on distributed memory multiprocessors. In: Zaki, M.J., Ho, C.-T. (eds.) KDD 1999. LNCS (LNAI), vol. 1759, pp. 245–260. Springer, Heidelberg (2000)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th Conference on Symposium on Operating Systems Design & Implementation, 6, p. 10. Berkeley, CA, USA (2004)
Elkan, C.: Using the triangle inequality to accelerate k-means. In: presented at the International Conference on Machine Learning - ICML, pp. 147–153 (2003)
Bentley, J.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)
Pelleg, D., Moore, A.: Accelerating exact K-means algorithms with geometric reasoning. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 277−281, New York, NY, USA (1999)
Judd, D., Mckinley, P.K., Jain, A.K.: Large-scale parallel data clustering. IEEE Trans. Pattern Anal. Mach. Intell. 20, 871–876 (1998)
Pettinger, D., Di Fatta, G.: Scalability of efficient parallel K-means. In: 2009 5th IEEE International Conference on E-Science Workshops, pp. 96–101 (2009)
Di Fatta, G., Pettinger, D.: Dynamic load balancing in parallel KD-tree K-means. In: IEEE International Conference on Scalable Computing and Communications, pp. 2478–2485 (2010)
Ghemawat, S., Gobioff, H., Leung, S.-T.: The google file system. In: Proceedings of the 19th ACM Symposium on Operating Systems Principles, pp. 29–43. New York, NY, USA (2003)
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–10. Washington, DC, USA (2010)
Zhao, W., Ma, H., He, Q.: Parallel K-means clustering based on mapreduce. In: Jaatun, M.G., Zhao, G., Rong, C. (eds.) Cloud Computing. LNCS, vol. 5931, pp. 674–679. Springer, Heidelberg (2009)
White, B., Yeh, T., Lin, J., Davis, L.: Web-scale computer vision using mapreduce for multimedia data mining. In: Proceedings of the Tenth International Workshop on Multimedia Data Mining, pp. 9:1–9:10. New York, NY, USA (2010)
Apache Hadoop. http://hadoop.apache.org/. Accessed on 03 January 2015
Pettinger, D., Di Fatta, G.: Space partitioning for scalable K-means. In: IEEE The Ninth International Conference on Machine Learning and Applications (ICMLA 2010), pp. 319-324. Washington DC, USA, 12–14 December 2010
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Al Ghamdi, S., Di Fatta, G., Stahl, F. (2015). Optimisation Techniques for Parallel K-Means on MapReduce. In: Di Fatta, G., Fortino, G., Li, W., Pathan, M., Stahl, F., Guerrieri, A. (eds) Internet and Distributed Computing Systems. IDCS 2015. Lecture Notes in Computer Science(), vol 9258. Springer, Cham. https://doi.org/10.1007/978-3-319-23237-9_17
Download citation
DOI: https://doi.org/10.1007/978-3-319-23237-9_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23236-2
Online ISBN: 978-3-319-23237-9
eBook Packages: Computer ScienceComputer Science (R0)