Optimisation Techniques for Parallel K-Means on MapReduce

Al Ghamdi, Sami; Di Fatta, Giuseppe; Stahl, Frederic

doi:10.1007/978-3-319-23237-9_17

Optimisation Techniques for Parallel K-Means on MapReduce

Sami Al Ghamdi¹⁹,
Giuseppe Di Fatta¹⁹ &
Frederic Stahl¹⁹

Conference paper
First Online: 01 January 2015

855 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9258))

Abstract

The K-Means algorithm is one the most efficient and widely used algorithms for clustering data. However, K-Means performance tends to get slower as data grows larger in size. Moreover, the rapid increase in the size of data has motivated the scientific and industrial communities to develop novel technologies that meet the needs of storing, managing, and analysing large-scale datasets known as Big Data. This paper describes the implementation of parallel K-Means on the MapReduce framework, which is a distributed framework best known for its reliability in processing large-scale datasets. Moreover, a detailed analysis of the effect of distance computations on the performance of K-Means on MapReduce is introduced. Finally, two optimisation techniques are suggested to accelerate K-Means on MapReduce by reducing distance computations per iteration to achieve the same deterministic results.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Lloyd, S.: Least Squares Quantization in PCM. IEEE Trans. Inf. Theor. 28(2), 129–137 (1982)
Article MathSciNet MATH Google Scholar
Dhillon, I.S., Modha, D.S.: A data-clustering algorithm on distributed memory multiprocessors. In: Zaki, M.J., Ho, C.-T. (eds.) KDD 1999. LNCS (LNAI), vol. 1759, pp. 245–260. Springer, Heidelberg (2000)
Chapter Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th Conference on Symposium on Operating Systems Design & Implementation, 6, p. 10. Berkeley, CA, USA (2004)
Google Scholar
Elkan, C.: Using the triangle inequality to accelerate k-means. In: presented at the International Conference on Machine Learning - ICML, pp. 147–153 (2003)
Google Scholar
Bentley, J.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)
Article MathSciNet MATH Google Scholar
Pelleg, D., Moore, A.: Accelerating exact K-means algorithms with geometric reasoning. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 277−281, New York, NY, USA (1999)
Google Scholar
Judd, D., Mckinley, P.K., Jain, A.K.: Large-scale parallel data clustering. IEEE Trans. Pattern Anal. Mach. Intell. 20, 871–876 (1998)
Article Google Scholar
Pettinger, D., Di Fatta, G.: Scalability of efficient parallel K-means. In: 2009 5th IEEE International Conference on E-Science Workshops, pp. 96–101 (2009)
Google Scholar
Di Fatta, G., Pettinger, D.: Dynamic load balancing in parallel KD-tree K-means. In: IEEE International Conference on Scalable Computing and Communications, pp. 2478–2485 (2010)
Google Scholar
Ghemawat, S., Gobioff, H., Leung, S.-T.: The google file system. In: Proceedings of the 19th ACM Symposium on Operating Systems Principles, pp. 29–43. New York, NY, USA (2003)
Google Scholar
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–10. Washington, DC, USA (2010)
Google Scholar
Zhao, W., Ma, H., He, Q.: Parallel K-means clustering based on mapreduce. In: Jaatun, M.G., Zhao, G., Rong, C. (eds.) Cloud Computing. LNCS, vol. 5931, pp. 674–679. Springer, Heidelberg (2009)
Chapter Google Scholar
White, B., Yeh, T., Lin, J., Davis, L.: Web-scale computer vision using mapreduce for multimedia data mining. In: Proceedings of the Tenth International Workshop on Multimedia Data Mining, pp. 9:1–9:10. New York, NY, USA (2010)
Google Scholar
Apache Hadoop. http://hadoop.apache.org/. Accessed on 03 January 2015
Pettinger, D., Di Fatta, G.: Space partitioning for scalable K-means. In: IEEE The Ninth International Conference on Machine Learning and Applications (ICMLA 2010), pp. 319-324. Washington DC, USA, 12–14 December 2010
Google Scholar

Download references

Author information

Authors and Affiliations

School of Systems Engineering, University of Reading, Whiteknights, Reading, RG6 6AY, UK
Sami Al Ghamdi, Giuseppe Di Fatta & Frederic Stahl

Authors

Sami Al Ghamdi
View author publications
You can also search for this author in PubMed Google Scholar
Giuseppe Di Fatta
View author publications
You can also search for this author in PubMed Google Scholar
Frederic Stahl
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sami Al Ghamdi .

Editor information

Editors and Affiliations

School of Systems Engineering, University of Reading, Reading, Berkshire, United Kingdom
Giuseppe Di Fatta
Dipartimento di Ingegneria Informatica, Modellistica, Elettronica e Sistemistica, University of Calabria Dipartimento di Ingegneria Informat, Rende, Italy
Giancarlo Fortino
School of Logistics and Engineer, University of Technology Wuhan, Wuhan, China
Wenfeng Li
CSIRO ICT, Acton, Australia
Mukaddim Pathan
School of Systems Engineering, University of Reading, Whiteknights, Reading, United Kingdom
Frederic Stahl
Dipartimento di Ingegneria Informatica, Modellistica, Elettronica e Sistemistica, University of Calabria, Rende, Italy
Antonio Guerrieri

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Al Ghamdi, S., Di Fatta, G., Stahl, F. (2015). Optimisation Techniques for Parallel K-Means on MapReduce. In: Di Fatta, G., Fortino, G., Li, W., Pathan, M., Stahl, F., Guerrieri, A. (eds) Internet and Distributed Computing Systems. IDCS 2015. Lecture Notes in Computer Science(), vol 9258. Springer, Cham. https://doi.org/10.1007/978-3-319-23237-9_17

Download citation

DOI: https://doi.org/10.1007/978-3-319-23237-9_17
Published: 25 August 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23236-2
Online ISBN: 978-3-319-23237-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics