Skip to main content

Optimisation Techniques for Parallel K-Means on MapReduce

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9258))

Abstract

The K-Means algorithm is one the most efficient and widely used algorithms for clustering data. However, K-Means performance tends to get slower as data grows larger in size. Moreover, the rapid increase in the size of data has motivated the scientific and industrial communities to develop novel technologies that meet the needs of storing, managing, and analysing large-scale datasets known as Big Data. This paper describes the implementation of parallel K-Means on the MapReduce framework, which is a distributed framework best known for its reliability in processing large-scale datasets. Moreover, a detailed analysis of the effect of distance computations on the performance of K-Means on MapReduce is introduced. Finally, two optimisation techniques are suggested to accelerate K-Means on MapReduce by reducing distance computations per iteration to achieve the same deterministic results.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Lloyd, S.: Least Squares Quantization in PCM. IEEE Trans. Inf. Theor. 28(2), 129–137 (1982)

    Article  MathSciNet  MATH  Google Scholar 

  2. Dhillon, I.S., Modha, D.S.: A data-clustering algorithm on distributed memory multiprocessors. In: Zaki, M.J., Ho, C.-T. (eds.) KDD 1999. LNCS (LNAI), vol. 1759, pp. 245–260. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  3. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th Conference on Symposium on Operating Systems Design & Implementation, 6, p. 10. Berkeley, CA, USA (2004)

    Google Scholar 

  4. Elkan, C.: Using the triangle inequality to accelerate k-means. In: presented at the International Conference on Machine Learning - ICML, pp. 147–153 (2003)

    Google Scholar 

  5. Bentley, J.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)

    Article  MathSciNet  MATH  Google Scholar 

  6. Pelleg, D., Moore, A.: Accelerating exact K-means algorithms with geometric reasoning. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 277−281, New York, NY, USA (1999)

    Google Scholar 

  7. Judd, D., Mckinley, P.K., Jain, A.K.: Large-scale parallel data clustering. IEEE Trans. Pattern Anal. Mach. Intell. 20, 871–876 (1998)

    Article  Google Scholar 

  8. Pettinger, D., Di Fatta, G.: Scalability of efficient parallel K-means. In: 2009 5th IEEE International Conference on E-Science Workshops, pp. 96–101 (2009)

    Google Scholar 

  9. Di Fatta, G., Pettinger, D.: Dynamic load balancing in parallel KD-tree K-means. In: IEEE International Conference on Scalable Computing and Communications, pp. 2478–2485 (2010)

    Google Scholar 

  10. Ghemawat, S., Gobioff, H., Leung, S.-T.: The google file system. In: Proceedings of the 19th ACM Symposium on Operating Systems Principles, pp. 29–43. New York, NY, USA (2003)

    Google Scholar 

  11. Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–10. Washington, DC, USA (2010)

    Google Scholar 

  12. Zhao, W., Ma, H., He, Q.: Parallel K-means clustering based on mapreduce. In: Jaatun, M.G., Zhao, G., Rong, C. (eds.) Cloud Computing. LNCS, vol. 5931, pp. 674–679. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  13. White, B., Yeh, T., Lin, J., Davis, L.: Web-scale computer vision using mapreduce for multimedia data mining. In: Proceedings of the Tenth International Workshop on Multimedia Data Mining, pp. 9:1–9:10. New York, NY, USA (2010)

    Google Scholar 

  14. Apache Hadoop. http://hadoop.apache.org/. Accessed on 03 January 2015

  15. Pettinger, D., Di Fatta, G.: Space partitioning for scalable K-means. In: IEEE The Ninth International Conference on Machine Learning and Applications (ICMLA 2010), pp. 319-324. Washington DC, USA, 12–14 December 2010

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sami Al Ghamdi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Al Ghamdi, S., Di Fatta, G., Stahl, F. (2015). Optimisation Techniques for Parallel K-Means on MapReduce. In: Di Fatta, G., Fortino, G., Li, W., Pathan, M., Stahl, F., Guerrieri, A. (eds) Internet and Distributed Computing Systems. IDCS 2015. Lecture Notes in Computer Science(), vol 9258. Springer, Cham. https://doi.org/10.1007/978-3-319-23237-9_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-23237-9_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-23236-2

  • Online ISBN: 978-3-319-23237-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics