Skip to main content

Accelerating Lloyd’s Algorithm for k-Means Clustering

  • Chapter
  • First Online:
Partitional Clustering Algorithms

Abstract

The k-means clustering algorithm, a staple of data mining and unsupervised learning, is popular because it is simple to implement, fast, easily parallelized, and offers intuitive results. Lloyd’s algorithm is the standard batch, hill-climbing approach for minimizing the k-means optimization criterion. It spends a vast majority of its time computing distances between each of the k cluster centers and the n data points. It turns out that much of this work is unnecessary, because points usually stay in the same clusters after the first few iterations. In the last decade researchers have developed a number of optimizations to speed up Lloyd’s algorithm for both low- and high-dimensional data.In this chapter we survey some of these optimizations and present new ones. In particular we focus on those which avoid distance calculations by the triangle inequality. By caching known distances and updating them efficiently with the triangle inequality, these algorithms can provably avoid many unnecessary distance calculations. All the optimizations examined produce the same results as Lloyd’s algorithm given the same input and initialization, so are suitable as drop-in replacements. These new algorithms can run many times faster and compute far fewer distances than the standard unoptimized implementation. In our experiments, it is common to see speedups of over 30–50x compared to Lloyd’s algorithm. We examine the trade-offs for using these methods with respect to the number of examples n, dimensions d, clusters k, and structure of the data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Note that the k in k-d trees and the k in k-means are two different (clashing) variable names. In k-means, the k refers to the number of centers/clusters sought; in k-d trees k refers to the dimension of the data the structure is built on.

References

  1. Agarwal PK, Har-Peled S, Varadarajan KR (2005) Geometric approximation via coresets. Comb Comput Geom 52:1–30

    MathSciNet  Google Scholar 

  2. Apache Mahout http://mahout.apache.org/. Version 0.8, Accessed 24 Jan 2014

  3. Arthur D, Vassilvitskii S (2007) kmeans++: the advantages of careful seeding. In: ACM-SIAM symposium on discrete algorithms, pp 1027–1035

    Google Scholar 

  4. Arthur D, Manthey B, Röglin H (2011) Smoothed analysis of the k-means method. J ACM 58(5):19

    Article  MathSciNet  Google Scholar 

  5. Bei C-D, Gray RM (1985) An improvement of the minimum distortion encoding algorithm for vector quantization. IEEE Trans Commun 33(10):1121–1133

    Google Scholar 

  6. Bottou L, Bengio Y (1995) Convergence properties of the k-means algorithms. In: Advances in neural information processing systems, vol 7. MIT Press, Cambridge, 585–592

    Google Scholar 

  7. Celebi ME (2011) Improving the performance of k-means for color quantization. Image Vis Comput 29(4):260–271

    Article  MathSciNet  Google Scholar 

  8. Celebi ME, Kingravi HA, Vela PA (2013) A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst Appl 40(1):200–210

    Article  Google Scholar 

  9. Coates A, Ng AY, Lee H (2011) An analysis of single-layer networks in unsupervised feature learning. In: International conference on artificial intelligence and statistics, pp 215–223

    Google Scholar 

  10. Dhillon I, Guan Y, Kulis B (2005) A unified view of kernel k-means, spectral clustering and graph cuts. Technical Report TR-04-25, University of Texas at Austin

    Google Scholar 

  11. Drake J, Hamerly G (2012) Accelerated k-means with adaptive distance bounds. In: 5th NIPS workshop on optimization for machine learning

    Google Scholar 

  12. Elkan C (2003) Using the triangle inequality to accelerate k-means. In: Proceedings of the twentieth international conference on machine learning (ICML), pp 147–153

    Google Scholar 

  13. Forgy EW (1965) Cluster analysis of multivariate data: efficiency versus interpretability of classifications. In: Biometric society meeting, Riverside

    Google Scholar 

  14. Fu K-S, Mui JK (1981) A survey on image segmentation. Pattern Recognit 13(1):3–16

    Article  MathSciNet  Google Scholar 

  15. Hamerly G (2010) Making k-means even faster. In: SIAM international conference on data mining

    Google Scholar 

  16. Hamerly G, Elkan C (2002) Alternatives to the k-means algorithm that find better clusterings. In: Proceedings of the eleventh international conference on Information and knowledge management, pp 600–607. ACM, New York

    Google Scholar 

  17. Hartigan JA, Wong MA (1979) Algorithm as 136: a k-means clustering algorithm. J R Stat Soc Ser C Appl Stat 28(1):100–108

    MATH  Google Scholar 

  18. Hochbaum DS, Shmoys DB (1985) A best possible heuristic for the k-center problem. Math Oper Res 10(2):180–184

    Article  MATH  MathSciNet  Google Scholar 

  19. Kanungo T, Mount DM, Netanyahu NS, Piatko CD, Silverman R, Wu AY (2002) An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans Pattern Anal Mach Intell 24:881–892

    Article  Google Scholar 

  20. Kaukoranta T, Franti P, Nevalainen O (2000) A fast exact gla based on code vector activity detection. IEEE Trans Image Process 9(8):1337–1342

    Article  Google Scholar 

  21. Lai JZC, Liaw Y-C (2008) Improvement of the k-means clustering filtering algorithm. Pattern Recognit 41(12):3677–3681

    Article  MATH  Google Scholar 

  22. Lai JZC, Liaw Y-C, Liu J (2008) A fast vq codebook generation algorithm using codeword displacement. Pattern Recognit 41(1):315–319

    Article  MATH  Google Scholar 

  23. Linde Y, Buzo A, Gray R (1980) An algorithm for vector quantizer design. IEEE Trans Commun 28(1):84–95

    Article  Google Scholar 

  24. Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28:129–137

    Article  MATH  MathSciNet  Google Scholar 

  25. Low Y, Gonzalez J, Kyrola A, Bickson D, Guestrin C, Hellerstein JM (2010) Graphlab: a new parallel framework for machine learning. In: Conference on uncertainty in artificial intelligence (UAI)

    Google Scholar 

  26. MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: 5th Berkeley symposium on mathematical statistics and probability, vol 1. University of California Press, Berkeley, pp 281–297

    Google Scholar 

  27. Moore AW (1991) An introductory tutorial on kd-trees. Technical Report 209, Carnegie Mellon University

    Google Scholar 

  28. Moore AW (2000) The anchors hierarchy: using the triangle inequality to survive high dimensional data. In: twelfth conference on uncertainty in artificial intelligence. AAAI Press, Stanford, CA, pp 397–405

    Google Scholar 

  29. Ng AY, Jordan MI, Weiss Y et al (2002) On spectral clustering: analysis and an algorithm. Adv Neural Inf Process Syst 2:849–856

    Google Scholar 

  30. Pan J-S, Lu Z-M, Sun S-H (2003) An efficient encoding algorithm for vector quantization based on subvector technique. IEEE Trans Image Process 12(3):265–270

    Article  Google Scholar 

  31. Pelleg D, Moore A (1999) Accelerating exact k-means algorithms with geometric reasoning. In: ACM SIGKDD fifth international conference on knowledge discovery and data mining, pp 277–281

    Google Scholar 

  32. Phillips SJ (2002) Acceleration of k-means and related clustering algorithms. In: Mount D, Stein C (eds) Algorithm engineering and experiments. Lecture notes in computer science, vol 2409. Springer, Berlin, Heidelberg, pp 61–62

    Chapter  Google Scholar 

  33. Ra S-W, Kim JK (1993) A fast mean-distance-ordered partial codebook search algorithm for image vector quantization. IEEE Trans Circuits Syst II 40(9):576–579

    Article  Google Scholar 

  34. Schölkopf B, Smola A, Müller K-R (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput 10(5):1299–1319

    Article  Google Scholar 

  35. Sculley D (2010) Web-scale k-means clustering. In: Proceedings of the 19th international conference on World Wide Web. ACM, New York, pp 1177–1178

    Google Scholar 

  36. Sherwood T, Perelman E, Hamerly G, Calder B (2002) Automatically characterizing large scale program behavior. SIGOPS Oper Syst Rev 36(5):45–57

    Article  Google Scholar 

  37. Tai S-C, Lai CC, Lin Y-C (1996) Two fast nearest neighbor searching algorithms for image vector quantization. IEEE Trans Commun 44(12):1623–1628

    Article  Google Scholar 

  38. Vattani A (2011) k-means requires exponentially many iterations even in the plane. Discrete Comput Geom 45(4):596–616

    Article  MATH  MathSciNet  Google Scholar 

  39. Wettschereck D, Dietterich T (1991) Improving the performance of radial basis function networks by learning center locations. In Neural Inf Process Syst 4:1133–1140

    Google Scholar 

  40. Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng AFM, Liu B, Yu PS, Zhou Z-H, Steinbach M, Hand DJ, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37

    Article  Google Scholar 

  41. Yael https://gforge.inria.fr/projects/yael/. Version v1845, Accessed 24 Jan 2014

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Greg Hamerly .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Hamerly, G., Drake, J. (2015). Accelerating Lloyd’s Algorithm for k-Means Clustering. In: Celebi, M. (eds) Partitional Clustering Algorithms. Springer, Cham. https://doi.org/10.1007/978-3-319-09259-1_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-09259-1_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-09258-4

  • Online ISBN: 978-3-319-09259-1

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics