Accelerating Lloyd’s Algorithm for k-Means Clustering

Hamerly, Greg; Drake, Jonathan

doi:10.1007/978-3-319-09259-1_2

Greg Hamerly² &
Jonathan Drake²

3613 Accesses
48 Citations
3 Altmetric

Abstract

The k-means clustering algorithm, a staple of data mining and unsupervised learning, is popular because it is simple to implement, fast, easily parallelized, and offers intuitive results. Lloyd’s algorithm is the standard batch, hill-climbing approach for minimizing the k-means optimization criterion. It spends a vast majority of its time computing distances between each of the k cluster centers and the n data points. It turns out that much of this work is unnecessary, because points usually stay in the same clusters after the first few iterations. In the last decade researchers have developed a number of optimizations to speed up Lloyd’s algorithm for both low- and high-dimensional data.In this chapter we survey some of these optimizations and present new ones. In particular we focus on those which avoid distance calculations by the triangle inequality. By caching known distances and updating them efficiently with the triangle inequality, these algorithms can provably avoid many unnecessary distance calculations. All the optimizations examined produce the same results as Lloyd’s algorithm given the same input and initialization, so are suitable as drop-in replacements. These new algorithms can run many times faster and compute far fewer distances than the standard unoptimized implementation. In our experiments, it is common to see speedups of over 30–50x compared to Lloyd’s algorithm. We examine the trade-offs for using these methods with respect to the number of examples n, dimensions d, clusters k, and structure of the data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Note that the k in k-d trees and the k in k-means are two different (clashing) variable names. In k-means, the k refers to the number of centers/clusters sought; in k-d trees k refers to the dimension of the data the structure is built on.

References

Agarwal PK, Har-Peled S, Varadarajan KR (2005) Geometric approximation via coresets. Comb Comput Geom 52:1–30
MathSciNet Google Scholar
Apache Mahout http://mahout.apache.org/. Version 0.8, Accessed 24 Jan 2014
Arthur D, Vassilvitskii S (2007) kmeans++: the advantages of careful seeding. In: ACM-SIAM symposium on discrete algorithms, pp 1027–1035
Google Scholar
Arthur D, Manthey B, Röglin H (2011) Smoothed analysis of the k-means method. J ACM 58(5):19
Article MathSciNet Google Scholar
Bei C-D, Gray RM (1985) An improvement of the minimum distortion encoding algorithm for vector quantization. IEEE Trans Commun 33(10):1121–1133
Google Scholar
Bottou L, Bengio Y (1995) Convergence properties of the k-means algorithms. In: Advances in neural information processing systems, vol 7. MIT Press, Cambridge, 585–592
Google Scholar
Celebi ME (2011) Improving the performance of k-means for color quantization. Image Vis Comput 29(4):260–271
Article MathSciNet Google Scholar
Celebi ME, Kingravi HA, Vela PA (2013) A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst Appl 40(1):200–210
Article Google Scholar
Coates A, Ng AY, Lee H (2011) An analysis of single-layer networks in unsupervised feature learning. In: International conference on artificial intelligence and statistics, pp 215–223
Google Scholar
Dhillon I, Guan Y, Kulis B (2005) A unified view of kernel k-means, spectral clustering and graph cuts. Technical Report TR-04-25, University of Texas at Austin
Google Scholar
Drake J, Hamerly G (2012) Accelerated k-means with adaptive distance bounds. In: 5th NIPS workshop on optimization for machine learning
Google Scholar
Elkan C (2003) Using the triangle inequality to accelerate k-means. In: Proceedings of the twentieth international conference on machine learning (ICML), pp 147–153
Google Scholar
Forgy EW (1965) Cluster analysis of multivariate data: efficiency versus interpretability of classifications. In: Biometric society meeting, Riverside
Google Scholar
Fu K-S, Mui JK (1981) A survey on image segmentation. Pattern Recognit 13(1):3–16
Article MathSciNet Google Scholar
Hamerly G (2010) Making k-means even faster. In: SIAM international conference on data mining
Google Scholar
Hamerly G, Elkan C (2002) Alternatives to the k-means algorithm that find better clusterings. In: Proceedings of the eleventh international conference on Information and knowledge management, pp 600–607. ACM, New York
Google Scholar
Hartigan JA, Wong MA (1979) Algorithm as 136: a k-means clustering algorithm. J R Stat Soc Ser C Appl Stat 28(1):100–108
MATH Google Scholar
Hochbaum DS, Shmoys DB (1985) A best possible heuristic for the k-center problem. Math Oper Res 10(2):180–184
Article MATH MathSciNet Google Scholar
Kanungo T, Mount DM, Netanyahu NS, Piatko CD, Silverman R, Wu AY (2002) An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans Pattern Anal Mach Intell 24:881–892
Article Google Scholar
Kaukoranta T, Franti P, Nevalainen O (2000) A fast exact gla based on code vector activity detection. IEEE Trans Image Process 9(8):1337–1342
Article Google Scholar
Lai JZC, Liaw Y-C (2008) Improvement of the k-means clustering filtering algorithm. Pattern Recognit 41(12):3677–3681
Article MATH Google Scholar
Lai JZC, Liaw Y-C, Liu J (2008) A fast vq codebook generation algorithm using codeword displacement. Pattern Recognit 41(1):315–319
Article MATH Google Scholar
Linde Y, Buzo A, Gray R (1980) An algorithm for vector quantizer design. IEEE Trans Commun 28(1):84–95
Article Google Scholar
Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28:129–137
Article MATH MathSciNet Google Scholar
Low Y, Gonzalez J, Kyrola A, Bickson D, Guestrin C, Hellerstein JM (2010) Graphlab: a new parallel framework for machine learning. In: Conference on uncertainty in artificial intelligence (UAI)
Google Scholar
MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: 5th Berkeley symposium on mathematical statistics and probability, vol 1. University of California Press, Berkeley, pp 281–297
Google Scholar
Moore AW (1991) An introductory tutorial on kd-trees. Technical Report 209, Carnegie Mellon University
Google Scholar
Moore AW (2000) The anchors hierarchy: using the triangle inequality to survive high dimensional data. In: twelfth conference on uncertainty in artificial intelligence. AAAI Press, Stanford, CA, pp 397–405
Google Scholar
Ng AY, Jordan MI, Weiss Y et al (2002) On spectral clustering: analysis and an algorithm. Adv Neural Inf Process Syst 2:849–856
Google Scholar
Pan J-S, Lu Z-M, Sun S-H (2003) An efficient encoding algorithm for vector quantization based on subvector technique. IEEE Trans Image Process 12(3):265–270
Article Google Scholar
Pelleg D, Moore A (1999) Accelerating exact k-means algorithms with geometric reasoning. In: ACM SIGKDD fifth international conference on knowledge discovery and data mining, pp 277–281
Google Scholar
Phillips SJ (2002) Acceleration of k-means and related clustering algorithms. In: Mount D, Stein C (eds) Algorithm engineering and experiments. Lecture notes in computer science, vol 2409. Springer, Berlin, Heidelberg, pp 61–62
Chapter Google Scholar
Ra S-W, Kim JK (1993) A fast mean-distance-ordered partial codebook search algorithm for image vector quantization. IEEE Trans Circuits Syst II 40(9):576–579
Article Google Scholar
Schölkopf B, Smola A, Müller K-R (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput 10(5):1299–1319
Article Google Scholar
Sculley D (2010) Web-scale k-means clustering. In: Proceedings of the 19th international conference on World Wide Web. ACM, New York, pp 1177–1178
Google Scholar
Sherwood T, Perelman E, Hamerly G, Calder B (2002) Automatically characterizing large scale program behavior. SIGOPS Oper Syst Rev 36(5):45–57
Article Google Scholar
Tai S-C, Lai CC, Lin Y-C (1996) Two fast nearest neighbor searching algorithms for image vector quantization. IEEE Trans Commun 44(12):1623–1628
Article Google Scholar
Vattani A (2011) k-means requires exponentially many iterations even in the plane. Discrete Comput Geom 45(4):596–616
Article MATH MathSciNet Google Scholar
Wettschereck D, Dietterich T (1991) Improving the performance of radial basis function networks by learning center locations. In Neural Inf Process Syst 4:1133–1140
Google Scholar
Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng AFM, Liu B, Yu PS, Zhou Z-H, Steinbach M, Hand DJ, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37
Article Google Scholar
Yael https://gforge.inria.fr/projects/yael/. Version v1845, Accessed 24 Jan 2014

Download references

Author information

Authors and Affiliations

Baylor University, 105 Baylor Ave., Waco, TX, 76798, USA
Greg Hamerly & Jonathan Drake

Authors

Greg Hamerly
View author publications
You can also search for this author in PubMed Google Scholar
Jonathan Drake
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Greg Hamerly .

Editor information

Editors and Affiliations

Computer Science dept., Louisiana State University Shreveport, Shreveport, Louisiana, USA
M. Emre Celebi

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Hamerly, G., Drake, J. (2015). Accelerating Lloyd’s Algorithm for k-Means Clustering. In: Celebi, M. (eds) Partitional Clustering Algorithms. Springer, Cham. https://doi.org/10.1007/978-3-319-09259-1_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-09259-1_2
Published: 17 October 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-09258-4
Online ISBN: 978-3-319-09259-1
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics