Advertisement

Accelerating Lloyd’s Algorithm for k-Means Clustering

  • Greg Hamerly
  • Jonathan Drake

Abstract

The k-means clustering algorithm, a staple of data mining and unsupervised learning, is popular because it is simple to implement, fast, easily parallelized, and offers intuitive results. Lloyd’s algorithm is the standard batch, hill-climbing approach for minimizing the k-means optimization criterion. It spends a vast majority of its time computing distances between each of the k cluster centers and the n data points. It turns out that much of this work is unnecessary, because points usually stay in the same clusters after the first few iterations. In the last decade researchers have developed a number of optimizations to speed up Lloyd’s algorithm for both low- and high-dimensional data.In this chapter we survey some of these optimizations and present new ones. In particular we focus on those which avoid distance calculations by the triangle inequality. By caching known distances and updating them efficiently with the triangle inequality, these algorithms can provably avoid many unnecessary distance calculations. All the optimizations examined produce the same results as Lloyd’s algorithm given the same input and initialization, so are suitable as drop-in replacements. These new algorithms can run many times faster and compute far fewer distances than the standard unoptimized implementation. In our experiments, it is common to see speedups of over 30–50x compared to Lloyd’s algorithm. We examine the trade-offs for using these methods with respect to the number of examples n, dimensions d, clusters k, and structure of the data.

Keywords

k-Means Triangle inequality Caching Accelerate Lloyd’s algorithm Clustering Unsupervised learning 

References

  1. 1.
    Agarwal PK, Har-Peled S, Varadarajan KR (2005) Geometric approximation via coresets. Comb Comput Geom 52:1–30MathSciNetGoogle Scholar
  2. 2.
    Apache Mahout http://mahout.apache.org/. Version 0.8, Accessed 24 Jan 2014
  3. 3.
    Arthur D, Vassilvitskii S (2007) kmeans++: the advantages of careful seeding. In: ACM-SIAM symposium on discrete algorithms, pp 1027–1035Google Scholar
  4. 4.
    Arthur D, Manthey B, Röglin H (2011) Smoothed analysis of the k-means method. J ACM 58(5):19CrossRefMathSciNetGoogle Scholar
  5. 5.
    Bei C-D, Gray RM (1985) An improvement of the minimum distortion encoding algorithm for vector quantization. IEEE Trans Commun 33(10):1121–1133Google Scholar
  6. 6.
    Bottou L, Bengio Y (1995) Convergence properties of the k-means algorithms. In: Advances in neural information processing systems, vol 7. MIT Press, Cambridge, 585–592Google Scholar
  7. 7.
    Celebi ME (2011) Improving the performance of k-means for color quantization. Image Vis Comput 29(4):260–271CrossRefMathSciNetGoogle Scholar
  8. 8.
    Celebi ME, Kingravi HA, Vela PA (2013) A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst Appl 40(1):200–210CrossRefGoogle Scholar
  9. 9.
    Coates A, Ng AY, Lee H (2011) An analysis of single-layer networks in unsupervised feature learning. In: International conference on artificial intelligence and statistics, pp 215–223Google Scholar
  10. 10.
    Dhillon I, Guan Y, Kulis B (2005) A unified view of kernel k-means, spectral clustering and graph cuts. Technical Report TR-04-25, University of Texas at AustinGoogle Scholar
  11. 11.
    Drake J, Hamerly G (2012) Accelerated k-means with adaptive distance bounds. In: 5th NIPS workshop on optimization for machine learningGoogle Scholar
  12. 12.
    Elkan C (2003) Using the triangle inequality to accelerate k-means. In: Proceedings of the twentieth international conference on machine learning (ICML), pp 147–153Google Scholar
  13. 13.
    Forgy EW (1965) Cluster analysis of multivariate data: efficiency versus interpretability of classifications. In: Biometric society meeting, RiversideGoogle Scholar
  14. 14.
    Fu K-S, Mui JK (1981) A survey on image segmentation. Pattern Recognit 13(1):3–16CrossRefMathSciNetGoogle Scholar
  15. 15.
    Hamerly G (2010) Making k-means even faster. In: SIAM international conference on data miningGoogle Scholar
  16. 16.
    Hamerly G, Elkan C (2002) Alternatives to the k-means algorithm that find better clusterings. In: Proceedings of the eleventh international conference on Information and knowledge management, pp 600–607. ACM, New YorkGoogle Scholar
  17. 17.
    Hartigan JA, Wong MA (1979) Algorithm as 136: a k-means clustering algorithm. J R Stat Soc Ser C Appl Stat 28(1):100–108zbMATHGoogle Scholar
  18. 18.
    Hochbaum DS, Shmoys DB (1985) A best possible heuristic for the k-center problem. Math Oper Res 10(2):180–184CrossRefzbMATHMathSciNetGoogle Scholar
  19. 19.
    Kanungo T, Mount DM, Netanyahu NS, Piatko CD, Silverman R, Wu AY (2002) An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans Pattern Anal Mach Intell 24:881–892CrossRefGoogle Scholar
  20. 20.
    Kaukoranta T, Franti P, Nevalainen O (2000) A fast exact gla based on code vector activity detection. IEEE Trans Image Process 9(8):1337–1342CrossRefGoogle Scholar
  21. 21.
    Lai JZC, Liaw Y-C (2008) Improvement of the k-means clustering filtering algorithm. Pattern Recognit 41(12):3677–3681CrossRefzbMATHGoogle Scholar
  22. 22.
    Lai JZC, Liaw Y-C, Liu J (2008) A fast vq codebook generation algorithm using codeword displacement. Pattern Recognit 41(1):315–319CrossRefzbMATHGoogle Scholar
  23. 23.
    Linde Y, Buzo A, Gray R (1980) An algorithm for vector quantizer design. IEEE Trans Commun 28(1):84–95CrossRefGoogle Scholar
  24. 24.
    Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28:129–137CrossRefzbMATHMathSciNetGoogle Scholar
  25. 25.
    Low Y, Gonzalez J, Kyrola A, Bickson D, Guestrin C, Hellerstein JM (2010) Graphlab: a new parallel framework for machine learning. In: Conference on uncertainty in artificial intelligence (UAI)Google Scholar
  26. 26.
    MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: 5th Berkeley symposium on mathematical statistics and probability, vol 1. University of California Press, Berkeley, pp 281–297Google Scholar
  27. 27.
    Moore AW (1991) An introductory tutorial on kd-trees. Technical Report 209, Carnegie Mellon UniversityGoogle Scholar
  28. 28.
    Moore AW (2000) The anchors hierarchy: using the triangle inequality to survive high dimensional data. In: twelfth conference on uncertainty in artificial intelligence. AAAI Press, Stanford, CA, pp 397–405Google Scholar
  29. 29.
    Ng AY, Jordan MI, Weiss Y et al (2002) On spectral clustering: analysis and an algorithm. Adv Neural Inf Process Syst 2:849–856Google Scholar
  30. 30.
    Pan J-S, Lu Z-M, Sun S-H (2003) An efficient encoding algorithm for vector quantization based on subvector technique. IEEE Trans Image Process 12(3):265–270CrossRefGoogle Scholar
  31. 31.
    Pelleg D, Moore A (1999) Accelerating exact k-means algorithms with geometric reasoning. In: ACM SIGKDD fifth international conference on knowledge discovery and data mining, pp 277–281Google Scholar
  32. 32.
    Phillips SJ (2002) Acceleration of k-means and related clustering algorithms. In: Mount D, Stein C (eds) Algorithm engineering and experiments. Lecture notes in computer science, vol 2409. Springer, Berlin, Heidelberg, pp 61–62CrossRefGoogle Scholar
  33. 33.
    Ra S-W, Kim JK (1993) A fast mean-distance-ordered partial codebook search algorithm for image vector quantization. IEEE Trans Circuits Syst II 40(9):576–579CrossRefGoogle Scholar
  34. 34.
    Schölkopf B, Smola A, Müller K-R (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput 10(5):1299–1319CrossRefGoogle Scholar
  35. 35.
    Sculley D (2010) Web-scale k-means clustering. In: Proceedings of the 19th international conference on World Wide Web. ACM, New York, pp 1177–1178Google Scholar
  36. 36.
    Sherwood T, Perelman E, Hamerly G, Calder B (2002) Automatically characterizing large scale program behavior. SIGOPS Oper Syst Rev 36(5):45–57CrossRefGoogle Scholar
  37. 37.
    Tai S-C, Lai CC, Lin Y-C (1996) Two fast nearest neighbor searching algorithms for image vector quantization. IEEE Trans Commun 44(12):1623–1628CrossRefGoogle Scholar
  38. 38.
    Vattani A (2011) k-means requires exponentially many iterations even in the plane. Discrete Comput Geom 45(4):596–616CrossRefzbMATHMathSciNetGoogle Scholar
  39. 39.
    Wettschereck D, Dietterich T (1991) Improving the performance of radial basis function networks by learning center locations. In Neural Inf Process Syst 4:1133–1140Google Scholar
  40. 40.
    Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng AFM, Liu B, Yu PS, Zhou Z-H, Steinbach M, Hand DJ, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37CrossRefGoogle Scholar
  41. 41.
    Yael https://gforge.inria.fr/projects/yael/. Version v1845, Accessed 24 Jan 2014

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Baylor UniversityWacoUSA

Personalised recommendations