Worst-Case and Smoothed Analysis of k-Means Clustering with Bregman Divergences

  • Bodo Manthey
  • Heiko Röglin
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5878)

Abstract

The k-means algorithm is the method of choice for clustering large-scale data sets and it performs exceedingly well in practice. Most of the theoretical work is restricted to the case that squared Euclidean distances are used as similarity measure. In many applications, however, data is to be clustered with respect to other measures like, e.g., relative entropy, which is commonly used to cluster web pages. In this paper, we analyze the running-time of the k-means method for Bregman divergences, a very general class of similarity measures including squared Euclidean distances and relative entropy. We show that the exponential lower bound known for the Euclidean case carries over to almost every Bregman divergence. To narrow the gap between theory and practice, we also study k-means in the semi-random input model of smoothed analysis. For the case that n data points in ℝd are perturbed by noise with standard deviation σ, we show that for almost arbitrary Bregman divergences the expected running-time is bounded by \({\rm poly}(n^{\sqrt k}, 1/\sigma)\) and kkd ·poly(n, 1/σ).

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Ackermann, M.R., Blömer, J.: Coresets and approximate clustering for Bregman divergences. In: Proc. of the 20th ACM-SIAM Symp. on Discrete Algorithms (SODA), pp. 1088–1097 (2009)Google Scholar
  2. 2.
    Ackermann, M.R., Blömer, J., Sohler, C.: Clustering for metric and non-metric distance measures. In: Proc. of the 19th ACM-SIAM Symp. on Discrete Algorithms (SODA), pp. 799–808 (2008)Google Scholar
  3. 3.
    Arthur, D., Manthey, B., Röglin, H.: k-means has polynomial smoothed complexity. In: Proc. of the 50th Ann. IEEE Symp. on Found. of Computer Science, FOCS (to appear, 2009)Google Scholar
  4. 4.
    Arthur, D., Vassilvitskii, S.: Worst-case and smoothed analysis of the ICP algorithm, with an application to the k-means method. SIAM Journal on Computing 39(2), 766–782 (2009)CrossRefMathSciNetGoogle Scholar
  5. 5.
    Banerjee, A., Merugu, S., Dhillon, I.S., Ghosh, J.: Clustering with Bregman divergences. Journal of Machine Learning Research 6, 1705–1749 (2005)MathSciNetGoogle Scholar
  6. 6.
    Berkhin, P.: Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, CA, USA (2002)Google Scholar
  7. 7.
    Dhillon, I.S., Mallela, S., Kumar, R.: A divisive information-theoretic feature clustering algorithm for text classification. Journal of Machine Learning Research 3, 1265–1287 (2003)MATHCrossRefGoogle Scholar
  8. 8.
    Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. John Wiley & Sons, Chichester (2000)Google Scholar
  9. 9.
    Feller, W.: An Introduction to Probability Theory and Its Applications, vol. II. John Wiley & Sons, Chichester (1971)MATHGoogle Scholar
  10. 10.
    Gray, R.M., Buzo, A., Gray Jr., A.H., Matsuyama, Y.: Distortion measures for speech processing. IEEE Transactions on Acoustics, Speech, and Signal Processing 28(4), 367–376 (1980)MATHCrossRefGoogle Scholar
  11. 11.
    Inaba, M., Katoh, N., Imai, H.: Variance-based k-clustering algorithms by Voronoi diagrams and randomization. IEICE Transactions on Information and Systems E83-D(6), 1199–1206 (2000)Google Scholar
  12. 12.
    Lloyd, S.P.: Least squares quantization in PCM. IEEE Transactions on Information Theory 28(2), 129–137 (1982)MATHCrossRefMathSciNetGoogle Scholar
  13. 13.
    Manthey, B., Röglin, H.: Improved smoothed analysis of the k-means method. In: Proc. of the 20th ACM-SIAM Symp. on Discrete Algorithms (SODA), pp. 461–470 (2009)Google Scholar
  14. 14.
    Spielman, D.A., Teng, S.-H.: Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. Journal of the ACM 51(3), 385–463 (2004)CrossRefMathSciNetGoogle Scholar
  15. 15.
    Vattani, A.: k-means requires exponentially many iterations even in the plane. In: Proc. of the 25th ACM Symp. on Computational Geometry (SoCG), pp. 324–332 (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Bodo Manthey
    • 1
  • Heiko Röglin
    • 2
  1. 1.Department of Applied MathematicsUniversity of Twente 
  2. 2.Department of Quantitative EconomicsMaastricht University 

Personalised recommendations