Advertisement

Machine Learning

, Volume 80, Issue 2–3, pp 213–243 | Cite as

Stability and model selection in k-means clustering

  • Ohad ShamirEmail author
  • Naftali Tishby
Article

Abstract

Clustering stability methods are a family of widely used model selection techniques for data clustering. Their unifying theme is that an appropriate model should result in a clustering which is robust with respect to various kinds of perturbations. Despite their relative success, not much is known theoretically on why or when do they work, or even what kind of assumptions they make in choosing an ‘appropriate’ model. Moreover, recent theoretical work has shown that they might ‘break down’ for large enough samples. In this paper, we focus on the behavior of clustering stability using k-means clustering. Our main technical result is an exact characterization of the distribution to which suitably scaled measures of instability converge, based on a sample drawn from any distribution in ℝ n satisfying mild regularity conditions. From this, we can show that clustering stability does not ‘break down’ even for arbitrarily large samples, at least for the k-means framework. Moreover, it allows us to identify the factors which eventually determine the behavior of clustering stability. This leads to some basic observations about what kind of assumptions are made when using these methods. While often reasonable, these assumptions might also lead to unexpected consequences.

Keywords

Clustering Model selection Stability Statistical learning theory 

References

  1. Anthony, M., & Bartlet, P. (1999). Neural network learning: theoretical foundations. Cambridge: Cambridge University Press. zbMATHCrossRefGoogle Scholar
  2. Ben-David, S., & von Luxburg, U. (2008). Relating clustering stability to properties of cluster boundaries. In 21st annual conference on learning theory, Helsinki, Finland, July 9–12, 2008 (pp. 379–390) Google Scholar
  3. Ben-David, S., von Luxburg, U., & Pál, D. (2006). A sober look at clustering stability. In 19th annual conference on learning theory, Pittsburgh, PA, USA, June 22–25, 2006 (pp. 5–19). Google Scholar
  4. Ben-David, S., Pál, D., & Simon, H.-U. (2007). Stability of k-means clustering. In 20th annual conference on learning theory, San Diego, CA, USA, June 13–15, 2007 (pp. 20–34). Google Scholar
  5. Ben-Hur, A., Elisseeff, A., & Guyon, I. (2002). A stability based method for discovering structure in clustered data. In Pacific symposium on biocomputing, Lihue, Hawaii, USA, January 3–7, 2002 (pp. 6–17). Google Scholar
  6. Bertoni, A., & Valentini, G. (2007). Model order selection for biomolecular data clustering. BMC Bioinformatics, 8(Suppl 2), S7. CrossRefGoogle Scholar
  7. Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification (2nd edn.). New York: Wiley. zbMATHGoogle Scholar
  8. Dudley, R. (1999). Uniform central limit theorems. Cambridge studies in advanced mathematics. Cambridge: Cambridge University Press. zbMATHCrossRefGoogle Scholar
  9. Dudoit, S., & Fridlyand, J. (2002). A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biology, 3(7), 0036.1–0036.21. CrossRefGoogle Scholar
  10. Hartigan, J. (1975). Clustering algorithms. New York: Wiley. zbMATHGoogle Scholar
  11. Hoeffman-Jørgensen, J., Shepp, L. A., & Dudley, R. (1979). On the lower tail of gaussian seminorms. The Annals of Probability, 7(2), 319–342. CrossRefMathSciNetGoogle Scholar
  12. Horn, R. A., & Johnson, C. R. (1985). Matrix analysis. Cambridge: Cambridge University Press. zbMATHGoogle Scholar
  13. Krieger, A., & Green, P. (1999). A cautionary note on using internal cross validation to select the number of clusters. Psychometrika, 64(3), 341–353. CrossRefGoogle Scholar
  14. Lange, T., Roth, V., Braun, M. L., & Buhmann, J. M. (2004). Stability-based validation of clustering solutions. Neural Computation, 16(6), 1299–1323. zbMATHCrossRefGoogle Scholar
  15. Latała, R., & Oleszkiewicz, K. (1999). Gaussian measures of dilatations of convex symmetric sets. Annals of Probability, 27(4), 1922–1938. Google Scholar
  16. Levine, E., & Domany, E. (2001). Resampling method for unsupervised estimation of cluster validity. Neural Computation, 13(11), 2573–2593. zbMATHCrossRefGoogle Scholar
  17. Linder, T. (2002). Principles of nonparametric learning. In L. Gyorfi (Ed.), CISM courses and lecture notes : Vol. 434. Learning-theoretic methods in vector quantization. New York: Springer. Chap. 4. Google Scholar
  18. Milman, V. D., & Schechtman, G. (1986). Asymptotic theory of finite dimensional normed spaces. Berlin: Springer. zbMATHGoogle Scholar
  19. Pollard, D. (1982). A central limit theorem for k-means clustering. The Annals of Probability, 10(4), 919–926. zbMATHCrossRefMathSciNetGoogle Scholar
  20. Radchenko, P. (2004). Asymptotics under nonstandard conditions. PhD thesis, Yale University. Google Scholar
  21. Shamir, O., & Tishby, N. (2008a). Cluster stability for finite samples. In J. C. Platt, D. Koller, Y. Singer, & S. Roweis (Eds.), Advances in neural information processing systems (Vol. 20, pp. 1297–1304). Cambridge: MIT Press. Google Scholar
  22. Shamir, O., & Tishby, N. (2008b). Model selection and stability in k-means clustering. In 21st annual conference on learning theory, Helsinki, Finland, July 9–12, 2008 (pp. 367–378). Cambridge: MIT Press. Google Scholar
  23. Shamir, O., & Tishby, N. (2009). On the reliability of clustering stability in the large sample regime. In D. Koller, D. Schuurmans, Y. Bengio, & L. Bottou (Eds.), Advances in neural information processing systems (Vol. 21, pp. 1465–1472). Cambridge: MIT Press. Google Scholar
  24. Smolkin, M., & Ghosh, D. (2003). Cluster stability scores for microarray data in cancer studies. BMC Bioinformatics, 4, 36. CrossRefGoogle Scholar
  25. Steinley, D. (2006). K-means clustering: a half-century synthesis. British Journal of Mathematical & Statistical Psychology, 59(1), 1–34. CrossRefMathSciNetGoogle Scholar
  26. van der Vaart, A. W., & Wellner, J. A. (1996). Weak convergence and empirical processes: with applications to statistics. Berlin: Springer. zbMATHGoogle Scholar

Copyright information

© The Author(s) 2010

Authors and Affiliations

  1. 1.School of Computer Science and EngineeringThe Hebrew UniversityJerusalemIsrael
  2. 2.School of Computer Science and Engineering, and Interdisciplinary Center for Neural ComputationThe Hebrew UniversityJerusalemIsrael

Personalised recommendations