Mathematical Programming

, Volume 165, Issue 2, pp 605–642 | Cite as

Probably certifiably correct k-means clustering

  • Takayuki Iguchi
  • Dustin G. Mixon
  • Jesse Peterson
  • Soledad Villar
Full Length Paper Series A

Abstract

Recently, Bandeira (C R Math, 2015) introduced a new type of algorithm (the so-called probably certifiably correct algorithm) that combines fast solvers with the optimality certificates provided by convex relaxations. In this paper, we devise such an algorithm for the problem of k-means clustering. First, we prove that Peng and Wei’s semidefinite relaxation of k-means Peng and Wei (SIAM J Optim 18(1):186–205, 2007) is tight with high probability under a distribution of planted clusters called the stochastic ball model. Our proof follows from a new dual certificate for integral solutions of this semidefinite program. Next, we show how to test the optimality of a proposed k-means solution using this dual certificate in quasilinear time. Finally, we analyze a version of spectral clustering from Peng and Wei (SIAM J Optim 18(1):186–205, 2007) that is designed to solve k-means in the case of two clusters. In particular, we show that this quasilinear-time method typically recovers planted clusters under the stochastic ball model.

Mathematics Subject Classification

65-XX 90-XX 46N10 68Q87 

Notes

Acknowledgements

The authors thank the anonymous referees, whose suggestions significantly improved this paper’s presentation and literature review. The authors also thank Afonso S. Bandeira and Nicolas Boumal for interesting discussions and valuable comments on an earlier version of this manuscript, and Xiaodong Li and Yang Li for interesting comments on our dual certificate. DGM was supported by an AFOSR Young Investigator Research Program award, NSF Grant No. DMS-1321779, and AFOSR Grant No. F4FGA05076J002. SV was supported by Rachel Ward’s NSF CAREER award and AFOSR Young Investigator Research Program award. The views expressed in this article are those of the authors and do not reflect the official policy or position of the United States Air Force, Department of Defense, or the U.S. Government.

References

  1. 1.
    Abbe, E., Bandeira, A.S., Hall, G.: Exact recovery in the stochastic block model. IEEE Trans. Inf. Theory 62(1), 471–487 (2016)MathSciNetCrossRefMATHGoogle Scholar
  2. 2.
    Abbe, E., Sandon, C.: Community detection in general stochastic block models: fundamental limits and efficient algorithms for recovery. In: IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015, Berkeley, CA, USA, pp. 670–688, 17–20 October 2015Google Scholar
  3. 3.
    Arthur, D., Vassilvitskii, S.: k-Means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete algorithms (2007)Google Scholar
  4. 4.
    Awasthi, P., Bandeira, A.S., Charikar, M., Krishnaswamy, R., Villar, S., Ward, R.: Relax, no need to round: integrality of clustering formulations. In: Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science, pp. 191–200. ACM (2015)Google Scholar
  5. 5.
    Bandeira, A.S.: A note on probably certifiably correct algorithms. C. R. Math. 354(3), 329–333 (2015)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Chen, H., Peng, J.: 0–1 Semidefinite programming for graph-cut clustering: modelling and approximation. In: Data Mining and Mathematical Programming. CRM Proceedings and Lecture Notes of the American Mathematical Society, pp. 15–40 (2008)Google Scholar
  7. 7.
    Dhillon, I.S., Guan, Y., Kulis, B.: Kernel k-means: spectral clustering and normalized cuts. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 551–556. ACM (2004)Google Scholar
  8. 8.
    Dhillon, I.S., Guan, Y., Kulis, B.: Weighted graph cuts without eigenvectors a multilevel approach. IEEE Trans. Pattern Anal. Mach. Intell. 29(11), 1944–1957 (2007)CrossRefGoogle Scholar
  9. 9.
    Elhamifar, E., Sapiro, G., Vidal, R.: Finding exemplars from pairwise dissimilarities via simultaneous sparse recovery. In: Advances in Neural Information Processing Systems, pp. 19–27 (2012)Google Scholar
  10. 10.
    Golub, G.H., Van Loan, C.F.: Matrix Computations, vol. 3. JHU Press, Baltimore (2012)MATHGoogle Scholar
  11. 11.
    Grant, M., Boyd, S., Ye, Y.: Graph implementations for nonsmooth convex programs. In: Blondel, V., Boyd, S., Kimura,H., (eds.) Recent Advances in Learning and Control. Lecture Notes in Control and Information Sciences. Springer, London, pp. 95–110 (2008)Google Scholar
  12. 12.
    Grant, M., Boyd, S.: CVX: matlab software for disciplined convex programming, version 2.1 (2014). http://cvxr.com/cvx
  13. 13.
    Iguchi, T., Mixon, D.G., Peterson, J., Villar, S.: On the tightness of an SDP relaxation of k-means. arXiv preprint arXiv:1505.04778 (2015)
  14. 14.
    Jain, K., Mahdian, M., Saberi, A.: A new greedy approach for facility location problems. In: Proceedings of the 34th Annual ACM Symposium on Theory of Computing (2002)Google Scholar
  15. 15.
    Laurent, B., Massart, P.: Adaptive estimation of a quadratic functional by model selection. Ann. Stat. 28, 1302–1338 (2000)MathSciNetCrossRefMATHGoogle Scholar
  16. 16.
    Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)MathSciNetCrossRefMATHGoogle Scholar
  17. 17.
    Mallat, S.G., Zhang, Z.: Matching pursuits with time-frequency dictionaries. IEEE Trans. Sig. Process. 41(12), 3397–3415 (1993)CrossRefMATHGoogle Scholar
  18. 18.
    Mixon, D.G.: Cone programming cheat sheet. Short, Fat Matrices (weblog) (2015)Google Scholar
  19. 19.
    Nellore, A., Ward, R.: Recovery guarantees for exemplar-based clustering. Inf. Comput. 245, 165–180 (2015)MathSciNetCrossRefMATHGoogle Scholar
  20. 20.
    Nesterov, Y., Nemirovskii, A.: Interior-Point Polynomial Algorithms in Convex Programming, vol. 13. SIAM, Philadelphia (1994). doi: 10.1137/1.9781611970791
  21. 21.
    Ostrovsky, R., Rabani, Y., Schulman, L., Swamy, C.: The effectiveness of lloyd-type methods for the k-means problem. In: Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science (2006)Google Scholar
  22. 22.
    Peng, J., Wei, Y.: Approximating k-means-type clustering via semidefinite programming. SIAM J. Optim. 18(1), 186–205 (2007)MathSciNetCrossRefMATHGoogle Scholar
  23. 23.
    Tropp, J.A.: User-friendly tail bounds for sums of random matrices. Found. Comput. Math. 12(4), 389–434 (2012)MathSciNetCrossRefMATHGoogle Scholar
  24. 24.
    Vershynin, R.: Introduction to the non-asymptotic analysis of random matrices. arXiv:1011.3027v7 (2011)
  25. 25.
    Vinayak, R.K., Hassibi, B.: Similarity clustering in the presence of outliers: Exact recovery via convex program. In: IEEE International Symposium on Information Theory, ISIT 2016, Barcelona, Spain, pp. 91–95, 10–15 July 2016Google Scholar
  26. 26.
    Wang, H., Song, M.: Ckmeans.1d.dp: optimal k-means clustering in one dimension by dynamic programming. R J. 3(2), 29–33 (2011)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg and Mathematical Optimization Society (outside the USA) 2016

Authors and Affiliations

  1. 1.Department of Mathematics and StatisticsAir Force Institute of TechnologyWright-Patterson AFBUSA
  2. 2.Department of MathematicsUniversity of Texas at AustinAustinUSA

Personalised recommendations