Mathematical Programming

, Volume 165, Issue 2, pp 605–642 | Cite as

Probably certifiably correct k-means clustering

  • Takayuki Iguchi
  • Dustin G. Mixon
  • Jesse Peterson
  • Soledad Villar
Full Length Paper Series A

Abstract

Recently, Bandeira (C R Math, 2015) introduced a new type of algorithm (the so-called probably certifiably correct algorithm) that combines fast solvers with the optimality certificates provided by convex relaxations. In this paper, we devise such an algorithm for the problem of k-means clustering. First, we prove that Peng and Wei’s semidefinite relaxation of k-means Peng and Wei (SIAM J Optim 18(1):186–205, 2007) is tight with high probability under a distribution of planted clusters called the stochastic ball model. Our proof follows from a new dual certificate for integral solutions of this semidefinite program. Next, we show how to test the optimality of a proposed k-means solution using this dual certificate in quasilinear time. Finally, we analyze a version of spectral clustering from Peng and Wei (SIAM J Optim 18(1):186–205, 2007) that is designed to solve k-means in the case of two clusters. In particular, we show that this quasilinear-time method typically recovers planted clusters under the stochastic ball model.

Mathematics Subject Classification

65-XX 90-XX 46N10 68Q87 

References

  1. 1.
    Abbe, E., Bandeira, A.S., Hall, G.: Exact recovery in the stochastic block model. IEEE Trans. Inf. Theory 62(1), 471–487 (2016)MathSciNetCrossRefMATHGoogle Scholar
  2. 2.
    Abbe, E., Sandon, C.: Community detection in general stochastic block models: fundamental limits and efficient algorithms for recovery. In: IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015, Berkeley, CA, USA, pp. 670–688, 17–20 October 2015Google Scholar
  3. 3.
    Arthur, D., Vassilvitskii, S.: k-Means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete algorithms (2007)Google Scholar
  4. 4.
    Awasthi, P., Bandeira, A.S., Charikar, M., Krishnaswamy, R., Villar, S., Ward, R.: Relax, no need to round: integrality of clustering formulations. In: Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science, pp. 191–200. ACM (2015)Google Scholar
  5. 5.
    Bandeira, A.S.: A note on probably certifiably correct algorithms. C. R. Math. 354(3), 329–333 (2015)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Chen, H., Peng, J.: 0–1 Semidefinite programming for graph-cut clustering: modelling and approximation. In: Data Mining and Mathematical Programming. CRM Proceedings and Lecture Notes of the American Mathematical Society, pp. 15–40 (2008)Google Scholar
  7. 7.
    Dhillon, I.S., Guan, Y., Kulis, B.: Kernel k-means: spectral clustering and normalized cuts. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 551–556. ACM (2004)Google Scholar
  8. 8.
    Dhillon, I.S., Guan, Y., Kulis, B.: Weighted graph cuts without eigenvectors a multilevel approach. IEEE Trans. Pattern Anal. Mach. Intell. 29(11), 1944–1957 (2007)CrossRefGoogle Scholar
  9. 9.
    Elhamifar, E., Sapiro, G., Vidal, R.: Finding exemplars from pairwise dissimilarities via simultaneous sparse recovery. In: Advances in Neural Information Processing Systems, pp. 19–27 (2012)Google Scholar
  10. 10.
    Golub, G.H., Van Loan, C.F.: Matrix Computations, vol. 3. JHU Press, Baltimore (2012)MATHGoogle Scholar
  11. 11.
    Grant, M., Boyd, S., Ye, Y.: Graph implementations for nonsmooth convex programs. In: Blondel, V., Boyd, S., Kimura,H., (eds.) Recent Advances in Learning and Control. Lecture Notes in Control and Information Sciences. Springer, London, pp. 95–110 (2008)Google Scholar
  12. 12.
    Grant, M., Boyd, S.: CVX: matlab software for disciplined convex programming, version 2.1 (2014). http://cvxr.com/cvx
  13. 13.
    Iguchi, T., Mixon, D.G., Peterson, J., Villar, S.: On the tightness of an SDP relaxation of k-means. arXiv preprint arXiv:1505.04778 (2015)
  14. 14.
    Jain, K., Mahdian, M., Saberi, A.: A new greedy approach for facility location problems. In: Proceedings of the 34th Annual ACM Symposium on Theory of Computing (2002)Google Scholar
  15. 15.
    Laurent, B., Massart, P.: Adaptive estimation of a quadratic functional by model selection. Ann. Stat. 28, 1302–1338 (2000)MathSciNetCrossRefMATHGoogle Scholar
  16. 16.
    Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)MathSciNetCrossRefMATHGoogle Scholar
  17. 17.
    Mallat, S.G., Zhang, Z.: Matching pursuits with time-frequency dictionaries. IEEE Trans. Sig. Process. 41(12), 3397–3415 (1993)CrossRefMATHGoogle Scholar
  18. 18.
    Mixon, D.G.: Cone programming cheat sheet. Short, Fat Matrices (weblog) (2015)Google Scholar
  19. 19.
    Nellore, A., Ward, R.: Recovery guarantees for exemplar-based clustering. Inf. Comput. 245, 165–180 (2015)MathSciNetCrossRefMATHGoogle Scholar
  20. 20.
    Nesterov, Y., Nemirovskii, A.: Interior-Point Polynomial Algorithms in Convex Programming, vol. 13. SIAM, Philadelphia (1994). doi:10.1137/1.9781611970791
  21. 21.
    Ostrovsky, R., Rabani, Y., Schulman, L., Swamy, C.: The effectiveness of lloyd-type methods for the k-means problem. In: Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science (2006)Google Scholar
  22. 22.
    Peng, J., Wei, Y.: Approximating k-means-type clustering via semidefinite programming. SIAM J. Optim. 18(1), 186–205 (2007)MathSciNetCrossRefMATHGoogle Scholar
  23. 23.
    Tropp, J.A.: User-friendly tail bounds for sums of random matrices. Found. Comput. Math. 12(4), 389–434 (2012)MathSciNetCrossRefMATHGoogle Scholar
  24. 24.
    Vershynin, R.: Introduction to the non-asymptotic analysis of random matrices. arXiv:1011.3027v7 (2011)
  25. 25.
    Vinayak, R.K., Hassibi, B.: Similarity clustering in the presence of outliers: Exact recovery via convex program. In: IEEE International Symposium on Information Theory, ISIT 2016, Barcelona, Spain, pp. 91–95, 10–15 July 2016Google Scholar
  26. 26.
    Wang, H., Song, M.: Ckmeans.1d.dp: optimal k-means clustering in one dimension by dynamic programming. R J. 3(2), 29–33 (2011)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg and Mathematical Optimization Society (outside the USA) 2016

Authors and Affiliations

  1. 1.Department of Mathematics and StatisticsAir Force Institute of TechnologyWright-Patterson AFBUSA
  2. 2.Department of MathematicsUniversity of Texas at AustinAustinUSA

Personalised recommendations