An alternative to EM for Gaussian mixture models: batch and stochastic Riemannian optimization

  • Reshad HosseiniEmail author
  • Suvrit Sra
Full Length Paper Series A


We consider maximum likelihood estimation for Gaussian Mixture Models (Gmm s). This task is almost invariably solved (in theory and practice) via the Expectation Maximization (EM) algorithm. EM owes its success to various factors, of which is its ability to fulfill positive definiteness constraints in closed form is of key importance. We propose an alternative to EM grounded in the Riemannian geometry of positive definite matrices, using which we cast Gmm parameter estimation as a Riemannian optimization problem. Surprisingly, such an out-of-the-box Riemannian formulation completely fails and proves much inferior to EM. This motivates us to take a closer look at the problem geometry, and derive a better formulation that is much more amenable to Riemannian optimization. We then develop Riemannian batch and stochastic gradient algorithms that outperform EM, often substantially. We provide a non-asymptotic convergence analysis for our stochastic method, which is also the first (to our knowledge) such global analysis for Riemannian stochastic gradient. Numerous empirical results are included to demonstrate the effectiveness of our methods.


Stochastic optimization Riemannian optimization Gaussian mixture models Positive definite matrices Retraction Non-asymptotic rate of convergence 

Mathematics Subject Classification

49 53 58 62 90 



  1. 1.
    Absil, P.A., Mahony, R., Sepulchre, R.: Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton (2009)zbMATHGoogle Scholar
  2. 2.
    Alvarez, F., Bolte, J., Brahic, O.: Hessian Riemannian gradient flows in convex programming. SIAM J. Control Optim. 43(2), 477–501 (2004)MathSciNetCrossRefzbMATHGoogle Scholar
  3. 3.
    Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: 18th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1027–1035 (2007)Google Scholar
  4. 4.
    Balakrishnan, S., Wainwright, M.J., Yu, B.: Statistical guarantees for the EM algorithm: from population to sample-based analysis (2014). arXiv:1408.2156
  5. 5.
    Bhatia, R.: Positive Definite Matrices. Princeton University Press, Princeton (2007)zbMATHGoogle Scholar
  6. 6.
    Bhojanapalli, S., Kyrillidis, A., Sanghavi, S.: Dropping convexity for faster semi-definite optimization. In: 29th Annual Conference on Learning Theory (COLT), pp. 530–582 (2016)Google Scholar
  7. 7.
    Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Berlin (2007)zbMATHGoogle Scholar
  8. 8.
    Bonnabel, S.: Stochastic gradient descent on Riemannian manifolds. IEEE Trans. Autom. Control 58(9), 2217–2229 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    Boumal, N., Mishra, B., Absil, P.A., Sepulchre, R.: Manopt, a matlab toolbox for optimization on manifolds. J. Mach. Learn. Res. 15(1), 1455–1459 (2014)zbMATHGoogle Scholar
  10. 10.
    Boumal, N., Absil, P.A., Cartis, C.: Global rates of convergence for nonconvex optimization on manifolds (2016). arXiv:1605.08101v1
  11. 11.
    Burer, S., Monteiro, R.D., Zhang, Y.: Solving semidefinite programs via nonlinear programming. part I: transformations and derivatives. Tech. Rep. TR99-17, Department of Computational and Applied Mathematics, Rice University, Houston TX (1999)Google Scholar
  12. 12.
    Dasgupta, S.: Learning mixtures of Gaussians. In: 40th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pp. 634–644 (1999)Google Scholar
  13. 13.
    Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 39, 1–38 (1977)MathSciNetzbMATHGoogle Scholar
  14. 14.
    Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley, New York (2000)zbMATHGoogle Scholar
  15. 15.
    Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning. Springer, Berlin (2001)zbMATHGoogle Scholar
  16. 16.
    Ge, R., Huang, Q., Kakade, S.M.: Learning mixtures of Gaussians in high dimensions (2015). arXiv:1503.00424
  17. 17.
    Ghadimi, S., Lan, G.: Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4), 2341–2368 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
  18. 18.
    Hiai, F., Petz, D.: Riemannian metrics on positive definite matrices related to means. Linear Algebra Appl. 430(11–12), 3105–3130 (2009)MathSciNetCrossRefzbMATHGoogle Scholar
  19. 19.
    Hiai, F., Petz, D.: Riemannian metrics on positive definite matrices related to means. ii. Linear Algebra Appl. 436(7), 2117–2136 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  20. 20.
    Hosseini, R., Sra, S.: Matrix manifold optimization for Gaussian mixtures. In: Advances in Neural Information Processing Systems (NIPS), vol. 28, pp. 910–918 (2015)Google Scholar
  21. 21.
    Jeuris, B., Vandebril, R., Vandereycken, B.: A survey and comparison of contemporary algorithms for computing the matrix geometric mean. Electron. Trans. Numer. Anal. 39, 379–402 (2012)MathSciNetzbMATHGoogle Scholar
  22. 22.
    Jordan, M.I., Jacobs, R.A.: Hierarchical mixtures of experts and the EM algorithm. Neural Comput. 6(2), 181–214 (1994)CrossRefGoogle Scholar
  23. 23.
    Journée, M., Bach, F., Absil, P.A., Sepulchre, R.: Low-rank optimization on the cone of positive semidefinite matrices. SIAM J. Optim. 20(5), 2327–2351 (2010)MathSciNetCrossRefzbMATHGoogle Scholar
  24. 24.
    Keener, R.W.: Theoretical Statistics. Springer Texts in Statistics. Springer, Berlin (2010)Google Scholar
  25. 25.
    Lee, J.M.: Introduction to Smooth Manifolds. Springer, Berlin (2012)CrossRefGoogle Scholar
  26. 26.
    Ma, J., Xu, L., Jordan, M.I.: Asymptotic convergence rate of the EM algorithm for Gaussian mixtures. Neural Comput. 12(12), 2881–2907 (2000)CrossRefGoogle Scholar
  27. 27.
    McLachlan, G.J., Peel, D.: Finite Mixture Models. Wiley, New Yrok (2000)CrossRefzbMATHGoogle Scholar
  28. 28.
    Moitra, A., Valiant, G.: Settling the polynomial learnability of mixtures of Gaussians. In: 51st Annual IEEE Symposium on Foundations of Computer Science (FOCS), pp. 93–102 (2010)Google Scholar
  29. 29.
    Murphy, K.P.: Machine Learning: A Probabilistic Perspective. MIT Press, Cambridge (2012)zbMATHGoogle Scholar
  30. 30.
    Naim, I., Gildea, D.: Convergence of the EM algorithm for Gaussian mixtures with unbalanced mixing coefficients. In: 29th International Conference on Machine Learning (ICML), pp. 1655–1662 (2012)Google Scholar
  31. 31.
    Nocedal, J., Wright, S.J.: Numerical Optimization. Springer, Berlin (2006)zbMATHGoogle Scholar
  32. 32.
    Redner, R.A., Walker, H.F.: Mixture densities, maximum likelihood, and the EM algorithm. SIAM Rev. 26, 195–239 (1984)MathSciNetCrossRefzbMATHGoogle Scholar
  33. 33.
    Reynolds, D.A., Quatieri, T.F., Dunn, R.B.: Speaker verification using adapted Gaussian mixture models. Digit. Signal Process. 10(1–3), 19–41 (2000)CrossRefGoogle Scholar
  34. 34.
    Ridolfi, A., Idier, J., Mohammad-Djafari, A.: Penalized maximum likelihood estimation for univariate normal mixture distributions. In: Actes du \(17^e\) Colloque GRETSI, pp. 259–262 (1999)Google Scholar
  35. 35.
    Ring, W., Wirth, B.: Optimization methods on Riemannian manifolds and their application to shape space. SIAM J. Optim. 22(2), 596–627 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  36. 36.
    Salakhutdinov, R., Roweis, S.T., Ghahramani, Z.: Optimization with EM and expectation-conjugate-gradient. In: 20th International Conference on Machine Learning (ICML), pp. 672–679 (2003)Google Scholar
  37. 37.
    Sra, S., Hosseini, R.: Geometric optimisation on positive definite matrices for elliptically contoured distributions. In: Advances in Neural Information Processing Systems (NIPS), vol. 26, pp. 2562–2570 (2013)Google Scholar
  38. 38.
    Sra, S., Hosseini, R.: Conic geometric optimization on the manifold of positive definite matrices. SIAM J. Optim. 25(1), 713–739 (2015)MathSciNetCrossRefzbMATHGoogle Scholar
  39. 39.
    Udrişte, C.: Convex Functions and Optimization Methods on Riemannian Manifolds. Kluwer Academic, Dordrecht (1994)CrossRefzbMATHGoogle Scholar
  40. 40.
    Vanderbei, R.J., Benson, H.Y.: On formulating semidefinite programming problems as smooth convex nonlinear optimization problems. Tech. Rep. ORFE-99-01, Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ (2000)Google Scholar
  41. 41.
    Vandereycken, B.: Low-rank matrix completion by Riemannian optimization. SIAM J. Optim. 23(2), 1214–1236 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
  42. 42.
    Wiesel, A.: Geodesic convexity and covariance estimation. IEEE Trans. Signal Process. 60(12), 6182–89 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  43. 43.
    Wisdom, S., Powers, T., Hershey, J., Le Roux, J., Atlas, L.: Full-capacity unitary recurrent neural networks. In: Advances in Neural Information Processing Systems (NIPS), vol. 29, pp. 4880–4888 (2016)Google Scholar
  44. 44.
    Xu, L., Jordan, M.I.: On convergence properties of the EM algorithm for Gaussian mixtures. Neural Comput. 8, 129–151 (1996)CrossRefGoogle Scholar
  45. 45.
    Zhang, H., Sra, S.: First-order methods for geodesically convex optimization. In: 29th Annual Conference on Learning Theory (COLT), pp 1617–1638 (2016)Google Scholar
  46. 46.
    Zhang, H., Reddi, S., Sra, S.: Riemannian SVRG: Fast stochastic optimization on Riemannian manifolds. In: Advances in Neural Information Processing Systems (NIPS), vol. 29, pp. 4592–4600 (2016)Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature and Mathematical Optimization Society 2019

Authors and Affiliations

  1. 1.School of ECE, College of EngineeringUniversity of TehranTehranIran
  2. 2.School of Computer ScienceInstitute of Research in Fundamental Sciences (IPM)TehranIran
  3. 3.Massachusetts Institute of TechnologyCambridgeUSA

Personalised recommendations