Geometric Optimization in Machine Learning

  • Suvrit Sra
  • Reshad Hosseini
Part of the Advances in Computer Vision and Pattern Recognition book series (ACVPR)


Machine learning models often rely on sparsity, low-rank, orthogonality, correlation, or graphical structure. The structure of interest in this chapter is geometric, specifically the manifold of positive definite (PD) matrices. Though these matrices recur throughout the applied sciences, our focus is on more recent developments in machine learning and optimization. In particular, we study (i) models that might be nonconvex in the Euclidean sense but are convex along the PD manifold; and (ii) ones that are neither Euclidean nor geodesic convex but are nevertheless amenable to global optimization. We cover basic theory for (i) and (ii); subsequently, we present a scalable Riemannian limited-memory BFGS algorithm (that also applies to other manifolds). We highlight some applications from statistics and machine learning that benefit from the geometric structure studies.


Riemannian Manifold Tangent Space Expectation Maximization Descent Direction Positive Definite 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



SS acknowledges partial support from NSF grant IIS-1409802.


  1. 1.
    P.A. Absil, R. Mahony, R. Sepulchre, Optimization Algorithms on Matrix Manifolds (Princeton University Press, Princeton, 2009)zbMATHGoogle Scholar
  2. 2.
    M. Arnaudon, F. Barbaresco, L. Yang, Riemannian medians and means with applications to radar signal processing. IEEE J. Sel. Top. Signal Process. 7(4), 595–604 (2013)CrossRefGoogle Scholar
  3. 3.
    D. Arthur, S. Vassilvitskii, K-means++: the advantages of careful seeding, in Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) (2007), pp. 1027–1035Google Scholar
  4. 4.
    M. Bacák, Convex Analysis and Optimization in Hadamard Spaces, vol. 22 (Walter de Gruyter GmbH & Co KG, Berlin, 2014)zbMATHGoogle Scholar
  5. 5.
    F. Bach, R. Jenatton, J. Mairal, G. Obozinski, Optimization with sparsity-inducing penalties. Foundations and Trends\({\textregistered }\) in Machine Learning 4(1), 1–106 (2012)Google Scholar
  6. 6.
    R. Bhatia, Positive Definite Matrices (Princeton University Press, Princeton, 2007)zbMATHGoogle Scholar
  7. 7.
    R. Bhatia, R.L. Karandikar, The matrix geometric mean. Technical report, isid/ms/2-11/02, Indian Statistical Institute (2011)Google Scholar
  8. 8.
    D.A. Bini, B. Iannazzo, Computing the Karcher mean of symmetric positive definite matrices. Linear Algebra Appl. 438(4), 1700–1710 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    D.A. Bini, B. Iannazzo, B. Jeuris, R. Vandebril, Geometric means of structured matrices. BIT Numer. Math. 54(1), 55–83 (2014)MathSciNetCrossRefzbMATHGoogle Scholar
  10. 10.
    C.M. Bishop, Pattern Recognition and Machine Learning (Springer, New York, 2007)zbMATHGoogle Scholar
  11. 11.
    N. Boumal, Optimization and estimation on manifolds. Ph.D. thesis, Université catholique de Louvain (2014)Google Scholar
  12. 12.
    N. Boumal, B. Mishra, P.A. Absil, R. Sepulchre, Manopt, a matlab toolbox for optimization on manifolds. J. Mach. Learn. Res. 15(1), 1455–1459 (2014)zbMATHGoogle Scholar
  13. 13.
    M.R. Bridson, A. Haefliger, Metric Spaces of Non-positive Curvature, vol. 319 (Springer Science & Business Media, Berlin, 1999)zbMATHGoogle Scholar
  14. 14.
    S. Burer, R.D. Monteiro, Y. Zhang, Solving semidefinite programs via nonlinear programming. part i: transformations and derivatives. Technical report, TR99-17, Rice University, Houston TX (1999)Google Scholar
  15. 15.
    Z. Chebbi, M. Moahker, Means of Hermitian positive-definite matrices based on the log-determinant \(\alpha \)-divergence function. Linear Algebra Appl. 436, 1872–1889 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  16. 16.
    A. Cherian, S. Sra, Riemannian dictionary learning and sparse coding for positive definite matrices. IEEE Trans. Neural Netw. Learn. Syst. (2015) (Submitted)Google Scholar
  17. 17.
    A. Cherian, S. Sra, Positive definite matrices: data representation and applications to computer vision, Riemannian Geometry in Machine Learning, Statistics, Optimization, and Computer Vision, Advances in Computer Vision and Pattern Recognition (Springer, New York, 2016) (this book)Google Scholar
  18. 18.
    A. Cherian, S. Sra, A. Banerjee, N. Papanikolopoulos, Jensen-Bregman logdet divergence for efficient similarity computations on positive definite tensors. IEEE Trans. Pattern Anal. Mach. Intell. (2012)Google Scholar
  19. 19.
    S. Dasgupta, Learning mixtures of Gaussians, in 40th Annual Symposium on Foundations of Computer Science (IEEE, 1999), pp. 634–644Google Scholar
  20. 20.
    A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 39, 1–38 (1977)MathSciNetzbMATHGoogle Scholar
  21. 21.
    R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, 2nd edn. (Wiley, New York, 2000)zbMATHGoogle Scholar
  22. 22.
    R. Hosseini, M. Mash’al, Mixest: an estimation toolbox for mixture models (2015). arXiv:1507.06065
  23. 23.
    R. Hosseini, S. Sra, Matrix manifold optimization for Gaussian mixtures, in Advances in Neural Information Processing Systems (NIPS) (2015)Google Scholar
  24. 24.
    J.B. Hough, M. Krishnapur, Y. Peres, B. Virág et al., Determinantal processes and independence. Probab. Surv. 3, 206–229 (2006)MathSciNetCrossRefzbMATHGoogle Scholar
  25. 25.
    W. Huang, K.A. Gallivan, P.A. Absil, A Broyden class of quasi-Newton methods for Riemannian optimization. SIAM J. Optim. 25(3), 1660–1685 (2015)MathSciNetCrossRefzbMATHGoogle Scholar
  26. 26.
    B. Jeuris, R. Vandebril, B. Vandereycken, A survey and comparison of contemporary algorithms for computing the matrix geometric mean. Electron. Trans. Numer. Anal. 39, 379–402 (2012)MathSciNetzbMATHGoogle Scholar
  27. 27.
    J.T. Kent, D.E. Tyler, Redescending M-estimates of multivariate location and scatter. Ann. Stat. 19(4), 2102–2119 (1991)MathSciNetCrossRefzbMATHGoogle Scholar
  28. 28.
    D. Le Bihan, J.F. Mangin, C. Poupon, C.A. Clark, S. Pappata, N. Molko, H. Chabriat, Diffusion tensor imaging: concepts and applications. J. Magn. Reson. Imaging 13(4), 534–546 (2001)CrossRefGoogle Scholar
  29. 29.
    H. Lee, Y. Lim, Invariant metrics, contractions and nonlinear matrix equations. Nonlinearity 21, 857–878 (2008)MathSciNetCrossRefzbMATHGoogle Scholar
  30. 30.
    J.M. Lee, Introduction to Smooth Manifolds, vol. 218, GTM (Springer, New York, 2012)CrossRefGoogle Scholar
  31. 31.
    B. Lemmens, R. Nussbaum, Nonlinear Perron-Frobenius Theory (Cambridge University Press, Cambridge, 2012)CrossRefzbMATHGoogle Scholar
  32. 32.
    Y. Lim, M. Pálfia, Matrix power means and the Karcher mean. J. Funct. Anal. 262, 1498–1514 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  33. 33.
    J. Ma, L. Xu, M.I. Jordan, Asymptotic convergence rate of the EM algorithm for Gaussian mixtures. Neural Comput. 12(12), 2881–2907 (2000)CrossRefGoogle Scholar
  34. 34.
    Z. Mariet, S. Sra, Diversity networks (2015). arXiv:1511.05077
  35. 35.
    Z. Mariet, S. Sra, Fixed-point algorithms for learning determinantal point processes, in International Conference on Machine Learning (ICML) (2015)Google Scholar
  36. 36.
    J. Masci, D. Boscaini, M.M. Bronstein, P. Vandergheynst, ShapeNet: convolutional neural networks on non-Euclidean manifolds (2015). arXiv:1501.06297
  37. 37.
    G.J. McLachlan, D. Peel, Finite Mixture Models (Wiley, New Jersey, 2000)CrossRefzbMATHGoogle Scholar
  38. 38.
    A. Mehrjou, R. Hosseini, B.N. Araabi, Mixture of ICAs model for natural images solved by manifold optimization method, in 7th International Conference on Information and Knowledge Technology (2015)Google Scholar
  39. 39.
    B. Mishra, A Riemannian approach to large-scale constrained least-squares with symmetries. Ph.D. thesis, Université de Namur (2014)Google Scholar
  40. 40.
    M. Moakher, A differential geometric approach to the geometric mean of symmetric positive-definite matrices. SIAM J. Matrix Anal. Appl. (SIMAX) 26, 735–747 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
  41. 41.
    K.P. Murphy, Machine Learning: A Probabilistic Perspective (MIT Press, Cambridge, 2012)zbMATHGoogle Scholar
  42. 42.
    F. Nielsen, R. Bhatia (eds.), Matrix Information Geometry (Springer, New York, 2013)Google Scholar
  43. 43.
    E. Ollila, D. Tyler, V. Koivunen, H.V. Poor, Complex elliptically symmetric distributions: survey, new results and applications. IEEE Trans. Signal Process. 60(11), 5597–5625 (2011)MathSciNetCrossRefGoogle Scholar
  44. 44.
    R.A. Redner, H.F. Walker, Mixture densities, maximum likelihood, and the EM algorithm. Siam Rev. 26, 195–239 (1984)MathSciNetCrossRefzbMATHGoogle Scholar
  45. 45.
    W. Ring, B. Wirth, Optimization methods on Riemannian manifolds and their application to shape space. SIAM J. Optim. 22(2), 596–627 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  46. 46.
    B. Schölkopf, A.J. Smola, Learning with Kernels (MIT Press, Cambridge, 2002)zbMATHGoogle Scholar
  47. 47.
    A. Shrivastava, P. Li, A new space for comparing graphs, in IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM) (IEEE, 2014), pp. 62–71Google Scholar
  48. 48.
    S. Sra, On the matrix square root and geometric optimization (2015). arXiv:1507.08366
  49. 49.
    S. Sra, Positive definite matrices and the S-divergence, in Proceedings of the American Mathematical Society (2015). arXiv:1110.1773v4
  50. 50.
    S. Sra, R. Hosseini, Geometric optimisation on positive definite matrices for elliptically contoured distributions, in Advances in Neural Information Processing Systems (2013), pp. 2562–2570Google Scholar
  51. 51.
    S. Sra, R. Hosseini, Conic geometric optimisation on the manifold of positive definite matrices. SIAM J. Optim. 25(1), 713–739 (2015)MathSciNetCrossRefzbMATHGoogle Scholar
  52. 52.
    S. Sra, R. Hosseini, L. Theis, M. Bethge, Data modeling with the elliptical gamma distribution, in Artificial Intelligence and Statistics (AISTATS), vol. 18 (2015)Google Scholar
  53. 53.
    A.C. Thompson, On certain contraction mappings in partially ordered vector space. Proc. AMS 14, 438–443 (1963)MathSciNetzbMATHGoogle Scholar
  54. 54.
    R. Tibshirani, Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Ser. B (Methodol.) 58, 267–288 (1996)MathSciNetzbMATHGoogle Scholar
  55. 55.
    C. Udrişte, Convex Functions and Optimization Methods on Riemannian Manifolds (Kluwer, Dordrecht, 1994)CrossRefzbMATHGoogle Scholar
  56. 56.
    R.J. Vanderbei, H.Y. Benson, On formulating semidefinite programming problems as smooth convex nonlinear optimization problems. Technical report, Princeton (2000)Google Scholar
  57. 57.
    B. Vandereycken, Riemannian and multilevel optimization for rank-constrained matrix problems. Ph.D. thesis, Department of Computer Science, KU Leuven (2010)Google Scholar
  58. 58.
    J.J. Verbeek, N. Vlassis, B. Kröse, Efficient greedy learning of Gaussian mixture models. Neural Comput. 15(2), 469–485 (2003)CrossRefzbMATHGoogle Scholar
  59. 59.
    A. Wiesel, Geodesic convexity and covariance estimation. IEEE Trans. Signal Process. 60(12), 6182–6189 (2012)MathSciNetCrossRefGoogle Scholar
  60. 60.
    A. Wiesel, Unified framework to regularized covariance estimation in scaled Gaussian models. IEEE Trans. Signal Process. 60(1), 29–38 (2012)MathSciNetCrossRefGoogle Scholar
  61. 61.
    L. Xu, M.I. Jordan, On convergence properties of the EM algorithm for Gaussian mixtures. Neural Comput. 8, 129–151 (1996)CrossRefGoogle Scholar
  62. 62.
    F. Yger, A review of kernels on covariance matrices for BCI applications, in IEEE International Workshop on Machine Learning for Signal Processing (MLSP) (IEEE, 2013), pp. 1–6Google Scholar
  63. 63.
    J. Zhang, L. Wang, L. Zhou, W. Li, Learning discriminative Stein Kernel for SPD matrices and its applications (2014). arXiv:1407.1974
  64. 64.
    T. Zhang, Robust subspace recovery by geodesically convex optimization (2012). arXiv:1206.1386
  65. 65.
    T. Zhang, A. Wiesel, S. Greco, Multivariate generalized Gaussian distribution: convexity and graphical models. IEEE Trans. Signal Process. 60(11), 5597–5625 (2013)MathSciNetGoogle Scholar
  66. 66.
    D. Zoran, Y. Weiss, Natural images, Gaussian mixtures and dead leaves, in Advances in Neural Information Processing Systems (2012), pp. 1736–1744Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Laboratory for Information & Decision Systems (LIDS)Massachusetts Institute of TechnologyCambridgeUSA
  2. 2.School of ECECollege of Engineering, University of TehranTehranIran

Personalised recommendations