Related Work on Geometry of Non-Convex Programs

  • Bin Shi
  • S. S. Iyengar


Over the past few years, there have been increasing interest in understanding the geometry of non-convex programs that naturally arise from machine learning problems. It is particularly interesting to study additional properties of the considered non-convex objective such that popular optimization methods (such as gradient descent) escape saddle points and converge to a local minimum. The strict saddle property (Definition  5.6) is one such property, which was also shown to hold in a broad range of applications.


Newton’s method Cholesky’s method Bayesian network Regression model Elastic-net regularizer Sequential Monte Carlo (SMC) Sparse-GEV model 


  1. [AAB+17]
    N. Agarwal, Z. Allen-Zhu, B. Bullins, E. Hazan, T. Ma, Finding approximate local minima faster than gradient descent, in STOC (2017), pp. 1195–1199.
  2. [AG16]
    A. Anandkumar, R. Ge, Efficient approaches for escaping higher order saddle points in non-convex optimization, in Conference on Learning Theory (2016), pp. 81–102. arXiv preprint arXiv:1602.05908Google Scholar
  3. [ALA07]
    A. Arnold, Y. Liu, N. Abe, Temporal causal modeling with graphical granger methods, in Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, New York, 2007), pp. 66–75Google Scholar
  4. [B+15]
    S. Bubeck, Convex optimization: algorithms and complexity. Found. Trends in Mach. Learn. 8(3–4), 231–357 (2015)zbMATHCrossRefGoogle Scholar
  5. [BL12]
    M.T. Bahadori, Y. Liu, On causality inference in time series, in AAAI Fall Symposium: Discovery Informatics (2012)Google Scholar
  6. [BM00]
    P.S. Bradley, O.L. Mangasarian, K-plane clustering. J. Global Optim. 16(1), 23–32 (2000)MathSciNetzbMATHCrossRefGoogle Scholar
  7. [BT09]
    A. Beck, M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imag. Sci. 2(1), 183–202 (2009)MathSciNetzbMATHCrossRefGoogle Scholar
  8. [CD16]
    Y. Carmon, J.C. Duchi, Gradient descent efficiently finds the cubic-regularized non-convex Newton step. arXiv preprint arXiv:1612.00547 (2016)Google Scholar
  9. [CDHS16]
    Y. Carmon, J.C. Duchi, O. Hinder, A. Sidford, Accelerated methods for non-convex optimization. arXiv preprint arXiv:1611.00756 (2016)Google Scholar
  10. [CJLP10]
    C.M. Carvalho, M.S. Johannes, H.F. Lopes, N.G. Polson, Particle learning and smoothing. Stat. Sci. 25, 88–106 (2010)MathSciNetzbMATHCrossRefGoogle Scholar
  11. [CLLC10]
    X. Chen, Y. Liu, H. Liu, J.G. Carbonell, Learning spatial-temporal varying graphs with applications to climate data analysis, in AAAI (2010)Google Scholar
  12. [CRS14]
    F.E. Curtis, D.P. Robinson, M. Samadi, A trust region algorithm with a worst-case iteration complexity of O(𝜖 −3∕2) for nonconvex optimization. Math. Program. 162(1–2), 1–32 (2014)MathSciNetzbMATHGoogle Scholar
  13. [DGA00]
    A. Doucet, S. Godsill, C. Andrieu, On sequential Monte Carlo sampling methods for bayesian filtering. Stat. Comput. 10(3), 197–208 (2000)CrossRefGoogle Scholar
  14. [Eic06]
    M. Eichler, Graphical modelling of multivariate time series with latent variables. Preprint, Universiteit Maastricht (2006)zbMATHGoogle Scholar
  15. [EV13]
    E. Elhamifar, R. Vidal, Sparse subspace clustering: algorithm, theory, and applications. IEEE Trans. Pattern Anal. Mach. Intell. 35(11), 2765–2781 (2013)CrossRefGoogle Scholar
  16. [GHJY15]
    R. Ge, F. Huang, C. Jin, Y. Yuan, Escaping from saddle points—online stochastic gradient for tensor decomposition, in Proceedings of the 28th Conference on Learning Theory (2015), pp. 797–842Google Scholar
  17. [GM74]
    P.E. Gill, W. Murray, Newton-type methods for unconstrained and linearly constrained optimization. Math. Program. 7(1), 311–350 (1974)MathSciNetzbMATHCrossRefGoogle Scholar
  18. [Gra69]
    C.W.J. Granger, Investigating causal relations by econometric models and cross-spectral methods. Econometrica 37(3), 424–438 (1969)zbMATHCrossRefGoogle Scholar
  19. [Gra80]
    C.W.J. Granger, Testing for causality: a personal viewpoint. J. Econ. Dyn. Control. 2, 329–352 (1980)MathSciNetCrossRefGoogle Scholar
  20. [HB15]
    R. Heckel, H. Bölcskei, Robust subspace clustering via thresholding. IEEE Trans. Inf. Theory 61(11), 6320–6342 (2015)MathSciNetzbMATHCrossRefGoogle Scholar
  21. [Hec98]
    D. Heckerman, A tutorial on learning with bayesian networks. Learning in Graphical Models (Springer, Berlin, 1998), pp. 301–354zbMATHCrossRefGoogle Scholar
  22. [HMR16]
    M. Hardt, T. Ma, B. Recht, Gradient descent learns linear dynamical systems. arXiv preprint arXiv:1609.05191 (2016)Google Scholar
  23. [HTB17]
    R. Heckel, M. Tschannen, H. Bölcskei, Dimensionality-reduced subspace clustering. Inf. Inference: A J. IMA 6(3), 246–283 (2017)MathSciNetzbMATHCrossRefGoogle Scholar
  24. [JGN+17]
    C. Jin, R. Ge, P. Netrapalli, S.M. Kakade, M.I. Jordan, How to escape saddle points efficiently, in Proceedings of the 34th International Conference on Machine Learning (2017), pp. 1724–1732Google Scholar
  25. [JNJ17]
    C. Jin, P. Netrapalli, M.I. Jordan, Accelerated gradient descent escapes saddle points faster than gradient descent. arXiv preprint arXiv:1711.10456 (2017)Google Scholar
  26. [JYG+03]
    R. Jansen, H. Yu, D. Greenbaum, Y. Kluger, N.J. Krogan, S. Chung, A. Emili, M. Snyder, J.F. Greenblatt, M. Gerstein, A bayesian networks approach for predicting protein–protein interactions from genomic data. Science 302(5644), 449–453 (2003)CrossRefGoogle Scholar
  27. [LBL12]
    Y. Liu, T. Bahadori, H. Li, Sparse-GEV: sparse latent space model for multivariate extreme value time serie modeling. arXiv preprint arXiv:1206.4685 (2012)Google Scholar
  28. [LLNM+09]
    A.C. Lozano, H. Li, A. Niculescu-Mizil, Y. Liu, C. Perlich, J. Hosking, N. Abe, Spatial-temporal causal modeling for climate change attribution, in Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (ACM, New York, 2009), pp. 587–596Google Scholar
  29. [LLY+13]
    G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, Y. Ma, Robust recovery of subspace structures by low-rank representation. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 171–184 (2013)CrossRefGoogle Scholar
  30. [LNMLL10]
    Y. Liu, A. Niculescu-Mizil, A.C. Lozano, Y. Lu, Learning temporal causal graphs for relational time-series analysis, in Proceedings of the 27th International Conference on Machine Learning (ICML-10) (2010), pp. 687–694Google Scholar
  31. [LPP+17]
    J.D. Lee, I. Panageas, G. Piliouras, M. Simchowitz, M.I. Jordan, B. Recht, First-order methods almost always avoid saddle points. arXiv preprint arXiv:1710.07406 (2017)Google Scholar
  32. [LRP16]
    L. Lessard, B. Recht, A. Packard, Analysis and design of optimization algorithms via integral quadratic constraints. SIAM J. Optim. 26(1), 57–95 (2016)MathSciNetzbMATHCrossRefGoogle Scholar
  33. [LSJR16]
    J.D. Lee, M. Simchowitz, M.I. Jordan, B. Recht, Gradient descent only converges to minimizers, in Conference on Learning Theory (2016), pp. 1246–1257Google Scholar
  34. [LY17]
    M. Liu, T. Yang, On noisy negative curvature descent: competing with gradient descent for faster non-convex optimization. arXiv preprint arXiv:1709.08571 (2017)Google Scholar
  35. [MDHW07]
    Y. Ma, H. Derksen, W. Hong, J. Wright, Segmentation of multivariate mixed data via lossy data coding and compression. IEEE Trans. Pattern Anal. Mach. Intell. 29(9), 1546–1562 (2007)CrossRefGoogle Scholar
  36. [MS79]
    J.J. Moré, D.C. Sorensen, On the use of directions of negative curvature in a modified newton method. Math. Program. 16(1), 1–20 (1979)MathSciNetzbMATHCrossRefGoogle Scholar
  37. [Mur02]
    K.P. Murphy, Dynamic bayesian networks: representation, inference and learning, Ph.D. thesis, University of California, Berkeley, 2002Google Scholar
  38. [Mur12]
    K.P. Murphy, Machine Learning: A Probabilistic Perspective (MIT Press, Cambridge, MA, 2012)zbMATHGoogle Scholar
  39. [Nes83]
    Y. Nesterov, A Method of Solving a Convex Programming Problem with Convergence Rate o (1/k2) Soviet Mathematics Doklady, vol. 27 (1983), pp. 372–376zbMATHGoogle Scholar
  40. [NN88]
    Y. Nesterov, A. Nemirovsky, A general approach to polynomial-time algorithms design for convex programming, Tech. report, Technical report, Centr. Econ. & Math. Inst., USSR Acad. Sci., Moscow, USSR, 1988Google Scholar
  41. [NP06]
    Y. Nesterov, B.T. Polyak, Cubic regularization of newton method and its global performance. Math. Program. 108(1), 177–205 (2006)MathSciNetzbMATHCrossRefGoogle Scholar
  42. [OC15]
    B. O’Donoghue, E. Candès, Adaptive restart for accelerated gradient schemes. Found. Comput. Math. 15(3), 715–732 (2015)MathSciNetzbMATHCrossRefGoogle Scholar
  43. [OW17]
    M. O’Neill, S.J. Wright, Behavior of accelerated gradient methods near critical points of nonconvex problems. arXiv preprint arXiv:1706.07993 (2017)Google Scholar
  44. [PCS14]
    D. Park, C. Caramanis, S. Sanghavi, Greedy subspace clustering, in Advances in Neural Information Processing Systems (2014), pp. 2753–2761Google Scholar
  45. [Pem90]
    R. Pemantle, Nonconvergence to unstable points in urn models and stochastic approximations. Ann. Probab. 18(2), 698–712 (1990)MathSciNetzbMATHCrossRefGoogle Scholar
  46. [Pol64]
    B.T. Polyak, Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964)CrossRefGoogle Scholar
  47. [PP16]
    I. Panageas, G. Piliouras, Gradient descent only converges to minimizers: non-isolated critical points and invariant regions. arXiv preprint arXiv:1605.00405 (2016)Google Scholar
  48. [RHW+88]
    D.E. Rumelhart, G.E. Hinton, R.J. Williams et al., Learning representations by back-propagating errors. Cogn. Model. 5(3), 1 (1988)Google Scholar
  49. [RW17]
    C.W. Royer, S.J. Wright, Complexity analysis of second-order line-search algorithms for smooth nonconvex optimization. arXiv preprint arXiv:1706.03131 (2017)Google Scholar
  50. [RZS+17]
    S.J. Reddi, M. Zaheer, S. Sra, B. Poczos, F. Bach, R. Salakhutdinov, A.J. Smola, A generic approach for escaping saddle points. arXiv preprint arXiv:1709.01434 (2017)Google Scholar
  51. [SBC14]
    W. Su, S. Boyd, E. Candes, A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights, in Advances in Neural Information Processing Systems (2014), pp. 2510–2518Google Scholar
  52. [SC12]
    M. Soltanolkotabi, E.J. Candes, A geometric analysis of subspace clustering with outliers. Ann. Stat. 40(4), 2195–2238 (2012)MathSciNetzbMATHCrossRefGoogle Scholar
  53. [SEC14]
    M. Soltanolkotabi, E. Elhamifar, E.J. Candes, Robust subspace clustering. Ann. Stat. 42(2), 669–699 (2014)MathSciNetzbMATHCrossRefGoogle Scholar
  54. [SMDH13]
    I. Sutskever, J. Martens, G. Dahl, G. Hinton, On the importance of initialization and momentum in deep learning, in International Conference on Machine Learning (2013), pp. 1139–1147Google Scholar
  55. [SQW16]
    J. Sun, Q. Qu, J. Wright, A geometric analysis of phase retrieval, in 2016 IEEE International Symposium on Information Theory (ISIT) (IEEE, Piscataway, 2016), pp. 2379–2383Google Scholar
  56. [TG17]
    P.A. Traganitis, G.B. Giannakis, Sketched subspace clustering. IEEE Trans. Signal Process. 66(7), 1663–1675 (2017)MathSciNetzbMATHCrossRefGoogle Scholar
  57. [TPGC]
    T. Park, G. Casella, The Bayesian Lasso. J. Am. Stat. Assoc. 103(482), 681–686 (2008)MathSciNetzbMATHCrossRefGoogle Scholar
  58. [Tse00]
    P. Tseng, Nearest q-flat to m points. J. Optim. Theory Appl. 105(1), 249–252 (2000)MathSciNetzbMATHCrossRefGoogle Scholar
  59. [TV17]
    M. Tsakiris, R. Vidal, Algebraic clustering of affine subspaces. IEEE Trans. Pattern Anal. Mach. Intell. 40(2), 482–489 (2017)CrossRefGoogle Scholar
  60. [TV18]
    M.C. Tsakiris, R. Vidal, Theoretical analysis of sparse subspace clustering with missing entries. arXiv preprint arXiv:1801.00393 (2018)Google Scholar
  61. [VMS05]
    R. Vidal, Y. Ma, S. Sastry, Generalized principal component analysis (GPCA). IEEE Trans. Pattern Anal. Mach. Intell. 27(12), 1945–1959 (2005)CrossRefGoogle Scholar
  62. [WRJ16]
    A.C. Wilson, B. Recht, M.I. Jordan, A lyapunov analysis of momentum methods in optimization. arXiv preprint arXiv:1611.02635 (2016)Google Scholar
  63. [WWJ16]
    A. Wibisono, A.C. Wilson, M.I. Jordan, A variational perspective on accelerated methods in optimization. Proc. Nat. Acad. Sci. 113(47), E7351–E7358 (2016)MathSciNetzbMATHCrossRefGoogle Scholar
  64. [WWS15a]
    Y. Wang, Y.-X. Wang, A. Singh, A deterministic analysis of noisy sparse subspace clustering for dimensionality-reduced data, in International Conference on Machine Learning (2015), pp. 1422–1431Google Scholar
  65. [WWS15b]
    Y. Wang, Y.-X. Wang, A. Singh, Differentially private subspace clustering, in Advances in Neural Information Processing Systems (2015), pp. 1000–1008Google Scholar
  66. [WX16]
    Y.-X. Wang, H. Xu, Noisy sparse subspace clustering. J. Mach. Learn. Res. 17(12), 1–41 (2016)MathSciNetzbMATHGoogle Scholar
  67. [YP06]
    J. Yan, M. Pollefeys, A general framework for motion segmentation: independent, articulated, rigid, non-rigid, degenerate and non-degenerate, in European Conference on Computer Vision (Springer, Berlin, 2006), pp. 94–106Google Scholar
  68. [YRV15]
    C. Yang, D. Robinson, R. Vidal, Sparse subspace clustering with missing entries, in International Conference on Machine Learning (2015), pp. 2463–2472Google Scholar
  69. [ZF09]
    C. Zou, J. Feng, Granger causality vs. dynamic bayesian network inference: a comparative study. BMC Bioinf. 10(1), 122 (2009)Google Scholar
  70. [ZH05]
    H. Zou, T. Hastie, Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat Methodol. 67(2), 301–320 (2005)MathSciNetzbMATHCrossRefGoogle Scholar
  71. [ZWML16]
    C. Zeng, Q. Wang, S. Mokhtari, T. Li, Online context-aware recommendation with time varying multi-armed bandit, in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, New York, 2016), pp. 2025–2034Google Scholar
  72. [ZWW+16]
    C. Zeng, Q. Wang, W. Wang, T. Li, L. Shwartz, Online inference for time-varying temporal dependency discovery from time series, in 2016 IEEE International Conference on Big Data (Big Data) (IEEE, Piscataway, 2016), pp. 1281–1290CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Bin Shi
    • 1
  • S. S. Iyengar
    • 2
  1. 1.University of CaliforniaBerkeleyUSA
  2. 2.Florida International UniversityMiamiUSA

Personalised recommendations