Skip to main content

Advertisement

Log in

Simple Algorithms for Optimization on Riemannian Manifolds with Constraints

  • Published:
Applied Mathematics & Optimization Submit manuscript

Abstract

We consider optimization problems on manifolds with equality and inequality constraints. A large body of work treats constrained optimization in Euclidean spaces. In this work, we consider extensions of existing algorithms from the Euclidean case to the Riemannian case. Thus, the variable lives on a known smooth manifold and is further constrained. In doing so, we exploit the growing literature on unconstrained Riemannian optimization. For the special case where the manifold is itself described by equality constraints, one could in principle treat the whole problem as a constrained problem in a Euclidean space. The main hypothesis we test here is whether it is sometimes better to exploit the geometry of the constraints, even if only for a subset of them. Specifically, this paper extends an augmented Lagrangian method and smoothed versions of an exact penalty method to the Riemannian case, together with some fundamental convergence results. Numerical experiments indicate some gains in computational efficiency and accuracy in some regimes for minimum balanced cut, non-negative PCA and k-means, especially in high dimensions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. Note that this condition involves \(\mathcal {L}\) as defined in Sect. 2.2, not \(\mathcal{{L}}_\rho \).

  2. https://github.com/losangle/Optimization-on-manifolds-with-extra-constraints.

  3. When the step size is of order \(10^{-10}\), we believe that the current point is close to convergence. We also conducted experiments with minimum step size \(10^{-7}\) for minimum balanced cut and non-negative PCA, and the performance profiles are visually similar to those displayed here.

  4. The proof follows an argument laid out by John M. Lee: https://math.stackexchange.com/questions/2307289/parallel-transport-along-radial-geodesics-yields-a-smooth-vector-field.

References

  1. Absil, P.-A., Hosseini, S.: A collection of nonsmooth Riemannian optimization problems. Technical Report UCL-INMA-2017.08, Université catholique de Louvain (2017)

  2. Absil, P.-A., Mahony, R., Sepulchre, R.: Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton (2008)

    Book  MATH  Google Scholar 

  3. Agarwal, N., Boumal, N., Bullins, B., Cartis, C.: Adaptive regularization with cubics on manifolds (2018). arXiv preprint arXiv:1806.00065

  4. Albert, R., Barabási, A.-L.: Statistical mechanics of complex networks. Rev. Mod. Phys. 74, 47–97 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  5. Andreani, R., Birgin, E.G., Martínez, J.M., Schuverdt, M.L.: On augmented Lagrangian methods with general lower-level constraints. SIAM J. Optim. 18(4), 1286–1309 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  6. Andreani, R., Haeser, G., Martínez, J.M.: On sequential optimality conditions for smooth constrained optimization. Optimization 60(5), 627–641 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  7. Andreani, R., Haeser, G., Ramos, A., Silva, P.J.: A second-order sequential optimality condition associated to the convergence of optimization algorithms. IMA J. Numer. Anal. 37, 1902–1929 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  8. Bento, G., Ferreira, O., Melo, J.: Iteration-complexity of gradient, subgradient and proximal point methods on Riemannian manifolds. J. Optim. Theory Appl. 173(2), 548–562 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  9. Bento, G.C., Ferreira, O.P., Melo, J.G.: Iteration-complexity of gradient, subgradient and proximal point methods on Riemannian manifolds. J. Optim. Theory Appl. 173(2), 548–562 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  10. Bergmann, R., Herzog, R.: Intrinsic formulation of KKT conditions and constraint qualifications on smooth manifolds (2018). arXiv preprint arXiv:1804.06214

  11. Bergmann, R., Persch, J., Steidl, G.: A parallel Douglas-Rachford algorithm for minimizing ROF-like functionals on images with values in symmetric Hadamard manifolds. SIAM J. Imaging Sci. 9(3), 901–937 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  12. Bertsekas, D.P.: Constrained Optimization and Lagrange Multiplier Methods. Athena Scientific, Belmont (1982)

    MATH  Google Scholar 

  13. Bertsekas, D.P.: Nonlinear Programming. Athena Scientific, Belmont (1999)

    MATH  Google Scholar 

  14. Birgin, E., Haeser, G., Ramos, A.: Augmented Lagrangians with constrained subproblems and convergence to second-order stationary points. Optimization Online (2016)

  15. Birgin, E.G., Floudas, C.A., Martínez, J.M.: Global minimization using an augmented Lagrangian method with variable lower-level constraints. Math. Program. 125(1), 139–162 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  16. Birgin, E.G., Martínez, J.M.: Practical Augmented Lagrangian Methods for Constrained Optimization. SIAM (2014)

  17. Boumal, N., Absil, P.-A., Cartis, C.: Global rates of convergence for nonconvex optimization on manifolds. IMA J. Numer. Anal. (2018)

  18. Boumal, N., Mishra, B., Absil, P.-A., Sepulchre, R.: Manopt, a Matlab toolbox for optimization on manifolds. J. Mach. Learn. Res. 15(1), 1455–1459 (2014)

    MATH  Google Scholar 

  19. Burer, S., Monteiro, R.D.: A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization. Math. Program. 95(2), 329–357 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  20. Byrd, R.H., Nocedal, J., Waltz, R.A.: Knitro: An integrated package for nonlinear optimization. In: Large-Scale Nonlinear Optimization, pp. 35–59. Springer (2006)

  21. Cambier, L., Absil, P.-A.: Robust low-rank matrix completion by Riemannian optimization. SIAM J. Sci. Comput. 38(5), S440–S460 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  22. Carmo, MPd: Riemannian Geometry. Birkhäuser, Boston (1992)

    Book  MATH  Google Scholar 

  23. Carson, T., Mixon, D.G., Villar, S.: Manifold optimization for k-means clustering. In: Sampling Theory and Applications (SampTA), 2017 International Conference on, pp. 73–77. IEEE (2017)

  24. Chatterjee, A., Madhav Govindu, V.: Efficient and robust large-scale rotation averaging. In: The IEEE International Conference on Computer Vision (ICCV) (December 2013)

  25. Chen, C., Mangasarian, O.L.: Smoothing methods for convex inequalities and linear complementarity problems. Math. Program. 71(1), 51–69 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  26. Clarke, F.H.: Optimization and nonsmooth analysis. SIAM (1990)

  27. Conn, A.R., Gould, G., Toint, P.L.: LANCELOT: A Fortran Package for Large-Scale Nonlinear Optimization (Release A), vol. 17. Springer, New York (2013)

    MATH  Google Scholar 

  28. Dolan, E.D., Moré, J.J.: Benchmarking optimization software with performance profiles. Math. Program. 91(2), 201–213 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  29. Dreisigmeyer, D.W.: Equality constraints. Riemannian manifolds and direct search methods. Optimization-Online (2007)

  30. Gould, N.I., Toint, P.L.: A note on the convergence of barrier algorithms to second-order necessary points. Math. Program. 85(2), 433–438 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  31. Grohs, P., Hosseini, S.: \(\varepsilon \)-subgradient algorithms for locally Lipschitz functions on Riemannian manifolds. Adv. Comput. Math. 42(2), 333–360 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  32. Guo, L., Lin, G.-H., Jane, J.Y.: Second-order optimality conditions for mathematical programs with equilibrium constraints. J. Optim. Theory. Appl. 158(1), 33–64 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  33. Hosseini, S., Huang, W., Yousefpour, R.: Line search algorithms for locally Lipschitz functions on Riemannian manifolds. SIAM J. Optim. 28(1), 596–619 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  34. Hosseini, S., Pouryayevali, M.: Generalized gradients and characterization of epi-Lipschitz sets in Riemannian manifolds. Nonlinear Anal. 74(12), 3884–3895 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  35. Huang, W., Absil, P.-A., Gallivan, K., Hand, P.: ROPTLIB: an object-oriented C++ library for optimization on Riemannian manifolds. Technical Report FSU16-14.v2, Florida State University (2016)

  36. Huang, W., Gallivan, K.A., Absil, P.-A.: A Broyden class of quasi-Newton methods for Riemannian optimization. SIAM J. Optim. 25(3), 1660–1685 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  37. Johnstone, I.M., Lu, A.Y.: On consistency and sparsity for principal components analysis in high dimensions. J. Am. Stat. Assoc. 104(486), 682–693 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  38. Kanzow, C., Steck, D.: An example comparing the standard and safeguarded augmented Lagrangian methods. Oper. Res. Lett. 45(6), 598–603 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  39. Khuzani, M.B., Li, N.: Stochastic primal-dual method on Riemannian manifolds with bounded sectional curvature (2017). arXiv preprint arXiv:1703.08167

  40. Kovnatsky, A., Glashoff, K., Bronstein, M.M.: Madmm: a generic algorithm for non-smooth optimization on manifolds. In: European Conference on Computer Vision, pp. 680–696. Springer (2016)

  41. Lang, K.: Fixing two weaknesses of the spectral method. In: Advances in Neural Information Processing Systems, pp. 715–722 (2006)

  42. Lee, J.: Introduction to Smooth Manifolds. Graduate Texts in Mathematics, vol. 218, 2nd edn. Springer, New York (2012)

    Book  Google Scholar 

  43. Lee, J.M.: Smooth manifolds. In: Introduction to Smooth Manifolds, pp. 1–29. Springer (2003)

  44. Lewis, A.S., Overton, M.L.: Nonsmooth optimization via BFGS. SIAM J. Optim 1–35 (Submitted) (2009)

  45. Lichman, M.: UCI machine learning repository (2013)

  46. Montanari, A., Richard, E.: Non-negative principal component analysis: message passing algorithms and sharp asymptotics. IEEE Trans. Inf. Theory 62(3), 1458–1484 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  47. Nocedal, J., Wright, S.J.: Numerical Optimization, 2nd edn. Springer, New York (2006)

    MATH  Google Scholar 

  48. Parikh, N., Boyd, S.: Proximal Algorithms, vol. 1. Now Publishers inc., Hanover (2014)

    Book  Google Scholar 

  49. Pinar, M.Ç., Zenios, S.A.: On smoothing exact penalty functions for convex constrained optimization. SIAM J. Optim. 4(3), 486–511 (1994)

    Article  MathSciNet  MATH  Google Scholar 

  50. Ruszczyński, A.P.: Nonlinear Optimization, vol. 13. Princeton University Press, Princeton (2006)

    Book  MATH  Google Scholar 

  51. Townsend, J., Koep, N., Weichwald, S.: Pymanopt: a Python toolbox for optimization on manifolds using automatic differentiation. J. Mach. Learn. Res. 17, 1–5 (2016)

    MathSciNet  MATH  Google Scholar 

  52. Weber, M., Sra, S.: Frank–Wolfe methods for geodesically convex optimization with application to the matrix geometric mean (2017). arXiv preprint arXiv:1710.10770

  53. Yang, W.H., Zhang, L.-H., Song, R.: Optimality conditions for the nonlinear programming problems on Riemannian manifolds. Pac. J. Optim. 10(2), 415–434 (2014)

    MathSciNet  MATH  Google Scholar 

  54. Zass, R., Shashua, A.: Nonnegative sparse pca. In: Advances in Neural Information Processing Systems, pp. 1561–1568 (2007)

  55. Zhang, J., Ma, S., Zhang, S.: Primal-dual optimization algorithms over Riemannian manifolds: an iteration complexity analysis (2017). arXiv preprint arXiv:1710.02236

  56. Zhang, J., Zhang, S.: A cubic regularized Newton’s method over Riemannian manifolds (2018). arXiv preprint arXiv:1805.05565

Download references

Acknowledgements

We thank an anonymous reviewer for detailed and helpful comments on the first version of this paper. NB is partially supported by NSF grant DMS-1719558.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nicolas Boumal.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Proof of Proposition 3.2

We first introduce two supporting lemmas. The first lemma is a well-known fact for which we provide a proof for completeness.Footnote 4

Lemma A.1

Let p be a point on a Riemannian manifold \(\mathcal{{M}}\), and let v be a tangent vector at p. Let \(\mathcal{{U}}\) be a normal neighborhood of p, that is, the exponential map maps a neighbourhood of the origin of \(\mathrm {T}_p\mathcal{{M}}\) diffeomorphically to \(\mathcal{{U}}\). Define the following vector field on \(\mathcal{{U}}\):

$$\begin{aligned} \forall q \in \mathcal{{U}}, \qquad V(q)&= \mathcal{{P}}_{p \rightarrow q} v, \end{aligned}$$

where parallel transport is done along the (unique) minimizing geodesic from p to q. Then, V is a smooth vector field on \(\mathcal{{U}}\).

Proof

Parallel transport from p is along geodesics passing through p. To facilitate their study, set up normal coordinates \(\phi :U \subset {\mathbb {R}}^d \rightarrow \mathcal{{U}}\) around p (in particular, \(\phi (0) = p\)), where d is the dimension of the manifold. For a point \(\phi (x_1,\dots ,x_d)\), by definition of normal coordinates, the radial geodesic from p is \(c(t) = \phi (tx_1,\dots , tx_d)\). Our vector field of interest is defined by \(V(p) = v\) and the fact that it is parallel along every radial geodesic c as described.

For a choice of point \(\phi (x)\) and corresponding geodesic, let

$$\begin{aligned} V(c(t)) = \sum _{k=1}^d v_k(t) \partial _k(c(t)) \end{aligned}$$

for some coordinate functions \(v_1, \ldots , v_d\), where \(\partial _k\) is the kth coordinate vector field. These coordinate functions satisfy the following ordinary differential equations (ODE) [22, Prop. 2.6, eq. (2)]:

$$\begin{aligned} 0&= \frac{dv_k(t)}{dt} + \sum _{i,j}\Gamma ^k_{ij}(tx_1,\dots , tx_d) v_j(t) x_i,&k = 1,\dots , d, \end{aligned}$$

where \(\Gamma \) denotes Christoffel symbols. Expand V(p) into the coordinate vector fields: \(v = \sum _{k=1}^d w_k \partial _k(p)\). Then, the initial conditions are \(v_k(0) = w_k\) for each k. Because these ODEs are smooth, solutions \(v_k(t; w)\) exist, and they are smooth in both t and the initial conditions w [42, Thm. D.6]. But this is not enough for our purpose.

Crucially, we wish to show smoothness also in the choice of \(x \in U\). To this end, following a classical trick, we extend the set of equations to let x be part of the variables, as follows:

$$\begin{aligned} {\left\{ \begin{array}{ll} 0 = \frac{dv_k(t)}{dt} + \sum _{i,j}\Gamma ^k_{ij}(tu_1(t),\ldots , tu_d(t)) v_j(t) u_i(t), &{} k=1,\ldots , d,\\ 0 = \frac{du_k(t)}{dt}, &{} k=1, \ldots , d. \end{array}\right. } \end{aligned}$$

The extended initial conditions are:

$$\begin{aligned} v_k(0)&= w_k,&u_k(0)&= x_k,&k&= 1, \ldots , d. \end{aligned}$$

Clearly, the functions \(u_k(t)\) are constant: \(u_k(t) = x_k\). These ODEs are still smooth, hence solutions \(v_k(t; w, x)\) still exist and are identical to those of the previous set of ODEs, except we now see they are also smooth in the choice of x. Specifically, for every \(x \in U\),

$$\begin{aligned} V(\phi (x))&= V(c(1)) = \sum _{k=1}^{d} v_k(1; w, x) \partial _k(\phi (x)), \end{aligned}$$

and each \(v_k(1; w, x)\) depends smoothly on x. Hence, V is smooth on \(\mathcal{{U}} = \phi (U)\).

\(\square \)

Lemma A.2

Given a Riemannian manifold \(\mathcal{{M}}\), a function \(f :\mathcal {M} \rightarrow {\mathbb {R}}\) (continuously differentiable), and a point \(p\in \mathcal{{M}}\), if \(p_0, p_1, p_2, \ldots \) is a sequence of points in a normal neighborhood of p and convergent to p, then the following holds:

$$\begin{aligned} \lim _{k\rightarrow \infty } \left\| \mathcal{{P}}_{p_k\rightarrow p}{\text {grad}}\, f(p_k) - {\text {grad}}\,f(p)\right\| _p = 0, \end{aligned}$$

where \(\mathcal{{P}}_{p_k\rightarrow p}\) is the parallel transport from \(\mathrm {T}_{p_k}\mathcal {M}\) to \(\mathrm {T}_p\mathcal {M}\) along the minimizing geodesic.

Proof of Lemma A.2

As parallel transport is an isometry, it is equivalent to show

$$\begin{aligned} \lim _{k\rightarrow \infty } \left\| {\text {grad}}\, f(p_k) - \mathcal{{P}}_{p\rightarrow p_k} {\text {grad}}\, f(p)\right\| _{p_k} = 0. \end{aligned}$$
(30)

Under our assumptions, \(\text {grad}f\) is a continuous vector field. Furthermore, by Lemma A.1, in a normal neighborhood of p, the vector field \(V(y) = \mathcal{{P}}_{p\rightarrow y} {\text {grad}}\, f(p)\) is a continuous vector field as well. Hence, \(\text {grad}f - V\) is a continuous vector field around p; since \(\text {grad}f(p) - V(p) = 0\), the result is proved: \(\lim _{k \rightarrow \infty } \text {grad}f(p_k) - V(p_k) = \text {grad}f(p) - V(p) = 0\). \(\square \)

Proof of Proposition 3.2

Restrict to a convergent subsequence if needed, so that \(\lim _{k\rightarrow \infty }x_k = {\overline{x}}\). Further exclude a (finite) number of \(x_k\)’s so that all the remaining points are in a neighborhood of \({\overline{x}}\) where the exponential map is a diffeomorphism. In this proof, let \(\mathcal{{A}}\) denote \(\mathcal{{A}}({\overline{x}})\) for ease of notation: this is the set of active constraints at the limit point. Then, there exist constants \(c, k_1\) such that \(g_i(x_k)< c < 0\) for all \(k>k_1\) with \(i \in \mathcal{{I}} \setminus \mathcal{{A}}\).

When \(\{\rho _k\}\) is unbounded, since multipliers are bounded, there exists \(k_2>k_1\) such that \(\lambda _i^k+\rho _kg_i(x_{k+1}) < 0\) for all \(k\ge k_2\), \(i\in \mathcal{{I}}\setminus \mathcal{{A}}\). Thus, by definition, \(\lambda _i^{k+1} = 0\) for all \(k\ge k_2\), \(i\in \mathcal{{I}}\setminus \mathcal{{A}}\).

When instead \(\{\rho _k\}\) is bounded, \(\lim _{k\rightarrow \infty } |\sigma _i^k| = 0\). Thus for \(i\in \mathcal{{I}}\setminus \mathcal{{A}}\), in view of \(g_i(x_k)< c<0\) for all \(k>k_1\), we have \(\lim _{k\rightarrow \infty } \frac{-\lambda _i^k}{\rho _k}= 0\). Then, for large enough k, \(\lambda _i^k +\rho _kg_i(x_{k+1})<0\) and thus there exists \(k_2>k_1\) such that \(\lambda _i^k = 0\) for all \(k\ge k_2\). So in either case, we can find such \(k_2\).

As LICQ is satisfied at \({\overline{x}}\), by continuity of the gradients of \(\{g_i\}\) and \(\{h_j\}\), the tangent vectors \(\{{\text {grad}}\, h_j(x_k)\}_{j\in \mathcal{{E}}}\cup \{{\text {grad}}\, g_i(x_k)\}_{i\in \mathcal{{I}}\cap \mathcal{{A}}}\) are linearly independent for all \(k>k_3>k_2\) for some \(k_3\). Define

$$\begin{aligned} {\overline{\lambda }}_i^k&= \max \left\{ 0, \lambda _i^{k-1} + \rho _{k-1} g_i(x_{k})\right\} ,&\text { and }&{\overline{\gamma }}_j^k&= \gamma _j^{k-1}+\rho _{k-1} h_j(x_{k}) \end{aligned}$$

as the unclipped update. Define \(S_k := \max \{\Vert {\overline{\gamma }}^k\Vert _\infty ,\Vert {\overline{\lambda }}^k\Vert _\infty \}\). We are going to discuss separately for situations when \(S_k\) is bounded and when it is unbounded. If it is bounded, then denote a limit point of \({\overline{\lambda }}^k, {\overline{\gamma }}^k\) as \({\overline{\lambda }}\) and \({\overline{\gamma }}\). Let

$$\begin{aligned} v = {\text {grad}}\, f({\overline{x}}) + \sum _{j\in \mathcal{{E}}} {\overline{\gamma }}_j{\text {grad}}\, h_j({\overline{x}}) + \sum _{i\in \mathcal{{I}}\cap \mathcal{{A}}}{\overline{\lambda }}_i {\text {grad}}\, g_i({\overline{x}}). \end{aligned}$$

In order to prove that v is zero, we compare it to a similar vector defined at \(x_k\), for all large k, and consider the limit \(k \rightarrow \infty \). Unlike the Euclidean case in the proof in [5], we cannot directly compare tangent vectors in the tangent spaces at \(x_k\) and \({\overline{x}}\): we use parallel transport to bring all tangent vectors to the tangent space at \({\overline{x}}\):

$$\begin{aligned} \Vert v\Vert\le & {} \left\| {\text {grad}}\, f({\overline{x}}) - \mathcal{{P}}_{x_k\rightarrow {\overline{x}}} {\text {grad}}\, f(x_k) {+}\!\! \sum _{j\in \mathcal{{E}}} {\overline{\gamma }}_j\left( {\text {grad}}\, h_j({\overline{x}})-\mathcal{{P}}_{x_k\rightarrow {\overline{x}}} {\text {grad}}\,h_j(x_k)\right) \right. \\&+ \left. \sum _{i\in \mathcal{{I}}\cap \mathcal{{A}}}{\overline{\lambda }}_i \left( {\text {grad}}\, g_i({\overline{x}}) - \mathcal{{P}}_{x_k\rightarrow {\overline{x}}} {\text {grad}}\, g_i(x_k)\right) \right\| \\&+ \left\| \mathcal{{P}}_{x_k\rightarrow {\overline{x}}} {\text {grad}}\, f(x_k) + \sum _{j\in \mathcal{{E}}} {\overline{\gamma }}_j\mathcal{{P}}_{x_k\rightarrow {\overline{x}}} {\text {grad}}\, h_j(x_k)\right. \\&\left. + \sum _{i\in \mathcal{{I}}\cap \mathcal{{A}}}{\overline{\lambda }}_i \mathcal{{P}}_{x_k\rightarrow {\overline{x}}} {\text {grad}}\, g_i(x_k)\right\| . \end{aligned}$$

By Lemma A.2, the first term vanishes in the limit \(k \rightarrow \infty \) since \(x_k \rightarrow {\overline{x}}\). We can understand the second term using isometry of parallel transport and linearity:

$$\begin{aligned}&\left\| \mathcal{{P}}_{x_k\rightarrow {\overline{x}}} {\text {grad}}\, f(x_k) + \sum _{j\in \mathcal{{E}}} {\overline{\gamma }}_j\mathcal{{P}}_{x_k\rightarrow {\overline{x}}} {\text {grad}}\, h_j(x_k) + \sum _{i\in \mathcal{{I}}\cap \mathcal{{A}}}{\overline{\lambda }}_i \mathcal{{P}}_{x_k\rightarrow {\overline{x}}} {\text {grad}}\, g_i(x_k)\right\| _{{\overline{x}}}\\&\quad = \left\| {\text {grad}}\, f(x_k) + \sum _{j\in \mathcal{{E}}} {\overline{\gamma }}_j{\text {grad}}\, h_j(x_k) + \sum _{i\in \mathcal{{I}}\cap \mathcal{{A}}}{\overline{\lambda }}_i {\text {grad}}\, g_i(x_k)\right\| _{x_k}\\&\quad \le \left\| \sum _{j\in \mathcal{{E}}} ({\overline{\gamma }}_j - {\overline{\gamma }}^k_j){\text {grad}}\, h_j(x_k) + \sum _{i\in \mathcal{{I}}\cap \mathcal{{A}}}({\overline{\lambda }}_i-{\overline{\lambda }}_i^k) {\text {grad}}\, g_i(x_k)\right\| _{x_k} \\&\qquad + \left\| {\text {grad}}\, f(x_k) + \sum _{j\in \mathcal{{E}}} {\overline{\gamma }}_j^k{\text {grad}}\, h_j(x_k) + \sum _{i\in \mathcal{{I}}}{\overline{\lambda }}_i^k {\text {grad}}\, g_i(x_k)\right\| _{x_k}\\&\qquad + \left\| \sum _{i\in \mathcal{{I}}\setminus \mathcal{{A}}} {\overline{\lambda }}_i^k {\text {grad}}\, g_i(x_k)\right\| _{x_k}. \end{aligned}$$

Here, the second term vanishes in the limit because it is upper bounded by \(\epsilon _{k}\) (by assumption) and we let \(\lim _{k\rightarrow \infty } \epsilon _k = 0\); the last term vanishes in the limit because of the discussion in the second paragraph; and the first term attains arbitrarily small values for large k as norms of gradients are bounded in a neighbourhood of \({\overline{x}}\) and by definition of \({\overline{\lambda }}\) and \({\overline{\gamma }}\). Since v is independent of k, we conclude that \(\Vert v\Vert = 0\). Therefore, \({\overline{x}}\) satisfies KKT conditions.

On the other hand, if \(\{S_k\}\) is unbounded, then for \(k\ge k_3\), we have

$$\begin{aligned} \left\| \frac{1}{S_k}{\text {grad}}\, f({\overline{x}}) + \sum _{j\in \mathcal{{E}}} \frac{{\overline{\gamma }}_j}{S_k}{\text {grad}}\, h_j({\overline{x}}) + \sum _{i\in \mathcal{{I}}}\frac{{\overline{\lambda }}_i}{S_k} {\text {grad}}\, g_i({\overline{x}})\right\| \le \frac{\epsilon _k}{S_k}. \end{aligned}$$

As all the coefficients on the left-hand side are bounded in \([-1,1]\), and by definition of \(S_k\), the coefficient vector has a nonzero limit point. Denote it as \({\overline{\lambda }}\) and \({\overline{\gamma }}\). By a similar argument as above, taking the limit in k, we can obtain

$$\begin{aligned} \left\| \sum _{j\in \mathcal{{E}}} {\overline{\gamma }}{\text {grad}}\, h_j({\overline{x}}) + \sum _{i\in \mathcal{{I}}\cap \mathcal{{A}}}{\overline{\lambda }} {\text {grad}}\, g_i({\overline{x}})\right\| = 0, \end{aligned}$$

which contradicts the LICQ condition at \({\overline{x}}\). Hence, the situation that \(\{S_k\}\) is unbounded does not take place, so we are left with the cases where it is bounded, for which we already showed that \({\overline{x}}\) satisfies KKT condition. \(\square \)

B Proof of Proposition 3.4

Proof

The proof is adapted from Section 3 in [7]. Define \({\overline{\gamma }}_j^k = \gamma _j^{k-1}+\rho _{k-1} h_j(x_k)\). By Proposition 3.2, \({\overline{x}}\) is a KKT point and by taking a subsequence of \(\{x_k\}\) if needed, \({\overline{\gamma }}^k\) is bounded and converges to \({\overline{\gamma }}\).

For any tangent vector \(d\in \mathcal{{C}}^W({\overline{x}})\), we have \(\langle d, {\text {grad}}\, h_j({\overline{x}})\rangle = 0\) for all \(j\in \mathcal{{E}}\). Let \(m = |\mathcal{{E}}|\), and dimension of \(\mathcal{{M}}\) be \(n\ge m\). Let \(\varphi \) be a chart such that \(\varphi ({\overline{x}}) = 0\). From [42, Prop. 8.1], the component functions of \(h_j\) with respect to this chart are smooth. Let \(\partial _1\dots \partial _{n}\) be the basis vectors of the given local chart. Let \(d = (d_1\partial _1,\dots , d_n\partial _n)\). Define: \(\mathcal{{F}}:{\mathbb {R}}^{n+m} \rightarrow {\mathbb {R}}^m\), i.e. for \(x\in {\mathbb {R}}^n, y\in {\mathbb {R}}^m, j\in \{1,\dots , m\}\) as

$$\begin{aligned} \mathcal{{F}}_j(x,y) = \langle (y_1\partial _1, \dots , y_m\partial _m, d_{m+1}\partial _{m+1}, \dots , d_n\partial _n), {\text {grad}}\, h_j(\varphi ^{-1}(x))\rangle _{\varphi ^{-1}(x)}. \end{aligned}$$

If we denote \(h_{j}^l\) as the lth coordinate of vector \({\text {grad}}\, h_j\) in this system, and \(G_x\) as gram matrix for the metric where \(G_{x_{p,q}} = \langle \partial _p,\partial _q\rangle _x\), then the above expression can be written as

$$\begin{aligned} \mathcal{{F}}_j(x,y) = [y_1, \dots , y_m, d_{m+1}, \dots , d_n] G_{\varphi ^{-1}(x)} [h_j^1(\varphi ^{-1}(x)), \dots h_j^n(\varphi ^{-1}(x))]^T. \end{aligned}$$

and by abuse of notation where \([1\dots m]\) means extracting the first m columns, we have

$$\begin{aligned} \frac{\partial \mathcal{{F}}_j}{\partial y} = \left( [h_j^1(\varphi ^{-1}(x)), \dots h_j^n(\varphi ^{-1}(x))]G_{\varphi ^{-1}(x)}\right) _{[1\cdots m]}. \end{aligned}$$

Notice that \([h^1(\varphi ^{-1}({\overline{x}})), \dots h^n(\varphi ^{-1}({\overline{x}}))]\) has full row rank (by LICQ), so it has rank m. As \(G_{\varphi ^{-1}({\overline{x}})}\) is invertible, \(\frac{\partial \mathcal{{F}}}{\partial y}({\overline{x}})\) must be invertible (reindex the columns from the top of the proof for this \(m\times n\) matrix if needed so that the m columns form a full rank matrix). Then, by the implicit function theorem, for a small neighbourhood U of \(\varphi ^{-1}({\overline{x}})\), we have a continuously differentiable function \(g:U\rightarrow {\mathbb {R}}^m\), where \(g(\varphi ^{-1}({\overline{x}})) = [d_1, \dots , d_m]\) and

$$\begin{aligned} \mathcal{{F}}(x, g(\varphi ^{-1}(x))) = 0. \end{aligned}$$

For each x locally around \({\overline{x}}\), let

$$\begin{aligned} d_x = [g(\varphi ^{-1}(x))_1\partial _1, \dots , g(\varphi ^{-1}(x))_m\partial _m, d_{m+1}\partial _{m+1}, \dots , d_n\partial _n]\in \mathrm {T}_{x}\mathcal{{M}}. \end{aligned}$$

These vectors then forms a smooth vector field such that \(\langle d_x, {\text {grad}}\, h_j(x)\rangle = 0\) for all \(j\in \mathcal{{E}}\), and \(d = d_{{\overline{x}}}\). Then we have that

$$\begin{aligned}&\mathrm {Hess}_x \mathcal{{L}}_{\rho _{k-1}}(x_k,\gamma ^{k-1})(d_{x_k},d_{x_k}) \\&\quad = \langle d_{x_k}, \mathrm {Hess} f(x_k) d_{x_k}\rangle + \rho _{k-1} \sum _{j\in \mathcal{{E}}} \langle d_{x_k}, \nabla _{d_{x}}(h_j(x)+\frac{\gamma _j^{k-1}}{\rho _{k-1}}){\text {grad}}\, h_j(x)\rangle _{x_k} \\&\quad = \langle d_{x_k}, \mathrm {Hess} f(x_k) d_{x_k}\rangle + \rho _{k-1}\sum _{j\in \mathcal{{E}}} d_{x}[h_j(x)+\frac{\gamma _j^{k-1}}{\rho _{k-1}}](x_k)\langle d_{x}, {\text {grad}}\, h_j(x)\rangle _{x_k} \\&\quad \quad + \sum _{j\in \mathcal{{E}}} (\rho _{k-1} h_j(x_k)+\gamma _j^{k-1})\langle d_{x}, \nabla _{d_x}{\text {grad}}\, h_j(x)\rangle _{x_k}\\&\quad = \langle d_{x_k}, \mathrm {Hess} f(x_k) d_{x_k}\rangle + \sum _{j\in \mathcal{{E}}} (\rho _{k-1} h_j(x_k)+\gamma _j^{k-1})\langle d_{x}, \nabla _{d_{x}}{\text {grad}}\, h_j(x)\rangle _{x_k}\\&\quad = \langle d_{x_k}, \nabla _{d_{x_k}} {\text {grad}}\, f(x_k)\rangle + \sum _{j\in \mathcal{{E}}} {\overline{\gamma }}_j^{k}\langle d_{x}, \nabla _{d_{x}}{\text {grad}}\, h_j(x)\rangle _{x_k} \end{aligned}$$

where the second equality is by definition of connection; the third is by orthogonality of d with \(\{{\text {grad}}\, h_j\}\); the fourth is from the definition of Hessian and \({\overline{\gamma }}\). Therefore we have

$$\begin{aligned} \langle d_{x}, \nabla _{d_{x}} {\text {grad}}\, f(x)\rangle _{x_k} + \sum _{j\in \mathcal{{E}}} {\overline{\gamma }}_j^k\langle d_{x}, \nabla _{d_{x}}{\text {grad}}\, h_j(x)\rangle _{x_k} \ge -\epsilon _k \Vert d_{x_k}\Vert ^2 \end{aligned}$$

Since the connection maps two continuously differentiable vector fields to a continuous vector field, we can take a limit and state:

$$\begin{aligned} \langle d, \nabla _{d} {\text {grad}}\, f({\overline{x}})\rangle + \sum _{j\in \mathcal{{E}}} {\overline{\gamma }}_j\langle d, \nabla _{d}{\text {grad}}\, h_j({\overline{x}})\rangle \ge 0 \end{aligned}$$

which is just \(\mathrm {Hess} \mathcal{{L}}({\overline{x}}, {\overline{\gamma }})(d,d) \ge 0\). \(\square \)

C Proof of Proposition 4.1

In the proof below, we use the following notation:

$$\begin{aligned} v \in F'(x^*,\lambda ^*, \gamma ^*) \Leftrightarrow {\left\{ \begin{array}{ll} v\in \mathrm {T}_{x^*}\mathcal{{M}}, &{}\\ \langle {\text {grad}}\, h_j(x^*), v\rangle = 0&{} \text { for all }j\in \mathcal{{E}},\text { and} \\ \langle {\text {grad}}\, g_i(x^*), v\rangle \le 0&{} \text { for all }i\in \mathcal{{A}}(x^*)\cap \mathcal{{I}}.\\ \end{array}\right. } \end{aligned}$$
(31)

Proof

Consider the function Q, defined by:

$$\begin{aligned} Q(x, \rho ) = f(x) + \rho \left( \sum _{i\in \mathcal{{I}}}\text {max}\{0, g_i(x)\} + \sum _{j\in \mathcal{{E}}}|h_j(x)|\right) . \end{aligned}$$

In a small enough neighbourhood of \(x^*\), terms for inactive constraints disappear and Q is just:

$$\begin{aligned} Q(x, \rho ) = f(x) + \rho \left( \sum _{i\in \mathcal{{A}}(x^*)\cap \mathcal{{I}}}\text {max}\{0, g_i(x)\} + \sum _{j\in \mathcal{{E}}}|h_j(x)|\right) . \end{aligned}$$

Although Q is nonsmooth, it is easy to verify that it has directional derivative in all directions:

$$\begin{aligned} \lim _{\tau \rightarrow 0 }\frac{\text {max}\{0,g_i(\mathrm {Exp}_{x^*}(\tau d))\}-\text {max}\{0,g_i(x^*)\}}{\tau }= & {} \lim _{\tau \rightarrow 0 }\frac{\text {max}\{0,g_i(\mathrm {Exp}_{x^*}(\tau d))\}}{\tau } \end{aligned}$$

and since \(g_i\circ \mathrm {Exp}_{x^*}\) is sufficiently smooth, discussing separately the sign of \(\frac{d}{d\tau }(g_i\circ \mathrm {Exp}_{x^*})(\tau d)\), we have the right hand side equal to \(\text {max}\{0, \frac{d}{d\tau }(g_i\circ \mathrm {Exp}_{x^*})(\tau d)\} = \max \{0,\langle {\text {grad}}\, g_i(x^*), d\rangle \}\). Similarly, we have

$$\begin{aligned} \lim _{\tau \rightarrow 0 }\frac{|h_j(\mathrm {Exp}_{x^*}(\tau d))|-|h_j(x^*)|}{\tau } = \left| \frac{d}{d\tau }(h_j\circ \mathrm {Exp}_{x^*})(\tau d)\right| = |\langle {\text {grad}}\,h_j(x^*), d\rangle |. \end{aligned}$$

Hence, the directional derivative along direction d, \(Q(x^*,\rho ; d)\), is well defined:

$$\begin{aligned} Q(x^*,\rho ;d)= & {} \langle {\text {grad}}\,f(x^*), d\rangle \nonumber \\&+ \rho \left( \sum _{i\in \mathcal{{A}}(x^*)\cap \mathcal{{I}}}\max \{0,\langle {\text {grad}}\,g_i(x^*), d\rangle \} + \sum _{j\in \mathcal{{E}}}|\langle {\text {grad}}\,h_j(x^*), d\rangle |\right) .\nonumber \\ \end{aligned}$$
(32)

As \(x^*\) is a KKT point,

$$\begin{aligned} {\text {grad}}\,f(x^*) + \sum _{i\in \mathcal{{A}}(x^*)\cap \mathcal{{I}}} \lambda _i^* {\text {grad}}\,g_i(x^*) + \sum _{j\in \mathcal{{E}}} \gamma _j^* {\text {grad}}\,h_j(x^*) = 0. \end{aligned}$$

Thus,

$$\begin{aligned} 0= & {} \langle {\text {grad}}\,f(x^*),d\rangle + \sum _{i\in \mathcal{{A}}(x^*)\cap \mathcal{{I}}} \lambda _i^* \langle {\text {grad}}\,g_i(x^*),d\rangle + \sum _{j\in \mathcal{{E}}} \gamma _j^* \langle {\text {grad}}\,h_j(x^*),d\rangle \\\le & {} \langle {\text {grad}}\,f(x^*),d\rangle + \sum _{i\in \mathcal{{A}}(x^*)\cap \mathcal{{I}}} \lambda _i^* \max \{0, \langle {\text {grad}}\,g_i(x^*),d\rangle \}\\&+ \sum _{j\in \mathcal{{E}}} \gamma _j^* |\langle {\text {grad}}\,h_j (x^*),d\rangle |. \end{aligned}$$

Combining with equation (32), we have

$$\begin{aligned} Q(x^*, \rho ;d)\ge & {} \sum _{i\in \mathcal{{A}}(x^*)\cap \mathcal{{I}}} (\rho -\lambda _i^*) \max \{0, \langle {\text {grad}}\,g_i(x^*),d\rangle \}\nonumber \\&+ \sum _{j\in \mathcal{{E}}} (\rho -\gamma _j^*) |\langle {\text {grad}}\,h_j(x^*),d\rangle |. \end{aligned}$$
(33)

For contradiction, suppose \(x^*\) is not a local minimum of Q. Then, there exists \(\{y_k\}_{k=1}^\infty \), \(\lim _{k\rightarrow \infty } y_k = x^*\) such that \(Q(y_k, \rho ) < Q(x^*, \rho ) = f(x^*)\). By restricting to a small enough neighbourhood, there exists \(\eta _k = \mathrm {Exp}^{-1}_{x^*}(y_k)\). Considering only a subsequence if needed, we have \(\lim _{k\rightarrow \infty } \frac{\eta _k}{\Vert \eta _k\Vert } = {\bar{\eta }}\). It is easy to see that \(Q(\mathrm {Exp}_{x^*}(\cdot ),\rho )\) is locally Lipschitz continuous at \(0_{x^*}\), which gives

$$\begin{aligned} Q(\mathrm {Exp}_{x^*}(\Vert \eta _k\Vert {\bar{\eta }}), \rho ) = Q(\mathrm {Exp}_{x^*}(\eta _k), \rho ) + o(\Vert \eta _k\Vert ) = Q(y_k, \rho ) + o(\Vert \eta _k\Vert ). \end{aligned}$$

Subtract \(Q(x^*,\rho )\) and take the limit:

$$\begin{aligned}&\lim _{k\rightarrow \infty } \frac{Q(\mathrm {Exp}_{x^*}(\Vert \eta _k\Vert {\bar{\eta }}), \rho ) - Q(x^*,\rho )}{\Vert \eta _k\Vert } =\lim _{k\rightarrow \infty } \frac{Q(y_k, \rho ) - Q(x^*,\rho )}{\Vert \eta _k\Vert } \\&\quad + \lim _{k\rightarrow \infty } \frac{o(\Vert \eta _k\Vert )}{\Vert \eta _k\Vert }\le 0. \end{aligned}$$

Notice the left-most expression is just \(Q(x^*, \rho ; {\bar{\eta }})\). Since coefficients on the right-hand side of (33) are strictly positive, we must have \(\langle {\text {grad}}\,g_i(x^*), {\bar{\eta }}\rangle \le 0\) and \(\langle {\text {grad}}\,h_j(x^*), {\bar{\eta }}\rangle = 0\). Since the exponential mapping is of second order, we have a Taylor expansion for f,

$$\begin{aligned} f(y_k) = f(x^*) + \langle {\text {grad}}\,f(x^*), \eta _k\rangle + \frac{1}{2}\langle \eta _k, \mathrm {Hess} f(x^*)[\eta _k]\rangle + o(\Vert \eta _k\Vert ^2), \end{aligned}$$

and similarly for \(g_i\) and \(h_j\). Notice that

$$\begin{aligned} Q(y_k, \rho )= & {} f(y_k) + \rho \left( \sum _{i\in \mathcal{{A}}(x^*)\cap \mathcal{{I}}}\max \{0, g_i(y_k)\} + \sum _{j\in \mathcal{{E}}}|h_j(y_k)|\right) \\\ge & {} \left( f(y_k) + \sum _{i\in \mathcal{{A}}(x^*)\cap \mathcal{{I}}}\lambda _i^* g_i(y_k) + \sum _{j\in \mathcal{{E}}}\gamma _j^*h_j(y_k)\right) \\&+ \sum _{i\in \mathcal{{A}}(x^*)\cap \mathcal{{I}}}(\rho - \lambda _i^*)\max \{0, g_i(y_k)\} \\&+ \sum _{j\in \mathcal{{E}}}(\rho - \gamma _j^*)|h_j(y_k)|\\\ge & {} f(x^*) + \langle {\text {grad}}\,f(x^*) + \sum _{i\in \mathcal{{A}}(x^*)\cap \mathcal{{I}}} \lambda _i^* {\text {grad}}\,g_i(x^*) \\&+ \sum _{j\in \mathcal{{E}}} \gamma _j^* {\text {grad}}\,h_j(x^*), \eta _k\rangle \\&+ \frac{1}{2}\langle \eta _k, \mathrm {Hess} (f(x^*)+\sum _{i\in \mathcal{{A}}(x^*)\cap \mathcal{{I}}} \lambda _i^* g_i(x^*)\\&+\sum _{j\in \mathcal{{E}}}\gamma _{j}^* h_j(x^*))[\eta _k]\rangle + o(\Vert \eta _k\Vert ^2) + P(y_k)\\= & {} f(x^*) + 0 + \frac{1}{2}\langle \eta _k, \mathrm {Hess}(\mathcal{{L}}(x,\lambda ^*,\gamma ^*)(x ^*)[\eta _k])\rangle + o(\Vert \eta _k\Vert ^2) + P(y_k) \end{aligned}$$

where \(P(y_k) = \sum _{i\in \mathcal{{A}}(x^*)\cap \mathcal{{I}}}(\rho - \lambda _i^*)\max \{0, g_i(y_k)\} + \sum _{j\in \mathcal{{E}}}(\rho - \gamma _j^*)|h_j(y_k)|\). The first inequality follows from quadratic approximation of \(f + \sum _{i\in \mathcal{{A}}(x^*)\cap \mathcal{{I}}} \lambda _{i} g_i+\sum _{j\in \mathcal{{E}}} \gamma _j h_j\) and bilinearity of the metric. The last equality comes from the definition of KKT points. Dividing the equation through by \(\Vert \eta \Vert ^2\), we obtain

$$\begin{aligned} \lim _{k\rightarrow \infty } \frac{Q(y_k,\rho ) - f(x^*)}{\Vert \eta \Vert ^2}= & {} \lim _{k\rightarrow \infty }\frac{1}{2}\left\langle \frac{\eta _k}{\Vert \eta _k\Vert }, \mathrm {Hess}(\mathcal{{L}}(x,\lambda ^*,\gamma ^*)(x ^*)\left[ \frac{\eta _k}{\Vert \eta _k\Vert }\right] \right\rangle \nonumber \\&+\, 0 + \lim _{k\rightarrow \infty }\frac{P(y_k)}{\Vert \eta \Vert ^2}. \end{aligned}$$
(34)

If \({\bar{\eta }}\in F'\), then as \(P(y_k)\ge 0\), the first term on the right hand side will be strictly larger than 0, which is a contradiction to \(Q(y_k,\rho ) < f(x^*)\) for all k. If \({\bar{\eta }}\in F-F'\), then there exists \(g_{i'}\) such that \(\langle {\text {grad}}\,g_{i'}(x), {\bar{\eta }}\rangle > 0\). Then,

$$\begin{aligned} g_{i'}(y_k) = g_{i'}(x^*) + \langle {\text {grad}}\,g_{i'}(x^*), \eta _k\rangle + o(\Vert {\bar{\eta }}\Vert ) = \langle {\text {grad}}\,g_{i'}(x^*), \eta _k\rangle + o(\Vert \eta _k\Vert ). \end{aligned}$$

Hence, dividing the above expression by \(\Vert \eta _k\Vert \) gives

$$\begin{aligned} \lim _{k\rightarrow \infty } \frac{g_{i'}(y_k)}{\Vert \eta _k\Vert } \ge \lim _{k\rightarrow \infty } \left\langle {\text {grad}}\,g_{i'}(x^*), \frac{\eta _k}{\Vert \eta _k\Vert }\right\rangle + 0 = \langle {\text {grad}}\,g_{i'}(x^*), \bar{\eta _k}\rangle > 0. \end{aligned}$$

Notice that \(\frac{P(y_k)}{\Vert \eta \Vert ^2} \ge \frac{g_{i'}(y_k)}{\Vert \eta _k\Vert }\) for large enough k and a contradiction is obtained by plugging it into (34). \(\square \)

D Proof of Proposition 4.2

Proof

We give a proof for \(Q^{\mathrm {lse}}\)—it is analogous for \(Q^{\mathrm {lqh}}\). For each iteration k and for each \(i\in \mathcal{{I}}\) and \(j\in \mathcal{{E}}\), define the following coefficients:

$$\begin{aligned} \lambda _i^k&= \frac{e^{g_i(x_{k+1})/u_k}}{1+e^{g_i(x_{k+1})/u_k}},&\text { and }&\gamma ^k_j&= \frac{e^{h_j(x_{k+1})/u_k}-e^{-h_j(x_{k+1})/u_k}}{e^{h_j(x_{k+1})/u_k}+e^{-h_j(x_{k+1})/u_k}}. \end{aligned}$$

Then, a simple calculation shows that (under our assumptions, \(\rho _k = \rho _0\) for all k; we simply write \(\rho \)):

$$\begin{aligned} {\text {grad}}\, Q^{\text {lse}}(x_{k+1},\rho _k,u_k) =&{\text {grad}}\, f(x_{k+1}) + \rho \sum _{i\in \mathcal{{I}}} \lambda _i^k {\text {grad}}\, g_i(x_{k+1}) \\&+ \rho \sum _{j\in \mathcal{{E}}} \gamma ^k_j {\text {grad}}\,h_j(x_{k+1}). \end{aligned}$$

Notice that the multipliers are bounded: \(\gamma ^k_j \in [-1,1]\) and \(\lambda _i^k\in [0,1]\). Hence, as sequences indexed by k, they have a limit point: we denote them by \({\overline{\gamma }}\in [-1,1]\) and \({\overline{\lambda }}\in [0,1]\). Furthermore, since \({\overline{x}}\) is feasible, there exists \(k_1\) such that for any \(k>k_1\), \(i\in \mathcal{{I}}\setminus \mathcal{{A}}({\overline{x}})\), \(g_i(x_k) < c\) for some constant \(c<0\). Then, as \(u_k\rightarrow 0\), by definition, \(\lambda ^k_i\) goes to 0 for \(i\in \mathcal{{I}}\setminus \mathcal{{A}}({\overline{x}})\). This shows \({\overline{\lambda }}_i = 0\) for \(i\in \mathcal{{I}}\setminus \mathcal{{A}}({\overline{x}})\). Considering a convergent subsequence if needed, there exists \(k_2>k_1\) such that, for all \(k>k_2\), \(\text {dist}(x_k,{\overline{x}}) < { i }({\overline{x}})\) (the injectivity radius). Thus, parallel transport from each \(x_k\) to \({\overline{x}}\) is well defined. Consider

$$\begin{aligned} v = {\text {grad}}\,f({\overline{x}}) + \rho \sum _{i\in \mathcal{{I}}\cap \mathcal{{A}}({\overline{x}})}{\overline{\lambda }}_i {\text {grad}}\,g_i({\overline{x}}) + \rho \sum _{j\in \mathcal{{E}}} {\overline{\gamma }}_j{\text {grad}}\,h_i({\overline{x}}). \end{aligned}$$

Notice that its coefficients are bounded, so we can get \(\Vert v\Vert = 0\) similar to the proof of Proposition 3.2. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, C., Boumal, N. Simple Algorithms for Optimization on Riemannian Manifolds with Constraints. Appl Math Optim 82, 949–981 (2020). https://doi.org/10.1007/s00245-019-09564-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00245-019-09564-3

Keywords

Mathematics Subject Classification

Navigation