Newton acceleration on manifolds identified by proximal gradient methods

Bareilles, Gilles; Iutzeler, Franck; Malick, Jérôme

doi:10.1007/s10107-022-01873-w

Newton acceleration on manifolds identified by proximal gradient methods

Full Length Paper
Series A
Published: 30 August 2022

Volume 200, pages 37–70, (2023)
Cite this article

Mathematical Programming Submit manuscript

884 Accesses
3 Citations
2 Altmetric
Explore all metrics

Abstract

Proximal methods are known to identify the underlying substructure of nonsmooth optimization problems. Even more, in many interesting situations, the output of a proximity operator comes with its structure at no additional cost, and convergence is improved once it matches the structure of a minimizer. However, it is impossible in general to know whether the current structure is final or not; such highly valuable information has to be exploited adaptively. To do so, we place ourselves in the case where a proximal gradient method can identify manifolds of differentiability of the nonsmooth objective. Leveraging this manifold identification, we show that Riemannian Newton-like methods can be intertwined with the proximal gradient steps to drastically boost the convergence. We prove the superlinear convergence of the algorithm when solving some nondegenerated nonsmooth nonconvex optimization problems. We provide numerical illustrations on optimization problems regularized by $\ell _1$-norm or trace-norm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Riemannian proximal gradient methods

Article 09 March 2021

An inexact Riemannian proximal gradient method

Article 16 January 2023

A nonmonotone accelerated proximal gradient method with variable stepsize strategy for nonsmooth and nonconvex minimization problems

Article 05 March 2024

Notes

We choose the term “Newton acceleration” to emphasize the similarity with the celebrated Nesterov acceleration [34]. Indeed both methods add an acceleration step after the proximal gradient iteration. But, unlike Nesterov’s method where the acceleration is provided by an inertial step, the Newton acceleration comes from a second-order step on a smooth substructure, as we detail in this paper.
The weaker assumption of prox-boundedness (ie. $g+r\Vert \cdot \Vert ^2$ is bounded below for some r) implies that $ \mathbf {prox}_{\gamma g}(y) $ is non-empty when $\gamma $ is taken sufficiently small; see [36, Chap. 1.G].
The following example shows that in the nonconvex setting, ii) and iii) do not necessarily imply i). Take f null and g as follows, then the proximity operator of g at 0 writes:
$$\begin{aligned} g(x) = {\left\{ \begin{array}{ll} x^{2}/2 &{} \text {if } |x| \le 1\\ 1-3x/2 &{} \text {if } x \ge 1\\ 1+3x/2 &{} \text {if }x \le -1 \end{array}\right. }, \qquad \mathbf {prox}_{\gamma g}(0) = {\left\{ \begin{array}{ll} 0 &{} \text {if } \gamma \in (0, 8/9) \\ \{-3\gamma / 2, 0, 3\gamma / 2\} &{} \text {if } \gamma = 8/9 \\ \{-3\gamma / 2, 3\gamma / 2\} &{} \text {if } \gamma > 8/9. \end{array}\right. } \end{aligned}$$
The function g is 1-prox-regular at 0, there holds $0 \in {{\,\mathrm{ri}\,}}\partial g(0) = \{0\}$, and yet 0 is not a fixed point of the proximal operator with stepsizes close to 1.
Indeed, in many applications of Riemannian optimization, computing geodesics and the exponential map can be costly and then retractions provide an efficient alternative. For this reason, we consider here second-order retractions [2, 14].
Each CG iteration requires one Hessian-vector product, avoiding to form the Hessian matrix. A test on this ratio is used to detect a direction of quasi-negative curvature for the (Riemannian) Hessian, which is a stopping criterion of the Conjugate Gradient. In our implementation, we require this quantity to be smaller than $10^{-15}$ for the Newton method. For the truncated version, we reduce the threshold when getting close to the solution: initialized at 1, the threshold is decreased by a factor 10 each time the unit-step is accepted by the line search.

References

Absil, P.A., Baker, C.G., Gallivan, K.A.: Trust-region methods on riemannian manifolds. Found. Comput. Math. 7(3), 303–330 (2007)
Article MathSciNet MATH Google Scholar
Absil, P.A., Mahony, R., Sepulchre, R.: Optimization algorithms on matrix manifolds. Princeton University Press, NJ (2009)
MATH Google Scholar
Absil, P.A., Malick, J.: Projection-like retractions on matrix manifolds. SIAM J. Optim. 22, 135–158 (2012)
Article MathSciNet MATH Google Scholar
Agarwal, N., Boumal, N., Bullins, B., Cartis, C.: Adaptive regularization with cubics on manifolds. Mathematical Programming (2020)
Aravkin, A.Y., Baraldi, R., Orban, D.: A proximal quasi-newton trust-region method for nonsmooth regularized optimization. SIAM J. Optim. 32(2), 900–929 (2022)
Article MathSciNet MATH Google Scholar
Bach, F., Jenatton, R., Mairal, J., Obozinski, G.: Optimization with sparsity-inducing penalties. Foundations and Trends® in Machine Learning 4(1), 1–106 (2012)
Bareilles, G., Iutzeler, F.: On the interplay between acceleration and identification for the proximal gradient algorithm. Comput. Optim. Appl. 77, 351–378 (2020)
Article MathSciNet MATH Google Scholar
Beck, A.: First-order methods in optimization, vol. 25. SIAM (2017)
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imag. Sci. 2(1), 183–202 (2009)
Article MathSciNet MATH Google Scholar
Becker, S., Fadili, J., Ochs, P.: On quasi-newton forward-backward splitting: Proximal calculus and convergence. SIAM J. Optim. 29(4), 2445–2481 (2019)
Article MathSciNet MATH Google Scholar
Bezanson, J., Edelman, A., Karpinski, S., Shah, V.B.: Julia: A fresh approach to numerical computing. SIAM Rev. 59(1), 65–98 (2017)
Article MathSciNet MATH Google Scholar
Bolte, J., Nguyen, T.P., Peypouquet, J., Suter, B.W.: From error bounds to the complexity of first-order descent methods for convex functions. Mathematical Programming (2015)
Bonnans, J.F., Gilbert, J.C., Lemaréchal, C., Sagastizábal, C.A.: Numerical optimization: theoretical and practical aspects. Springer Science & Business Media, Berlin (2006)
MATH Google Scholar
Boumal, N.: An introduction to optimization on smooth manifolds. To appear with Cambridge University Press (2022). URL http://www.nicolasboumal.net/book
Briant, O., Lemaréchal, C., Meurdesoif, P., Michel, S., Perrot, N., Vanderbeck, F.: Comparison of bundle and classical column generation. Math. Program. 113(2), 299–344 (2008)
Article MathSciNet MATH Google Scholar
Burke, J.V., Moré, J.J.: On the identification of active constraints. SIAM J. Numer. Anal. 25(5), 1197–1211 (1988)
Article MathSciNet MATH Google Scholar
Daniilidis, A., Hare, W., Malick, J.: Geometrical interpretation of the predictor-corrector type algorithms in structured optimization problems. Optimization 55(5–6), 481–503 (2006)
Article MathSciNet MATH Google Scholar
Dembo, R.S., Steihaug, T.: Truncated-newtono algorithms for large-scale unconstrained optimization. Math. Program. 26(2), 190–212 (1983)
Article MATH Google Scholar
Dennis Jr, J.E., Schnabel, R.B.: Numerical methods for unconstrained optimization and nonlinear equations. SIAM (1996)
Dolan, E.D., Moré, J.J.: Benchmarking optimization software with performance profiles. Math. Program. 91(2), 201–213 (2002)
Article MathSciNet MATH Google Scholar
Drusvyatskiy, D., Lewis, A.S.: Optimality, identifiability, and sensitivity. Math. Program. 147(1–2), 467–498 (2014)
Article MathSciNet MATH Google Scholar
Hare, W., Sagastizábal, C.: Computing proximal points of nonconvex functions. Math. Program. 116(1–2), 221–258 (2009)
Article MathSciNet MATH Google Scholar
Iutzeler, F., Malick, J.: Nonsmoothness in machine learning: specific structure, proximal identification, and applications. Set-Valued and Variational Analysis 28(4), 661–678 (2020)
Article MathSciNet MATH Google Scholar
Lee, C.p.: Accelerating inexact successive quadratic approximation for regularized optimization through manifold identification. arXiv preprint arXiv:2012.02522 (2020)
Lee, J.D., Sun, Y., Saunders, M.A.: Proximal newton-type methods for minimizing composite functions. SIAM J. Optim. 24(3), 1420–1443 (2014)
Article MathSciNet MATH Google Scholar
Lemaréchal, C., Oustry, F., Sagastizábal, C.: The u-lagrangian of a convex function. Trans. Am. Math. Soc. 352(2), 711–729 (2000)
Article MathSciNet MATH Google Scholar
Lewis, A., Wylie, C.: A simple newton method for local nonsmooth optimization. arXiv preprint arXiv:1907.11742 (2019)
Lewis, A.S.: Active sets, nonsmoothness, and sensitivity. SIAM J. Optim. 13(3), 702–725 (2002)
Article MathSciNet MATH Google Scholar
Lewis, A.S., Liang, J., Tian, T.: Partial smoothness and constant rank. SIAM J. Optim. 32(1), 276–291 (2022)
Article MathSciNet MATH Google Scholar
Lewis, A.S., Wright, S.J.: A proximal method for composite minimization. Math. Program. 158(1), 501–546 (2016). https://doi.org/10.1007/s10107-015-0943-9
Article MathSciNet MATH Google Scholar
Liang, J., Fadili, J., Peyré, G.: Activity identification and local linear convergence of forward-backward-type methods. SIAM J. Optim. 27(1), 408–437 (2017)
Article MathSciNet MATH Google Scholar
Mifflin, R., Sagastizábal, C.: A VU-algorithm for convex minimization. Math. Program. 104(2–3), 583–608 (2005)
Article MathSciNet MATH Google Scholar
Miller, S.A., Malick, J.: Newton methods for nonsmooth convex minimization: connections among U-lagrangian, riemannian newton and sqp methods. Math. Program. 104(2–3), 609–633 (2005)
Article MathSciNet MATH Google Scholar
Nesterov, Y.: A method of solving a convex programming problem with convergence rate O$(1/k^2)$. Soviet Mathematics Doklady 27(2), 372–376 (1983)
MATH Google Scholar
Poliquin, R., Rockafellar, R.: Prox-regular functions in variational analysis. Trans. Am. Math. Soc. 348(5), 1805–1838 (1996)
Article MathSciNet MATH Google Scholar
Rockafellar, R.T., Wets, R.J.B.: Variational analysis, vol. 317. Springer, Berlin (2009)
Wright, S.J.: Identifiable surfaces in constrained optimization. SIAM J. Control. Optim. 31(4), 1063–1079 (1993)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Univ. Grenoble Alpes, CNRS, Grenoble INP, LJK, 38000, Grenoble, France
Gilles Bareilles, Franck Iutzeler & Jérôme Malick

Authors

Gilles Bareilles
View author publications
You can also search for this author in PubMed Google Scholar
Franck Iutzeler
View author publications
You can also search for this author in PubMed Google Scholar
Jérôme Malick
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Franck Iutzeler.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work is partly funded by the ANR JCJC project STROLL (ANR-19-CE23-0008).

Appendices

Preliminary results on the proximal gradient

The first result shows the local Lipschitz continuity of the proximity operator. It can be proven by applying [35, Th. 4.4] with the assumption that ${\bar{x}}=\mathbf {prox}_{{{\bar{\gamma }}} g}({\bar{y}})$, following the arguments of [22, Th. 1]. We provide here a self-contained proof.

Lemma A.1

Consider a function $g:{\mathbb {R}}^n\rightarrow {\mathbb {R}}$, a pair of points ${\bar{x}}, {\bar{y}}$ and a step length ${\bar{\gamma }}>0$ such that ${\bar{x}}=\mathbf {prox}_{{{\bar{\gamma }}} g}({\bar{y}})$ and $g$ is r prox-regular at ${\bar{x}}$ for subgradient ${\bar{v}} \triangleq ({\bar{y}}-{\bar{x}}) / {\bar{\gamma }}$.

Then, for any $\gamma \in (0, \min (1/r, {{\bar{\gamma }}}))$, there exists a neighborhood ${\mathcal {N}}_{{\bar{y}}}$ of ${\bar{y}}$ over which $\mathbf {prox}_{\gamma g}$ is single-valued and $(1-\gamma r)^{-1}$-Lipschitz continuous. Furthermore, there holds

$$\begin{aligned} x = \mathbf {prox}_{\gamma g}(y) \Leftrightarrow (y-x)/\gamma \in \partial g(x) \end{aligned}$$

for $y\in {\mathcal {N}}_{{\bar{y}}}$ and x near ${\bar{x}}$ in the sense $\Vert x-{\bar{x}}\Vert <\varepsilon $, $|g(x)-g({\bar{x}})|<\varepsilon $ and $\Vert (y-x)/\gamma - {\bar{v}}\Vert <\varepsilon $.

Proof

One can easily check that prox-regularity of $g$ at ${\bar{x}}$ for subgradient ${\bar{v}}$ is equivalent to prox-regularity of function ${\tilde{g}}$ around 0 for subgradient 0, with ${\tilde{g}} = g(\cdot + {\bar{x}}) - \langle {\bar{v}}, \cdot \rangle - g({\bar{x}})$ and a change of variable ${\tilde{x}} = x-{\bar{x}}$. Similarly, ${\bar{x}}=\mathbf {prox}_{{{\bar{\gamma }}} g}({\bar{y}})$ is characterized by its global optimality condition

$$\begin{aligned} g(x) + \frac{1}{2{{\bar{\gamma }}}}\Vert x-{\bar{y}}\Vert > g({\bar{x}}) + \frac{1}{2{{\bar{\gamma }}}}\Vert {\bar{x}}-{\bar{y}}\Vert ^2 \qquad \text { for all } x\ne {\bar{x}}, \end{aligned}$$

which we may write as

$$\begin{aligned} g(x) > g({\bar{x}}) + \langle {\bar{v}}, x-{\bar{x}} \rangle - \frac{1}{2{{\bar{\gamma }}}}\Vert x - {\bar{x}}\Vert ^2 \qquad \text { for all } x\ne {\bar{x}}. \end{aligned}$$

Under that same change of variables, since ${\tilde{g}}(0) = 0$, this optimality condition rewrites as

$$\begin{aligned} {\tilde{g}}({\tilde{x}}) > - \frac{1}{2{{\bar{\gamma }}}}\Vert {\tilde{x}}\Vert ^2 \qquad \text { for all } {\tilde{x}}\ne 0. \end{aligned}$$

We may thus apply Theorem 4.4 from [35] to get the claimed result on ${\tilde{g}}$, which transfers back to $g$ as our change of function and variable is bijective. We thus obtain that for $\gamma \in (0, \min (1/r, {{\bar{\gamma }}}))$, on a neighborhood ${\mathcal {N}}_{{\bar{y}}}$ of ${\bar{y}}$, $\mathbf {prox}_{\gamma g}$ is single-valued, $(1-\gamma r)^{-1}$-Lipschitz continuous and $\mathbf {prox}_{\gamma g}(y)=[I+\gamma T]^{-1}(y)$, where T denotes the $g$-attentive $\varepsilon $-localization of $\partial g$ at ${\bar{x}}$. Taking y near ${\bar{y}}$ and x near ${\bar{x}}$ such that $\Vert x-{\bar{x}}\Vert <\varepsilon $, $|g(x)-g({\bar{x}})|<\varepsilon $ and $\Vert (y-x)/\gamma - {\bar{v}}\Vert <\varepsilon $ allows to identify the localization of $\partial g(x)$ with $\partial g(x)$, so that

$$\begin{aligned} \frac{y-x}{\gamma }\in \partial g(x)\Leftrightarrow \frac{y-x}{\gamma }\in T(x)\Leftrightarrow (I+\gamma T)(x)=y\Leftrightarrow x=\mathbf {prox}_{\gamma g}(y). \end{aligned}$$

Note that the proof of [35, Th. 4.4] includes a minor error relative to the Lipschitz constant computation, we report here a corrected value. $\square $

Now, we show that critical points of prox-regular functions are strong local minimizers; this result appears more or less explicitly in some articles, including [17].

Lemma A.2

Let f and g denote two functions and ${\bar{x}}$, ${\bar{y}}$ two points such that f is differentiable at ${\bar{y}}$ and g is r-prox-regular at ${\bar{x}}$ for subgradient $\frac{1}{\gamma }({\bar{y}}-{\bar{x}}) - \nabla f({\bar{y}}) \in \partial g({\bar{x}})$ with $\gamma \in (0, 1/r)$. Then, the function $\rho _{{\bar{y}}} : x\mapsto g(x) + \frac{1}{2\gamma }\Vert {\bar{y}}-\gamma \nabla f({\bar{y}}) - x\Vert ^2$ satisfies

$$\begin{aligned} \rho _{{\bar{y}}}(x) \ge \rho _{{\bar{y}}}({\bar{x}}) + \frac{1}{2} \left( \frac{1}{\gamma } - r \right) \Vert x-{\bar{x}} \Vert ^2, \qquad \text {for all } x\text { near }{\bar{x}}. \end{aligned}$$

Proof

Prox-regularity of g at ${\bar{x}}$ with subgradient $\frac{1}{\gamma } ({\bar{y}}-\gamma \nabla f({\bar{y}}) - {\bar{x}}) \in \partial g({\bar{x}})$ writes

$$\begin{aligned} g(x) \ge g({\bar{x}}) + \frac{1}{\gamma }\langle {\bar{y}}-\gamma \nabla f({\bar{y}}) - {\bar{x}} , x-{\bar{x}} \rangle - \frac{r}{2} \Vert x-{\bar{x}} \Vert ^2. \end{aligned}$$

The identity $ 2\langle b-a, c-a\rangle = \Vert b-a\Vert ^2 + \Vert c-a\Vert ^2 - \Vert b-c\Vert ^2 $ applied to the previous scalar product yields:

$$\begin{aligned} g(x) \ge g({\bar{x}}) + \frac{1}{2\gamma } \Vert {\bar{y}}-\gamma \nabla f({\bar{y}}) - {\bar{x}} \Vert ^2 + \frac{1}{2\gamma } \Vert x-{\bar{x}} \Vert ^2 - \frac{1}{2\gamma } \Vert {\bar{y}}-\gamma \nabla f({\bar{y}}) - x \Vert ^2 - \frac{r}{2} \Vert x-{\bar{x}} \Vert ^2, \end{aligned}$$

which rewrites

$$\begin{aligned} \underbrace{ g(x) + \frac{1}{2\gamma } \Vert {\bar{y}}-\gamma \nabla f({\bar{y}}) - x \Vert ^2 }_{=\rho _{{\bar{y}}}(x) } \ge \underbrace{ g({\bar{x}}) + \frac{1}{2\gamma } \Vert {\bar{y}}-\gamma \nabla f({\bar{y}}) - {\bar{x}} \Vert ^2 }_{=\rho _{{\bar{y}}}({\bar{x}}) } + \frac{1}{2} \left( \frac{1}{\gamma } - r \right) \Vert x-{\bar{x}} \Vert ^2, \end{aligned}$$

which is the claimed inequality. $\square $

Technical results on Riemannian methods.

In this section, we provide basic results on Riemannian optimization that simplify our developments and that we have not been able to find in the existing literature.

1.1 Euclidean spaces and manifolds, back and forth

We establish here a connection between the Riemannian and the Euclidean distances.

Lemma B.1

Consider a point ${\bar{x}}$ of a Riemannian manifold ${\mathcal {M}}$, equipped with a retraction ${\text {R}}$ such that ${\text {R}}_{{\bar{x}}}$ is ${\mathcal {C}}^2$. For any $\varepsilon >0$, there exists a neighborhood ${\mathcal {U}}$ of ${\bar{x}}$ in ${\mathcal {M}}$ such that

where ${\text {R}}^{-1}_{{\bar{x}}}:{\mathcal {M}}\rightarrow T_{{\bar{x}}} {\mathcal {M}}$ is the smooth inverse of ${\text {R}}_{{\bar{x}}}$ defined locally around ${\bar{x}}$.

Proof

The retraction at ${\bar{x}}$ can be inverted locally around 0. Indeed, as ${\text {D}}{\text {R}}_{{\bar{x}}}(0_{T_{{\bar{x}}} {\mathcal {M}}})=I$ is invertible and ${\text {R}}_{{\bar{x}}}$ is ${\mathcal {C}}^2$, the implicit function theorem provides the existence of a ${\mathcal {C}}^2$ inverse function ${\text {R}}^{-1}_{{\bar{x}}}:{\mathcal {M}}\rightarrow T_{{\bar{x}}} {\mathcal {M}}$ defined locally around ${\bar{x}}$. Furthermore, one shows by differentiating the relation ${\text {R}}_{{\bar{x}}}\circ {\text {R}}_{{\bar{x}}}^{-1}$ that the differential of ${\text {R}}^{-1}_{{\bar{x}}}$ at ${\bar{x}}$ is the identity.

We consider the function $f:{\mathcal {M}}\rightarrow {\mathbb {R}}$ defined by $f(x)=\Vert \log _{{\bar{x}}}(x)\Vert -\Vert {\text {R}}_{{\bar{x}}}^{-1}(x)\Vert $. Clearly $f({\bar{x}})=0$, and ${\text {D}}f({\bar{x}})=0$ as the differentials of both ${\text {R}}_{{\bar{x}}}^{-1}$ and logarithm at ${\bar{x}}$ are the identity. In local coordinates ${\hat{x}}=\log _{{\bar{x}}}{x}$ around ${\bar{x}}$, f is represented by the function ${\hat{f}}=f\circ \exp _{{\bar{x}}}: T_{{\bar{x}}} {\mathcal {M}}\rightarrow {\mathbb {R}}$. As ${\hat{f}}(\hat{{\bar{x}}})=0$, ${\text {D}}{\hat{f}}(\hat{{\bar{x}}})=0$ and ${\hat{f}}$ is ${\mathcal {C}}^2$, there exists some $C>0$ such that

$$\begin{aligned} -C \Vert {\hat{x}} - \hat{{\bar{x}}}\Vert ^2 \le {\hat{f}}({\hat{x}}) \le C \Vert {\hat{x}} - \hat{{\bar{x}}}\Vert ^2 \qquad \text {in a neighborhood } {\hat{{\mathcal {U}}}}\text { of }\hat{{\bar{x}}},. \end{aligned}$$

For any $\varepsilon >0$, by taking a small enough neighborhood ${\hat{{\mathcal {U}}}}'\subset {\hat{{\mathcal {U}}}}$, there holds

$$\begin{aligned} -\varepsilon \Vert {\hat{x}} - \hat{{\bar{x}}}\Vert \le {\hat{f}}({\hat{x}}) \le \varepsilon \Vert {\hat{x}} - \hat{{\bar{x}}}\Vert . \end{aligned}$$

Thus for all x in ${\mathcal {U}}= {\text {R}}_{{\bar{x}}}({\hat{{\mathcal {U}}}}')$,

$$\begin{aligned} -\varepsilon \Vert \log _{{\bar{x}}}(x)\Vert \le \Vert \log _{{\bar{x}}}(x)\Vert -\Vert {\text {R}}_{{\bar{x}}}^{-1}(x)\Vert \le \varepsilon \Vert \log _{{\bar{x}}}(x)\Vert , \end{aligned}$$

as ${\hat{x}}=\log _{{\bar{x}}}(x)$, $\hat{{\bar{x}}}=0$. We conclude with . $\square $

We recall a slightly specialized version of [33, Th. 2.2], which is essentially the application of the implicit function theorem around a point of a manifold.

Proposition B.1

Consider a p-dimensional ${\mathcal {C}}^k$-submanifold ${\mathcal {M}}$ of ${\mathbb {R}}^n$ around a point ${\bar{x}}\in {\mathcal {M}}$. The mapping ${\text {R}}: T {\mathcal {B}}\rightarrow {\mathcal {M}}$, defined for $(x, \eta )\in T {\mathcal {B}}$ near $({\bar{x}}, 0)$ by $\mathrm {proj}_x({\text {R}}(x, \eta )) = \eta $ defines a second-order retraction near $({\bar{x}}, 0)$. The point-wise retraction, defined as ${\text {R}}_x = {\text {R}}(x, \cdot )$, is locally invertible with inverse ${\text {R}}_{x}^{-1}=\mathrm {proj}_{x}$.

Proof

Let $\varPsi : {\mathbb {R}}^n\rightarrow {\mathbb {R}}^{n-p}$ denote a ${\mathcal {C}}^k$ function defining ${\mathcal {M}}$ around ${\bar{x}}$: for all x close enough to ${\bar{x}}$, there holds $x\in {\mathcal {M}}\Leftrightarrow \varPsi (x)=0$, and ${\text {D}}\varPsi (x)$ is surjective. Consider the equation $\varPhi (x, \eta _t, \eta _n) = 0$ around $({\bar{x}}, 0, 0)$, with

The partial differential ${\text {D}}_{\eta _n}\varPhi ({\bar{x}}, 0, 0)$ is, for ,

$$\begin{aligned} {\text {D}}_{\eta _n}\varPhi ({\bar{x}}, 0, 0)[\xi _n] = {\text {D}}\varPsi ({\bar{x}})[\xi _n]. \end{aligned}$$

Since ${\bar{x}}\in {\mathcal {M}}$, ${\text {D}}_{\eta _n}\varPhi ({\bar{x}}, 0, 0)$ is surjective from to ${\mathbb {R}}^{n-p}$ so its a bijection. The implicit function theorem provides the existence of neighborhoods ${\mathcal {N}}_{{\bar{x}}}^1\subset {\mathcal {M}}$, , and a unique ${\mathcal {C}}^k$ function $\eta _n : {\mathcal {N}}_{{\bar{x}}}^1 \times {\mathcal {N}}_0^2 \rightarrow {\mathcal {N}}_0^3$ such that, for all $x\in {\mathcal {N}}_{{\bar{x}}}^1$, $\eta _t\in {\mathcal {N}}_0^2$ and $\eta _n\in {\mathcal {N}}_0^3$, $\eta _n({\bar{x}}, 0) = 0$ and

$$\begin{aligned} \varPhi (x, \eta _t, \eta _n(x, \eta _t)) = 0 \Leftrightarrow x+\eta _t+\eta _n(x, \eta _t) \in {\mathcal {M}}. \end{aligned}$$

It also provides an expression for the partial derivative of $\eta _n$ at (x, 0) along $\eta _t$: for ,

$$\begin{aligned} {\text {D}}_{\eta _t} \eta _n(x, 0)[\xi _t] = - \left[ {\text {D}}_{\eta _n}\varPhi (x, 0, 0) \right] ^{-1} {\text {D}}_{\eta _t} \varPhi (x, 0, 0)[\xi _t]. \end{aligned}$$

As noted before, ${\text {D}}_{\eta _n}\varPhi (x, 0, 0)$ is bijective since $x\in {\mathcal {M}}$. Besides, ${\text {D}}_{\eta _t} \varPhi (x, 0, 0) = {\text {D}}\varPhi (x)[\xi _t]=0$ since identifies as the kernel of ${\text {D}}\varPhi (x)$. Thus ${\text {D}}_{\eta _t} \eta _n(x, 0) = 0$.

Now, define a map ${\text {R}}:{\mathcal {N}}_{{\bar{x}}}^1\times {\mathcal {N}}_0^2 \rightarrow {\mathcal {M}}$ by ${\text {R}}(x, \eta _t) = x + \eta _t + \eta _n(x, \eta _t)$. This map has degree of smoothness ${\mathcal {C}}^k$ since $\eta _n$ is ${\mathcal {C}}^k$, satisfies ${\text {R}}(x, 0) = x$ since $\eta _n(x, 0)=0$ and satisfies ${\text {D}}_{\eta _t}\eta _n(x, 0) = I + {\text {D}}_{\eta _t}(x, 0) = I$. Thus ${\text {R}}$ defines a retraction on a neighborhood of $({\bar{x}}, 0)$.

We turn to show the second-order property of ${\text {R}}$. Consider the smooth curve $c$ defined as $c(t) = {\text {R}}(x, t\eta )$ for some $x\in {\mathcal {N}}_{{\bar{x}}}^1$, . It’s first derivative writes

$$\begin{aligned} c'(t) = \eta + {\text {D}}_{\eta _t}\eta _n(x, t\eta )[\eta ] = \eta . \end{aligned}$$

The acceleration of the curve c is obtained by computing the derivative of $c'(\cdot )$ in the ambient space and then projecting onto . Thus $c''(t)=0$ and in particular, $c''(0)=0$ which makes ${\text {R}}$ a second-order retraction. $\square $

Lemma B.2

Consider a point ${\bar{x}}$ of a Riemannian manifold ${\mathcal {M}}$. For any $\varepsilon >0$, there exists a neighborhood ${\mathcal {U}}$ of ${\bar{x}}$ in ${\mathcal {M}}$ such that, for all $x\in {\mathcal {U}}$,

where $\Vert x-{\bar{x}}\Vert $ is the Euclidean distance in the ambient space.

Proof

Let ${\bar{x}}$, x denote two close points on ${\mathcal {M}}$. Consider the tangential retraction introduced in Proposition B.1. As a retraction, it satisfies:

$$\begin{aligned} {\text {R}}_{{\bar{x}}}(\eta ) = {\text {R}}_{{\bar{x}}}(0) + {\text {D}}{\text {R}}_{{\bar{x}}}(0)[\eta ] + {\mathcal {O}}(\Vert \eta \Vert ^2) = {\bar{x}} + {\mathcal {O}}(\Vert \eta \Vert ^2). \end{aligned}$$

Taking $x={\text {R}}_{{\bar{x}}}(\eta )$ allows to write $x = {\bar{x}} + {\mathcal {O}}(\Vert {\text {R}}_{{\bar{x}}}^{-1}(x)\Vert ^2)$, so that for any small $\varepsilon _1>0$, there exists a small enough neighborhood ${\mathcal {U}}_1\subset {\mathcal {U}}$ of ${\bar{x}}$ in ${\mathcal {M}}$ such that

$$\begin{aligned} (1-\varepsilon _1) \Vert {\text {R}}_{{\bar{x}}}^{-1}(x)\Vert \le \Vert x-{\bar{x}}\Vert \le (1+\varepsilon _1)\Vert {\text {R}}_{{\bar{x}}}^{-1}(x)\Vert . \end{aligned}$$

By Lemma B.1, for $\varepsilon _2>0$ small enough, there exists a neighborhood ${\mathcal {U}}_2\subset {\mathcal {U}}$ of ${\bar{x}}$ such that,

With $\varepsilon _1$, $\varepsilon _2$ such that $1-\varepsilon =(1-\varepsilon _1)(1-\varepsilon _2)$, we combine the two estimates to conclude. $\square $

1.2 Two technical results on Riemannian descent algorithms

We provide here two technical results used in the proofs of Sect. 4. First, Theorem B.1 adapts [13, Th. 4.16] to the Riemannian setting. Second, Lemma B.3 adapts the proof of [18, Lemma A.2] to the Riemannian setting.

Theorem B.1

(Soundness of the Riemannian line search) Consider a manifold ${\mathcal {M}}$ equipped with a retraction ${\text {R}}$ and a twice differentiable function $F:{\mathcal {M}}\rightarrow {\mathbb {R}}$ that admits a strong local minimizer $x^\star $, that is, a point such that $\hbox {Hess }F(x^\star )$ is positive definite. If x is close to $x^\star $, $\eta $ brings a superlinear improvement towards $x^\star $, that is as $x\rightarrow x^\star $, and $0<m_1<1/2$, then $\eta $ is acceptable by the Armijo rule (4.1) with unit stepsize $\alpha =1$.

Proof

Let $x, \eta \in T {\mathcal {B}}$ denote a pair such that x is close to $x^\star $ and . For convenience, let $x_+ = {\text {R}}_x(\eta )$ denote the next point.

Following [2] (see e.g. the proof of Th. 6.3.2), we work in local coordinates around $x^\star $, representing any point $x\in {\mathcal {M}}$ by ${\widehat{x}} = \log _{x^\star }(x)$ and any tangent vector by ${\widehat{\eta }}_x = {\text {D}}\log _{x^\star }(x)[\eta ]$. The function $F$ is represented by . Defining the coordinates via the logarithm grants the useful property that the Riemannian distance of any two points $x, y\in {\mathcal {M}}$ matches the euclidean distance between their representatives: . Besides, there holds

$$\begin{aligned} {\text {D}}F(x)[\eta ]={\text {D}}{\widehat{F}}({\widehat{x}})[{\widehat{\eta }}] \quad \text { and } \quad \hbox {Hess }F(x)[\eta , \eta ]={\text {D}}{\widehat{F}}({\widehat{x}})[{\widehat{\eta }}, {\widehat{\eta }}]. \end{aligned}$$

(B.1)

Indeed, ${\text {D}}F(x)[\eta ] = (F\circ \gamma )'(0)$ and $\hbox {Hess }F(x)[\eta , \eta ] = (F\circ \gamma )''(0)$, where $\gamma $ denotes the geodesic curve defined by ${\widehat{\gamma }}(t) = {\widehat{x}} + t{\widehat{\eta }}$. Using $F\circ \gamma ={\widehat{F}}\circ {\widehat{\gamma }}$, one obtains the result.

Step 1. We derive an approximation of ${\text {D}}F(x)[\eta ] = \langle \hbox {grad }F(x), \eta \rangle $ in terms of ${\text {D}}^2 {\widehat{F}}(\widehat{x^\star })[{\widehat{x}}-\widehat{x^\star }]^2$. To do so, we go through the intermediate quantity ${\text {D}}{\widehat{F}}({\widehat{x}}) [\widehat{x_+}-{\widehat{x}}]$, and handle precisely the $o(\cdot )$ terms. By smoothness of ${\widehat{F}}$ and since ${\text {D}}{\widehat{F}}(\widehat{x^\star })=0$, Taylor’s formula for ${\text {D}}{\widehat{F}}$ writes

$$\begin{aligned} {\text {D}}{\widehat{F}}({\widehat{x}}) [\widehat{x_+}-{\widehat{x}}]&= {\text {D}}^2 {\widehat{F}}(\widehat{x^\star })[\widehat{x_+}-{\widehat{x}}, {\widehat{x}}-\widehat{x^\star }] + o(\Vert {\widehat{x}}-\widehat{x^\star }\Vert ^2)\\&= -{\text {D}}^2 {\widehat{F}}(\widehat{x^\star })[{\widehat{x}}-\widehat{x^\star }]^2 + {\text {D}}^2 {\widehat{F}}(\widehat{x^\star })[\widehat{x_+}-\widehat{x^\star }, {\widehat{x}}-\widehat{x^\star }] + o(\Vert {\widehat{x}}-\widehat{x^\star }\Vert ^2)\\&= -{\text {D}}^2 {\widehat{F}}(\widehat{x^\star })[{\widehat{x}}-\widehat{x^\star }]^2 + o(\Vert {\widehat{x}}-\widehat{x^\star }\Vert ^2), \end{aligned}$$

where, in the last step, we used that $\Vert \widehat{x_+}-\widehat{x^\star }\Vert = o(\Vert {\widehat{x}}-\widehat{x^\star }\Vert )$ to get that $\Vert {\text {D}}^2 {\widehat{F}}(\widehat{x^\star })[\widehat{x_+}-\widehat{x^\star }, {\widehat{x}}-\widehat{x^\star }]\Vert = \Vert {\text {D}}^2 {\widehat{F}}(\widehat{x^\star })\Vert \Vert \widehat{x_+}-\widehat{x^\star }\Vert \Vert {\widehat{x}}-\widehat{x^\star }\Vert = o(\Vert {\widehat{x}}-\widehat{x^\star }\Vert ^2)$. We now turn to show that ${\text {D}}{\widehat{F}}({\widehat{x}}) [\widehat{x_+}-{\widehat{x}}]$ behaves as ${\text {D}}F(x)[\eta ]$ up to $o(\Vert \widehat{x_+}-{\widehat{x}}\Vert ^2)$. Since ${\text {D}}F(x)[\eta ]={\text {D}}{\widehat{F}}({\widehat{x}})[{\widehat{\eta }}]$ by (B.1), there holds:

$$\begin{aligned} \Vert {\text {D}}F(x)[\eta ] - {\text {D}}{\widehat{F}}({\widehat{x}})[\widehat{x_+} - {\widehat{x}}]\Vert = \Vert {\text {D}}{\widehat{F}}({\widehat{x}})[{\widehat{\eta }} - (\widehat{x_+} - {\widehat{x}})]\Vert \le \Vert {\text {D}}{\widehat{F}}({\widehat{x}}) \Vert \Vert {\widehat{\eta }} - (\widehat{x_+} - {\widehat{x}})\Vert . \end{aligned}$$

As $F$ is twice differentiable and $\exp $ is ${\mathcal {C}}^\infty $, ${\widehat{F}}$ is twice differentiable as well. In particular its derivative is locally Lipschitz continuous, so that for ${\widehat{x}}$ near $\widehat{x^\star }$, we obtain a first estimate:

$$\begin{aligned} \Vert {\text {D}}{\widehat{F}}({\widehat{x}})\Vert = \Vert {\text {D}}{\widehat{F}}({\widehat{x}}) - {\text {D}}{\widehat{F}}(\widehat{x^\star })\Vert = {\mathcal {O}} (\Vert {\widehat{x}}-\widehat{x^\star } \Vert ). \end{aligned}$$

Besides, the following estimate holds $\Vert {\widehat{\eta }} - (\widehat{x_+} - {\widehat{x}})\Vert = o (\Vert {\widehat{x}}-\widehat{x^\star } \Vert )$. Indeed, as the function is differentiable, there holds for small,

$$\begin{aligned} \log _{x^\star }({\text {R}}_x(\eta )) = \log _{x^\star }({\text {R}}_x(0)) + {\text {D}}\log _{x^\star }({\text {R}}_x(0))[{\text {D}}{\text {R}}_x(0)[\eta ]] + o(\Vert \eta \Vert ), \end{aligned}$$

which simplifies to $\widehat{x_+} = {\widehat{x}} + {\widehat{\eta }} + o(\Vert \eta \Vert )$. Lemma B.1 allows to write . Using the triangular inequality and the assumption that we get

so that the second estimate holds.

Combining the two above estimates allows to conclude that

$$\begin{aligned} \Vert {\text {D}}F(x)[\eta ] - {\text {D}}{\widehat{F}}({\widehat{x}})[\widehat{x_+} - {\widehat{x}}]\Vert = o(\Vert {\widehat{x}}-\widehat{x^\star }\Vert ^2), \end{aligned}$$

so that overall,

$$\begin{aligned} {\text {D}}F(x)[\eta ] = {\text {D}}{\widehat{F}}({\widehat{x}}) [\widehat{x_+}-{\widehat{x}}] + o(\Vert {\widehat{x}}-\widehat{x^\star }\Vert ^2) = -{\text {D}}^2 {\widehat{F}}(\widehat{x^\star })[{\widehat{x}}-\widehat{x^\star }]^2 + o(\Vert {\widehat{x}}-\widehat{x^\star }\Vert ^2). \end{aligned}$$

Using that and ${\text {D}}^2 {\widehat{F}}(\widehat{x^\star }) = \hbox {Hess }F(x^\star )$ (B.1), we obtain

(B.2)

Step 2. The function $F$ admits a second-order development around $x^\star $: applying Eq. (2.3) with the exponential map $\exp _{x^\star }$ as a second-order retraction yields

(B.3)

where we used that . Denote $0 < l \le L$ the lower and upper eigenvalues of $\hbox {Hess }F(x^\star )$. The combination (B.3) $+ m_1$ (B.2) writes

Let $\varepsilon >0$ such that $\frac{1}{2}L \varepsilon ^2 < (\frac{1}{2}-m_1) l$. As , for x close enough to $x^\star $ there holds . Combining this with the second-order development of f at $x_+$, there holds:

Subtracting the two estimates yields

which ensures that the Armijo condition is satisfied. $\square $

Lemma B.3

(Riemannian Newton-CG a descent direction) Let Assumption 1 hold and consider a manifold ${\mathcal {M}}$ and a point $x\in {\mathcal {M}}$. If $F$ is twice differentiable on ${\mathcal {M}}$ at x and x is not a stationary point of $F$, then there holds:

$$\begin{aligned} \langle \hbox {grad }F(x), d \rangle \le -\min (1, \Vert \hbox {Hess }F(x)\Vert ^{-1}) \Vert \hbox {grad }F(x)\Vert ^2, \end{aligned}$$

where d was obtained solving (Inexact Newton eq.) with any forcing parameter $\eta $.

Proof

The result is obtained by applying the analysis of [18, Lemma A.2] to the approximate resolution of (Inexact Newton eq.) on the euclidean space , with constant specified according to the proof. $\square $

Complements to the experimental section

1.1 Oracles of section 5.1

We detail here the oracles of $f(x)\triangleq 2x_1^2+x_2^2$ and $g(x) \triangleq |x_1^2-x_2|$:

proximity operator: For $\gamma <1/2$, there holds
$$\begin{aligned} \mathbf {prox}_{\gamma g}(x) = {\left\{ \begin{array}{ll} (\frac{x_1}{1+2\gamma }, x_2+\gamma ) &{}\text { if } x_2 \le \frac{x_1^2}{(1+2\gamma )^2} - \gamma \\ (\frac{x_1}{1+4\gamma t-2\gamma }, x_2 + 2\gamma t - \gamma ) &{}\text { if } \frac{x_1^2}{(1+2\gamma )^2} - \gamma \le x_2 \le \frac{x_1^2}{(1-2\gamma )^2} + \gamma \\ (\frac{x_1}{1-2\gamma }, x_2-\gamma ) &{}\text { if } \frac{x_1^2}{(1-2\gamma )^2} + \gamma \le x_2 \end{array}\right. } \end{aligned}$$
where t solves $x_2^2 + (-2\gamma t + \gamma - x_2)(1+4\gamma t -2\gamma )^2 = 0$.
Riemannian gradient and Hessian: Since $g$ is identically null on ${\mathcal {M}}$, for any point $(x, \eta )\in T {\mathcal {B}}$, $ \hbox {grad }g(x) = 0 \text { and } \hbox {Hess }g(x)[\eta ] = 0 $. Moreover, Euclidean gradient and Hessian-vector product are converted to Riemannian ones using Eqs. (2.1) and (2.2):
$$\begin{aligned} \hbox {grad }f(x)&= \mathrm {proj}_{x}(\nabla f(x)) \\ \hbox {Hess }f(x)[\eta ]&= \mathrm {proj}_{x}\left( \nabla ^2 f(x)[\eta ] - \begin{pmatrix} 2\eta _1 \\ 0 \end{pmatrix} \left\langle \nabla f(x), \begin{pmatrix} 2x_1 \\ -1 \end{pmatrix}\right\rangle \frac{1}{1+4x_1^2} \right) , \end{aligned}$$
and the orthogonal projection onto writes
$$\begin{aligned} \mathrm {proj}_{x}(d) = d - \left\langle d, \begin{pmatrix} 2x_1 \\ -1 \end{pmatrix}\right\rangle \frac{1}{1+4x_1^2} \begin{pmatrix} 2x_1 \\ -1 \end{pmatrix}. \end{aligned}$$

1.2 Differentiating the singular-value decomposition

We establish the expressions of the derivative of the matrices involved in the singular value decomposition. These results may be seen as part of folklore, but, up to our knowledge, there are not explicitly written in the literature. We need them for the computations related to trace-norm regularized problems.

Lemma C.1

Consider the manifold of fixed rank matrices ${\mathcal {M}}_r$, a pair $x, \eta \in T {\mathcal {B}}$ and a smooth curve $c: I\rightarrow {\mathcal {M}}_r$ such that $c(0)=x$, $c'(0)=\eta $. Besides, let U(t), $\varSigma (t)$, V(t) denote smooth curves of St(m, r), ${\mathbb {R}}^{r\times r}$, St(n, r) such that $\gamma (t) = U(t)\varSigma (t) V(t)^\top $. The derivatives of the decomposition factors at $t=0$ write

$$\begin{aligned} U'&= U \left( F \circ \left[ U^\top \eta V \varSigma + \varSigma V^\top \eta ^\top U \right] \right) + (I_m - U U^\top ) \eta V \varSigma ^{-1} \\ V'&= V \left( F \circ \left[ \varSigma U^\top \eta V + V^\top \eta ^\top U \varSigma \right] \right) + (I_n - V V^\top ) \eta ^\top U\varSigma ^{-1} \\ \varSigma '&= I_k \circ \left[ U^\top \eta V \right] , \end{aligned}$$

where $I_k$ is the identity of ${\mathbb {R}}^{k\times k}$, $\circ $ denotes the Hadamard product and $F\in {\mathbb {R}}^{r\times r}$ is such that $F_{ij} = 1 / (\varSigma _{jj}^2 - \varSigma _{ii}^2)$ if $\varSigma _{jj} \ne \varSigma _{ii}$, and $F_{ij}=0$ otherwise. Equivalently, when the tangent vector is represented as $\eta = UMV^\top + U_pV^\top + UV_p^\top $, the above expressions simplify to

$$\begin{aligned} U'&= U \left( F \circ \left[ M \varSigma + \varSigma M^\top \right] \right) + U_p \varSigma ^{-1} \\ V'&= V \left( F \circ \left[ \varSigma M + M^\top \varSigma \right] \right) + V_p \varSigma ^{-1} \\ \varSigma '&= I_k \circ M, \end{aligned}$$

Proof

We consider the curve $\gamma $ and all components and derivatives at $t=0$, therefore we don’t mention evaluation time. Differentiating $\gamma = U\varSigma V^\top $ yields

$$\begin{aligned} \eta = U'\varSigma V^\top + U\varSigma ' V^\top + U\varSigma V'^\top \end{aligned}$$

(C.1)

As a tangent vector to the Stiefel manifold at point U, $U'$ can be expressed as [2, Ex. 3.5.2]

$$\begin{aligned} U'=U\varOmega _U + U_\perp B_U, \end{aligned}$$

(C.2)

where $\varOmega _U\in {\mathbb {R}}^{r\times r}$ is a skew-symmetric matrix, $B_U\in {\mathbb {R}}^{m-r \times m-r}$, and $U_\perp $ is any matrix such that $U^\top U_\perp = 0$ and $U_\perp ^\top U_\perp = I_{m-r}$. Similarly, $V' = V\varOmega _V + V_\perp B_V$, where $\varOmega _V\in {\mathbb {R}}^{r\times r}$ is skew-symmetric, $B_V\in {\mathbb {R}}^{n-r \times n-r}$, and $V_\perp $ is any matrix such that $V^\top V_\perp = 0$ and $V_\perp ^\top V_\perp = I_{n-r}$.

Computing $U^\top \times $ (C.1) $\times V$ yields

$$\begin{aligned} U^\top \eta V = \varOmega _U\varSigma + \varSigma ' + \varSigma \varOmega _V^\top . \end{aligned}$$

Looking at the diagonal elements of this equation yields the derivative of the diagonal component of $\eta $. This is done by taking the Hadamard product of both sides of previous equation with the identity matrix of ${\mathbb {R}}^{r\times r}$, and writes

$$\begin{aligned} \varSigma ' = I_r\circ \left[ U^\top \eta V \right] . \end{aligned}$$

The off-diagonal elements of this equation write

$$\begin{aligned} {{\bar{I}}}_r\circ \left[ U^\top \eta V \right] = \varOmega _U\varSigma + \varSigma \varOmega _V^\top , \end{aligned}$$

(C.3)

where ${{\bar{I}}}_r$ has zeros on the diagonal and ones elsewhere. Adding (C.3)$\varSigma $ and $\varSigma $(C.3)$^\top $ yields

$$\begin{aligned} {{\bar{I}}}_r\circ \left[ U^\top \eta V \varSigma + \varSigma V^\top \eta ^\top U \right] = \varOmega _U\varSigma ^2 - \varSigma ^2\varOmega _U, \end{aligned}$$

which decouples coefficient-wise. At coefficient (ij), with $i\ne j$,

$$\begin{aligned} \left[ U^\top \eta V \varSigma + \varSigma V^\top \eta ^\top U \right] _{ij} = [\varOmega _U]_{ij}\left( \varSigma ^2_{jj} - \varSigma ^2_{ii}\right) , \end{aligned}$$

hence $\varOmega _U = F \circ \left[ U^\top \eta V \varSigma + \varSigma V^\top \eta ^\top U \right] $, where $F\in {\mathbb {R}}^{m-r\times r}$ has zeros on the diagonal and for $i\ne j$, $F_{ij} = 1/(\varSigma ^2_{jj} - \varSigma ^2_{ii})$ if $\varSigma ^2_{jj} \ne \varSigma ^2_{ii}$, 0 otherwise. Besides, left-multiplying (C.1) by $U_\perp ^\top $ yields $U_\perp ^\top \eta = U^\top _\perp U'\varSigma V^\top $, which rewrites, using the decomposition (C.2) of $U'$, as $U_\perp ^\top \eta = B_U \varSigma V^\top $. Hence $B_U = U_\perp ^\top \eta V \varSigma ^{-1}$ and we get the complete expression for $U'$ by assembling the expressions of $\varOmega _U$ and $B_U$ with the decomposition (C.2). The term $U_\perp ^\top U_\perp $ is eliminated using that $U^\top U + U_\perp ^\top U_\perp = I_m$.

Let’s follow the same steps to get expressions for $V'$. Adding $\varSigma $(C.3) and (C.3)$^\top \varSigma $ yields

$$\begin{aligned} {{\bar{I}}}_r\circ \left[ \varSigma U^\top \eta V + V^\top \eta ^\top U \varSigma \right] = \varOmega _V \varSigma ^2 - \varSigma ^2 \varOmega _V, \end{aligned}$$

from which we get $\varOmega _V = F \circ \left[ \varSigma U^\top \eta V + V^\top \eta ^\top U \varSigma \right] $. Besides, right-multiplying (C.1) by $V_\perp $ yields $\eta V_\perp = U \varSigma V'^\top V_\perp $, which rewrites using the decomposition $V' = V\varOmega _V + V_\perp B_V $ as $\eta V_\perp = U \varSigma B_V^\top $. Hence $B_V = V_\perp ^\top \eta ^\top U \varSigma ^{-1}$, and we get the claimed formula by eliminating the $V_\perp $ terms with $V^\top V + V_\perp ^\top V_\perp = I_n$. The simplified expressions are obtained using that $U^\top U=I_m$, $U^\top U_p=0$, $V^\top V=I_n$ and $V^\top V_p=0$. $\square $

We are now ready to give the expression of the Riemannian gradient and Hessian of the nuclear norm.

Proposition C.1

The nuclear norm $g=\Vert \cdot \Vert _*$ restricted to ${\mathcal {M}}_r$ is ${\mathcal {C}}^2$ and admits a smooth second-order development of the form (2.3) near any point $x = U\varSigma V^\top \in {\mathcal {M}}_r$. Denoting a tangent vector, there holds:

$$\begin{aligned} \hbox {grad }g(x)&= U V^\top \\ \hbox {Hess }g(x)[\eta ]&= U \left[ {\tilde{F}} \circ (M-M^\top ) \right] V^\top + U_p \varSigma ^{-1}V^\top + U\varSigma ^{-1}V_p^T, \end{aligned}$$

where $\circ $ denotes the Hadamard product and ${\tilde{F}}\in {\mathbb {R}}^{r\times r}$ is such that ${\tilde{F}}_{ij} = 1 / (\varSigma _{jj} + \varSigma _{ii})$ if $\varSigma _{jj} \ne \varSigma _{ii}$, and ${\tilde{F}}_{ij}=0$ otherwise.

Proof

Let $c: I\rightarrow {\mathcal {M}}_r$ denote a smooth curve over ${\mathcal {M}}_r$ such that $\gamma (0)=x$ and $\gamma '(0)=\eta $, and consider $\varphi = \Vert c(\cdot )\Vert _* : I\rightarrow {\mathbb {R}}$. Writing the decomposition $c(t) = U(t)\varSigma (t) V(t)^\top $, for U(t), $\varSigma (t)$, V(t) smooth curves of St(m, r), ${\mathbb {R}}^{r\times r}$, St(n, r) allows to write $\varphi (t) = {{\,\mathrm{Tr}\,}}(\varSigma (t))$. Applying Lemma C.1 yields

$$\begin{aligned} \varphi '(0) = {{\,\mathrm{Tr}\,}}(\varSigma '(0)) = {{\,\mathrm{Tr}\,}}(U^\top \eta V) = {{\,\mathrm{Tr}\,}}(\eta V U^\top ) = \langle \eta , UV^\top \rangle , \end{aligned}$$

so that .

In order to obtain the Riemannian Hessian, let ${{\bar{Z}}} : I\rightarrow {\mathbb {R}}^n$ denote a smooth extension of $\hbox {grad }g(c(\cdot ))$, defined by ${{\bar{Z}}}(t) = U(t)V(t)^\top $. The Riemannian Hessian is then obtained as $\hbox {Hess }g(x)[\eta ] = \mathrm {proj}_x {{\bar{Z}}}'(0)$. The derivative of ${{\bar{Z}}}$ at 0 is simply ${{\bar{Z}}}'(0) = U' V^\top + U V'^\top $ and thus writes, applying Lemma C.1

$$\begin{aligned} {{\bar{Z}}}'(0)= & {} U \left( F \circ \left[ M \varSigma + \varSigma M^\top \right] \right) V^\top + U_p \varSigma ^{-1} V^\top + U \left( F \circ \left[ \varSigma M + M^\top \varSigma \right] \right) ^\top V^\top \\&+ U \varSigma ^{-1} V_p^\top \end{aligned}$$

This expression simplifies to the statement by using the fact that F is antisymmetric and applying the identity $(A\circ B)^\top = A^\top \circ B^\top $. $\square $

1.3 Additional numerical experiment

We illustrate in this appendix the robustness of the Newton acceleration on several instances of the same problem. More precisely, in the set-up of Sect. 5.3, we compare the 4 algorithms on 20 random instances of the tracenorm problem, in terms of wallclock time required to reach a suboptimality of $10^{-9}$. We then provide in Fig. 5 a performance profile (i.e. the ordinate of a curve at absciss $t \ge 1$ indicates the proportion of problems for which the corresponding algorithm was able to satisfy the criterion within t times the best algorithm time for each problem; see [20]).

We observe the following on Fig. 5. The ordinate at origin of a curve gives the proportion of problems for which the corresponding algorithm performed best: methods with Newton acceleration are the most efficient in $95\% (= 75\%+20\%)$ of the instances. Furthermore, they require about $2.5\times $ less time to converge in half of the instances. Note also that the proximal gradient is completely outperformed by the others algorithms since it takes $5\times $ more time than the best algorithm, for all instances.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Bareilles, G., Iutzeler, F. & Malick, J. Newton acceleration on manifolds identified by proximal gradient methods. Math. Program. 200, 37–70 (2023). https://doi.org/10.1007/s10107-022-01873-w

Download citation

Received: 15 January 2021
Accepted: 22 July 2022
Published: 30 August 2022
Issue Date: June 2023
DOI: https://doi.org/10.1007/s10107-022-01873-w

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Newton acceleration on manifolds identified by proximal gradient methods

Abstract

Access this article

Similar content being viewed by others

Riemannian proximal gradient methods

An inexact Riemannian proximal gradient method

A nonmonotone accelerated proximal gradient method with variable stepsize strategy for nonsmooth and nonconvex minimization problems

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Preliminary results on the proximal gradient

Lemma A.1

Proof

Lemma A.2

Proof

Technical results on Riemannian methods.

1.1 Euclidean spaces and manifolds, back and forth

Lemma B.1

Proof

Proposition B.1

Proof

Lemma B.2

Proof

1.2 Two technical results on Riemannian descent algorithms

Theorem B.1

Proof

Lemma B.3

Proof

Complements to the experimental section

1.1 Oracles of section 5.1

1.2 Differentiating the singular-value decomposition

Lemma C.1

Proof

Proposition C.1

Proof

1.3 Additional numerical experiment

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation