Retraction-Based Direct Search Methods for Derivative Free Riemannian Optimization

Kungurtsev, Vyacheslav; Rinaldi, Francesco; Zeffiro, Damiano

doi:10.1007/s10957-023-02268-3

Retraction-Based Direct Search Methods for Derivative Free Riemannian Optimization

Open access
Published: 30 July 2023

(2023)
Cite this article

Download PDF

You have full access to this open access article

Journal of Optimization Theory and Applications Aims and scope Submit manuscript

Retraction-Based Direct Search Methods for Derivative Free Riemannian Optimization

Download PDF

Vyacheslav Kungurtsev¹,
Francesco Rinaldi² &
Damiano Zeffiro ORCID: orcid.org/0000-0002-4189-0631²

960 Accesses
1 Citation
Explore all metrics

Abstract

Direct search methods represent a robust and reliable class of algorithms for solving black-box optimization problems. In this paper, the application of those strategies is exported to Riemannian optimization, wherein minimization is to be performed with respect to variables restricted to lie on a manifold. More specifically, classic and linesearch extrapolated variants of direct search are considered, and tailored strategies are devised for the minimization of both smooth and nonsmooth functions, by making use of retractions. A class of direct search algorithms for minimizing nonsmooth objectives on a Riemannian manifold without having access to (sub)derivatives is analyzed for the first time in the literature. Along with convergence guarantees, a set of numerical performance illustrations on a standard set of problems is provided.

A Collection of Nonsmooth Riemannian Optimization Problems

Convergence Rate of Descent Method with New Inexact Line-Search on Riemannian Manifolds

Article 24 September 2018

A Riemannian BFGS Method for Nonconvex Optimization Problems

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Riemannian optimization, or solving minimization problems with decision variables constrained to lie on a Riemannian manifold, is an important and active area of research since there are numerous problems in data science, robotics, and other settings wherein there is a geometric structure characterizing the allowable inputs. Derivative free optimization (DFO), or zeroth-order optimization, involves algorithms that only make use of function evaluations rather than any gradient computations in their implementation, designed for applications where accurate approximations of the gradient are unavailable due to noise or high computational cost. This paper specializes existing direct search DFO algorithms to Riemannian optimization problems. Reference of Riemannian optimization and DFO includes [1, 6, 11, 22], respectively.

Direct search methods (see, e.g., [21] and references therein) belong to the class of derivative free algorithms that do not build models of the objective or gradient approximations. Thus, they are particularly suitable for problems with function evaluations considered as a black box with little prior information that could suggest how accurate different interpolation models would be, as both differentiability and conditioning properties of the function are unknown.

To the best of our knowledge, thorough studies of derivative free optimization (DFO) on Riemannian manifolds have only been carried out recently in the literature. The closest would be direct search confined to a subspace, presented in [4]. In [23], the authors focus on a model-based method using a two-point function approximation for the gradient. The paper [31] presents a specialized Polak–Ribière–Polyak procedure for finding a zero of a tangent vector field on a Riemannian manifold. In [13], it is claimed that the convergence analysis of mesh adaptive direct search methods (MADS; see, e.g., [5, 6]) for unconstrained objectives can be extended to the case of Riemannian manifolds using the exponential map. In the subsequent work [14], the author focuses on a specific class of manifolds (reductive homogeneous spaces, including several matrix manifolds), discussing more in detail how, thanks to the properties of exponential maps, a straightforward extension of MADS is possible at least for that class. Some nonsmooth problems on Riemannian manifolds and references to derivative free optimization methods without convergence analyses can be found in [18].

Thus, our paper presents the first analysis of retraction-based direct search strategies on Riemannian manifolds, and the first analysis of a direct search algorithm for minimizing nonsmooth objectives in Riemannian optimization. In particular, a classic direct search scheme (see, e.g., [11, 21]) and a linesearch-based scheme (see, e.g., [12, 24,25,26] for further details on this class of methods) to deal with the minimization of a given smooth function over a manifold are adapted from analogous methods in the unconstrained settings. Then, inspired by the ideas in [15], the two proposed strategies are extended to the nonsmooth case. The introduction of the geometric constraint presents significant challenges: Namely, the stable structure of the Euclidean vector space makes it natural for a fixed set of coordinate-like directions to consistently approximate desired directions by spanning the space in a uniform way. The fact that this geometric structure can change necessitates a careful adjustment of the poll directions corresponding to the change in this structure, with minimal computational expense. The associated convergence theory presents some novel results that could be of independent interest.

The remainder of this paper is as follows. In Sect. 2, some definitions are presented. In Sect. 3, a direct search method applicable for continuously differentiable f is presented, with a convergence proof. In Sect. 4, the case of f not being continuously differentiable but rather only Lipschitz continuous is considered. Some numerical results are presented in Sect. 5. Detailed proofs can be found in Appendix.

The codes relevant to the numerical tests are available at the following link: https://github.com/DamianoZeffiro/riemannian-ds.

2 Definitions and Notation

This section introduces some notation for the formalism used in this article. The reader is referred to, e.g., [1, 8] for an overview of the relevant background.

Let $\mathcal {M}$ be a smooth finite dimensional connected manifold.

The problem of interest here is

$$\begin{aligned} \min _{x \in \mathcal {M}} f(x) , \end{aligned}$$

(1)

with f being continuous and bounded below. Both the case of f(x) being continuously differentiable and a more general nonsmooth case are considered. For $x\in \mathcal {M}$, let $T_x\mathcal {M}$ be the tangent vector space at x and $T\mathcal {M}$ be the tangent bundle $\cup _{x\in \mathcal {M}} T_x\mathcal {M}$. $\mathcal {M}$ is assumed to be a Riemannian manifold, so that for x in $\mathcal {M}$, there is a scalar product $\langle \cdot , \cdot \rangle _x: T_x\mathcal {M}\times T_x\mathcal {M}\rightarrow \mathbb {R}$ and a norm $\Vert \cdot \Vert _x$ on $T_x\mathcal {M}$ smoothly depending on x. Let ${{\,\textrm{dist}\,}}(\cdot , \cdot )$ be the distance induced by the scalar product, so that for $x, y \in \mathcal {M}$ the distance ${{\,\textrm{dist}\,}}(x, y)$ is the length of the shortest geodesic connecting x and y. Furthermore, let $\nabla _{\mathcal {M}}$ be the Levi-Civita connection for $\mathcal {M}$ (see [8, Theorem 5.5] for a precise definition), and $\Gamma : T\mathcal {M}\times \mathcal {M}\rightarrow T\mathcal {M}$ be a parallel transport with respect to $\nabla _{\mathcal {M}}$, with $\Gamma _x^y(v) \in T_y\mathcal {M}$ transport of the vector $v \in T_x\mathcal {M}$ to one in $T_y\mathcal {M}$ along a fixed curve connecting x and y. The parallel transport $\Gamma $ is assumed to operate always along a distance minimizing geodesic when it exists. Consequently, for any $x \in \mathcal {M}$ there is a neighborhood U of x such that the parallel transport $\Gamma _{y}^{z}(v)$ is well defined and depends smoothly on y, z, v for $y, z \in U$ and $v \in T_y\mathcal {M}$. Any nonuniqueness in the definition of $\Gamma $ is either explicitly accounted for or inconsequential without loss of generality in the context.

When $\mathcal {M}$ is embedded in $\mathbb {R}^n$, $\mathsf P_x$ is defined as the orthogonal projection from $\mathbb {R}^n$ to $T_x\mathcal {M}$, and $S(x, r) \subset \mathbb {R}^n$ as the sphere centered at x and with radius r.

$\{a_k\}$ is used as a shorthand for $\{a_k\}_{k \in I}$ when the index set I is clear from the context. The shorthand notations $T_k\mathcal {M}, \mathsf P_k, \langle \cdot , \cdot \rangle _k, \Vert \cdot \Vert _k$, $\Gamma _i^j$ are also employed, in place of $T_{x_k}\mathcal {M}, \mathsf P_{x_k}, \langle \cdot , \cdot \rangle _{x_k}, \Vert \cdot \Vert _{x_k}$ and $\Gamma _{x_i}^{x_j}$. For $x_0$ given point in $\mathcal {M}$ serving as initialization of the algorithms presented in this manuscript, the sublevel set relative to $f(x_0)$ is denoted as $\mathcal {L}_0= \{ x \in \mathcal {M}\ | \ f(x) \le f(x_0)\}$. When there is no ambiguity on the value of x, $\Vert \cdot \Vert $ is used instead of $\Vert \cdot \Vert _x$.

The distance ${{\,\textrm{dist}\,}}^*$ is defined between vectors in different tangent spaces in a standard way using parallel transport (see, for instance, [7]): for $x, y \in \mathcal {M}$, $v \in T_x\mathcal {M}$ and $w \in T_yM$,

$$\begin{aligned} {{\,\textrm{dist}\,}}^*(v, w) = \left\| v - \Gamma _y^x w\right\| = \left\| w - \Gamma _x^y v\right\| , \end{aligned}$$

(2)

and for a sequence $\{(y_k, v_k)\}$ in $T\mathcal {M}$ the notation $v_k \rightarrow v$ means $y_k \rightarrow y$ in $\mathcal {M}$ and ${{\,\textrm{dist}\,}}^*(v_k, v) \rightarrow 0$. On compact subsets of $\mathcal {M}$, for ${{\,\textrm{dist}\,}}(x, y)$ small enough the minimizing geodesic between x and y is uniquely defined and consequently the parallel transport $\Gamma $ and the distance ${{\,\textrm{dist}\,}}^*$ also are. As it is common in the Riemannian optimization literature (see, e.g., [2]), to define our tentative descent directions a retraction $R: T\mathcal {M}\rightarrow \mathcal {M}$ is used. This retraction R is assumed to be in $C^1(T\mathcal {M}, \mathcal {M})$, with

$$\begin{aligned} {{\,\textrm{dist}\,}}(R(x, d), x) \le L_r\left\| d\right\| , \end{aligned}$$

(3)

(true in any compact subset of $T\mathcal {M}$ given the $C^1$ regularity of R, without any further assumptions).

For a scalar-valued function $f:\mathcal {M}\rightarrow \mathbb {R}$, the gradient $\text {grad}f(x)$ is defined as the unique element of $T_x\mathcal {M}$ such that for all $v\in \mathcal {M}$, it holds that

$$\begin{aligned} D f(x)[v] = \langle v,\text {grad}f(x)\rangle _x . \end{aligned}$$

When $\mathcal {M}$ is embedded in $\mathbb {R}^n$, the (Riemannian) gradient is a simple projection onto $T_x\mathcal {M}$, i.e., $\text {grad}f(x) = \mathsf P_x(\nabla f(x))$.

3 Smooth Optimization Problems

In this section, methods for the solution of problem (1) with the objective $f \in C^1(\mathcal {M})$ are considered. In particular, the gradient $\text {grad}f(x)$ is assumed to be continuous along $\mathcal {M}$ as a function of x.

3.1 Preliminaries

A Lipschitz continuous gradient assumption is first presented.

Assumption 1

There exists $L_f>0$ such that for all $x,y\in \mathcal {M}$

$$\begin{aligned} {{\,\textrm{dist}\,}}^*(\text {grad}f(x), \text {grad}f(y)) = \left\| \Gamma _x^y \text {grad}f(x) - \text {grad}f(y)\right\| \le L_f{{\,\textrm{dist}\,}}(x, y) . \end{aligned}$$

(4)

The next assumption generalizes the standard descent property.

Assumption 2

There exists $L>0$ so that for every $x \in \mathcal {M}\cap \mathcal {L}_0, d \in T_x\mathcal {M}$

$$\begin{aligned} f(R(x, d)) \le f(x) + \langle {\text {grad}} f(x), d\rangle + \frac{L}{2} \left\| d\right\| ^2 . \end{aligned}$$

(5)

Under suitable assumptions, the Lipschitz gradient property implies the generalized standard descent property.

Proposition 3.1

Assume that $\mathcal {L}_0$ is compact, f is Lipschitz continuous and that R is a $C^2$ retraction. Then, Assumption 1 implies Assumption 2.

The proof can be found in Appendix. It should be noted that Proposition 3.1 is a key tool to extend convergence properties from the unconstrained case to the Riemannian case. To the best of our knowledge, this result is new to the literature. Under the stronger assumption that f has Lipschitz gradient as a function in $\mathbb {R}^n$, the standard descent property (5) was proven for retractions in [9].

For each algorithm in this section, it is further assumed that, at each iteration k, a positive spanning set (as defined, e.g., in [11]) $\{p_k^j\}_{j \in [1:K]}$ is available for the tangent space $T_{k}M$. This positive spanning set is assumed to stay bounded and not become degenerate during the algorithm, that is,

Assumption 3

There exists $B>0$ such that

$$\begin{aligned} \max _{j \in [1:K]} \left\| p_k^j\right\| \le B, \end{aligned}$$

(6)

for every $k \in \mathbb {N}$. Furthermore, there is a constant $\tau > 0$ such that

$$\begin{aligned} \max _{i \in [1:K]} \langle r, p_k^j\rangle \ge \tau \left\| r\right\| , \end{aligned}$$

(7)

for every $k \in \mathbb {N}$ and $r \in T_{x_k}M$.

3.2 Direct Search Algorithm

Here, the Riemannian Direct Search method based on Spanning Bases (RDS-SB) for smooth objectives is presented as Algorithm 1.

This procedure resembles the standard direct search algorithm for unconstrained derivative free optimization (see, e.g., [11, 21]) with two significant modifications. First, at every iteration a positive spanning set is computed for the current tangent vector space $T_k\mathcal {M}$. As this space is expected to change at every iteration, it is not possible to use the same standard positive spanning sets appearing in the classic algorithms. Second, the candidate point $x_k^j$ is computed by retracting the step $\alpha _k p_k^j$ from the current tangent space $T_k\mathcal {M}$ to the manifold, ensuring satisfaction of the geometric constraint.

3.3 Convergence Analysis

In this section, asymptotic global convergence of the method is shown. First it is proved that the gradient, in unsuccessful iterates, must be bounded by a constant proportional to the stepsize (Lemma 3.2). This is a well-known condition in the unconstrained case (see, e.g., [30, Theorem 1]), extended to the Riemannian case thanks to Proposition 3.1. Given that the stepsize converges to zero, the bound implies that the gradient converges to zero for unsuccessful steps. It is then proved, using the Lipschitz continuity of the gradient, that the gradient converges to zero for successful steps as well.

The first lemma states a bound on the scalar product between the gradient and the descent direction for an unsuccessful iteration.

Lemma 3.1

Let $f\in C^1(\Omega )$, $\{x_k\}$ generated by Algorithm 1, and let Assumptions 2, 3 hold.

If $f(R(x_k, \alpha _k p_k^j)) > f(x_k) - \gamma \alpha _k^2$, then

$$\begin{aligned} \alpha _k(LB^2/2 + \gamma ) > - \langle {\text {grad}} f(x_k), p_k^j\rangle . \end{aligned}$$

(8)

Proof

To start with, we have

$$\begin{aligned} f(x_k) - \gamma \alpha _k^2< & {} f(R(x, \alpha _k p_k^j)) \le f(x_k) + \alpha _k\langle {\text {grad}} f(x_k), p_k^j\rangle + \frac{L}{2} \alpha _k^2 \left\| p_k^j\right\| ^2 \nonumber \\\le & {} f(x_k) + \alpha _k \langle {\text {grad}} f(x_k), p_k^j\rangle + \frac{L}{2} \alpha _k^2 B^2 , \end{aligned}$$

(9)

where we used (5) in the second inequality, and (6) in the third one. The above inequality can be rewritten as

$$\begin{aligned} \alpha _k\langle {\text {grad}} f(x_k), p_k^j\rangle + \alpha _k^2 (LB^2/2 + \gamma ) > 0. \end{aligned}$$

(10)

Given that $\alpha _k > 0$, the above is true if and only if

$$\begin{aligned} \alpha _k > - \frac{\langle {\text {grad}} f(x_k), p_k^j\rangle }{(LB^2/2 + \gamma )} , \end{aligned}$$

(11)

which rearranged gives the thesis. $\square $

From this, a bound on the gradient with respect to the stepsize is inferred.

Lemma 3.2

Let $f\in C^1(\Omega )$, $\{x_k\}$ generated by Algorithm 1, and let Assumptions 2, 3 hold. If iteration k is unsuccessful, then

$$\begin{aligned} \left\| \text {grad}f(x_k)\right\| \le \frac{\alpha _k(LB^2/2 + \gamma )}{\tau } . \end{aligned}$$

(12)

Proof

If iteration k is unsuccessful, Eq. (8) must hold for every $j \in [1:K]$. We obtain the thesis by applying the positive spanning property (7) in the RHS:

$$\begin{aligned} \alpha _k(LB^2/2 + \gamma ) > \max _{j \in [1:K]} - \langle {\text {grad}} f(x_k), p_k^j\rangle \ge \tau \left\| \text {grad}f(x_k)\right\| . \end{aligned}$$

(13)

$\square $

Finally, convergence of the gradient norm to zero is shown using the lemmas above and appropriate arguments regarding the stepsizes.

Theorem 3.1

Let $f\in C^1(\Omega )$, $\{x_k\}$ generated by Algorithm 1, and let Assumptions 1, 2, 3 hold. For the sequence $\{x_k\}$ generated by Algorithm 1

$$\begin{aligned} \lim _{k \rightarrow \infty } \left\| \text {grad}f(x_k)\right\| = 0 . \end{aligned}$$

(14)

Proof

To start with, it holds that $\alpha _k \rightarrow 0$ since the objective is bounded below, $\{f(x_k)\}$ is nonincreasing with $f(x_{k + 1}) \le f(x_k) - \gamma \alpha _k^2$ if the step k is successful, and so there can be a finite number of successful steps with $\alpha _k \ge \varepsilon $ for any $\varepsilon > 0$.

For a fixed $\varepsilon > 0$, let $\bar{k}$ such that $\alpha _k \le \varepsilon $ for every $k \ge \bar{k}$. We now show that, for every $\varepsilon > 0$ and $k \ge \bar{k}$ large enough, we have

$$\begin{aligned} \left\| \text {grad}f(x_k)\right\| \le \varepsilon \left( \frac{ (LB^2/2 + \gamma )}{\tau } + L_fL_r B \frac{\gamma _2}{\gamma _2 - 1}\right) , \end{aligned}$$

(15)

which implies the thesis given that $\varepsilon $ is arbitrary.

First, Eq. (15) is satisfied for $k \ge \bar{k}$ if the step k is unsuccessful by Lemma 3.2:

$$\begin{aligned} \left\| \text {grad}f(x_k)\right\| \le \frac{\alpha _k(LB^2/2 + \gamma )}{\tau } \le \frac{\varepsilon (LB^2/2 + \gamma )}{\tau } , \end{aligned}$$

(16)

using $\alpha _k \le \varepsilon $ in the second inequality.

If the step k is successful, then let j be the minimum positive index such that the step $k + j$ is unsuccessful. Notice that such a j exists because $\alpha _k\rightarrow 0$ which implies by the Algorithm’s construction an infinite subsequence of unsuccessful steps. We have that $\alpha _{k + i} = \alpha _k \gamma _2^{i}$ for $i \in [0:j - 1]$, and since $\alpha _{k + j - 1} \le \varepsilon $ by induction we get $\alpha _{k + i} \le \varepsilon \gamma _2^{i - j + 1}$. Therefore,

$$\begin{aligned} \sum _{i= 0}^{j - 1} \alpha _{k + i} \le \sum _{i = 0}^{j - 1} \varepsilon \gamma _2^{i - j + 1} \le \varepsilon \sum _{h= 0}^{\infty } \gamma _2^{-h} = \varepsilon \frac{\gamma _2}{\gamma _2 - 1} . \end{aligned}$$

(17)

Then,

$$\begin{aligned} {{\,\textrm{dist}\,}}(x_k, x_{k + j})\le & {} \sum _{i = 0}^{j - 1} {{\,\textrm{dist}\,}}(x_{k + i}, x_{k + i + 1}) = \sum _{i = 0}^{j - 1} {{\,\textrm{dist}\,}}(x_{k + i}, R(x_{k + i}, \alpha _{k + i}p_{k + i}^{j(k + i)} )) \nonumber \\\le & {} \sum _{i = 0}^{j - 1} L_r \alpha _{k + i}B \le L_rB \varepsilon \frac{\gamma _2}{\gamma _2 - 1} , \end{aligned}$$

(18)

where we used (3) together with (6) in the second inequality, and (17) in the third one.

In turn,

$$\begin{aligned} \left\| \text {grad}f(x_k)\right\|\le & {} {{\,\textrm{dist}\,}}^*(\text {grad}f(x_k), \text {grad}f(x_{k + j})) + \left\| \text {grad}f(x_{k + j})\right\| \nonumber \\\le & {} L_f {{\,\textrm{dist}\,}}(x_k, x_{k + j}) + \frac{ \varepsilon (LB^2/2 + \gamma )}{\tau } \nonumber \\\le & {} \varepsilon \left( \frac{ LB^2/2 + \gamma }{\tau } + L_fL_r B \frac{\gamma _2}{\gamma _2 - 1} \right) , \end{aligned}$$

(19)

where we used (4) and (16) with $k + j$ instead of k for the first and second summand, respectively, in the second inequality, and (18) in the last one. $\square $

3.4 Incorporating an Extrapolation Linesearch

The work [25, 26] introduced the use of an extrapolating linesearch that tests the objective on variable inputs farther away from the current iterate than the tentative point obtained by direct search on a given direction (i.e., an element of the positive spanning set). Such a thorough exploration of the search directions ultimately yields better performances in practice by computing longer successfully objective-decreasing steps. In this work, it is shown that the same technique can be applied in the Riemannian setting to good effect. In particular, in this section our Riemannian Direct Search with Extrapolation method based on Spanning Bases (RDSE-SB) for smooth objectives is presented. The scheme is described in detail as Algorithm 2, which can be viewed as a Riemannian version of [26, Algorithm 2].

The method uses a specific stepsize for each direction in the positive spanning set, so that instead of $\alpha _k$ there is a set of stepsizes $\{\alpha _k^j\}_{j \in [1:K]}$ for every $k \in \mathbb {N}_0$. Furthermore, a retraction-based linesearch procedure (see Algorithm 3) is used to better explore a given direction in case a sufficient decrease in the objective is obtained.

When analyzing the RDSE-SB method, due to the changes in the tangent space, the same positive spanning set cannot be kept for different iterates as is done in the unconstrained case (see [26, Algorithm 2, Step 2 and 3]). Therefore, using the distance $\text {dist}^*$ to compare different tangent spaces, a novel condition is introduced here ensuring some continuity in the choice of the positive spanning set.

Assumption 4

For every $k \in \mathbb {N}$, $j \in [1:K]$, there exists a constant $L_{\Gamma }>0$ such that

$$\begin{aligned} {{\,\textrm{dist}\,}}^*(p^j_k, p_{k + 1}^j) \le L_{\Gamma }{{\,\textrm{dist}\,}}(x_k, x_{k + 1}) . \end{aligned}$$

(20)

When $\mathcal {M}$ is embedded in $\mathbb {R}^n$ and $\mathcal {L}_0$ is compact, it is easy to see that condition (20) holds if $\{p_k^j\}_{j \in [1:K]}$ is the projection of a positive spanning set of $\mathbb {R}^n$ (independent from k) into $T_{k}\mathcal {M}$, using that $T_x\mathcal {M}$ varies smoothly with x.

It is now convenient to define, for $k \le l$, $\tilde{\Gamma }_k^l = \Gamma _{l-1}^{l} \circ \ldots \circ \Gamma _{k}^{k + 1}$, where for $k = l$ the composition on the RHS is empty and we set $\tilde{\Gamma }_k^l$ equal to the identity. Let also

$$\begin{aligned} d(k, l) = \sum _{i = 0}^{l - k - 1} {{\,\textrm{dist}\,}}(x_{k + i}, x_{k + i + 1}) . \end{aligned}$$

(21)

The following lemma, which links the directions of the positive spanning sets in different iterates, holds:

Lemma 3.3

Let $f \in C^1(\mathcal {M})$, $\{x_k\}$ be generated by Algorithm 2, and Assumptions 1, 3, 4 hold. For $k \in \mathbb {N}$, $j \ge 0$, $i \in [1:K]$:

$$\begin{aligned}{} & {} |\langle \text {grad}f(x_k), p_k^i\rangle - \langle \text {grad}f(x_{k + j}), p_{k + j}^i\rangle | \le L_{\Gamma } \left\| \text {grad}f(x_k)\right\| d(k, k + j)\nonumber \\{} & {} \quad + L_f d(k, k + j) . \end{aligned}$$

(22)

Proof

First,

$$\begin{aligned}{} & {} \left\| \tilde{\Gamma }_k^{k + h}p^i_{k} - p^i_{k + h}\right\| = \left\| \sum _{j = 0}^{h - 1}(\tilde{\Gamma }_{k + j}^{k + h}p^i_{k + j} - \tilde{\Gamma }_{k + j + 1}^{k + h}p_{k + j + 1}^i) \right\| \nonumber \\{} & {} \quad \le \sum _{j = 0}^{h - 1} \left\| \tilde{\Gamma }_{k + j}^{k + h}p^i_{k + j} - \tilde{\Gamma }_{k + j + 1}^{k + h}p_{k + j + 1}^i\right\| = \sum _{j = 0}^{h - 1} \left\| \tilde{\Gamma }_{k + j + 1}^{k + h}(\Gamma _{k + j}^{k + j + 1}p^i_{k + j} - p_{k + j + 1}^i)\right\| \nonumber \\{} & {} \quad = \sum _{j = 0}^{h - 1} \left\| \Gamma _{k + j}^{k + j + 1}p^i_{k + j} - p_{k + j + 1}^i\right\| \le \sum _{j = 0}^{h - 1} L_{\Gamma } {{\,\textrm{dist}\,}}(x_{k +j}, x_{k + j + 1}) = L_{\Gamma } d(k, k + h) ,\nonumber \\ \end{aligned}$$

(23)

where we used (20) in the last inequality. Analogously, from (4) it follows

$$\begin{aligned} |\left\| \text {grad}f(x_{k + h}) - \tilde{\Gamma }_k^{k + h} \text {grad}f(x_k)\right\| \le L_f d(k, k+ h) . \end{aligned}$$

(24)

We can then conclude

$$\begin{aligned}{} & {} |\langle \text {grad}f(x_{k + h}), p^i_{k + h}\rangle - \langle \text {grad}f(x_k), p^i_k\rangle | = |\langle \text {grad}f(x_{k + h}), p^i_{k + h}\rangle \nonumber \\{} & {} \qquad - \langle \tilde{\Gamma }_k^{k + h} \text {grad}f(x_k), \tilde{\Gamma }_k^{k + h}p^i_{k}\rangle | \nonumber \\{} & {} \quad = |\langle \text {grad}f(x_{k + h}) - \tilde{\Gamma }_k^{k + h} \text {grad}f(x_k), p^i_{k + h}\rangle \nonumber \\{} & {} \qquad - \langle \tilde{\Gamma }_k^{k + h} \text {grad}f(x_k), \tilde{\Gamma }_k^{k + h}p^i_{k} - p^i_{k + h}\rangle | \nonumber \\{} & {} \quad \le |\langle \text {grad}f(x_{k + h}) \nonumber \\{} & {} \qquad - \tilde{\Gamma }_k^{k + h} \text {grad}f(x_k), p^i_{k + h}\rangle | + |\langle \tilde{\Gamma }_k^{k + h} \text {grad}f(x_k), \tilde{\Gamma }_k^{k + h}p^i_{k} - p^i_{k + h}\rangle | \nonumber \\{} & {} \quad \le \left\| \text {grad}f(x_{k + h}) - \tilde{\Gamma }_k^{k + h} \text {grad}f(x_k)\right\| \left\| p^i_{k + h}\right\| \nonumber \\{} & {} \qquad + \left\| \tilde{\Gamma }_k^{k + h} \text {grad}f(x_k)\right\| \left\| \tilde{\Gamma }_k^{k + h}p^i_{k} - p^i_{k + h}\right\| \nonumber \\{} & {} \quad \le B L_f d(k, k + h) + L_{\Gamma } d(k, k+ h) \left\| \text {grad}f(x_k)\right\| , \end{aligned}$$

(25)

where we used (23), (24) and (6) in the last inequality. $\square $

Asymptotic convergence of this method is proved in the remaining part of this section.

Lemma 3.4

Let $f\in C^1(\mathcal {M})$, $\{x_k\}$ generated by Algorithm 2, and let Assumptions 2, 3 hold. At every iteration k, the following inequality holds:

$$\begin{aligned} -\langle \text {grad}f(x_k), p_k^{j(k)}\rangle < \tilde{\alpha }_{k + 1}^{j(k)} \frac{\gamma _2}{\gamma _1} (LB^2/2 + \gamma ). \end{aligned}$$

(26)

Proof

It is immediate to check that we must always have

$$\begin{aligned} f(R(x_k, \Delta _k p_k^{j(k)})) > f(x_k) - \gamma \Delta _k^2, \end{aligned}$$

(27)

for $\Delta _k = \frac{1}{\gamma _1} \tilde{\alpha }_{k + 1}^{j(k)}$ if the linesearch procedure terminates at the second line, and $\Delta _k = \gamma _2\tilde{\alpha }_{k + 1}^{j(k)} $ if the linesearch procedure terminates in the last line. Then in both cases

$$\begin{aligned} -\langle \text {grad}f(x_k), p_k^{j(k)}\rangle < \Delta _k (LB^2/2 + \gamma ) \le \tilde{\alpha }_{k + 1}^{j(k)} \frac{\gamma _2}{\gamma 1} (LB^2/2 + \gamma ) , \end{aligned}$$

(28)

where we used Lemma 3.1 in the first inequality. $\square $

Assumption 4 makes it possible to extend [26, Proposition 5.2] to the Riemannian case.

Theorem 3.2

Let $f\in C^1(\mathcal {M})$, $\{x_k\}$ be generated by Algorithm 1, and let Assumptions 1, 2, 3 and 4 hold. We have

$$\begin{aligned} \lim _{k \rightarrow \infty } \left\| \text {grad}f(x_k)\right\| \rightarrow 0 . \end{aligned}$$

(29)

Proof

Let $\bar{\alpha }_k = \max _{j \in [1:K]} \tilde{\alpha }_{k + 1}^{j(k)}$, so that $ \bar{\alpha }_k \rightarrow 0$ since $\tilde{\alpha }_{k}^{j(k)} \rightarrow 0$, reasoning as in the proof of Theorem 3.1. As a consequence of Lemma 3.4, we have

$$\begin{aligned} -\langle \text {grad}f(x_k), p_k^{j(k)}\rangle < \bar{\alpha }_k c_1 , \end{aligned}$$

(30)

for the constant $c_1 = \frac{\gamma _2}{\gamma _1} (LB^2/2 + \gamma )$ independent from j(k).

It remains to bound $\langle \text {grad}f(x_k), p_k^i\rangle $ for $i \ne j(k)$. To start with, we have the following bound:

$$\begin{aligned}{} & {} -\langle \text {grad}f(x_{k}), p^i_k\rangle \le -\langle \text {grad}f(x_{k + h}), p^i_{k + h}\rangle + |\langle \text {grad}f(x_{k + h}), p^i_{k + h}\rangle \nonumber \\{} & {} \qquad - \langle \text {grad}f(x_{k}), p^i_k\rangle | \nonumber \\{} & {} \quad \le c_1 \bar{\alpha }_{k + h} + |\langle \text {grad}f(x_{k + h}), p^i_{k + h}\rangle - \langle \text {grad}f(x_{k}), p^i_k\rangle | , \end{aligned}$$

(31)

for $h \le K$ such that $i = j(k + h)$, and where in the second inequality we used (30) with $k + h$ instead of k. For the second summand appearing in the RHS of (31), from Lemma 3.3 it follows

$$\begin{aligned}{} & {} |\langle \text {grad}f(x_{k + h}), p^i_{k + h}\rangle - \langle \text {grad}f(x_k), p^i_k\rangle | \le L_f d(k, k + h)B \nonumber \\{} & {} \quad + L_{\Gamma } \left\| \text {grad}f(x_k)\right\| d(k, k + h) . \end{aligned}$$

(32)

We can now bound $d(k, k + h)$ as follows

$$\begin{aligned} d(k, k + h)= & {} \sum _{l = 0} ^ {h - 1} {{\,\textrm{dist}\,}}(x_{k + l + 1}, x_{k + l}) \nonumber \\= & {} \sum _{l= 0}^{h - 1} {{\,\textrm{dist}\,}}(x_{k + l}, R(x_{k + l}, \bar{\alpha }_{k + l} p_{k + l}^{j(k + l)})) \le \sum _{l= 0}^{h - 1} L_r\bar{\alpha }_{k + l} \left\| p_{k + l}^{j(k + l)}\right\| \nonumber \\\le & {} B L_r\sum _{l= 0}^{h - 1} \bar{\alpha }_{k + l} \le hBL_r\max _{l \in [0:h-1]} \bar{\alpha }_{k + l} \nonumber \\\le & {} KBL_r\max _{l \in [0:K]} \bar{\alpha }_{k + l} , \end{aligned}$$

(33)

where we used (3) in the second inequality, (6) in the third one, and $h \le K$ in the last one.

Let $\Delta _k = \max _{l \in [0:K]} \bar{\alpha }_{k + l} $, so that in particular $\Delta _k \rightarrow 0$.

For every $i \in [1:K]$:

$$\begin{aligned} - \langle \text {grad}f(x_{k}), p^i_k\rangle\le & {} c_1 \bar{\alpha }_{k + h} + L_f d(k, k + h)B + L_{\Gamma } \left\| \text {grad}f(x_k)\right\| d(k, k + h) \nonumber \\\le & {} c_2 \Delta _k + c_3 \Delta _k \left\| \text {grad}f(x_k)\right\| , \end{aligned}$$

(34)

for $c_2 = c_1 + L_fB^2KL_r$ and $c_3 = KBL_rL_{\Gamma } $. Then, applying (7) and (34), we get

$$\begin{aligned} \tau \left\| \text {grad}f(x_k)\right\| \le \max _{i \in [1:K]} -\langle \text {grad}f(x_{k}), p^i_k\rangle \le c_2 \Delta _k + c_3 \Delta _k \left\| \text {grad}f(x_k)\right\| \end{aligned}$$

(35)

and rearranging, for k large enough so that $\tau - c_3 \Delta _k > 0$,

$$\begin{aligned} \left\| \text {grad}f(x_k)\right\| \le \frac{c_2 \Delta _k }{\tau - c_3 \Delta _k} \rightarrow 0 , \end{aligned}$$

(36)

as desired. $\square $

4 Nonsmooth Objectives

In this section, some direct search methods are studied in the context where f is Lipschitz continuous, and bounded from below, but not necessarily continuously differentiable. The algorithms detailed here are built around the ideas given in [15], where the authors consider direct search methods for nonsmooth objectives in Euclidean space.

4.1 Clarke Stationarity for Nonsmooth Functions on Riemannian Manifolds

In order to perform our analysis, a definition of the Clarke directional derivative for a point $x \in \mathcal {M}$ is needed. The standard approach is to write the function in coordinate charts and take the standard Clarke derivative in an Euclidean space (see, e.g., [19, 20]). Formally, given a chart $(\varphi , U)$ at $x \in \mathcal {M}$ and $v \in T_x\mathcal {M}$,

$$\begin{aligned} f^{\circ }(x; v) = \tilde{f}^{\circ }(\varphi (x); \textrm{d} \varphi (x)v) , \end{aligned}$$

(37)

for $\tilde{f}(y) = f(\varphi ^{- 1}(y))$. The following lemma shows the relationship between definition (37) and a directional derivative like object defined with retractions. This nontrivial result is the key tool allowing us to extend the analysis of direct search methods on $\mathbb {R}^n$ to the Riemannian setting.

Lemma 4.1

Let f be Lipschitz continuous. If $(y_k, q_k) \rightarrow (x, d)$ and $t_k \rightarrow 0$,

$$\begin{aligned} f^{\circ }(x; d) \ge \limsup _{k \rightarrow \infty } \frac{f(R(y_k, t_kq_k)) - f(y_k)}{t_k} . \end{aligned}$$

(38)

The proof is rather technical and thus deferred to the Appendix.

4.2 Refining Subsequences

The definition of refining subsequence used in the analysis of direct search methods (see, e.g., [3, 15]) is adapted here to the Riemannian setting. Let $(x_k, d_k)$ be a sequence in $T\mathcal {M}$.

Definition 4.1

The subsequence $\{x_{i(k)}\}$ is refining if $x_{i(k)} \rightarrow x^* $, and i(k) is unsuccessful for every k. In this case, the limit $x^*$ is called a refined point.

Definition 4.2

Given a refining subsequence $\{x_{i(k)}\}$ with refined point $x^*$, a direction $d \in T_x\mathcal {M}$ with $\Vert d\Vert _x = 1$ is said to be a refined direction if for a further subsequence $\{j(i(k))\}$

$$\begin{aligned} \lim _{k \rightarrow \infty } {{\,\textrm{dist}\,}}^*(d_{j(i(k))}, d) = 0 . \end{aligned}$$

(39)

A sufficient condition for the directions in a refined point to be refining is now given, assuming that the manifold is embedded in $\mathbb {R}^n$ and that the directions are obtained projecting from the unit sphere to the tangent spaces.

Proposition 4.1

If $x_{i(k)}$ is a refining subsequence, $\bar{d}_{i(k)}$ is dense in the unit sphere,

$$\begin{aligned} d_{i(k)} = \frac{\mathsf P_{k}(\bar{d}_{i(k)})}{\Vert \mathsf P_k(\bar{d}_{i(k)})\Vert _k} , \end{aligned}$$

for $\mathsf P_k(\bar{d}_{i(k)}) \ne 0$ and $d_{i(k)} = 0$; otherwise, then every $d \in T_{x^*}\mathcal {M}$ with $\Vert d\Vert _{x^*} = 1$ is refining.

Proof

Fix $d \in T_{x^*} \mathcal {M}$, with $\Vert d\Vert _{x^*} = 1$, and let $\bar{d} = d/ \Vert d\Vert $. By density, $\bar{d}_{j(i(k))} \rightarrow \bar{d}$ for a proper choice of the subsequence $\{j(i(k))\}$. Then,

$$\begin{aligned} \lim _{k \rightarrow \infty } d_{j(i(k))} = \lim _{k \rightarrow \infty } \frac{\mathsf P_k(\bar{d}_{j(i(k))})}{\left\| \mathsf P_k(\bar{d}_{j(i(k))})\right\| _k} = \frac{\mathsf P_{x^*}(\bar{d})}{\left\| \mathsf P_{x^*}(\bar{d})\right\| _{x^*}} = \frac{\bar{d}}{\left\| \bar{d}\right\| _{x^*}} = d , \end{aligned}$$

(40)

where in the second equality we used the continuity of $\mathsf P_x$ and of the norm $\Vert \cdot \Vert _x$, and in the third equality we used $\mathsf P_{x^*}(\bar{d}) = \bar{d}$ since $\bar{d} \in T_{x^*} \mathcal {M}$ by construction. $\square $

4.3 Direct Search for Nonsmooth Objectives

Our Riemannian Direct Search method based on Dense Directions (RDS-DD) for nonsmooth objectives is presented here. The scheme is presented in detail as Algorithm 4. The algorithm performs three simple steps at an iteration k. First, a search direction is selected randomly in the current tangent space. Then, a tentative point is generated by retracting the step $\alpha _k d_k$ from the tangent space to the manifold. Such a point is then eventually accepted as the new iterate if a sufficient decrease condition of the objective function is satisfied (and the stepsize is expanded); otherwise, the iterate stays the same. (And the stepsize is reduced.)

Thanks to the theoretical tools previously introduced, and in particular to the relation between retractions and the Clarke directional derivative proved in Lemma 4.1, it showed in a straightforward way that a suitable subsequence of unsuccessful iterations of the RDS-DD method converges to a Clarke stationary point.

Theorem 4.1

Let f be Lipschitz continuous and $\{x_k\}$ be generated by Algorithm 4. If $\{x_{i(k)}\}$ is refining, with $ x_{i(k)} \rightarrow x^* $, and every $d \in T_{x^*}\mathcal {M}$ with $\Vert d\Vert _{x^*} = 1$ is a refining direction, $x^*$ is Clarke stationary.

Proof

By the same assumptions as in the smooth case $\alpha _k \rightarrow 0$ and in particular $\alpha _{i(k)} \rightarrow 0$. Since by assumption i(k) is an unsuccessful step, we have, for every i(k),

$$\begin{aligned} f(R(x_{i(k)}, \alpha _{i(k)} d_{i(k)})) - f(x_{i(k)}) > -\gamma \alpha _{i(k)}^2 . \end{aligned}$$

(41)

Let $d \in T_{x^*}\mathcal {M}$ with $\Vert d\Vert _{x^*} = 1$, let $\{j(i(k)) \}$ be such that $d_{j(i(k))} \rightarrow d$, and let $y_k = x_{j(i(k))} $, $q_k = d_{j(i(k))}$, $t_k = \alpha _{j(i(k))}$. We have

$$\begin{aligned} \limsup _{k \rightarrow \infty } \frac{f(R(y_k, t_kq_k)) - f(y_k)}{t_k} \ge \limsup _{k \rightarrow \infty } -\gamma \alpha _{i(k)} = 0 , \end{aligned}$$

(42)

thanks to (41), and by applying Lemma 4.1 we get

$$\begin{aligned} f^{\circ }(x^*; d) \ge \limsup _{k \rightarrow \infty } \frac{f(R(y_k, t_kq_k)) - f(y_k)}{t_k} \ge 0 , \end{aligned}$$

(43)

which implies the thesis since d is arbitrary. $\square $

4.4 Direct Search with Linesearch Extrapolation for Nonsmooth Objectives

Our Riemannian Direct Search method with linesearch Extrapolation based on Dense Directions (RDSE-DD) for nonsmooth objectives is presented here. It can be seen as an extension to the Riemannian setting of the $\text {DFN}_{simple}$ algorithm introduced in [15] for the Euclidean setting with bound constraints. The detailed scheme is given in Algorithm 5. The algorithm performs just two simple steps at an iteration k. First, a given search direction is suitably projected on the current tangent space. Then, a linesearch is performed using Algorithm 3 to hopefully obtain a new point that guarantees a sufficient decrease.

Once again, by exploiting the theoretical tools previously introduced, it is proved in a straightforward way that a suitable subsequence of the RDSE-DD iterations converges to a Clarke stationary point. Thanks to the use of the linesearch strategy, the following result is not restricted to considering unsuccessful iterations. Given the lack of such iterations, for the purposes of Definition 4.1, every converging subsequence generated by Algorithm 5 is considered as refining.

Theorem 4.2

Let f be Lipschitz continuous and $\{x_k\}$ be generated by Algorithm 5. If $\{x_{i(k)}\}$ is refining, with $ x_{i(k)} \rightarrow x^* $ and every $d \in T_{x^*}\mathcal {M}$ with $\Vert d\Vert _{x^*} = 1$ is a refining direction, then $x^*$ is Clarke stationary.

Proof

Let $\beta _k = \tilde{\alpha }_{k + 1}/\gamma _1$ if the linesearch procedure exits before the loop, and $\beta _k = \gamma _2 \tilde{\alpha }_{k}$ otherwise, so that in particular $\beta _k \rightarrow 0$. Then by definition of the linesearch procedure, for every k

$$\begin{aligned} f(R(x_k, \beta _k d_k)) - f(x_k) > -\gamma \beta _k^2 . \end{aligned}$$

(44)

The rest of the proof is analogous to that of Theorem 4.1. $\square $

5 Numerical Results

In this section, results of some numerical experiments of the algorithms described in this paper on a set of simple but illustrative example problems are presented. The comparison among the algorithms is carried out by using data and performance profiles [27]. Specifically, let S be a set of algorithms and P a set of problems. For each $s\in S$ and $p \in P$, let $t_{p,s}$ be the number of function evaluations required by algorithm s on problem p to satisfy the condition

$$\begin{aligned} f(x_k) \le f_L + \tau (f(x_0) - f_L) , \end{aligned}$$

(45)

where $0< \tau < 1$ and $f_L$ is the best objective function value achieved by any solver on problem p. Then, the performance and data profiles of solver s are defined, respectively, by the following functions

$$\begin{aligned} \rho _s(\alpha )= & {} \frac{1}{|P|}\left| \left\{ p\in P: \frac{t_{p,s}}{\min \{t_{p,s'}:s'\in S\}}\le \alpha \right\} \right| ,\\ d_s(\kappa )= & {} \frac{1}{|P|}\left| \left\{ p\in P: t_{p,s}\le \kappa (n_p+1)\right\} \right| , \end{aligned}$$

where $n_p$ is the dimension of problem p.

A budget of $100(n_p+1)$ function evaluations is used in all cases, and two different precisions for the condition (45), that is, $\tau \in \{10^{-1},10^{-3}\}$. Randomly generated instances of well-known optimization problems over manifolds from [1, 8, 18] are considered. A brief description of those problems as well as the details of our implementation can be found in Appendix (see Sects. 7.3, 7.4 and 7.5). The size of the ambient space for the instances varies from 2 to 200. In the results, the problems are split by ambient space dimension: between 2 and 15 for small instances, between 16 and 50 for medium instances, and between 51 and 200 for large instances.

5.1 Smooth Problems

In Fig. 1, the results related to 8 smooth instances of problem (1) from [1, 8] are included, each with 15 different problem dimensions (from 2 to 200), for a total number of 60 tested instances, split as described above. Our methods, that is, RDS-SB and RDSE-SB, are compared with the zeroth-order gradient descent (ZO-RGD, [23, Algorithm 1]).

The results clearly show that RDSE-SB performs better than RDS-SB and ZO-RGD both in efficiency and reliability for both levels of precision. It can also be seen how the gap between RDSE-SB and the other two algorithms gets larger as the problem dimension grows.

5.2 Nonsmooth Problems

Here, a preliminary comparison is reported between a direct search strategy, a linesearch strategy, and ZO-RGD on two nonsmooth instances of (1) from [18], each with 15 different problem sizes (from 2 to 200), thus getting a total number of 30 tested instances, split by dimension as for smooth instances. It should be noted that while in the unconstrained setting the performance of zeroth-order (sub)gradient descent methods on nonsmooth objectives has been analyzed (see, e.g., [28]), there are, to the best of our knowledge, no convergence guarantees in the Riemannian setting.

In the direct search strategy (RDS-DD+), the RDS-SB method is applied until $\alpha _{k + 1} \le \alpha _{\epsilon }$, at which point the nonsmooth version RDS-DD is used. Analogously, in the linesearch strategy (RDSE-DD+), the RDSE-SB method is applied until $\max _{j \in [1:K]} \tilde{\alpha }_{k + 1}^j \le \alpha _{\epsilon }$, at which point the nonsmooth version RDSE-DD is used. Both strategies use a threshold parameter $\alpha _{\epsilon } > 0$ to switch from the smooth to the nonsmooth DFO algorithm. The reader is referred to [15] and references therein for other direct search strategies combining coordinate and dense directions.

In Fig. 2, the comparison between the considered strategies is reported. As in the smooth case, the linesearch-based strategy outperforms both the simple direct search and the zeroth-order one. It can once again be seen how the gap between the algorithms gets larger as the problem dimension gets large enough.

6 Conclusion

In this paper, direct search algorithms with and without an extrapolation linesearch for minimizing functions over a Riemannian manifold are presented. It was found that, modulo modifications to account for the changing vector space structure with the iterations, direct search strategies provide guarantees of convergence for both smooth and nonsmooth objectives. It was also found that in practice, in our numerical experiments, the extrapolation linesearch speeds up the performance of direct search in both cases, and it appears that it even outperforms a gradient approximation-based zeroth-order Riemannian algorithm in the smooth case. As a natural extension for future work, considering the stochastic case would be a reasonable next step.

Data availability

The datasets generated during the current study are available from the corresponding author on reasonable request.

References

Absil, P.-A., Mahony, R., Sepulchre, R.: Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton (2009)
MATH Google Scholar
Absil, P.-A., Malick, J.: Projection-like retractions on matrix manifolds. SIAM J. Optim. 22, 135–158 (2012)
Article MathSciNet MATH Google Scholar
Audet, C., Dennis, J.E., Jr.: Analysis of generalized pattern searches. SIAM J. Optim. 13, 889–903 (2002)
Article MathSciNet MATH Google Scholar
Audet, C., Le Digabel, S., Peyrega, M.: Linear equalities in blackbox optimization. Comput. Optim. Appl. 61, 1–23 (2015)
Article MathSciNet MATH Google Scholar
Audet, C., Dennis, J.E., Jr.: Mesh adaptive direct search algorithms for constrained optimization. SIAM J. Optim. 17, 188–217 (2006)
Article MathSciNet MATH Google Scholar
Audet, C., Hare, W.: Derivative-Free and Blackbox Optimization. Springer (2017)
Book MATH Google Scholar
Azagra, D., Ferrera, J., López-Mesas, F.: Nonsmooth analysis and Hamilton–Jacobi equations on Riemannian manifolds. J. Funct. Anal. 220, 304–361 (2005)
Article MathSciNet MATH Google Scholar
Boumal, N.: An Introduction to optimization on smooth manifolds. http://sma.epfl.ch/nboumal/book/index.html (2022). Accessed 10 Feb 2022
Boumal, N., Absil, P.-A., Cartis, C.: Global rates of convergence for nonconvex optimization on manifolds. IMA J. Numer. Anal. 39, 1–33 (2019)
Article MathSciNet MATH Google Scholar
Boumal, N., Mishra, B., Absil, P.-A., Sepulchre, R.: Manopt, a Matlab toolbox for optimization on manifolds. J. Mach. Learn. Res. 15, 1455–1459 (2014)
MATH Google Scholar
Conn, A.R., Scheinberg, K., Vicente, L.N.: Introduction to Derivative-Free Optimization. SIAM, Philadelphia (2009)
Book MATH Google Scholar
Cristofari, A., Rinaldi, F.: A derivative-free method for structured optimization problems. SIAM J. Optim. 31, 1079–1107 (2021)
Article MathSciNet MATH Google Scholar
Dreisigmeyer, D.W.: Equality constraints, Riemannian manifolds and direct search methods. https://optimization-online.org/wp-content/uploads/2007/08/1743.pdf (2006). Accessed 21 Mar 2023
Dreisigmeyer, D.W.: Direct search methods on reductive homogeneous spaces. J. Optim. Theory Appl. 176, 585–604 (2018)
Article MathSciNet MATH Google Scholar
Fasano, G., Liuzzi, G., Lucidi, S., Rinaldi, F.: A linesearch-based derivative-free approach for nonsmooth constrained optimization. SIAM J. Optim. 24, 959–992 (2014)
Article MathSciNet MATH Google Scholar
Gratton, S., Royer, C.W., Vicente, L.N., Zhang, Z.: Direct search based on probabilistic descent. SIAM J. Optim. 25, 1515–1541 (2015)
Article MathSciNet MATH Google Scholar
Hosseini, R., Sra, S.: Matrix manifold optimization for Gaussian mixtures. NIPS 28, 910–918 (2015)
Google Scholar
Hosseini, S., Mordukhovich, B.S., Uschmajew, A.: Nonsmooth Optimization and Its Applications. International Series of Numerical Mathematics, Springer (2019)
Book MATH Google Scholar
Hosseini, S., Pouryayevali, M.: Nonsmooth optimization techniques on Riemannian manifolds. J. Optim. Theory Appl. 158, 328–342 (2013)
Article MathSciNet MATH Google Scholar
Hosseini, S., Uschmajew, A.: A Riemannian gradient sampling algorithm for nonsmooth optimization on manifolds. SIAM J. Optim. 27, 173–189 (2017)
Article MathSciNet MATH Google Scholar
Kolda, T.G., Lewis, R.M., Torczon, V.: Optimization by direct search: new perspectives on some classical and modern methods. SIAM Rev. 45, 385–482 (2003)
Article MathSciNet MATH Google Scholar
Larson, J., Menickelly, M., Wild, S.M.: Derivative-free optimization methods. Acta Numer. 28, 287–404 (2019)
Article MathSciNet MATH Google Scholar
Li, J., Balasubramanian, K., Ma, S.: Zeroth-order optimization on Riemannian manifolds. https://arxiv.org/abs/2003.11238 (2020)
Liuzzi, G., Lucidi, S., Sciandrone, M.: Sequential penalty derivative-free methods for nonlinear constrained optimization. SIAM J. Optim. 20, 2614–2635 (2010)
Article MathSciNet MATH Google Scholar
Lucidi, S., Sciandrone, M.: A derivative-free algorithm for bound constrained optimization. Comput. Optim. Appl. 21, 119–142 (2002)
Article MathSciNet MATH Google Scholar
Lucidi, S., Sciandrone, M.: On the global convergence of derivative-free methods for unconstrained optimization. SIAM J. Optim. 13, 97–116 (2002)
Article MathSciNet MATH Google Scholar
Moré, J.J., Wild, S.M.: Benchmarking derivative-free optimization algorithms. SIAM J. Optim. 20, 172–191 (2009)
Article MathSciNet MATH Google Scholar
Nesterov, Y., Spokoiny, V.: Random gradient-free minimization of convex functions. Found. Comput. Math. 17, 527–566 (2017)
Article MathSciNet MATH Google Scholar
Vandereycken, B.: Riemannian and multilevel optimization for rank-constrained matrix problems. PhD Thesis. Department of Computer Science, KU Leuven. http://www.unige.ch/math/vandereycken/papers/phd_Vandereycken.pdf (2010). Accessed 10 Feb 2022
Vicente, L.N.: Worst case complexity of direct search. EURO J. Comput. Optim. 1, 143–153 (2013)
Article MATH Google Scholar
Yao, T.-T., Zhao, Z., Bai, Z.-J., Jin, X.-Q.: A Riemannian derivative-free Polak–Ribiére–Polyak method for tangent vector field. Numer. Algorithms 86, 325–355 (2021)
Article MathSciNet MATH Google Scholar

Download references

Funding

Open access funding provided by Università degli Studi di Padova within the CRUI-CARE Agreement.

Author information

Authors and Affiliations

Department of Computer Science, Czech Technical University, Prague, Czech Republic
Vyacheslav Kungurtsev
Dipartimento di Matematica “Tullio Levi-Civita”, Università di Padova, Padua, Italy
Francesco Rinaldi & Damiano Zeffiro

Authors

Vyacheslav Kungurtsev
View author publications
You can also search for this author in PubMed Google Scholar
Francesco Rinaldi
View author publications
You can also search for this author in PubMed Google Scholar
Damiano Zeffiro
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Damiano Zeffiro.

Additional information

Communicated by Sonia Cafieri.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Vyacheslav Kungurtsev: Research supported by the OP VVV project CZ.02.1.01/0.0/0.0/16_019/0000765 “Research Center for Informatics”.

Appendix

1.1 Proof of Proposition 3.1

Before proving Proposition 3.1, a local version is stated and proved.

Lemma A.1

Let Assumption 1 hold and $R \in C^2(T\mathcal {M}, \mathcal {M})$. Then, for every $x \in \mathcal {M}$ there is a neighborhood $U_x$ of x, and constants $L_x$ and $D_x$ such that for every $y \in U_x$, $d \in T_y\mathcal {M}$ with $\Vert d\Vert \le D_x$

$$\begin{aligned} f(R(y, d)) \le f(y) + \langle \text {grad}f(y) , d\rangle + \frac{L_x}{2} \left\| d\right\| ^2 . \end{aligned}$$

(46)

Proof

Let $(\varphi )$ be a chart defined in a neighborhood U of $x \in \mathcal {M}$. We can take the neighborhood small enough so that for y, z varying in U the parallel transport $\Gamma _y^z$ depends smoothly on y, z and is uniquely defined. We use the notation $(\tilde{x}, \tilde{d})= (\varphi (x), \textrm{d}\varphi (x)d)$ for $(x, d) \in T\mathcal {M}$. We push forward the manifold and the related structure with the chart $\varphi $, i.e., for $\bar{\varphi } = \varphi ^{-1}$ we define $\tilde{f} = f\circ \bar{\varphi }$, $\tilde{U} = \varphi (U)$, $\tilde{R}(\tilde{y}, \tilde{d}) = R(y, d)$; for $d, q \in T_x\mathcal {M}$ we define $g(\tilde{d}, \tilde{q}) = \langle d, q\rangle _x$, $\Vert \tilde{d} - \tilde{q}\Vert _{\tilde{x}} = \Vert d - q\Vert _x$, and $\tilde{\Gamma }_{\tilde{x}}^{\tilde{y}}(\tilde{d}) = \Gamma _x^y(d)$. With slight abuse of notation, we use ${{\,\textrm{dist}\,}}(\tilde{x}, \tilde{y})$ to denote ${{\,\textrm{dist}\,}}(x, y)$. We also define as $\text {grad}\tilde{f}$ the gradient of $\tilde{f}$ with respect to the scalar product g, so that $g(\text {grad}\tilde{f}(\tilde{x}), \tilde{d}) = \langle \nabla \tilde{f}(x), d\rangle $ for any $\tilde{d} \in \mathbb {R}^m$. Importantly, by the equivalence of norms in $\mathbb {R}^m$ we can use $O(\Vert \tilde{d}\Vert _x)$ and $O(\Vert \tilde{d}\Vert )$ interchangeably.

We can choose $(\varphi , U)$ and $B > 0$ in such a way that for some neighborhood $\tilde{U}_x \subset \tilde{U}$ of $\tilde{x}$, for every $\tilde{y} \in \tilde{U}_x$ and $\tilde{d}$ with $\Vert \tilde{d}\Vert _{\tilde{y}} \le B$ we have $\tilde{R}(\tilde{y}, \tilde{d}) \in \tilde{U}_2 \subset \tilde{U}$, with $\tilde{U}_2$ compact.

With this notation, we need to prove

$$\begin{aligned} \tilde{f}(\tilde{R}(\tilde{y}, \tilde{d})) \le \tilde{f}(\tilde{y}) + g(\text {grad}\tilde{f}(\tilde{y}), \tilde{d}) + \frac{L_x}{2}\left\| \tilde{d}\right\| _{\tilde{y}}^2 , \end{aligned}$$

(47)

for $\tilde{d}$ such that $\Vert \tilde{d}\Vert _{\tilde{y}} \le B$, $\tilde{y} \in \tilde{U}_x \subset \tilde{U}$ and some $L_x > 0$.

First, since $\tilde{R}$ is in particular $C^1$ regular

$$\begin{aligned} \tilde{R}(\tilde{x}, \tilde{d}) = \tilde{x} + O(\left\| \tilde{d}\right\| _{\tilde{x}}) , \end{aligned}$$

(48)

and by the local smoothness of the parallel transport, for $\tilde{y}, \tilde{z} \in \tilde{U}_2$, we have

$$\begin{aligned} \tilde{\Gamma }_{\tilde{y}}^{\tilde{z}} \tilde{q} = \tilde{q} + O(\left\| \tilde{y} - \tilde{z}\right\| ) . \end{aligned}$$

(49)

Furthermore,

$$\begin{aligned}{} & {} \text {grad}\tilde{f}(\tilde{R}(\tilde{y}, \tilde{q})) =\tilde{\Gamma }_{\tilde{y}}^{\tilde{R}(\tilde{y}, \tilde{q})} \text {grad}\tilde{f}(\tilde{y}) + O({{\,\textrm{dist}\,}}(\tilde{y}, \tilde{R}(\tilde{y}, \tilde{q}))) \nonumber \\{} & {} \quad = \tilde{\Gamma }_{\tilde{y}}^{\tilde{R}(\tilde{y}, \tilde{q})} \text {grad}\tilde{f}(\tilde{y}) + O(\left\| \tilde{q}\right\| ) , \end{aligned}$$

(50)

where we used (4) in the first equality and (3) in the second equality.

Finally, since, $\frac{\textrm{d}}{\textrm{d}t}\tilde{R}(\tilde{y}, t\tilde{q})$ is $C^1$ regular, we also have

$$\begin{aligned}{} & {} \frac{\textrm{d}}{\textrm{d}t} \tilde{R}(\tilde{y}, t\tilde{q})|_{t = h} = \frac{\textrm{d}}{\textrm{d}t} \tilde{R}(\tilde{y}, t\tilde{q})|_{t = 0} + O(\left\| h \tilde{q}\right\| ) \nonumber \\{} & {} \quad = \tilde{q} + O(h\left\| \tilde{q}\right\| ) = \tilde{\Gamma }_{\tilde{y}}^{R(\tilde{y}, h\tilde{q})}\tilde{q} + O(\left\| R(\tilde{y}, h\tilde{q}) - \tilde{y}\right\| ) + O(h \left\| \tilde{q}\right\| ) \nonumber \\{} & {} \quad = \tilde{\Gamma }_{\tilde{y}}^{R(\tilde{y}, h\tilde{q})}\tilde{q} + O(h \left\| \tilde{q}\right\| ) , \end{aligned}$$

(51)

where we used (49) in the third equality, and (3) in the last one. Again by compactness, for $\tilde{y} \in \tilde{U}_1$, $t \le 1$, $\Vert \tilde{q}\Vert \le B$ the implicit constants can be taken with no dependence from the variables.

Now, for $\tilde{d}$ s.t. $\tilde{d} \le B$ define $\tilde{q} = B \tilde{d}/\Vert \tilde{d}\Vert $, so that $\tilde{d} = \bar{t} \tilde{q}$ for $\bar{t} =\Vert \tilde{d}\Vert /B$. Then, we obtain (47) reasoning as follows:

$$\begin{aligned}{} & {} \tilde{f}(\tilde{R}(\tilde{y}, \tilde{d})) - \tilde{f}(\tilde{R}(\tilde{y}, 0)) = \tilde{f}(\tilde{R}(\tilde{y}, \bar{t} \tilde{q})) - \tilde{f}(\tilde{R}(\tilde{y}, 0)) \nonumber \\{} & {} \quad = \int _{0}^{\bar{t}} \frac{\textrm{d}}{\textrm{d}t} \tilde{f}(\tilde{R}(\tilde{y} + t\tilde{q})) \textrm{d}t = \int _{0}^{\bar{t}} g(\text {grad}f(\tilde{R}(\tilde{y}, t\tilde{q})), \frac{\textrm{d}}{\textrm{d}t} \tilde{R}(\tilde{y}, t\tilde{d})) \textrm{d}t \nonumber \\{} & {} \quad = \int _{0}^{\bar{t}} g(\tilde{\Gamma }_{\tilde{y}}^{\tilde{R}(\tilde{y}, t\tilde{q})} \text {grad}\tilde{f}(\tilde{y}) + O(t\left\| \tilde{q}\right\| ), \tilde{\Gamma }_{\tilde{y}}^{\tilde{R}(\tilde{y},t\tilde{d})}\tilde{d} + O(t \left\| \tilde{q}\right\| )) dt \nonumber \\{} & {} \quad = \int _{0}^{\bar{t}} \left( g(\tilde{\Gamma }_{\tilde{y}}^{\tilde{R}(\tilde{y}, t\tilde{q})} \text {grad}\tilde{f}(\tilde{y}), \tilde{\Gamma }_{\tilde{y}}^{\tilde{R}(\tilde{y},t\tilde{d})}\tilde{d}) + O(t\left\| \tilde{q}\right\| )\right) \textrm{d}t \nonumber \\{} & {} \quad = g(\text {grad}f(\tilde{y}), \tilde{d}) + O(\bar{t}^2 \left\| \tilde{q}\right\| ) = g(\text {grad}f(\tilde{y}), \tilde{d}) + O(\left\| \tilde{d}\right\| ^2) , \end{aligned}$$

(52)

where we used (50) and (51) in the fourth equality, and the implicit constant for the $O(\Vert \tilde{d}\Vert ^2)$ term can be taken as some $L_x > 0$ independent from $\tilde{d}$ and $\tilde{y}$. $\square $

Proof

(Proposition 3.1) By the compactness of $\mathcal {L}_0$, the local property of Lemma A.1 can be extended to all $\mathcal {L}_0$: For some $\tilde{L}, B > 0$, (5) holds for every $x \in \mathcal {L}_0\cap \mathcal {M}$, $d \in T_x\mathcal {M}$ with $\Vert d\Vert \le B$. Let

$$\begin{aligned} M_f = \max _{x \in \mathcal {L}_0} \left\| \text {grad}f(x)\right\| . \end{aligned}$$

(53)

We now claim that when $\Vert d\Vert > B$, (5) holds with $L =\frac{2(M_f + L_0L_r)}{2B} $, for $L_0$ Lipschitz constant of f. Indeed in this case, we have

$$\begin{aligned}{} & {} f(R(x, d)) \le f(x) + L_0 {{\,\textrm{dist}\,}}(x, R(x, d)) \le f(x) + L_0L_r \left\| d\right\| \nonumber \\{} & {} \quad = f(x) - M_f\left\| d\right\| + (M_f + L_0L_r) \left\| d\right\| \le f(x) + \langle \text {grad}f(x), d\rangle \nonumber \\{} & {} \qquad + \frac{2(M_f + L_0L_r)}{2B} B\left\| d\right\| \nonumber \\{} & {} \quad \le f(x) + \langle \text {grad}f(x), d\rangle + \frac{2(M_f + L_0L_r)}{2B} \left\| d\right\| ^2 , \end{aligned}$$

(54)

as desired. Combining the results obtained for the case $\Vert d\Vert \le B$ and $\Vert d\Vert > B$, we obtain the desired result for $L = \max \left( \frac{2(L_0L_r + M_f)}{B}, \tilde{L}\right) $. $\square $

1.2 Proof of Lemma 4.1

In order to prove Lemma 4.1, the following lemma is needed, which will be proved with an argument analogous to the one used in the proof of [5, Proposition 3.9].

Lemma A.2

For a Lipschitz continuous function $h: \mathbb {R}^m \rightarrow \mathbb {R}$, $\tilde{y}, \tilde{v} \in \mathbb {R}^m$, if $\tilde{y}_k \rightarrow \tilde{y}$, $\tilde{v}_k \rightarrow \tilde{v}$, and $t_k \rightarrow 0$, then

$$\begin{aligned} h^{\circ }(\tilde{y}; \tilde{v}) \ge \limsup _{k \rightarrow \infty } \frac{h(\tilde{y}_k + t_k \tilde{v}_k) - h(\tilde{y}_k)}{t_k} . \end{aligned}$$

(55)

Proof

We have

$$\begin{aligned} |h(\tilde{y}_k + t_k \tilde{v}_k) - h(\tilde{y}_k + t_k \tilde{v})| \le t_k L_h\left\| \tilde{v} - \tilde{v}_k\right\| = o(t_k) , \end{aligned}$$

(56)

with $L_h$ being the Lipschitz constant of h. Then,

$$\begin{aligned}{} & {} \limsup _{k \rightarrow \infty } \frac{h(\tilde{y}_k + t_k \tilde{v}_k) - h(\tilde{y}_k)}{t_k} = \limsup _{k \rightarrow \infty } \frac{h(\tilde{y}_k + t_k \tilde{v}) + o(t_k) - h(\tilde{y}_k)}{t_k} \nonumber \\{} & {} \quad = \limsup _{k \rightarrow \infty } \frac{h(\tilde{y}_k + t_k \tilde{v}) - h(\tilde{y}_k)}{t_k} \le h^{\circ }(\tilde{y}; \tilde{v}) , \end{aligned}$$

(57)

where we used (56) in the first equality, and with the inequality true by definition of the Clarke derivative. $\square $

Proof

(Lemma 4.1) We use the notation introduced in the proof of Proposition 3.1, so that in particular $\tilde{x} = \varphi (x)$ and $\tilde{d} = \textrm{d}\varphi (x) d$. Without loss of generality, we assume that U is bounded, that $\varphi $ can be extended to a neighborhood containing the closure of U, $\{x_k\} \subset U$, and that the parallel transport $\Gamma _x^yv$ depends smoothly from $x, y \in U$ and $v \in T_x\mathcal {M}$.

First, since pushforward $\tilde{R}$ of a $C^2$ retraction on $\mathbb {R}$ is a $C^2$ retraction itself of $T \mathbb {R}^m$ on $\mathbb {R}^m$, we have the Taylor expansion

$$\begin{aligned} \tilde{R}(\tilde{y}, \tilde{v}) = \tilde{y} + \tilde{v} + O(\left\| \tilde{v}\right\| ^2) , \end{aligned}$$

(58)

with the implicit constant uniform for $\tilde{y}$ varying in $\tilde{U}$ and $\tilde{v}$ chosen in $\mathbb {R}^m$.

Second, for any fixed constant $B> 0$, by continuity we have

$$\begin{aligned} \left\| \tilde{\Gamma }_{\tilde{x}}^{\tilde{x}_k}\tilde{q} - \tilde{q}\right\| \le O\left( \left\| \tilde{x} - \tilde{x}_k\right\| \right) , \end{aligned}$$

(59)

for $k \rightarrow \infty $, $\tilde{q} \in \mathbb {R}^m$ with $\Vert \tilde{q}\Vert \le B$, and with a uniform implicit constant.

Therefore,

$$\begin{aligned}{} & {} \left\| \tilde{d}_k - \tilde{d}\right\| \le \left\| \tilde{d}_k - \tilde{\Gamma }_{\tilde{x}}^{\tilde{x}_k}\tilde{d}\right\| + \left\| \tilde{\Gamma }_{\tilde{x}}^{\tilde{x}_k}\tilde{d} - \tilde{d}\right\| \le O\left( \left\| \tilde{d}_k - \tilde{\Gamma }_{\tilde{x}}^{\tilde{x}_k}(\tilde{d})\right\| _{\tilde{x}} \right) + O\left( \left\| \tilde{x} - \tilde{x}_k\right\| \right) \nonumber \\{} & {} \quad = O\left( \left\| d_k - \Gamma _x^{x_k}(d)\right\| _{x} \right) + O\left( \left\| \tilde{x} - \tilde{x}_k\right\| \right) = o(1) , \end{aligned}$$

(60)

where in the second inequality we used (59), and in the last equality we used $d_k \rightarrow d$ together with $\tilde{x}_k \rightarrow \tilde{x}$.

Let now $\tilde{v}_k = (\tilde{R}(\tilde{x}_k, t_k\tilde{d}_k) - \tilde{x}_k)/t_k$. Then,

$$\begin{aligned}{} & {} \left\| \tilde{v}_k - \tilde{d}\right\| = \frac{1}{t_k}\left\| \tilde{R}(\tilde{x}_k, t_k\tilde{d}_k) - \tilde{x}_k - t_k\tilde{d}\right\| \nonumber \\{} & {} \quad \le \frac{1}{t_k} (\left\| R(\tilde{x}_k, t_k\tilde{d}_k) - \tilde{x}_k - t_k\tilde{d}_k\right\| + t_k\left\| \tilde{d}_k - \tilde{d}\right\| ) \nonumber \\{} & {} \quad = \frac{1}{t_k} (O(t_k^2 \left\| \tilde{d}_k\right\| ^2) + t_k o(1)) = o(1) , \end{aligned}$$

(61)

where we used (58) and (60) for the first and the second summand in the second equality. In other words, $\tilde{v}_k \rightarrow \tilde{d}$. To conclude,

$$\begin{aligned}{} & {} \limsup _{k \rightarrow \infty } \frac{f(R(y_k, t_kd_k)) - f(y_k)}{t_k} = \limsup _{k \rightarrow \infty } \frac{\tilde{f}(\tilde{R}(\tilde{y}_k, t_k\tilde{d}_k)) - \tilde{f}(\tilde{y}_k)}{t_k} \nonumber \\{} & {} \quad = \limsup _{k \rightarrow \infty } \frac{\tilde{f}(\tilde{y}_k + t_k \tilde{v}_k) - \tilde{f}(\tilde{y}_k)}{t_k} \le \tilde{f}^{\circ }(\tilde{x}; \tilde{d}) = \tilde{f}^\circ (\varphi (x); \textrm{d}\varphi (x) d) = f^{\circ }(x; d) ,\nonumber \\ \end{aligned}$$

(62)

where in the inequality we were able to apply Lemma A.2 because $\tilde{v}_k \rightarrow \tilde{d}$ by (61), and the last equality follows by the definition (37). $\square $

1.3 Implementation Details

For all the problems, the manifold structure used was the one available in the MANOPT library [10].

After a basic tuning phase, the algorithm parameters were set as follows: $\gamma _1 = 0.61$, $\gamma _2 = 1$ and $\gamma = 0.77$ were used for Algorithm 1, $\gamma _1= 0.81$, $\gamma _2 = 3.12$, and $\gamma = 0.11$ for Algorithm 2, and the stepsize 1.64/n (recall that n is the dimension of the ambient space) for the ZO-RGD method.

For the nonsmooth strategies RDS-DD+ and RDSE-DD+, the same parameters of the smooth case for RDS-SB and RDSE-SB were considered, setting $\alpha _{\epsilon } = 10^{-4}$, and for both RDS-DD and RDSE-DD used $\gamma _1= 0.95$, $\gamma _2= 2$, and $\gamma = 1$. When dealing with the nonsmooth case, the stepsize used for ZO-RGD was the same as the one considered in the smooth case.

The positive spanning set was obtained both in Algorithm 1 and Algorithm 2 by projecting the positive spanning set $(e_1, \ldots , e_n, - e_1, \ldots , - e_n)$ of the ambient space $\mathbb {R}^n$ on the tangent space. The initial stepsize was set to 1 for all the direct search methods, with no fine tuning.

The starting point and the parameters related to the instances were generated either with MATLAB rand function or by using the random element generators implemented in the MANOPT library.

1.4 Smooth Problems

Here, the 8 smooth instances of problem (1) from [1, 8] are described.

1.4.1 Largest Eigenvalue, Singular Value, and Top Singular Values Problem

In the largest eigenvalue problem [8, Section 2.3], given a symmetric matrix $A \in \mathbb {S}_{n-1}:= \{A \in \mathbb {R}^{n \times n} \ | \ A = A^\top \}$, the goal is to compute

$$\begin{aligned} \max _{x \in S(0,1)} x^\top Ax . \end{aligned}$$

(63)

The largest singular value problem [8, Section 2.3] can be formulated generalizing (63): Given $A \in \mathbb {R}^{m \times h}$, the problem to solve is

$$\begin{aligned} \max _{x \in S(0,1), y \in S(0,1)} x^\top Ay . \end{aligned}$$

(64)

Notice how the domain in (63) and (64) is a sphere and the product of two spheres, respectively.

Finally, to compute the sum of the top r singular values, as explained in [8, Section 2.5] it suffices to solve

$$\begin{aligned} \max _{X \in St(m, r), Y \in St(h, r)} X^\top AY , \end{aligned}$$

(65)

for St(a, b) the Stiefel manifold with dimensions (a, b).

1.4.2 Dictionary Learning

The dictionary learning problem [8, Section 2.4] can be formulated as

$$\begin{aligned}{} & {} \underset{D \in \mathbb {R}^{d \times h}, C \in \mathbb {R}^{h \times k}}{\min } \quad \quad \left\| Y - DC\right\| + \lambda \left\| C\right\| _1 \nonumber \\{} & {} {\text {s.t.}} \quad \quad \left\| D_1\right\| = \ldots = \left\| D_h\right\| = 1 \end{aligned}$$

(66)

for a fixed $Y \in \mathbb {R}^{d \times k}$, $\lambda > 0$, $\Vert \cdot \Vert _1$ the $\ell _1 -$ norm, and $D_1, \ldots , D_h$ the columns of D.

In our implementation, the objective is smoothed by using a smoothed version $\Vert \cdot \Vert _{1, \varepsilon }$ of $\Vert \cdot \Vert _1$

$$\begin{aligned} \left\| C\right\| _{1, \varepsilon } = \sum _{i, j} \sqrt{C_{i, j}^2 + \varepsilon ^2} . \end{aligned}$$

(67)

In our tests, the solution $\bar{C}$ is generated using MATLAB sprand function, with a density of 0.3, set the regularization parameter $\lambda $ to 0.01 and $\varepsilon $ to 0.001.

1.4.3 Synchronization of Rotations

Let $\text {SO}(d)$ be the special orthogonal group:

$$\begin{aligned} \text {SO}(d) = \{R \in \mathbb {R}^{d \times d} \ | \ R^\top R=I_d \text { and } \det (R) = 1\} . \end{aligned}$$

(68)

In the synchronization of rotations problem [8, Section 2.6], rotations $R_1, \ldots , R_h \in \text {SO}(d)$ must be retrieved from noisy measurements $H_{ij}$ of $R_iR_j^{-1}$, for every $(i, j) \in E$, a subset of ${h \atopwithdelims ()2}$ (the set of couples of distinct elements in [1 : h]). The objective is then

$$\begin{aligned} \min _{\hat{R}_1, \ldots , \hat{R}_h \in \text {SO}(d)} \sum _{(i,j) \in E} \left\| \hat{R}_i - H_{ij} \hat{R}_j\right\| ^2 . \end{aligned}$$

(69)

In our tests, the case $h = 2$ is considered for simplicity.

1.4.4 Low-rank Matrix Completion

The low-rank matrix completion problem [8, Section 2.7] can be written, for a fixed matrix $M \in \mathbb {R}^{m \times h}$, as

$$\begin{aligned}{} & {} \underset{X \in \mathbb {R}^{m \times h}}{\min } \quad \quad \sum _{(i, j) \in \Omega } \left( X_{ij} - M_{ij}\right) ^2, \nonumber \\{} & {} \quad \mathrm{s.t. } \quad \quad \text {rank}(X) = r , \end{aligned}$$

(70)

given a positive integer $r > 0$ and a subset of indices $\Omega \subset [1:m] \times [1:h]$. It can be proven that the optimization domain, that is, the matrices in $\mathbb {R}^{m \times n}$ with fixed rank r, can be given a Riemannian manifold structure (see, e.g., [29]).

1.4.5 Gaussian Mixture Models

In the Gaussian mixture model problem [8, Section 2.8], the goal is to compute a maximum likelihood estimation for a given set of observations $x_1, \ldots , x_h$:

$$\begin{aligned} \max _{\begin{array}{c} \hat{u}_1, \ldots , \hat{u}_k \in \mathbb {R}^d \\ \hat{\Sigma }_1, \ldots , \hat{\Sigma }_k \in \text {Sym}(d)^+, \\ w \in \Delta ^{K - 1}_+ \end{array}} \sum _{i = 1}^h \log \left( \sum _{k = 1}^K w_k \frac{1}{\sqrt{2\pi \det (\Sigma _k)}} e^{\frac{(x-\mu _k)^\top \Sigma _k^{-1} (x-\mu _k)}{2}} \right) , \end{aligned}$$

(71)

where $\text {Sym}(d)^+$ is the manifold of positive definite matrices

$$\begin{aligned} \text {Sym}(d)^+ = \{X \in \mathbb {R}^{d \times d} \ | \ X = X^\top , \, X \succ 0 \} , \end{aligned}$$

(72)

and $\Delta ^{K - 1}_+$ is the subset of strictly positive elements of the simplex $\Delta ^{K - 1}$, which can be given a manifold structure. In our tests, the case $K = 2$ is considered with the reformulation proposed in [17], which does not use the unconstrained variables $(\hat{u}_1, \ldots , \hat{u}_k)$.

1.4.6 Procrustes Problem

The Procrustes problem [1] is the following linear regression problem, for fixed $A \in \mathbb {R}^{l \times n}$ and $B \in \mathbb {R}^{l \times p}$:

$$\begin{aligned} \min _{X \in \text {St}(n,p)} \left\| AX - B\right\| _F^2 , \end{aligned}$$

(73)

In our tests, the variable $X \in \mathbb {R}^{n \times p}$ is assumed to be in the Stiefel manifold $\text {St}(n, p)$, a choice leading to the so-called unbalanced orthogonal Procrustes problem.

1.5 Nonsmooth Problems

Here, two nonsmooth problems taken from [18] are described.

1.5.1 Sparsest Vector in a Subspace

Given an orthonormal matrix $Q \in \mathbb {R}^{m \times n}$, the problem of finding the sparsest vector in the subspace generated by the columns of Q can be relaxed as

$$\begin{aligned} \min _{x \in \mathbb {S}^{n - 1}} \left\| Qx\right\| _1 . \end{aligned}$$

(74)

1.5.2 Nonsmooth Low-rank Matrix Completion

In the nonsmooth version of the low-rank matrix completion problem (70) the Euclidean norm is replaced with the $l_1$ norm, so that the objective consists in a sum of absolute values:

$$\begin{aligned}{} & {} \underset{X \in \mathbb {R}^{m \times n}}{\min } \quad \quad \sum _{(i, j) \in \Omega } \left| X_{ij} - M_{ij}\right| , \nonumber \\{} & {} \quad \mathrm{s.t.} \quad \quad \text {rank}(X) = r . \end{aligned}$$

(75)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Kungurtsev, V., Rinaldi, F. & Zeffiro, D. Retraction-Based Direct Search Methods for Derivative Free Riemannian Optimization. J Optim Theory Appl (2023). https://doi.org/10.1007/s10957-023-02268-3

Download citation

Received: 07 November 2022
Accepted: 27 June 2023
Published: 30 July 2023
DOI: https://doi.org/10.1007/s10957-023-02268-3

Retraction-Based Direct Search Methods for Derivative Free Riemannian Optimization

Abstract

Similar content being viewed by others

A Collection of Nonsmooth Riemannian Optimization Problems

Convergence Rate of Descent Method with New Inexact Line-Search on Riemannian Manifolds

A Riemannian BFGS Method for Nonconvex Optimization Problems

1 Introduction

2 Definitions and Notation

3 Smooth Optimization Problems

3.1 Preliminaries

Assumption 1

Assumption 2

Proposition 3.1

Assumption 3

3.2 Direct Search Algorithm

3.3 Convergence Analysis

Lemma 3.1

Proof

Lemma 3.2

Proof

Theorem 3.1

Proof

3.4 Incorporating an Extrapolation Linesearch

Assumption 4

Lemma 3.3

Proof

Lemma 3.4

Proof

Theorem 3.2

Proof

4 Nonsmooth Objectives

4.1 Clarke Stationarity for Nonsmooth Functions on Riemannian Manifolds

Lemma 4.1

4.2 Refining Subsequences

Definition 4.1

Definition 4.2

Proposition 4.1

Proof

4.3 Direct Search for Nonsmooth Objectives

Theorem 4.1

Proof

4.4 Direct Search with Linesearch Extrapolation for Nonsmooth Objectives

Theorem 4.2

Proof

5 Numerical Results

5.1 Smooth Problems

5.2 Nonsmooth Problems

6 Conclusion

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

1.1 Proof of Proposition 3.1

Lemma A.1

Proof

Proof

1.2 Proof of Lemma 4.1

Lemma A.2

Proof

Proof

1.3 Implementation Details

1.4 Smooth Problems

1.4.1 Largest Eigenvalue, Singular Value, and Top Singular Values Problem

1.4.2 Dictionary Learning

1.4.3 Synchronization of Rotations

1.4.4 Low-rank Matrix Completion

1.4.5 Gaussian Mixture Models

1.4.6 Procrustes Problem

1.5 Nonsmooth Problems

1.5.1 Sparsest Vector in a Subspace

1.5.2 Nonsmooth Low-rank Matrix Completion

Rights and permissions

About this article

Cite this article