1 Introduction

Riemannian optimization, or solving minimization problems with decision variables constrained to lie on a Riemannian manifold, is an important and active area of research since there are numerous problems in data science, robotics, and other settings wherein there is a geometric structure characterizing the allowable inputs. Derivative free optimization (DFO), or zeroth-order optimization, involves algorithms that only make use of function evaluations rather than any gradient computations in their implementation, designed for applications where accurate approximations of the gradient are unavailable due to noise or high computational cost. This paper specializes existing direct search DFO algorithms to Riemannian optimization problems. Reference of Riemannian optimization and DFO includes [1, 6, 11, 22], respectively.

Direct search methods (see, e.g., [21] and references therein) belong to the class of derivative free algorithms that do not build models of the objective or gradient approximations. Thus, they are particularly suitable for problems with function evaluations considered as a black box with little prior information that could suggest how accurate different interpolation models would be, as both differentiability and conditioning properties of the function are unknown.

To the best of our knowledge, thorough studies of derivative free optimization (DFO) on Riemannian manifolds have only been carried out recently in the literature. The closest would be direct search confined to a subspace, presented in [4]. In [23], the authors focus on a model-based method using a two-point function approximation for the gradient. The paper [31] presents a specialized Polak–Ribière–Polyak procedure for finding a zero of a tangent vector field on a Riemannian manifold. In [13], it is claimed that the convergence analysis of mesh adaptive direct search methods (MADS; see, e.g., [5, 6]) for unconstrained objectives can be extended to the case of Riemannian manifolds using the exponential map. In the subsequent work [14], the author focuses on a specific class of manifolds (reductive homogeneous spaces, including several matrix manifolds), discussing more in detail how, thanks to the properties of exponential maps, a straightforward extension of MADS is possible at least for that class. Some nonsmooth problems on Riemannian manifolds and references to derivative free optimization methods without convergence analyses can be found in [18].

Thus, our paper presents the first analysis of retraction-based direct search strategies on Riemannian manifolds, and the first analysis of a direct search algorithm for minimizing nonsmooth objectives in Riemannian optimization. In particular, a classic direct search scheme (see, e.g., [11, 21]) and a linesearch-based scheme (see, e.g., [12, 24,25,26] for further details on this class of methods) to deal with the minimization of a given smooth function over a manifold are adapted from analogous methods in the unconstrained settings. Then, inspired by the ideas in [15], the two proposed strategies are extended to the nonsmooth case. The introduction of the geometric constraint presents significant challenges: Namely, the stable structure of the Euclidean vector space makes it natural for a fixed set of coordinate-like directions to consistently approximate desired directions by spanning the space in a uniform way. The fact that this geometric structure can change necessitates a careful adjustment of the poll directions corresponding to the change in this structure, with minimal computational expense. The associated convergence theory presents some novel results that could be of independent interest.

The remainder of this paper is as follows. In Sect. 2, some definitions are presented. In Sect. 3, a direct search method applicable for continuously differentiable f is presented, with a convergence proof. In Sect. 4, the case of f not being continuously differentiable but rather only Lipschitz continuous is considered. Some numerical results are presented in Sect. 5. Detailed proofs can be found in Appendix.

The codes relevant to the numerical tests are available at the following link: https://github.com/DamianoZeffiro/riemannian-ds.

2 Definitions and Notation

This section introduces some notation for the formalism used in this article. The reader is referred to, e.g., [1, 8] for an overview of the relevant background.

Let \(\mathcal {M}\) be a smooth finite dimensional connected manifold.

The problem of interest here is

$$\begin{aligned} \min _{x \in \mathcal {M}} f(x) , \end{aligned}$$
(1)

with f being continuous and bounded below. Both the case of f(x) being continuously differentiable and a more general nonsmooth case are considered. For \(x\in \mathcal {M}\), let \(T_x\mathcal {M}\) be the tangent vector space at x and \(T\mathcal {M}\) be the tangent bundle \(\cup _{x\in \mathcal {M}} T_x\mathcal {M}\). \(\mathcal {M}\) is assumed to be a Riemannian manifold, so that for x in \(\mathcal {M}\), there is a scalar product \(\langle \cdot , \cdot \rangle _x: T_x\mathcal {M}\times T_x\mathcal {M}\rightarrow \mathbb {R}\) and a norm \(\Vert \cdot \Vert _x\) on \(T_x\mathcal {M}\) smoothly depending on x. Let \({{\,\textrm{dist}\,}}(\cdot , \cdot )\) be the distance induced by the scalar product, so that for \(x, y \in \mathcal {M}\) the distance \({{\,\textrm{dist}\,}}(x, y)\) is the length of the shortest geodesic connecting x and y. Furthermore, let \(\nabla _{\mathcal {M}}\) be the Levi-Civita connection for \(\mathcal {M}\) (see [8, Theorem 5.5] for a precise definition), and \(\Gamma : T\mathcal {M}\times \mathcal {M}\rightarrow T\mathcal {M}\) be a parallel transport with respect to \(\nabla _{\mathcal {M}}\), with \(\Gamma _x^y(v) \in T_y\mathcal {M}\) transport of the vector \(v \in T_x\mathcal {M}\) to one in \(T_y\mathcal {M}\) along a fixed curve connecting x and y. The parallel transport \(\Gamma \) is assumed to operate always along a distance minimizing geodesic when it exists. Consequently, for any \(x \in \mathcal {M}\) there is a neighborhood U of x such that the parallel transport \(\Gamma _{y}^{z}(v)\) is well defined and depends smoothly on yzv for \(y, z \in U\) and \(v \in T_y\mathcal {M}\). Any nonuniqueness in the definition of \(\Gamma \) is either explicitly accounted for or inconsequential without loss of generality in the context.

When \(\mathcal {M}\) is embedded in \(\mathbb {R}^n\), \(\mathsf P_x\) is defined as the orthogonal projection from \(\mathbb {R}^n\) to \(T_x\mathcal {M}\), and \(S(x, r) \subset \mathbb {R}^n\) as the sphere centered at x and with radius r.

\(\{a_k\}\) is used as a shorthand for \(\{a_k\}_{k \in I}\) when the index set I is clear from the context. The shorthand notations \(T_k\mathcal {M}, \mathsf P_k, \langle \cdot , \cdot \rangle _k, \Vert \cdot \Vert _k\), \(\Gamma _i^j\) are also employed, in place of \(T_{x_k}\mathcal {M}, \mathsf P_{x_k}, \langle \cdot , \cdot \rangle _{x_k}, \Vert \cdot \Vert _{x_k}\) and \(\Gamma _{x_i}^{x_j}\). For \(x_0\) given point in \(\mathcal {M}\) serving as initialization of the algorithms presented in this manuscript, the sublevel set relative to \(f(x_0)\) is denoted as \(\mathcal {L}_0= \{ x \in \mathcal {M}\ | \ f(x) \le f(x_0)\}\). When there is no ambiguity on the value of x, \(\Vert \cdot \Vert \) is used instead of \(\Vert \cdot \Vert _x\).

The distance \({{\,\textrm{dist}\,}}^*\) is defined between vectors in different tangent spaces in a standard way using parallel transport (see, for instance, [7]): for \(x, y \in \mathcal {M}\), \(v \in T_x\mathcal {M}\) and \(w \in T_yM\),

$$\begin{aligned} {{\,\textrm{dist}\,}}^*(v, w) = \left\| v - \Gamma _y^x w\right\| = \left\| w - \Gamma _x^y v\right\| , \end{aligned}$$
(2)

and for a sequence \(\{(y_k, v_k)\}\) in \(T\mathcal {M}\) the notation \(v_k \rightarrow v\) means \(y_k \rightarrow y\) in \(\mathcal {M}\) and \({{\,\textrm{dist}\,}}^*(v_k, v) \rightarrow 0\). On compact subsets of \(\mathcal {M}\), for \({{\,\textrm{dist}\,}}(x, y)\) small enough the minimizing geodesic between x and y is uniquely defined and consequently the parallel transport \(\Gamma \) and the distance \({{\,\textrm{dist}\,}}^*\) also are. As it is common in the Riemannian optimization literature (see, e.g., [2]), to define our tentative descent directions a retraction \(R: T\mathcal {M}\rightarrow \mathcal {M}\) is used. This retraction R is assumed to be in \(C^1(T\mathcal {M}, \mathcal {M})\), with

$$\begin{aligned} {{\,\textrm{dist}\,}}(R(x, d), x) \le L_r\left\| d\right\| , \end{aligned}$$
(3)

(true in any compact subset of \(T\mathcal {M}\) given the \(C^1\) regularity of R, without any further assumptions).

For a scalar-valued function \(f:\mathcal {M}\rightarrow \mathbb {R}\), the gradient \(\text {grad}f(x)\) is defined as the unique element of \(T_x\mathcal {M}\) such that for all \(v\in \mathcal {M}\), it holds that

$$\begin{aligned} D f(x)[v] = \langle v,\text {grad}f(x)\rangle _x . \end{aligned}$$

When \(\mathcal {M}\) is embedded in \(\mathbb {R}^n\), the (Riemannian) gradient is a simple projection onto \(T_x\mathcal {M}\), i.e., \(\text {grad}f(x) = \mathsf P_x(\nabla f(x))\).

3 Smooth Optimization Problems

In this section, methods for the solution of problem (1) with the objective \(f \in C^1(\mathcal {M})\) are considered. In particular, the gradient \(\text {grad}f(x)\) is assumed to be continuous along \(\mathcal {M}\) as a function of x.

3.1 Preliminaries

A Lipschitz continuous gradient assumption is first presented.

Assumption 1

There exists \(L_f>0\) such that for all \(x,y\in \mathcal {M}\)

$$\begin{aligned} {{\,\textrm{dist}\,}}^*(\text {grad}f(x), \text {grad}f(y)) = \left\| \Gamma _x^y \text {grad}f(x) - \text {grad}f(y)\right\| \le L_f{{\,\textrm{dist}\,}}(x, y) . \end{aligned}$$
(4)

The next assumption generalizes the standard descent property.

Assumption 2

There exists \(L>0\) so that for every \(x \in \mathcal {M}\cap \mathcal {L}_0, d \in T_x\mathcal {M}\)

$$\begin{aligned} f(R(x, d)) \le f(x) + \langle {\text {grad}} f(x), d\rangle + \frac{L}{2} \left\| d\right\| ^2 . \end{aligned}$$
(5)

Under suitable assumptions, the Lipschitz gradient property implies the generalized standard descent property.

Proposition 3.1

Assume that \(\mathcal {L}_0\) is compact, f is Lipschitz continuous and that R is a \(C^2\) retraction. Then, Assumption 1 implies Assumption 2.

The proof can be found in Appendix. It should be noted that Proposition 3.1 is a key tool to extend convergence properties from the unconstrained case to the Riemannian case. To the best of our knowledge, this result is new to the literature. Under the stronger assumption that f has Lipschitz gradient as a function in \(\mathbb {R}^n\), the standard descent property (5) was proven for retractions in [9].

For each algorithm in this section, it is further assumed that, at each iteration k, a positive spanning set (as defined, e.g., in [11]) \(\{p_k^j\}_{j \in [1:K]}\) is available for the tangent space \(T_{k}M\). This positive spanning set is assumed to stay bounded and not become degenerate during the algorithm, that is,

Assumption 3

There exists \(B>0\) such that

$$\begin{aligned} \max _{j \in [1:K]} \left\| p_k^j\right\| \le B, \end{aligned}$$
(6)

for every \(k \in \mathbb {N}\). Furthermore, there is a constant \(\tau > 0\) such that

$$\begin{aligned} \max _{i \in [1:K]} \langle r, p_k^j\rangle \ge \tau \left\| r\right\| , \end{aligned}$$
(7)

for every \(k \in \mathbb {N}\) and \(r \in T_{x_k}M\).

3.2 Direct Search Algorithm

Here, the Riemannian Direct Search method based on Spanning Bases (RDS-SB) for smooth objectives is presented as Algorithm 1.

figure a

This procedure resembles the standard direct search algorithm for unconstrained derivative free optimization (see, e.g., [11, 21]) with two significant modifications. First, at every iteration a positive spanning set is computed for the current tangent vector space \(T_k\mathcal {M}\). As this space is expected to change at every iteration, it is not possible to use the same standard positive spanning sets appearing in the classic algorithms. Second, the candidate point \(x_k^j\) is computed by retracting the step \(\alpha _k p_k^j\) from the current tangent space \(T_k\mathcal {M}\) to the manifold, ensuring satisfaction of the geometric constraint.

3.3 Convergence Analysis

In this section, asymptotic global convergence of the method is shown. First it is proved that the gradient, in unsuccessful iterates, must be bounded by a constant proportional to the stepsize (Lemma 3.2). This is a well-known condition in the unconstrained case (see, e.g., [30, Theorem 1]), extended to the Riemannian case thanks to Proposition 3.1. Given that the stepsize converges to zero, the bound implies that the gradient converges to zero for unsuccessful steps. It is then proved, using the Lipschitz continuity of the gradient, that the gradient converges to zero for successful steps as well.

The first lemma states a bound on the scalar product between the gradient and the descent direction for an unsuccessful iteration.

Lemma 3.1

Let \(f\in C^1(\Omega )\), \(\{x_k\}\) generated by Algorithm 1, and let Assumptions 2, 3 hold.

If \(f(R(x_k, \alpha _k p_k^j)) > f(x_k) - \gamma \alpha _k^2\), then

$$\begin{aligned} \alpha _k(LB^2/2 + \gamma ) > - \langle {\text {grad}} f(x_k), p_k^j\rangle . \end{aligned}$$
(8)

Proof

To start with, we have

$$\begin{aligned} f(x_k) - \gamma \alpha _k^2< & {} f(R(x, \alpha _k p_k^j)) \le f(x_k) + \alpha _k\langle {\text {grad}} f(x_k), p_k^j\rangle + \frac{L}{2} \alpha _k^2 \left\| p_k^j\right\| ^2 \nonumber \\\le & {} f(x_k) + \alpha _k \langle {\text {grad}} f(x_k), p_k^j\rangle + \frac{L}{2} \alpha _k^2 B^2 , \end{aligned}$$
(9)

where we used (5) in the second inequality, and (6) in the third one. The above inequality can be rewritten as

$$\begin{aligned} \alpha _k\langle {\text {grad}} f(x_k), p_k^j\rangle + \alpha _k^2 (LB^2/2 + \gamma ) > 0. \end{aligned}$$
(10)

Given that \(\alpha _k > 0\), the above is true if and only if

$$\begin{aligned} \alpha _k > - \frac{\langle {\text {grad}} f(x_k), p_k^j\rangle }{(LB^2/2 + \gamma )} , \end{aligned}$$
(11)

which rearranged gives the thesis. \(\square \)

From this, a bound on the gradient with respect to the stepsize is inferred.

Lemma 3.2

Let \(f\in C^1(\Omega )\), \(\{x_k\}\) generated by Algorithm 1, and let Assumptions 2, 3 hold. If iteration k is unsuccessful, then

$$\begin{aligned} \left\| \text {grad}f(x_k)\right\| \le \frac{\alpha _k(LB^2/2 + \gamma )}{\tau } . \end{aligned}$$
(12)

Proof

If iteration k is unsuccessful, Eq. (8) must hold for every \(j \in [1:K]\). We obtain the thesis by applying the positive spanning property (7) in the RHS:

$$\begin{aligned} \alpha _k(LB^2/2 + \gamma ) > \max _{j \in [1:K]} - \langle {\text {grad}} f(x_k), p_k^j\rangle \ge \tau \left\| \text {grad}f(x_k)\right\| . \end{aligned}$$
(13)

\(\square \)

Finally, convergence of the gradient norm to zero is shown using the lemmas above and appropriate arguments regarding the stepsizes.

Theorem 3.1

Let \(f\in C^1(\Omega )\), \(\{x_k\}\) generated by Algorithm 1, and let Assumptions 1, 2, 3 hold. For the sequence \(\{x_k\}\) generated by Algorithm 1

$$\begin{aligned} \lim _{k \rightarrow \infty } \left\| \text {grad}f(x_k)\right\| = 0 . \end{aligned}$$
(14)

Proof

To start with, it holds that \(\alpha _k \rightarrow 0\) since the objective is bounded below, \(\{f(x_k)\}\) is nonincreasing with \(f(x_{k + 1}) \le f(x_k) - \gamma \alpha _k^2\) if the step k is successful, and so there can be a finite number of successful steps with \(\alpha _k \ge \varepsilon \) for any \(\varepsilon > 0\).

For a fixed \(\varepsilon > 0\), let \(\bar{k}\) such that \(\alpha _k \le \varepsilon \) for every \(k \ge \bar{k}\). We now show that, for every \(\varepsilon > 0\) and \(k \ge \bar{k}\) large enough, we have

$$\begin{aligned} \left\| \text {grad}f(x_k)\right\| \le \varepsilon \left( \frac{ (LB^2/2 + \gamma )}{\tau } + L_fL_r B \frac{\gamma _2}{\gamma _2 - 1}\right) , \end{aligned}$$
(15)

which implies the thesis given that \(\varepsilon \) is arbitrary.

First, Eq. (15) is satisfied for \(k \ge \bar{k}\) if the step k is unsuccessful by Lemma 3.2:

$$\begin{aligned} \left\| \text {grad}f(x_k)\right\| \le \frac{\alpha _k(LB^2/2 + \gamma )}{\tau } \le \frac{\varepsilon (LB^2/2 + \gamma )}{\tau } , \end{aligned}$$
(16)

using \(\alpha _k \le \varepsilon \) in the second inequality.

If the step k is successful, then let j be the minimum positive index such that the step \(k + j\) is unsuccessful. Notice that such a j exists because \(\alpha _k\rightarrow 0\) which implies by the Algorithm’s construction an infinite subsequence of unsuccessful steps. We have that \(\alpha _{k + i} = \alpha _k \gamma _2^{i}\) for \(i \in [0:j - 1]\), and since \(\alpha _{k + j - 1} \le \varepsilon \) by induction we get \(\alpha _{k + i} \le \varepsilon \gamma _2^{i - j + 1}\). Therefore,

$$\begin{aligned} \sum _{i= 0}^{j - 1} \alpha _{k + i} \le \sum _{i = 0}^{j - 1} \varepsilon \gamma _2^{i - j + 1} \le \varepsilon \sum _{h= 0}^{\infty } \gamma _2^{-h} = \varepsilon \frac{\gamma _2}{\gamma _2 - 1} . \end{aligned}$$
(17)

Then,

$$\begin{aligned} {{\,\textrm{dist}\,}}(x_k, x_{k + j})\le & {} \sum _{i = 0}^{j - 1} {{\,\textrm{dist}\,}}(x_{k + i}, x_{k + i + 1}) = \sum _{i = 0}^{j - 1} {{\,\textrm{dist}\,}}(x_{k + i}, R(x_{k + i}, \alpha _{k + i}p_{k + i}^{j(k + i)} )) \nonumber \\\le & {} \sum _{i = 0}^{j - 1} L_r \alpha _{k + i}B \le L_rB \varepsilon \frac{\gamma _2}{\gamma _2 - 1} , \end{aligned}$$
(18)

where we used (3) together with (6) in the second inequality, and (17) in the third one.

In turn,

$$\begin{aligned} \left\| \text {grad}f(x_k)\right\|\le & {} {{\,\textrm{dist}\,}}^*(\text {grad}f(x_k), \text {grad}f(x_{k + j})) + \left\| \text {grad}f(x_{k + j})\right\| \nonumber \\\le & {} L_f {{\,\textrm{dist}\,}}(x_k, x_{k + j}) + \frac{ \varepsilon (LB^2/2 + \gamma )}{\tau } \nonumber \\\le & {} \varepsilon \left( \frac{ LB^2/2 + \gamma }{\tau } + L_fL_r B \frac{\gamma _2}{\gamma _2 - 1} \right) , \end{aligned}$$
(19)

where we used (4) and (16) with \(k + j\) instead of k for the first and second summand, respectively, in the second inequality, and (18) in the last one. \(\square \)

3.4 Incorporating an Extrapolation Linesearch

The work [25, 26] introduced the use of an extrapolating linesearch that tests the objective on variable inputs farther away from the current iterate than the tentative point obtained by direct search on a given direction (i.e., an element of the positive spanning set). Such a thorough exploration of the search directions ultimately yields better performances in practice by computing longer successfully objective-decreasing steps. In this work, it is shown that the same technique can be applied in the Riemannian setting to good effect. In particular, in this section our Riemannian Direct Search with Extrapolation method based on Spanning Bases (RDSE-SB) for smooth objectives is presented. The scheme is described in detail as Algorithm 2, which can be viewed as a Riemannian version of [26, Algorithm 2].

The method uses a specific stepsize for each direction in the positive spanning set, so that instead of \(\alpha _k\) there is a set of stepsizes \(\{\alpha _k^j\}_{j \in [1:K]}\) for every \(k \in \mathbb {N}_0\). Furthermore, a retraction-based linesearch procedure (see Algorithm 3) is used to better explore a given direction in case a sufficient decrease in the objective is obtained.

When analyzing the RDSE-SB method, due to the changes in the tangent space, the same positive spanning set cannot be kept for different iterates as is done in the unconstrained case (see [26, Algorithm 2, Step 2 and 3]). Therefore, using the distance \(\text {dist}^*\) to compare different tangent spaces, a novel condition is introduced here ensuring some continuity in the choice of the positive spanning set.

Assumption 4

For every \(k \in \mathbb {N}\), \(j \in [1:K]\), there exists a constant \(L_{\Gamma }>0\) such that

$$\begin{aligned} {{\,\textrm{dist}\,}}^*(p^j_k, p_{k + 1}^j) \le L_{\Gamma }{{\,\textrm{dist}\,}}(x_k, x_{k + 1}) . \end{aligned}$$
(20)

When \(\mathcal {M}\) is embedded in \(\mathbb {R}^n\) and \(\mathcal {L}_0\) is compact, it is easy to see that condition (20) holds if \(\{p_k^j\}_{j \in [1:K]}\) is the projection of a positive spanning set of \(\mathbb {R}^n\) (independent from k) into \(T_{k}\mathcal {M}\), using that \(T_x\mathcal {M}\) varies smoothly with x.

It is now convenient to define, for \(k \le l\), \(\tilde{\Gamma }_k^l = \Gamma _{l-1}^{l} \circ \ldots \circ \Gamma _{k}^{k + 1}\), where for \(k = l\) the composition on the RHS is empty and we set \(\tilde{\Gamma }_k^l\) equal to the identity. Let also

$$\begin{aligned} d(k, l) = \sum _{i = 0}^{l - k - 1} {{\,\textrm{dist}\,}}(x_{k + i}, x_{k + i + 1}) . \end{aligned}$$
(21)

The following lemma, which links the directions of the positive spanning sets in different iterates, holds:

Lemma 3.3

Let \(f \in C^1(\mathcal {M})\), \(\{x_k\}\) be generated by Algorithm 2, and Assumptions 1, 3, 4 hold. For \(k \in \mathbb {N}\), \(j \ge 0\), \(i \in [1:K]\):

$$\begin{aligned}{} & {} |\langle \text {grad}f(x_k), p_k^i\rangle - \langle \text {grad}f(x_{k + j}), p_{k + j}^i\rangle | \le L_{\Gamma } \left\| \text {grad}f(x_k)\right\| d(k, k + j)\nonumber \\{} & {} \quad + L_f d(k, k + j) . \end{aligned}$$
(22)

Proof

First,

$$\begin{aligned}{} & {} \left\| \tilde{\Gamma }_k^{k + h}p^i_{k} - p^i_{k + h}\right\| = \left\| \sum _{j = 0}^{h - 1}(\tilde{\Gamma }_{k + j}^{k + h}p^i_{k + j} - \tilde{\Gamma }_{k + j + 1}^{k + h}p_{k + j + 1}^i) \right\| \nonumber \\{} & {} \quad \le \sum _{j = 0}^{h - 1} \left\| \tilde{\Gamma }_{k + j}^{k + h}p^i_{k + j} - \tilde{\Gamma }_{k + j + 1}^{k + h}p_{k + j + 1}^i\right\| = \sum _{j = 0}^{h - 1} \left\| \tilde{\Gamma }_{k + j + 1}^{k + h}(\Gamma _{k + j}^{k + j + 1}p^i_{k + j} - p_{k + j + 1}^i)\right\| \nonumber \\{} & {} \quad = \sum _{j = 0}^{h - 1} \left\| \Gamma _{k + j}^{k + j + 1}p^i_{k + j} - p_{k + j + 1}^i\right\| \le \sum _{j = 0}^{h - 1} L_{\Gamma } {{\,\textrm{dist}\,}}(x_{k +j}, x_{k + j + 1}) = L_{\Gamma } d(k, k + h) ,\nonumber \\ \end{aligned}$$
(23)

where we used (20) in the last inequality. Analogously, from (4) it follows

$$\begin{aligned} |\left\| \text {grad}f(x_{k + h}) - \tilde{\Gamma }_k^{k + h} \text {grad}f(x_k)\right\| \le L_f d(k, k+ h) . \end{aligned}$$
(24)

We can then conclude

$$\begin{aligned}{} & {} |\langle \text {grad}f(x_{k + h}), p^i_{k + h}\rangle - \langle \text {grad}f(x_k), p^i_k\rangle | = |\langle \text {grad}f(x_{k + h}), p^i_{k + h}\rangle \nonumber \\{} & {} \qquad - \langle \tilde{\Gamma }_k^{k + h} \text {grad}f(x_k), \tilde{\Gamma }_k^{k + h}p^i_{k}\rangle | \nonumber \\{} & {} \quad = |\langle \text {grad}f(x_{k + h}) - \tilde{\Gamma }_k^{k + h} \text {grad}f(x_k), p^i_{k + h}\rangle \nonumber \\{} & {} \qquad - \langle \tilde{\Gamma }_k^{k + h} \text {grad}f(x_k), \tilde{\Gamma }_k^{k + h}p^i_{k} - p^i_{k + h}\rangle | \nonumber \\{} & {} \quad \le |\langle \text {grad}f(x_{k + h}) \nonumber \\{} & {} \qquad - \tilde{\Gamma }_k^{k + h} \text {grad}f(x_k), p^i_{k + h}\rangle | + |\langle \tilde{\Gamma }_k^{k + h} \text {grad}f(x_k), \tilde{\Gamma }_k^{k + h}p^i_{k} - p^i_{k + h}\rangle | \nonumber \\{} & {} \quad \le \left\| \text {grad}f(x_{k + h}) - \tilde{\Gamma }_k^{k + h} \text {grad}f(x_k)\right\| \left\| p^i_{k + h}\right\| \nonumber \\{} & {} \qquad + \left\| \tilde{\Gamma }_k^{k + h} \text {grad}f(x_k)\right\| \left\| \tilde{\Gamma }_k^{k + h}p^i_{k} - p^i_{k + h}\right\| \nonumber \\{} & {} \quad \le B L_f d(k, k + h) + L_{\Gamma } d(k, k+ h) \left\| \text {grad}f(x_k)\right\| , \end{aligned}$$
(25)

where we used (23), (24) and (6) in the last inequality. \(\square \)

figure b
figure c

Asymptotic convergence of this method is proved in the remaining part of this section.

Lemma 3.4

Let \(f\in C^1(\mathcal {M})\), \(\{x_k\}\) generated by Algorithm 2, and let Assumptions 2, 3 hold. At every iteration k, the following inequality holds:

$$\begin{aligned} -\langle \text {grad}f(x_k), p_k^{j(k)}\rangle < \tilde{\alpha }_{k + 1}^{j(k)} \frac{\gamma _2}{\gamma _1} (LB^2/2 + \gamma ). \end{aligned}$$
(26)

Proof

It is immediate to check that we must always have

$$\begin{aligned} f(R(x_k, \Delta _k p_k^{j(k)})) > f(x_k) - \gamma \Delta _k^2, \end{aligned}$$
(27)

for \(\Delta _k = \frac{1}{\gamma _1} \tilde{\alpha }_{k + 1}^{j(k)}\) if the linesearch procedure terminates at the second line, and \(\Delta _k = \gamma _2\tilde{\alpha }_{k + 1}^{j(k)} \) if the linesearch procedure terminates in the last line. Then in both cases

$$\begin{aligned} -\langle \text {grad}f(x_k), p_k^{j(k)}\rangle < \Delta _k (LB^2/2 + \gamma ) \le \tilde{\alpha }_{k + 1}^{j(k)} \frac{\gamma _2}{\gamma 1} (LB^2/2 + \gamma ) , \end{aligned}$$
(28)

where we used Lemma 3.1 in the first inequality. \(\square \)

Assumption 4 makes it possible to extend [26, Proposition 5.2] to the Riemannian case.

Theorem 3.2

Let \(f\in C^1(\mathcal {M})\), \(\{x_k\}\) be generated by Algorithm 1, and let Assumptions 1, 2, 3 and 4 hold. We have

$$\begin{aligned} \lim _{k \rightarrow \infty } \left\| \text {grad}f(x_k)\right\| \rightarrow 0 . \end{aligned}$$
(29)

Proof

Let \(\bar{\alpha }_k = \max _{j \in [1:K]} \tilde{\alpha }_{k + 1}^{j(k)}\), so that \( \bar{\alpha }_k \rightarrow 0\) since \(\tilde{\alpha }_{k}^{j(k)} \rightarrow 0\), reasoning as in the proof of Theorem 3.1. As a consequence of Lemma 3.4, we have

$$\begin{aligned} -\langle \text {grad}f(x_k), p_k^{j(k)}\rangle < \bar{\alpha }_k c_1 , \end{aligned}$$
(30)

for the constant \(c_1 = \frac{\gamma _2}{\gamma _1} (LB^2/2 + \gamma )\) independent from j(k).

It remains to bound \(\langle \text {grad}f(x_k), p_k^i\rangle \) for \(i \ne j(k)\). To start with, we have the following bound:

$$\begin{aligned}{} & {} -\langle \text {grad}f(x_{k}), p^i_k\rangle \le -\langle \text {grad}f(x_{k + h}), p^i_{k + h}\rangle + |\langle \text {grad}f(x_{k + h}), p^i_{k + h}\rangle \nonumber \\{} & {} \qquad - \langle \text {grad}f(x_{k}), p^i_k\rangle | \nonumber \\{} & {} \quad \le c_1 \bar{\alpha }_{k + h} + |\langle \text {grad}f(x_{k + h}), p^i_{k + h}\rangle - \langle \text {grad}f(x_{k}), p^i_k\rangle | , \end{aligned}$$
(31)

for \(h \le K\) such that \(i = j(k + h)\), and where in the second inequality we used (30) with \(k + h\) instead of k. For the second summand appearing in the RHS of (31), from Lemma 3.3 it follows

$$\begin{aligned}{} & {} |\langle \text {grad}f(x_{k + h}), p^i_{k + h}\rangle - \langle \text {grad}f(x_k), p^i_k\rangle | \le L_f d(k, k + h)B \nonumber \\{} & {} \quad + L_{\Gamma } \left\| \text {grad}f(x_k)\right\| d(k, k + h) . \end{aligned}$$
(32)

We can now bound \(d(k, k + h)\) as follows

$$\begin{aligned} d(k, k + h)= & {} \sum _{l = 0} ^ {h - 1} {{\,\textrm{dist}\,}}(x_{k + l + 1}, x_{k + l}) \nonumber \\= & {} \sum _{l= 0}^{h - 1} {{\,\textrm{dist}\,}}(x_{k + l}, R(x_{k + l}, \bar{\alpha }_{k + l} p_{k + l}^{j(k + l)})) \le \sum _{l= 0}^{h - 1} L_r\bar{\alpha }_{k + l} \left\| p_{k + l}^{j(k + l)}\right\| \nonumber \\\le & {} B L_r\sum _{l= 0}^{h - 1} \bar{\alpha }_{k + l} \le hBL_r\max _{l \in [0:h-1]} \bar{\alpha }_{k + l} \nonumber \\\le & {} KBL_r\max _{l \in [0:K]} \bar{\alpha }_{k + l} , \end{aligned}$$
(33)

where we used (3) in the second inequality, (6) in the third one, and \(h \le K\) in the last one.

Let \(\Delta _k = \max _{l \in [0:K]} \bar{\alpha }_{k + l} \), so that in particular \(\Delta _k \rightarrow 0\).

For every \(i \in [1:K]\):

$$\begin{aligned} - \langle \text {grad}f(x_{k}), p^i_k\rangle\le & {} c_1 \bar{\alpha }_{k + h} + L_f d(k, k + h)B + L_{\Gamma } \left\| \text {grad}f(x_k)\right\| d(k, k + h) \nonumber \\\le & {} c_2 \Delta _k + c_3 \Delta _k \left\| \text {grad}f(x_k)\right\| , \end{aligned}$$
(34)

for \(c_2 = c_1 + L_fB^2KL_r\) and \(c_3 = KBL_rL_{\Gamma } \). Then, applying (7) and (34), we get

$$\begin{aligned} \tau \left\| \text {grad}f(x_k)\right\| \le \max _{i \in [1:K]} -\langle \text {grad}f(x_{k}), p^i_k\rangle \le c_2 \Delta _k + c_3 \Delta _k \left\| \text {grad}f(x_k)\right\| \end{aligned}$$
(35)

and rearranging, for k large enough so that \(\tau - c_3 \Delta _k > 0\),

$$\begin{aligned} \left\| \text {grad}f(x_k)\right\| \le \frac{c_2 \Delta _k }{\tau - c_3 \Delta _k} \rightarrow 0 , \end{aligned}$$
(36)

as desired. \(\square \)

4 Nonsmooth Objectives

In this section, some direct search methods are studied in the context where f is Lipschitz continuous, and bounded from below, but not necessarily continuously differentiable. The algorithms detailed here are built around the ideas given in [15], where the authors consider direct search methods for nonsmooth objectives in Euclidean space.

4.1 Clarke Stationarity for Nonsmooth Functions on Riemannian Manifolds

In order to perform our analysis, a definition of the Clarke directional derivative for a point \(x \in \mathcal {M}\) is needed. The standard approach is to write the function in coordinate charts and take the standard Clarke derivative in an Euclidean space (see, e.g., [19, 20]). Formally, given a chart \((\varphi , U)\) at \(x \in \mathcal {M}\) and \(v \in T_x\mathcal {M}\),

$$\begin{aligned} f^{\circ }(x; v) = \tilde{f}^{\circ }(\varphi (x); \textrm{d} \varphi (x)v) , \end{aligned}$$
(37)

for \(\tilde{f}(y) = f(\varphi ^{- 1}(y))\). The following lemma shows the relationship between definition (37) and a directional derivative like object defined with retractions. This nontrivial result is the key tool allowing us to extend the analysis of direct search methods on \(\mathbb {R}^n\) to the Riemannian setting.

Lemma 4.1

Let f be Lipschitz continuous. If \((y_k, q_k) \rightarrow (x, d)\) and \(t_k \rightarrow 0\),

$$\begin{aligned} f^{\circ }(x; d) \ge \limsup _{k \rightarrow \infty } \frac{f(R(y_k, t_kq_k)) - f(y_k)}{t_k} . \end{aligned}$$
(38)

The proof is rather technical and thus deferred to the Appendix.

4.2 Refining Subsequences

The definition of refining subsequence used in the analysis of direct search methods (see, e.g., [3, 15]) is adapted here to the Riemannian setting. Let \((x_k, d_k)\) be a sequence in \(T\mathcal {M}\).

Definition 4.1

The subsequence \(\{x_{i(k)}\}\) is refining if \(x_{i(k)} \rightarrow x^* \), and i(k) is unsuccessful for every k. In this case, the limit \(x^*\) is called a refined point.

Definition 4.2

Given a refining subsequence \(\{x_{i(k)}\}\) with refined point \(x^*\), a direction \(d \in T_x\mathcal {M}\) with \(\Vert d\Vert _x = 1\) is said to be a refined direction if for a further subsequence \(\{j(i(k))\}\)

$$\begin{aligned} \lim _{k \rightarrow \infty } {{\,\textrm{dist}\,}}^*(d_{j(i(k))}, d) = 0 . \end{aligned}$$
(39)

A sufficient condition for the directions in a refined point to be refining is now given, assuming that the manifold is embedded in \(\mathbb {R}^n\) and that the directions are obtained projecting from the unit sphere to the tangent spaces.

Proposition 4.1

If \(x_{i(k)}\) is a refining subsequence, \(\bar{d}_{i(k)}\) is dense in the unit sphere,

$$\begin{aligned} d_{i(k)} = \frac{\mathsf P_{k}(\bar{d}_{i(k)})}{\Vert \mathsf P_k(\bar{d}_{i(k)})\Vert _k} , \end{aligned}$$

for \(\mathsf P_k(\bar{d}_{i(k)}) \ne 0\) and \(d_{i(k)} = 0\); otherwise, then every \(d \in T_{x^*}\mathcal {M}\) with \(\Vert d\Vert _{x^*} = 1\) is refining.

Proof

Fix \(d \in T_{x^*} \mathcal {M}\), with \(\Vert d\Vert _{x^*} = 1\), and let \(\bar{d} = d/ \Vert d\Vert \). By density, \(\bar{d}_{j(i(k))} \rightarrow \bar{d}\) for a proper choice of the subsequence \(\{j(i(k))\}\). Then,

$$\begin{aligned} \lim _{k \rightarrow \infty } d_{j(i(k))} = \lim _{k \rightarrow \infty } \frac{\mathsf P_k(\bar{d}_{j(i(k))})}{\left\| \mathsf P_k(\bar{d}_{j(i(k))})\right\| _k} = \frac{\mathsf P_{x^*}(\bar{d})}{\left\| \mathsf P_{x^*}(\bar{d})\right\| _{x^*}} = \frac{\bar{d}}{\left\| \bar{d}\right\| _{x^*}} = d , \end{aligned}$$
(40)

where in the second equality we used the continuity of \(\mathsf P_x\) and of the norm \(\Vert \cdot \Vert _x\), and in the third equality we used \(\mathsf P_{x^*}(\bar{d}) = \bar{d}\) since \(\bar{d} \in T_{x^*} \mathcal {M}\) by construction. \(\square \)

4.3 Direct Search for Nonsmooth Objectives

Our Riemannian Direct Search method based on Dense Directions (RDS-DD) for nonsmooth objectives is presented here. The scheme is presented in detail as Algorithm 4. The algorithm performs three simple steps at an iteration k. First, a search direction is selected randomly in the current tangent space. Then, a tentative point is generated by retracting the step \(\alpha _k d_k\) from the tangent space to the manifold. Such a point is then eventually accepted as the new iterate if a sufficient decrease condition of the objective function is satisfied (and the stepsize is expanded); otherwise, the iterate stays the same. (And the stepsize is reduced.)

figure d

Thanks to the theoretical tools previously introduced, and in particular to the relation between retractions and the Clarke directional derivative proved in Lemma 4.1, it showed in a straightforward way that a suitable subsequence of unsuccessful iterations of the RDS-DD method converges to a Clarke stationary point.

Theorem 4.1

Let f be Lipschitz continuous and \(\{x_k\}\) be generated by Algorithm 4. If \(\{x_{i(k)}\}\) is refining, with \( x_{i(k)} \rightarrow x^* \), and every \(d \in T_{x^*}\mathcal {M}\) with \(\Vert d\Vert _{x^*} = 1\) is a refining direction, \(x^*\) is Clarke stationary.

Proof

By the same assumptions as in the smooth case \(\alpha _k \rightarrow 0\) and in particular \(\alpha _{i(k)} \rightarrow 0\). Since by assumption i(k) is an unsuccessful step, we have, for every i(k),

$$\begin{aligned} f(R(x_{i(k)}, \alpha _{i(k)} d_{i(k)})) - f(x_{i(k)}) > -\gamma \alpha _{i(k)}^2 . \end{aligned}$$
(41)

Let \(d \in T_{x^*}\mathcal {M}\) with \(\Vert d\Vert _{x^*} = 1\), let \(\{j(i(k)) \}\) be such that \(d_{j(i(k))} \rightarrow d\), and let \(y_k = x_{j(i(k))} \), \(q_k = d_{j(i(k))}\), \(t_k = \alpha _{j(i(k))}\). We have

$$\begin{aligned} \limsup _{k \rightarrow \infty } \frac{f(R(y_k, t_kq_k)) - f(y_k)}{t_k} \ge \limsup _{k \rightarrow \infty } -\gamma \alpha _{i(k)} = 0 , \end{aligned}$$
(42)

thanks to (41), and by applying Lemma 4.1 we get

$$\begin{aligned} f^{\circ }(x^*; d) \ge \limsup _{k \rightarrow \infty } \frac{f(R(y_k, t_kq_k)) - f(y_k)}{t_k} \ge 0 , \end{aligned}$$
(43)

which implies the thesis since d is arbitrary. \(\square \)

4.4 Direct Search with Linesearch Extrapolation for Nonsmooth Objectives

Our Riemannian Direct Search method with linesearch Extrapolation based on Dense Directions (RDSE-DD) for nonsmooth objectives is presented here. It can be seen as an extension to the Riemannian setting of the \(\text {DFN}_{simple}\) algorithm introduced in [15] for the Euclidean setting with bound constraints. The detailed scheme is given in Algorithm 5. The algorithm performs just two simple steps at an iteration k. First, a given search direction is suitably projected on the current tangent space. Then, a linesearch is performed using Algorithm 3 to hopefully obtain a new point that guarantees a sufficient decrease.

figure e

Once again, by exploiting the theoretical tools previously introduced, it is proved in a straightforward way that a suitable subsequence of the RDSE-DD iterations converges to a Clarke stationary point. Thanks to the use of the linesearch strategy, the following result is not restricted to considering unsuccessful iterations. Given the lack of such iterations, for the purposes of Definition 4.1, every converging subsequence generated by Algorithm 5 is considered as refining.

Theorem 4.2

Let f be Lipschitz continuous and \(\{x_k\}\) be generated by Algorithm 5. If \(\{x_{i(k)}\}\) is refining, with \( x_{i(k)} \rightarrow x^* \) and every \(d \in T_{x^*}\mathcal {M}\) with \(\Vert d\Vert _{x^*} = 1\) is a refining direction, then \(x^*\) is Clarke stationary.

Proof

Let \(\beta _k = \tilde{\alpha }_{k + 1}/\gamma _1\) if the linesearch procedure exits before the loop, and \(\beta _k = \gamma _2 \tilde{\alpha }_{k}\) otherwise, so that in particular \(\beta _k \rightarrow 0\). Then by definition of the linesearch procedure, for every k

$$\begin{aligned} f(R(x_k, \beta _k d_k)) - f(x_k) > -\gamma \beta _k^2 . \end{aligned}$$
(44)

The rest of the proof is analogous to that of Theorem 4.1. \(\square \)

5 Numerical Results

In this section, results of some numerical experiments of the algorithms described in this paper on a set of simple but illustrative example problems are presented. The comparison among the algorithms is carried out by using data and performance profiles [27]. Specifically, let S be a set of algorithms and P a set of problems. For each \(s\in S\) and \(p \in P\), let \(t_{p,s}\) be the number of function evaluations required by algorithm s on problem p to satisfy the condition

$$\begin{aligned} f(x_k) \le f_L + \tau (f(x_0) - f_L) , \end{aligned}$$
(45)

where \(0< \tau < 1\) and \(f_L\) is the best objective function value achieved by any solver on problem p. Then, the performance and data profiles of solver s are defined, respectively, by the following functions

$$\begin{aligned} \rho _s(\alpha )= & {} \frac{1}{|P|}\left| \left\{ p\in P: \frac{t_{p,s}}{\min \{t_{p,s'}:s'\in S\}}\le \alpha \right\} \right| ,\\ d_s(\kappa )= & {} \frac{1}{|P|}\left| \left\{ p\in P: t_{p,s}\le \kappa (n_p+1)\right\} \right| , \end{aligned}$$

where \(n_p\) is the dimension of problem p.

A budget of \(100(n_p+1)\) function evaluations is used in all cases, and two different precisions for the condition (45), that is, \(\tau \in \{10^{-1},10^{-3}\}\). Randomly generated instances of well-known optimization problems over manifolds from [1, 8, 18] are considered. A brief description of those problems as well as the details of our implementation can be found in Appendix (see Sects. 7.3, 7.4 and 7.5). The size of the ambient space for the instances varies from 2 to 200. In the results, the problems are split by ambient space dimension: between 2 and 15 for small instances, between 16 and 50 for medium instances, and between 51 and 200 for large instances.

5.1 Smooth Problems

In Fig. 1, the results related to 8 smooth instances of problem (1) from [1, 8] are included, each with 15 different problem dimensions (from 2 to 200), for a total number of 60 tested instances, split as described above. Our methods, that is, RDS-SB and RDSE-SB, are compared with the zeroth-order gradient descent (ZO-RGD, [23, Algorithm 1]).

The results clearly show that RDSE-SB performs better than RDS-SB and ZO-RGD both in efficiency and reliability for both levels of precision. It can also be seen how the gap between RDSE-SB and the other two algorithms gets larger as the problem dimension grows.

Fig. 1
figure 1

From top to bottom: results for small, medium, and large instances in the smooth case

5.2 Nonsmooth Problems

Here, a preliminary comparison is reported between a direct search strategy, a linesearch strategy, and ZO-RGD on two nonsmooth instances of (1) from [18], each with 15 different problem sizes (from 2 to 200), thus getting a total number of 30 tested instances, split by dimension as for smooth instances. It should be noted that while in the unconstrained setting the performance of zeroth-order (sub)gradient descent methods on nonsmooth objectives has been analyzed (see, e.g., [28]), there are, to the best of our knowledge, no convergence guarantees in the Riemannian setting.

In the direct search strategy (RDS-DD+), the RDS-SB method is applied until \(\alpha _{k + 1} \le \alpha _{\epsilon }\), at which point the nonsmooth version RDS-DD is used. Analogously, in the linesearch strategy (RDSE-DD+), the RDSE-SB method is applied until \(\max _{j \in [1:K]} \tilde{\alpha }_{k + 1}^j \le \alpha _{\epsilon }\), at which point the nonsmooth version RDSE-DD is used. Both strategies use a threshold parameter \(\alpha _{\epsilon } > 0\) to switch from the smooth to the nonsmooth DFO algorithm. The reader is referred to [15] and references therein for other direct search strategies combining coordinate and dense directions.

In Fig. 2, the comparison between the considered strategies is reported. As in the smooth case, the linesearch-based strategy outperforms both the simple direct search and the zeroth-order one. It can once again be seen how the gap between the algorithms gets larger as the problem dimension gets large enough.

Fig. 2
figure 2

From top to bottom: results for small, medium, and large instances in the nonsmooth case

6 Conclusion

In this paper, direct search algorithms with and without an extrapolation linesearch for minimizing functions over a Riemannian manifold are presented. It was found that, modulo modifications to account for the changing vector space structure with the iterations, direct search strategies provide guarantees of convergence for both smooth and nonsmooth objectives. It was also found that in practice, in our numerical experiments, the extrapolation linesearch speeds up the performance of direct search in both cases, and it appears that it even outperforms a gradient approximation-based zeroth-order Riemannian algorithm in the smooth case. As a natural extension for future work, considering the stochastic case would be a reasonable next step.