1 Introduction

The space of \(n \times n\) real symmetric matrices \({\mathcal {S}}_{n}\) is endowed with the trace inner product \(\langle A, B\rangle :={\text {trace}}(A B)\). A matrix \(A \in {\mathcal {S}}_{n}\) is called completely positive if for some \(r \in {\mathbb {N}}\) there exists an entrywise nonnegative matrix \(B \in {\mathbb {R}}^{n \times r}\) such that \(A=B B^{\top }\), and we call B a CP factorization of A. We define \(\mathcal{CP}_{n}\) as the set of \(n \times n\) completely positive matrices, equivalently characterized as

$$\begin{aligned} \mathcal{CP}\mathcal{}_{n} :=\{B B^{\top } \in {\mathcal {S}}_{n} \mid B \text{ is } \text{ a } \text{ nonnegative } \text{ matrix } \} ={\text {conv}}\{x x^{\top } \mid x \in {\mathbb {R}}_{+}^{n}\}, \end{aligned}$$

where \({\text {conv}}(S)\) denotes the convex hull of a given set S. We denote the set of \(n \times n\) copositive matrices by \(\mathcal {COP}_{n}:=\{A \in {\mathcal {S}}_{n} \mid x^{\top } A x \ge 0 \text{ for } \text{ all } x \in {\mathbb {R}}_{+}^{n}\}.\) It is known that \(\mathcal {COP}_n\) and \(\mathcal{CP}\mathcal{}_{n}\) are duals of each other under the trace inner product; moreover, both \(\mathcal{CP}\mathcal{}_{n}\) and \(\mathcal {COP}_{n}\) are proper convex cones [1, Section2.2]. For any positive integer n, we have the following inclusion relationship among other important cones in conic optimization:

$$\begin{aligned} \mathcal{CP}\mathcal{}_{n} \subseteq {\mathcal {S}}_{n}^{+} \cap {\mathcal {N}}_{n} \subseteq {\mathcal {S}}_{n}^{+} \subseteq {\mathcal {S}}_{n}^{+}+{\mathcal {N}}_{n} \subseteq \mathcal {COP}_{n}, \end{aligned}$$

where \({\mathcal {S}}_{n}^{+}\) is the cone of \(n \times n\) symmetric positive semidefinite matrices and \({\mathcal {N}}_{n}\) is the cone of \(n \times n\) symmetric nonnegative matrices. See the monograph [1] for a comprehensive description of \(\mathcal{CP}\mathcal{}_{n}\) and \(\mathcal {COP}_{n}\).

Conic optimization is a subfield of convex optimization that studies minimization of linear functions over proper cones. Here, if the proper cone is \(\mathcal{CP}\mathcal{}_n\) or its dual cone \(\mathcal {COP}_n\), we call the conic optimization problem a copositive programming problem. Copositive programming is closely related to many nonconvex, NP-hard quadratic and combinatorial optimizations [2]. For example, consider the so-called standard quadratic optimization problem,

$$\begin{aligned} \min \{x^{\top } M x \mid {{\textbf {e}}}^{\top } x=1, x \in {\mathbb {R}}_{+}^{n} \}, \end{aligned}$$
(1)

where \(M \in {\mathcal {S}}_{n}\) is possibly not positive semidefinite and \({{\textbf {e}}}\) is the all-ones vector. Bomze et al. [3] showed that the following completely positive reformulation,

$$\begin{aligned} \min \{\langle M, X\rangle \mid \langle E, X\rangle =1, X \in {\mathcal {C}} {\mathcal {P}}_{n} \}, \end{aligned}$$

where E is the all-ones matrix, is equivalent to (1). Burer [4] reported a more general result, where any quadratic problem with binary and continuous variables can be rewritten as a linear program over \(\mathcal{CP}\mathcal{}_n\). As an application to combinatorial problems, consider the problem of computing the independence number \(\alpha (G)\) of a graph G with n nodes. De Klerk and Pasechnik [5] showed that

$$\begin{aligned} \alpha (G)=\max \{\langle E, X\rangle \mid \langle A+I, X\rangle =1, X \in {\mathcal {C}} {\mathcal {P}}_{n}\}, \end{aligned}$$

where A is the adjacency matrix of G. For surveys on applications of copositive programming, see [2, 6,7,8,9].

The difficulty of the above problems lies entirely in the completely positive conic constraint. Note that because neither \(\mathcal {COP}_{n}\) nor \(\mathcal{CP}\mathcal{}_{n}\) is self-dual, the primal-dual interior point method for conic optimization does not work as is. Besides this, there are many open problems related to completely positive cones. One is checking membership in \(\mathcal{CP}\mathcal{}_{n}\), which was shown to be NP-hard by [10]. Computing or estimating the cp-rank, as defined later in (3), is also an open problem. We refer the reader to [9, 11] for a detailed discussion of those unresolved issues.

In this paper, we focus on finding a CP factorization for a given \(A \in \mathcal{CP}\mathcal{}_{n}\), i.e., the CP factorization problem:

$$\begin{aligned} \text{ Find } B \in {\mathbb {R}}^{n \times r} \text{ s.t. } A=B B^{\top } \text{ and } B \ge 0, \end{aligned}$$
(CPfact)

which seems to be closely related to the membership problem \(A \in \mathcal{CP}\mathcal{}_n\). Sometimes, a matrix is shown to be completely positive through duality, or rather, \(\langle A, X\rangle \ge 0\) for all \(X \in \mathcal {COP}_n\), but in this case, a CP factorization will not necessarily be obtained.

1.1 Related work on CP factorization

Various methods of solving CP factorization problems have been studied. Jarre and Schmallowsky [12] stated a criterion for complete positivity, based on the augmented primal dual method to solve a particular second-order cone problem. Dickinson and Dür [13] dealt with complete positivity of matrices that possess a specific sparsity pattern and proposed a method for finding CP factorizations of these special matrices that can be performed in linear time. Nie [14] formulated the CP factorization problem as an \({\mathcal {A}}\)-truncated K-moment problem, for which the author developed an algorithm that solves a series of semidefinite optimization problems. Sponsel and Dür [15] considered the problem of projecting a matrix onto \(\mathcal{CP}\mathcal{}_{n}\) and \(\mathcal {COP}_{n}\) by using polyhedral approximations of these cones. With the help of these projections, they devised a method to compute a CP factorization for any matrix in the interior of \(\mathcal{CP}\mathcal{}_{n}\). Bomze [16] showed how to construct a CP factorization of an \(n \times n\) matrix based on a given CP factorization of an \((n-1) \times (n-1)\) principal submatrix. Dutour Sikirić et al. [17] developed a simplex-like method for a rational CP factorization that works if the input matrix allows a rational CP factorization.

In 2020, Groetzner and Dür [18] applied the alternating projection method to the CP factorization problem by posing it as an equivalent feasibility problem (see (FeasCP)). Shortly afterwards, Chen et al. [19] reformulated the split feasibility problem as a difference-of-convex optimization problem and solved (FeasCP) as a specific application. In fact, we will solve this equivalent feasibility problem (FeasCP) by other means in this paper. In 2021, Boţ and Nguyen [20] proposed a projected gradient method with relaxation and inertia parameters for the CP factorization problem, aimed at solving

$$\begin{aligned} \min _{X}\{\Vert A-X X^{\top }\Vert ^{2} \mid X \in {\mathbb {R}}_{+}^{n \times r} \cap {\mathcal {B}}({0}, \sqrt{{\text {trace}}(A)})\}, \end{aligned}$$
(2)

where \({\mathcal {B}}(0, \varepsilon ):=\{X \in {\mathbb {R}}^{n \times r} \mid \Vert X\Vert \le \varepsilon \}\) is the closed ball centered at 0. The authors argued that its optimal value is zero if and only if \(A \in \mathcal{CP}\mathcal{}_{n}\).

1.2 Our contributions and organization of the paper

Inspired by the idea of Groetzner and Dür [18], wherein CPfact is shown to be equivalent to a feasibility problem called (FeasCP), we treat the problem (FeasCP) as a nonsmooth Riemannian optimization problem and solve it through a general Riemannian smoothing method. Our contributions are summarized as follows:

  1. 1.

    Although it is not explicitly stated in [18], (FeasCP) is actually a Riemannian optimization formulation. We propose a new Riemannian optimization technique and apply it to the problem.

  2. 2.

    In particular, we present a general framework of Riemannian smoothing for the nonsmooth Riemannian optimization problem and show convergence to a stationary point of the original problem.

  3. 3.

    We apply the general framework of Riemannian smoothing to CP factorization. Numerical experiments clarify that our method is competitive with other efficient CP factorization methods, especially for large-scale matrices.

In Sect. 2, we review the process to reconstruct CPfact into another feasibility problem; in particular, we take a different approach to this problem from those in other studies. In Sect. 3, we describe the general framework of smoothing methods for Riemannian optimization. To apply it to the CP factorization problem, we employ a smoothing function named LogSumExp. Section 4 is a collection of numerical experiments for CP factorization. As a meaningful supplement, in Sect. 5, we conduct further experiments (FSV problem and robust low-rank matrix completion) to explore the numerical performance of various sub-algorithms and smoothing functions on different applications.

2 Preliminaries

2.1 cp-rank and cp-plus-rank

First, let us recall some basic properties of completely positive matrices. Generally, many CP factorizations of a given A may exist, and they may vary in their numbers of columns. This gives rise to the following definitions: the cp-rank of \(A \in {\mathcal {S}}_{n}\), denoted by \({\text {cp}}(A)\), is defined as

$$\begin{aligned} {\text {cp}}(A) := \min \{r \in {\mathbb {N}} \mid A=B B^{\top }, B \in {\mathbb {R}}^{n \times r}, B \ge 0 \}, \end{aligned}$$
(3)

where \({\text {cp}}(A) = \infty\) if \(A \notin \mathcal{CP}\mathcal{}_{n}.\) Similarly, we can define the cp-plus-rank as

$$\begin{aligned} {\text {cp}}^{+}(A):= \min \{r \in {\mathbb {N}} \mid A=B B^{\top }, B \in {\mathbb {R}}^{n \times r}, B > 0 \}. \end{aligned}$$

Immediately, for all \(A \in {\mathcal {S}}_{n}\), we have

$$\begin{aligned} {\text {rank}}(A) \le \mathrm {cp}(A) \le \mathrm {cp}^{+}(A). \end{aligned}$$
(4)

Every CP factorization B of A is of the same rank as A since \({\text {rank}}(X X^{\top })={\text {rank}}(X)\) holds for any matrix X. The first inequality of (4) comes from the fact that for any CP factorization B,

$$\begin{aligned} {\text {rank}}(A) = {\text {rank}}(B) \le \text { the number of columns of } B . \end{aligned}$$

The second is trivial by definition.

Note that computing or estimating the cp-rank of any given \(A \in \mathcal{CP}\mathcal{}_{n}\) is still an open problem. The following result gives a tight upper bound of the cp-rank for \(A \in \mathcal{CP}\mathcal{}_n\) in terms of the order n.

Theorem 2.1

(Bomze, Dickinson, and Still [21, Theorem 4.1]) For all \(A \in {\mathcal {C}} {\mathcal {P}}_{n}\), we have

$$\begin{aligned} {\text {cp}}(A) \le \mathrm {cp}_{n}:=\left\{ \begin{array}{ll} n &{} \text{ for } n \in \{2,3,4\} \\ \frac{1}{2} n(n+1)-4 &{} \text{ for } n \ge 5. \end{array}\right. \end{aligned}$$
(5)

The following result is useful for distinguishing completely positive matrices in either the interior or on the boundary of \(\mathcal{CP}\mathcal{}_n\).

Theorem 2.2

(Dickinson [22, Theorem 3.8]) We have

$$\begin{aligned} \begin{aligned} {\text {int}}({\mathcal {C}} {\mathcal {P}}_{n})&=\{A \in {\mathcal {S}}_{n} \mid {\text {rank}}(A)=n, {\text {cp}}^{+}(A)<\infty \} \\&=\{ A \in {\mathcal {S}}_{n} \mid {\text {rank}}(A)=n, A=BB^{\top }, B \in {\mathbb {R}}^{n \times r}, B \ge 0, \\&\qquad b_{j}>0 \hbox { for at least one column } b_{j} \hbox { of } B \}. \end{aligned} \end{aligned}$$

2.2 CP factorization as a feasibility problem

Groetzner and Dür [18] reformulated the CP factorization problem as an equivalent feasibility problem containing an orthogonality constraint.

Given \(A \in \mathcal{CP}\mathcal{}_{n}\), we can easily get another CP factorization \({\widehat{B}}\) with \(r^{\prime }\) columns for every integer \(r^{\prime } \ge r\), if we also have a CP factorization B with r columns. The simplest way to construct such an \(n \times r^{\prime }\) matrix \({\widehat{B}}\) is to append \(k:=r^{\prime }-r\) zero columns to B,  i.e., \({\widehat{B}}:=\left[ B, 0_{n \times k}\right] \ge 0.\) Another way is called column replication, i.e.,

$$\begin{aligned} {\widehat{B}}:=[b_{1}, \ldots , b_{n-1}, \underbrace{\frac{1}{\sqrt{m}} b_{n}, \ldots , \frac{1}{\sqrt{m}} b_{n}}_{m:=r^{\prime }-n+1 \text{ columns } }], \end{aligned}$$
(6)

where \(b_{i}\) denotes the i-th column of B. It is easy to see that \({\widehat{B}} {\widehat{B}}^{\top }=B B^{\top }=A.\) The next lemma is easily derived from the previous discussion, and it implies that there always exists an \(n \times \mathrm {cp}_{n}\) CP factorization for any \(A \in \mathcal{CP}\mathcal{}_{n}\). Recall that definition of \(\mathrm {cp}_{n}\) is given in (5).

Lemma 2.3

Suppose that \(A \in {\mathcal {S}}_{n}\), \(r \in {\mathbb {N}}\). Then \(r \ge {\text {cp}}(A)\) if and only if A has a CP factorization B with r columns.

Let \({\mathcal {O}}(r)\) denote the orthogonal group of order r, i.e., the set of \(r \times r\) orthogonal matrices. The following lemma is essential to our study. Note that many authors have proved the existence of such an orthogonal matrix X (see, e.g., [23, Lemma 2.1] and [18, Lemma 2.6]).

Lemma 2.4

Let \(B, C \in {\mathbb {R}}^{n \times r}\). \(B B^{\top }=C C^{\top }\) if and only if there exists \(X \in {\mathcal {O}}(r) \text { with }B X=C\).

The next proposition puts the previous two lemmas together.

Proposition 2.5

Let \(A \in \mathcal{CP}\mathcal{}_n, r \ge {\text {cp}}(A), A = {\bar{B}} {\bar{B}}^{\top }\), where \({\bar{B}} \in {\mathbb {R}}^{n \times r}\) may possibly be not nonnegative. Then there exists an orthogonal matrix \(X \in {\mathcal {O}}(r)\) such that \({\bar{B}} X \ge 0\) and \(A=({\bar{B}}X)({\bar{B}}X)^{\top }.\)

This proposition tells us that one can find an orthogonal matrix X which can turn a “bad” factorization \({\bar{B}}\) into a “good” factorization \({\bar{B}}X\). Let \(r \ge {\text {cp}}(A)\) and \({\bar{B}} \in {\mathbb {R}}^{n \times r}\) be an arbitrary (possibly not nonnegative) initial factorization \(A={\bar{B}} {\bar{B}}^{\top }\). The task of finding a CP factorization of A can then be formulated as the following feasibility problem,

$$\begin{aligned} \text{ Find } X \text{ s.t. } {\bar{B}} X \ge 0 \text{ and } X \in {\mathcal {O}}(r). \end{aligned}$$
(FeasCP)

We should notice that the condition \(r \ge {\text {cp}}(A)\) is necessary; otherwise, (FeasCP) has no solution even if \(A \in \mathcal{CP}\mathcal{}_{n}.\) Regardless of the exact \({\text {cp}}(A)\) which is often unknown, one can use \(r=\mathrm {cp}_{n}\) in (5). Note that finding an initial matrix \({\bar{B}}\) is not difficult. Since a completely positive matrix is necessarily positive semidefinite, one can use Cholesky decomposition or spectral decomposition and then extend it to r columns by using (6). The following corollary shows that the feasibility of (FeasCP) is precisely a criterion for complete positivity.

Corollary 2.6

Set \(r \ge {\text {cp}}(A)\), \({\bar{B}} \in {\mathbb {R}}^{n \times r}\) an arbitrary initial factorization of A. Then \(A \in \mathcal{CP}\mathcal{}_{n}\) if and only if (FeasCP) is feasible. In this case, for any feasible solution X, \({\bar{B}}X\) is a CP factorization of A.

In this study, solving (FeasCP) is the key to finding a CP factorization, but it is still a hard problem because \({\mathcal {O}}(r)\) is nonconvex.

2.3 Approaches to solving (FeasCP)

Groetzner and Dür [18] applied the so-called alternating projections method to (FeasCP). They defined the polyhedral cone, \({\mathcal {P}}:=\{X \in {\mathbb {R}}^{r \times r} : {\bar{B}} X \ge 0\},\) and rewrote (FeasCP) as

$$\begin{aligned} \text{ Find } X \text{ s.t. } X \in {\mathcal {P}} \cap {\mathcal {O}}(r). \end{aligned}$$

The alternating projections method is as follows: choose a starting point \(X_{0} \in {\mathcal {O}}(r)\); then compute \(P_{0}={\text {proj}}_{{\mathcal {P}}}(X_{0})\) and \(X_{1}={\text {proj}}_{{\mathcal {O}}(r)}(P_{0})\), and iterate this process. Computing the projection onto \({\mathcal {P}}\) amounts to solving a second-order cone problem (SOCP), while computing the projection onto \({\mathcal {O}}(r)\) amounts to a singular value decomposition. Note that we need to solve an SOCP alternately at every iteration, which is still expensive in practice. A modified version without convergence involves calculating an approximation of \({\text {proj}}_{{\mathcal {P}}}(X_{k})\) by using the Moore-Penrose inverse of \({\bar{B}}\); for details, see [18, Algorithm 2].

Our way is to use the optimization form. Here, we denote by \(\max (\cdot )\) (resp. \(\min (\cdot )\)) the max-function (resp. min-function)) that selects the largest (resp. smallest) entry of a vector or matrix. Notice that \(-\min (\cdot )=\max (-(\cdot )).\) We associate (FeasCP) with the following optimization problem:

$$\begin{aligned} \max _{X \in {\mathcal {O}}(r)} \{\min \,({\bar{B}}X)\}. \end{aligned}$$

For consistency of notation, we turn the maximization problem into a minimization problem:

$$\begin{aligned} \min _{X \in {\mathcal {O}}(r)} \{\max \,(-{\bar{B}}X)\}. \end{aligned}$$
(OptCP)

The feasible set \({\mathcal {O}}(r)\) is known to be compact [24, Observation 2.1.7]. In accordance with the extreme value theorem [25, Theorem 4.16], OptCP attains the global minimum, say t. Summarizing these observations together with Corollary 2.6 yields the following proposition.

Proposition 2.7

Set \(r \ge {\text {cp}}(A)\), and let \({\bar{B}} \in {\mathbb {R}}^{n \times r}\) be an arbitrary initial factorization of A. Then the following statements are equivalent:

  1. 1.

    \(A \in \mathcal{CP}\mathcal{}_{n}\).

  2. 2.

    (FeasCP) is feasible.

  3. 3.

    In OptCP, there exists a feasible solution X such that \(\max (-{\bar{B}}X) \le 0\); alternatively, \(\min ({\bar{B}}X) \ge 0\).

  4. 4.

    In OptCP, the global minimum \(t \le 0\).

3 Riemannian smoothing method

The problem of minimizing a real-valued function over a Riemannian manifold \({\mathcal {M}}\), which is called Riemannian optimization, has been actively studied during the last few decades. In particular, the Stiefel manifold,

$$\begin{aligned} {\text {St}}(n, p)=\{X \in {\mathbb {R}}^{n \times p} \mid X^{\top } X=I\}, \end{aligned}$$

(when \(n=p\), it reduces to the orthogonal group) is an important case and is our main interest here. We treat the CP factorization problem, i.e., OptCP as a problem of minimizing a nonsmooth function over a Riemannian manifold, for which variants of subgradient methods [26], proximal gradient methods [27], and the alternating direction method of multipliers (ADMM) [28] have been studied.

Smoothing methods [29], which use a parameterized smoothing function to approximate the objective function, are effective on a class of nonsmooth optimizations in Euclidean space. Recently, Zhang, Chen and Ma [30] extended a smoothing steepest descent method to the case of Riemannian submanifolds in \({\mathbb {R}}^{n}\). This is not the first time that smoothing methods have been studied on manifolds. Liu and Boumal [31] extended the augmented Lagrangian method and exact penalty method to the Riemannian case. The latter leads to a nonsmooth Riemannian optimization problem to which they applied smoothing techniques. Cambier and Absil [32] dealt with the problem of robust low-rank matrix completion by solving a Riemannian optimization problem, wherein they applied a smoothing conjugate gradient method.

In this section, we propose a general Riemannian smoothing method and apply it to the CP factorization problem.

3.1 Notation and terminology of Riemannian optimization

Let us briefly review some concepts in Riemannian optimization, following the notation of [33]. Throughout this paper, \({\mathcal {M}}\) will refer to a complete Riemannian submanifold of Euclidean space \({\mathbb {R}}^{n}\). Thus, \({\mathcal {M}}\) is endowed with a Riemannian metric induced from the Euclidean inner product, i.e., \(\langle \xi , \eta \rangle _{x}:=\xi ^{\top } \eta\) for any \(\xi , \eta \in \mathrm{T}_{x} {\mathcal {M}}\), where \(\mathrm{T}_{x} {\mathcal {M}} \subseteq {\mathbb {R}}^{n}\) is the tangent space to \({\mathcal {M}}\) at x. The Riemannian metric induces the usual Euclidean norm \(\Vert \xi \Vert _{x}:=\Vert \xi \Vert =\sqrt{\langle \xi , \xi \rangle _{x}}\) for \(\xi \in \mathrm{T}_{x} {\mathcal {M}}\). The tangent bundle \(\mathrm{T} {\mathcal {M}} :=\bigsqcup _{x \in {\mathcal {M}}} \mathrm{T}_{x} {\mathcal {M}}\) is a disjoint union of the tangent spaces of \({\mathcal {M}}\). Let \(f: {\mathcal {M}} \rightarrow {\mathbb {R}}\) be a smooth function on \({\mathcal {M}}\). The Riemannian gradient of f is a vector field \({\text {grad}} f\) on \({\mathcal {M}}\) that is uniquely defined by the identities: for all \((x, v) \in \mathrm {T} {\mathcal {M}}\),

$$\begin{aligned} \mathrm {D}f(x)[v]=\langle v, {\text {grad}} f(x) \rangle _{x} \end{aligned}$$

where \(\mathrm {D}f(x):\mathrm{T}_{x} {\mathcal {M}} \rightarrow \mathrm{T}_{f(x)} {\mathbb {R}} \cong {\mathbb {R}}\) is the differential of f at \(x \in {\mathcal {M}}\). Since \({\mathcal {M}}\) is an embedded submanifold of \({\mathbb {R}}^{n}\), we have a simpler statement for f that is also well defined on the whole \({\mathbb {R}}^{n}\):

$$\begin{aligned} {\text {grad}} f(x)={\text {Proj}}_{x}(\nabla f(x)), \end{aligned}$$

where \(\nabla f(x)\) is the usual gradient in \({\mathbb {R}}^{n}\) and \({\text {Proj}}_{x}\) denotes the orthogonal projector from \({\mathbb {R}}^{n}\) to \(\mathrm{T}_{x} {\mathcal {M}}\). For a subset \(D \subseteq {\mathbb {R}}^{n}\), the function \(h \in C^{1}(D)\) is smooth, i.e., continuously differentiable on D. Given a point \(x \in {\mathbb {R}}^{n}\) and \(\delta >0\), \({\mathcal {B}}(x, \delta )\) denotes a closed ball of radius \(\delta\) centered at x. \({\mathbb {R}}_{++}\) denotes the set of positive real numbers. We use subscript notation \(x_{i}\) to select the ith entry of a vector and superscript notation \(x^{k}\) to designate an element in a sequence \(\{x^{k}\}\).

3.2 Ingredients

Now let us consider the nonsmooth Riemannian optimization problem (NROP):

$$\begin{aligned} \min _{x \in {\mathcal {M}}} f(x), \end{aligned}$$
(NROP)

where \({\mathcal {M}} \subseteq {\mathbb {R}}^{n}\) and \(f: {\mathbb {R}}^{n} \rightarrow {\mathbb {R}}\) is a proper lower semi-continuous function (maybe nonsmooth or even non-Lipschitzian) on \({\mathbb {R}}^{n}\). For convenience, the term smooth Riemannian optimization problem (SROP) refers to NROP when \(f(\cdot )\) is continuously differentiable on \({\mathbb {R}}^{n}\). To avoid confusion in this case, we use g instead of f,

$$\begin{aligned} \min _{x \in {\mathcal {M}}} g(x). \end{aligned}$$
(SROP)

Throughout this subsection, we will refer to many of the concepts in [30].

First, let us review the usual concepts and properties related to generalized subdifferentials in \({\mathbb {R}}^{n}\). For a proper lower semi-continuous function \(f: {\mathbb {R}}^{n} \rightarrow {\mathbb {R}}\), the Fréchet subdifferential and the limiting subdifferential of f at \(x \in {\mathbb {R}}^{n}\) are defined as

$$\begin{aligned}&\begin{aligned} {\hat{\partial }} f(x):=\{\nabla h(x) \mid \exists \delta&>0 \text{ such } \text{ that } h \in C^{1}({\mathcal {B}}(x, \delta )) \text{ and } \\&f-h \text{ attains } \text{ a } \text{ local } \text{ minimum } \text{ at } x \text{ on } {\mathbb {R}}^{n}\}, \end{aligned} \\&\partial f(x):=\{\lim _{\ell \rightarrow \infty } v^{\ell }\mid v^{\ell } \in {\hat{\partial }} f\left( x^{\ell }\right) ,\left( x^{\ell }, f\left( x^{\ell }\right) \right) \rightarrow (x, f(x))\}. \end{aligned}$$

The definition of \({\hat{\partial }} f(x)\) above is not the standard one: the standard definition follows [34, 8.3 Definition]. But these definitions are equivalent by [34, 8.5 Proposition]. For locally Lipschitz functions, the Clarke subdifferential at \(x \in {\mathbb {R}}^{n}\), \(\partial ^{\circ } f(x)\), is the convex hull of the limiting subdifferential. Their relationship is as follows:

$$\begin{aligned} {\hat{\partial }} f(x) \subseteq \partial f(x) \subseteq \partial ^{\circ } f(x). \end{aligned}$$

Notice that if f is convex, \(\partial f(x)\) and \(\partial ^{\circ } f(x)\) coincide with the classical subdifferential in convex analysis [34, 8.12 Proposition].

Example 1

(Bagirov, Karmitsa, and Mäkelä [35, Theorem 3.23]) From a result on the pointwise max-function in convex analysis, we have

$$\begin{aligned} \partial \max (x)={\text {conv}}\{e_i \mid i \in {\mathcal {I}}(x) \}, \end{aligned}$$

where \(e_i\)’s are the standard bases of \({\mathbb {R}}^{n}\) and \({\mathcal {I}}(x)=\{i \mid x_{i}=\max (x)\}.\)

Next, we extend our discussion to include generalized subdifferentials of a nonsmooth function on submanifolds \({\mathcal {M}}\). The Riemannian Fréchet subdifferential and the Riemannian limiting subdifferential of f at \(x \in {\mathcal {M}}\) (see, e.g., [30, Definition 3.1]) are defined as

$$\begin{array}{*{20}l} {\hat{\partial }_{{\mathcal{R}}} f(x): = \{ {\text{grad}}h(x){\mid }\exists \delta > 0{\text{ such that }}h \in C^{1} ({\mathcal{B}}(x,\delta ))\;{\text{and}}\;f - h{\text{ attains a local minimum at }}x{\text{ on }}{\mathcal{M}}\} ,} \hfill \\ {\partial _{{\mathcal{R}}} f(x): = \{ \mathop {\lim }\limits_{{\ell \to \infty }} v^{\ell } {\mid }v^{\ell } \in \hat{\partial }_{{\mathcal{R}}} f\left( {x^{\ell } } \right),\left( {x^{\ell } ,f\left( {x^{\ell } } \right)} \right) \to (x,f(x))\} .} \hfill \\ \end{array}$$

If \({\mathcal {M}}={\mathbb {R}}^{n}\), the above definitions coincide with the usual Fréchet and limiting subdifferentials in \({\mathbb {R}}^{n}\). Moreover, it follows directly that, for all \(x \in {\mathcal {M}}\), one has \({\hat{\partial }}_{{\mathcal {R}}} f(x) \subseteq \partial _{{\mathcal {R}}} f(x)\). According to [30, Proposition 3.2], if x is a local minimizer of f on \({\mathcal {M}}\), then \(0 \in {\hat{\partial }}_{{\mathcal {R}}} f({x})\). Thus, we call a point \(x \in {\mathcal {M}}\) a Riemannian limiting stationary point of NROP if

$$\begin{aligned} 0 \in \partial _{{\mathcal {R}}} f(x). \end{aligned}$$
(7)

In this paper, we will treat it as a necessary condition for a local solution of NROP to exist.

The smoothing function is the most important tool of the smoothing method.

Definition 3.1

(Zhang and Chen [36, Definition 3.1]) A function \({\tilde{f}}(\cdot , \cdot ): {\mathbb {R}}^{n} \times {\mathbb {R}}_{++} \rightarrow {\mathbb {R}}\) is called a smoothing function of \(f: {\mathbb {R}}^{n} \rightarrow {\mathbb {R}}\), if \({\tilde{f}}(\cdot , \mu )\) is continuously differentiable in \({\mathbb {R}}^{n}\) for any \(\mu >0\),

$$\begin{aligned} \lim _{z \rightarrow x, \mu \downarrow 0} {\tilde{f}}(z, \mu )=f(x) \end{aligned}$$

and there exist a constant \(\kappa >0\) and a function \(\omega : {\mathbb {R}}_{++} \rightarrow {\mathbb {R}}_{++}\) such that

$$\begin{aligned} |{\tilde{f}}(x, \mu )-f(x)|\le \kappa \omega (\mu ) \quad \text{ with } \quad \lim _{\mu \downarrow 0} \omega (\mu )=0. \end{aligned}$$
(8)

Example 2

(Chen, Wets, and Zhang [37, Lemma 4.4]) The LogSumExp function, \({\text {lse}}(x,\mu ) : {\mathbb {R}}^n \times {\mathbb {R}}_{++} \rightarrow {\mathbb {R}}\), given by

$$\begin{aligned} {\text {lse}}(x,\mu ) := \mu \log ( \textstyle \sum _{i=1}^{n} \exp ( x_i/\mu ) ), \end{aligned}$$

is the smoothing function of \(\max (x)\) because we can see that

(i) \({\text {lse}}(\cdot , \mu )\) is smooth on \({\mathbb {R}}^{n}\) for any \(\mu >0\). Its gradient \(\nabla _{x} {\text {lse}}(x,\mu )\) is given by \(\sigma (\cdot ,\mu ) : {\mathbb {R}}^n \rightarrow \Delta ^{n-1}\),

$$\begin{aligned} \nabla _{x} {\text {lse}}(x,\mu ) =\sigma (x,\mu ):=\frac{1}{\sum _{\ell =1}^{n} \exp (x_\ell /\mu )} [\begin{array}{c} \exp ( x_{1}/\mu ), \ldots , \exp ( x_{n}/\mu ) \end{array}]^{\top }, \end{aligned}$$
(9)

where \(\Delta ^{n-1}:=\{x \in {\mathbb {R}}^{n} \mid \sum _{i=1}^{n} x_i =1, x_{i} \ge 0\}\) is the unit simplex.

(ii) For all \(x \in {\mathbb {R}}^n\) and \(\mu > 0\), we have \(\max (x) < {\text {lse}}(x,\mu ) \le \max ( x) + \mu \log (n).\) Then, the constant \(\kappa = \log (n)\) and \(\omega (\mu )=\mu\). The above inequalities imply that \(\lim _{z \rightarrow x, \mu \downarrow 0} {\text {lse}}(z,\mu )= \max (x).\)

Gradient sub-consistency or consistency is crucial to showing that any limit point of the Riemannian smoothing method is also a limiting stationary point of NROP.

Definition 3.2

(Zhang, Chen and Ma [30, Definition 3.4 & 3.9]) A smoothing function \({\tilde{f}}\) of f is said to satisfy gradient sub-consistency on \({\mathbb {R}}^{n}\) if, for any \(x \in {\mathbb {R}}^{n}\),

$$\begin{aligned} G_{{\tilde{f}}}(x) \subseteq \partial f(x), \end{aligned}$$
(10)

where the subdifferential of f associated with \({\tilde{f}}\) at \(x \in {\mathbb {R}}^{n}\) is given by

$$\begin{aligned} G_{{\tilde{f}}}(x):=\{u \in {\mathbb {R}}^{n} \mid \nabla _{x} {\tilde{f}}\left( z_{k}, \mu _{k}\right) \rightarrow u \text{ for } \text{ some } z_{k} \rightarrow x, \mu _{k} \downarrow 0\}. \end{aligned}$$

Similarly, \({\tilde{f}}\) is said to satisfy Riemannian gradient sub-consistency on \({\mathcal {M}}\) if, for any \(x \in {\mathcal {M}}\),

$$\begin{aligned} G_{{\tilde{f}}, {\mathcal {R}}}(x) \subseteq \partial _{{\mathcal {R}}} f(x), \end{aligned}$$
(11)

where the Riemannian subdifferential of f associated with \({\tilde{f}}\) at \(x \in {\mathcal {M}}\) is given by

$$\begin{aligned} G_{{\tilde{f}}, {\mathcal {R}}}(x)=\{v \in {\mathbb {R}}^{n}\mid {\text {grad}} {\tilde{f}}\left( z_{k}, \mu _{k}\right) \rightarrow v \text{ for } \text{ some } z_{k} \in {\mathcal {M}}, z_{k} \rightarrow x, \mu _{k} \downarrow 0\}. \end{aligned}$$

If one substitutes the inclusion with equality in (10), then \({\tilde{f}}\) satisfies gradient consistency on \({\mathbb {R}}^{n}\), and similarly in (11) for \({\mathcal {M}}\). Thanks to the following useful proposition from [30], we can induce gradient sub-consistency on \({\mathcal {M}}\) from that on \({\mathbb {R}}^{n}\) if f is locally Lipschitz.

Proposition 3.3

(Zhang, Chen and Ma [30, Proposition 3.10]) Let f be a locally Lipschitz function and \({\tilde{f}}\) a smoothing function of f. For \({\tilde{f}}\), if gradient sub-consistency holds on \({\mathbb {R}}^{n}\), then Riemannian gradient sub-consistency holds on \({\mathcal {M}}\) as well.

The next example illustrates Riemannian gradient sub-consistency on \({\mathcal {M}}\) for \({\text {lse}}(x,\mu )\) in Example 2, since any convex function is locally Lipschitz continuous.

Example 3

(Chen, Wets, and Zhang [37, Lemma 4.4]) The smoothing function \({\text {lse}}(x,\mu )\) of \(\max (x)\) satisfies gradient consistency on \({\mathbb {R}}^{n}\). That is, for any \(x \in {\mathbb {R}}^n\),

$$\begin{aligned} \partial \max (x)=G_{{\text {lse}}}(x)=\{\lim _{x^{k} \rightarrow x, \mu _{k} \downarrow 0} \sigma (x^{k},\mu _{k})\}. \end{aligned}$$

Note that the original assertion of [37, Lemma 4.4] is gradient consistency in the Clarke sense, i.e., \(\partial ^{\circ } \max (x)=G_{{\text {lse}}}(x)\).

3.3 Riemannian smoothing method

Motivated by the previous papers [30,31,32] on smoothing methods and Riemannian manifolds, we propose a general Riemannian smoothing method. Algorithm 1 is the basic framework of this general method.

figure a

Now let us describe the convergence properties of the basic method. First, let us assume that the function \({\tilde{f}}(x, \mu _{k})\) has a minimizer on \({\mathcal {M}}\) for each value of \(\mu _{k}\).

Theorem 3.4

Suppose that each \(x^{k}\) is an exact global minimizer of (12) in Algorithm 1. Then every limit point \(x^{*}\) of the sequence \(\{x^{k}\}\) is a global minimizer of the problem NROP.

Proof

Let \({\bar{x}}\) be a global solution of NROP, that is,

$$\begin{aligned} f({\bar{x}}) \le f(x) \quad \text{ for } \text{ all } x \in {\mathcal {M}}. \end{aligned}$$

From the Definition 3.1 of the smoothing function, there exist a constant \(\kappa >0\) and a function \(\omega : {\mathbb {R}}_{++} \rightarrow {\mathbb {R}}_{++}\) such that, for all \(x \in {\mathcal {M}}\),

$$\begin{aligned} -\kappa \omega (\mu ) \le {\tilde{f}}(x, \mu )-f(x) \le \kappa \omega (\mu ) \end{aligned}$$
(13)

with \(\lim _{\mu \downarrow 0} \omega (\mu )=0.\) Substituting \(x^{k}\) and combining with the global solution \({\bar{x}}\), we have that

$$\begin{aligned} {\tilde{f}}(x^{k}, \mu _{k}) \ge f(x^{k})-\kappa \omega (\mu _{k}) \ge f({\bar{x}})-\kappa \omega (\mu _{k}). \end{aligned}$$

By rearranging this expression, we obtain

$$\begin{aligned} -\kappa \omega (\mu _{k}) \le {\tilde{f}}(x^{k}, \mu _{k}) -f({\bar{x}}). \end{aligned}$$
(14)

Since \(x^{k}\) minimizes \({\tilde{f}}(x, \mu _{k})\) on \({\mathcal {M}}\) for each \(\mu _{k}\), we have that \({\tilde{f}}(x^{k}, \mu _{k}) \le {\tilde{f}}({\bar{x}}, \mu _{k})\), which leads to

$$\begin{aligned} {\tilde{f}}(x^{k}, \mu _{k}) - f({\bar{x}}) \le {\tilde{f}}({\bar{x}}, \mu _{k}) - f({\bar{x}}) \le \kappa \omega (\mu _{k}). \end{aligned}$$
(15)

The second inequality above follows from (13). Combining (14) and (15), we obtain

$$\begin{aligned} |{\tilde{f}}(x^{k}, \mu _{k})-f({\bar{x}})|\le \kappa \omega (\mu _{k}). \end{aligned}$$
(16)

Now, suppose that \(x^{*}\) is a limit point of \(\{x^{k}\}\), so that there is an infinite subsequence \({\mathcal {K}}\) such that \(\lim _{k \in {\mathcal {K}}} x^{k}=x^{*}.\) Note that \(x^{*} \in {\mathcal {M}}\) because \({\mathcal {M}}\) is complete. By taking the limit as \(k \rightarrow \infty , k \in {\mathcal {K}}\), on both sides of (16), again by the definition of the smoothing function, we obtain

$$\begin{aligned} |f(x^{*})-f({\bar{x}}) |= \lim _{k \in {\mathcal {K}}} |{\tilde{f}}(x^{k}, \mu _{k})-f({\bar{x}})|\le \lim _{k \in {\mathcal {K}}} \kappa \omega (\mu _{k}) = 0. \end{aligned}$$

Thus, it follows that \(f(x^{*})=f({\bar{x}}).\) Since \(x^{*} \in {\mathcal {M}}\) is a point whose objective value is equal to that of the global solution \({\bar{x}}\), we conclude that \(x^{*}\), too, is a global solution. \(\square\)

This strong result requires us to find a global minimizer of each subproblem, which, however, cannot always be done. The next result concerns the convergence properties of the sequence \({\tilde{f}}(x^{k}, \mu _{k})\) under the condition that \({\tilde{f}}\) has the following additional property:

$$\begin{aligned} 0< \mu _{2}< \mu _{1} \Longrightarrow {\tilde{f}}(x,\mu _{2}) < {\tilde{f}}(x,\mu _{1})\text { for all } x \in {\mathbb {R}}^n. \end{aligned}$$
(17)

Example 4

The above property holds for \({\text {lse}}(x,\mu )\) in Example 2; i.e., we have \({\text {lse}}(x,\mu _{2}) < {\text {lse}}(x,\mu _{1})\) on \({\mathbb {R}}^n\), provided that \(0< \mu _{2} < \mu _{1}\).

Note that under the equality,

$$\begin{aligned} \sum _{l=1}^{n} \exp (x_l/{\mu } ) =\exp \{ {\text {lse}}(x,\mu ) / \mu \}, \end{aligned}$$

the i-th component of \(\sigma (x,\mu )\) can be rewritten as

$$\begin{aligned} \sigma _{i}(x,\mu )= \exp \{ {( x_i - {\text {lse}}(x,\mu )) } /\mu \}. \end{aligned}$$

For any fixed \(x \in {\mathbb {R}}^n\), consider the derivative of a real function \(\mu \rightarrow {\text {lse}}(x,\cdot ) : {\mathbb {R}}_{++} \rightarrow {\mathbb {R}}\). Then we have

$$\begin{aligned} \begin{aligned} \nabla _{\mu } {\text {lse}}(x,\mu ) = {\text {lse}} / \mu - \frac{\sum _{i=1}^{n} x_{i} \exp (x_{i}/{\mu })}{\mu \exp {({\text {lse}} / \mu )}} =&( {\text {lse}} - \textstyle \sum _{i=1}^{n} x_{i} \exp \{( x_i - {\text {lse}})/\mu \} ) /\mu \\ =&( {\text {lse}} - \textstyle \sum _{i=1}^{n} x_{i} \sigma _i ) /\mu \le 0, \end{aligned} \end{aligned}$$

where “\({\text {lse}}\), \(\sigma\)” are shorthand for \({\text {lse}}(x,\mu )\) and \(\sigma (x,\mu )\). For the last inequality above, we observe that \(\sigma \in \Delta ^{n-1}\); hence, the term \(\sum _{i=1}^{n} x_{i} \sigma _i\) is a convex combination of all entries of x, which implies that \(\sum _{i=1}^{n} x_{i} \sigma _i \le \max (x) < {\text {lse}}.\) This completes the proofs of our claims.

In [32], the authors considered a special case of Algorithm 1, wherein the smoothing function \({\tilde{f}}(x,\mu ) =\sqrt{\mu ^{2}+x^{2}}\) of \(|x |\) also satisfies (17) and a Riemannian conjugate gradient method is used for (12).

Theorem 3.5

Suppose that \(f^{*}:=\inf _{x \in {\mathcal {M}}} f(x)\) exists and the smoothing function \({\tilde{f}}\) has property (17). Let \(f^{k}:={\tilde{f}}(x^{k}, \mu _{k})\). Then the sequence \(\{f^{k}\}\) generated by Algorithm 1 is strictly decreasing and bounded below by \(f^{*}\); hence,

$$\begin{aligned} \lim _{k \rightarrow \infty } |f^{k} - f^{k-1} |=0. \end{aligned}$$

Proof

For each \(k \ge 1\), \(x_{k}\) is obtained by approximately solving

$$\begin{aligned} \min _{x \in {\mathcal {M}}} {\tilde{f}}(x, \mu _{k}), \end{aligned}$$

starting at \(x^{k-1}\). Then at least, we have

$$\begin{aligned} {\tilde{f}}(x^{k-1}, \mu _{k}) \ge {\tilde{f}}(x^{k}, \mu _{k})=f^{k}. \end{aligned}$$

Since \(\mu _{k}=\theta \mu _{k-1} < \mu _{k-1}\), property (17) ensures

$$\begin{aligned} f^{k-1}={\tilde{f}}(x^{k-1}, \mu _{k-1}) > {\tilde{f}}(x^{k-1}, \mu _{k}). \end{aligned}$$

The claim that sequence \(\{f^{k}\}\) is strictly decreasing follows from these two inequalities.

Suppose that, for all \(\mu > 0\) and for all \(x \in {\mathbb {R}}^n\),

$$\begin{aligned} {\tilde{f}}(x, \mu ) \ge {f}(x). \end{aligned}$$
(18)

Then for each k,

$$\begin{aligned} f^{k}={\tilde{f}}(x^{k}, \mu _{k}) \ge {f}(x^{k}) \ge \inf _{x \in {\mathcal {M}}} f(x) = f^{*}, \end{aligned}$$

which proves our claims.

Now, we show (18) is true if the smoothing function has property (17). Fix any \(x \in {\mathbb {R}}^n\); (17) implies that \({\tilde{f}}(x, \cdot )\) is strictly decreasing as \(\mu \rightarrow 0.\) Actually, \({\tilde{f}}(x, \cdot )\) is monotonically increasing on \(\mu >0.\) On the other hand, from the definition of the smoothing function, we have that

$$\begin{aligned} \lim _{\mu \downarrow 0} {\tilde{f}}(x, \mu )=f(x). \end{aligned}$$

Hence, we have \(\inf _{\mu > 0} {\tilde{f}}(x, \mu )=f(x),\) as claimed. \(\square\)

Note that the above weak result does not ensure that \(\{ f^k\} \rightarrow f^*\). Next, for better convergence (compared with Theorem 3.5) and an effortless implementation (compared with Theorem 3.4), we propose an enhanced Riemannian smoothing method: Algorithm 2. This is closer to the version in [30], where the authors use the Riemannian steepest descent method for solving the smoothed problem (19).

figure b

The following result is adapted from [30, Proposition 4.2 & Theorem 4.3]. Readers are encouraged to refer to [30] for a discussion on the stationary point associated with \({\tilde{f}}\) on \({\mathcal {M}}\).

Theorem 3.6

In Algorithm 2, suppose that the chosen sub-algorithm has the following general convergence property for SROP:

$$\begin{aligned} \liminf _{\ell \rightarrow \infty }\Vert {\text {grad}} g(x^{\ell })\Vert =0. \end{aligned}$$
(21)

Moreover, suppose that, for all \(\mu _{k}\), the function \({\tilde{f}}(\cdot ,\mu _{k})\) satisfies the convergence assumptions of the sub-algorithm needed for g above and \({\tilde{f}}\) satisfies the Riemannian gradient sub-consistency on \({\mathcal {M}}\). Then

  1. 1.

    For each k, there exists an \(x^{k}\) satisfying (20); hence, Algorithm 2 is well defined.

  2. 2.

    Every limit point \(x^{*}\) of the sequence \(\{x^{k}\}\) generated by Algorithm 2 is a Riemannian limiting stationary point of NROP (see (7)).

Proof

Fix any \(\mu _{k}\). By (21), we have \(\liminf _{\ell \rightarrow \infty }\Vert {\text {grad}} {\tilde{f}}(x^{\ell },\mu _{k})\Vert =0\). Hence, there is a convergent subsequence of \(\Vert {\text {grad}} {\tilde{f}}(x^{\ell },\mu _{k})\Vert\) whose limit is 0. This means that, for any \(\epsilon >0\), there exists an integer \(\ell _\epsilon\) such that \(\Vert {\text {grad}} {\tilde{f}}(x^{\ell _\epsilon },\mu _{k})\Vert < \epsilon .\) If \(\epsilon = \delta _{k}\), we get \(x^{k}= x^{\ell _\epsilon }\). Thus, statement (1) holds.

Next, suppose that \(x^{*}\) is a limit point of \(\{x^{k}\}\) generated by Algorithm 2, so that there is an infinite subsequence \({\mathcal {K}}\) such that \(\lim _{k \in {\mathcal {K}}} x^{k}=x^{*}.\) From (1), we have

$$\begin{aligned} \lim _{k \in {\mathcal {K}}} \Vert {\text {grad}} {\tilde{f}}(x^{k}, \mu _{k})\Vert \le \lim _{k \in {\mathcal {K}}} \delta _{k}=0, \end{aligned}$$

and we find that \({\text {grad}} {\tilde{f}}(x^{k}, \mu _{k}) \rightarrow 0\) for \(k \in {\mathcal {K}}, x^{k} \in {\mathcal {M}}, x^{k} \rightarrow x^{*}, \mu _{k} \downarrow 0\). Hence,

$$\begin{aligned} 0 \in G_{{\tilde{f}}, {\mathcal {R}}}(x^{*}) \subseteq \partial _{{\mathcal {R}}} f(x^{*}). \end{aligned}$$

\(\square\)

Now let us consider the selection strategy of the nonnegative sequence \(\{\delta _{k}\}\) with \(\delta _{k} \rightarrow 0\). In [30], when \(\mu _{k+1}=\theta \mu _{k}\) shrinks, the authors set

$$\begin{aligned} \delta _{k+1}:=\rho \delta _{k} \end{aligned}$$
(22)

with an initial value of \(\delta _{0}\) and constant \(\rho \in (0,1)\). In the spirit of the usual smoothing methods described in [29], one can set

$$\begin{aligned} \delta _{k}:=\gamma \mu _{k} \end{aligned}$$
(23)

with a constant \(\gamma >0\). The latter is an adaptive rule, because \(\mu _{k}\) determines subproblem (19) and its stopping criterion at the same time. The merits and drawbacks of the two rules require more discussion, but the latter seems to be more reasonable.

We conclude this section by discussing the connections with [32] and [30]. Our work is based on an efficient unification of them. [32] focused on a specific algorithm and did not discuss the underlying generalities, whereas we studied a general framework for Riemannian smoothing. Recall that the “smoothing function” is the core tool of the smoothing method. In addition to what are required by its definition (see Definition 3.1), it needs to have the following “additional properties” (AP) in order for the algorithms to converge:

  1. (AP1)

    Approximate from above, i.e., (17). (Needed in Algorithm 1)

  2. (AP2)

    (Riemannian) gradient sub-consistency, i.e., Definition 3.2. (Needed in Algorithm 2)

We find that not all smoothing functions satisfy (AP1) and for some functions it is hard to prove whether (AP2) holds. For example, all the functions in Table 1 are smoothing functions of |x|, but only the first three meet (AP1); the last two do not. In [29], the authors showed that the first one in Table 1, \({\tilde{f}}_{1}(x, \mu )\), has property (AP2). The others remain to be verified, but doing so will not be a trivial exercise. To a certain extent, Algorithm 1 as well as Theorem 3.5 guarantee a fundamental convergence result even if one has difficulty in showing whether one’s smoothing function satisfies (AP2). Therefore, it makes sense to consider Algorithms 1 and 2 together for the sake of the completeness of the general framework.

Table 1 List of smoothing functions of the absolute value function |x| with \(\kappa\) and \(\omega (\mu )\) in (8)

Algorithm 2 expands on the results of [30]. It allows us to use any standard method of SROP, not just steepest descent, to solve the smoothed problem (19). Various standard Riemannian algorithms for SROP, such as the Riemannian conjugate gradient method [38] (which often performs better than Riemannian steepest descent), the Riemannian Newton method [39, Chapter 6], and the Riemannian trust region method [39, Chapter 7], have extended the concepts and techniques used in Euclidean space to Riemannian manifolds. As shown by Theorem 3.6, no matter what kind of sub-algorithm is implemented for (19), it does not affect the final convergence as long as the chosen sub-algorithm has property (21). On the other hand, we advocate that the sub-algorithm should be viewed as a “Black Box” and the user should not have to care about the code details of the sub-algorithm at all. We can directly use an existing solver, e.g., Manopt [40], which includes the standard Riemannian algorithms mentioned above. Hence, we can choose the most suitable sub-algorithm for the application and quickly implement it with minimal effort.

4 Numerical experiments on CP factorization

The numerical experiments in Sects. 4 and 5 were performed on a computer equipped with an Intel Core i7-10700 at 2.90GHz with 16GB of RAM using Matlab R2022a. Our Algorithm 2 is implemented in the Manopt framework [40] (version 7.0). The number of iterations to solve the smoothed problem (19) with the sub-algorithm is recorded in the total number of iterations. We refer readers to the supplementary material of this paper for the available codes.Footnote 1

In this section, we describe numerical experiments that we conducted on CP factorization in which we solved OptCP using Algorithm 2, where different Riemannian algorithms were employed as sub-algorithms and \({\text {lse}}(-{\bar{B}}X,\mu )\) was used as the smoothing function. To be specific, we used three built-in Riemannian solvers of Manopt 7.0 — steepest descent (SD), conjugate gradient (CG), and trust regions (RTR), denoted by SM_SD, SM_CG and SM_RTR, respectively. We compared our algorithms with the following non-Riemannian numerical algorithms for CP factorization that were mentioned in Sect. 1.1. We followed the settings used by the authors in their papers.

  • SpFeasDC_ls [19]: A difference-of-convex functions approach for solving the split feasibility problem, it can be applied to (FeasCP). The implementation details regarding the parameters we used are the same as in the numerical experiments reported in [19, Section 6.1].

  • RIPG_mod [20]: This is a projected gradient method with relaxation and inertia parameters for solving (2). As shown in [20, Section 4.2], RIPG_mod is the best among the many strategies of choosing parameters.

  • APM_mod [18]: A modified alternating projection method for CP factorization; it is described in Sect. 2.3.

We have shown that \({\text {lse}}(x,\mu )\) is a smoothing function of \(\max (x)\) with gradient consistency. \({\text {lse}}(\cdot , \mu )\) of the matrix argument can be simply derived from entrywise operations. Then from the properties of compositions of smoothing functions [41, Proposition 1 (3)], we have that \({\text {lse}}(-{\bar{B}}X,\mu )\) is a smoothing function of \(\max (-{\bar{B}}X)\) with gradient consistency. In practice, it is important to avoid numerical overflow and underflow when evaluating \({\text {lse}}(x,\mu )\). Overflow occurs when any \(x_i\) is large and underflow occurs when all \(x_i\) are small. To avoid these problems, we can shift each component \(x_i\) by \(\max (x)\) and use the following formula:

$$\begin{aligned} {\text {lse}}(x,\mu ) = \mu \log (\textstyle \sum _{i=1}^{n} \exp ((x_{i}-\max (x))/{\mu } ) ) + \max (x), \end{aligned}$$

whose validity is easy to show.

The details of the experiments are as follows. If \(A \in \mathcal{CP}\mathcal{}_{n}\) was of full rank, for accuracy reasons, we obtained an initial \({\bar{B}}\) by using Cholesky decomposition. Otherwise, \({\bar{B}}\) was obtained by using spectral decomposition. Then we extended \({\bar{B}}\) to r columns by column replication (6). We set \(r = {\text {cp}}(A)\) if \({\text {cp}}(A)\) was known or r was sufficiently large. We used RandOrthMat.m [42] to generate a random starting point \(X^0\) on the basis of the Gram-Schmidt process.

For our three algorithms, we set \(\mu _{0}=100, \theta =0.8\) and used an adaptive rule (23) of \(\delta _{k}:=\gamma \mu _{k}\) with \(\gamma =0.5\). Except for RIPG_mod, all the algorithms terminated successfully at iteration k, where \(\min ({\bar{B}}X^{k}) \ge -10^{-15}\) was attained before the maximum number of iterations (5,000) was reached. In addition, SpFeasDC_ls failed when \({\bar{L}}_{k}>10^{10}\). Regarding RIPG_mod, it terminated successfully when \(\Vert A-X_{k} X_{k}^{\top }\Vert ^{2}/\Vert A\Vert ^{2}<10^{-15}\) was attained before at most 10,000 iterations for \(n<100\), and before at most 50,000 iterations in all other cases. In the tables of this section, we report the rounded success rate (Rate) over the total number of trials, although the definitions of “Rate” in the different experiments (described in Sections 4.1-4.4) vary slightly from one experiment to the other. We will describe them later.

4.1 Randomly generated instances

We examined the case of randomly generated matrices to see how the methods were affected by the order n or r. The instances were generated in the same way as in [18, Section 7.7]. We computed C by setting \(C_{i j}:=|B_{i j}|\) for all ij,  where B is a random \(n \times 2n\) matrix based on the Matlab command randn, and we took \(A=C C^{\top }\) to be factorized. In Table 2, we set \(r=1.5 n\) and \(r=3 n\) for the values \(n \in \{20,30,40,100,200,400,600,800\}\). For each pair of n and r, we generated 50 instances if \(n \le 100\) and 10 instances otherwise. For each instance, we initialized all the algorithms at the same random starting point \(X^{0}\) and initial decomposition \({\bar{B}}\), except for RIPG_mod. Note that each instance A was assigned only one starting point.

Table 2 lists the average time in seconds (\(\hbox {Time}_{\mathrm {s}}\)) and the average number of iterations (\(\hbox {Iter}_{\mathrm {s}}\)) among the successful instances. For our three Riemannian algorithms, \(\hbox {Iter}_{\mathrm {s}}\) contains the number of iterations of the sub-algorithm. Table 2 also lists the rounded success rate (Rate) over the total number (50 or 10) of instances for each pair of n and r. Boldface highlights the two best results in each row.

As shown in Table 2, except for APM_mod, each method had a success rate of 1 for all pairs of n and r. Our three algorithms outperformed the other methods on the large-scale matrices with \(n \ge 100\). In particular, SM_CG with the conjugate-gradient method gave the best results.

4.2 A specifically structured instance

Let \({{\textbf {e}}}_{n}\) denote the all-ones vector in \({\mathbb {R}}^{n}\) and consider the matrix [18, Example 7.1],

$$\begin{aligned} A_{n} =\left( \begin{array}{cc} {0} &{} {{{\textbf {e}}}_{n-1}^{\top }} \\ {{{\textbf {e}}}_{n-1}} &{} {I_{n-1}} \end{array}\right) ^{\top }\left( \begin{array}{cc} {0} &{} {{{\textbf {e}}}_{n-1}^{\top }} \\ {{{\textbf {e}}}_{n-1}} &{} {I_{n-1}} \end{array}\right) \in \mathcal{CP}\mathcal{}_{n}. \end{aligned}$$

Theorem 2.2 shows that \(A_{n} \in {\text {int}}(\mathcal{CP}\mathcal{}_{n})\) for every \(n \ge 2.\) By construction, it is obvious that \({\text {cp}}(A_{n})=n\). We tried to factorize \(A_{n}\) for the values \(n \in \{10,20,50,75,100,150\}\) in Table 3. For each \(A_{n}\), using \(r={\text {cp}}(A_{n})=n\) and the same initial decomposition \({\bar{B}}\), we tested all the algorithms on the same 50 randomly generated starting points, except for RIPG_mod. Note that each instance was assigned 50 starting points.

Table 3 lists the average time in seconds (\(\hbox {Time}_{\mathrm {s}}\)) and the average number of iterations (\(\hbox {Iter}_{\mathrm {s}}\)) among the successful starting points. It also lists the rounded success rate (Rate) over the total number (50) of starting points for each n. Boldface highlights the two best results for each n.

We can see from Table 3 that the success rates of our three algorithms were always 1, whereas the success rates of the other methods decreased as n increased. Likewise, SM_CG with the conjugate-gradient method gave the best results.

Table 2 CP factorization of random completely positive matrices
Table 3 CP factorization of a family of specifically structured instances

4.3 An easy instance on the boundary of \(\mathcal{CP}\mathcal{}_n\)

Consider the following matrix from [43, Example 2.7]:

$$\begin{aligned} A=\left( \begin{array}{ccccc} 41 &{} 43 &{} 80 &{} 56 &{} 50 \\ 43 &{} 62 &{} 89 &{} 78 &{} 51 \\ 80 &{} 89 &{} 162 &{} 120 &{} 93 \\ 56 &{} 78 &{} 120 &{} 104 &{} 62 \\ 50 &{} 51 &{} 93 &{} 62 &{} 65 \end{array}\right) . \end{aligned}$$

The sufficient condition from [43, Theorem 2.5] ensures that this matrix is completely positive and \({\text {cp}}(A)= {\text {rank}}(A)=3.\) Theorem 2.2 tells us that \(A \in {\text {bd}}(\mathcal {C P}_5)\), since \({\text {rank}}(A) \ne 5\).

We found that all the algorithms could easily factorize this matrix. However, our three algorithms returned a CP factorization B whose smallest entry was as large as possible. In fact, they also maximized the smallest entry in the \(n \times r\) symmetric factorization of A, since OptCP is equivalent to

$$\begin{aligned} \max _{A=XX^{\top }, X \in \mathbb {R}^{ n \times r }} \{\min \,(X)\}. \end{aligned}$$

When we did not terminate as soon as \(\min (\bar{B}X^{k}) \ge -10^{-15}\), for example, after 1000 iterations, our algorithms gave the following CP factorization whose the smallest entry is around \(2.8573 \gg -10^{-15}\):

$$\begin{aligned} A=B B^{\top } \text{, } \text{ where } B \approx \left( \begin{array}{lll} 3.5771 &{} 4.4766 &{} {\textbf {2.8573}} \\ 2.8574 &{} 3.0682 &{} 6.6650 \\ 8.3822 &{} 7.0001 &{} 6.5374 \\ 5.7515 &{} 2.8574 &{} 7.9219 \\ 2.8574 &{} 6.7741 &{} 3.3085 \end{array}\right) . \end{aligned}$$

4.4 A hard instance on the boundary of \(\mathcal{CP}\mathcal{}_n\)

Next, we examined how well these methods worked on a hard matrix on the boundary of \(\mathcal{CP}\mathcal{}_n\). Consider the following matrix on the boundary taken from [44]:

$$\begin{aligned}A=\left( \begin{array}{lllll} 8 &{} 5 &{} 1 &{} 1 &{} 5 \\ 5 &{} 8 &{} 5 &{} 1 &{} 1 \\ 1 &{} 5 &{} 8 &{} 5 &{} 1 \\ 1 &{} 1 &{} 5 &{} 8 &{} 5 \\ 5 &{} 1 &{} 1 &{} 5 &{} 8 \end{array}\right) \in {\text {bd}} (\mathcal{CP}\mathcal{}_{5}). \end{aligned}$$

Since \(A \in \mathrm {bd}(\mathcal {C P}_{5})\) and A is of full rank, it follows from Theorem 2.2 that \({\text {cp}}^{+}(A) = \infty\); i.e., there is no strictly positive CP factorization for A. Hence, the global minimum of (OptCP), \(t=0\), is clear. None of the algorithms could decompose this matrix under our tolerance, \(10^{-15}\), in the stopping criteria. As was done in [18, Example7.3], we investigated slight perturbations of this matrix. Given

$$\begin{aligned} M M^{\top }=:C \in {\text {int}}(\mathcal {C P}_{5}) \text{ with } M=\left( \begin{array}{cccccc} 1 &{} 1 &{} 0 &{} 0 &{} 0 &{} 0 \\ 1 &{} 0 &{} 1 &{} 0 &{} 0 &{} 0 \\ 1 &{} 0 &{} 0 &{} 1 &{} 0 &{} 0 \\ 1 &{} 0 &{} 0 &{} 0 &{} 1 &{} 0 \\ 1 &{} 0 &{} 0 &{} 0 &{} 0 &{} 1 \end{array}\right) , \end{aligned}$$

we factorized \(A_{\lambda }:=\lambda A+(1-\) \(\lambda ) C\) for different values of \(\lambda \in [0,1)\) using \(r= 12 > \mathrm {cp}_{5}=11.\) Note that \(A_{\lambda } \in {\text {int}}(C \mathcal {P}_{5})\) provided \(0 \le \lambda <1\) and \(A_{\lambda }\) approached the boundary as \(\lambda \rightarrow 1\). We chose the largest \(\lambda = 0.9999\). For each \(A_{\lambda },\) we tested all of the algorithms on 50 randomly generated starting points and computed the success rate over the total number of starting points.

Table 4 shows how the success rate of each algorithm changes as \(A_{\lambda }\) approaches the boundary. The table sorts the results from left to right according to overall performance. Except for SM_RTR, whose success rate was always 1, the success rates of all the other algorithms significantly decreased as \(\lambda\) increased to 0.9999. Surprisingly, the method of SM_CG, which performed well in the previous examples, seemed unable to handle instances close to the boundary.

Table 4 Success rate of CP factorization of \(A_{\lambda }\) for values of \(\lambda\) from 0.6 to 0.9999

5 Further numerical experiments: comparison with [30, 32]

As described at the end of Sect. 3, the algorithms in [30] and [32] are both special cases of our algorithm. In this section, we compare them to show whether it performs better when we use other sub-algorithms or other smoothing functions. We applied Algorithm 2 to two problems: finding a sparse vector (FSV) in a subspace and robust low-rank matrix completion, which are problems implemented in [30] and [32], respectively. Since they both involve approximations to the \(\ell _{1}\) norm, we applied the smoothing functions listed in Table 1.

We used the six solvers built into Manopt 7.0, namely, steepest descent; Barzilai-Borwein (i.e., gradient-descent with BB step size); Conjugate gradient; trust regions; BFGS (a limited-memory version); ARC (i.e., adaptive regularization by cubics).

5.1 FSV problem

The FSV problem is to find the sparsest vector in an n-dimensional linear subspace \(W\subseteq {\mathbb {R}}^{m}\); it has applications in robust subspace recovery, dictionary learning, and many other problems in machine learning and signal processing [45, 46]. Let \(Q \in {\mathbb {R}}^{m \times n}\) denote a matrix whose columns form an orthonormal basis of W: this problem can be formulated as

$$\begin{aligned} \min _{ x \in S^{n-1}} \Vert Q x\Vert _{0}, \end{aligned}$$

where \(S^{n-1}:=\left\{ x \in {\mathbb {R}}^{n}\mid \Vert x\Vert =1\right\}\) is the sphere manifold, and \(\Vert z\Vert _{0}\) counts the number of nonzero components of z. Because this discontinuous objective function is unwieldy, in the literature, one instead focuses on solving the \(\ell _{1}\) norm relaxation given below:

$$\begin{aligned} \min _{ x \in S^{n-1}} \Vert Q x\Vert _{1}, \end{aligned}$$
(24)

where \(\Vert z\Vert _{1}:=\sum _{i}\left| z_{i}\right|\) is the \(\ell _{1}\) norm of the vector z.

Our synthetic problems of the \(\ell _{1}\) minimization model (24) were generated in the same way as in [30]: i.e., we chose \(m \in \{4 n, 6 n, 8 n, 10 n\}\) for \(n=5\) and \(m \in \{6 n, 8 n, 10 n, 12n\}\) for \(n=10\). We defined a sparse vector \(e_{n}:=(1, \ldots , 1,0, \ldots , 0)^{\top } \in {\mathbb {R}}^{m}\), whose first n components are 1 and the remaining components are 0. Let the subspace W be the span of \(e_{n}\) and some \(n-1\) random vectors in \({\mathbb {R}}^{m}\). By mgson.m [47], we generated an orthonormal basis of W to form a matrix \(Q \in {\mathbb {R}}^{m \times n}\). With this construction, the minimum value of \(\Vert Q x\Vert _{0}\) should be equal to n. We chose the initial points by using the M.rand() tool of Manopt 7.0 that returns a random point on the manifold M and set x0 = abs(M.rand()). The nonnegative initial point seemed to be better in the experiment. Regarding the the settings of our Algorithm 2, we chose the same smoothing function \({\tilde{f}}_{1}(x, \mu )\) in Table 1 and the same gradient tolerance strategy (22) as in [30]: \(\mu _{0}=1, \theta =0.5, \delta _{0}=0.1, \rho =0.5.\) We compared the numerical performances when using different sub-algorithms. Note that with the choice of the steepest-descent method, our Algorithm 2 is exactly the same as the one in [30].

For each (nm), we generated 50 pairs of random instances and random initial points. We claim that an algorithm successfully terminates if \(\Vert Q x_{k}\Vert _{0}=n\), where \(x_{k}\) is the k-th iteration. Here, when we count the number of nonzeros of \(Q x_{k}\), we truncated the entries as

$$\begin{aligned} (Q x_{k})_{i}=0 \quad \text{ if } \left| (Q x_{k})_{i}\right| <\tau , \end{aligned}$$
(25)

where \(\tau >0\) is a tolerance related to the precision of the solution, taking values from \(10^{-5}\) to \(10^{-12}\). Tables 5 and 6 report the number of successful cases out of 50 cases. Boldface highlights the best result for each row.

Table 5 Number of successes from 50 pairs of random instances and random initial points for the \(\ell _{1}\) minimization model (24) and \(n=5\)

As shown in Tables 5 and 6, surprisingly, the conjugate-gradient method, which performed best on the CP factorization problem in Sect. 4, performed worst on the FSV problem. In fact, it was almost useless. Moreover, although the steepest-descent method employed in [30] was not bad at obtaining low-precision solutions with \(\tau \in \{ 10^{-5}, 10^{-6}, 10^{-7}, 10^{-8}\}\), it had difficulty obtaining high-precision solutions with \(\tau \in \{ 10^{-9}, 10^{-10}, 10^{-11}, 10^{-12}\}\). The remaining four sub-algorithms easily obtained high-precision solutions, with the Barzilai-Borwein method performing the best in most occasions. Combined with the results in Sect. 4, this shows that in practice, the choice of sub-algorithm in the Riemannian smoothing method (Algorithm 2) is highly problem-dependent. For the other smoothing functions in Table 1, we obtained similar results as in Tables 5 and 6.

Table 6 Number of successes from 50 pairs of random instances and random initial points for the \(\ell _{1}\) minimization model (24) and \(n=10\)
Fig. 1
figure 1

Perfect low-rank matrix completion of a rank-10 \(5000 \times 5000\) matrix without any outliers using different smoothing functions in Table 1. ae comprise the running iteration comparison; fj comprise the time comparison

5.2 Robust low-rank matrix completion

Low-rank matrix completion consists of recovering a rank-r matrix M of size \(m \times n\) from only a fraction of its entries with \(r \ll \min (m, n)\). The situation in robust low-rank matrix completion is one where only a few observed entries, called outliers, are perturbed, i.e.,

$$\begin{aligned} M=M_{0}+S, \end{aligned}$$

where \(M_{0}\) is the unperturbed original data matrix of rank r and S is a sparse matrix. This is a case of adding non-Gaussian noise for which the traditional \(\ell _{2}\) minimization model,

$$\begin{aligned} \min _{X \in {\mathcal {M}}_{r}}\left\| {\mathcal {P}}_{\Omega }(X-M)\right\| _{2} \end{aligned}$$

is not well suited to recovery of \(M_{0}\). Here, \({\mathcal {M}}_{r}:=\left\{ X \in {\mathbb {R}}^{m \times n} \mid {\text {rank}}(X)=r\right\}\) is a fixed rank manifold, \(\Omega\) denotes the set of indices of observed entries, and \({\mathcal {P}}_{\Omega }: {\mathbb {R}}^{m \times n} \rightarrow {\mathbb {R}}^{m \times n}\) is the projection onto \(\Omega\), defined as

$$\begin{aligned} Z_{ij} {\mathop {\longmapsto }\limits ^{P_{\Omega }}} {\left\{ \begin{array}{ll} Z_{ij} &{} \text{ if } (i, j) \in \Omega \\ 0 &{} \text{ if } (i, j) \notin \Omega . \end{array}\right. } \end{aligned}$$

In [32], the authors try to solve

$$\begin{aligned} \min _{X \in {\mathcal {M}}_{r}}\left\| {\mathcal {P}}_{\Omega }(X-M)\right\| _{1}, \end{aligned}$$

because the sparsity-inducing property of the \(\ell _{1}\) norm leads one to expect exact recovery when the noise consists of just a few outliers.

In all of the experiments, the problems were generated in the same way as in [32]. In particular, after picking the values of mnr, we generated the ground truth \(U \in {\mathbb {R}}^{m \times r}\), \(V \in {\mathbb {R}}^{n \times r}\) with independent and identically distributed (i.i.d.) Gaussian entries of zero mean and unit variance and \(M:=U V^{\top }\). We then sampled \(k:=\) \(\rho r(m+n-r)\) observed entries uniformly at random, where \(\rho\) is the oversampling factor. In our experiments, we set \(\rho =5\) throughout. We chose the initial points \(X_{0}\) by using the rank-r truncated SVD decomposition of \({\mathcal {P}}_{\Omega }(M)\).

Regarding the setting of our Algorithm 2, we tested all combinations of the five smoothing functions in Table 1 and six sub-algorithms mentioned before (30 cases in total). We set \(\mu _{0}=100\) and chose an aggressive value of \(\theta =0.05\) for reducing \(\mu\), as in [32]. The stopping criterion of the loop of the sub-algorithm was set to a maximum of 40 iterations or the gradient tolerance (23), whichever was reached first. We monitored the iterations \(X_{k}\) through the root mean square error (RMSE), which is defined as the error on all the entries between \(X_{k}\) and the original low-rank matrix \(M_{0}\), i.e.,

$$\begin{aligned} {\text {RMSE}}\left( X_{k}, M_{0}\right) :=\sqrt{\frac{\sum _{i=1}^{m}\sum _{j=1}^{n}\left( X_{k, i j}-M_{0, i j}\right) ^{2}}{m n}}. \end{aligned}$$
Fig. 2
figure 2

Low-rank matrix completion with outliers for two rank-10 \(500 \times 500\) matrices by using different smoothing functions in Table 1. aj corresponds to one matrix with outliers created by using \(\mu _{N}=\sigma _{N}=0.1\), while kt corresponds to the other with outliers created by using \(\mu _{N}=\sigma _{N}=1\). ae and ko comprise the running iteration comparison; fj and pt comprise the time comparison

Fig. 3
figure 3

Low-rank matrix completion with outliers for two rank-10 \(5000 \times 5000\) matrices by using different smoothing functions in Table 1. aj corresponds to one matrix with outliers created by using \(\mu _{N}=\sigma _{N}=0.1\), while kt corresponds to the other with outliers created by using \(\mu _{N}=\sigma _{N}=1\). ae and ko comprise the running iteration comparison; fj and pt comprise the time comparison

5.2.1 Perfect low-rank matrix completion

As in [32], we first tested all the methods on a simple perfect matrix M (without any outliers) of size \(5000 \times 5000\) and rank 10. The results are shown in Figure 1. We can see that the choice of smoothing function does not have much effect on numerical performance. In terms of the number of iterations ((a)–(e)), our Algorithm 2 inherits the convergence of its sub-algorithm at least Q-superlinearly when trust regions or ARC are used. But the single iteration cost of trust regions and ARC is high; they are not efficient in terms of time. Specifically, the conjugate-gradient method employed in [32] stagnates at lower precision. Overall, Barzilai-Borwein performed best in terms of time and accuracy.

5.2.2 Low-rank matrix completion with outliers

Given a \(500 \times 500\) matrix for which we observed the entries uniformly at random with an oversampling \(\rho\) of 5, we perturbed \(5 \%\) of the observed entries by adding noise to them in order to create outliers. The added item was a random variable defined as \({\mathcal {O}}={\mathcal {S}}_{\pm 1} \cdot {\mathcal {N}}(\mu _{N}, \sigma _{N}^{2})\) where \({\mathcal {S}}_{\pm 1}\) is a random variable with equal probability of being equal to \(+1\) or \(-1\), while \({\mathcal {N}}(\mu _{N}, \sigma _{N}^{2})\) is a Gaussian random variable of mean \(\mu _{N}\) and variance \(\sigma _{N}^{2}\).

Figure 2 reports the results of two \(500 \times 500\) instances with outliers generated using \(\mu _{N}=\sigma _{N}=0.1\) and \(\mu _{N}=\sigma _{N}=1\). Again, we can see that the choice of smoothing function does not have much effect. In most cases, BFGS and trust regions were better than the other methods in terms of number of iterations, and BFGS was the fastest. Furthermore, the conjugate-gradient method employed in [32] still stagnated at solutions with lower precision, approximately \(10^{-6}\), while steepest descent, BFGS, and trust regions always obtained solutions with at least \(10^{-8}\) precision.

Next, we ran the same experiment on larger \(5000 \times 5000\) matrices, with \(5 \%\) outliers. Figure 3 illustrates the results of these experiments, with \(\mu _{N}=\sigma _{N}=0.1\) and \(\mu _{N}=\sigma _{N}=1\). In most cases, trust regions still outperformed the other methods in terms of number of iterations, while BFGS performed poorly. Barzilai-Borwein and the conjugate-gradient method were almost as good in terms of time.

6 Concluding remarks

We examined the problem of finding a CP factorization of a given completely positive matrix and treated it as a nonsmooth Riemannian optimization problem. To this end, we studied a general framework of Riemannian smoothing for Riemannian optimization. The numerical experiments clarified that our method can compete with other efficient CP factorization methods, in particular on large-scale matrices.

Let us we summarize the relation of our approach to the existing CP factorization methods. Groetzner and Dür [18] and Chen et al. [19] proposed different methods to solve (FeasCP). Boţ and Nguyen [20] tried to solve another model (2). However, the methods they used do not belong to the Riemannian optimization techniques, but are rather Euclidean ones, since they treated the set \({\mathcal {O}}(r) := \{ X \in {\mathbb {R}}^{r \times r} : X^{\top }X=I\}\) as a usual constraint in Euclidean space. By comparison, we recognize the existence of manifolds, namely, the Stiefel manifold \({\mathcal {M}}= {\mathcal {O}}(r)\), and use optimization techniques specific to them. This change in perspective suggests the possibility of using the rich variety of Riemannian optimization techniques. As the experiments in Sect. 4 show, our Riemannian approach is faster and more reliable than the Euclidean methods.

In the future, we plan to extend Algorithm 2 to the case of general manifolds and, particularly, to quotient manifolds. This application is believed to be possible, although effort should be put into establishing analogous convergence results. In fact, convergence has been verified in a built-in example in Manopt 7.0 [40]: robust_pca.m computes a robust version of PCA on data and optimizes a nonsmooth function over a Grassmann manifold. The nonsmooth term consists of the \(l_2\) norm, which is not squared, for robustness. In robust_pca.m, Riemannian smoothing with a pseudo-Huber loss function is used in place of the \(l_2\) norm.

As in the other numerical methods, there is no guarantee that Algorithm 2 will find a CP factorization for every \(A \in {\mathcal {C}} {\mathcal {P}}_{n}\). It follows from Proposition 2.7 that \(A \in {\mathcal {C}} {\mathcal {P}}_{n}\) if and only if the global minimum of OptCP, say t, is such that \(t \leqslant 0\). Since our methods only converge to a stationary point, Algorithm 2 provides us with a local minimizer at best. We are looking forward to finding a global minimizer of OptCP in our future work.