1 Introduction

In this study, we consider the constrained nonlinear least-squares problem:

$$\begin{aligned} \min _{x \in {\mathbb {R}}^d} \ f(x) :=\frac{1}{2}\Vert F(x)\Vert ^2 \quad \mathrm {subject\ to}\quad x \in {\mathcal {C}}, \end{aligned}$$
(1)

where \(\Vert \cdot \Vert\) denotes the \(\ell _2\)-norm, \(F: {\mathbb {R}}^d \rightarrow {\mathbb {R}}^n\) is a continuously differentiable function, and \({\mathcal {C}} \subseteq {\mathbb {R}}^d\) is a closed convex set. If there exists a point \(x \in {\mathcal {C}}\) such that \(F(x) = {\textbf{0}}\), the problem is said to be zero-residual, and is reduced to the constrained nonlinear equation:

$$\begin{aligned} \text {find} \quad x \in {\mathcal {C}}\quad \text {such that}\quad F(x) = {\textbf{0}}. \end{aligned}$$

Such problems cover a wide range of applications, including chemical equilibrium systems [48], economic equilibrium problems [20], power flow equations [61], nonnegative matrix factorization [7, 42], phase retrieval [11, 63], nonlinear compressed sensing [8], and learning constrained neural networks [17].

Levenberg–Marquardt (LM) methods [43, 47] are efficient iterative algorithms for solving problem (1); they were originally developed for unconstrained cases (i.e., \({\mathcal {C}} = {\mathbb {R}}^d\)) and later extended to constrained cases by [40]. Given a current point \(x_k \in {\mathcal {C}}\), an LM method defines a model function \(m^k_\uplambda : {\mathbb {R}}^d \rightarrow {\mathbb {R}}\) with a damping parameter \(\uplambda > 0\):

$$\begin{aligned} m^k_\uplambda (x) :=\frac{1}{2}\Vert F_k + J_k(x-x_k)\Vert ^2 + \frac{\uplambda }{2}\Vert x-x_k\Vert ^2, \end{aligned}$$
(2)

where \(F_k :=F(x_k) \in {\mathbb {R}}^d\) and \(J_k :=J(x_k) \in {\mathbb {R}}^{n \times d}\) with \(J: {\mathbb {R}}^d \rightarrow {\mathbb {R}}^{n \times d}\) being the Jacobian matrix function of F. The next point \(x_{k+1} \in {\mathcal {C}}\) is set to an exact or approximate solution to the convex subproblem:

$$\begin{aligned} \min _{x \in {\mathbb {R}}^d}\ m^k_\uplambda (x)\quad \mathrm {subject\ to}\quad x \in {\mathcal {C}} \end{aligned}$$
(3)

for some \(\uplambda = \uplambda _k\). Various versions of this method have been proposed, and their theoretical and practical performances largely depend on how the damping parameter \(\uplambda _k\) is updated.

1.1 Our contribution

We propose an LM method with a new rule for updating \(\uplambda _k\). Our method is based on majorization-minimization (MM) methods, which successively minimize a majorization or, in other words, an upper bound on the objective function. The key to our method is the fact that the model \(m^k_\uplambda\) defined in (2) is a majorization of the objective f under certain standard assumptions. This MM perspective enables us to create an LM method with desirable properties, including global and local convergence guarantees. Although there exist several MM methods for problem (1) and relevant problems [3, 4, 38, 50, 53], as far as we know, no studies have elucidated that the model in (2) is a majorization of f. Another feature of our LM method is the way of generating an approximate solution of subproblem (3). It is sufficient to apply one iteration of a projected gradient method to (3) for deriving the iteration complexity of our LM method, which leads to an overall complexity bound.

Our contributions are summarized as follows:

  1. (i)

    A new MM-based LM method We prove that the LM model defined in (2) is a majorization of f if the damping parameter \(\uplambda\) is sufficiently large. See Lemma 1 for a precise statement. This result provides us with a new update rule of \(\uplambda\), bringing about a new LM method for solving problem (1).

  2. (ii)

    Iteration and overall complexity for finding a stationary point The iteration complexity of our LM method for finding an \(\varepsilon\)-stationary point (see Definition 1) is proved to be \(O(\varepsilon ^{-2})\) under mild assumptions on the Jacobian. Because the computational complexity per iteration of our method does not depend on \(\varepsilon\), the overall complexity is also evaluated as \(O(\varepsilon ^{-2})\) through

    $$\begin{aligned} (\text {Overall complexity}) = (\text {Iteration complexity}) \times (\text {Complexity per iteration}). \end{aligned}$$

    See Corollaries 1 and 2 for a precise statement.

  3. (iii)

    Local quadratic convergence For zero-residual problems, assume that a starting point \(x_0 \in {\mathcal {C}}\) is sufficiently close to an optimal solution, and assume standard conditions, including a local error bound condition. Then, if the subproblems are solved with sufficient accuracy, a solution sequence \((x_k)\) generated by our method converges quadratically to an optimal solution. See Theorem 2 for a precise statement.

  4. (iv)

    Improved convergence results even for unconstrained problems Our method achieves both the \(O(\varepsilon ^{-2})\) iteration complexity bound and local quadratic convergence. An LM method having such global and local convergence results is new for unconstrained and constrained problems, as shown in Table 1.

Numerical results show that our method converges faster and is more robust than existing LM-type methods [22, 26, 36, 40], a projected gradient method, and a trust-region reflective method [10, 58].

Table 1 Comparison of methods for problem (1)

1.2 Oracle model for overall complexity bounds

To evaluate the overall complexity of LM methods, we count the number of basic operations—evaluation of F(x), Jacobian-vector multiplications J(x)u and \(J(x)^\top v\), and projection onto \({\mathcal {C}}\)—required to find an \(\varepsilon\)-stationary point, following [21, Sect. 6]. The important point is that we do not assume an evaluation of \(J_k :=J(x_k)\) but access the Jacobian only through products \(J_k u\) and \(J_k^\top v\) to solve subproblem (3). Computing vectors \(J_k u\) and \(J_k^\top v\) for given \(u \in {\mathbb {R}}^d\) and \(v \in {\mathbb {R}}^n\) is much cheaper than evaluating the matrix \(J_k\).Footnote 1 Avoiding the computation of the \(n \times d\) matrix \(J_k\) makes algorithms practical for large-scale problems where n and d amount to thousands or millions. We note that some existing LM-type methods [3, 4, 12,13,14,15,16, 36] compute the Jacobian explicitly.

1.3 Paper organization

In Sect. 2, we review LM methods and related algorithms for problem (1). In Sect. 3, a key lemma is presented and the LM method (Algorithm 1) is derived based on the lemma. Sections 4 and 5 show theoretical results for Algorithm 1: iteration complexity, overall complexity, and local quadratic convergence. In Sect. 6, we generalize Algorithm 1 and present a more practical variant of Algorithm 1. This variant also achieves the theoretical guarantees given for Algorithm 1 in Sects. 4 and 5. Section 7 provides some numerical results and Sect. 8 concludes the paper.

1.4 Notation

Let \({\mathbb {R}}^d\) denote a d-dimensional Euclidean space equipped with the \(\ell _2\)-norm \(\Vert \cdot \Vert\) and the standard inner product \(\langle \cdot , \cdot \rangle\). For a matrix \(A \in {\mathbb {R}}^{m \times n}\), let \(\Vert A\Vert\) denote its spectral norm, or its largest singular value. For \(a \in {\mathbb {R}}\), let \(\lceil a\rceil\) denote the least integer greater than or equal to a.

2 Comparison with related works

We review existing methods for problem (1) and compare them with our work.

2.1 General methods

Algorithms for general nonconvex optimization problems, not just for least-squares problems, also solve problem (1). For example, the projected gradient method have an overall complexity bound of \(O(\varepsilon ^{-2})\); our LM method enjoys local quadratic convergence in addition to that bound, which seems difficult to achieve with general first-order methods. Figure 1 illustrates that our LM successfully minimizes the Rosenbrock function, a valley-like function that is notoriously difficult to minimize numerically. Although quadratic convergence is proved only locally around an optimal solution, in practice, the LM method may perform considerably better than general first-order methods, even when started far from the optimum.

Fig. 1
figure 1

Minimization of the Rosenbrock function [56], \(f(x, y) = (x - 1)^2 + 100 (y - x^2)^2\). Both the gradient descent (GD) and our LM start from \((-1, 1)\) and converge to the optimal solution, (1, 1). One marker corresponds to one iteration, and the GD and LM are truncated after 1000 and 20 iterations, respectively

Some methods, such as the Newton method, achieve local quadratic convergence using second-order or higher-order derivatives of f; our LM achieves it without the second-order derivative. Besides the fact that our LM does not require a computationally demanding Hessian matrix, it has another advantage: subproblem (3) is very tractable. Whereas our subproblem is smooth and strongly convex, those in second- or higher-order methods are nonconvex in general. The matter becomes more severe under the presence of constraints because the subproblems may be NP-hard, as pointed out in [15].

2.2 Specialized methods for least squares

Several methods, including the LM method, utilize the least-squares structure of problem (1). Focusing on those algorithms without second-order derivatives, we review them from three points of view: (i) subproblem, (ii) complexity for finding a stationary point, and (iii) local superlinear convergence. Most of the methods discussed in this section are summarized in Table 1. The table shows the following:

  • Our method can achieve an overall computational complexity bound, \(O(\varepsilon ^{-2}) \times O(1) = O(\varepsilon ^{-2})\), for finding an \(\varepsilon\)-stationary point for constrained problems.

  • To the best of our knowledge, this is the first LM that achieves such a complexity bound with local quadratic convergence, even for unconstrained problems.

2.2.1 Subproblems

Most algorithms for the nonlinear least-squares problem (1) generate a solution sequence \((x_k)_{k \in {\mathbb {N}}}\) by repeatedly solving convex subproblems, and we focus on such algorithms. There are three popular subproblems, in addition to the LM subproblem (3):

$$\begin{aligned} \min _{x \in {\mathbb {R}}^d}\ \Vert F_k + J_k(x - x_k)\Vert + \frac{\uplambda }{2} \Vert x - x_k\Vert ^2\quad{} & {} \mathrm {subject\ to}\quad x \in {\mathcal {C}}, \end{aligned}$$
(4)
$$\begin{aligned} \min _{x \in {\mathbb {R}}^d}\ \Vert F_k + J_k(x - x_k)\Vert ^2 + \frac{\uplambda }{2} \Vert x - x_k\Vert ^3\quad{} & {} \mathrm {subject\ to}\quad x \in {\mathcal {C}}, \end{aligned}$$
(5)
$$\begin{aligned} \min _{x \in {\mathbb {R}}^d}\ \Vert F_k + J_k(x - x_k)\Vert ^2\quad{} & {} \mathrm {subject\ to}\quad x \in {\mathcal {C}}, \ \Vert x - x_k\Vert \le \Delta , \end{aligned}$$
(6)

where \(\uplambda , \Delta > 0\) are properly defined constants. Methods using subproblems (4), (5), and (6) have been proposed and analyzed in [3, 4, 16, 50], [3], and [12, 24, 32, 64], respectively. Other works [13,14,15] propose methods with a more general version of (5). These four subproblems (3)–(6) are closely related in theory; one subproblem becomes equivalent to the others with specific choices of the parameters \(\uplambda\) and \(\Delta\).

In practice, these four subproblems are quite different, and the LM subproblem (3) is the most tractable one because the objective function \(m^k_\uplambda\) is smooth and strongly convex. Thanks to smoothness and strong convexity, we can efficiently solve subproblem (3) with linearly convergent methods such as the projected gradient method. Note that the objective function of (4) is nonsmooth, and (5) and (6) are not necessarily strongly convex. Although some algorithms for subproblems (4)–(6) without constraints have been proposed [4, 13, 64], efficient algorithms are nontrivial under the presence of constraints. Hence, the LM method is more practical than methods using other subproblems.

2.2.2 Complexity for finding a stationary point

For unconstrained zero-residual problems, Nesterov [50] proposed a method with subproblem (4) and showed that the method finds an \(\varepsilon\)-stationary point after \(O(\varepsilon ^{-2})\) iterations under a strong assumption (see footnote 2 of Table 1 for details). After that, for unconstrained (possibly) nonzero-residual problems, several methods with subproblems (3), (4), and (6) have been proposed [12, 57, 66], and they achieve the same iteration complexity bound under weaker assumptions such as the Lipschitz continuity of J or \(\nabla f\). The method of [12] has been extended for constrained problems [16].Footnote 2 These methods [12, 16, 50, 57, 66] have the iteration complexity bound, but computational complexity per iteration, i.e., complexity for a subproblem, is unclear.

The key to bounding complexity per iteration is that we do not need to solve subproblems so accurately to derive the iteration complexity bound. Several algorithms have been proposed based on this fact for both unconstrained [5, 6, 13, 14] and constrained [15] problems. They use a point that decreases the model function value sufficiently compared to the value at the current iterate \(x_k\). Such a point can be computed with an \(\varepsilon\)-independent number of basic operations: evaluation of F(x), Jacobian-vector multiplications J(x)u and \(J(x)^\top v\), and projection onto \({\mathcal {C}}\). Thus, the methods in [5, 13,14,15] achieve the overall complexity \(O(\varepsilon ^{-2}) \times O(1) = O(\varepsilon ^{-2})\).

Our LM method also finds an \(\varepsilon\)-stationary point within \(O(\varepsilon ^{-2})\) iterations, and the complexity per iteration is O(1) when subproblems are solved approximately like [5, 6, 13,14,15]. Thus, the overall complexity amounts to \(O(\varepsilon ^{-2})\) same as [5, 13,14,15].

2.2.3 Local superlinear convergence

For unconstrained zero-residual problems, many methods with subproblems (3)–(6) have achieved local quadratic convergence under a local error bound condition [3, 19, 23, 29,30,31,32,33,34,35, 62]. These local convergence results have been extended to constrained problems [1, 22, 26, 40]. Some methods [24, 64] have local convergence of an arbitrarily order less than 2. Other methods [25, 27, 28] achieve local (nearly) cubic convergence by solving two subproblems in one iteration. We note that the local convergence analyses in [4, 50] assume the solution sequence \((x_k)\) is in the neighborhood of a solution \(x^*\) such that \(F(x^*) = {\textbf{0}}\) and \({{\,\textrm{rank}\,}}J(x^*) = n\), which is a stronger assumption than the local error bound.

Among these methods, some [1, 3, 4, 19, 22, 29, 31, 35] use an approximate solution to subproblems while preserving local quadratic convergence. The approximate solution is more accurate than that used to derive the global complexity mentioned in the previous section. We also use the same kind of approximate solution as [1, 19, 29, 31, 35] to prove local quadratic convergence. See Condition 2 in Sect. 6 for the details of the approximate solution.

3 Majorization lemma and proposed method

Here, we will prove a majorization lemma that shows that the LM model \(m^k_\uplambda\) defined in (2) is an upper bound on the objective function. In view of this lemma, we can characterize our LM method as a majorization-minimization (MM) method.

For \(a, b \in {\mathbb {R}}^d\), we denote the sublevel set and the line segment by

$$\begin{aligned} {\mathcal {S}}(a)&:=\{ x \in {\mathbb {R}}^d \,|\, f(x) \le f(a) \}, \end{aligned}$$
(7)
$$\begin{aligned} {\mathcal {L}}(a, b)&:=\{ (1-\theta ) a + \theta b \in {\mathbb {R}}^d \,|\, \theta \in [0, 1] \}. \end{aligned}$$
(8)

3.1 LM method as majorization-minimization

MM is a framework for nonconvex optimization that successively performs (approximate) minimization of an upper bound on the objective function. The following lemma, a majorization lemma, shows that the model \(m^k_\uplambda\) defined in (2) is an upper bound on the objective f over some region under certain assumptions.

Lemma 1

Let \({\mathcal {X}} \subseteq {\mathbb {R}}^d\) be any closed convex set, and suppose \(x_k \in {\mathcal {X}}\). Moreover, assume that for some constant \(L > 0\),

$$\begin{aligned} \Vert J(y) - J(x)\Vert \le L\Vert y - x\Vert ,\quad \forall x,y \in {\mathcal {X}} \ \ \text {s.t.} \ \ {\mathcal {L}}(x, y) \subseteq {\mathcal {S}}(x_k). \end{aligned}$$
(9)

Then for any \(\uplambda > 0\) and \(x \in {\mathcal {X}}\) such that

$$\begin{aligned}&\uplambda \ge L\Vert F_k\Vert \quad \text {and} \end{aligned}$$
(10)
$$\begin{aligned}&m^k_\uplambda (x) \le m^k_\uplambda (x_k), \end{aligned}$$
(11)

the following bound holds:

$$\begin{aligned} f(x) \le m^k_\uplambda (x). \end{aligned}$$
(12)

The proof is given in Sect. A.2.

The assumption in (9) is the Lipschitz continuity of J and is analogous to the Lipschitz continuity of \(\nabla f\), which is often used in the analysis of first-order methods. Equation (10) requires a sufficiently large damping parameter, which corresponds to a sufficiently small step-size for first-order methods. Equation (11) requires the point \(x \in {\mathcal {X}}\) to be a solution that is at least as good as the current point \(x_k \in {\mathcal {X}}\) in terms of the model function value.

3.2 Proposed LM method

figure a

Based on Lemma 1, we propose an LM method that solves problem (1). The proposed LM is formally described in Algorithm 1 and is outlined below. First, in Line 1, three parameters are initialized: an estimate M of the Lipschitz constant L of J, a parameter \(\eta\) used for solving subproblems, and the iteration counter k. Line 3 sets \(\uplambda\) using M as an estimate of L based on (10). Then, the inner loop of Lines 4-10 solves subproblem (3) approximately by a projected gradient method. The details of the inner loop will be described later. Lines 12–15 check if the current \(\uplambda\) and the computed solution x are acceptable. If \(\uplambda\) and x satisfy (12), they are accepted as \(\uplambda _k\) and \(x_{k+1}\). Otherwise, the current value of M is judged to be small as an estimate of L in light of Lemma 1 and is increased. We refer to the former case as a “successful” iteration and the latter as an “unsuccessful” iteration. Note that k represents not the number of outer iterations but that of only successful iterations. As shown later in Lemma 5(ii) and Theorem 2(i), the number of unsuccessful iterations is upper-bounded by a constant under certain assumptions.

Inner loop for subproblem In the inner loop of Lines 4-10, subproblem (3) is solved approximately by the projected gradient method. Here, the operator \({{\,\textrm{proj}\,}}_{{\mathcal {C}}}\) in Line 5 is the projection operator defined by

$$\begin{aligned} {{\,\textrm{proj}\,}}_{{\mathcal {C}}}(x) :=\mathop {\textrm{argmin}}\limits _{y \in {\mathcal {C}}}\Vert y - x\Vert . \end{aligned}$$

The parameter t is the inner iteration counter, and the parameter \(\eta\) is the inverse step-size that is adaptively chosen by a standard backtracking technique in Lines 6-9. As shown in Lemma 6(ii) later, Line 9 is executed a finite number of times under certain standard assumptions. Hence, the inner loop must stop after a finite number of iterations.

Input parameters Algorithm 1 has several input parameters. The parameters \(M_0\) and \(\alpha\) are used to estimate the Lipschitz constant of the Jacobian J, and the parameters \(\eta _0\) and \(\alpha _{\textrm{in}}\) are used to control the step-size in the inner loop. The parameters T and c control how accurately the subproblems are solved through the stopping criteria of the inner loop. Here, note that we allow for \(T = \infty\). As we will prove in Sect. 4, the algorithm has an iteration complexity bound for an \(\varepsilon\)-stationary point regardless of the choice of the input parameters. However, to obtain an overall complexity bound or local quadratic convergence, there are restrictions on the choice of T, as explained in the next paragraph.

Stopping criteria for inner loop There are two types of stopping criteria as in Line 10, and the inner loop terminates when at least one of them is satisfied. If \(T < \infty\), the projected gradient method stops after executing Line 7 at most T times, and then the overall complexity for an \(\varepsilon\)-stationary point is guaranteed to be \(O(\varepsilon ^{-2})\). If \(T = \infty\), we have to solve subproblems more accurately to find a \((c\uplambda \Vert F_k\Vert )\)-stationary point of the subproblem, and then Algorithm 1 achieves local quadratic convergence.

Remark 1

To make the algorithm more practical, we can introduce other parameters \(0< \beta < 1\) and \(M_{\min } > 0\), and update \(M \leftarrow \max \{ \beta M, M_{\min } \}\) after every successful iteration. As with the gradient descent method, such an operation prevents the estimate M from being too large and eliminates the need to choose \(M_0\) carefully. Inserting this operation never deteriorates the complexity bounds described in Sect. 4 and the local quadratic convergence in Sect. 5.

Remark 2

Some methods (e.g., [57, 66]) use the condition

$$\begin{aligned} \frac{m^k_\uplambda (x_k) - f(x)}{m^k_\uplambda (x_k) - m^k_\uplambda (x)} \ge \theta \end{aligned}$$
(13)

with some \(0< \theta < 1\) to determine whether the computed solution x to the subproblem is acceptable. Our acceptance condition (12) is stronger than the classical one since (12) is equivalent to

$$\begin{aligned} \frac{m^k_\uplambda (x_k) - f(x)}{m^k_\uplambda (x_k) - m^k_\uplambda (x)} \ge 1 \end{aligned}$$

under condition (11). Therefore, Lemma 1 is stronger than the classical statement that condition (13) holds if \(\uplambda\) is sufficiently large.

4 Iteration complexity and overall complexity

We will prove that Algorithm 1 finds an \(\varepsilon\)-stationary point of problem (1) within \(O(\varepsilon ^{-2})\) outer iterations. Futhermore, we will prove that under \(T < \infty\), the overall complexity for an \(\varepsilon\)-stationary point is also \(O(\varepsilon ^{-2})\). Throughout this section, \((x_k)\) and \((\uplambda _k)\) denote the sequences generated by the algorithm.

4.1 Assumptions

We make the following assumptions to derive the complexity bound. Recall that the sublevel set \({\mathcal {S}}(a)\) and the line segment \({\mathcal {L}}(a, b)\) are defined in (7) and (8) and that \(x_0 \in {\mathcal {C}}\) denotes the starting point of Algorithm 1.

Assumption 1

For some constants \(\sigma , L > 0\),

  1. (i)

    \(\Vert J(x)\Vert \le \sigma\), \(\forall x \in {\mathcal {C}} \cap {\mathcal {S}}(x_0)\),

  2. (ii)

    \(\Vert J(y) - J(x)\Vert \le L\Vert y - x\Vert\), \(\forall x,y \in {\mathcal {C}}\) s.t.  \({\mathcal {L}}(x, y) \subseteq {\mathcal {S}}(x_0)\).

Assumption 1(i) means the \(\sigma\)-boundedness of J on \({\mathcal {C}} \cap {\mathcal {S}}(x_0)\). Assumption 1(ii) is similar to the L-Lipschitz continuity of J on \({\mathcal {C}} \cap {\mathcal {S}}(x_0)\) but weaker due to the condition of \({\mathcal {L}}(x, y) \subseteq {\mathcal {S}}(x_0)\). Assumption 1 is milder than the assumptions in the previous work that discussed the iteration complexity, even when \({\mathcal {C}} = {\mathbb {R}}^d\). For example, the analysis in [66] assumes f and J to be Lipschitz continuous on \({\mathbb {R}}^d\), which implies the boundedness of J on \({\mathbb {R}}^d\).

4.2 Approximate stationary point

Before analyzing the algorithm, we define an \(\varepsilon\)-stationary point for constrained optimization problems. Let \(\iota _{{\mathcal {C}}}: {\mathbb {R}}^d \rightarrow {\mathbb {R}}\cup \{ +\infty \}\) be the indicator function of the closed convex set \({\mathcal {C}} \subseteq {\mathbb {R}}^d\). For a convex function \(g: {\mathbb {R}}^d \rightarrow {\mathbb {R}}\cup \{ +\infty \}\), its subdifferential at \(x \in {\mathbb {R}}^d\) is the set defined by \(\partial g(x) :=\{ p \in {\mathbb {R}}^d \,|\, g(y) \ge g(x) + \langle p, y - x \rangle , \ \forall y \in {\mathbb {R}}^d \}\).

Definition 1

(see, e.g., Definition 1 in [51]) For \(\varepsilon > 0\), a point \(x \in {\mathcal {C}}\) is said to be an \(\varepsilon\)-stationary point of the problem \(\min _{x \in {\mathcal {C}}} f(x)\) if

$$\begin{aligned} \min _{p \in \partial \iota _{{\mathcal {C}}}(x)} \Vert \nabla f(x) + p\Vert \le \varepsilon . \end{aligned}$$
(14)

This definition is consistent with the unconstrained case; the above inequalities are equivalent to \(\Vert \nabla f(x)\Vert \le \varepsilon\) when \({\mathcal {C}} = {\mathbb {R}}^d\). There is another equivalent definition of an \(\varepsilon\)-stationary point, which we will also use.

Lemma 2

For \(x \in {\mathcal {C}}\) and \(\varepsilon > 0\), condition (14) is equivalent to

$$\begin{aligned} \langle \nabla f(x), y - x \rangle \ge - \varepsilon \Vert y - x\Vert , \quad \forall y \in {\mathcal {C}}. \end{aligned}$$
(15)

Proof

The tangent cone \({\mathcal {T}}(x)\) of \({\mathcal {C}}\) at \(x \in {\mathcal {C}}\) is defined by

$$\begin{aligned} {\mathcal {T}}(x)&:=\{ \beta (y - x) \,|\, y \in {\mathcal {C}}, \ \beta \ge 0 \}. \end{aligned}$$
(16)

Note that

$$\begin{aligned} {\mathcal {T}}(x)&= \{ z \in {\mathbb {R}}^d \,|\, \langle y, z \rangle \le 0,\ \forall y \in \partial \iota _{{\mathcal {C}}}(x) \} \end{aligned}$$
(17)

because \({\mathcal {C}}\) is a closed convex set and \(\partial \iota _C(x)\) is the normal cone of \({\mathcal {C}}\). We have

$$\begin{aligned}&\min _{p \in \partial \iota _{{\mathcal {C}}}(x)} \Vert \nabla f(x) + p\Vert \\&\quad = \min _{p \in \partial \iota _{{\mathcal {C}}}(x)} \max _{u: \Vert u\Vert \le 1} \langle - \nabla f(x) - p, u \rangle \\&\quad = \max _{u: \Vert u\Vert \le 1} \inf _{p \in \partial \iota _{{\mathcal {C}}}(x)} \Big \{ \langle - \nabla f(x), u \rangle - \langle p, u \rangle \Big \}&\quad&(\text {by a minimax theorem})\\&\quad = \max _{u \in {\mathcal {T}}(x),\,\Vert u\Vert \le 1} \langle - \nabla f(x), u \rangle&\quad&(\text {by}\, (17))\\&\quad = \sup _{y \in {\mathcal {C}} \setminus \{ x \}} \frac{\langle - \nabla f(x), y - x \rangle }{\Vert y - x\Vert }&\quad&(\text {by}\, (16)). \end{aligned}$$

Therefore, condition (14) is equivalent to

$$\begin{aligned} \sup _{y \in {\mathcal {C}} \setminus \{ x \}} \frac{\langle - \nabla f(x), y - x \rangle }{\Vert y - x\Vert } \le \varepsilon , \end{aligned}$$

which is also equivalent to (15). \(\square\)

A useful tool for deriving iteration complexity bounds is gradient mapping (see, e.g., [49]), also known as projected gradient [41] or reduced gradient [52]. For \(\eta > 0\), the projected gradient operator \({\mathcal {P}}_\eta : {\mathcal {C}} \rightarrow {\mathcal {C}}\) and the gradient mapping \({\mathcal {G}}_\eta : {\mathcal {C}} \rightarrow {\mathbb {R}}^d\) for problem (1) are defined by

$$\begin{aligned} {\mathcal {P}}_\eta (x)&:=\mathop {\textrm{argmin}}\limits _{y \in {\mathcal {C}}} \Big \{ \langle \nabla f(x), y - x \rangle + \frac{\eta }{2}\Vert y - x\Vert ^2 \Big \} = {{\,\textrm{proj}\,}}_{{\mathcal {C}}} \Big ( x - \frac{1}{\eta } \nabla f(x) \Big ), \end{aligned}$$
(18)
$$\begin{aligned} {\mathcal {G}}_\eta (x)&:=\eta (x - {\mathcal {P}}_\eta (x)). \end{aligned}$$
(19)

The following lemma shows the relationship between an \(\varepsilon\)-stationary point and the gradient mapping.

Lemma 3

Suppose that Assumption 1holds, and let

$$\begin{aligned} L_{f} :=\sigma ^2 + L \Vert F_0\Vert . \end{aligned}$$
(20)

Then, for any \(x \in {\mathcal {C}} \cap {\mathcal {S}}(x_0)\) and \(\eta \ge L_f\), the point \({\mathcal {P}}_\eta (x)\) is a \((2 \Vert {\mathcal {G}}_\eta (x)\Vert )\)-stationary point of problem (1).

The proof is given in Sect. A.4. This lemma will be used for the proof of Theorem 1(ii).

Although Lemma 3 looks quite similar to [51, Corollary 1], there exists a significant difference in their assumptions. Indeed, Lemma 3 assumes the boundedness and the Lipschitz property of J only on a (possibly) nonconvex set \({\mathcal {C}} \cap {\mathcal {S}}(x_0)\), whereas [51, Corollary 1] assumes the Lipschitz continuity on the whole space \({\mathbb {R}}^d\). This makes our proof more complicated than in [51, Corollary 1].

4.3 Preliminary lemmas

First, we bound the decrease in the model function value due to the inner loop. For \(\eta > 0\), we define the function \({\mathcal {D}}_\eta : {\mathcal {C}} \rightarrow {\mathbb {R}}\) by

$$\begin{aligned} {\mathcal {D}}_\eta (x) :=- \min _{y \in {\mathcal {C}}} \Big \{ \langle \nabla f(x), y - x \rangle + \frac{\eta }{2}\Vert y - x\Vert ^2 \Big \}. \end{aligned}$$
(21)

We see that \({\mathcal {D}}_\eta (x) \ge -\langle \nabla f(x), x - x \rangle - \frac{\eta }{2}\Vert x - x\Vert ^2 = 0\) for all \(x \in {\mathcal {C}}\). In addition, \({\mathcal {D}}_\eta (x)\) is decreasing with respect to \(\eta\).

Lemma 4

The solution x obtained in Line 11 of Algorithm 1 satisfies

$$\begin{aligned} m^k_{\uplambda } (x) \le m^k_{\uplambda } (x_k) - {\mathcal {D}}_\eta (x_k) \le m^k_{\uplambda } (x_k), \end{aligned}$$
(22)

where k, \(\uplambda\), and \(\eta\) are parameters in Algorithm 1.

Proof

The second inequality in (22) follows from the nonnegativity of \({\mathcal {D}}_\eta (x)\), and therefore we will prove the first one. Let \(T'\) denote the value of t when the inner loop is completed, and for each \(0 \le t \le T'\), let \(\eta _{k,t}\) denote the values of \(\eta\) when \(x_{k,t}\) is obtained through Line 7. Our aim is to prove the first inequality in (22) with \((x, \eta )=(x_{k,T'},\eta _{k,T'})\). We have

$$\begin{aligned} m^k_{\uplambda }(x_{k,1})&\le m^k_\uplambda (x_k) + \langle \nabla m^k_\uplambda (x_k), x_{k,1} - x_k \rangle + \frac{\eta _{k,1}}{2} \Vert x_{k,1} - x_k\Vert ^2&\quad&(\text {by Line 6})\\&= m^k_\uplambda (x_k) + \min _{z \in {\mathcal {C}}} \Big \{ \langle \nabla m^k_\uplambda (x_k), z - x_k \rangle + \frac{\eta _{k,1}}{2}\Vert z - x_k\Vert ^2 \Big \}&\quad&(\text {by the definition of} x_{k,1})\\&= m^k_\uplambda (x_k) - {\mathcal {D}}_{\eta _{k,1}}(x_k)&\quad&(\text {by}\, \nabla m^k_\uplambda (x_k) = \nabla f(x_k)). \end{aligned}$$

Since \({\mathcal {D}}_\eta (x_k)\) is decreasing in \(\eta\) and \(\eta _{k,1} \le \eta _{k,2} \le \dots \le \eta _{k,T'}\), we have \({\mathcal {D}}_{\eta _{k,1}}(x_k) \ge {\mathcal {D}}_{\eta _{k,T'}}(x_k)\). On the other hand, we have \(m^k_\uplambda (x_{k,1}) \ge \dots \ge m^k_\uplambda (x_{k, T'})\). Combining these inequalities, we obtain the desired result. \(\square\)

From the above lemma and Line 12, it follows that for all k,

$$\begin{aligned} f(x_{k+1}) \le m^k_{\uplambda _k}(x_{k+1}) \le m^k_{\uplambda _k}(x_k) = f(x_k). \end{aligned}$$
(23)

This monotonicity of \(f(x_k)\) in k is an important property of the majorization-minimization and will be used in our analysis.

The following two lemmas show that the parameters M and \(\eta\) in the algorithm are upper-bounded, and hence Lines 9 and 15 are executed only a finite number of times per single run.

Lemma 5

Suppose that Assumption 1(ii) holds, and let

$$\begin{aligned} {\bar{M}} :=\max \{M_0, \alpha L\}, \end{aligned}$$
(24)

where \(M_0\) and \(\alpha\) are the inputs of Algorithm 1. Then,

  1. (i)

    The parameter M in Algorithm 1 always satisfies \(M \le {\bar{M}}\);

  2. (ii)

    Throughout the algorithm, the number of unsuccessful iterations is at most \(\lceil \log _\alpha ({\bar{M}} / M_0)\rceil = O(1)\).

Proof

We have \({\mathcal {S}}(x_k) \subseteq {\mathcal {S}}(x_0)\) from (23), and therefore Assumption 1(ii) implies (9) with \({\mathcal {X}} = {\mathcal {C}}\). On the other hand, (22) directly implies (11). Hence, by Lemma 1 with \({\mathcal {X}} = {\mathcal {C}}\) and Lemma 4, if \(M \ge L\) holds at Line 3, the condition in Line 12 must be true. Therefore, if \(M_0 \ge L\), no unsuccessful iterations occur and the parameter M always satisfies \(M = M_0\). Otherwise, there exists an integer \(l \ge 1\) such that \(L \le \alpha ^l M_0 < \alpha L\). Since \(M = \alpha ^l M_0\) after l unsuccessful iterations, the parameter M always satisfies \(M < \alpha L\). Consequently, we obtain the first result, and the second follows from the first. \(\square\)

Lemma 6

Suppose that Assumption 1holds, and let

$$\begin{aligned} {\bar{\eta }} :=\max \{ \eta _0, \alpha _{\textrm{in}} (\sigma ^2 + {\bar{M}} \Vert F_0\Vert ) \}, \end{aligned}$$
(25)

where \(\eta _0\) and \(\alpha _{\textrm{in}}\) are the inputs of Algorithm 1 and \({\bar{M}}\) is defined in (24). Then,

  1. (i)

    the parameter \(\eta\) in Algorithm 1 always satisfies \(\eta \le {\bar{\eta }}\);

  2. (ii)

    throughout the algorithm, Line 9 will be executed at most \(\lceil \log _{\alpha _\textrm{in}} ({\bar{\eta }} / \eta _0)\rceil = O(1)\) times.

Proof

Since the function \(m^k_\uplambda\) defined by (2) has the \((\Vert J_k\Vert ^2 + \uplambda )\)-Lipschitz continuous gradient, we have

$$\begin{aligned} m^k_\uplambda (y) \le m^k_\uplambda (x) + \langle \nabla m^k_\uplambda (x), y - x \rangle + \frac{\Vert J_k\Vert ^2 + \uplambda }{2} \Vert y - x\Vert ^2,\quad \forall x, y \in {\mathbb {R}}^d \end{aligned}$$

(see, e.g., [52, Eq. (2.1.9)]). We also have \(\Vert J_k\Vert ^2 + \uplambda \le \sigma ^2 + {\bar{M}} \Vert F_0\Vert\) from Assumption 1(i) and Lemma 5. Therefore, the inequality in Line 6 must hold if \(\eta \ge \sigma ^2 + {\bar{M}} \Vert F_0\Vert\). With the same arguments as in Lemma 5, we obtain the desired results. \(\square\)

As we can see from the proofs of Lemmas 5 and 6, if \(M_0 \ge L\) and \(\eta _0 \ge \sigma ^2 + M_0 \Vert F_0\Vert\), then no unsuccessful iterations occur in both outer and inner loops. Adjusting M and \(\eta\) adaptively as in the presented algorithm avoids a too small step-size in practice.

4.4 Iteration complexity and overall complexity

We use the following lemma for the analysis.

Lemma 7

$$\begin{aligned} {\mathcal {D}}_\eta (x) \ge \frac{1}{2\eta }\Vert {\mathcal {G}}_\eta (x)\Vert ^2,\quad \forall x \in {\mathcal {C}}, \quad \eta > 0. \end{aligned}$$

Proof

By the first-order optimality condition on (18) and the convexity of \({\mathcal {C}}\), we have

$$\begin{aligned} \langle \nabla f(x) + \eta ({\mathcal {P}}_\eta (x) - x), y - {\mathcal {P}}_\eta (x) \rangle \ge 0, \quad \forall y \in {\mathcal {C}}. \end{aligned}$$
(26)

Using this inequality, we obtain

$$\begin{aligned} {\mathcal {D}}_\eta (x)&= \langle \nabla f(x), x - {\mathcal {P}}_\eta (x) \rangle - \frac{\eta }{2}\Vert x - {\mathcal {P}}_\eta (x)\Vert ^2&\quad&(\text {by}\, (18)\, \text {and} \,(21))\\&\ge \frac{\eta }{2}\Vert x - {\mathcal {P}}_\eta (x)\Vert ^2&\quad&(\text {by}\, (26)\, \text {with }y = x)\\&= \frac{1}{2\eta }\Vert {\mathcal {G}}_\eta (x)\Vert ^2&\quad&(\text {by}\, (19)). \end{aligned}$$

\(\square\)

We show the asymptotic global convergence and the iteration complexity bound of Algorithm 1.

Theorem 1

Suppose that Assumption 1holds, and define \({\bar{\eta }}\) by (25). Then,

  1. (i)

    \(\displaystyle \lim _{k \rightarrow \infty } \Vert {\mathcal {G}}_{{\bar{\eta }}}(x_k)\Vert = 0\), and therefore, any accumulation point of \((x_k)\) is a stationary point of problem (1);

  2. (ii)

    \({\mathcal {P}}_{{\bar{\eta }}}(x_k)\) is an \(\varepsilon\)-stationary point of problem (1) for some \(k = O(\varepsilon ^{-2})\).

Proof

We have

$$\begin{aligned} f(x_{k+1}) - f(x_k)&\le m^k_{\uplambda _k}(x_{k+1}) - m^k_{\uplambda _k}(x_k)&\quad&(\text {by Line 12 and}\, m^k_{\uplambda _k}(x_k) = f(x_k))\\&\le - {\mathcal {D}}_{{\bar{\eta }}}(x_k)&\quad&(\text {by Lemmas 6(i) and 4})\\&\le - \frac{1}{2 {\bar{\eta }}}\Vert {\mathcal {G}}_{{\bar{\eta }}}(x_k)\Vert ^2&\quad&(\text {by Lemma 7}). \end{aligned}$$

Summing up this inequality for \(k=0,1,\dots ,K-1\), we obtain

$$\begin{aligned} \sum _{k=0}^{K-1} \Vert {\mathcal {G}}_{{\bar{\eta }}}(x_k)\Vert ^2 \le 2 {\bar{\eta }} (f(x_0) - f(x_K)) \le 2 {\bar{\eta }} f(x_0) \end{aligned}$$
(27)

for all \(K \ge 0\). Therefore, we also have \(\sum _{k=0}^{\infty } \Vert {\mathcal {G}}_{{\bar{\eta }}}(x_k)\Vert ^2 \le 2 {\bar{\eta }} f(x_0)\), yielding \(\lim _{k \rightarrow \infty } \Vert {\mathcal {G}}_{{\bar{\eta }}}(x_k)\Vert = 0\), the first result.

Combining (27) with \(\min _{0 \le k < K} \Vert {\mathcal {G}}_{{\bar{\eta }}}(x_k)\Vert ^2 \le \frac{1}{K} \sum _{k=0}^{K-1} \Vert {\mathcal {G}}_{{\bar{\eta }}}(x_k)\Vert ^2\), we have \(\Vert {\mathcal {G}}_{{\bar{\eta }}}(x_k)\Vert \le \varepsilon / 2\) for some \(k = O(\varepsilon ^{-2})\). For such \(x_k\), the point \({\mathcal {P}}_{{\bar{\eta }}}(x_k)\) is an \(\varepsilon\)-stationary point from Lemma 3 and \({\bar{\eta }} \ge L_f\). Thus, we have obtained the second result. \(\square\)

From Lemma 5(ii) and Theorem 1(ii), we obtain the iteration complexity bound of our algorithm as follows.

Corollary 1

Under Assumption 1, Algorithm 1 finds an \(\varepsilon\)-stationary point within \(O(\varepsilon ^{-2})\) outer iterations, namely, \(O(\varepsilon ^{-2})\) successful and unsuccessful iterations.

From this iteration complexity bound and Lemma 6(ii), we also obtain the overall complexity bound.

Corollary 2

Suppose that Assumption 1holds. Then, Algorithm 1 with \(T < \infty\) finds an \(\varepsilon\)-stationary point after \(O(\varepsilon ^{-2} T)\) basic operations.

We use the term basic operations to refer to evaluation of F(x), Jacobian-vector multiplications J(x)u and \(J(x)^\top v\), and projection onto \({\mathcal {C}}\) as in Sect. 1.2.

In order to compute an \(\varepsilon\)-stationary point based on Theorem 1(ii), knowledge of the value of \({\bar{\eta }}\) is required. However, this requirement can be circumvented with a slight modification of the algorithm. See Sect. A.5 for the details.

5 Local quadratic convergence

For zero-residual problems, we will prove that the sequence \((x_k)\) generated by Algorithm 1 with \(T = \infty\) converges locally quadratically to an optimal solution. Let us denote the set of optimal solutions to problem (1) by \({\mathcal {X}}^* :=\{ x \in {\mathcal {C}} \,|\, F(x) = {\textbf{0}} \}\) and the distance between \(x \in {\mathbb {R}}^d\) and \({\mathcal {X}}^*\) simply by \({{\,\textrm{dist}\,}}(x) :=\min _{y \in {\mathcal {X}}^*} \Vert y - x\Vert\). Throughout this section, we fix a point \(x^* \in {\mathcal {X}}^*\) and denote a neighborhood of \(x^*\) by \({\mathcal {B}}(r) :=\{ x \in {\mathbb {R}}^d \,|\, \Vert x - x^*\Vert \le r \}\) for \(r > 0\).Footnote 3 As in the previous section, we denote the sequences generated by Algorithm 1 with \(T = \infty\) by \((x_k)\) and \((\uplambda _k)\).

5.1 Assumptions

We make the following assumptions to prove local quadratic convergence.

Assumption 2

  1. (i)

    There exists \(x \in {\mathcal {C}}\) such that \(F(x) = {\textbf{0}}\).

For some constants \(\rho , L, r > 0\),

  1. (ii)

    \(\rho {{\,\textrm{dist}\,}}(x) \le \Vert F(x)\Vert\), \(\forall x \in {\mathcal {C}} \cap {\mathcal {B}}(r)\),

  2. (iii)

    \(\Vert J(y) - J(x)\Vert \le L\Vert y - x\Vert\), \(\forall x,y \in {\mathcal {C}} \cap {\mathcal {B}}(r)\).

Assumption 2(i) requires the problem to be zero-residual, Assumption 2(ii) is called a local error bound condition, and Assumption 2(iii) is the local Lipschitz continuity of J. These assumptions are used in the previous analyses of LM methods [1, 3, 19, 22, 23, 29, 30, 33, 34, 40, 62].

5.2 Fundamental inequalities for analysis

Since \({\mathcal {C}} \cap {\mathcal {B}}(r)\) is compact, there exists a constant \(\sigma > 0\) such that

$$\begin{aligned} \Vert J(x)\Vert \le \sigma , \quad \forall x \in {\mathcal {C}} \cap {\mathcal {B}}(r), \end{aligned}$$
(28)

which implies

$$\begin{aligned} \Vert F(y) - F(x)\Vert \le \sigma \Vert y - x\Vert , \quad \forall x, y \in {\mathcal {C}} \cap {\mathcal {B}}(r). \end{aligned}$$
(29)

Let \(\sigma\) denote such a constant in the rest of this section.

For a point \(x \in {\mathbb {R}}^d\), let \({{\tilde{x}}} \in {\mathcal {X}}^*\) denote an optimal solution closest to x; \(\Vert {{\tilde{x}}} - x\Vert = {{\,\textrm{dist}\,}}(x)\). In particular, \({{\tilde{x}}}_k\) denotes one of the closest solutions to \(x_k\) for each \(k \ge 0\). Since \(\Vert {{\tilde{a}}} - x^*\Vert \le \Vert {{\tilde{a}}} - a\Vert + \Vert a - x^*\Vert \le 2\Vert a - x^*\Vert\), we have

$$\begin{aligned} a \in {\mathcal {B}}(r/2) \quad \Longrightarrow \quad {{\tilde{a}}} \in {\mathcal {B}}(r). \end{aligned}$$
(30)

Therefore, (29) with \(y :={{\tilde{x}}}\) implies

$$\begin{aligned} \Vert F(x)\Vert \le \sigma \Vert x - {{\tilde{x}}}\Vert = \sigma {{\,\textrm{dist}\,}}(x), \quad \forall x \in {\mathcal {C}} \cap {\mathcal {B}}(r/2). \end{aligned}$$
(31)

From the stopping criterion in Line 10 of Algorithm 1 with \(T = \infty\) and Definition 1, the solution x obtained in Line 11 satisfies

$$\begin{aligned} \langle \nabla m^k_\uplambda (x), y - x \rangle \ge - c\uplambda \Vert F_k\Vert \Vert y - x\Vert , \quad \forall y \in {\mathcal {C}}. \end{aligned}$$
(32)

From the definition of \(x_{k+1}\) and \(\uplambda _k\), we also have the inequality with \((x, \uplambda ) = (x_{k+1}, \uplambda _k)\), i.e.,

$$\begin{aligned} \langle \nabla m^k_{\uplambda _k}(x_{k+1}), y - x_{k+1} \rangle \ge - c\uplambda _k \Vert F_k\Vert \Vert y - x_{k+1}\Vert , \quad \forall y \in {\mathcal {C}}. \end{aligned}$$
(33)

5.3 Preliminary lemma

Lemma 8

Suppose that Assumption 2holds, and define \({\bar{M}}\) by (24). Define the constants \(C_1, C_2, \delta > 0\) by

$$\begin{aligned} C_1&:=\sqrt{1 + c^2 \sigma ^2 + \frac{L^2 r}{16 \rho M_0}}, \end{aligned}$$
(34a)
$$\begin{aligned} C_2&:=\frac{1}{c^2} \bigg ( \sigma ^2 \Big ( c{\bar{M}} + \frac{L}{2\rho } \Big ) + \frac{L\sigma C_1^2}{2} + (L + {\bar{M}})\sigma C_1 \bigg ), \end{aligned}$$
(34b)
$$\begin{aligned} \delta&:=\frac{r}{2(1 + C_1)}, \end{aligned}$$
(34c)

where \(M_0\) and c are the inputs of Algorithm 1. Assume that \(x_k \in {\mathcal {B}}(\delta )\) and \(M \le {\bar{M}}\) hold at Line 3. Then,

  1. (i)

    The solution x obtained in Line 11 satisfies

    $$\begin{aligned} \Vert x - x_k\Vert \le C_1 {{\,\textrm{dist}\,}}(x_k); \end{aligned}$$
    (35)
  2. (ii)

    \(M \le {\bar{M}}\) holds when \(x_{k+1}\) is obtained;

  3. (iii)

    The following hold:

    $$\begin{aligned} \Vert x_{k+1} - x_k\Vert&\le C_1 {{\,\textrm{dist}\,}}(x_k), \end{aligned}$$
    (36)
    $$\begin{aligned} {{\,\textrm{dist}\,}}(x_{k+1})&\le C_2 {{\,\textrm{dist}\,}}(x_k)^2. \end{aligned}$$
    (37)

Proof of Lemma 8(i)

From \(x_k \in {\mathcal {B}}(\delta )\), \(\delta \le r/2\), and (30), we have

$$\begin{aligned} x_k \in {\mathcal {B}}(r/2) \quad \text {and}\quad {{\tilde{x}}}_k \in {\mathcal {B}}(r). \end{aligned}$$
(38)

Moreover, we have from \(\nabla m^k_{\uplambda }(x) = J_k^\top (F_k + J_k(x - x_k)) + \uplambda (x - x_k)\) that

$$\begin{aligned} \underbrace{\langle \nabla m^k_{\uplambda }(x), x - {{\tilde{x}}}_k \rangle }_{\text {(A)}} = \underbrace{\langle F_k + J_k(x - x_k), J_k(x - {{\tilde{x}}}_k) \rangle }_{\text {(B)}} + \uplambda \underbrace{\langle x - x_k, x - {{\tilde{x}}}_k \rangle }_{\text {(C)}}. \end{aligned}$$

We bound the terms (A)–(C) as follows:

$$\begin{aligned} \text {(A)} \le c\uplambda \Vert F_k\Vert \Vert x - {{\tilde{x}}}_k\Vert&\le c\sigma \uplambda \Vert x_k - {{\tilde{x}}}_k\Vert \Vert x - {{\tilde{x}}}_k\Vert \\&\le \frac{c^2 \sigma ^2 \uplambda }{2} \Vert x_k - {{\tilde{x}}}_k\Vert ^2 + \frac{\uplambda }{2} \Vert x - {{\tilde{x}}}_k\Vert ^2, \end{aligned}$$

where the first and second inequalities follow from (32) and (31), respectively, and the last inequality follows from the arithmetic and geometric means;

$$\begin{aligned} \text {(B)} \ge - \frac{1}{4} \Vert F_k + J_k({{\tilde{x}}}_k - x_k)\Vert ^2 \ge - \frac{L^2}{16}\Vert {{\tilde{x}}}_k - x_k\Vert ^4, \end{aligned}$$

where the first inequality follows from \(4\langle a, b \rangle = \Vert a+b\Vert ^2-\Vert a-b\Vert ^2 \ge -\Vert a-b\Vert ^2\) and the second inequality from Lemma 9(ii), (38), and Assumption 2(iii);

$$\begin{aligned} \text {(C)} = \frac{1}{2} \Big ( \Vert x - x_k\Vert ^2 + \Vert x - {{\tilde{x}}}_k\Vert ^2 - \Vert {{\tilde{x}}}_k - x_k\Vert ^2 \Big ). \end{aligned}$$

Combining these bounds and rearranging terms yield

$$\begin{aligned} \Vert x - x_k\Vert ^2 \le (1 + c^2 \sigma ^2) \Vert {{\tilde{x}}}_k - x_k\Vert ^2 + \frac{L^2}{8\uplambda }\Vert {{\tilde{x}}}_k - x_k\Vert ^4. \end{aligned}$$
(39)

From (38), Assumption 2(ii), and \(\uplambda = M \Vert F_k\Vert \ge M_0 \Vert F_k\Vert\), we have

$$\begin{aligned} \Vert {{\tilde{x}}}_k - x_k\Vert ^2 \le \frac{r}{2} \times \frac{\Vert F_k\Vert }{\rho } \le \frac{r \uplambda }{2 \rho M_0}. \end{aligned}$$

Applying this bound to the second term on the right-hand side of (39), we obtain the desired result (35). \(\square\)

Proof of Lemma 8(ii)

As in Lemma 8(i), let x denote the x obtained in Line 11. By (34c), (35), and \(x_k \in {\mathcal {B}}(\delta )\), we have

$$\begin{aligned} \Vert x - x^*\Vert&\le \Vert x_k - x^*\Vert + \Vert x - x_k\Vert \\&\le \Vert x_k - x^*\Vert + C_1 {{\,\textrm{dist}\,}}(x_k)\\&\le (1 + C_1) \Vert x_k - x^*\Vert \le (1 + C_1) \delta = r/2, \end{aligned}$$

i.e.,

$$\begin{aligned} x \in {\mathcal {B}}(r/2). \end{aligned}$$
(40)

We now have \(x_k, x \in {\mathcal {C}} \cap {\mathcal {B}}(r)\). As in the proof of Lemma 5(i), by using Lemma 1 with \({\mathcal {X}} :={\mathcal {C}} \cap {\mathcal {B}}(r)\), we see that if \(M \ge L\) holds at Line 3, the outer iteration must be successful. This leads to the desired result. \(\square\)

Proof of Lemma 8(iii)

Equation (36) follows from Lemmas 8(i) and 8(ii). We prove (37) below. From (30) and (40), we have \(x_{k+1}, {{\tilde{x}}}_{k+1} \in {\mathcal {B}}(r)\). Moreover, we have

$$\begin{aligned}&\Vert F_{k+1}\Vert ^2 - \overbrace{\langle \nabla m^k_{\uplambda _k}(x_{k+1}), x_{k+1} - {{\tilde{x}}}_{k+1} \rangle }^{\text {(D)}}\\&= \langle F_{k+1}, F_{k+1} + J_{k+1}({{\tilde{x}}}_{k+1} - x_{k+1}) \rangle + \langle J_{k+1}^\top F_{k+1} - \nabla m^k_{\uplambda _k}(x_{k+1}), x_{k+1} - {{\tilde{x}}}_{k+1} \rangle \\&\le \underbrace{\Vert F_{k+1}\Vert \Vert F_{k+1} + J_{k+1}({{\tilde{x}}}_{k+1} - x_{k+1})\Vert }_{\text {(E)}} + \underbrace{\Vert J_{k+1}^\top F_{k+1} - \nabla m^k_{\uplambda _k}(x_{k+1})\Vert }_{\text {(F)}} {{\,\textrm{dist}\,}}(x_{k+1}) \end{aligned}$$

and bound the terms (D)–(F) as follows:

$$\begin{aligned} \text {(D)} \le c\uplambda _k \Vert F_k\Vert {{\,\textrm{dist}\,}}(x_{k+1}) \le c{\bar{M}} \Vert F_k\Vert ^2 {{\,\textrm{dist}\,}}(x_{k+1}) \end{aligned}$$

by (33) and Lemma 8(ii);

$$\begin{aligned} \text {(E)} \le \frac{L}{2} \Vert F_{k+1}\Vert {{\,\textrm{dist}\,}}(x_{k+1})^2 \le \frac{L}{2\rho } \Vert F_{k+1}\Vert ^2 {{\,\textrm{dist}\,}}(x_{k+1}) \le \frac{L}{2\rho } \Vert F_k\Vert ^2 {{\,\textrm{dist}\,}}(x_{k+1}) \end{aligned}$$

by Lemma 9(ii), Assumption 2(ii), and \(\Vert F_{k+1}\Vert \le \Vert F_k\Vert\) from (23); and

$$\begin{aligned} \text {(F)}&= \Vert J_{k+1}^\top F_{k+1} - J_k^\top (F_k + J_ku) - \uplambda _k u\Vert&\quad&(\text {by letting u}\, :=x_{k+1} - x_k)\\&\le \Vert J_k^\top (F_{k+1} - F_k - J_ku)\Vert \\&\qquad + \Vert (J_{k+1} - J_k)^\top F_{k+1}\Vert + \uplambda _k \Vert u\Vert \\&\le \frac{L\sigma }{2} \Vert u\Vert ^2 + L\Vert F_{k+1}\Vert \Vert u\Vert + \uplambda _k \Vert u\Vert&\quad&(\text {by}\, (28),\, \text {Lemma 9(ii)},\\&&& \quad \text{and Assumption 2(iii))}\\&\le \frac{L\sigma }{2} \Vert u\Vert ^2 + (L + {\bar{M}}) \Vert F_k\Vert \Vert u\Vert&\quad&(\text {by}\, \Vert F_{k+1}\Vert \le \Vert F_k\Vert \text { and Lemma 8(ii)})\\&\le \Big ( \frac{L\sigma C_1^2}{2} + (L + {\bar{M}})\sigma C_1 \Big ) {{\,\textrm{dist}\,}}(x_k)^2&\quad&(\text {by (31) and (36)}) \end{aligned}$$

Combining these bounds yields

$$\begin{aligned} \Vert F_{k+1}\Vert ^2 \le \bigg ( \Big ( c{\bar{M}} + \frac{L}{2\rho } \Big ) \Vert F_k\Vert ^2 + \Big ( \frac{L\sigma C_1^2}{2} + (L + {\bar{M}})\sigma C_1 \Big ) {{\,\textrm{dist}\,}}(x_k)^2 \bigg ) {{\,\textrm{dist}\,}}(x_{k+1}). \end{aligned}$$

We bound \(\Vert F_k\Vert\) and \(\Vert F_{k+1}\Vert\) in the above inequality by using Assumption 2(ii) and (31), yielding

$$\begin{aligned} \rho ^2 {{\,\textrm{dist}\,}}(x_{k+1})^2 \le \bigg ( \sigma ^2 \Big ( c{\bar{M}} + \frac{L}{2\rho } \Big ) + \frac{L\sigma C_1^2}{2} + (L + {\bar{M}})\sigma C_1 \bigg ) {{\,\textrm{dist}\,}}(x_k)^2{{\,\textrm{dist}\,}}(x_{k+1}), \end{aligned}$$

which implies the desired result (37). \(\square\)

5.4 Local quadratic convergence

Let us state the local quadratic convergence result of Algorithm 1.

Theorem 2

Suppose that Assumption 2holds, and define \({\bar{M}}\) by (24). Set \(x_0 \in {\mathcal {B}}(\delta _0)\) for a sufficiently small constant \(\delta _0 > 0\) such that

$$\begin{aligned} C_2 \delta _0 < 1,\quad \delta _0 + \frac{C_1 \delta _0}{1 - C_2 \delta _0} \le \delta , \end{aligned}$$
(41)

where \(C_1\), \(C_2\), and \(\delta\) are the constants defined in (34a)–(34c). Then,

  1. (i)

    The number of unsuccessful iterations is at most \(\lceil \log _\alpha ({\bar{M}} / M_0)\rceil = O(1)\), and

  2. (ii)

    The sequence \((x_k)\) converges quadratically to an optimal solution \({\hat{x}} \in {\mathcal {X}}^*\).

Proof

First, we will prove that

$$\begin{aligned}&x_k \in {\mathcal {B}}(\delta ), \text {and} \end{aligned}$$
(42a)
$$\begin{aligned}&M \le {\bar{M}}\text { holds when }x_k\text { is obtained} \end{aligned}$$
(42b)

for all \(k \ge 0\) by induction. For \(k=0\), (42a) and (42b) are obvious. For a fixed \(K \ge 0\), assume (42a) and (42b) for all \(k \le K\). We then have (36), (37) and (42b) for \(k \le K+1\) by Lemma 8. To complete the induction, we prove (42a) for \(k = K+1\). Solving the recursion of (37) and using \({{\,\textrm{dist}\,}}(x_0) \le \delta _0\), we have

$$\begin{aligned} {{\,\textrm{dist}\,}}(x_k) \le {{\,\textrm{dist}\,}}(x_0) (C_2 {{\,\textrm{dist}\,}}(x_0))^{2^k - 1} \le \delta _0 (C_2 \delta _0)^{2^k - 1} \le \delta _0 (C_2 \delta _0)^k \end{aligned}$$
(43)

for all \(k \le K+1\). We obtain (42a) for \(k = K+1\) as follows:

$$\begin{aligned} \Vert x_{K+1} - x^*\Vert&\le \Vert x_0 - x^*\Vert + \sum _{k=0}^K \Vert x_{k+1} - x_k\Vert&\quad&(\text {by the triangle inequality})\\&\le \delta _0 + C_1 \sum _{k=0}^K {{\,\textrm{dist}\,}}(x_k)&\quad&(\text {by (36)})\\&\le \delta _0 + \frac{C_1 \delta _0}{1 - C_2 \delta _0} \le \delta&\quad&(\text {by (41) and (43)}). \end{aligned}$$

Now, we have proved (42a) and (42b) for all \(k \ge 0\). \(\square\)

Proof of Theorem 2(ii)

Note that we have proved (37) and (43) for all \(k \ge 0\) in the proof of Theorem 2(i). By (43) and \(C_2\delta _0<1\) in (41), we have

$$\begin{aligned} \lim _{k \rightarrow \infty }{{\,\textrm{dist}\,}}(x_k) = 0. \end{aligned}$$
(44)

As with (43), we have for \(i \ge k\),

$$\begin{aligned} {{\,\textrm{dist}\,}}(x_i) \le {{\,\textrm{dist}\,}}(x_k) (C_2 {{\,\textrm{dist}\,}}(x_k))^{2^{i-k} - 1} \le {{\,\textrm{dist}\,}}(x_k) (C_2 \delta _0)^{i-k}. \end{aligned}$$

Using this bound and (35), we obtain

$$\begin{aligned} \Vert x_k - x_l\Vert \le \sum _{i=k}^{l-1} \Vert x_{i+1} - x_i\Vert \le C_1 \sum _{i=k}^{l-1} {{\,\textrm{dist}\,}}(x_i) \le \frac{C_1}{1 - C_2\delta _0} {{\,\textrm{dist}\,}}(x_k) \end{aligned}$$
(45)

for all kl such that \(0 \le k < l\). Equations (45) and (44) imply that \((x_k)\) is a Cauchy sequence. Accordingly, the sequence \((x_k)\) converges to a point \({\hat{x}} \in {\mathcal {X}}^*\) by (44). Thus, we obtain

$$\begin{aligned} \Vert x_{k+1} - {\hat{x}}\Vert&= \lim _{l \rightarrow \infty } \Vert x_{k+1} - x_l\Vert&\quad&(\text {by the continuity of $\Vert \cdot \Vert$} )\\&\le \frac{C_1}{1 - C_2\delta _0} {{\,\textrm{dist}\,}}(x_{k+1})&\quad&(\text {by (45)})\\&\le \frac{C_1C_2}{1 - C_2\delta _0} {{\,\textrm{dist}\,}}(x_k)^2&\quad&(\text {by (37)})\\&\le \frac{C_1C_2}{1 - C_2\delta _0} \Vert x_k - {\hat{x}}\Vert ^2&\quad&(\text {by}\, {\hat{x}} \in {\mathcal {X}}^*), \end{aligned}$$

which implies Theorem 2(ii). \(\square\)

6 Practical variant of the proposed method

We present a more practical variant (Algorithm 3) of Algorithm 1, which also achieves the theoretical guarantees given for Algorithm 1 in Sects. 4 and 5.

6.1 Generalized version of Algorithm 1

To obtain the practical variant, we first present a generalized framework of Algorithm 1. Algorithm 1 runs the vanilla projected gradient (PG) method in the inner loop. This PG can be replaced with other algorithms keeping \(O(\varepsilon ^{-2})\) iteration complexity and quadratic convergence that were gained for Algorithm 1. Indeed, these theoretical results rely on the fact that the x obtained in Line 11 of Algorithm 1 satisfies the following conditions:

Condition 1

(for \(O(\varepsilon ^{-2})\) iteration complexity bound) There exists a constant \(\gamma > 0\) such that for all k,

$$\begin{aligned} m^k_{\uplambda } (x) - m^k_{\uplambda } (x_k) \le - {\mathcal {D}}_\gamma (x_k). \end{aligned}$$

Condition 2

(for local quadratic convergence) Both of the following hold:

  1. (i)

    \(m^k_\uplambda (x) \le m^k_\uplambda (x_k)\) for all k;

  2. (ii)

    There exists a constant \(c > 0\) such that x is a \((c\uplambda \Vert F_k\Vert )\)-stationary point of subproblem (3) for all k.

This fact yields a general algorithmic framework that achieves the \(O(\varepsilon ^{-2})\) iteration complexity bound together with the quadratic convergence as in Algorithm 2.

figure b
figure c

In Line 4 of Algorithm 2, any globally convergent algorithm for subproblem (3) can be employed. For example, we may use (block) coordinate descent methods, Frank-Wolfe methods, interior point methods, active set methods, or augmented Lagrangian methods. For unconstrained cases, since the subproblem reduces to solving a system of linear equations, we may use conjugate gradient methods or direct methods, including Gaussian elimination.

6.2 Proposed method with an accelerated projected gradient

A practical example of Algorithm 2 is presented in Algorithm 3. This algorithm employs the accelerated projected gradient (APG) method [45, Algorithm 1] with the adaptive restarting technique [54, Sect. 3.2] to solve subproblems and adopts the additional parameters mentioned in Remark 1. Since the solution x obtained in Line 19 of Algorithm 3 satisfies Condition 1, this algorithm enjoys the \(O(\varepsilon ^{-2})\) iteration complexity bound. In addition, it also achieves the \(O(\varepsilon ^{-2})\) overall complexity bound if \(T < \infty\) as with Corollary 2, and it achieves local quadratic convergence if \(T = \infty\). Algorithm 3 will be used for the numerical experiments in the next section.

7 Numerical experiments

We examine the practical performance of the proposed method. We implemented all methods in Python with SciPy [58] and JAX [9] and executed them on a computer with Apple M1 Chip (8 cores, 3.2 GHz) and 16 GB RAM.

7.1 Problem setting

We consider three types of instances: (i) compressed sensing with quadratic measurement, (ii) nonnegative matrix factorization with missing values, and (iii) autoencoder with MNIST dataset.

7.1.1 Compressed sensing with quadratic measurement

Given \(A_1,\dots ,A_n \in {\mathbb {R}}^{r \times d}\), \(b_1,\dots ,b_n \in {\mathbb {R}}^d\), and \(c_1,\dots ,c_n, R \in {\mathbb {R}}\), we consider the following problem:

$$\begin{aligned} \min _{x \in {\mathbb {R}}^d}\ \sum _{i=1}^n \Big ({ \frac{1}{2r} \Vert A_i x\Vert ^2 + \langle b_i, x \rangle - c_i }\Big )^2 \quad \mathrm {subject\ to}\quad \Vert x\Vert _1 \le R, \end{aligned}$$
(46)

where \(\Vert \cdot \Vert _1\) denotes the \(\ell _1\)-norm. Problem (46) formulates the situation where a sparse vector \(x^* \in {\mathbb {R}}^d\) is recovered from a small number (i.e., \(n < d\)) of quadratic observations, \(\frac{1}{2r} \Vert A_i x^*\Vert ^2 + \langle b_i, x^* \rangle\) for \(i = 1, \dots , n\). Such a problem arises in the context of compressed sensing [8, 44] and phase retrieval [11, 63]. Problem (46) can be transformed into the form of problem (1).

Generating instances First, we generate the optimal solution \(x^* \in {\mathbb {R}}^d\) with only \(d_{\textrm{nnz}} \,(< d)\) nonzero entries. The indexes of the nonzero entries are chosen uniformly randomly, and the value of those elements are independently drawn from the uniform distribution on \([-x_{\textrm{max}}, x_{\textrm{max}}]\). Each entry of \(A_i\)’s and \(b_i\)’s is drawn independently from the standard normal distribution \({\mathcal {N}}(0, 1)\). Then, we set \(R = \Vert x^*\Vert _1\) and \(c_i = \frac{1}{2r} \Vert A_i x^*\Vert ^2 + \langle b_i, x^* \rangle\) for all i. We fix \(d = 200\), \(r = 10\), and \(n = 50\), and set \(d_{\textrm{nnz}} \in \{ 5, 10, 20 \}\) and \(x_{\textrm{max}} \in \{ 0.1, 1 \}\). We set the starting point for each algorithm as \(x_0 = {\textbf{0}}\).

7.1.2 Nonnegative matrix factorization with missing values

Given \(A \in {\mathbb {R}}^{m \times n}\) and \(H \in \{ 0, 1 \}^{m \times n}\), we consider the following problem:

$$\begin{aligned} \min _{X \in {\mathbb {R}}^{m \times r},\ Y \in {\mathbb {R}}^{n \times r}} \Vert H \odot (X Y^\top - A) \Vert _{\textrm{F}}^2 \quad \mathrm {subject\ to}\quad X \ge O,\quad Y \ge O, \end{aligned}$$
(47)

where \(\odot\) denotes the elementwise product, \(X \ge O\) and \(Y \ge O\) denote elementwise inequalities, and \(\Vert \cdot \Vert _{\textrm{F}}\) denotes the Frobenius norm. Problem (47) formulates the situation where a data matrix A with some missing entries is approximated by the product \(X Y^\top\) of two nonnegative matrices. Such a problem is called nonnegative matrix factorization (NMF) with missing values and is widely used for nonnegative data analysis, especially for collaborative filtering [46, 65]. For more information on NMF, see [7, 59] and the references therein. Problem (47) can also be written as problem (1).

Generating instances To generate A and H, we introduce two parameters: \(\gamma \ge 1\) and \(0 < p \le 1\). The parameters \(\gamma\) and p control the condition number of A and the number of 1’s in H, respectively. Let \(l :=\min \{ m, n \}\). First, a matrix \({{\tilde{A}}} \in {\mathbb {R}}^{m \times n}\) is generated by \({{\tilde{A}}} = U D V^\top\), and then the matrix A is obtained by normalizing \({{\tilde{A}}} = ({{\tilde{a}}}_{ij})_{i,j}\) as \(A = {{\tilde{A}}} / \max _{i, j} {{\tilde{a}}}_{ij}\). Here, each entry of \(U \in {\mathbb {R}}^{m \times l}\) and \(V \in {\mathbb {R}}^{n \times l}\) follows independently the uniform distribution on [0, 1], and \(D = {{\,\textrm{diag}\,}}(\gamma ^{0}, \gamma ^{-1/l}, \gamma ^{-2/l},\dots , \gamma ^{-(l-1)/l} ) \in {\mathbb {R}}^{l \times l}\) is a diagonal matrix. H is a random matrix whose entries follow independently the Bernoulli distribution with parameter p, i.e., each entry of H is 1 with probability p. We fix \(m = n = 50\) and \(\gamma = 10^5\), and set \(r \in \{ 10, 40 \}\) and \(p \in \{ 0.02, 0.1, 0.5 \}\). Since \((X, Y) = (O, O)\) is a stationary point of problem (47), we set the starting point to random matrices whose entries independently follow the uniform distribution on \([0, 10^{-3}]\).

7.1.3 Autoencoder with MNIST dataset

The third instance is highly nonlinear and large-scale. In machine learning, autoencoders (see, e.g., [37, Section 14]) are a popular model to compress real-world data, represented as high-dimensional vectors, into low-dimensional vectors. Given p-dimensional data \(a_1, \dots , a_N \in {\mathbb {R}}^p\), autoencoders try to learn an encoder \(\phi _x^{\textrm{enc}}: {\mathbb {R}}^p \rightarrow {\mathbb {R}}^q\) and a decoder \(\phi _y^{\textrm{dec}}: {\mathbb {R}}^q \rightarrow {\mathbb {R}}^p\), where \(q < p\). Here, x and y are parameters to be learned by solving the following optimization problem:

$$\begin{aligned} \min _{x, y}\ \sum _{i=1}^N \Vert a_i - \phi _y^{\textrm{dec}}(\phi _x^{\textrm{enc}}(a_i))\Vert ^2. \end{aligned}$$
(48)

As we see from the optimization problem above, the autoencoder aims to extract latent features that can be used to reconstruct the original data.

For this experiment, we use the MNIST hand-written digit dataset. Each data is a \(28 \times 28\) pixel grayscale image, which is represented as a vector \(a_i \in [0, 1]^p\) with \(p = 28 \times 28 = 728\). The dataset contains 60,000 training data, of which \(N = 1000\) were randomly chosen for use. We set \(q = 16\); our model encodes 728-dimensional data into 16 dimensions. Both encoder and decoder are two-layer neural networks with a hidden layer of size 64 and logistic sigmoid activation functions. Specifically, the encoder \(\phi _x^{\textrm{enc}}\) is written as

$$\begin{aligned} \phi _x^{\textrm{enc}}(a) = \phi _x^2 \circ \phi _x^1 (a), \quad \text {where}\quad \phi _x^i(a)&:=S( W_i a + b_i ). \end{aligned}$$

Here, S is the elementwise logistic sigmoid function, and \(W_1 \in {\mathbb {R}}^{728 \times 64}\), \(b_1 \in {\mathbb {R}}^{64}\), \(W_2 \in {\mathbb {R}}^{64 \times 16}\), and \(b_2 \in {\mathbb {R}}^{16}\) are parameters of the network; \(x = ((W_i, b_i))_{i=1}^2\). The decoder \(\phi _y^{\textrm{dec}}\) is formulated in a similar way. When we rewrite problem (48) in the form of (1), the dimension of the function \(F: {\mathbb {R}}^d \rightarrow {\mathbb {R}}^n\) is \(d = 96{,}104\) and \(n = Np = 728{,}000\).

7.2 Algorithms and implementation

We compare the proposed method with six existing methods. The details are below.

Proposed (Algorithm 3) and Proposed-NA (Algorithm 1) method To see the effect of acceleration for subproblems, we implemented both Algorithms 1 and 3; Algorithm 3 is expected to be faster, of course. In Line 10 of Algorithm 1 and Line 18 of Algorithm 3, we have to check if \(x_{k,t}\) is a \((c\uplambda \Vert F_k\Vert )\)-stationary point, but it is not very easy. We thus replace the criterion with one using gradient mapping, i.e., check if \(\eta \Vert x_{k,t} - y\Vert \le c\uplambda \Vert F_k\Vert\). The input parameters of Algorithm 1 and 3 are set to \(M_0 = \eta _0 = 1\), \(\alpha = \alpha _{\textrm{in}} = 2\), \(\beta = \beta _{\textrm{in}} = 0.9\), \(M_{\min } = 10^{-10}\), \(T = 100\), and \(c = 1\).

Fan method [26, Algorithm 2.1] and KYF method [40, Algorithm 2.12] The Fan and KYF methods are constrained LM methods with a global convergence guarantee. To solve subproblem (3), an APG method is used as well as Algorithm 3 for a fair comparison. The difference from the APG in Algorithm 3 is in the stopping criterion; the condition “\(x_{k,t}\) is a \((c\uplambda \Vert F_k\Vert )\)-stationary point” in Line 18 of Algorithm 3 is replaced with \(\eta \Vert x_{k,t} - y\Vert \le 10^{-9}\).Footnote 4 The input parameters in [26, 40] are set to \(\mu = 10^{-4}\), \(\beta = 0.9\), \(\sigma = 10^{-4}\), \(\gamma = 0.99995\), and \(\delta \in \{ 1, 2 \}\), following the recommendations of [26, 40].Footnote 5

Facchinei method [22, Algorithm 3] This is a constrained LM method that allows subproblems to be solved inexactly. We solve the subproblems in almost the same way as the Fan and KYF methods. The input parameters in [22] are set to \(\gamma _0 = 1\) and \(S = 2\).

GGO method [36, Algorithm G-LMA-IP] This is an LM-type method that requires the solution of a linear system at each iteration. The main advantage of this algorithm is that it does not require exact projection and can be applied to problems with a complex feasible region. Still, it is reported to perform well even when the projection is easy to compute exactly [36]. The linear systems are solved via QR decomposition (scipy.linalg.qr [58]) and the input parameters in [36] are set to \(M \in \{ 1, 15 \}\), \(\eta _1 = 10^{-4}\), \(\eta _2 = 10^{-2}\), \(\eta _3 = 10^{10}\), \(\gamma = 10^{-3}\), \(\beta = 1/2\), and \(\theta _k = 0\), following [36].Footnote 6

Projected gradient (PG) method The PG method is one of the most standard first-order methods for problem (1). The step-size is adaptively chosen in a similar way to the APG in Algorithm 3 with \(\eta _0 = 1\), \(\alpha _{\textrm{in}} = 2\), and \(\beta _{\textrm{in}} = 0.9\).

Trust-region reflective (TRF) method This is an interior trust-region method for box-constrained nonlinear optimization. It was proposed in [10] and is implemented in SciPy [58] with several improvements. For the TRF method, we call scipy.optimize.least_squares [58] with a gtol=1e-5 option to avoid the long execution time caused by searching for too precise a solution.

Other information As mentioned in Sect. 1.2, there are two ways to handle Jacobian matrices: explicitly computing \(J_k :=J(x_k)\) or using Jacobian-vector products \(J_k u\) and \(J_k^\top v\). In our experiments, the latter implementation outperformed the former, so we adopted the latter if possible (i.e., for Proposed, Proposed-NA, Fan, KYF, Facchinei, and PG).Footnote 7 We note that GGO is based on QR decomposition, which is probably impossible to implement using Jacobian-vector products.

For projection onto the feasible region of problem (46), we employ [18, Algorithm 1], whose time complexity is \(O(d \log d)\).

7.3 Results

7.3.1 Compressed sensing and NMF

Figures 2 and 3 show the results of compressed sensing in (46) and NMF in (47). Each figure consists of six subfigures, and they consist of two plots; the upper one shows the worst case among ten randomly generated instances, and the lower one shows the best case.Footnote 8

Fig. 2
figure 2

Results of compressed sensing (problem (46))

Fig. 3
figure 3

Results of NMF (problem (47))

Tables 2 and 3 provide more detailed information. For the tables, each algorithm is stopped when either of the following conditions is fulfilled: (i) the algorithm finds a point where the norm of the gradient mapping is less than \(10^{-5}\); (ii) the execution time exceeds 10 seconds. The “Success” column indicates the percentage of instances (out of 10) that ended up satisfying condition (i). The other columns show the averages of the following values: the objective function value reached, the gradient-mapping norm, the execution time, the number of iterations, and the number of basic operations. JVP stands for Jacobian-vector products.

Table 2 Results of compressed sensing (problem (46))
Table 3 Results of NMF (problem (47))

A remarkable feature of our method is its stability in addition to fast convergence. For example, while the Fan and KYF methods perform well in most cases, they sometimes do not converge fast, as shown in Table 2(e) and (f). The proposed method shows the best or comparable performance in all our settings compared to the other methods. This suggests that our method is stable without careful parameter tuning.

As seen from Tables 3(d)–(f), the Facchinei and GGO methods do not work well in some cases. As for the Facchinei method, the reason is presumably that the method does not guarantee global convergence. For GGO, it is observed from the tables that the number of iterations performed within the time limit is small, say 3 or 4. It is because the method at each iteration computes a Jacobian explicitly and solves a linear system, resulting in a high cost per iteration for large-scale problems. Our method guarantees global convergence and repeats relatively low-cost iterations without Jacobian computation, which also leads to a stable performance.

Figure 4 shows the results of the TRF method. Since this method can only handle box constraints, the results only of problem (47) are presented. One marker corresponds to one instance, representing the elapsed time and the obtained objective value.Footnote 9 TRF takes more time to converge than the proposed method; in particular, comparing Table 3(f) and Fig. 4f, we see that the elapsed time is about 1000 times longer than ours. This result may be due to the difference in how TRF and ours handle the constraint. When the optimal solution or a stationary point is at the boundary of the constraint set, our method can reach the boundary in a finite number of iterations. However, TRF does not, as it is an interior point method.

Fig. 4
figure 4

Results of NMF (problem (47)) by the TRF method

7.3.2 Autoencoder with MNIST

Figure 5 shows the results of problem (48). The results of the GGO method are omitted because the method explicitly computes the Jacobian, but it was infeasible in this large-scale setting, where \(d = 96{,}104\) and \(n = 728{,}000\). Among the existing methods, the PG method converges the fastest, but the proposed method converges about five times faster than PG. This result suggests that our method is also effective for large-scale and highly nonlinear problems.

Fig. 5
figure 5

Results of autoencoder with MNIST (problem (48))

8 Conclusion and future work

We proposed an LM method for solving constrained least-squares problems. Our method finds an \(\varepsilon\)-stationary point of (possibly) nonzero-residual problems after \(O(\varepsilon ^{-2})\) computation, and also achieves local quadratic convergence for zero-residual problems. There are few LM methods having both overall complexity bounds and local quadratic convergence even for unconstrained problems; in fact, our investigation yielded only one such algorithm [6]. The key to our analysis is a simple update rule for \((\uplambda _k)\) and the majorization lemma (Lemma 1).

We may be able to extend the convergence analysis shown in this paper to different problem settings. For example, it would be interesting to derive an overall complexity bound of LM methods for a nonsmooth function F. It would be also interesting to integrate a stochastic technique into our LM method against problems with F of a huge size. Finally, in recent years, studies on local convergence analysis for non-zero residual problems are progressive [2, 6, 39]. It is important to research our LM method further in this line.