Majorization-minimization-based Levenberg–Marquardt method for constrained nonlinear least squares

Marumo, Naoki; Okuno, Takayuki; Takeda, Akiko

doi:10.1007/s10589-022-00447-y

Majorization-minimization-based Levenberg–Marquardt method for constrained nonlinear least squares

Open access
Published: 17 January 2023

Volume 84, pages 833–874, (2023)
Cite this article

Download PDF

You have full access to this open access article

Computational Optimization and Applications Aims and scope Submit manuscript

Majorization-minimization-based Levenberg–Marquardt method for constrained nonlinear least squares

Download PDF

3023 Accesses
5 Citations
2 Altmetric
Explore all metrics

Abstract

A new Levenberg–Marquardt (LM) method for solving nonlinear least squares problems with convex constraints is described. Various versions of the LM method have been proposed, their main differences being in the choice of a damping parameter. In this paper, we propose a new rule for updating the parameter so as to achieve both global and local convergence even under the presence of a convex constraint set. The key to our results is a new perspective of the LM method from majorization-minimization methods. Specifically, we show that if the damping parameter is set in a specific way, the objective function of the standard subproblem in LM methods becomes an upper bound on the original objective function under certain standard assumptions. Our method solves a sequence of the subproblems approximately using an (accelerated) projected gradient method. It finds an $\varepsilon$-stationary point after $O(\varepsilon ^{-2})$ computation and achieves local quadratic convergence for zero-residual problems under a local error bound condition. Numerical results on compressed sensing and matrix factorization show that our method converges faster in many cases than existing methods.

Lp quasi-norm minimization: algorithm and applications

Article Open access 07 February 2024

Conjugate gradient acceleration of iteratively re-weighted least squares methods

Article 18 March 2016

A Preconditioned Conjugate Gradient Method with Active Set Strategy for $$\ell _1$$ -Regularized Least Squares

Article 13 April 2018

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In this study, we consider the constrained nonlinear least-squares problem:

$$\begin{aligned} \min _{x \in {\mathbb {R}}^d} \ f(x) :=\frac{1}{2}\Vert F(x)\Vert ^2 \quad \mathrm {subject\ to}\quad x \in {\mathcal {C}}, \end{aligned}$$

(1)

where $\Vert \cdot \Vert$ denotes the $\ell _2$-norm, $F: {\mathbb {R}}^d \rightarrow {\mathbb {R}}^n$ is a continuously differentiable function, and ${\mathcal {C}} \subseteq {\mathbb {R}}^d$ is a closed convex set. If there exists a point $x \in {\mathcal {C}}$ such that $F(x) = {\textbf{0}}$, the problem is said to be zero-residual, and is reduced to the constrained nonlinear equation:

$$\begin{aligned} \text {find} \quad x \in {\mathcal {C}}\quad \text {such that}\quad F(x) = {\textbf{0}}. \end{aligned}$$

Such problems cover a wide range of applications, including chemical equilibrium systems [48], economic equilibrium problems [20], power flow equations [61], nonnegative matrix factorization [7, 42], phase retrieval [11, 63], nonlinear compressed sensing [8], and learning constrained neural networks [17].

Levenberg–Marquardt (LM) methods [43, 47] are efficient iterative algorithms for solving problem (1); they were originally developed for unconstrained cases (i.e., ${\mathcal {C}} = {\mathbb {R}}^d$) and later extended to constrained cases by [40]. Given a current point $x_k \in {\mathcal {C}}$, an LM method defines a model function $m^k_\uplambda : {\mathbb {R}}^d \rightarrow {\mathbb {R}}$ with a damping parameter $\uplambda > 0$:

$$\begin{aligned} m^k_\uplambda (x) :=\frac{1}{2}\Vert F_k + J_k(x-x_k)\Vert ^2 + \frac{\uplambda }{2}\Vert x-x_k\Vert ^2, \end{aligned}$$

(2)

where $F_k :=F(x_k) \in {\mathbb {R}}^d$ and $J_k :=J(x_k) \in {\mathbb {R}}^{n \times d}$ with $J: {\mathbb {R}}^d \rightarrow {\mathbb {R}}^{n \times d}$ being the Jacobian matrix function of F. The next point $x_{k+1} \in {\mathcal {C}}$ is set to an exact or approximate solution to the convex subproblem:

$$\begin{aligned} \min _{x \in {\mathbb {R}}^d}\ m^k_\uplambda (x)\quad \mathrm {subject\ to}\quad x \in {\mathcal {C}} \end{aligned}$$

(3)

for some $\uplambda = \uplambda _k$. Various versions of this method have been proposed, and their theoretical and practical performances largely depend on how the damping parameter $\uplambda _k$ is updated.

1.1 Our contribution

We propose an LM method with a new rule for updating $\uplambda _k$. Our method is based on majorization-minimization (MM) methods, which successively minimize a majorization or, in other words, an upper bound on the objective function. The key to our method is the fact that the model $m^k_\uplambda$ defined in (2) is a majorization of the objective f under certain standard assumptions. This MM perspective enables us to create an LM method with desirable properties, including global and local convergence guarantees. Although there exist several MM methods for problem (1) and relevant problems [3, 4, 38, 50, 53], as far as we know, no studies have elucidated that the model in (2) is a majorization of f. Another feature of our LM method is the way of generating an approximate solution of subproblem (3). It is sufficient to apply one iteration of a projected gradient method to (3) for deriving the iteration complexity of our LM method, which leads to an overall complexity bound.

Our contributions are summarized as follows:

(i)
A new MM-based LM method We prove that the LM model defined in (2) is a majorization of f if the damping parameter $\uplambda$ is sufficiently large. See Lemma 1 for a precise statement. This result provides us with a new update rule of $\uplambda$, bringing about a new LM method for solving problem (1).
(ii)
Iteration and overall complexity for finding a stationary point The iteration complexity of our LM method for finding an $\varepsilon$-stationary point (see Definition 1) is proved to be $O(\varepsilon ^{-2})$ under mild assumptions on the Jacobian. Because the computational complexity per iteration of our method does not depend on $\varepsilon$, the overall complexity is also evaluated as $O(\varepsilon ^{-2})$ through
$$\begin{aligned} (\text {Overall complexity}) = (\text {Iteration complexity}) \times (\text {Complexity per iteration}). \end{aligned}$$
See Corollaries 1 and 2 for a precise statement.
(iii)
Local quadratic convergence For zero-residual problems, assume that a starting point $x_0 \in {\mathcal {C}}$ is sufficiently close to an optimal solution, and assume standard conditions, including a local error bound condition. Then, if the subproblems are solved with sufficient accuracy, a solution sequence $(x_k)$ generated by our method converges quadratically to an optimal solution. See Theorem 2 for a precise statement.
(iv)
Improved convergence results even for unconstrained problems Our method achieves both the $O(\varepsilon ^{-2})$ iteration complexity bound and local quadratic convergence. An LM method having such global and local convergence results is new for unconstrained and constrained problems, as shown in Table 1.

Numerical results show that our method converges faster and is more robust than existing LM-type methods [22, 26, 36, 40], a projected gradient method, and a trust-region reflective method [10, 58].

Table 1 Comparison of methods for problem (1)

Full size table

1.2 Oracle model for overall complexity bounds

To evaluate the overall complexity of LM methods, we count the number of basic operations—evaluation of F(x), Jacobian-vector multiplications J(x)u and $J(x)^\top v$, and projection onto ${\mathcal {C}}$—required to find an $\varepsilon$-stationary point, following [21, Sect. 6]. The important point is that we do not assume an evaluation of $J_k :=J(x_k)$ but access the Jacobian only through products $J_k u$ and $J_k^\top v$ to solve subproblem (3). Computing vectors $J_k u$ and $J_k^\top v$ for given $u \in {\mathbb {R}}^d$ and $v \in {\mathbb {R}}^n$ is much cheaper than evaluating the matrix $J_k$.^{Footnote 1} Avoiding the computation of the $n \times d$ matrix $J_k$ makes algorithms practical for large-scale problems where n and d amount to thousands or millions. We note that some existing LM-type methods [3, 4, 12,13,14,15,16, 36] compute the Jacobian explicitly.

1.3 Paper organization

In Sect. 2, we review LM methods and related algorithms for problem (1). In Sect. 3, a key lemma is presented and the LM method (Algorithm 1) is derived based on the lemma. Sections 4 and 5 show theoretical results for Algorithm 1: iteration complexity, overall complexity, and local quadratic convergence. In Sect. 6, we generalize Algorithm 1 and present a more practical variant of Algorithm 1. This variant also achieves the theoretical guarantees given for Algorithm 1 in Sects. 4 and 5. Section 7 provides some numerical results and Sect. 8 concludes the paper.

1.4 Notation

Let ${\mathbb {R}}^d$ denote a d-dimensional Euclidean space equipped with the $\ell _2$-norm $\Vert \cdot \Vert$ and the standard inner product $\langle \cdot , \cdot \rangle$. For a matrix $A \in {\mathbb {R}}^{m \times n}$, let $\Vert A\Vert$ denote its spectral norm, or its largest singular value. For $a \in {\mathbb {R}}$, let $\lceil a\rceil$ denote the least integer greater than or equal to a.

2 Comparison with related works

We review existing methods for problem (1) and compare them with our work.

2.1 General methods

Algorithms for general nonconvex optimization problems, not just for least-squares problems, also solve problem (1). For example, the projected gradient method have an overall complexity bound of $O(\varepsilon ^{-2})$; our LM method enjoys local quadratic convergence in addition to that bound, which seems difficult to achieve with general first-order methods. Figure 1 illustrates that our LM successfully minimizes the Rosenbrock function, a valley-like function that is notoriously difficult to minimize numerically. Although quadratic convergence is proved only locally around an optimal solution, in practice, the LM method may perform considerably better than general first-order methods, even when started far from the optimum.

Some methods, such as the Newton method, achieve local quadratic convergence using second-order or higher-order derivatives of f; our LM achieves it without the second-order derivative. Besides the fact that our LM does not require a computationally demanding Hessian matrix, it has another advantage: subproblem (3) is very tractable. Whereas our subproblem is smooth and strongly convex, those in second- or higher-order methods are nonconvex in general. The matter becomes more severe under the presence of constraints because the subproblems may be NP-hard, as pointed out in [15].

2.2 Specialized methods for least squares

Several methods, including the LM method, utilize the least-squares structure of problem (1). Focusing on those algorithms without second-order derivatives, we review them from three points of view: (i) subproblem, (ii) complexity for finding a stationary point, and (iii) local superlinear convergence. Most of the methods discussed in this section are summarized in Table 1. The table shows the following:

Our method can achieve an overall computational complexity bound, $O(\varepsilon ^{-2}) \times O(1) = O(\varepsilon ^{-2})$, for finding an $\varepsilon$-stationary point for constrained problems.
To the best of our knowledge, this is the first LM that achieves such a complexity bound with local quadratic convergence, even for unconstrained problems.

2.2.1 Subproblems

Most algorithms for the nonlinear least-squares problem (1) generate a solution sequence $(x_k)_{k \in {\mathbb {N}}}$ by repeatedly solving convex subproblems, and we focus on such algorithms. There are three popular subproblems, in addition to the LM subproblem (3):

$$\begin{aligned} \min _{x \in {\mathbb {R}}^d}\ \Vert F_k + J_k(x - x_k)\Vert + \frac{\uplambda }{2} \Vert x - x_k\Vert ^2\quad{} & {} \mathrm {subject\ to}\quad x \in {\mathcal {C}}, \end{aligned}$$

(4)

$$\begin{aligned} \min _{x \in {\mathbb {R}}^d}\ \Vert F_k + J_k(x - x_k)\Vert ^2 + \frac{\uplambda }{2} \Vert x - x_k\Vert ^3\quad{} & {} \mathrm {subject\ to}\quad x \in {\mathcal {C}}, \end{aligned}$$

(5)

$$\begin{aligned} \min _{x \in {\mathbb {R}}^d}\ \Vert F_k + J_k(x - x_k)\Vert ^2\quad{} & {} \mathrm {subject\ to}\quad x \in {\mathcal {C}}, \ \Vert x - x_k\Vert \le \Delta , \end{aligned}$$

(6)

where $\uplambda , \Delta > 0$ are properly defined constants. Methods using subproblems (4), (5), and (6) have been proposed and analyzed in [3, 4, 16, 50], [3], and [12, 24, 32, 64], respectively. Other works [13,14,15] propose methods with a more general version of (5). These four subproblems (3)–(6) are closely related in theory; one subproblem becomes equivalent to the others with specific choices of the parameters $\uplambda$ and $\Delta$.

In practice, these four subproblems are quite different, and the LM subproblem (3) is the most tractable one because the objective function $m^k_\uplambda$ is smooth and strongly convex. Thanks to smoothness and strong convexity, we can efficiently solve subproblem (3) with linearly convergent methods such as the projected gradient method. Note that the objective function of (4) is nonsmooth, and (5) and (6) are not necessarily strongly convex. Although some algorithms for subproblems (4)–(6) without constraints have been proposed [4, 13, 64], efficient algorithms are nontrivial under the presence of constraints. Hence, the LM method is more practical than methods using other subproblems.

2.2.2 Complexity for finding a stationary point

For unconstrained zero-residual problems, Nesterov [50] proposed a method with subproblem (4) and showed that the method finds an $\varepsilon$-stationary point after $O(\varepsilon ^{-2})$ iterations under a strong assumption (see footnote 2 of Table 1 for details). After that, for unconstrained (possibly) nonzero-residual problems, several methods with subproblems (3), (4), and (6) have been proposed [12, 57, 66], and they achieve the same iteration complexity bound under weaker assumptions such as the Lipschitz continuity of J or $\nabla f$. The method of [12] has been extended for constrained problems [16].^{Footnote 2} These methods [12, 16, 50, 57, 66] have the iteration complexity bound, but computational complexity per iteration, i.e., complexity for a subproblem, is unclear.

The key to bounding complexity per iteration is that we do not need to solve subproblems so accurately to derive the iteration complexity bound. Several algorithms have been proposed based on this fact for both unconstrained [5, 6, 13, 14] and constrained [15] problems. They use a point that decreases the model function value sufficiently compared to the value at the current iterate $x_k$. Such a point can be computed with an $\varepsilon$-independent number of basic operations: evaluation of F(x), Jacobian-vector multiplications J(x)u and $J(x)^\top v$, and projection onto ${\mathcal {C}}$. Thus, the methods in [5, 13,14,15] achieve the overall complexity $O(\varepsilon ^{-2}) \times O(1) = O(\varepsilon ^{-2})$.

Our LM method also finds an $\varepsilon$-stationary point within $O(\varepsilon ^{-2})$ iterations, and the complexity per iteration is O(1) when subproblems are solved approximately like [5, 6, 13,14,15]. Thus, the overall complexity amounts to $O(\varepsilon ^{-2})$ same as [5, 13,14,15].

2.2.3 Local superlinear convergence

For unconstrained zero-residual problems, many methods with subproblems (3)–(6) have achieved local quadratic convergence under a local error bound condition [3, 19, 23, 29,30,31,32,33,34,35, 62]. These local convergence results have been extended to constrained problems [1, 22, 26, 40]. Some methods [24, 64] have local convergence of an arbitrarily order less than 2. Other methods [25, 27, 28] achieve local (nearly) cubic convergence by solving two subproblems in one iteration. We note that the local convergence analyses in [4, 50] assume the solution sequence $(x_k)$ is in the neighborhood of a solution $x^*$ such that $F(x^*) = {\textbf{0}}$ and ${{\,\textrm{rank}\,}}J(x^*) = n$, which is a stronger assumption than the local error bound.

Among these methods, some [1, 3, 4, 19, 22, 29, 31, 35] use an approximate solution to subproblems while preserving local quadratic convergence. The approximate solution is more accurate than that used to derive the global complexity mentioned in the previous section. We also use the same kind of approximate solution as [1, 19, 29, 31, 35] to prove local quadratic convergence. See Condition 2 in Sect. 6 for the details of the approximate solution.

3 Majorization lemma and proposed method

Here, we will prove a majorization lemma that shows that the LM model $m^k_\uplambda$ defined in (2) is an upper bound on the objective function. In view of this lemma, we can characterize our LM method as a majorization-minimization (MM) method.

For $a, b \in {\mathbb {R}}^d$, we denote the sublevel set and the line segment by

$$\begin{aligned} {\mathcal {S}}(a)&:=\{ x \in {\mathbb {R}}^d \,|\, f(x) \le f(a) \}, \end{aligned}$$

(7)

$$\begin{aligned} {\mathcal {L}}(a, b)&:=\{ (1-\theta ) a + \theta b \in {\mathbb {R}}^d \,|\, \theta \in [0, 1] \}. \end{aligned}$$

(8)

3.1 LM method as majorization-minimization

MM is a framework for nonconvex optimization that successively performs (approximate) minimization of an upper bound on the objective function. The following lemma, a majorization lemma, shows that the model $m^k_\uplambda$ defined in (2) is an upper bound on the objective f over some region under certain assumptions.

Lemma 1

Let ${\mathcal {X}} \subseteq {\mathbb {R}}^d$ be any closed convex set, and suppose $x_k \in {\mathcal {X}}$. Moreover, assume that for some constant $L > 0$,

$$\begin{aligned} \Vert J(y) - J(x)\Vert \le L\Vert y - x\Vert ,\quad \forall x,y \in {\mathcal {X}} \ \ \text {s.t.} \ \ {\mathcal {L}}(x, y) \subseteq {\mathcal {S}}(x_k). \end{aligned}$$

(9)

Then for any $\uplambda > 0$ and $x \in {\mathcal {X}}$ such that

$$\begin{aligned}&\uplambda \ge L\Vert F_k\Vert \quad \text {and} \end{aligned}$$

(10)

$$\begin{aligned}&m^k_\uplambda (x) \le m^k_\uplambda (x_k), \end{aligned}$$

(11)

the following bound holds:

$$\begin{aligned} f(x) \le m^k_\uplambda (x). \end{aligned}$$

(12)

The proof is given in Sect. A.2.

The assumption in (9) is the Lipschitz continuity of J and is analogous to the Lipschitz continuity of $\nabla f$, which is often used in the analysis of first-order methods. Equation (10) requires a sufficiently large damping parameter, which corresponds to a sufficiently small step-size for first-order methods. Equation (11) requires the point $x \in {\mathcal {X}}$ to be a solution that is at least as good as the current point $x_k \in {\mathcal {X}}$ in terms of the model function value.

3.2 Proposed LM method

Based on Lemma 1, we propose an LM method that solves problem (1). The proposed LM is formally described in Algorithm 1 and is outlined below. First, in Line 1, three parameters are initialized: an estimate M of the Lipschitz constant L of J, a parameter $\eta$ used for solving subproblems, and the iteration counter k. Line 3 sets $\uplambda$ using M as an estimate of L based on (10). Then, the inner loop of Lines 4-10 solves subproblem (3) approximately by a projected gradient method. The details of the inner loop will be described later. Lines 12–15 check if the current $\uplambda$ and the computed solution x are acceptable. If $\uplambda$ and x satisfy (12), they are accepted as $\uplambda _k$ and $x_{k+1}$. Otherwise, the current value of M is judged to be small as an estimate of L in light of Lemma 1 and is increased. We refer to the former case as a “successful” iteration and the latter as an “unsuccessful” iteration. Note that k represents not the number of outer iterations but that of only successful iterations. As shown later in Lemma 5(ii) and Theorem 2(i), the number of unsuccessful iterations is upper-bounded by a constant under certain assumptions.

Inner loop for subproblem In the inner loop of Lines 4-10, subproblem (3) is solved approximately by the projected gradient method. Here, the operator ${{\,\textrm{proj}\,}}_{{\mathcal {C}}}$ in Line 5 is the projection operator defined by

$$\begin{aligned} {{\,\textrm{proj}\,}}_{{\mathcal {C}}}(x) :=\mathop {\textrm{argmin}}\limits _{y \in {\mathcal {C}}}\Vert y - x\Vert . \end{aligned}$$

The parameter t is the inner iteration counter, and the parameter $\eta$ is the inverse step-size that is adaptively chosen by a standard backtracking technique in Lines 6-9. As shown in Lemma 6(ii) later, Line 9 is executed a finite number of times under certain standard assumptions. Hence, the inner loop must stop after a finite number of iterations.

Input parameters Algorithm 1 has several input parameters. The parameters $M_0$ and $\alpha$ are used to estimate the Lipschitz constant of the Jacobian J, and the parameters $\eta _0$ and $\alpha _{\textrm{in}}$ are used to control the step-size in the inner loop. The parameters T and c control how accurately the subproblems are solved through the stopping criteria of the inner loop. Here, note that we allow for $T = \infty$. As we will prove in Sect. 4, the algorithm has an iteration complexity bound for an $\varepsilon$-stationary point regardless of the choice of the input parameters. However, to obtain an overall complexity bound or local quadratic convergence, there are restrictions on the choice of T, as explained in the next paragraph.

Stopping criteria for inner loop There are two types of stopping criteria as in Line 10, and the inner loop terminates when at least one of them is satisfied. If $T < \infty$, the projected gradient method stops after executing Line 7 at most T times, and then the overall complexity for an $\varepsilon$-stationary point is guaranteed to be $O(\varepsilon ^{-2})$. If $T = \infty$, we have to solve subproblems more accurately to find a $(c\uplambda \Vert F_k\Vert )$-stationary point of the subproblem, and then Algorithm 1 achieves local quadratic convergence.

Remark 1

To make the algorithm more practical, we can introduce other parameters $0< \beta < 1$ and $M_{\min } > 0$, and update $M \leftarrow \max \{ \beta M, M_{\min } \}$ after every successful iteration. As with the gradient descent method, such an operation prevents the estimate M from being too large and eliminates the need to choose $M_0$ carefully. Inserting this operation never deteriorates the complexity bounds described in Sect. 4 and the local quadratic convergence in Sect. 5.

Remark 2

Some methods (e.g., [57, 66]) use the condition

$$\begin{aligned} \frac{m^k_\uplambda (x_k) - f(x)}{m^k_\uplambda (x_k) - m^k_\uplambda (x)} \ge \theta \end{aligned}$$

(13)

with some $0< \theta < 1$ to determine whether the computed solution x to the subproblem is acceptable. Our acceptance condition (12) is stronger than the classical one since (12) is equivalent to

$$\begin{aligned} \frac{m^k_\uplambda (x_k) - f(x)}{m^k_\uplambda (x_k) - m^k_\uplambda (x)} \ge 1 \end{aligned}$$

under condition (11). Therefore, Lemma 1 is stronger than the classical statement that condition (13) holds if $\uplambda$ is sufficiently large.

4 Iteration complexity and overall complexity

We will prove that Algorithm 1 finds an $\varepsilon$-stationary point of problem (1) within $O(\varepsilon ^{-2})$ outer iterations. Futhermore, we will prove that under $T < \infty$, the overall complexity for an $\varepsilon$-stationary point is also $O(\varepsilon ^{-2})$. Throughout this section, $(x_k)$ and $(\uplambda _k)$ denote the sequences generated by the algorithm.

4.1 Assumptions

We make the following assumptions to derive the complexity bound. Recall that the sublevel set ${\mathcal {S}}(a)$ and the line segment ${\mathcal {L}}(a, b)$ are defined in (7) and (8) and that $x_0 \in {\mathcal {C}}$ denotes the starting point of Algorithm 1.

Assumption 1

For some constants $\sigma , L > 0$,

(i)
$\Vert J(x)\Vert \le \sigma$, $\forall x \in {\mathcal {C}} \cap {\mathcal {S}}(x_0)$,
(ii)
$\Vert J(y) - J(x)\Vert \le L\Vert y - x\Vert$, $\forall x,y \in {\mathcal {C}}$ s.t. ${\mathcal {L}}(x, y) \subseteq {\mathcal {S}}(x_0)$.

Assumption 1(i) means the $\sigma$-boundedness of J on ${\mathcal {C}} \cap {\mathcal {S}}(x_0)$. Assumption 1(ii) is similar to the L-Lipschitz continuity of J on ${\mathcal {C}} \cap {\mathcal {S}}(x_0)$ but weaker due to the condition of ${\mathcal {L}}(x, y) \subseteq {\mathcal {S}}(x_0)$. Assumption 1 is milder than the assumptions in the previous work that discussed the iteration complexity, even when ${\mathcal {C}} = {\mathbb {R}}^d$. For example, the analysis in [66] assumes f and J to be Lipschitz continuous on ${\mathbb {R}}^d$, which implies the boundedness of J on ${\mathbb {R}}^d$.

4.2 Approximate stationary point

Before analyzing the algorithm, we define an $\varepsilon$-stationary point for constrained optimization problems. Let $\iota _{{\mathcal {C}}}: {\mathbb {R}}^d \rightarrow {\mathbb {R}}\cup \{ +\infty \}$ be the indicator function of the closed convex set ${\mathcal {C}} \subseteq {\mathbb {R}}^d$. For a convex function $g: {\mathbb {R}}^d \rightarrow {\mathbb {R}}\cup \{ +\infty \}$, its subdifferential at $x \in {\mathbb {R}}^d$ is the set defined by $\partial g(x) :=\{ p \in {\mathbb {R}}^d \,|\, g(y) \ge g(x) + \langle p, y - x \rangle , \ \forall y \in {\mathbb {R}}^d \}$.

Definition 1

(see, e.g., Definition 1 in [51]) For $\varepsilon > 0$, a point $x \in {\mathcal {C}}$ is said to be an $\varepsilon$-stationary point of the problem $\min _{x \in {\mathcal {C}}} f(x)$ if

$$\begin{aligned} \min _{p \in \partial \iota _{{\mathcal {C}}}(x)} \Vert \nabla f(x) + p\Vert \le \varepsilon . \end{aligned}$$

(14)

This definition is consistent with the unconstrained case; the above inequalities are equivalent to $\Vert \nabla f(x)\Vert \le \varepsilon$ when ${\mathcal {C}} = {\mathbb {R}}^d$. There is another equivalent definition of an $\varepsilon$-stationary point, which we will also use.

Lemma 2

For $x \in {\mathcal {C}}$ and $\varepsilon > 0$, condition (14) is equivalent to

$$\begin{aligned} \langle \nabla f(x), y - x \rangle \ge - \varepsilon \Vert y - x\Vert , \quad \forall y \in {\mathcal {C}}. \end{aligned}$$

(15)

Proof

The tangent cone ${\mathcal {T}}(x)$ of ${\mathcal {C}}$ at $x \in {\mathcal {C}}$ is defined by

$$\begin{aligned} {\mathcal {T}}(x)&:=\{ \beta (y - x) \,|\, y \in {\mathcal {C}}, \ \beta \ge 0 \}. \end{aligned}$$

(16)

Note that

$$\begin{aligned} {\mathcal {T}}(x)&= \{ z \in {\mathbb {R}}^d \,|\, \langle y, z \rangle \le 0,\ \forall y \in \partial \iota _{{\mathcal {C}}}(x) \} \end{aligned}$$

(17)

because ${\mathcal {C}}$ is a closed convex set and $\partial \iota _C(x)$ is the normal cone of ${\mathcal {C}}$. We have

$$\begin{aligned}&\min _{p \in \partial \iota _{{\mathcal {C}}}(x)} \Vert \nabla f(x) + p\Vert \\&\quad = \min _{p \in \partial \iota _{{\mathcal {C}}}(x)} \max _{u: \Vert u\Vert \le 1} \langle - \nabla f(x) - p, u \rangle \\&\quad = \max _{u: \Vert u\Vert \le 1} \inf _{p \in \partial \iota _{{\mathcal {C}}}(x)} \Big \{ \langle - \nabla f(x), u \rangle - \langle p, u \rangle \Big \}&\quad&(\text {by a minimax theorem})\\&\quad = \max _{u \in {\mathcal {T}}(x),\,\Vert u\Vert \le 1} \langle - \nabla f(x), u \rangle&\quad&(\text {by}\, (17))\\&\quad = \sup _{y \in {\mathcal {C}} \setminus \{ x \}} \frac{\langle - \nabla f(x), y - x \rangle }{\Vert y - x\Vert }&\quad&(\text {by}\, (16)). \end{aligned}$$

Therefore, condition (14) is equivalent to

$$\begin{aligned} \sup _{y \in {\mathcal {C}} \setminus \{ x \}} \frac{\langle - \nabla f(x), y - x \rangle }{\Vert y - x\Vert } \le \varepsilon , \end{aligned}$$

which is also equivalent to (15). $\square$

A useful tool for deriving iteration complexity bounds is gradient mapping (see, e.g., [49]), also known as projected gradient [41] or reduced gradient [52]. For $\eta > 0$, the projected gradient operator ${\mathcal {P}}_\eta : {\mathcal {C}} \rightarrow {\mathcal {C}}$ and the gradient mapping ${\mathcal {G}}_\eta : {\mathcal {C}} \rightarrow {\mathbb {R}}^d$ for problem (1) are defined by

$$\begin{aligned} {\mathcal {P}}_\eta (x)&:=\mathop {\textrm{argmin}}\limits _{y \in {\mathcal {C}}} \Big \{ \langle \nabla f(x), y - x \rangle + \frac{\eta }{2}\Vert y - x\Vert ^2 \Big \} = {{\,\textrm{proj}\,}}_{{\mathcal {C}}} \Big ( x - \frac{1}{\eta } \nabla f(x) \Big ), \end{aligned}$$

(18)

$$\begin{aligned} {\mathcal {G}}_\eta (x)&:=\eta (x - {\mathcal {P}}_\eta (x)). \end{aligned}$$

(19)

The following lemma shows the relationship between an $\varepsilon$-stationary point and the gradient mapping.

Lemma 3

Suppose that Assumption 1holds, and let

$$\begin{aligned} L_{f} :=\sigma ^2 + L \Vert F_0\Vert . \end{aligned}$$

(20)

Then, for any $x \in {\mathcal {C}} \cap {\mathcal {S}}(x_0)$ and $\eta \ge L_f$, the point ${\mathcal {P}}_\eta (x)$ is a $(2 \Vert {\mathcal {G}}_\eta (x)\Vert )$-stationary point of problem (1).

The proof is given in Sect. A.4. This lemma will be used for the proof of Theorem 1(ii).

Although Lemma 3 looks quite similar to [51, Corollary 1], there exists a significant difference in their assumptions. Indeed, Lemma 3 assumes the boundedness and the Lipschitz property of J only on a (possibly) nonconvex set ${\mathcal {C}} \cap {\mathcal {S}}(x_0)$, whereas [51, Corollary 1] assumes the Lipschitz continuity on the whole space ${\mathbb {R}}^d$. This makes our proof more complicated than in [51, Corollary 1].

4.3 Preliminary lemmas

First, we bound the decrease in the model function value due to the inner loop. For $\eta > 0$, we define the function ${\mathcal {D}}_\eta : {\mathcal {C}} \rightarrow {\mathbb {R}}$ by

$$\begin{aligned} {\mathcal {D}}_\eta (x) :=- \min _{y \in {\mathcal {C}}} \Big \{ \langle \nabla f(x), y - x \rangle + \frac{\eta }{2}\Vert y - x\Vert ^2 \Big \}. \end{aligned}$$

(21)

We see that ${\mathcal {D}}_\eta (x) \ge -\langle \nabla f(x), x - x \rangle - \frac{\eta }{2}\Vert x - x\Vert ^2 = 0$ for all $x \in {\mathcal {C}}$. In addition, ${\mathcal {D}}_\eta (x)$ is decreasing with respect to $\eta$.

Lemma 4

The solution x obtained in Line 11 of Algorithm 1 satisfies

$$\begin{aligned} m^k_{\uplambda } (x) \le m^k_{\uplambda } (x_k) - {\mathcal {D}}_\eta (x_k) \le m^k_{\uplambda } (x_k), \end{aligned}$$

(22)

where k, $\uplambda$, and $\eta$ are parameters in Algorithm 1.

Proof

The second inequality in (22) follows from the nonnegativity of ${\mathcal {D}}_\eta (x)$, and therefore we will prove the first one. Let $T'$ denote the value of t when the inner loop is completed, and for each $0 \le t \le T'$, let $\eta _{k,t}$ denote the values of $\eta$ when $x_{k,t}$ is obtained through Line 7. Our aim is to prove the first inequality in (22) with $(x, \eta )=(x_{k,T'},\eta _{k,T'})$. We have

$$\begin{aligned} m^k_{\uplambda }(x_{k,1})&\le m^k_\uplambda (x_k) + \langle \nabla m^k_\uplambda (x_k), x_{k,1} - x_k \rangle + \frac{\eta _{k,1}}{2} \Vert x_{k,1} - x_k\Vert ^2&\quad&(\text {by Line 6})\\&= m^k_\uplambda (x_k) + \min _{z \in {\mathcal {C}}} \Big \{ \langle \nabla m^k_\uplambda (x_k), z - x_k \rangle + \frac{\eta _{k,1}}{2}\Vert z - x_k\Vert ^2 \Big \}&\quad&(\text {by the definition of} x_{k,1})\\&= m^k_\uplambda (x_k) - {\mathcal {D}}_{\eta _{k,1}}(x_k)&\quad&(\text {by}\, \nabla m^k_\uplambda (x_k) = \nabla f(x_k)). \end{aligned}$$

Since ${\mathcal {D}}_\eta (x_k)$ is decreasing in $\eta$ and $\eta _{k,1} \le \eta _{k,2} \le \dots \le \eta _{k,T'}$, we have ${\mathcal {D}}_{\eta _{k,1}}(x_k) \ge {\mathcal {D}}_{\eta _{k,T'}}(x_k)$. On the other hand, we have $m^k_\uplambda (x_{k,1}) \ge \dots \ge m^k_\uplambda (x_{k, T'})$. Combining these inequalities, we obtain the desired result. $\square$

From the above lemma and Line 12, it follows that for all k,

$$\begin{aligned} f(x_{k+1}) \le m^k_{\uplambda _k}(x_{k+1}) \le m^k_{\uplambda _k}(x_k) = f(x_k). \end{aligned}$$

(23)

This monotonicity of $f(x_k)$ in k is an important property of the majorization-minimization and will be used in our analysis.

The following two lemmas show that the parameters M and $\eta$ in the algorithm are upper-bounded, and hence Lines 9 and 15 are executed only a finite number of times per single run.

Lemma 5

Suppose that Assumption 1(ii) holds, and let

$$\begin{aligned} {\bar{M}} :=\max \{M_0, \alpha L\}, \end{aligned}$$

(24)

where $M_0$ and $\alpha$ are the inputs of Algorithm 1. Then,

(i)
The parameter M in Algorithm 1 always satisfies $M \le {\bar{M}}$;
(ii)
Throughout the algorithm, the number of unsuccessful iterations is at most $\lceil \log _\alpha ({\bar{M}} / M_0)\rceil = O(1)$.

Proof

We have ${\mathcal {S}}(x_k) \subseteq {\mathcal {S}}(x_0)$ from (23), and therefore Assumption 1(ii) implies (9) with ${\mathcal {X}} = {\mathcal {C}}$. On the other hand, (22) directly implies (11). Hence, by Lemma 1 with ${\mathcal {X}} = {\mathcal {C}}$ and Lemma 4, if $M \ge L$ holds at Line 3, the condition in Line 12 must be true. Therefore, if $M_0 \ge L$, no unsuccessful iterations occur and the parameter M always satisfies $M = M_0$. Otherwise, there exists an integer $l \ge 1$ such that $L \le \alpha ^l M_0 < \alpha L$. Since $M = \alpha ^l M_0$ after l unsuccessful iterations, the parameter M always satisfies $M < \alpha L$. Consequently, we obtain the first result, and the second follows from the first. $\square$

Lemma 6

Suppose that Assumption 1holds, and let

$$\begin{aligned} {\bar{\eta }} :=\max \{ \eta _0, \alpha _{\textrm{in}} (\sigma ^2 + {\bar{M}} \Vert F_0\Vert ) \}, \end{aligned}$$

(25)

where $\eta _0$ and $\alpha _{\textrm{in}}$ are the inputs of Algorithm 1 and ${\bar{M}}$ is defined in (24). Then,

(i)
the parameter $\eta$ in Algorithm 1 always satisfies $\eta \le {\bar{\eta }}$;
(ii)
throughout the algorithm, Line 9 will be executed at most $\lceil \log _{\alpha _\textrm{in}} ({\bar{\eta }} / \eta _0)\rceil = O(1)$ times.

Proof

Since the function $m^k_\uplambda$ defined by (2) has the $(\Vert J_k\Vert ^2 + \uplambda )$-Lipschitz continuous gradient, we have

$$\begin{aligned} m^k_\uplambda (y) \le m^k_\uplambda (x) + \langle \nabla m^k_\uplambda (x), y - x \rangle + \frac{\Vert J_k\Vert ^2 + \uplambda }{2} \Vert y - x\Vert ^2,\quad \forall x, y \in {\mathbb {R}}^d \end{aligned}$$

(see, e.g., [52, Eq. (2.1.9)]). We also have $\Vert J_k\Vert ^2 + \uplambda \le \sigma ^2 + {\bar{M}} \Vert F_0\Vert$ from Assumption 1(i) and Lemma 5. Therefore, the inequality in Line 6 must hold if $\eta \ge \sigma ^2 + {\bar{M}} \Vert F_0\Vert$. With the same arguments as in Lemma 5, we obtain the desired results. $\square$

As we can see from the proofs of Lemmas 5 and 6, if $M_0 \ge L$ and $\eta _0 \ge \sigma ^2 + M_0 \Vert F_0\Vert$, then no unsuccessful iterations occur in both outer and inner loops. Adjusting M and $\eta$ adaptively as in the presented algorithm avoids a too small step-size in practice.

4.4 Iteration complexity and overall complexity

We use the following lemma for the analysis.

Lemma 7

$$\begin{aligned} {\mathcal {D}}_\eta (x) \ge \frac{1}{2\eta }\Vert {\mathcal {G}}_\eta (x)\Vert ^2,\quad \forall x \in {\mathcal {C}}, \quad \eta > 0. \end{aligned}$$

Proof

By the first-order optimality condition on (18) and the convexity of ${\mathcal {C}}$, we have

$$\begin{aligned} \langle \nabla f(x) + \eta ({\mathcal {P}}_\eta (x) - x), y - {\mathcal {P}}_\eta (x) \rangle \ge 0, \quad \forall y \in {\mathcal {C}}. \end{aligned}$$

(26)

Using this inequality, we obtain

$$\begin{aligned} {\mathcal {D}}_\eta (x)&= \langle \nabla f(x), x - {\mathcal {P}}_\eta (x) \rangle - \frac{\eta }{2}\Vert x - {\mathcal {P}}_\eta (x)\Vert ^2&\quad&(\text {by}\, (18)\, \text {and} \,(21))\\&\ge \frac{\eta }{2}\Vert x - {\mathcal {P}}_\eta (x)\Vert ^2&\quad&(\text {by}\, (26)\, \text {with }y = x)\\&= \frac{1}{2\eta }\Vert {\mathcal {G}}_\eta (x)\Vert ^2&\quad&(\text {by}\, (19)). \end{aligned}$$

$\square$

We show the asymptotic global convergence and the iteration complexity bound of Algorithm 1.

Theorem 1

Suppose that Assumption 1holds, and define ${\bar{\eta }}$ by (25). Then,

(i)
$\displaystyle \lim _{k \rightarrow \infty } \Vert {\mathcal {G}}_{{\bar{\eta }}}(x_k)\Vert = 0$, and therefore, any accumulation point of $(x_k)$ is a stationary point of problem (1);
(ii)
${\mathcal {P}}_{{\bar{\eta }}}(x_k)$ is an $\varepsilon$-stationary point of problem (1) for some $k = O(\varepsilon ^{-2})$.

Proof

We have

$$\begin{aligned} f(x_{k+1}) - f(x_k)&\le m^k_{\uplambda _k}(x_{k+1}) - m^k_{\uplambda _k}(x_k)&\quad&(\text {by Line 12 and}\, m^k_{\uplambda _k}(x_k) = f(x_k))\\&\le - {\mathcal {D}}_{{\bar{\eta }}}(x_k)&\quad&(\text {by Lemmas 6(i) and 4})\\&\le - \frac{1}{2 {\bar{\eta }}}\Vert {\mathcal {G}}_{{\bar{\eta }}}(x_k)\Vert ^2&\quad&(\text {by Lemma 7}). \end{aligned}$$

Summing up this inequality for $k=0,1,\dots ,K-1$, we obtain

$$\begin{aligned} \sum _{k=0}^{K-1} \Vert {\mathcal {G}}_{{\bar{\eta }}}(x_k)\Vert ^2 \le 2 {\bar{\eta }} (f(x_0) - f(x_K)) \le 2 {\bar{\eta }} f(x_0) \end{aligned}$$

(27)

for all $K \ge 0$. Therefore, we also have $\sum _{k=0}^{\infty } \Vert {\mathcal {G}}_{{\bar{\eta }}}(x_k)\Vert ^2 \le 2 {\bar{\eta }} f(x_0)$, yielding $\lim _{k \rightarrow \infty } \Vert {\mathcal {G}}_{{\bar{\eta }}}(x_k)\Vert = 0$, the first result.

Combining (27) with $\min _{0 \le k < K} \Vert {\mathcal {G}}_{{\bar{\eta }}}(x_k)\Vert ^2 \le \frac{1}{K} \sum _{k=0}^{K-1} \Vert {\mathcal {G}}_{{\bar{\eta }}}(x_k)\Vert ^2$, we have $\Vert {\mathcal {G}}_{{\bar{\eta }}}(x_k)\Vert \le \varepsilon / 2$ for some $k = O(\varepsilon ^{-2})$. For such $x_k$, the point ${\mathcal {P}}_{{\bar{\eta }}}(x_k)$ is an $\varepsilon$-stationary point from Lemma 3 and ${\bar{\eta }} \ge L_f$. Thus, we have obtained the second result. $\square$

From Lemma 5(ii) and Theorem 1(ii), we obtain the iteration complexity bound of our algorithm as follows.

Corollary 1

Under Assumption 1, Algorithm 1 finds an $\varepsilon$-stationary point within $O(\varepsilon ^{-2})$ outer iterations, namely, $O(\varepsilon ^{-2})$ successful and unsuccessful iterations.

From this iteration complexity bound and Lemma 6(ii), we also obtain the overall complexity bound.

Corollary 2

Suppose that Assumption 1holds. Then, Algorithm 1 with $T < \infty$ finds an $\varepsilon$-stationary point after $O(\varepsilon ^{-2} T)$ basic operations.

We use the term basic operations to refer to evaluation of F(x), Jacobian-vector multiplications J(x)u and $J(x)^\top v$, and projection onto ${\mathcal {C}}$ as in Sect. 1.2.

In order to compute an $\varepsilon$-stationary point based on Theorem 1(ii), knowledge of the value of ${\bar{\eta }}$ is required. However, this requirement can be circumvented with a slight modification of the algorithm. See Sect. A.5 for the details.

5 Local quadratic convergence

For zero-residual problems, we will prove that the sequence $(x_k)$ generated by Algorithm 1 with $T = \infty$ converges locally quadratically to an optimal solution. Let us denote the set of optimal solutions to problem (1) by ${\mathcal {X}}^* :=\{ x \in {\mathcal {C}} \,|\, F(x) = {\textbf{0}} \}$ and the distance between $x \in {\mathbb {R}}^d$ and ${\mathcal {X}}^*$ simply by ${{\,\textrm{dist}\,}}(x) :=\min _{y \in {\mathcal {X}}^*} \Vert y - x\Vert$. Throughout this section, we fix a point $x^* \in {\mathcal {X}}^*$ and denote a neighborhood of $x^*$ by ${\mathcal {B}}(r) :=\{ x \in {\mathbb {R}}^d \,|\, \Vert x - x^*\Vert \le r \}$ for $r > 0$.^{Footnote 3} As in the previous section, we denote the sequences generated by Algorithm 1 with $T = \infty$ by $(x_k)$ and $(\uplambda _k)$.

5.1 Assumptions

We make the following assumptions to prove local quadratic convergence.

Assumption 2

(i)
There exists $x \in {\mathcal {C}}$ such that $F(x) = {\textbf{0}}$.

For some constants $\rho , L, r > 0$,

(ii)
$\rho {{\,\textrm{dist}\,}}(x) \le \Vert F(x)\Vert$, $\forall x \in {\mathcal {C}} \cap {\mathcal {B}}(r)$,
(iii)
$\Vert J(y) - J(x)\Vert \le L\Vert y - x\Vert$, $\forall x,y \in {\mathcal {C}} \cap {\mathcal {B}}(r)$.

Assumption 2(i) requires the problem to be zero-residual, Assumption 2(ii) is called a local error bound condition, and Assumption 2(iii) is the local Lipschitz continuity of J. These assumptions are used in the previous analyses of LM methods [1, 3, 19, 22, 23, 29, 30, 33, 34, 40, 62].

5.2 Fundamental inequalities for analysis

Since ${\mathcal {C}} \cap {\mathcal {B}}(r)$ is compact, there exists a constant $\sigma > 0$ such that

$$\begin{aligned} \Vert J(x)\Vert \le \sigma , \quad \forall x \in {\mathcal {C}} \cap {\mathcal {B}}(r), \end{aligned}$$

(28)

which implies

$$\begin{aligned} \Vert F(y) - F(x)\Vert \le \sigma \Vert y - x\Vert , \quad \forall x, y \in {\mathcal {C}} \cap {\mathcal {B}}(r). \end{aligned}$$

(29)

Let $\sigma$ denote such a constant in the rest of this section.

For a point $x \in {\mathbb {R}}^d$, let ${{\tilde{x}}} \in {\mathcal {X}}^*$ denote an optimal solution closest to x; $\Vert {{\tilde{x}}} - x\Vert = {{\,\textrm{dist}\,}}(x)$. In particular, ${{\tilde{x}}}_k$ denotes one of the closest solutions to $x_k$ for each $k \ge 0$. Since $\Vert {{\tilde{a}}} - x^*\Vert \le \Vert {{\tilde{a}}} - a\Vert + \Vert a - x^*\Vert \le 2\Vert a - x^*\Vert$, we have

$$\begin{aligned} a \in {\mathcal {B}}(r/2) \quad \Longrightarrow \quad {{\tilde{a}}} \in {\mathcal {B}}(r). \end{aligned}$$

(30)

Therefore, (29) with $y :={{\tilde{x}}}$ implies

$$\begin{aligned} \Vert F(x)\Vert \le \sigma \Vert x - {{\tilde{x}}}\Vert = \sigma {{\,\textrm{dist}\,}}(x), \quad \forall x \in {\mathcal {C}} \cap {\mathcal {B}}(r/2). \end{aligned}$$

(31)

From the stopping criterion in Line 10 of Algorithm 1 with $T = \infty$ and Definition 1, the solution x obtained in Line 11 satisfies

$$\begin{aligned} \langle \nabla m^k_\uplambda (x), y - x \rangle \ge - c\uplambda \Vert F_k\Vert \Vert y - x\Vert , \quad \forall y \in {\mathcal {C}}. \end{aligned}$$

(32)

From the definition of $x_{k+1}$ and $\uplambda _k$, we also have the inequality with $(x, \uplambda ) = (x_{k+1}, \uplambda _k)$, i.e.,

$$\begin{aligned} \langle \nabla m^k_{\uplambda _k}(x_{k+1}), y - x_{k+1} \rangle \ge - c\uplambda _k \Vert F_k\Vert \Vert y - x_{k+1}\Vert , \quad \forall y \in {\mathcal {C}}. \end{aligned}$$

(33)

5.3 Preliminary lemma

Lemma 8

Suppose that Assumption 2holds, and define ${\bar{M}}$ by (24). Define the constants $C_1, C_2, \delta > 0$ by

$$\begin{aligned} C_1&:=\sqrt{1 + c^2 \sigma ^2 + \frac{L^2 r}{16 \rho M_0}}, \end{aligned}$$

(34a)

$$\begin{aligned} C_2&:=\frac{1}{c^2} \bigg ( \sigma ^2 \Big ( c{\bar{M}} + \frac{L}{2\rho } \Big ) + \frac{L\sigma C_1^2}{2} + (L + {\bar{M}})\sigma C_1 \bigg ), \end{aligned}$$

(34b)

$$\begin{aligned} \delta&:=\frac{r}{2(1 + C_1)}, \end{aligned}$$

(34c)

where $M_0$ and c are the inputs of Algorithm 1. Assume that $x_k \in {\mathcal {B}}(\delta )$ and $M \le {\bar{M}}$ hold at Line 3. Then,

(i)
The solution x obtained in Line 11 satisfies
$$\begin{aligned} \Vert x - x_k\Vert \le C_1 {{\,\textrm{dist}\,}}(x_k); \end{aligned}$$
(35)
(ii)
$M \le {\bar{M}}$ holds when $x_{k+1}$ is obtained;
(iii)
The following hold:
$$\begin{aligned} \Vert x_{k+1} - x_k\Vert&\le C_1 {{\,\textrm{dist}\,}}(x_k), \end{aligned}$$
(36)
$$\begin{aligned} {{\,\textrm{dist}\,}}(x_{k+1})&\le C_2 {{\,\textrm{dist}\,}}(x_k)^2. \end{aligned}$$
(37)

Proof of Lemma 8(i)

From $x_k \in {\mathcal {B}}(\delta )$, $\delta \le r/2$, and (30), we have

$$\begin{aligned} x_k \in {\mathcal {B}}(r/2) \quad \text {and}\quad {{\tilde{x}}}_k \in {\mathcal {B}}(r). \end{aligned}$$

(38)

Moreover, we have from $\nabla m^k_{\uplambda }(x) = J_k^\top (F_k + J_k(x - x_k)) + \uplambda (x - x_k)$ that

$$\begin{aligned} \underbrace{\langle \nabla m^k_{\uplambda }(x), x - {{\tilde{x}}}_k \rangle }_{\text {(A)}} = \underbrace{\langle F_k + J_k(x - x_k), J_k(x - {{\tilde{x}}}_k) \rangle }_{\text {(B)}} + \uplambda \underbrace{\langle x - x_k, x - {{\tilde{x}}}_k \rangle }_{\text {(C)}}. \end{aligned}$$

We bound the terms (A)–(C) as follows:

$$\begin{aligned} \text {(A)} \le c\uplambda \Vert F_k\Vert \Vert x - {{\tilde{x}}}_k\Vert&\le c\sigma \uplambda \Vert x_k - {{\tilde{x}}}_k\Vert \Vert x - {{\tilde{x}}}_k\Vert \\&\le \frac{c^2 \sigma ^2 \uplambda }{2} \Vert x_k - {{\tilde{x}}}_k\Vert ^2 + \frac{\uplambda }{2} \Vert x - {{\tilde{x}}}_k\Vert ^2, \end{aligned}$$

where the first and second inequalities follow from (32) and (31), respectively, and the last inequality follows from the arithmetic and geometric means;

$$\begin{aligned} \text {(B)} \ge - \frac{1}{4} \Vert F_k + J_k({{\tilde{x}}}_k - x_k)\Vert ^2 \ge - \frac{L^2}{16}\Vert {{\tilde{x}}}_k - x_k\Vert ^4, \end{aligned}$$

where the first inequality follows from $4\langle a, b \rangle = \Vert a+b\Vert ^2-\Vert a-b\Vert ^2 \ge -\Vert a-b\Vert ^2$ and the second inequality from Lemma 9(ii), (38), and Assumption 2(iii);

$$\begin{aligned} \text {(C)} = \frac{1}{2} \Big ( \Vert x - x_k\Vert ^2 + \Vert x - {{\tilde{x}}}_k\Vert ^2 - \Vert {{\tilde{x}}}_k - x_k\Vert ^2 \Big ). \end{aligned}$$

Combining these bounds and rearranging terms yield

$$\begin{aligned} \Vert x - x_k\Vert ^2 \le (1 + c^2 \sigma ^2) \Vert {{\tilde{x}}}_k - x_k\Vert ^2 + \frac{L^2}{8\uplambda }\Vert {{\tilde{x}}}_k - x_k\Vert ^4. \end{aligned}$$

(39)

From (38), Assumption 2(ii), and $\uplambda = M \Vert F_k\Vert \ge M_0 \Vert F_k\Vert$, we have

$$\begin{aligned} \Vert {{\tilde{x}}}_k - x_k\Vert ^2 \le \frac{r}{2} \times \frac{\Vert F_k\Vert }{\rho } \le \frac{r \uplambda }{2 \rho M_0}. \end{aligned}$$

Applying this bound to the second term on the right-hand side of (39), we obtain the desired result (35). $\square$

Proof of Lemma 8(ii)

As in Lemma 8(i), let x denote the x obtained in Line 11. By (34c), (35), and $x_k \in {\mathcal {B}}(\delta )$, we have

$$\begin{aligned} \Vert x - x^*\Vert&\le \Vert x_k - x^*\Vert + \Vert x - x_k\Vert \\&\le \Vert x_k - x^*\Vert + C_1 {{\,\textrm{dist}\,}}(x_k)\\&\le (1 + C_1) \Vert x_k - x^*\Vert \le (1 + C_1) \delta = r/2, \end{aligned}$$

i.e.,

$$\begin{aligned} x \in {\mathcal {B}}(r/2). \end{aligned}$$

(40)

We now have $x_k, x \in {\mathcal {C}} \cap {\mathcal {B}}(r)$. As in the proof of Lemma 5(i), by using Lemma 1 with ${\mathcal {X}} :={\mathcal {C}} \cap {\mathcal {B}}(r)$, we see that if $M \ge L$ holds at Line 3, the outer iteration must be successful. This leads to the desired result. $\square$

Proof of Lemma 8(iii)

Equation (36) follows from Lemmas 8(i) and 8(ii). We prove (37) below. From (30) and (40), we have $x_{k+1}, {{\tilde{x}}}_{k+1} \in {\mathcal {B}}(r)$. Moreover, we have

$$\begin{aligned}&\Vert F_{k+1}\Vert ^2 - \overbrace{\langle \nabla m^k_{\uplambda _k}(x_{k+1}), x_{k+1} - {{\tilde{x}}}_{k+1} \rangle }^{\text {(D)}}\\&= \langle F_{k+1}, F_{k+1} + J_{k+1}({{\tilde{x}}}_{k+1} - x_{k+1}) \rangle + \langle J_{k+1}^\top F_{k+1} - \nabla m^k_{\uplambda _k}(x_{k+1}), x_{k+1} - {{\tilde{x}}}_{k+1} \rangle \\&\le \underbrace{\Vert F_{k+1}\Vert \Vert F_{k+1} + J_{k+1}({{\tilde{x}}}_{k+1} - x_{k+1})\Vert }_{\text {(E)}} + \underbrace{\Vert J_{k+1}^\top F_{k+1} - \nabla m^k_{\uplambda _k}(x_{k+1})\Vert }_{\text {(F)}} {{\,\textrm{dist}\,}}(x_{k+1}) \end{aligned}$$

and bound the terms (D)–(F) as follows:

$$\begin{aligned} \text {(D)} \le c\uplambda _k \Vert F_k\Vert {{\,\textrm{dist}\,}}(x_{k+1}) \le c{\bar{M}} \Vert F_k\Vert ^2 {{\,\textrm{dist}\,}}(x_{k+1}) \end{aligned}$$

by (33) and Lemma 8(ii);

$$\begin{aligned} \text {(E)} \le \frac{L}{2} \Vert F_{k+1}\Vert {{\,\textrm{dist}\,}}(x_{k+1})^2 \le \frac{L}{2\rho } \Vert F_{k+1}\Vert ^2 {{\,\textrm{dist}\,}}(x_{k+1}) \le \frac{L}{2\rho } \Vert F_k\Vert ^2 {{\,\textrm{dist}\,}}(x_{k+1}) \end{aligned}$$

by Lemma 9(ii), Assumption 2(ii), and $\Vert F_{k+1}\Vert \le \Vert F_k\Vert$ from (23); and

$$\begin{aligned} \text {(F)}&= \Vert J_{k+1}^\top F_{k+1} - J_k^\top (F_k + J_ku) - \uplambda _k u\Vert&\quad&(\text {by letting u}\, :=x_{k+1} - x_k)\\&\le \Vert J_k^\top (F_{k+1} - F_k - J_ku)\Vert \\&\qquad + \Vert (J_{k+1} - J_k)^\top F_{k+1}\Vert + \uplambda _k \Vert u\Vert \\&\le \frac{L\sigma }{2} \Vert u\Vert ^2 + L\Vert F_{k+1}\Vert \Vert u\Vert + \uplambda _k \Vert u\Vert&\quad&(\text {by}\, (28),\, \text {Lemma 9(ii)},\\&&& \quad \text{and Assumption 2(iii))}\\&\le \frac{L\sigma }{2} \Vert u\Vert ^2 + (L + {\bar{M}}) \Vert F_k\Vert \Vert u\Vert&\quad&(\text {by}\, \Vert F_{k+1}\Vert \le \Vert F_k\Vert \text { and Lemma 8(ii)})\\&\le \Big ( \frac{L\sigma C_1^2}{2} + (L + {\bar{M}})\sigma C_1 \Big ) {{\,\textrm{dist}\,}}(x_k)^2&\quad&(\text {by (31) and (36)}) \end{aligned}$$

Combining these bounds yields

$$\begin{aligned} \Vert F_{k+1}\Vert ^2 \le \bigg ( \Big ( c{\bar{M}} + \frac{L}{2\rho } \Big ) \Vert F_k\Vert ^2 + \Big ( \frac{L\sigma C_1^2}{2} + (L + {\bar{M}})\sigma C_1 \Big ) {{\,\textrm{dist}\,}}(x_k)^2 \bigg ) {{\,\textrm{dist}\,}}(x_{k+1}). \end{aligned}$$

We bound $\Vert F_k\Vert$ and $\Vert F_{k+1}\Vert$ in the above inequality by using Assumption 2(ii) and (31), yielding

$$\begin{aligned} \rho ^2 {{\,\textrm{dist}\,}}(x_{k+1})^2 \le \bigg ( \sigma ^2 \Big ( c{\bar{M}} + \frac{L}{2\rho } \Big ) + \frac{L\sigma C_1^2}{2} + (L + {\bar{M}})\sigma C_1 \bigg ) {{\,\textrm{dist}\,}}(x_k)^2{{\,\textrm{dist}\,}}(x_{k+1}), \end{aligned}$$

which implies the desired result (37). $\square$

5.4 Local quadratic convergence

Let us state the local quadratic convergence result of Algorithm 1.

Theorem 2

Suppose that Assumption 2holds, and define ${\bar{M}}$ by (24). Set $x_0 \in {\mathcal {B}}(\delta _0)$ for a sufficiently small constant $\delta _0 > 0$ such that

$$\begin{aligned} C_2 \delta _0 < 1,\quad \delta _0 + \frac{C_1 \delta _0}{1 - C_2 \delta _0} \le \delta , \end{aligned}$$

(41)

where $C_1$, $C_2$, and $\delta$ are the constants defined in (34a)–(34c). Then,

(i)
The number of unsuccessful iterations is at most $\lceil \log _\alpha ({\bar{M}} / M_0)\rceil = O(1)$, and
(ii)
The sequence $(x_k)$ converges quadratically to an optimal solution ${\hat{x}} \in {\mathcal {X}}^*$.

Proof

First, we will prove that

$$\begin{aligned}&x_k \in {\mathcal {B}}(\delta ), \text {and} \end{aligned}$$

(42a)

$$\begin{aligned}&M \le {\bar{M}}\text { holds when }x_k\text { is obtained} \end{aligned}$$

(42b)

for all $k \ge 0$ by induction. For $k=0$, (42a) and (42b) are obvious. For a fixed $K \ge 0$, assume (42a) and (42b) for all $k \le K$. We then have (36), (37) and (42b) for $k \le K+1$ by Lemma 8. To complete the induction, we prove (42a) for $k = K+1$. Solving the recursion of (37) and using ${{\,\textrm{dist}\,}}(x_0) \le \delta _0$, we have

$$\begin{aligned} {{\,\textrm{dist}\,}}(x_k) \le {{\,\textrm{dist}\,}}(x_0) (C_2 {{\,\textrm{dist}\,}}(x_0))^{2^k - 1} \le \delta _0 (C_2 \delta _0)^{2^k - 1} \le \delta _0 (C_2 \delta _0)^k \end{aligned}$$

(43)

for all $k \le K+1$. We obtain (42a) for $k = K+1$ as follows:

$$\begin{aligned} \Vert x_{K+1} - x^*\Vert&\le \Vert x_0 - x^*\Vert + \sum _{k=0}^K \Vert x_{k+1} - x_k\Vert&\quad&(\text {by the triangle inequality})\\&\le \delta _0 + C_1 \sum _{k=0}^K {{\,\textrm{dist}\,}}(x_k)&\quad&(\text {by (36)})\\&\le \delta _0 + \frac{C_1 \delta _0}{1 - C_2 \delta _0} \le \delta&\quad&(\text {by (41) and (43)}). \end{aligned}$$

Now, we have proved (42a) and (42b) for all $k \ge 0$. $\square$

Proof of Theorem 2(ii)

Note that we have proved (37) and (43) for all $k \ge 0$ in the proof of Theorem 2(i). By (43) and $C_2\delta _0<1$ in (41), we have

$$\begin{aligned} \lim _{k \rightarrow \infty }{{\,\textrm{dist}\,}}(x_k) = 0. \end{aligned}$$

(44)

As with (43), we have for $i \ge k$,

$$\begin{aligned} {{\,\textrm{dist}\,}}(x_i) \le {{\,\textrm{dist}\,}}(x_k) (C_2 {{\,\textrm{dist}\,}}(x_k))^{2^{i-k} - 1} \le {{\,\textrm{dist}\,}}(x_k) (C_2 \delta _0)^{i-k}. \end{aligned}$$

Using this bound and (35), we obtain

$$\begin{aligned} \Vert x_k - x_l\Vert \le \sum _{i=k}^{l-1} \Vert x_{i+1} - x_i\Vert \le C_1 \sum _{i=k}^{l-1} {{\,\textrm{dist}\,}}(x_i) \le \frac{C_1}{1 - C_2\delta _0} {{\,\textrm{dist}\,}}(x_k) \end{aligned}$$

(45)

for all k, l such that $0 \le k < l$. Equations (45) and (44) imply that $(x_k)$ is a Cauchy sequence. Accordingly, the sequence $(x_k)$ converges to a point ${\hat{x}} \in {\mathcal {X}}^*$ by (44). Thus, we obtain

$$\begin{aligned} \Vert x_{k+1} - {\hat{x}}\Vert&= \lim _{l \rightarrow \infty } \Vert x_{k+1} - x_l\Vert&\quad&(\text {by the continuity of $\Vert \cdot \Vert$} )\\&\le \frac{C_1}{1 - C_2\delta _0} {{\,\textrm{dist}\,}}(x_{k+1})&\quad&(\text {by (45)})\\&\le \frac{C_1C_2}{1 - C_2\delta _0} {{\,\textrm{dist}\,}}(x_k)^2&\quad&(\text {by (37)})\\&\le \frac{C_1C_2}{1 - C_2\delta _0} \Vert x_k - {\hat{x}}\Vert ^2&\quad&(\text {by}\, {\hat{x}} \in {\mathcal {X}}^*), \end{aligned}$$

which implies Theorem 2(ii). $\square$

6 Practical variant of the proposed method

We present a more practical variant (Algorithm 3) of Algorithm 1, which also achieves the theoretical guarantees given for Algorithm 1 in Sects. 4 and 5.

6.1 Generalized version of Algorithm 1

To obtain the practical variant, we first present a generalized framework of Algorithm 1. Algorithm 1 runs the vanilla projected gradient (PG) method in the inner loop. This PG can be replaced with other algorithms keeping $O(\varepsilon ^{-2})$ iteration complexity and quadratic convergence that were gained for Algorithm 1. Indeed, these theoretical results rely on the fact that the x obtained in Line 11 of Algorithm 1 satisfies the following conditions:

Condition 1

(for $O(\varepsilon ^{-2})$ iteration complexity bound) There exists a constant $\gamma > 0$ such that for all k,

$$\begin{aligned} m^k_{\uplambda } (x) - m^k_{\uplambda } (x_k) \le - {\mathcal {D}}_\gamma (x_k). \end{aligned}$$

Condition 2

(for local quadratic convergence) Both of the following hold:

(i)
$m^k_\uplambda (x) \le m^k_\uplambda (x_k)$ for all k;
(ii)
There exists a constant $c > 0$ such that x is a $(c\uplambda \Vert F_k\Vert )$-stationary point of subproblem (3) for all k.

This fact yields a general algorithmic framework that achieves the $O(\varepsilon ^{-2})$ iteration complexity bound together with the quadratic convergence as in Algorithm 2.

In Line 4 of Algorithm 2, any globally convergent algorithm for subproblem (3) can be employed. For example, we may use (block) coordinate descent methods, Frank-Wolfe methods, interior point methods, active set methods, or augmented Lagrangian methods. For unconstrained cases, since the subproblem reduces to solving a system of linear equations, we may use conjugate gradient methods or direct methods, including Gaussian elimination.

6.2 Proposed method with an accelerated projected gradient

A practical example of Algorithm 2 is presented in Algorithm 3. This algorithm employs the accelerated projected gradient (APG) method [45, Algorithm 1] with the adaptive restarting technique [54, Sect. 3.2] to solve subproblems and adopts the additional parameters mentioned in Remark 1. Since the solution x obtained in Line 19 of Algorithm 3 satisfies Condition 1, this algorithm enjoys the $O(\varepsilon ^{-2})$ iteration complexity bound. In addition, it also achieves the $O(\varepsilon ^{-2})$ overall complexity bound if $T < \infty$ as with Corollary 2, and it achieves local quadratic convergence if $T = \infty$. Algorithm 3 will be used for the numerical experiments in the next section.

7 Numerical experiments

We examine the practical performance of the proposed method. We implemented all methods in Python with SciPy [58] and JAX [9] and executed them on a computer with Apple M1 Chip (8 cores, 3.2 GHz) and 16 GB RAM.

7.1 Problem setting

We consider three types of instances: (i) compressed sensing with quadratic measurement, (ii) nonnegative matrix factorization with missing values, and (iii) autoencoder with MNIST dataset.

7.1.1 Compressed sensing with quadratic measurement

Given $A_1,\dots ,A_n \in {\mathbb {R}}^{r \times d}$, $b_1,\dots ,b_n \in {\mathbb {R}}^d$, and $c_1,\dots ,c_n, R \in {\mathbb {R}}$, we consider the following problem:

$$\begin{aligned} \min _{x \in {\mathbb {R}}^d}\ \sum _{i=1}^n \Big ({ \frac{1}{2r} \Vert A_i x\Vert ^2 + \langle b_i, x \rangle - c_i }\Big )^2 \quad \mathrm {subject\ to}\quad \Vert x\Vert _1 \le R, \end{aligned}$$

(46)

where $\Vert \cdot \Vert _1$ denotes the $\ell _1$-norm. Problem (46) formulates the situation where a sparse vector $x^* \in {\mathbb {R}}^d$ is recovered from a small number (i.e., $n < d$) of quadratic observations, $\frac{1}{2r} \Vert A_i x^*\Vert ^2 + \langle b_i, x^* \rangle$ for $i = 1, \dots , n$. Such a problem arises in the context of compressed sensing [8, 44] and phase retrieval [11, 63]. Problem (46) can be transformed into the form of problem (1).

Generating instances First, we generate the optimal solution $x^* \in {\mathbb {R}}^d$ with only $d_{\textrm{nnz}} \,(< d)$ nonzero entries. The indexes of the nonzero entries are chosen uniformly randomly, and the value of those elements are independently drawn from the uniform distribution on $[-x_{\textrm{max}}, x_{\textrm{max}}]$. Each entry of $A_i$’s and $b_i$’s is drawn independently from the standard normal distribution ${\mathcal {N}}(0, 1)$. Then, we set $R = \Vert x^*\Vert _1$ and $c_i = \frac{1}{2r} \Vert A_i x^*\Vert ^2 + \langle b_i, x^* \rangle$ for all i. We fix $d = 200$, $r = 10$, and $n = 50$, and set $d_{\textrm{nnz}} \in \{ 5, 10, 20 \}$ and $x_{\textrm{max}} \in \{ 0.1, 1 \}$. We set the starting point for each algorithm as $x_0 = {\textbf{0}}$.

7.1.2 Nonnegative matrix factorization with missing values

Given $A \in {\mathbb {R}}^{m \times n}$ and $H \in \{ 0, 1 \}^{m \times n}$, we consider the following problem:

$$\begin{aligned} \min _{X \in {\mathbb {R}}^{m \times r},\ Y \in {\mathbb {R}}^{n \times r}} \Vert H \odot (X Y^\top - A) \Vert _{\textrm{F}}^2 \quad \mathrm {subject\ to}\quad X \ge O,\quad Y \ge O, \end{aligned}$$

(47)

where $\odot$ denotes the elementwise product, $X \ge O$ and $Y \ge O$ denote elementwise inequalities, and $\Vert \cdot \Vert _{\textrm{F}}$ denotes the Frobenius norm. Problem (47) formulates the situation where a data matrix A with some missing entries is approximated by the product $X Y^\top$ of two nonnegative matrices. Such a problem is called nonnegative matrix factorization (NMF) with missing values and is widely used for nonnegative data analysis, especially for collaborative filtering [46, 65]. For more information on NMF, see [7, 59] and the references therein. Problem (47) can also be written as problem (1).

Generating instances To generate A and H, we introduce two parameters: $\gamma \ge 1$ and $0 < p \le 1$. The parameters $\gamma$ and p control the condition number of A and the number of 1’s in H, respectively. Let $l :=\min \{ m, n \}$. First, a matrix ${{\tilde{A}}} \in {\mathbb {R}}^{m \times n}$ is generated by ${{\tilde{A}}} = U D V^\top$, and then the matrix A is obtained by normalizing ${{\tilde{A}}} = ({{\tilde{a}}}_{ij})_{i,j}$ as $A = {{\tilde{A}}} / \max _{i, j} {{\tilde{a}}}_{ij}$. Here, each entry of $U \in {\mathbb {R}}^{m \times l}$ and $V \in {\mathbb {R}}^{n \times l}$ follows independently the uniform distribution on [0, 1], and $D = {{\,\textrm{diag}\,}}(\gamma ^{0}, \gamma ^{-1/l}, \gamma ^{-2/l},\dots , \gamma ^{-(l-1)/l} ) \in {\mathbb {R}}^{l \times l}$ is a diagonal matrix. H is a random matrix whose entries follow independently the Bernoulli distribution with parameter p, i.e., each entry of H is 1 with probability p. We fix $m = n = 50$ and $\gamma = 10^5$, and set $r \in \{ 10, 40 \}$ and $p \in \{ 0.02, 0.1, 0.5 \}$. Since $(X, Y) = (O, O)$ is a stationary point of problem (47), we set the starting point to random matrices whose entries independently follow the uniform distribution on $[0, 10^{-3}]$.

7.1.3 Autoencoder with MNIST dataset

The third instance is highly nonlinear and large-scale. In machine learning, autoencoders (see, e.g., [37, Section 14]) are a popular model to compress real-world data, represented as high-dimensional vectors, into low-dimensional vectors. Given p-dimensional data $a_1, \dots , a_N \in {\mathbb {R}}^p$, autoencoders try to learn an encoder $\phi _x^{\textrm{enc}}: {\mathbb {R}}^p \rightarrow {\mathbb {R}}^q$ and a decoder $\phi _y^{\textrm{dec}}: {\mathbb {R}}^q \rightarrow {\mathbb {R}}^p$, where $q < p$. Here, x and y are parameters to be learned by solving the following optimization problem:

$$\begin{aligned} \min _{x, y}\ \sum _{i=1}^N \Vert a_i - \phi _y^{\textrm{dec}}(\phi _x^{\textrm{enc}}(a_i))\Vert ^2. \end{aligned}$$

(48)

As we see from the optimization problem above, the autoencoder aims to extract latent features that can be used to reconstruct the original data.

For this experiment, we use the MNIST hand-written digit dataset. Each data is a $28 \times 28$ pixel grayscale image, which is represented as a vector $a_i \in [0, 1]^p$ with $p = 28 \times 28 = 728$. The dataset contains 60,000 training data, of which $N = 1000$ were randomly chosen for use. We set $q = 16$; our model encodes 728-dimensional data into 16 dimensions. Both encoder and decoder are two-layer neural networks with a hidden layer of size 64 and logistic sigmoid activation functions. Specifically, the encoder $\phi _x^{\textrm{enc}}$ is written as

$$\begin{aligned} \phi _x^{\textrm{enc}}(a) = \phi _x^2 \circ \phi _x^1 (a), \quad \text {where}\quad \phi _x^i(a)&:=S( W_i a + b_i ). \end{aligned}$$

Here, S is the elementwise logistic sigmoid function, and $W_1 \in {\mathbb {R}}^{728 \times 64}$, $b_1 \in {\mathbb {R}}^{64}$, $W_2 \in {\mathbb {R}}^{64 \times 16}$, and $b_2 \in {\mathbb {R}}^{16}$ are parameters of the network; $x = ((W_i, b_i))_{i=1}^2$. The decoder $\phi _y^{\textrm{dec}}$ is formulated in a similar way. When we rewrite problem (48) in the form of (1), the dimension of the function $F: {\mathbb {R}}^d \rightarrow {\mathbb {R}}^n$ is $d = 96{,}104$ and $n = Np = 728{,}000$.

7.2 Algorithms and implementation

We compare the proposed method with six existing methods. The details are below.

Proposed (Algorithm 3) and Proposed-NA (Algorithm 1) method To see the effect of acceleration for subproblems, we implemented both Algorithms 1 and 3; Algorithm 3 is expected to be faster, of course. In Line 10 of Algorithm 1 and Line 18 of Algorithm 3, we have to check if $x_{k,t}$ is a $(c\uplambda \Vert F_k\Vert )$-stationary point, but it is not very easy. We thus replace the criterion with one using gradient mapping, i.e., check if $\eta \Vert x_{k,t} - y\Vert \le c\uplambda \Vert F_k\Vert$. The input parameters of Algorithm 1 and 3 are set to $M_0 = \eta _0 = 1$, $\alpha = \alpha _{\textrm{in}} = 2$, $\beta = \beta _{\textrm{in}} = 0.9$, $M_{\min } = 10^{-10}$, $T = 100$, and $c = 1$.

Fan method [26, Algorithm 2.1] and KYF method [40, Algorithm 2.12] The Fan and KYF methods are constrained LM methods with a global convergence guarantee. To solve subproblem (3), an APG method is used as well as Algorithm 3 for a fair comparison. The difference from the APG in Algorithm 3 is in the stopping criterion; the condition “$x_{k,t}$ is a $(c\uplambda \Vert F_k\Vert )$-stationary point” in Line 18 of Algorithm 3 is replaced with $\eta \Vert x_{k,t} - y\Vert \le 10^{-9}$.^{Footnote 4} The input parameters in [26, 40] are set to $\mu = 10^{-4}$, $\beta = 0.9$, $\sigma = 10^{-4}$, $\gamma = 0.99995$, and $\delta \in \{ 1, 2 \}$, following the recommendations of [26, 40].^{Footnote 5}

Facchinei method [22, Algorithm 3] This is a constrained LM method that allows subproblems to be solved inexactly. We solve the subproblems in almost the same way as the Fan and KYF methods. The input parameters in [22] are set to $\gamma _0 = 1$ and $S = 2$.

GGO method [36, Algorithm G-LMA-IP] This is an LM-type method that requires the solution of a linear system at each iteration. The main advantage of this algorithm is that it does not require exact projection and can be applied to problems with a complex feasible region. Still, it is reported to perform well even when the projection is easy to compute exactly [36]. The linear systems are solved via QR decomposition (scipy.linalg.qr [58]) and the input parameters in [36] are set to $M \in \{ 1, 15 \}$, $\eta _1 = 10^{-4}$, $\eta _2 = 10^{-2}$, $\eta _3 = 10^{10}$, $\gamma = 10^{-3}$, $\beta = 1/2$, and $\theta _k = 0$, following [36].^{Footnote 6}

Projected gradient (PG) method The PG method is one of the most standard first-order methods for problem (1). The step-size is adaptively chosen in a similar way to the APG in Algorithm 3 with $\eta _0 = 1$, $\alpha _{\textrm{in}} = 2$, and $\beta _{\textrm{in}} = 0.9$.

Trust-region reflective (TRF) method This is an interior trust-region method for box-constrained nonlinear optimization. It was proposed in [10] and is implemented in SciPy [58] with several improvements. For the TRF method, we call scipy.optimize.least_squares [58] with a gtol=1e-5 option to avoid the long execution time caused by searching for too precise a solution.

Other information As mentioned in Sect. 1.2, there are two ways to handle Jacobian matrices: explicitly computing $J_k :=J(x_k)$ or using Jacobian-vector products $J_k u$ and $J_k^\top v$. In our experiments, the latter implementation outperformed the former, so we adopted the latter if possible (i.e., for Proposed, Proposed-NA, Fan, KYF, Facchinei, and PG).^{Footnote 7} We note that GGO is based on QR decomposition, which is probably impossible to implement using Jacobian-vector products.

For projection onto the feasible region of problem (46), we employ [18, Algorithm 1], whose time complexity is $O(d \log d)$.

7.3 Results

7.3.1 Compressed sensing and NMF

Figures 2 and 3 show the results of compressed sensing in (46) and NMF in (47). Each figure consists of six subfigures, and they consist of two plots; the upper one shows the worst case among ten randomly generated instances, and the lower one shows the best case.^{Footnote 8}

Tables 2 and 3 provide more detailed information. For the tables, each algorithm is stopped when either of the following conditions is fulfilled: (i) the algorithm finds a point where the norm of the gradient mapping is less than $10^{-5}$; (ii) the execution time exceeds 10 seconds. The “Success” column indicates the percentage of instances (out of 10) that ended up satisfying condition (i). The other columns show the averages of the following values: the objective function value reached, the gradient-mapping norm, the execution time, the number of iterations, and the number of basic operations. JVP stands for Jacobian-vector products.

Table 2 Results of compressed sensing (problem (46))

Full size table

Table 3 Results of NMF (problem (47))

Full size table

A remarkable feature of our method is its stability in addition to fast convergence. For example, while the Fan and KYF methods perform well in most cases, they sometimes do not converge fast, as shown in Table 2(e) and (f). The proposed method shows the best or comparable performance in all our settings compared to the other methods. This suggests that our method is stable without careful parameter tuning.

As seen from Tables 3(d)–(f), the Facchinei and GGO methods do not work well in some cases. As for the Facchinei method, the reason is presumably that the method does not guarantee global convergence. For GGO, it is observed from the tables that the number of iterations performed within the time limit is small, say 3 or 4. It is because the method at each iteration computes a Jacobian explicitly and solves a linear system, resulting in a high cost per iteration for large-scale problems. Our method guarantees global convergence and repeats relatively low-cost iterations without Jacobian computation, which also leads to a stable performance.

Figure 4 shows the results of the TRF method. Since this method can only handle box constraints, the results only of problem (47) are presented. One marker corresponds to one instance, representing the elapsed time and the obtained objective value.^{Footnote 9} TRF takes more time to converge than the proposed method; in particular, comparing Table 3(f) and Fig. 4f, we see that the elapsed time is about 1000 times longer than ours. This result may be due to the difference in how TRF and ours handle the constraint. When the optimal solution or a stationary point is at the boundary of the constraint set, our method can reach the boundary in a finite number of iterations. However, TRF does not, as it is an interior point method.

7.3.2 Autoencoder with MNIST

Figure 5 shows the results of problem (48). The results of the GGO method are omitted because the method explicitly computes the Jacobian, but it was infeasible in this large-scale setting, where $d = 96{,}104$ and $n = 728{,}000$. Among the existing methods, the PG method converges the fastest, but the proposed method converges about five times faster than PG. This result suggests that our method is also effective for large-scale and highly nonlinear problems.

8 Conclusion and future work

We proposed an LM method for solving constrained least-squares problems. Our method finds an $\varepsilon$-stationary point of (possibly) nonzero-residual problems after $O(\varepsilon ^{-2})$ computation, and also achieves local quadratic convergence for zero-residual problems. There are few LM methods having both overall complexity bounds and local quadratic convergence even for unconstrained problems; in fact, our investigation yielded only one such algorithm [6]. The key to our analysis is a simple update rule for $(\uplambda _k)$ and the majorization lemma (Lemma 1).

We may be able to extend the convergence analysis shown in this paper to different problem settings. For example, it would be interesting to derive an overall complexity bound of LM methods for a nonsmooth function F. It would be also interesting to integrate a stochastic technique into our LM method against problems with F of a huge size. Finally, in recent years, studies on local convergence analysis for non-zero residual problems are progressive [2, 6, 39]. It is important to research our LM method further in this line.

Code availability

The source code used in our numerical experiments is available on https://github.com/n-marumo/constrained-lm.

Notes

Automatic differentiation libraries such as JAX [9] compute the Jacobian-vector products at several times the cost of evaluating F(x). See, e.g., the JAX documentation [60].
More precisely, [16] proposed a framework for arbitrary-order methods and it includes a method with subproblem (4) as a special case.
If $x^*$ is an interior point of the constraint ${\mathcal {C}}$, the problem can be regarded as an unconstrained one, and the quadratic convergence is easier to prove. We do not assume this, i.e., $x^*$ may be on the boundary of ${\mathcal {C}}$.
Since $c\uplambda \Vert F_k\Vert$ in Algorithm 3 derives from our update rule of $\uplambda$ and our analysis, it does not seem appropriate to use a criterion with $c\uplambda \Vert F_k\Vert$ directly in another algorithm. We thus use $\eta \Vert x_{k,t} - y\Vert \le 10^{-9}$ instead of $\eta \Vert x_{k,t} - y\Vert \le c\uplambda \Vert F_k\Vert$ here.
$\delta = 1$ and $\delta = 2$ correspond to the Fan method and the KYF method, respectively.
Because the algorithm with $M = 1$ outperformed $M = 15$ in our experiments, we omit the results with $M = 15$.
When using the Jacobian-vector products, i.e., not computing the Jacobian explicitly, almost all of the algorithm’s runtime is spent solving subproblems.
Here, for each method, we determine the best and worst cases out of ten instances as follows. Each algorithm is stopped when either of the following conditions is fulfilled: (i) the objective function value falls below $10^{-10}$; (ii) the execution time exceeds 10 seconds. First, we define that the case stopped by condition (i) is better than that stopped by condition (ii). Then, among the cases stopped by condition (i), the case with a shorter execution time is defined as better. Similarly, among the cases stopped by condition (ii), the case with a smaller objective value is defined as better. Note that from the above definition, the instances corresponding to plots in the same figure may be distinct.
We ran the TRF method for ten instances for each (r, p), but Fig. 4f has only nine markers because the algorithm stopped with the error “SVD did not converge” for one instance.

References

Behling, R., Fischer, A.: A unified local convergence analysis of inexact constrained Levenberg–Marquardt methods. Optim. Lett. 6(5), 927–940 (2012)
MathSciNet MATH Google Scholar
Behling, R., Gonçalves, D.S., Santos, S.A.: Local convergence analysis of the Levenberg–Marquardt framework for nonzero-residue nonlinear least-squares problems under an error bound condition. J. Optim. Theory Appl. 183(3), 1099–1122 (2019)
MathSciNet MATH Google Scholar
Bellavia, S., Morini, B.: Strong local convergence properties of adaptive regularized methods for nonlinear least squares. IMA J. Numer. Anal. 35(2), 947–968 (2015)
MathSciNet MATH Google Scholar
Bellavia, S., Cartis, C., Gould, N.I., Morini, B., Toint, P.L.: Convergence of a regularized Euclidean residual algorithm for nonlinear least-squares. SIAM J. Numer. Anal. 48(1), 1–29 (2010)
MathSciNet MATH Google Scholar
Bellavia, S., Gratton, S., Riccietti, E.: A Levenberg–Marquardt method for large nonlinear least-squares problems with dynamic accuracy in functions and gradients. Numer. Math. 140(3), 791–825 (2018)
MathSciNet MATH Google Scholar
Bergou, E.H., Diouane, Y., Kungurtsev, V.: Convergence and complexity analysis of a Levenberg–Marquardt algorithm for inverse problems. J. Optim. Theory Appl. 185(3), 927–944 (2020)
MathSciNet MATH Google Scholar
Berry, M.W., Browne, M., Langville, A.N., Pauca, V.P., Plemmons, R.J.: Algorithms and applications for approximate nonnegative matrix factorization. Comput. Stat. Data Anal. 52(1), 155–173 (2007)
MathSciNet MATH Google Scholar
Blumensath, T.: Compressed sensing with nonlinear observations and related nonlinear optimization problems. IEEE Trans. Inf. Theory 59(6), 3466–3474 (2013)
MathSciNet MATH Google Scholar
Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., VanderPlas, J., Wanderman-Milne, S., Zhang, Q.: JAX: composable transformations of Python+NumPy programs (2018). http://github.com/google/jax
Branch, M.A., Coleman, T.F., Li, Y.: A subspace, interior, and conjugate gradient method for large-scale bound-constrained minimization problems. SIAM J. Sci. Comput. 21(1), 1–23 (1999)
MathSciNet MATH Google Scholar
Candes, E.J., Li, X., Soltanolkotabi, M.: Phase retrieval via Wirtinger flow: theory and algorithms. IEEE Trans. Inf. Theory 61(4), 1985–2007 (2015)
MathSciNet MATH Google Scholar
Cartis, C., Gould, N.I., Toint, P.L.: On the evaluation complexity of composite function minimization with applications to nonconvex nonlinear programming. SIAM J. Optim. 21(4), 1721–1739 (2011)
MathSciNet MATH Google Scholar
Cartis, C., Gould, N.I.M., Toint, P.L.: Adaptive cubic regularisation methods for unconstrained optimization. Part I: motivation, convergence and numerical results. Math. Program. 127(2), 245–295 (2011)
MathSciNet MATH Google Scholar
Cartis, C., Gould, N.I.M., Toint, P.L.: Adaptive cubic regularisation methods for unconstrained optimization. Part II: worst-case function-and derivative-evaluation complexity. Math. Program. 130(2), 295–319 (2011)
MathSciNet MATH Google Scholar
Cartis, C., Gould, N.I.M., Toint, P.L.: An adaptive cubic regularization algorithm for nonconvex optimization with convex constraints and its function-evaluation complexity. IMA J. Numer. Anal. 32(4), 1662–1695 (2012)
MathSciNet MATH Google Scholar
Cartis, C., Gould, N., Toint, P.L.: Strong evaluation complexity bounds for arbitrary-order optimization of nonconvex nonsmooth composite functions. arXiv preprint arXiv:2001.10802 (2020)
Chorowski, J., Zurada, J.M.: Learning understandable neural networks with nonnegative weight constraints. IEEE Trans. Neural Netw. Learn. Syst. 26(1), 62–69 (2014)
MathSciNet Google Scholar
Condat, L.: Fast projection onto the simplex and the $l_1$ ball. Math. Program. 158(1), 575–585 (2016)
MathSciNet MATH Google Scholar
Dan, H., Yamashita, N., Fukushima, M.: Convergence properties of the inexact Levenberg–Marquardt method under local error bound conditions. Optim. Methods Softw. 17(4), 605–626 (2002)
MathSciNet MATH Google Scholar
Dirkse, S.P., Ferris, M.C.: MCPLIB: a collection of nonlinear mixed complementarity problems. Optim. Methods Softw. 5(4), 319–345 (1995)
Google Scholar
Drusvyatskiy, D., Paquette, C.: Efficiency of minimizing compositions of convex functions and smooth maps. Math. Program. 178(1–2), 503–558 (2019)
MathSciNet MATH Google Scholar
Facchinei, F., Fischer, A., Herrich, M.: A family of Newton methods for nonsmooth constrained systems with nonisolated solutions. Math. Methods Oper. Res. 77(3), 433–443 (2013)
MathSciNet MATH Google Scholar
Fan, J.: A modified Levenberg-Marquardt algorithm for singular system of nonlinear equations. J. Comput. Math. 21, 625–636 (2003)
MathSciNet MATH Google Scholar
Fan, J.: Convergence rate of the trust region method for nonlinear equations under local error bound condition. Comput. Optim. Appl. 34(2), 215–227 (2006)
MathSciNet MATH Google Scholar
Fan, J.: The modified Levenberg–Marquardt method for nonlinear equations with cubic convergence. Math. Comput. 81(277), 447–466 (2012)
MathSciNet MATH Google Scholar
Fan, J.: On the Levenberg–Marquardt methods for convex constrained nonlinear equations. J. Ind. Manag. Optim. 9(1), 227–241 (2013)
MathSciNet MATH Google Scholar
Fan, J.: Accelerating the modified Levenberg–Marquardt method for nonlinear equations. Math. Comput. 83(287), 1173–1187 (2014)
MathSciNet MATH Google Scholar
Fan, J., Lu, N.: On the modified trust region algorithm for nonlinear equations. Optim. Methods Softw. 30(3), 478–491 (2015)
MathSciNet MATH Google Scholar
Fan, J., Pan, J.: Inexact Levenberg–Marquardt method for nonlinear equations. Discrete Contin. Dyn. Syst. Ser. B 4(4), 1223–1232 (2004)
MathSciNet MATH Google Scholar
Fan, J., Pan, J.: Convergence properties of a self-adaptive Levenberg–Marquardt algorithm under local error bound condition. Comput. Optim. Appl. 34(1), 47–62 (2006)
MathSciNet MATH Google Scholar
Fan, J., Pan, J.: On the convergence rate of the inexact Levenberg–Marquardt method. J. Ind. Manag. Optim. 7(1), 199–210 (2011)
MathSciNet MATH Google Scholar
Fan, J., Pan, J.: An improved trust region algorithm for nonlinear equations. Comput. Optim. Appl. 48(1), 59–70 (2011)
MathSciNet MATH Google Scholar
Fan, J., Yuan, Y.: On the convergence of a new Levenberg–Marquardt method. Technical Report 005, AMSS, Chinese Academy of Sciences (2001)
Fan, J., Yuan, Y.: On the quadratic convergence of the Levenberg–Marquardt method without nonsingularity assumption. Computing 74(1), 23–39 (2005)
MathSciNet MATH Google Scholar
Fischer, A., Shukla, P., Wang, M.: On the inexactness level of robust Levenberg–Marquardt methods. Optimization 59(2), 273–287 (2010)
MathSciNet MATH Google Scholar
Gonçalves, D.S., Gonçalves, M.L., Oliveira, F.R.: An inexact projected LM type algorithm for solving convex constrained nonlinear equations. J. Comput. Appl. Math. 391, 113421 (2021)
MathSciNet MATH Google Scholar
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016)
MATH Google Scholar
Griewank, A.: The modification of Newton’s method for unconstrained optimization by bounding cubic terms. Technical report, Technical report NA/12, Department of Applied Mathematics and Theoretical Physics, University of Cambridge (1981)
Ipsen, I.C., Kelley, C., Pope, S.: Rank-deficient nonlinear least squares problems and subset selection. SIAM J. Numer. Anal. 49(3), 1244–1266 (2011)
MathSciNet MATH Google Scholar
Kanzow, C., Yamashita, N., Fukushima, M.: Levenberg–Marquardt methods with strong local convergence properties for solving nonlinear equations with convex constraints. J. Comput. Appl. Math. 172(2), 375–397 (2004)
MathSciNet MATH Google Scholar
Lan, G., Monteiro, R.D.C.: Iteration-complexity of first-order penalty methods for convex programming. Math. Program. 138(1–2), 115–139 (2013)
MathSciNet MATH Google Scholar
Lee, D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: Leen, T., Dietterich, T., Tresp, V. (eds.) Advances in Neural Information Processing Systems, vol. 13. MIT Press, Cambridge https://proceedings.neurips.cc/paper/2000/file/f9d1152547c0bde01830b7e8bd60024c-Paper.pdf (2000)
Levenberg, K.: A method for the solution of certain non-linear problems in least squares. Q. Appl. Aathematics 2(2), 164–168 (1944)
MathSciNet MATH Google Scholar
Li, X., Voroninski, V.: Sparse signal recovery from quadratic measurements via convex programming. SIAM J. Math. Anal. 45(5), 3019–3033 (2013)
MathSciNet MATH Google Scholar
Lin, Q., Xiao, L.: An adaptive accelerated proximal gradient method and its homotopy continuation for sparse optimization. In: Xing, E.P., Jebara, T. (eds.) Proceedings of the 31st International Conference on Machine Learning, voll. 32, pp. 73–81. PMLR. (2014)
Luo, X., Zhou, M., Xia, Y., Zhu, Q.: An efficient non-negative matrix-factorization-based approach to collaborative filtering for recommender systems. IEEE Trans. Ind. Inf. 10(2), 1273–1284 (2014)
Google Scholar
Marquardt, D.W.: An algorithm for least-squares estimation of nonlinear parameters. J. Soc. Ind. Appl. Math. 11(2), 431–441 (1963)
MathSciNet MATH Google Scholar
Meintjes, K., Morgan, A.P.: A methodology for solving chemical equilibrium systems. Appl. Math. Comput. 22(4), 333–361 (1987)
MathSciNet MATH Google Scholar
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic Publishers, Massachusetts (2004)
MATH Google Scholar
Nesterov, Y.: Modified Gauss–Newton scheme with worst case guarantees for global performance. Optimisation Methods and Software 22(3), 469–483 (2007)
MathSciNet MATH Google Scholar
Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013)
MathSciNet MATH Google Scholar
Nesterov, Y.: Lectures on Convex Optimization, vol. 137. Springer, Berlin (2018)
MATH Google Scholar
Nesterov, Y., Polyak, B.T.: Cubic regularization of Newton method and its global performance. Math. Program. 108(1), 177–205 (2006)
MathSciNet MATH Google Scholar
O’donoghue, B., Candès, E.: Adaptive restart for accelerated gradient schemes. Found. Comput. Math. 15(3), 715–732 (2015)
MathSciNet MATH Google Scholar
Osborne, M.R.: Nonlinear least squares—the Levenberg algorithm revisited. J. Aust. Math. Soc. Ser. B. Appl. Math. 19(3), 343–357 (1976)
MathSciNet MATH Google Scholar
Rosenbrock, H.: An automatic method for finding the greatest or least value of a function. Comput. J. 3(3), 175–184 (1960)
MathSciNet Google Scholar
Ueda, K., Yamashita, N.: On a global complexity bound of the Levenberg–Marquardt method. J. Optim. Theory Appl. 147(3), 443–453 (2010)
MathSciNet MATH Google Scholar
Virtanen, P., Gommers, R., Oliphant, T.E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S.J., Brett, M., Wilson, J., Millman, K.J., Mayorov, N., Nelson, A.R.J., Jones, E., Kern, R., Larson, E., Carey, C.J., Polat, İ, Feng, Y., Moore, E.W., VanderPlas, J., Laxalde, D., Perktold, J., Cimrman, R., Henriksen, I., Quintero, E.A., Harris, C.R., Archibald, A.M., Ribeiro, A.H., Pedregosa, F., vanMulbregt, P.: SciPy 1.0 Contributors. SciPy 1.0. Fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020)
Google Scholar
Wang, Y.-X., Zhang, Y.-J.: Nonnegative matrix factorization: a comprehensive review. IEEE Trans. Knowl. Data Eng. 25(6), 1336–1353 (2012)
Google Scholar
Wiltschko, A., Johnson, M.: The Autodiff Cookbook—JAX documentation. https://jax.readthedocs.io/en/latest/notebooks/autodiff_cookbook.html
Wood, A.J., Wollenberg, B.F., Sheblé, G.B.: Power Generation, Operation, and Control. Wiley, New York (2013)
Google Scholar
Yamashita, N., Fukushima, M.: On the rate of convergence of the Levenberg–Marquardt method. In: Alefeld, G., Chen, X. (eds.) Topics in Numerical Analysis, pp. 239–249. Springer, Vienna (2001)
Google Scholar
Zhang, H., Zhou, Y., Liang, Y., Chi, Y.: A nonconvex approach for phase retrieval: reshaped Wirtinger flow and incremental algorithms. J. Mach. Learn. Res. 18(1), 5164–5198 (2017)
MathSciNet MATH Google Scholar
Zhang, J.-L., Wang, Y.: A new trust region method for nonlinear equations. Math. Methods Oper. Res. 58(2), 283–298 (2003)
MathSciNet MATH Google Scholar
Zhang, S., Wang, W., Ford, J., Makedon, F.: Learning from incomplete ratings using non-negative matrix factorization. In Proceedings of the 2006 SIAM International Conference on Data Mining, pp. 549–553. SIAM (2006)
Zhao, R., Fan, J.: Global complexity bound of the Levenberg–Marquardt method. Optim. Methods Softw. 31, 805–814 (2016)
MathSciNet MATH Google Scholar

Download references

Acknowledgements

We thank the anonymous referees for their careful reading of our manuscript and their helpful comments.

Funding

Open access funding provided by University of Tokyo. This work was supported by JSPS KAKENHI Grant Numbers 20K19748 and 19H04069, and JST ERATO Grand Number JPMJER1903.

Author information

Authors and Affiliations

Graduate School of Information Science and Technology, University of Tokyo, Tokyo, Japan
Naoki Marumo & Akiko Takeda
Faculty of Science and Technology, Seikei University, Tokyo, Japan
Takayuki Okuno
Center for Advanced Intelligence Project, RIKEN, Tokyo, Japan
Takayuki Okuno & Akiko Takeda

Authors

Naoki Marumo
View author publications
You can also search for this author in PubMed Google Scholar
Takayuki Okuno
View author publications
You can also search for this author in PubMed Google Scholar
Akiko Takeda
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Naoki Marumo.

Ethics declarations

Competing interests

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Lemmas and proofs

1.1 Lemma on Lipschitz-like properties

Recall that the line segment ${\mathcal {L}}(a, b)$ is defined in (8).

Lemma 9

Let ${\mathcal {X}} \subseteq {\mathbb {R}}^d$ be any (possibly nonconvex) set. For some constants $\sigma , L > 0$, consider the following two sets of conditions:

$$\begin{aligned}&\Vert J(x)\Vert \le \sigma , \quad \forall x \in {\mathcal {X}}, \end{aligned}$$

(49a)

$$\begin{aligned}&\Vert J(y) - J(x)\Vert \le L\Vert y - x\Vert , \quad \forall x,y \in {\mathcal {X}} \ \ \text {s.t.} \ \ {\mathcal {L}}(x, y) \subseteq {\mathcal {X}}, \end{aligned}$$

(49b)

and

$$\begin{aligned}&\Vert F(y) - F(x)\Vert \le \sigma \Vert y - x\Vert , \quad \forall x,y \in {\mathcal {X}} \ \ \text {s.t.} \ \ {\mathcal {L}}(x, y) \subseteq {\mathcal {X}},\end{aligned}$$

(50a)

$$\begin{aligned}&\Vert F(y) - F(x) - J(x)(y - x)\Vert \le \frac{L}{2}\Vert y - x\Vert ^2, \quad \forall x,y \in {\mathcal {X}} \ \ \text {s.t.} \ \ {\mathcal {L}}(x, y) \subseteq {\mathcal {X}},\end{aligned}$$

(50b)

$$\begin{aligned}&\Vert \nabla f(y) - \nabla f(x)\Vert \le (\sigma ^2 + L\Vert F(x)\Vert ) \Vert y - x\Vert , \quad \forall x,y \in {\mathcal {X}} \ \ \text {s.t.} \ \ {\mathcal {L}}(x, y) \subseteq {\mathcal {X}}. \end{aligned}$$

(50c)

Then,

(i)
(49a) $\Longrightarrow$ (50a),
(ii)
(49b) $\Longrightarrow$ (50b),
(iii)
(49a) and (49b) $\Longrightarrow$ (50c).

Proof

By applying the multivariate mean value theorem, i.e.,

$$\begin{aligned} F(y) - F(x) = \int _0^1 J((1-\theta ) x + \theta y)(y - x)\,d\theta , \end{aligned}$$

we can easily obtain Lemma 9(i) and (ii). Lemma 9(iii) is obtained as follows:

$$\begin{aligned} \Vert \nabla f(y) - \nabla f(x)\Vert&= \Vert J(y)^\top F(y) - J(x)^\top F(x)\Vert \\&\le \Vert J(y)^\top F(y) - J(y)^\top F(x)\Vert + \Vert J(y)^\top F(x) - J(x)^\top F(x)\Vert \\&\le \Vert J(y)\Vert \Vert F(y) - F(x)\Vert + \Vert J(y) - J(x)\Vert \Vert F(x)\Vert \\&\le (\sigma ^2 + L\Vert F(x)\Vert ) \Vert y - x\Vert . \end{aligned}$$

The last inequality follows from (49a), (49b), and Lemma 9(i). $\square$

Remark 3

By replacing (F, J) with $(f, \nabla f)$ in Lemma 9(ii),

$$\begin{aligned} \Vert \nabla f(y) - \nabla f(x)\Vert \le L_f \Vert y - x\Vert , \quad \forall x,y \in {\mathcal {X}} \ \ \text {s.t.} \ \ {\mathcal {L}}(x, y) \subseteq {\mathcal {X}} \end{aligned}$$

implies

$$\begin{aligned} |f(y) - f(x) - \langle \nabla f(x), y - x \rangle | \le \frac{L_f}{2}\Vert y - x\Vert ^2, \quad \forall x,y \in {\mathcal {X}} \ \ \text {s.t.} \ \ {\mathcal {L}}(x, y) \subseteq {\mathcal {X}}. \end{aligned}$$

1.2 Proof of Lemma 1

The proof requires the following lemma, which is useful for deriving the majorization lemma for (general) MM-based algorithms under the assumption of Lipschitz continuity only on a sublevel set. We will use this lemma to prove Lemma 1 as well as Proposition 1.

Lemma 10

Let ${\mathcal {X}} \subseteq {\mathbb {R}}^d$ be any convex set. Fix a point $x_k \in {\mathcal {X}}$, and a strictly convex function ${{\tilde{m}}}: {\mathbb {R}}^d \rightarrow {\mathbb {R}}$ such that ${{\tilde{m}}}(x_k) = f(x_k)$ and $\nabla {{\tilde{m}}}(x_k) = \nabla f(x_k)$. We consider three subsets of ${\mathcal {X}}$:

$$\begin{aligned} \begin{aligned} {\mathcal {R}}_1&:=\{ x \in {\mathcal {X}} \,|\, {{\tilde{m}}}(x) \le {{\tilde{m}}}(x_k) \},\\ {\mathcal {R}}_2&:=\{ x \in {\mathcal {X}} \,|\, {\mathcal {L}}(x_k, x) \subseteq {\mathcal {S}}(x_k) \},\\ {\mathcal {R}}_3&:=\{ x \in {\mathcal {X}} \,|\, f(x) \le {{\tilde{m}}}(x) \}. \end{aligned} \end{aligned}$$

(51)

If

$$\begin{aligned} ({\mathcal {R}}_1 \cap {\mathcal {R}}_2) \subseteq {\mathcal {R}}_3, \end{aligned}$$

(52)

then ${\mathcal {R}}_1 \subseteq {\mathcal {R}}_2$, and therefore ${\mathcal {R}}_1 \subseteq {\mathcal {R}}_2 \subseteq {\mathcal {R}}_3$.

Proof

We fix $x \in {\mathcal {R}}_1$ arbitrarily and will show $x \in {\mathcal {R}}_2$. This is obvious if $x = x_k$, and thus, let $x \ne x_k$ below. Accordingly, we have

$$\begin{aligned} \langle \nabla f(x_k), x - x_k \rangle < 0 \end{aligned}$$

(53)

since

$$\begin{aligned} \langle \nabla f(x_k), x - x_k \rangle&= \langle \nabla {{\tilde{m}}}(x_k), x - x_k \rangle&\quad&(\text {by}\, \nabla {{\tilde{m}}}(x_k) = \nabla f(x_k))\\&< {{\tilde{m}}}(x) - {{\tilde{m}}}(x_k)&\quad&(\text {by the strictly convexity of }\,{{\tilde{m}}}\,\text { and }x \ne x_k)\\&\le 0&\quad&(\text {by }\,x \in {\mathcal {R}}_1). \end{aligned}$$

By the Taylor expansion

$$\begin{aligned} f((1 - \theta )x_k + \theta x)&= f(x_k) + \theta \langle \nabla f(x_k), x - x_k \rangle + o(\theta ) \end{aligned}$$

together with (53), there exists $\theta _1 > 0$ such that

$$\begin{aligned} f((1 - \theta )x_k + \theta x) < f(x_k),\quad \text {for all }\theta \in (0, \theta _1]. \end{aligned}$$

(54)

We will prove $x \in {\mathcal {R}}_2$ by contradiction; assume $x \notin {\mathcal {R}}_2$, i.e., there exists $\theta _2 \in [0, 1]$ such that

$$\begin{aligned} f((1 - \theta _2)x_k + \theta _2 x) > f(x_k). \end{aligned}$$

(55)

Combining (55) and (54) with the intermediate value theorem yields that there exists $\theta _3 \in (\theta _1, \theta _2)$ such that

$$\begin{aligned}&f((1 - \theta )x_k + \theta x) = f(x_k), \quad \hbox { for}\ \theta = \theta _3 \end{aligned}$$

(56a)

$$\begin{aligned}&f((1 - \theta )x_k + \theta x) \le f(x_k), \quad \hbox { for all}\ \theta \in [0, \theta _3]. \end{aligned}$$

(56b)

Note that (56b) is equivalent to $(1 - \theta _3)x_k + \theta _3 x \in {\mathcal {R}}_2$. On the other hand, we also have $(1 - \theta _3)x_k + \theta _3 x \in {\mathcal {R}}_1$ by the convexity of ${\mathcal {R}}_1$ and $x_k, x \in {\mathcal {R}}_1$. Thus, we have

$$\begin{aligned} (1 - \theta _3)x_k + \theta _3 x \in {\mathcal {R}}_3 \end{aligned}$$

(57)

by (52). Therefore, we obtain

$$\begin{aligned} f(x_k)&= f((1 - \theta _3)x_k + \theta _3 x)&\quad&(\text {by (56a)})\\&\le {{\tilde{m}}}((1 - \theta _3)x_k + \theta _3 x)&\quad&(\text {by (57)})\\&< (1-\theta _3) {{\tilde{m}}}(x_k) + \theta _3 {{\tilde{m}}}(x)&\quad&(\text {by the strictly convexity of}\, {{\tilde{m}}})\\&\le {{\tilde{m}}}(x_k)&\quad&(\text {by}\, x \in {\mathcal {R}}_1)\\&= f(x_k)&\quad&(\text {by the assumption on}\, {{\tilde{m}}}), \end{aligned}$$

which is a contradiction. $\square$

Now, we prove Lemma 1.

Proof of Lemma 1

The model function $m^k_\uplambda$ is strictly convex and satisfies $m^k_\uplambda (x_k) = f(x_k)$ and $\nabla m^k_\uplambda (x_k) = \nabla f(x_k)$. We use Lemma 10 with ${{\tilde{m}}} :=m^k_\uplambda$. Note that (11) and (12) correspond to $x \in {\mathcal {R}}_1$ and $x \in {\mathcal {R}}_3$, respectively, where ${\mathcal {R}}_1$ and ${\mathcal {R}}_3$ are defined in (51). Thus, by Lemma 10, it suffices to prove (52). We fix $x \in {\mathcal {R}}_1 \cap {\mathcal {R}}_2$ arbitrarily and will show $x \in {\mathcal {R}}_3$. Let $u :=x - x_k$. From the convexity of ${\mathcal {X}}$, $x \in {\mathcal {R}}_2$, (9), and Lemma 9(ii), we have

$$\begin{aligned} \Vert F(x) - F_k - J_k u\Vert \le \frac{L}{2}\Vert u\Vert ^2. \end{aligned}$$

(58)

From the inequality of arithmetic and geometric means, we have

$$\begin{aligned} \frac{\uplambda }{2}\Vert u\Vert ^2 + \frac{L^2}{2\uplambda } \Vert u\Vert ^2\Vert F_k + J_k u\Vert ^2 \ge L \Vert u\Vert ^2\Vert F_k + J_k u\Vert . \end{aligned}$$

(59)

Furthermore, by (2) and $x \in {\mathcal {R}}_1$, we have

$$\begin{aligned} \Vert F_k + J_k u\Vert ^2 + \uplambda \Vert u\Vert ^2 = 2m^k_\uplambda (x) \le 2m^k_\uplambda (x_k) = 2f(x_k) = \Vert F_k\Vert ^2. \end{aligned}$$

(60)

Using these inequalities, we obtain $x \in {\mathcal {R}}_3$ as follows:

$$\begin{aligned} f(x) - m^k_\uplambda (x)&= \frac{1}{2}\Vert F(x)\Vert ^2 - m^k_\uplambda (x)\\&\le \frac{1}{2}\Big ( \Vert F_k + J_k u\Vert + \Vert F(x) - F_k - J_k u\Vert \Big )^2 - m^k_\uplambda (x)\\&\le \frac{1}{2} \Big ( \Vert F_k + J_k u\Vert + \frac{L}{2}\Vert u\Vert ^2 \Big )^2 - m^k_\uplambda (x)&\quad&(\text {by}\, (58))\\&= \frac{L^2}{8}\Vert u\Vert ^4 - \frac{\uplambda }{2}\Vert u\Vert ^2 + \frac{L}{2}\Vert u\Vert ^2\Vert F_k + J_k u\Vert{} & {} (\text {by}\, (2))\\&\le \frac{L^2}{8}\Vert u\Vert ^4 - \frac{\uplambda }{4}\Vert u\Vert ^2 + \frac{L^2}{4\uplambda } \Vert u\Vert ^2\Vert F_k + J_k u\Vert ^2{} & {} (\text {by}\, (59))\\&\le \frac{L^2}{8}\Vert u\Vert ^4 - \frac{\uplambda }{4}\Vert u\Vert ^2 + \frac{L^2}{4\uplambda } \Vert u\Vert ^2 \Big ( \Vert F_k\Vert ^2 - \uplambda \Vert u\Vert ^2 \Big ){} & {} (\text {by}\, (60))\\&= - \frac{L^2}{8}\Vert u\Vert ^4 - \frac{1}{4\uplambda } \Vert u\Vert ^2 \Big ( \uplambda ^2 - L^2 \Vert F_k\Vert ^2 \Big )\\&\le 0{} & {} (\text {by}\, (10)). \end{aligned}$$

$\square$

The proof of Lemma 1 is a little complicated mainly because x and y in (9) are restricted on the sublevel set ${\mathcal {S}}(x_k)$. If we assume the Lipschitz continuity of J on the convex set ${\mathcal {X}}$ as in [66], Lemma 10 is unnecessary and the proof of Lemma 1 can be simplified.

1.3 Proposition on projected gradient methods

Proposition 1

Fix a point $x_k \in {\mathcal {C}}$. For some constant $L_f > 0$, assume that

$$\begin{aligned} \Vert \nabla f(y) - \nabla f(x)\Vert \le L_f\Vert y - x\Vert , \quad \forall x,y \in {\mathcal {C}} \ \ \text {s.t.} \ \ {\mathcal {L}}(x, y) \subseteq {\mathcal {S}}(x_k). \end{aligned}$$

(61)

Then, for $\eta \ge L_f$,

$$\begin{aligned}&{\mathcal {L}}(x_k, {\mathcal {P}}_\eta (x_k)) \subseteq {\mathcal {S}}(x_k), \end{aligned}$$

(62)

$$\begin{aligned}&f({\mathcal {P}}_\eta (x_k)) - f(x_k) \le - {\mathcal {D}}_\eta (x_k). \end{aligned}$$

(63)

Proof

For $\eta \ge L_f$, we define

$$\begin{aligned} {{\tilde{m}}}(x) :=f(x_k) + \langle \nabla f(x_k), x - x_k \rangle + \frac{\eta }{2}\Vert x - x_k\Vert ^2, \end{aligned}$$

and use Lemma 10 with this function and ${\mathcal {X}} = {\mathcal {C}}$. Note that this ${{\tilde{m}}}$ is strictly convex and satisfies that ${{\tilde{m}}}(x_k) = f(x_k)$ and $\nabla {{\tilde{m}}}(x_k) = \nabla f(x_k)$. By (61), Remark 3, and $\eta \ge L_f$, we have ${\mathcal {R}}_2 \subseteq {\mathcal {R}}_3$, where ${\mathcal {R}}_2$ and ${\mathcal {R}}_3$ are defined in (51), and we therefore have (52). Thus, by Lemma 10, we obtain ${\mathcal {R}}_1 \subseteq {\mathcal {R}}_2 \subseteq {\mathcal {R}}_3$, which yields ${\mathcal {P}}_\eta (x_k) = \mathop {\textrm{argmin}}\nolimits _{x \in {\mathcal {C}}} {{\tilde{m}}}(x) \in {\mathcal {R}}_1 \subseteq {\mathcal {R}}_2 \subseteq {\mathcal {R}}_3$. The first result (62) is equivalent to ${\mathcal {P}}_\eta (x_k) \in {\mathcal {R}}_2$, and the second (63) is equivalent to ${\mathcal {P}}_\eta (x_k) \in {\mathcal {R}}_3$ since ${{\tilde{m}}} ({\mathcal {P}}_\eta (x_k)) = f(x_k) - {\mathcal {D}}_\eta (x_k)$. $\square$

1.4 Proof of Lemma 3

To prove Lemma 3, we first show the following Lipschitz-like property on $\nabla f$.

Lemma 11

Let Assumption 1hold and define $L_f$ by (20). Then, for $\eta \ge L_{f}$, we have

$$\begin{aligned} \Vert \nabla f({\mathcal {P}}_\eta (x)) - \nabla f(x)\Vert \le \eta \Vert {\mathcal {P}}_\eta (x) - x\Vert ,\quad \forall x \in {\mathcal {C}} \cap {\mathcal {S}}(x_0). \end{aligned}$$

Proof

Fix a point $x' \in {\mathcal {C}} \cap {\mathcal {S}}(x_0)$ arbitrarily. Since $\Vert F(x')\Vert \le \Vert F(x_0)\Vert = \Vert F_0\Vert$, Assumption 1 and Lemma 9(iii) with ${\mathcal {X}} = {\mathcal {C}} \cap {\mathcal {S}}(x')$ imply

$$\begin{aligned} \Vert \nabla f(y) - \nabla f(x)\Vert \le L_f \Vert y - x\Vert ,\quad \forall x,y \in {\mathcal {C}} \ \ \text {s.t.} \ \ {\mathcal {L}}(x, y) \subseteq {\mathcal {S}}(x'). \end{aligned}$$

(64)

By Proposition 1 and (64), we have ${\mathcal {L}}(x', {\mathcal {P}}_\eta (x')) \subseteq {\mathcal {S}}(x')$ for $\eta \ge L_f$. Therefore, by using (64) again, we obtain

$$\begin{aligned} \Vert \nabla f({\mathcal {P}}_\eta (x')) - \nabla f(x')\Vert \le L_f \Vert {\mathcal {P}}_\eta (x') - x'\Vert \le \eta \Vert {\mathcal {P}}_\eta (x') - x'\Vert , \end{aligned}$$

which is the desired result. $\square$

Now, we prove Lemma 3.

Proof of Lemma 3

Since

$$\begin{aligned} {\mathcal {G}}_\eta (x) = \eta (x - {\mathcal {P}}_\eta (x)) \in \nabla f(x) + \partial \iota _{{\mathcal {C}}}({\mathcal {P}}_\eta (x)) \end{aligned}$$

(65)

from the definitions of ${\mathcal {P}}_\eta$ and ${\mathcal {G}}_\eta$ in (18) and (19), we obtain

$$\begin{aligned}&\min _{g \in \partial \iota _{{\mathcal {C}}}({\mathcal {P}}_\eta (x))} \Vert \nabla f({\mathcal {P}}_\eta (x)) + g\Vert \\&\le \min _{g \in \partial \iota _{{\mathcal {C}}}({\mathcal {P}}_\eta (x))} \Vert \nabla f(x) + g\Vert + \Vert \nabla f({\mathcal {P}}_\eta (x)) - \nabla f(x)\Vert \\&\le \Vert {\mathcal {G}}_\eta (x)\Vert + \eta \Vert {\mathcal {P}}_\eta (x) - x\Vert&\quad&(\text {by}\, (65)\, \text {and Lemma 11})\\&= 2 \Vert {\mathcal {G}}_\eta (x)\Vert&\quad&(\text {by}\, (19)). \end{aligned}$$

$\square$

1.5 Relaxing an assumption in Theorem 1(ii)

In order to compute an $\varepsilon$-stationary point based on Theorem 1(ii), knowledge of the value of ${\bar{\eta }}$ is required. However, this requirement can be circumvented with a slight modification of the algorithm. We show it in this section.

Let $\eta _k$ be the value of $\eta$ when $x_k$ is obtained in Algorithm 1. As with the proof of Theorem 1(ii), we can show that

$$\begin{aligned} \Vert {\mathcal {G}}_{\eta _{k+1}}(x_k)\Vert \le \varepsilon / 2 \end{aligned}$$

(66)

for some $k = O(\varepsilon ^{-2})$. If (66) and $\eta _{k+1} \ge L_f$ hold, then ${\mathcal {P}}_{\eta _{k+1}}(x_k)$ is an $\varepsilon$-stationary point by Lemma 3, but $\eta _{k+1} \ge L_f$ is not necessarily true. To address this issue, we modify Algorithm 1 a little.

As can be seen from the proof of Lemma 3, even if $\eta < L_f$, the point ${\mathcal {P}}_\eta (x_k)$ is an $\varepsilon$-stationary point of problem (1) as long as $\Vert {\mathcal {G}}_{\eta }(x_k)\Vert \le \varepsilon / 2$ and the following hold:

$$\begin{aligned} \Vert \nabla f({\mathcal {P}}_{\eta }(x_k)) - \nabla f(x_k)\Vert \le \eta \Vert {\mathcal {P}}_{\eta }(x_k) - x_k\Vert . \end{aligned}$$

(67)

Thus, by updating $\eta$ by $\eta \leftarrow \alpha _{\textrm{in}} \eta$ until (67) is satisfied when $x_k$ is obtained in Algorithm 1, we can guarantee that ${\mathcal {P}}_{\eta _{k+1}}(x_k)$ is an $\varepsilon$-stationary point for some $k = O(\varepsilon ^{-2})$. Since (67) must hold for $\eta \ge L_f$ by Lemma 11, this modification of the algorithm does not sacrifice Lemma 6 and other convergence guarantees. The important point here is that we can check if (67) holds with no prior knowledge of constants (e.g., $\sigma$, L, and $L_f$) of the problem. We have obtained the modified algorithm that does not require the knowledge of the constants.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Marumo, N., Okuno, T. & Takeda, A. Majorization-minimization-based Levenberg–Marquardt method for constrained nonlinear least squares. Comput Optim Appl 84, 833–874 (2023). https://doi.org/10.1007/s10589-022-00447-y

Download citation

Received: 02 November 2021
Accepted: 20 December 2022
Published: 17 January 2023
Issue Date: April 2023
DOI: https://doi.org/10.1007/s10589-022-00447-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Majorization-minimization-based Levenberg–Marquardt method for constrained nonlinear least squares

Abstract

Similar content being viewed by others

Lp quasi-norm minimization: algorithm and applications

Conjugate gradient acceleration of iteratively re-weighted least squares methods

A Preconditioned Conjugate Gradient Method with Active Set Strategy for $$\ell _1$$ -Regularized Least Squares

1 Introduction

1.1 Our contribution

1.2 Oracle model for overall complexity bounds

1.3 Paper organization

1.4 Notation

2 Comparison with related works

2.1 General methods

2.2 Specialized methods for least squares

2.2.1 Subproblems

2.2.2 Complexity for finding a stationary point

2.2.3 Local superlinear convergence

3 Majorization lemma and proposed method

3.1 LM method as majorization-minimization

Lemma 1

3.2 Proposed LM method

Remark 1

Remark 2

4 Iteration complexity and overall complexity

4.1 Assumptions

Assumption 1

4.2 Approximate stationary point

Definition 1

Lemma 2

Proof

Lemma 3

4.3 Preliminary lemmas

Lemma 4

Proof

Lemma 5

Proof

Lemma 6

Proof

4.4 Iteration complexity and overall complexity

Lemma 7

Proof

Theorem 1

Proof

Corollary 1

Corollary 2

5 Local quadratic convergence

5.1 Assumptions

Assumption 2

5.2 Fundamental inequalities for analysis

5.3 Preliminary lemma

Lemma 8

Proof of Lemma 8(i)

Proof of Lemma 8(ii)

Proof of Lemma 8(iii)

5.4 Local quadratic convergence

Theorem 2

Proof

Proof of Theorem 2(ii)

6 Practical variant of the proposed method

6.1 Generalized version of Algorithm 1

Condition 1

Condition 2

6.2 Proposed method with an accelerated projected gradient

7 Numerical experiments

7.1 Problem setting

7.1.1 Compressed sensing with quadratic measurement

7.1.2 Nonnegative matrix factorization with missing values

7.1.3 Autoencoder with MNIST dataset

7.2 Algorithms and implementation

7.3 Results

7.3.1 Compressed sensing and NMF

7.3.2 Autoencoder with MNIST

8 Conclusion and future work

Code availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations