1 Introduction

This paper is concerned with the solution of nonlinear least-squares problems of the form

$$\begin{aligned} \min _{\textbf{x}\in \mathbb {R}^n} \left\| \textbf{y}-f(\textbf{x})\right\| ^2, \end{aligned}$$
(1.1)

where \(f:\mathbb {R}^n\rightarrow \mathbb {R}^m\) is a differentiable nonlinear function, \(\textbf{x}\in \mathbb {R}^n\) denotes the unknown vector that we would like to determine, \(\textbf{y}\in \mathbb {R}^m\) represents measured data, and \(\left\| \cdot \right\| \) stands for the Euclidean vector norm. This kind of problems arises in several fields of science and engineering, such as in image reconstruction and geophysics. In some applications, the solution of the problem (1.1) is very sensitive to perturbations, e.g., to errors in the data vector \(\textbf{y}\); these errors may, for instance, be caused by measurement inaccuracies. In this paper, we only consider problems that are not overly sensitive to perturbations; if the given problem is very sensitive to perturbations in \(\textbf{y}\), then we assume that the problem has been regularized to make the solution less sensitive. For an example of well-posed problem of the form (1.1) we refer the reader to the so-called Bratu problem; see, e.g., [29] and Sect. 4. If the problem is ill-posed, a simple approach to regularize it is to use Tikhonov regularization (see, e.g., [21]), i.e., to substitute (1.1) with

$$\begin{aligned} \min _{\textbf{x}\in \mathbb {R}^n}\left\{ \left\| \textbf{y}-f(\textbf{x})\right\| ^2+\mu \left\| \textbf{x}\right\| ^2\right\} , \end{aligned}$$

for a suitable \(\mu >0\). In the following, we will assume that this substitution, if required, has been performed.

We consider the Gauss–Newton method [20, 34] for the solution of (1.1). This is an iterative method that at each step solves a linear least-squares problem obtained from a first-order Taylor expansion of the nonlinear function \(r(\textbf{x})=\textbf{y}-f(\textbf{x})\). Convergence is assured thanks to the use of the Armijo-Goldstein condition; see below. In case the problem (1.1) is underdetermined, i.e., when n is larger than m, the solution is not unique. Pes and Rodriguez [32, 33] describe a modification of the Gauss–Newton method that can be applied in this situation to determine the minimal-norm solution; also some regularized variants of the Gauss–Newton method are described.

Nevertheless, we would like to briefly discuss the situation when the Jacobian matrix is severely ill-conditioned or even rank-deficient. Since we project a linearization of the (large) problem (1.1) into solution subspaces of fairly small dimensions, the projected problems usually are better conditioned than the original large problem. As it will be illustrated by numerical examples in Sect. 4, we are able to quite accurately solve ill-conditioned problems, when the data \(\textbf{y}\) are not corrupted by noise, by employing as the only type of regularization the projection into a small linear subspace. This assumption corresponds to committing an inverse crime. Nevertheless, we illustrate that our solution method is less sensitive to ill-conditioning than the standard Gauss–Newton method, and that, when the problem (1.1) is severely ill-conditioned, the standard Gauss–Newton method may fail to determine an accurate solution even under the scenario of an inverse crime.

The main computational effort of the Gauss–Newton method is the solution of linear least-squares problems of possibly large dimensions at each iteration. We would like to address this issue by using Generalized Krylov Subspaces (GKS). The subspaces we consider were first applied by Lampe et al. [30] to the solution of large linear discrete ill-posed problems by Tikhonov regularization. Subsequently they were used for the solution of nonconvex minimization problems in imaging and statistics; see, e.g., [9, 10, 27, 31]. The approximate solution \(\textbf{x}^{(k)}\in \mathbb {R}^n\) computed at iteration k lives in a generalized Krylov subspace \(\mathcal {V}_{k-1}\), and a projected residual error is included in the next generalized Krylov subspace \(\mathcal {V}_k\). The solution subspaces generated in this manner are nested and typically contain accurate approximations of the desired solution of (1.1) already when they are of fairly small dimension; in particular, the use of generalized Krylov subspaces performs better than using standard Krylov subspaces; see [7, 31] for illustrations.

Another possible bottleneck of the Gauss–Newton method for large-scale problems is the computation of the Jacobian matrix at each iteration. If this is very time-consuming, then instead of forming the Jacobian by evaluating all of its entries at each iteration, we approximate the Jacobian by using a secant (Broyden) update; see [6, 20]. When the iterates \(\textbf{x}^{(k)}\) and \(\textbf{x}^{(k+1)}\) are close, the secant update allows us to approximate the Jacobian of f at \(\textbf{x}^{(k+1)}\) by the Jacobian (or an approximation thereof) at \(\textbf{x}^{(k)}\) plus a low-rank correction. Therefore, we update the Jacobian for \(\tilde{k}\) iterations, and compute the Jacobian “from scratch” only when \(k\equiv 0\;\text{ mod }\; \tilde{k}\).

Finally, we observe that, if many iterations of the algorithms are performed, the dimension of the space \(\mathcal {V}_k\) may increase significantly and this, in turn, can slow down the computations. To avoid this situation, we propose a restarted variant based on the algorithms proposed in [11]. Every \(k_{\textrm{rest}}\) iterations the space \(\mathcal {V}_k\) is restarted, i.e., it is replaced by a one-dimensional space; see below. This strategy ensures that the dimension of the space \(\mathcal {V}_k\) is bounded regardless of the number of iterations. We will illustrate that this reduces the computing time.

It is the purpose of this paper to show that projection into generalized Krylov subspaces may be beneficial when solving large nonlinear least-squares problems (1.1). This approach is compared to a standard implementation of the Gauss–Newton method. We remark that many approaches to Newton-Krylov methods have been discussed in the literature; among the first papers on this topic are those written by Brown and Saad [4, 5], who consider the application of Arnoldi methods, which requires \(m=n\). Darvishi and Shin [13] and Kan et al. [28] provide more recent contributions. However, these methods do not focus on nonlinear least-squares problems and, moreover, consider the application of standard Krylov subspaces.

This paper is organized as follows. Section 2 briefly recalls the classical Gauss–Newton method. Our new solution approaches are described in Sect. 3 and there we also show some of properties of these methods. Section 4 presents a few numerical examples that illustrate the performance of our methods, and concluding remarks can be found in Sect. 5.

2 The Gauss–Newton Method

This section reviews the classical Gauss–Newton method applied to the solution of nonlinear least-squares problems (1.1). Let \(r(\textbf{x})=\textbf{y}-f(\textbf{x})\) be the residual function, where \(\textbf{x}\in \mathbb {R}^n\) represents an approximate solution of the problem (1.1), the function \(f:\mathbb {R}^n\rightarrow \mathbb {R}^m\) is nonlinear and Fréchet differentiable, and the vector \(\textbf{y}\in \mathbb {R}^m\) represents available data. We disregard for the moment that in some applications, the problem (1.1) might be severely ill-conditioned. The Gauss–Newton method computes the solution \(\textbf{x}^*\) of (1.1) by minimizing the Euclidean norm of the residual \(r(\textbf{x})\), i.e., it solves

$$\begin{aligned} \textbf{x}^*= \arg \min _{\textbf{x}\in \mathbb {R}^n} \left\| r(\textbf{x})\right\| ^2, \end{aligned}$$
(2.1)

where for simplicity, let us assume that the solution \(\textbf{x}^*\) is unique. The Gauss–Newton method is an iterative procedure that, at each iteration, linearizes the problem (1.1) and minimizes the norm of the residual of the linearized problem. Since we assumed that f is Fréchet differentiable, it follows that \(r(\textbf{x})\) satisfies this property as well. Therefore, we use the approximation

$$\begin{aligned} r\left( \textbf{x}^{(k+1)}\right) \simeq r\left( \textbf{x}^{(k)}\right) +J^{(k)} \textbf{q}^{(k)}, \end{aligned}$$

where \(\textbf{x}^{(k)}\) is the current approximation of \(\textbf{x}^*\). The vector \(\textbf{q}^{(k)}\) is referred to as the step, and \(J^{(k)} = J(\textbf{x}^{(k)}) \in \mathbb {R}^{m\times n}\) represents the Jacobian matrix of \(r(\textbf{x}) = [r_1(\textbf{x}),\ldots ,r_m(\textbf{x})]^T\) at \(\textbf{x}=\textbf{x}^{(k)}\). It is defined by

$$\begin{aligned} \left[ J(\textbf{x})\right] _{ij} = \frac{\partial r_i(\textbf{x})}{\partial x_j}, \quad i = 1, \ldots , m, \quad j = 1, \ldots , n. \end{aligned}$$

Let \(\textbf{x}^{(0)}\in \mathbb {R}^n\) be an initial approximation of the solution \(\textbf{x}^*\) of the nonlinear least-squares problem (2.1). We determine, for \(k=0,1,2,\ldots ~\), the step \(\textbf{q}^{(k)}\) by solving the linear least-squares problem

$$\begin{aligned} \min _{\textbf{q} \in \mathbb {R}^n} \left\| r\left( \textbf{x}^{(k)}\right) +J^{(k)} \textbf{q}\right\| ^2, \end{aligned}$$

and we compute the next approximation \(\textbf{x}^{(k+1)}\) of \(\textbf{x}^*\) as

$$\begin{aligned} \textbf{x}^{(k+1)} = \textbf{x}^{(k)} + \alpha ^{(k)} \textbf{q}^{(k)},\qquad k = 0,1,2,\ldots ~, \end{aligned}$$

where the coefficient \(\alpha ^{(k)}\) is determined according to the Armijo-Goldstein principle, see below, to ensure convergence to a stationary point of

$$\begin{aligned} \mathcal {J}(\textbf{x})=\left\| r(\textbf{x})\right\| ^2; \end{aligned}$$

see, e.g., [20, 25] for discussions. Since the functional \(\mathcal {J}\) might not be convex, it is in general not possible to show that the iterates \(\textbf{x}^{(k)}\) converge to a minimizer of \(\mathcal {J}\). We remark that if one knows that the solution \(\textbf{x}^*\) is nonnegative, then it is possible to determine \(\alpha ^{(k)}\) such that \(\textbf{x}^{(k)}\ge 0\) for all k. We will not dwell on this situation here, but we note that it is straightforward to extend our solution method to compute nonnegative solutions.

We say that \(\alpha ^{(k)}\) satisfies the Armijo-Goldstein condition if

$$\begin{aligned} \left\| r\left( \textbf{x}^{(k)}\right) \right\| ^2 - \left\| r\left( \textbf{x}^{(k)} + \alpha ^{(k)} \textbf{q}^{(k)}\right) \right\| ^2 \ge \frac{1}{2} \alpha ^{(k)} \left\| J^{(k)} \textbf{q}^{(k)}\right\| ^2. \end{aligned}$$
(2.2)

To determine a suitable \(\alpha ^{(k)}\), we apply a line search. Let \(\alpha _0^{(k)}>0\) be large enough. If \(\alpha _0^{(k)}\) satisfies the condition (2.2), then we set \(\alpha ^{(k)}=\alpha _0^{(k)}\), otherwise we let

$$\begin{aligned} \alpha _1^{(k)}=\frac{\alpha _0^{(k)}}{2}. \end{aligned}$$

We iterate in this manner until we find an \(\alpha _j^{(k)}\) that satisfies the condition (2.2) and then set \(\alpha ^{(k)}=\alpha _j^{(k)}\). The Gauss–Newton method with line search based on the Armijo-Goldstein condition yields a converging sequence of iterates \(\textbf{x}^{(k)}\), \(k=0,1,2,\ldots ~\), for any differentiable function f; see [25] for a proof. Algorithm 1 summarizes the computations.

Algorithm 1
figure a

The classical Gauss–Newton method

Theorem 1

Let \(f:\mathbb {R}^n\rightarrow \mathbb {R}^m\) be a Fréchet differentiable function and let \(\textbf{x}^{(k)}\), \(k=1,2,\ldots ~\), denote the iterates generated by Algorithm 1. There exists \(\textbf{x}^*\) such that

$$\begin{aligned} \lim _{k\rightarrow \infty }\left\| \textbf{x}^{(k)}-\textbf{x}^*\right\| =0, \end{aligned}$$

and \(\textbf{x}^*\) is a stationary point of \(\mathcal {J}(\textbf{x})=\left\| \textbf{y}-f(\textbf{x})\right\| ^2\). If \(\mathcal {J}\) is convex, then \(\textbf{x}^*\) is a global minimizer of \(\mathcal {J}\). Moreover, if \(\mathcal {J}\) is strictly convex, then there is a unique global minimizer that coincides with \(\textbf{x}^*\).

Proof

Proofs can be found in [3, 25]. \(\square \)

3 The Gauss–Newton Method in Generalized Krylov Subspaces

We would like to reduce the computational cost of the iterations with Algorithm 1 when applied to the solution of large-scale problems (1.1). We achieve this by determining approximate solutions of the problem (2.1) in a sequence of fairly low-dimensional solution subspaces \(\mathcal {V}_k\), \(k=1,2,\ldots ~\), whose dimensions increase with k. At iteration k, we first compute the approximation \(\textbf{x}^{(k+1)}\in \mathcal {V}_k\) of the solution \(\textbf{x}^*\) of (2.1), and then expand the solution subspace by including an appropriately chosen vector in \(\mathcal {V}_k\). This defines the next solution subspace \(\mathcal {V}_{k+1}\). Since \(d_k=\textrm{dim}(\mathcal {V}_k)\ll n\), the computation of \(\textbf{x}^{(k+1)}\) is less expensive than computing a new iterate with the standard Gauss–Newton method, which seeks to determine a solution in \(\mathbb {R}^n\).

Assume that at the k-th step an approximation \(\textbf{x}^{(k)}\in \mathcal {V}_{k-1}\subset \mathcal {V}_k\) of the solution \(\textbf{x}^*\) is known. Let the columns of the matrix \(V_k\in \mathbb {R}^{n\times d_k}\) form an orthonormal basis for the subspace \(\mathcal {V}_k\). Then

$$\begin{aligned} V_k^T V_k = I_{d_k}\quad \text{ and }\quad \textbf{x}^{(k)}=V_k\textbf{z}^{(k)}, \end{aligned}$$

where \(I_{d_k}\) denotes the identity matrix of order \(d_k\), \(\textbf{z}^{(k)}\in \mathbb {R}^{d_k}\), and the last entry of \(\textbf{z}^{(k)}\) vanishes since \(\textbf{z}^{(k)}\in \mathcal {V}_{k-1}\).

The residual at the k-th step is given by

$$\begin{aligned} r\left( V_k\textbf{z}^{(k)}\right) = \textbf{y}-f\left( \textbf{x}^{(k)}\right) =\textbf{y}-f\left( V_k \textbf{z}^{(k)}\right) . \end{aligned}$$
(3.1)

We determine a new approximate solution \(\textbf{x}^{(k+1)}\in \mathcal {V}_k\) as

$$\begin{aligned} \textbf{x}^{(k+1)}=V_k\textbf{z}^{(k+1)}, \end{aligned}$$

where the vector \(\textbf{z}^{(k+1)}\) is computed by applying the Gauss–Newton method to (3.1), i.e.,

$$\begin{aligned} \textbf{z}^{(k+1)}=\textbf{z}^{(k)}+\alpha ^{(k)} \textbf{q}^{(k)}, \end{aligned}$$
(3.2)

and the step \(\textbf{q}^{(k)}\) is obtained by solving the least-squares problem

$$\begin{aligned} \min _{\textbf{q}\in \mathbb {R}^{d_k}}\left\| r\left( V_k\textbf{z}^{(k)}\right) +J^{(k)}\textbf{q}\right\| ^2. \end{aligned}$$

With slight abuse of notation, we denote the step by \(\textbf{q}\), similarly as in the previous section, even though the vectors \(\textbf{q}\) are of different dimensions; in this section, the dimension depends on k. The matrix \(J^{(k)}\in \mathbb {R}^{m\times d_k}\) represents the Jacobian of r in (3.1) evaluated at the point \(\textbf{z}^{(k)}\). We have

$$\begin{aligned} J^{(k)} = -J_f\left( V_k\textbf{z}^{(k)}\right) V_k, \end{aligned}$$

where \(J_f\) denotes the Jacobian matrix of the nonlinear function f. To ensure convergence, the parameter \(\alpha ^{(k)}\) in (3.2) is determined by applying the Armijo-Goldstein principle, i.e., \(\alpha ^{(k)}\) is the largest member of the sequence \(\alpha ^{(0)}2^{-i}\), \(i=0,1,\ldots ~\), for which

$$\begin{aligned} \left\| r\left( V_k\textbf{z}^{(k)}\right) \right\| ^2 - \left\| r\left( V_k\left( \textbf{z}^{(k)} + \alpha ^{(k)} \textbf{q}^{(k)}\right) \right) \right\| ^2 \ge \frac{1}{2} \alpha ^{(k)} \left\| J^{(k)} \textbf{q}^{(k)}\right\| ^2. \end{aligned}$$

In our numerical experiments we set \(\alpha ^{(0)}=1\).

Once the new iterate \(\textbf{x}^{(k+1)}\) has been evaluated, we enlarge the solution subspace \(\mathcal {V}_k\) by determining a vector \(\textbf{v}^{(k+1)}\) that is orthogonal to \(\mathcal {V}_{k}\). To this end, let us first compute the Jacobian \(J_f\left( \textbf{x}^{(k+1)}\right) \) of f at \(\textbf{x}^{(k+1)}\). This computation also is required for determining \(\textbf{x}^{(k+2)}\) and, therefore, is carried out only once. Let

$$\begin{aligned} \textbf{g}^{(k+1)} = J_f\left( \textbf{x}^{(k+1)}\right) ^T\left( \textbf{y}-f\left( \textbf{x}^{(k)}\right) \right) . \end{aligned}$$

The intuition is that \(\textbf{g}^{(k+1)}\) is the residual of the normal equations associated with the linearized problem. However, in general, this vector is not orthogonal to the subspace \(\mathcal {V}_k\). We therefore explicitly orthogonalize it to this subspace, i.e., we compute

$$\begin{aligned} \tilde{\textbf{g}}^{(k+1)}=\textbf{g}^{(k+1)}-V_kV_k^T \textbf{g}^{(k+1)}, \end{aligned}$$

and then normalize the vector so obtained,

$$\begin{aligned} \textbf{v}^{(k+1)} = \frac{\tilde{\textbf{g}}^{(k+1)}}{\left\| \tilde{\textbf{g}}^{(k+1)}\right\| }. \end{aligned}$$

We store an orthonormal basis for \(\mathcal {V}_{k+1}\) as columns of the matrix \(V_{k+1}=\left[ V_k,\textbf{v}^{(k+1)}\right] \in \mathbb {R}^{n\times (d_k+1)}\).

Although it is an extremely rare event when solving problems that arise from a real application, it is possible that \(\tilde{\textbf{g}}^{(k+1)}=\textbf{0}\), i.e., that \(\textbf{g}^{(k+1)}\in \mathcal {V}_k\). In this case, we simply do not enlarge the solution subspace in this iteration and proceed with the computations. Differently from Krylov methods for the solution of linear systems of equations, such as GMRES, this is not a case of a “lucky breakdown” and one generally cannot terminate the iterations.

The last point to address is how to determine \(V_0\), i.e., the basis for the initial subspace. In our computed examples, we will choose an initial approximate solution \(\textbf{x}^{(0)}\ne \textbf{0}\) and set

$$\begin{aligned} V_0 = \frac{\textbf{x}^{(0)}}{\left\| \textbf{x}^{(0)}\right\| }. \end{aligned}$$

The computations are summarized by Algorithm 2.

Algorithm 2
figure b

A Gauss–Newton method using generalized Krylov subspaces

We are now in position to show some theoretical results.

Lemma 1

Let f be a Fréchet differentiable function. With the notation of Algorithm 2, it holds that for every k, there is a finite positive integer j such that

$$\begin{aligned} \left\| r\left( V_k\textbf{z}^{(k)}\right) \right\| ^2 - \left\| r\left( V_k\left( \textbf{z}^{(k)} + \alpha _j^{(k)} \textbf{q}^{(k)}\right) \right) \right\| ^2 \ge \frac{1}{2} \alpha _j^{(k)} \left\| J^{(k)} \textbf{q}^{(k)}\right\| ^2. \end{aligned}$$

Proof

This result follows from [25, Proposition 4.1]. \(\square \)

Theorem 2

Let f be Fréchet differentiable and let \(\textbf{z}^{(k)}\), \(k=1,2,\ldots ~\), denote the iterates generated by Algorithm 2. Define

$$\begin{aligned} \textbf{x}^{(k)}=V_k\textbf{z}^{(k)}. \end{aligned}$$

Then there is a vector \(\textbf{x}^*\) such that

$$\begin{aligned} \lim _{k\rightarrow \infty }\left\| \textbf{x}^{(k)}-\textbf{x}^*\right\| =0, \end{aligned}$$

and \(\textbf{x}^*\) is a stationary point of \(\mathcal {J}(\textbf{x})=\left\| \textbf{y}-f(\textbf{x})\right\| ^2\). If \(\mathcal {J}\) is convex, then \(\textbf{x}^*\) is a global minimizer of \(\mathcal {J}\). Moreover, if \(\mathcal {J}\) is strictly convex, then \(\textbf{x}^*\) is the unique global minimizer.

Proof

We first discuss the case when there is an index \(k_0\) such that \(\textbf{g}^{(k)}\in \mathcal {V}_{k_0}\) for all \(k>k_0\) and \(\textrm{dim}\left( \mathcal {V}_{k_0}\right) <n\). Then, after iteration \(k_0\), all subsequent iterates belong to \(\mathcal {V}_{k_0}\). However, since the Armijo-Goldstein principle is satisfied, this means that Algorithm 2 reduces to Algorithm 1 after \(k_0\) iterations and, therefore, the iterates converge to a stationary point of \(\mathcal {J}\) by Theorem 1.

On the other hand, if there is no index \(k_0\) such that \(\textbf{g}^{(k)}\in \mathcal {V}_{k_0}\) for all \(k>k_0\) and \(\textrm{dim}\left( \mathcal {V}_{k_0}\right) <n\), then there is an index \(k_1\) such that \(\textrm{dim}\left( \mathcal {V}_{k_1}\right) =n\) and, therefore, \(\mathcal {V}_{k}=\mathbb {R}^n\) for all \(k>k_1\). Convergence to a stationary point now follows from Theorem 1. \(\square \)

Remark 1

The proof of Theorem 2 would appear to suggest that one may need n iterations with Algorithm 2 to obtain an accurate approximation of the limit point. However, in problems that stem from real applications, typically only a few iterations are required to satisfy the stopping criterion. This is illustrated by numerical experiments in Sect. 4.

3.1 Secant Acceleration

In some applications the evaluation of the Jacobian \(J_f\) of f may be computationally expensive. We therefore discuss the updating strategy proposed by Broyden [6] that makes it possible to avoid the computation of \(J_f\) from scratch every iteration. A nice discussion of this updating strategy is provided by Dennis and Schnabel [20, Chapter 8]. Broyden [6] proposed the following secant update of the Jacobian matrix

$$\begin{aligned} J_f\left( V_{k+1}\textbf{z}^{(k+1)}\right) \approx J_f\left( V_k\textbf{z}^{(k)}\right) +H^{(k)}, \end{aligned}$$
(3.3)

where \(H^{(k)}\) generally is a rank-two matrix defined by

$$\begin{aligned} H^{(k)}=\dfrac{\Delta \textbf{r}^{(k)}-J_f\left( V_k \textbf{z}^{(k)}\right) }{\left\| \Delta \textbf{x}^{(k)}\right\| ^2}\left( \Delta \textbf{x}^{(k)}\right) ^T, \end{aligned}$$
(3.4)

with \(\Delta \textbf{x}^{(k)}=\textbf{x}^{(k+1)}-\textbf{x}^{(k)}\) and \(\Delta \textbf{r}^{(k)}=\textbf{r}^{(k+1)}-\textbf{r}^{(k)}\). We then can compute an approximation of \(J_f(V_{k+1}\textbf{z}^{(k+1)})V_{k+1}\) as follows

These updating formulas can be applied repeatedly, for \(k=0,1,\ldots ~\), with a suitable choice of \(H^{(0)}\). Then \(H^{(k)}\) is a low-rank matrix when k is fairly small; see [20]. The matrix \(H^{(k)}\) typically is not explicitly formed; instead the vectors that make up \(H^{(k)}\) are stored and used when evaluating matrix–vector products with \(H^{(k)}\). When the number of updating steps increases, the quality of the approximation of \(J_f\) by \(H^{(k)}\) at desired points \(\textbf{x}\) may deteriorate, but not by very much. Dennis and Schnabel [20, p. 176] refer to this behavior as “bounded deterioration”; see [20, Lemma 8.2.1] for bounds on the deterioration. When the approximation \(H^{(0)}\) of the Jacobian at the initial point \(\textbf{x}_{\textrm{init}}\) is sufficiently close to the Jacobian at the desired solution, \(J_f(\textbf{x}^*)\), the bounded deterioration of the quality of the approximations of \(H^{(k)}\) is mild enough to secure convergence to the solution \(\textbf{x}^*\) as k increases when repeatedly using the updating formula; see [20, Theorem 8.2.2] for details. Nevertheless, since it is difficult to assess whether the conditions of this theorem are satisfied, we recompute the Jacobian from scratch every \(\tilde{k}\) iterations. Algorithm 3 summarizes the computations.

Algorithm 3
figure c

Gauss–Newton in GKS with secant update

3.2 Restarting Strategy

The main computational cost of the outlined solution method, if we ignore the computation of the Jacobian of f, is the solution of a linear system of equations of fairly small size at every iteration. However, if many iterations are carried out, the systems may become large and require a non-negligible computational effort to solve. To avoid this issue, we employ the restart strategy proposed in [11] for the Maximization–Minimization algorithm. Specifically, the solution method described in [11] restarts the generalized Krylov subspace when a fixed user-specified number of iterations have been carried out. We denote this number by \(k_{\textrm{rest}}>1\) as the number of iterations after which we restart the GKS. Since we set \(V_0\in \mathbb {R}^{n\times 1}\), we have that, if no restarting is used, \(\textrm{dim}(V_k)=k+1\). To avoid that the dimension of \(V_k\) and, therefore, the size of \(J^{(k)}\), become too large, we set, if \(k\equiv 0 \,\text{ mod }\, k_{\textrm{rest}}\),

$$\begin{aligned} V_k=\frac{\textbf{x}^{(k)}}{\left\| \textbf{x}^{(k)}\right\| }. \end{aligned}$$

This ensures that the number of columns of \(V_k\) never exceeds \(k_{\textrm{rest}}\) and that the computational effort per iteration does not increase too much with k. For large-scale problems, this also has the added benefit of reducing the memory requirement of the method. This may be important if only a small amount of fast memory is available or if the size of the problem to be solved is very large; see [11] for a discussion.

Since when using restarts the spaces \(\mathcal {V}_k\) are not nested, the proof of Theorem 2 does not hold. We only can state that, thanks to the Armijo-Goldstein rule, it holds

$$\begin{aligned} \left\| \textbf{r}^{(k+1)}\right\| \le \left\| \textbf{r}^{(k)}\right\| ,\quad \forall k>0. \end{aligned}$$

Algorithm 4 summarizes the computations.

Algorithm 4
figure d

A restarted Gauss–Newton method using generalized Krylov subspaces

4 Numerical Examples

In this section, we apply the algorithms described in the previous sections to three nonlinear least-squares problems: a real-world problem from applied geophysics, a partial differential equation that appears in a number of applications such as in fuel ignition models of thermal combustion theory, and a simple problem for which the Jacobian is a very sparse matrix. For each problem we compare the algorithms in terms of accuracy of the computed solution, number of iterations required to satisfy the stopping criterion, and CPU time. We measure the accuracy by means of the Relative Reconstruction Error (RRE)

$$\begin{aligned} \textrm{RRE}(\textbf{x})=\frac{\left\| \textbf{x}-\textbf{x}_{\textrm{true}}\right\| }{\left\| \textbf{x}_{\textrm{true}}\right\| }, \end{aligned}$$

where \(\textbf{x}_{\textrm{true}}\) denotes the desired solution of the problem. For each algorithm we set the maximum number of iterations to \(K=100\) and the tolerance for the stopping criterion to \(\tau =10^{-5}\). In Algorithm 3 we set \(\tilde{k}=10\), i.e., we compute the Jacobian from scratch every \(\tilde{k}=10\) iterations. The updating formulas perform best if the updated Jacobian is close to the Jacobian at \(\textbf{x}_{\textrm{true}}\). We therefore do not use the updating formulas in Algorithm 3 for \(k=1,2,\ldots ,\tilde{k}\). Finally, we set in Algorithm 4 the number of iterations after which the space \(\mathcal {V}_k\) is restarted to \(k_{\textrm{rest}}=20\). All computations were carried out in MATLAB 2021b running on a laptop computer with an AMD Ryzen 7 5800HS CPU and 16GB of RAM.

A geophysics problem. We consider a nonlinear model from geophysics with the aim of reconstructing the distribution of the electrical conductivity and the magnetic permeability of the subsoil from measured data recorded with a ground conductivity meter (GCM). This device is made up of two coils, a transmitter and a receiver. The transmitter sends electromagnetic waves into the subsoil, while the receiver measures the induced electromagnetic field; see, e.g., [8, 14, 17,18,19] and references therein for more details on the mathematical model and [15, 16] for numerical results obtained.

We briefly recall how the model is defined. It is based on Maxwell’s equations. The subsoil is assumed to have a layered structure with n layers of thickness \(d_\ell \), \(\ell =1,2,\ldots ,n\). Layer \(\ell \) has electric conductivity \(\sigma _\ell \) and magnetic permeability \(\mu _\ell \). The last layer is assumed to be infinite. We define the so-called propagation constant as

$$\begin{aligned} u_\ell (\lambda ) = \sqrt{\lambda ^2 + {\textrm{i}}\sigma _\ell \mu _\ell \omega },\qquad {\textrm{i}}=\sqrt{-1}, \end{aligned}$$

where \(\lambda \) is the integration variable ranging from zero to infinity. It represents the ratio of the depth below the surface measured in meters and the inter-coil distance \(\rho \) (the distance between the transmitter and the receiver). The parameter \(\omega \) stands for the angular frequency, that is, \(2\pi \) times the frequency in Hertz.

The model, which describes the interaction between the GCM and the subsoil for the vertical and horizontal orientation of the device coils, is given by

$$\begin{aligned} {\left\{ \begin{array}{ll} \displaystyle M^V(\varvec{\sigma },\varvec{\mu };h, \omega , \rho ) = -\rho ^3 \int _0^\infty e^{-2\lambda h} \lambda ^2 R_{\omega ,0}(\lambda ) J_0(\rho \lambda )~d\lambda , \\ \displaystyle M^H(\varvec{\sigma },\varvec{\mu };h, \omega , \rho ) = -\rho ^2 \int _0^\infty e^{-2\lambda h} \lambda R_{\omega ,0}(\lambda ) J_1(\rho \lambda )~d\lambda , \\ \end{array}\right. } \end{aligned}$$

where \(\varvec{\sigma }=[\sigma _1,\ldots ,\sigma _n]^T\), \(\varvec{\mu }=[\mu _1,\ldots ,\mu _n]^T\), h represents the height of the device while measuring above the ground, \(J_s\) denotes the Bessel function of the first kind of order s, and \(R_{\omega ,0}(\lambda )\) is the reflection factor defined as

$$\begin{aligned} R_{\omega ,0}(\lambda ) = \frac{N_0(\lambda ) - Y_1(\lambda )}{N_0(\lambda ) + Y_1(\lambda )}. \end{aligned}$$

Here \(N_0(\lambda )=\lambda /({\textrm{i}}\mu _0\omega )\), \(\mu _0=4\pi \cdot 10^{-7}\)H/m is the magnetic permeability of free space and \(Y_1(\lambda )\) is computed by the recursive formula

$$\begin{aligned} Y_\ell (\lambda ) = N_\ell (\lambda )\frac{Y_{\ell +1}(\lambda )+N_\ell (\lambda )\tanh (d_\ell u_\ell (\lambda ))}{N_\ell (\lambda ) + Y_{\ell +1}(\lambda )\tanh (d_\ell u_\ell (\lambda ))}, \end{aligned}$$
(4.1)

for \(\ell =n-1,n-2,\ldots ,1\), where \(Y_\ell (\lambda )\) is the surface admittance at the top of each layer. The coefficient \(N_\ell (\lambda )= u_\ell (\lambda )/({\textrm{i}}\mu _\ell \omega )\) represents the characteristic admittance at the \(\ell \)-th layer and yields the initialization of the recursion formula (4.1), that is when \(\ell =n\), we set \(Y_n(\lambda )=N_n(\lambda )\).

For simplicity, we focus on the reconstruction of the distribution of the electrical conductivity and assume the magnetic permeability to be known. To this end, we construct three test cases and consider three different profiles for the electrical conductivity: a Gaussian function, a continuous, but not everywhere differentiable function (which we refer to as “triangular”), and a step function. It is known that in this application the model requires the solution to be differentiable. Therefore, the last two cases are very challenging to solve. When we construct our test cases, we assume that there is no noise except for round-off errors. Thus, we commit an inverse-crime. However, we remark that the Jacobian of f is so ill-conditioned that perturbations introduced by round-off errors produce significant fluctuations in the computed solutions.

We set \(\textbf{x}^{(0)}\) to be a constant function. Since the problem is non-convex and underdetermined, the choice of the function \(\textbf{x}^{(0)}\) is important. For the Gaussian and triangular profiles, we set \(\textbf{x}_0=\left[ 0.5,\,0.5,\ldots ,0.5\right] ^T\), while for the step function we set \(\textbf{x}_0=\left[ 1.5,\,1.5,\ldots ,1.5\right] ^T\). Moreover, we calibrate the instrument so that measurements are taken for 10 different heights, i.e., \(m=10\), and the subsoil is assumed to be composed of 100 layers, i.e., \(n=100\).

Table 1 reports results obtained with Algorithms 1, 23, and 4. Since the problems considered are extremely ill-conditioned, the Gauss–Newton algorithm produces poor reconstructions. Therefore, similarly to the Levenberg-Marquardt scheme (see, e.g., [26] and references therein) we regularize the inversion of the Jacobian by applying Tikhonov regularization; see, e.g., [21] for a discussion and analysis of Tikhonov regularization. In particular, we solve

$$\begin{aligned} \textbf{q}^{(k+1)}=\arg \min _{\textbf{q}\in \mathbb {R}^{d_k}} \left\{ \left\| \textbf{r}^{(k)}+J^{(k)}\textbf{q}\right\| ^2+10^{-4}\left\| \textbf{q}\right\| ^2 \right\} , \end{aligned}$$

and refer to the algorithm so defined as “Algorithm 1 Regularized”. Here \(10^{-4}\) is the regularization parameter and \(\left\| \textbf{q}\right\| ^2\) is the regularization operator.

We can observe that, even when regularized, the Gauss–Newton algorithm fails to provide an accurate approximation of the desired solution and the algorithm is significantly slower than Algorithms 23, and 4 both in terms of the number of iterations required and CPU time. The latter three algorithms provide satisfactory reconstructions for all three problems and are inexpensive computationally. Since very few iterations are performed Algorithms 2 and 4 produce basically indistinguishable results. Moreover, the use of the Broyden update reduces the computational cost with small to no impact on the accuracy of the computed solution. We point out that Algorithms 23, and 4 do not require fine-tuning of any parameters to perform well, and they do not require Tikhonov regularization. The number of restart iterations \(k_{\textrm{rest}}\) can be fixed to 20 in all cases and, in our experience, the method is quite robust with respect to the choice of this parameter.

Figure 1 displays the computed reconstructions determined by the regularized Gauss–Newton method and Algorithms 2 and 3. We do not show the graphs determined by Algorithm 1 as these computed solutions are not meaningful approximation of the desired solutions. Moreover, we do not display the approximations computed by Algorithm 4 as they are nearly identical to the ones obtained by Algorithms 2.

Note that we do not propose here to combine both the secant update with the restarting of the solution subspace. The reason is twofold, firstly we do not want to complicate any further the algorithm by inserting multiple parameters that need to be tuned. Secondly, in our computed example few iterations are performed and the possible computational advantage of this combination would be negligible. Moreover, the obtained algorithm would be highly heuristic and it may be possible to expect erratic behavior. Therefore, at this time we do not consider the combinations of these two acceleration techniques.

Table 1 Geophysics problem: RRE, number of iterations performed, and CPU times obtained with the considered algorithms for the three different profiles of the electrical conductivity
Fig. 1
figure 1

Geophysics problem: reconstruction of the Gaussian (a), triangular (b), and step function (c) test profiles. The black dotted curves show the exact solutions, the magenta dashed-dotted curves are approximations determined by the regularized Gauss–Newton method, the blue solid curves show approximate solutions computed by Algorithm 2, and the red dashed curves display approximate solutions obtained with Algorithm 3. We do not report the approximations computed by Algorithm 1 as they are not meaningful and the ones obtained by Algorithm 4 since they are indistinguishable from the one returned by Algorithm 2 (Color figure online)

Partial differential equation. This example describes a nonlinear partial differential equation that can be solved with the Gauss–Newton method. Specifically, we consider the Bratu problem

$$\begin{aligned} \left\{ \begin{array}{rcl} -\Delta x(s,t)+\alpha x_s(s,t)+\lambda \textrm{e}^{x(s,t)}&{}=&{}y,\quad (s,t)\in \Omega ,\\ x(s,t)&{}=&{}0,\quad (s,t)\in \partial \Omega , \end{array} \right. \end{aligned}$$
(4.2)

where \(\Delta x(s,t)\) stands for the Laplacian of x, \(x_s\) denotes the partial derivative of x along the s-direction, and \(\alpha \) is a constant. We are interested in investigating how the solution changes when the parameter \(\lambda \) is varied; see, e.g., [12, 24, 29] for discussions and illustrations of these kinds of path-following problems.

We let \(\Omega = [-3,3]^2\) and discretize the problem with standard 2nd order finite differences on an \(n\times n\) equispaced grid. The discretized problem can be written as

$$\begin{aligned} \arg \min _{\textbf{x}\in \mathbb {R}^{n^2}}\left\| \textbf{y}-\left( L\textbf{x}+\alpha D\textbf{x}+\lambda \textrm{e}^{\textbf{x}}\right) \right\| ^2, \end{aligned}$$

where the entries of \(\textbf{x}\) represent the discretized solution arranged in lexicographical order, \(L=L_1\otimes I+I\otimes L_1\), \(D=D_1\otimes I\), where \(\otimes \) denotes the Kronecker product, and the exponential is meant element-wise. We let

$$\begin{aligned} L_1=\begin{bmatrix} 2&{} \quad -1\\ -1&{} \quad 2&{} \quad -1\\ &{}\ddots &{}\ddots &{}\ddots \\ &{}&{}-1&{}2&{}-1\\ {} &{}&{}&{}-1&{}2 \end{bmatrix}\in \mathbb {R}^{n\times n}\quad \text{ and }\quad D_1=\begin{bmatrix} -1&{} \quad 1\\ &{} \quad \ddots &{} \quad \ddots \\ &{} \quad &{} \quad -1&{} \quad 1\\ &{} \quad &{} \quad &{} \quad -1 \end{bmatrix}\in \mathbb {R}^{n\times n}. \end{aligned}$$

For simplicity, once we compute the sampling of the function x and denote it by \(\textbf{x}_{\textrm{true}}\), we construct \(\textbf{y}\) by

$$\begin{aligned} \textbf{y}=L\textbf{x}_{\textrm{true}}+\alpha D\textbf{x}_{\textrm{true}}+\lambda \textrm{e}^{\textbf{x}_{\textrm{true}}}, \end{aligned}$$

i.e., we disregard approximation errors in the PDE as well as the scaling.

In our test, we let \(x(s,t)=\textrm{e}^{-10(s^2+t^2)}\) and sample x on a \(100\times 100\) equispaced grid on the square \([-3,3]^2\). Therefore, in (1.1), \(f:\mathbb {R}^{10^4}\rightarrow \mathbb {R}^{10^4}\). The Jacobian of f at the point \(\textbf{x}\) is given by

$$\begin{aligned} J_f=L+\alpha \cdot D+\lambda \cdot \textrm{diag}(\textrm{e}^{\textbf{x}}). \end{aligned}$$

Since we impose zero Dirichlet boundary conditions, L and D are Block Toeplitz with Toeplitz Block (BTTB) matrices. Therefore, the matrix \(J_f\) is a Generalized Locally Toeplitz (GLT) matrix; see [1, 2, 22, 23]. The GLT theory furnishes a tool for studying the behavior of the singular values of \(J_f\). Providing a complete and precise analysis of the behavior of the singular values of \(J_f\) is outside the scope of this paper. Here we just note that the singular values of \(J_f\) can be approximated quite accurately, if \(n^2\) is large enough, by a uniform sampling over \([0,\pi ]^2\) of the modulus of the GLT symbol of \(J_f\). Let \(x:[0,1]^2\rightarrow \mathbb {R}\) be the function that \(\textbf{x}\) is a sampling of. Then, the GLT symbol of \(J_f\) is given by

$$\begin{aligned} S(\theta ,\phi ,s,t)=(2-2\cos (\theta ))+(2-2\cos (\phi ))+\alpha (-1+\cos (\theta )- {\textrm{i}}\sin (\theta )))+\lambda \textrm{e}^{x(s,t)}, \end{aligned}$$

where \({\textrm{i}}=\sqrt{-1}\), \((\theta ,\phi )\in [-\pi ,\pi ]^2\), and \((s,t)\in [0,1]^2\). It is straightforward to see that, if \(\alpha \) is large and \(\lambda \) is small, then the singular values of \(J_f\) decay rapidly to 0 with increasing index and, therefore, \(J_f\) is ill-conditioned. On the other hand, if \(\lambda \) is large and \(\alpha \) is small, then the singular values decay slowly with increasing index and, therefore, \(J_f\) is well-conditioned. This is confirmed by Fig. 2, which displays the approximated singular values of \(J_f\) for two choices of \(\alpha \) and \(\lambda \), namely \((\alpha ,\lambda )=(1,10)\) and \((\alpha ,\lambda )=(10,1)\), with \(\textbf{x}=\textbf{x}_{\textrm{true}}\). We observe that in the first case the matrix \(J_f\) is very well-conditioned, while in the latter case the matrix is poorly conditioned. We can estimate the condition number of \(J_f\) in these two cases with the MATLAB function condest. In the first case, i.e., the well-conditioned one, the computed condition number is \(\kappa _2\approx 3.3\), while in the latter case, i.e., the ill-conditioned one, we obtain \(\kappa _2\approx 1.7\cdot 10^{30}\). This illustrates that the conditioning for large values of \(\alpha \) is far worse than what the GLT theory predicts.

We illustrate the performances of Algorithms 12, and 4 for several choices of \(\alpha \) and \(\lambda \). In particular, we let \((\alpha ,\lambda )\in \{1,\;2,\ldots ,\;10\}^2\). Since we observed that the standard Gauss–Newton method for some choices of \(\alpha \) and \(\lambda \) stops after a single iteration, we forced this method to carry out at least 5 iterations.

Table 2 reports the means, standard deviations, maximum values, and minimum values for the RRE, number of iterations, and CPU times required for Algorithms 12, and 4. Since the computation of \(J_f\) is very cheap, we do not consider Algorithm 3 in this example. We note that, on average, Algorithms 1 and 4 yield less accurate computed solutions than Algorithm 2. However, for some parameter pairs \((\alpha ,\lambda )\) the accuracy obtained with Algorithms 1 and 4 is much higher than with Algorithm 2. Nevertheless, the approximate solutions computed by the latter algorithms are always fairly accurate, even in the worst case. On the other hand, the Gauss–Newton method determines very poor reconstructions for several choices of \(\alpha \) and \(\lambda \). Figure 3a, d, g show \(\log _{10}(\textrm{RRE})\) for each choice of the parameters. We can observe that, for a fairly large number of choices of \(\alpha \) and \(\lambda \), the Gauss–Newton method performs quite poorly in terms of accuracy, while Algorithms 2 and 4 always provides accurate reconstructions. This is due to the fact that the projection into the generalized Krylov subspace regularizes the problem without reducing the accuracy of the computed solutions.

Fig. 2
figure 2

Partial differential equation test case: approximation of the singular values of \(J_f\) via the GLT theory with two different couple of parameters \((\alpha ,\lambda )\). The solid black line is obtained for \(\alpha =1\) and \(\lambda =10\), while the dashed gray line is for \(\alpha =10\) and \(\lambda =1\)

Table 2 Partial differential equation test case: means, standard deviations, maximum values, and minimum values for the RRE, number of iterations, and CPU times obtained for Algorithms 12, and 4

The CPU times reported in Table 2 and in Fig. 3c, f, i illustrate that Algorithms 2 and 4 are faster in terms of CPU time than Algorithm 1, despite that the first two methods usually require more iterations to converge than Algorithm 1. We can also observe that the speed-up obtained by using the restarting strategy is significant as it reduces the average computational cost by a factor of almost 10.

Fig. 3
figure 3

Partial differential equation test case: RRE, number of iterations, and CPU times obtained with Algorithms 12, and 4 for each choice of \((\alpha ,\lambda )\in \{1,\;2,\ldots ,\;10\}^2\) in (4.2). The first row reports the results obtained with Algorithm 1, the second row shows results for Algorithm 2, and the third row contains results for Algorithm 4

We now discuss how the CPU times changes when the dimensions of the problem increase. We fix \(\alpha =5\) and \(\lambda =10\) and let \(n\in {100, 150,\ldots ,1000}\). Therefore, when \(n=1000\), we have \(f:\mathbb {R}^{10^6}\rightarrow \mathbb {R}^{10^6}\). We run the considered algorithms and plot the CPU times in Fig. 4. We can observe that, obviously, the CPU time required to solve the problem increases with n. However, the rate of increase is substantially lower when the projection in the generalized Krylov subspace is employed. In particular, using Algorithm 4, we can see that we are able to solve a nonlinear problem with \(10^6\) unknowns in less than 10 seconds on a laptop.

Fig. 4
figure 4

Partial differential equation test case: CPU times obtained with Algorithms 12, and 4 with \((\alpha ,\lambda )=(5,10)\) in (4.2) for increasing dimensions of the problem. The dotted gray line reports the results obtained with Algorithm 1, the solid black line shows results for Algorithm 2, and the dashed black contains results for Algorithm 4

An extremely sparse problem. In this example, we would like to show a very particular situation when Algorithms 2 and 3 are not faster than Algorithm 1. However, we hasten to point out that this situation arises very seldom. Introduce the nonlinear differentiable function \(f:\mathbb {R}^n\rightarrow \mathbb {R}^{n-1}\) given by

$$\begin{aligned} f(\textbf{x})=\sin \left( \begin{bmatrix} x_1+x_2\\ x_2+x_3\\ \vdots \\ x_{n-1}+x_n \end{bmatrix}\right) , \end{aligned}$$

where the \(\sin \) operation is meant element-wise. We construct the problem by choosing as the exact solution \(\textbf{x}\in \mathbb {R}^n\) of the problem (1.1) a sampling of the function \(\textbf{x}=\frac{1}{2}\sin (x)\), with \(x\in (-\pi ,\pi )\), on an equispaced grid with \(n=10^{3}\). The Jacobian \(J_f(\textbf{x})\) of f is bidiagonal and given by

Least-square problems with such a matrix can be solved very inexpensively.

Table 3 reports results obtained for Algorithms 1 and 2. We note that the latter algorithm determines a more accurate approximate solution than the former. However, since the solution of least-squares problems with the sparse matrix \(J_f\) is very inexpensive to compute and Algorithm 1 carries out fewer iterations than Algorithm 2, the overall cost of the former algorithm is slightly smaller than of the latter algorithm. Nevertheless, Algorithm 2 reduces the cost per iteration significantly, when compared with Algorithm 1, and provides more accurate reconstructions. Therefore, one may still want to use Algorithm 2 even if it is slightly more expensive.

Note that, since Algorithm 2 carried out fewer than \(k_{\textrm{rest}}\) iterations, we do not report the results for Algorithm 4 as they are identical to the ones obtained with Algorithm 2.

Table 3 An extremely sparse problem: RRE, number of iterations performed, and CPU times for Algorithms 1 and 2

5 Conclusion and Extensions

This paper presents new implementations of the Gauss–Newton algorithm based on the use of generalized Krylov subspaces. The approach described easily can be extended to other nonlinear optimization methods such as the Levenberg-Marquardt method. We have shown that Algorithm 2 determines approximate solutions that converge to a stationary point of the minimized functional. Several numerical examples show that Algorithms 23, and 4 outperform the standard Gauss–Newton method. Extensions to regularized problems as well as other iterative algorithms are presently being developed.