A Flow Perspective on Nonlinear Least-Squares Problems

Just as the damped Newton method for the numerical solution of nonlinear algebraic problems can be interpreted as a forward Euler timestepping on the Newton flow equations, the damped Gauß–Newton method for nonlinear least squares problems is equivalent to forward Euler timestepping on the corresponding Gauß–Newton flow equations. We highlight the advantages of the Gauß–Newton flow and the Gauß–Newton method from a statistical and a numerical perspective in comparison with the Newton method, steepest descent, and the Levenberg–Marquardt method, which are respectively equivalent to Newton flow forward Euler, gradient flow forward Euler, and gradient flow backward Euler. We finally show an unconditional descent property for a generalized Gauß–Newton flow, which is linked to Krylov–Gauß–Newton methods for large-scale nonlinear least squares problems. We provide numerical results for large-scale problems: An academic generalized Rosenbrock function and a real-world bundle adjustment problem from 3D reconstruction based on 2D images.


Introduction
We consider nonlinear least-squares problems of the form where f : R n → R m is twice continuously differentiable. The necessary optimality conditions of (1) require that in a minimum x * ∈ R n the gradient of the objective function vanishes where f denotes the Jacobian of f . Equation (2) is called the normal equation. It is important to realize that the normal equation is only a necessary condition for a minimum, which is also satisfied in other spurious stationary points (maxima, saddle points) of the objective. Stationary points can often be characterized by their Hessian which splits up into a first order term H based solely on f and a second-order term Q, in which the current residual f and its Hessian tensor enter (but no first order derivatives).
Example 1 (Parameter estimation) The main area of application for this type of problem is maximum likelihood estimation for parameter estimation problems with normally distributed measurement errors (see, e.g., [4]): In this case, we would like to determine an unknown but deterministic model parameterx ∈ R n , which parametrizes a deterministic model response function h : R n → R m . We assume we can obtain measurements from the real-world system modeled by h andx with normally distributed measurements η ∈ R m with expected value h(x) and known covariance , which is an m-by-m symmetric positive semi-definite matrix. With = − 1 2 , the maximum likelihood estimation problem becomes a nonlinear least squares problem of the form (1) by setting Obviously, the measurements η enter linearly into f and thus all derivatives of f are independent of η. In particular, they only enter in the Hessian F (x) through the second order term Q in a multiplicative fashion.

Related previous work
The Gauß-Newton method (see, e.g., [4,5,17,25]), which solves a sequence of linear least-squares problems (linearized within the norm), is the method of choice to solve problem (1) in many cases. Like all Newton-type methods, it is in general only locally convergent. Because its search direction is a descent direction for the objective, line-search methods such as [3] are applicable to enforce convergence from arbitrary starting points. Alternatively, the Levenberg-Marquardt method [15,16] uses an adaptive Tikhonov-type regularization to enforce convergence, which is equivalent to a trust-region globalization [17,27]. Starting with the seminal paper by Davidenko [7], a large body of literature is dedicated to the construction of globalization methods by applying time-stepping methods for ordinary differential equations to the Newton flow equations (see, e.g., [6,[8][9][10]14]) or generalized Newton flow equations [22,23,28]. Similarly, gradient descent and proximal point algorithms can be interpreted as forward or backward Euler time-stepping on the gradient flow equations (we refer the reader to the discussion in [20,Sec. 4]). A detailed study of the Gauß-Newton flow equations appears to be missing up to now.
Contributions and structure of the paper We review the local convergence of the Gauß-Newton method and the notion of statistically stable solutions, which are defined by a spectral radius condition (Section 2). We introduce the Gauß-Newton flow and compare it with the Newton flow and gradient flow (Section 3). The Levenberg-Marquardt method is recovered as a backward Euler timestepping on the gradient flow, which means that it is a proximal point method. The damped Gauß-Newton method is a forward Euler timestepping scheme on the Gauß-Newton flow equations. We illustrate certain salient features of the different flows on the well-known Rosenbrock function. We also learn from this example that the local spectral radius condition has no meaningful global counterpart along the Gauß-Newton flow, which in turn implies that full steps have to be taken locally to guarantee convergence to only statistically stable solutions (Section 4). For the case of inexact Gauß-Newton methods based on approximate solutions of the linearized least-squares systems with the aid of Krylov subspace methods, we show that the LSQR search direction [19] is always a descent direction (Section 5). Finally, we provide in Section 6 numerical results for Krylov-Gauß-Newton methods on large-scale academic and real-world problems: A generalized Rosenbrock function and a bundle adjustment problem from 3D reconstruction based on 2D images.

The Gauß-Newton Method
The (local) Gauß-Newton method generates a sequence (x k ) k ⊂ R n from an initial solution guess x 0 ∈ R n via the iteration where the increments s k ∈ R n are the solutions of the linear least-squares problems (linearization under the norm) This problem is convex and always has a solution. If f (x k ) has full column rank, the solution is uniquely given by the Moore-Penrose pseudoinverse f ( There is also a different viewpoint: The Gauß-Newton method is a Newton-type method on the normal equation (2), where we discard the second-order term Q in the approximation of the inverse of the Hessian F , i.e., At the center of the analysis of the Gauß-Newton method and the Gauß-Newton flow (to be described in Section 4) is the matrix Lemma 1 If f (x) has full column rank, then the spectrum of G(x) is real and its eigenvectors are real and orthogonal with respect to the scalar product endowed by is symmetric and f (x) has full rank, H (x) = f (x) T f (x) is positive definite, which implies that (λ, v) are real and the orthogonality of the eigenvectors with respect to H (x) (see, e.g., [21]).
It is well-known that the local convergence of the Gauß-Newton method to a solution x * ∈ R n of the normal equations (2) is only guaranteed by Ostrowski's theorem [13,18] if the spectral radius condition κ GN := ρ G(x * ) < 1 ( 4 ) holds. The convergence is linear with rate equal to κ GN . Conversely, if κ GN > 1, it is easy to see that there exists an arbitrarily small perturbation of x * in the direction of principal eigenvector of the matrix in (4), which leads the iterates away from x * .
Because the residuals f (x) enter multiplicatively in Q(x), solutions with small residual will satisfy the condition (4). It might appear as a disadvantage of the Gauß-Newton method that it will not be attracted by solutions of the normal equation (2) that violate (4) in comparison with, e.g., a Newton method, which exhibits local convergence to any solution to the normal equation (2). For parameter estimation problems, however, this behavior is actually an advantage: Example 2 (Mirror trick, see [5]) We continue with Example 1. Assume we have obtained a minimizer x * ∈ R n of (1) for which κ GN > 1. By Lemma 1, there exists an eigenvalue λ ∈ R n of G(x * ) with modulus greater than one. Let v ∈ R n \ {0} denote the eigenvector of G(x * ) corresponding to λ, which implies that Because x * is a minimizer, the Hessian must be positive semi-definite, which together with (5) leads to If f (x * ) has full column rank, then λ > −1 and because |λ| = κ GN > 1 it follows that λ > 1.
By our statistical assumptions, the measurement errors are symmetrically distributed around the true model response. Let us, in a thought experiment, mirror the measurements η i around h i (x * ), i.e., we generate statistically plausible measurements The corresponding least-squares residuals arẽ Because we changed f only by a constant, the derivatives off and f coincide. Hence, x * also solves the normal equations (2) forf due toF For the Hessian, the sign of the Q term swaps becausẽ From (6) and (5), we obtain Because λ > 1, this shows that the HessianF (x * ) has negative curvature in the direction of v if f (x * ) has full column rank. It is, thus, not a minimizer of (1) with the mirrored measurements. In this sense, x * is not a statistically stable solution of (1) for the original measurements η. It is indeed advantageous that the Gauß-Newton method will not be attracted to such spurious solutions of the normal equation (2), in contrast to, e.g., the Newton method.

Continuous Flows and Timestepping Methods for Least-Squares Problems
In connection with (1), we give three examples of ordinary differential initial value problems. The corresponding flows are the Newton flow, the gradient flow, and the Gauß-Newton flow. Together with appropriate timestepping in the form of forward and backward Euler methods, we can derive the most popular globalization techniques for solving nonlinear least-squares problems.

The Newton Flow
We can derive the Newton flow by embedding the normal equation (2) into a family of root-finding problems that depend on a parameter t ∈ [0, ∞) according to Obviously, x 0 is a solution for t = 0 and we obtain a solution of the normal equations for t = ∞. If F (x 0 ) is invertible, the Implicit Function Theorem guarantees a local solution A forward Euler step on the Newton flow equation (7) delivers one Newton step We strongly discourage the use of the Newton method for solving (1), because the iterates are attracted to spurious solutions of the normal equation (2) such as maxima or saddle points of (1). Furthermore, the use of the full Hessian is necessarily positive semi-definite, must entail a pointt at which F (x(t)) is singular by the intermediate value theorem applied to the (real) eigenvalues of F (x(t)). In this case, the Newton flow and the Newton step computation break down.

The Gradient Flow
A flow that is only attracted to minima is the gradient floẇ Its equilibria are exactly the solutions of the normal equation (2) and equilibria x * with indefinite or negative definite Hessian F (x * ) are not asymptotically stable. It is interesting to note that the objective decreases along the gradient flow d dt A forward Euler discretization leads to the method of steepest descent (also called gradient descent) which can be extremely slow for medium to badly conditioned problems. A backward Euler discretization leads to the step equations These equations are the necessary optimality conditions of Thus, backward Euler on the gradient flow delivers the Levenberg-Marquardt method, which adds a Tikhonov regularization term. This method has a strong link to trust-region methods, there the inverse steplength 1/t k plays the role of an optimal Lagrange multiplier for a quadratic trust-region constraint.

The Gauß-Newton Flow
The Gauß-Newton flow is given bẏ As for the gradient flow (8), the objective decreases along the Gauß-Newton flow because d dt Forward Euler timestepping delivers the damped Gauß-Newton method Backward Euler timestepping yields the step equations which could be interpreted as a variable metric Levenberg-Marquard method. Unfortunately, the step equation cannot be readily interpreted as an optimality condition. If we approximate H (x k+1 ) by H (x k ), we essentially arrive again at a damped Gauß-Newton method with damping factor 1/(1 + t k ). An interesting alternative interpretation of the Gauß-Newton method is to embed the original problem (1) in a family of problems parameterized by t ∈ [0, ∞), whose solution is x 0 for t = 0 and a solution x * of (1) for t = ∞: This gives rise to a residual in the normal equations of the form Using the implicit function theorem analogously to the Newton flow case, we arrive at the initial value problem (11) If f (x 0 ) has full column rank, then a local solution x(t) exists and the initial flow direction equals the Gauß-Newton flowẋ There is no guarantuee, however, that the matrix on the left-hand side of (11) stays invertible for all t ≥ 0. Nonetheless, we can embed the solution of (10) with a finite t in an outer loop, in which we sequentially update x 0 by an approximate solution of (10). This leads to a sequential homotopy method for Gauß-Newton methods similar to the one proposed in [24] for inexact Sequential Quadratic Programming methods.

Comparison of the Newton, Gradient, and Gauß-Newton Flows
We compare the different flows for the classical Rosenbrock function, which fits into the class of parameter estimation problems (Example 1) with n = m = 2 and and which yields the well-known optimization problem The point x * = (1, 1) T is a unique global minimum, because it is the only point which attains the lower bound 0 of the objective. We can easily compute Solving for det F (x) = 0, we obtain that the Hessian F (x) is singular along the parabola x 2 = x 2 1 + 1 200 , which is just slightly above the trough of the banana valley x 2 = x 2 1 . The right-hand sides of the gradient, Newton, and Gauß-Newton flows can be verified to be Obviously, the Newton flow has a singularity along the shifted parabola x 2 = x 2 1 + 1 200 . Thus, the Newton flow cannot cross this barrier if started from above it (cf. Fig. 1). Moreover, we see that in the Rosenbrock example, the Gauß-Newton flow does not suffer from stiffness, which is induced in the gradient and Newton flow by fast transients towards the trough of the banana valley (cf. Fig. 1), where they take a sharp turn.
Two remarks are appropriate here: First, fast attraction to the banana-shaped bottom of the valley is detrimental for line-search and trust-region based globalization if no secondorder correction is employed, because stepsizes and trust-region radii are required to be rather tiny to ensure descent in the objective. Second, from a stability point of view, forward Euler timestepping requires tiny stepsizes for stiff equations, which thus applies for the gradient and Newton method but not for the Gauß-Newton method. These two properties speak clearly in favor of using the Gauß-Newton method.
The obtuse angle of very nearly 135 • between the Newton and the Gauß-Newton flows atx = (−1, −1) in Fig. 1 hints at another important point: The convergence theory of inexact Newton methods based on generalized Newton flows (see [22,23]) is not helpful in the analysis of the Gauß-Newton method. More rigorously, we can algebraically check that This means that the Gauß-Newton direction does not satisfy the contravariant κ-condition A2 in [23]. We also learn that its norm-free linearized counterpart, the κ GN -condition (4), is meaningful only in the vicinity of a solution x * and that ρ(G(x)) has no relevance for the globalization of Gauß-Newton methods for x far away from a solution x * .

Asymptotic Stability of the Gauß-Newton Flow
The interpretation of the Gauß-Newton method as a Newton-type method provided in Section 2 allows for the study of generalized Newton flows [22,23], which determine the behavior of the damped iteration x k+1 = x k + t k s k for small stepsizes t k ∈ (0, 1]. We can then interpret the Gauß-Newton iteration as a forward Euler timestepping method with stepsize t k on the continuous Gauß-Newton flow x(t) determined by the initial value problem (9). (9) and f (x * ) has full column rank. With λ ∈ R we denote the smallest eigenvalue of the matrix G(x * ). Then, x * is asymptotically stable if λ > −1 and unstable if λ < −1.

Lemma 2 (Asymptotic stability of critical points) Assume x * ∈ R n is an equilibrium point of the Gauß-Newton flow x(t) in
Proof We study the spectrum of the linearization of the flow right-hand side about x * . To simplify notation, we abbreviate A(x) = f (x). The Moore-Penrose pseudoinverse can be differentiated with the formula [12, Theorem 4.3] At x * , we can exploit A(x * ) + f (x * ) = 0 and in addition the general identities dA Hence, we obtain for the linearization of the flow right-hand side around the critical point which has purely real eigenvalues by Lemma 1. This matrix has only negative eigenvalues for λ > −1, which implies that x * is asymptotically stable, and at least one positive eigenvalue for λ < −1, in which case x * is unstable.
We believe a warning is appropriate at this point: In contrast to the spectral radius condition (4) on G(x * ) for the discrete Gauß-Newton iteration, the condition for asymptotic stability of the Gauß-Newton flow concerns only the negative part of the spectrum of G(x * ). In the light of the mirror trick of Example 2, the flow might be attracted to statistically unstable solutions. We illustrate this behavior in the following example by the means of a simple parameter estimation problem that we want to solve with a damped Gauß-Newton method using a simple Armijo backtracking line-search.
Example 3 (Frequency reconstruction) Given a sequence of n + 1 measurements (η i ) 0≤i≤n taken at time t i = 2πi n from the signal h i (a * , ω * , * ) = a * sin(ω * t i + * ) we aim to reconstruct the amplitude, frequency and phase x * = (a * , ω * , * ) ∈ R 3 . Under the assumption that the measurement noise η i − h i (x * ) is normally distributed around 0 with covariance σ = 0.1, this leads to the maximum likelihood parameter estimation problem . For n = 100 and measurements generated with the true signal parameters x * = (1, 4, 1) we apply a Gauß-Newton method with simple Armijo backtracking line-search using α = 1 2 as backtracking factor and β = 0.1 for the descent condition and compare the behavior with the full-step Gauß-Newton method for two different initial values (see Fig. 2).
Example 3 suggests how to interpret the convergence behavior of a damped Gauß-Newton method with line-search. The line-search condition will always guarantee descent and thus convergence to a stationary point within the current level-set. However, only if the convergence finally happens with full-steps, the solution will be statistically stable -a property that can be used in algorithms to distinguish statistically stable from statistically unstable solutions. To summarize it, a GN method with stepsize damping can help to globalize the convergence but in general it can not always prevent the iterates from getting trapped at a statistically unstable critical point, in particular when the initial guess is far away from  (3, 2, 2) . The upper plots represent the residual-norm of all iterates, the stepnorm of all iterates, and the stepsize of the iterates using the line-search method until the termination criterion s k 2 ≤ 5 · 10 −8 or k ≥ 40 is reached. The lower plot represents the measurements and the model response of the final iterates. It can be seen that for both starting points the full-step GN method failed to converge and just diverges. The backtracking GN method converges for both points but the stepsize of the Armijo backtracking line-search only converges with full steps for the first point (green line) the solution. In such a case domain knowledge has to be used to find a good initial guess for the iterates.
Two particular Krylov subspace methods for the approximate solution of (3) are LSQR and LSMR [11,19]. In their i-th subiteration, they construct an orthonormal basis V i (x k ) ∈ R n×i of the i-dimensional Krylov subspace by a Golub-Kahan bidiagonalization process. (If the dimension of K i (H (x k ), F (x k )) is smaller than i, then the solution is already contained in K i−1 (H (x k ), F (x k )).) We then solve the reduced space linear least-squares problems In that sense, LSQR strives for the best linearized fit, while LSMR strives for the lowest violation of the linearized normal equation. For LSQR, the normal equation provided that f (x k ) has full column rank. For fixed i, we can consider the LSQR-Gauß-Newton flow equationsẋ For i ≥ 1, the objective value is guaranteed to decrease along the flow due to d dt whenever F (x) = 0 and f (x) has full column rank (suppressing the arguments of x = x(t)). We see that there is no further condition on the required accuracy to solve (3) other than i ≥ 1: We always obtain a descent direction. This observation suggests the following numerical strategy: Far away from a solution, we approximate solutions to (3) with rather low accuracy. When we are close to a solution, indicated for instance by stagnation in the nonlinear objective value, we tighten the accuracy more and more to finally benefit from the statistical stability properties of the exact Gauß-Newton direction. A simple implementation of this approach is illustrated in Algorithm 1.

Algorithm 1 LSQR-Gauß-Newton method.
1: function LSQRGAUSSNEWTON(x 0 , xtol, otol, k max , α, β, γ, σ, τ, τ min ) 2: Evaluate f (x 0 ) 3: for k = 0, . . . , k max do 4: Compute s k as an approximate solution of (3) by LSQR [11] with BTOL = 0 and ATOL = τ 5: Set t ← 1, evaluate f (x k ) T f (x k )s k 6: loop Armijo backtracking line-search 7: Set For LSMR, the normal equation reads Repetition of the steps above leads to an objective value variation along the LSMR-Gauß-Newton flow of d dt From here, it is not clear whether we always get a descent direction. We have found LSMR to be working well in practice without giving rise to directions of ascent. This is in line with observation 1 in [11,Sec. 7.1], where the authors report that the residual r(x) = f (x)V i (x)y * i (x) + f (x) for LSMR "seems to be monotonic (no counterexamples were found)".

Numerical Results
We report on our numerical experience with a line-search LSQR-Gauß-Newton method applied to two test cases: The first is an extended Rosenbrock function in variable dimensions with and without synthetic random measurements. The second is a challenging real-world problem called Bundle Adjustment for the construction of a 3D environment from a large number of markers in 2D pictures.
The results were obtained with a Python implementation of Algorithm 1. We used CasADi [2] for the computation of function derivatives. The default parameters were α = 1 2 for the backtracking factor, β = 1 10 for the Armijo descent condition, σ = 10 −4 for the objective stagnation test to trigger a reduction by a factor of γ = 1 10 of the LSQR tolerance τ , initially set to 10 −3 and bounded below by τ min = 10 −12 .
All computations were performed on a 64bit Ubuntu 16.04 Linux machine with 128 GB of RAM and Intel Core i7-5820K computational cores at 3.3 GHz. No parallelization was used.

An Extended Rosenbrock Function
We consider the following n-dimensional extension of the Rosenbrock function (cf. [26]) in the form of Example 1 with with measurements η ∈ R 2n−2 that are normally distributed with expectation 0 and covariance −2 . For η = 0, the unique global optimum is x * i = 1 for i = 1, . . . , n. We run Algorithm 1 on this example with xtol = 10 −5 and otol = 10 −12 for 20 runs with different random realizations of η each for n = 10 1 , . . . , 10 7 with an initial guess (x 0 ) i = 1 for i = 1, . . . , n, which is not a solution because η = 0. The resulting statistics are displayed in Table 1. We see that we can find solutions of the large instances with a relatively small total number of LSQR iterations in comparison to the problem dimension. The required CPU time appears to grow linearly in n.

Large Scale Bundle Adjustment
Bundle Adjustment is a method of Structure from Motion (SfM) for estimating geometric 3D data from big sequences of 2D images. The core ingredient that couples the real world 3D position of an object and the resulting 2D position of the corresponding marker in the image is the camera model that depends on camera position, orientation, and some additional parameters such as focal lengths and lense distortion parameters. The camera model gives a prediction of the 2D positions of the markers depending on the real world 3D coordinates of an object and the camera parameters. By defining the residual function f as the difference of predicted 2D positions and the measurements, a large scale nonlinear least squares problem can be set up to estimate the most likely 3D positions and camera parameters. For a detailed description of the camera model and how the least-squares problem is defined we refer the reader to [1]. We note that, since 3D translation of the realworld coordinates and the camera positions by an arbitrary vector results in the same 2D image measurements, the bundle adjustment problems exhibit an intrinsic ambiguity, which results in a structural violation of the full-rank condition on f (x). As in [1], we rely on the regularizing properties of Krylov-space methods for our computations.
We use the bundle adjustment datasets Ladybug, Trafalgar Square, Dubrovnik, and Venice, which are freely available at http://grail.cs.washington.edu/projects/bal and consist  of several problems with varying numbers of 2D pictures and 3D objects (see Fig. 3 for a visualization resulting from the largest Venice dataset). The evaluation of the function f is partially based on the SciPy Cookbook Python Notebook, which is available at https://scipy-cookbook.readthedocs.io/items/bundle adjustment.html. We use the same Python implementation of Algorithm 1 as in the previous example. The derivatives are obtained using CasADi [2].

Results
We run Algorithm 1 with the parameters α = 1 2 for the back-tracking factor, β = 10 −3 for the Armijo descent condition, σ = 10 −2 for the objective stagnation test to trigger a reduction by a factor of τ = 1 10 of the LSQR tolerance τ , initially set to 1 10 and bounded from below by τ min = 10 −4 . We chose xtol = 10 −10 and otol = 10 −7 . As in the synthetic extended Rosenbrock example, variable and residual dimensions are very nearly proportional for all problems.
In all cases, Algorithm 1 terminates within a maximum of 56 Krylov-Gauß-Newton iterations, always taking full steps eventually. As can be seen in Table 2, even the  problems with large problem dimension are solved within relatively few Krylov-Gauß-Newton steps. We observe that the total CPU time appears to grow linearly with the problem dimension, see Fig. 4. When it comes to the total number of inner LSQR iterations, we note that again relatively few are needed compared to the problem dimensions, see Table 2.

Conclusion
In the full-rank case, the full-step Gauß-Newton method has the favorable property of not being attracted to statistically unstable minima. The nonlinear least-squares objective decreases along the the Gauß-Newton flow and the Krylov-Gauß-Newton flow if LSQR is used. The damped (Krylov-)Gauß-Newton method is equivalent to forward Euler timestepping on the (Krylov-)Gauß-Newton flow, while the Levenberg-Marquardt method is equivalent to backward Euler timestepping on the gradient flow. We showed for the popular example of the Rosenbrock function that the Gauß-Newton flow equations constitute a non-stiff differential equation, in contrast to the Newton and gradient flow equations, which give rise to fast transients towards the trough of the notorious banana-shaped valley. From this vantage point, line-search globalization for the Newton and gradient methods suffer from much stricter stepsize restrictions than the Gauß-Newton method. Krylov-Gauß-Newton methods have an intrinsic regularizing property to make them appropriate also for ill-conditioned large-scale nonlinear least-squares problems.