A split Levenberg-Marquardt method for large-scale sparse problems

Krejić, Nataša; Malaspina, Greta; Swaenen, Lense

doi:10.1007/s10589-023-00460-9

A split Levenberg-Marquardt method for large-scale sparse problems

Open access
Published: 15 February 2023

Volume 85, pages 147–179, (2023)
Cite this article

Download PDF

You have full access to this open access article

Computational Optimization and Applications Aims and scope Submit manuscript

A split Levenberg-Marquardt method for large-scale sparse problems

Download PDF

1693 Accesses
1 Altmetric
Explore all metrics

Abstract

We consider large-scale nonlinear least squares problems with sparse residuals, each of them depending on a small number of variables. A decoupling procedure which results in a splitting of the original problems into a sequence of independent problems of smaller sizes is proposed and analysed. The smaller size problems are modified in a way that offsets the error made by disregarding dependencies that allow us to split the original problem. The resulting method is a modification of the Levenberg-Marquardt method with smaller computational costs. Global convergence is proved as well as local linear convergence under suitable assumptions on sparsity. The method is tested on the network localization simulated problems with up to one million variables and its efficiency is demonstrated.

Modified inexact Levenberg–Marquardt methods for solving nonlinear least squares problems

Article 28 May 2019

A Levenberg–Marquardt method for large nonlinear least-squares problems with dynamic accuracy in functions and gradients

Article 30 June 2018

Solving large linear least squares problems with linear equality constraints

Article Open access 05 July 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The problem we consider is a nonlinear least squares problem

$$\begin{aligned} \min _{{\textbf{x}}^{}\in {\mathbb {R}}^N}\frac{1}{2}\sum _{j=1}^m r_j({\textbf{x}}^{})^2 = \frac{1}{2}\min _{{\textbf{x}}^{}\in {\mathbb {R}}^N} \Vert {\textbf{R}}({\textbf{x}}^{})\Vert ^2_2 = \min _{{\textbf{x}}^{}\in {\mathbb {R}}^N} {\textbf{F}}({\textbf{x}}^{}) \end{aligned}$$

(1)

where for every $j=1,\ldots ,m$, $r_j: {\mathbb {R}}^N\rightarrow {\mathbb {R}}$ is a $C^{1}$ function, ${\textbf{R}}({\textbf{x}}^{}) = \left( r_1({\textbf{x}}^{}),\dots ,r_m({\textbf{x}}^{})\right) ^\top \in {\mathbb {R}}^{m}$ is the vector of residuals, and ${\textbf{F}}$ is the aggregated residual function. We are assuming a special structure of the residuals that is relevant in many applications. That is, we assume that the residuals are very sparse functions, each depending only on a small number of variables while the whole problem is large-scale, i.e. with N very large. We are not assuming any particular sparsity pattern allowing many practical problems to fit the framework we consider.

Typical problems of this kind are Least Squares Network Adjustment [17], Bundle Adjustment [13], Wireless Sensor Network Localization [15], where the variables correspond to the coordinates of physical points in a region of the 2 or 3 dimensional space and residuals correspond to observations of geometrical quantities involving these points. In these cases each observation typically involves a small (often fixed) number of points, and thus each residual function involves a small number of variables. Moreover, when considering problems of large dimension with points deployed in a large region of the space, the number of observations involving each point is small with respect to the total amount of observations. That is, each variable is involved in a relatively small number of residual functions. This leads to problems that are very sparse. Given that the measurements are prone to errors or to different kinds of noise, the residuals are in general weighted with the weights being reciprocal of the measurement precision. Furthermore, a typical property of such problems is that they are nearly separable, meaning that it is possible to partition the points into subsets that are connected by a small number of observations. The dominant properties of all these problems, a very large dimension N and sparsity, motivated the modification of classical Levenberg-Marquardt method presented in this paper.

The Levenberg-Marquardt (LM) method is generally used to solve the least squares problems of large dimension. The method is based on a regularization strategy for improvement of Gauss-Newton method. In each iteration of Gauss-Newton method one has to solve a linear least squares problem. The LM method further adds a regularization (damping) parameter that facilitate the direction computation. Thus in each iteration of the LM method one has to solve a system of linear equations to compute the step or step direction. The regularization parameter plays a fundamental role and its derivation is subject of many studies.

Many modifications of the classical Levenberg-Marquardt scheme have been proposed in literature to retain convergence while relaxing the assumptions on the objective function and to improve the performance of the method. In [6, 7, 18] the damping parameter is defined as a multiple of the objective function. With this choice of the parameter local superlinear or quadratic convergence is proved under a local error bound assumption for zero residual problems, while global convergence is achieved by employing a line search strategy. In [11] the authors propose an updating strategy for the parameter that, in combination with Armijo line search, ensures global convergence and q-quadratic local convergence under the same assumption of the previous papers. In [2] the non-zero residual case is considered and a Levenberg-Marquardt scheme is proposed that achieves local convergence with order depending on the rank of the Jacobian matrix and of a combined measure of nonlinearity and residual size around stationary points. In [4] an inexact Levenberg-Marquardt is considered and local convergence is proved under a local error bound condition. In [3] the authors propose an approximated Levenberg-Marquardt method, suitable for large-scale problems, that relies on inaccurate function values and derivatives.

The problems we are interested in are of very large dimension and sparse. Sparsity very often induces the property we call near-separability, i.e. it is possible to partition the variables into subsets such that a subset of residual functions depends only on a subset of variables while only a limited number of residual functions depends on variables in different subsets. This property is mainly a natural consequence of the problem origin. For example in the Network Adjustment problem physical distance of the points determines the set of points that are connected by observations. Although the complete separability of the residuals is not common, the number of residuals that depend on points from different subsets is in general rather small compared to the problem size N. On the other hand, for a very large N solving the LM linear system in each iteration, even for very sparse problems can be costly. The particular example is the refinement of cadastral maps and in this case N is prohibitively large for direct application of the LM method. For example, the Dutch Kadaster is pursuing making the cadastral map more accurate by making the map consistent with accurate field surveyor measurements [8, 10]. This application yields a non-linear least squares problem which is known as ‘least squares adjustment’ in the field of geography. If the entire Netherlands were to be considered for one big adjustment problem, the number of variables would be twice the number of feature points in the Netherlands, which is on the order of 1 billion variables and even considering separate parts of Netherlands still yields a very large-scale problem.

The method we propose here is designed to exploit the sparsity and near-separability in the following way. Assuming that we can split the variables in such a way that a large number of residual function depends only on a particular subset of variables, while a relatively small number of residual functions depends on variables from different subsets, the system of linear equations in LM iteration has a particular block structure. The variable subsets and corresponding residuals imply strong dominance of relatively dense diagonal blocks while the non-diagonal blocks are very sparse. This structure motivated the idea of splitting: we decompose the LM system matrix into K independent systems of linear equations, determined by the diagonal blocks of the LM system. This way we get K linear systems of significantly smaller dimensions and we can solve them faster, either sequentially or in parallel. Thus, one can consider this approach as a kind of inexact LM method. However, this kind of splitting might be too inaccurate given that we completely disregarded the off-diagonal blocks of the LM matrix. Therefore, we modify the K independent linear systems in such a way that off-diagonal blocks are included in the right-hand sides of these independent systems. The modification of the right-hand sides of independent linear systems is based on a correction parameter proposed for the Newton method in [14] and for the distributed Newton method in [1] and attempts to minimize the difference between the full LM linear system residual and the residual of the modified method. Having an affordable search direction computed by solving the K independent linear systems we can proceed in the usual way - applying a line search to get a globally convergent method. Furthermore, under a set of standard assumptions we can prove local linear convergence.

A proper splitting of the variables into suitable subsets is a key assumption for the efficiency of this method but the problems we are interested in very often offer a natural way of meaningful splitting. For example in the localization problems or network adjustment problems the geometry of points dictates meaningful subsets. In the experiments we present here one can see that a graph partitioning algorithm provides a good subset division in a cost-efficient way.

The numerical experiments presented in the paper demonstrate clearly the advantage of the proposed method with respect to the full-size LM method. We show that the splitting method successfully copes with very large dimensions, testing the problems of up to 1 million variables. Furthermore, we demonstrate that even in the case of large dimensions that can be solved by classical LM method, say around a couple of tens of thousands, the splitting method works faster. Additionally, we investigate the robustness of the splitting in the sense that we demonstrate empirically that the number of subsets plays an important role in the efficiency of the method but that there is a reasonable degree of freedom in choosing that number without affecting the performance of the method. The experiments presented here are done in a sequential way, i.e. we did not solve the independent systems of linear equations in parallel which would further enhance the proposed method. Parallel implementation will be a subject of further research.

The paper is organized as follows. In Sect. 2 we present the framework that we are considering. The proposed method is described in Sect. 3 while the theoretical analysis of the proposed method is carried out in Sect. 4. In Sect. 5 we discuss some implementation details and present numerical results.

The notation we use is the following. Mappings defined on ${\mathbb {R}}^N,$ vectors from ${\mathbb {R}}^N$ and matrices with at least one dimension being N are denoted by boldfaced letters ${\textbf{F}}, {\textbf{R}}, {\textbf{x}}^{}, {\textbf{B}}, \ldots$ while their block-elements are denoted by the same letters in italics and indices so ${\textbf{x}}^{} = (x_1,\ldots ,x_s), \; {\textbf{x}}^{} \in {\mathbb {R}}^N, \; x_s \in {\mathbb {R}}^{n_s}.$ The dimensions are clearly stated to avoid confusion. We use $\lambda _{\min }(\cdot )$ and $\lambda _{\max }(\cdot )$ to denote the smallest and largest eigenvalue of a matrix, respectively. The Euclidean norm is denoted by $\Vert \cdot \Vert$ for both matrices and vectors.

2 Nearly separable problems

The problem we consider is stated in (1). Denote with ${\mathcal {I}} = \{1,\dots ,N\}$ and with ${\mathcal {J}} = \{1,\dots ,m\}$. Given a partition $I_1,\dots ,I_{K}$ of ${\mathcal {I}}$ we define the corresponding partition of ${\mathcal {J}}$ into $E_1,\dots , E_{K}$ as follows:

$$\begin{aligned} \begin{aligned}&E_s = \{j\in {\mathcal {J}} | r_j \text { only depends on variables in } I_s\},\ s=1,\ldots ,K\\&{\widehat{E}} = {\mathcal {J}} \setminus \bigcup _{i=1}^{ K }E_i. \end{aligned} \end{aligned}$$

(2)

That is, given a partition of the set of variables, each of the subsets $E_s$ contains the indices corresponding to residual functions that only involve variables in $I_s$, while ${\widehat{E}}$ contains the indices of residuals that involve variables belonging to different subsets $I_s$. We say that problem (1) is separable if there exist $K \ge 2$ and a partition $\{I_s\}_{s=1,\ldots ,K }$ of ${\mathcal {I}}$ such that ${\widehat{E}} = \emptyset$, while we say that it is nearly-separable if there exist $K \ge 2$ and a partition $\{I_s\}_{s=1,\ldots ,K }$ of ${\mathcal {I}}$ such that the cardinality of ${\widehat{E}}$ is small with respect to the cardinality of $\bigcup _{s=1}^{ K }E_s.$ The term “nearly-separable” is not defined precisely and should be understood in the same fashion as sparsity, i.e. assuming that we can identify the corresponding partitions.

The above described splitting can be interpreted as follows. Given the least squares problem in (1) we define the corresponding underlying network as the undirected graph ${\mathcal {G}} = (\mathcal I, {\mathcal {E}})$ where ${\mathcal {I}}$ and ${\mathcal {E}}$ denote the set of nodes and set of edges respectively. The graph ${\mathcal {G}}$ that has a node for each variable $x_i$, and an edge between node i and node k if there is a residual function $r_j$ that involves $x_i$ and $x_k$. With this definition in mind the partition of ${\mathcal {I}}$ and ${\mathcal {J}}$ that we described above corresponds to a partition of the sets of nodes and edges of the network, where $E_s$ contains the indices of edges that are between nodes in the same subset $I_s$ and ${\widehat{E}}$ contains the edges that connect different subsets. The problem is separable if the underlying network ${\mathcal {G}}$ is not connected (and the number K is equal to the number of connected components of ${\mathcal {G}}$). The problem is nearly-separable if we can partition the set of nodes of the network in such a way that the number of edges that connect different subsets is small with respect to the number of edges that are internal to the subsets.

Given the partition $\{I_s\}_{s=1,\ldots ,K }$ of the variables and the corresponding partition $\{E_s\}_{s=1,\ldots ,K },\ {\widehat{E}}$ of the residuals, for $s=1,\ldots , K$ we define $x_{s}\in {\mathbb {R}}^{n_s}$ as the vector of the variables in $I_s$ where $n_s$ denotes the cardinality of $I_s$, and we introduce the following functions

$$\begin{aligned} \begin{aligned}&R_s(x_{s}):= (r_j({\textbf{x}}^{}))_{j\in E_s}, \quad \quad&\rho ({\textbf{x}}^{}):= (r_j({\textbf{x}}^{}))_{j\in {\widehat{E}}}\\&F_s(x_{s}):=\Vert R_s(x_s)\Vert ^2_2 \quad \quad&\Phi ({\textbf{x}}^{}):= \Vert \rho ({\textbf{x}}^{})\Vert ^2_2 \end{aligned} \end{aligned}$$

(3)

so that for every $s=1,\ldots ,K$, $R_s: {\mathbb {R}}^{n_s} \rightarrow {\mathbb {R}}$ is the vector of residuals involving only variables in $I_s$, while $\rho : {\mathbb {R}}^N\rightarrow R$ is the vector of residuals in ${\widehat{E}}$ and $F_s$, $\Phi$ are the corresponding local aggregated residual functions. Notice that $\sum _{s=1}^K n_s = N.$ With this notation problem (1) can be rewritten as

$$\begin{aligned} \begin{aligned}&\min _{{\textbf{x}}^{}\in {\mathbb {R}}^N}\left( \Phi ({\textbf{x}}^{})+ \sum _{s=1}^{K } F_s(x_{s})\right) = \min _{{\textbf{x}}^{}\in {\mathbb {R}}^N}\left( \frac{1}{2}\Vert \rho ({\textbf{x}}^{})\Vert ^2_2 + \sum _{s=1}^{K } \frac{1}{2}\Vert R_s(x_{s})\Vert ^2_2 \right) \end{aligned} \end{aligned}$$

(4)

In particular, if the problem is separable (and therefore $\widehat{E}$ is empty) $\Phi \equiv 0$ and solving problem (4) is equivalent to solving K independent least squares problems given by

$$\begin{aligned} \begin{aligned}&\min _{x_{s}\in {\mathbb {R}}^{n_s}}F_s(x_{s}) = \min _{x_{s}\in {\mathbb {R}}^{n_s}} \frac{1}{2}\Vert R_s(x_{s})\Vert ^2_2\ \text { for}\ s=1,\ldots , K. \end{aligned} \end{aligned}$$

(5)

If the problem is not separable then in general $\Phi$ is not equal to zero and that is the case we are interested in.

3 LMS: the Levenberg-Marquardt method with splitting

Let $\{I_s\}_{s=1,\ldots ,K }$ be a partition of ${\mathcal {I}}$ and $\{E_s\}_{s=1,\ldots ,K }$ be the corresponding partition of ${\mathcal {J}}$ as defined in (2). To ease the notation we assume that we reordered the variables and the residuals functions according to the given partitions, such that for ${\textbf{x}}^{} \in {\mathbb {R}}^N$

$$\begin{aligned} {\textbf{x}}^{} = \left( \begin{array}{cc} x_1\\ \vdots \\ x_{ K } \end{array}\right) , \quad {\textbf{R}}({\textbf{x}}^{}) = \left( \begin{array}{cc} R_1(x_1)\\ \vdots \\ R_{ K }(x_{ K })\\ \rho ({\textbf{x}}^{}) \end{array}\right) \end{aligned}$$

With this reordering, denoting with $J_{sR_j}$ the Jacobian of the partial residual vector $R_j$ defined in (3) with respect to the variables in $I_s$, and with $J_{s \rho }$ the Jacobian of the partial residual $\rho$ with respect to $x_s,$ we have

$$\begin{aligned} {\textbf {{J}}}({\textbf{x}}^{}) = \left( \begin{array}{cccc} J_{1R_1}(x_1)&{} &{} &{} 0 \\ &{}J_{2R_2}(x_2)&{} &{} \\ &{} &{} \ddots &{} \\ 0 &{} &{} &{}J_{ K R_{ K }}(x_{ K })\\ J_{1\rho }({\textbf{x}}^{})&{} J_{2\rho }({\textbf{x}}^{}) &{} \dots &{}J_{ K \rho }({\textbf{x}}^{}) \end{array}\right) . \end{aligned}$$

From this structure of ${\textbf{R}}$ and ${\textbf {{J}}}$ we get the corresponding block structure of the gradient ${\textbf{g}}({\textbf{x}}^{}) = {\textbf {{J}}}({\textbf{x}}^{})^\top {\textbf{R}}({\textbf{x}}^{})$ and the matrix ${\textbf {{J}}}({\textbf{x}}^{})^\top {\textbf {{J}}}({\textbf{x}}^{})$:

$$\begin{aligned}{} & {} {\textbf{g}}({\textbf{x}}^{})^\top = \left( g_1^\top ({\textbf{x}}^{}), g_2^\top ({\textbf{x}}^{}), \ldots , g_{ K }^\top ({\textbf{x}}^{})\right) , \end{aligned}$$

(6)

$$\begin{aligned}{} & {} {\textbf {{J}}}({\textbf{x}}^{})^\top {\textbf {{J}}}({\textbf{x}}^{}) = \left( \begin{array}{cccc} H_1({\textbf{x}}^{})&{} B_{12}({\textbf{x}}^{})&{}\dots &{} B_{1 K }({\textbf{x}}^{}) \\ B_{21}({\textbf{x}}^{}) &{}H_2({\textbf{x}}^{})&{} \ddots &{} \vdots \\ \vdots &{}\ddots &{} \ddots &{} B_{ K -1 K } ({\textbf{x}}^{}) \\ B_{ K 1}({\textbf{x}}^{}) &{}\dots &{} B_{ K K -1}({\textbf{x}}^{}) &{}H_{ K }({\textbf{x}}^{})\\ \end{array}\right) , \end{aligned}$$

(7)

with

$$\begin{aligned} \begin{aligned}&g_s({\textbf{x}}^{})= J_{sR_s}(x_s)^\top R_s(x_s) + J_{s\rho _s}({\textbf{x}}^{})^\top \rho ({\textbf{x}}^{})\qquad \text {for }\ s=1,\ldots , K,\\&H_{s}({\textbf{x}}^{}) = J_{sR_s}(x_{s})^\top J_{sR_s}(x_{s}) + J_{s\rho }({\textbf{x}}^{})^\top J_{s\rho }({\textbf{x}}^{}) \qquad \text {for }\ s=1,\ldots , K,\\&B_{ij}({\textbf{x}}^{}) = J_{i\rho }({\textbf{x}}^{})^\top J_{j\rho }({\textbf{x}}^{}) \qquad \text {for }\ i,j=1,\ldots , K. \end{aligned} \end{aligned}$$

(8)

In the following, we denote with ${\textbf{g}}^k = {\textbf {{J}}}_k^\top {\textbf{R}}_k$ the vector with s-th block component equal to $g_s({\textbf{x}}^{k})$, with ${\textbf{H}}_k = {\textbf{H}}({\textbf{x}}^{k})$ the block diagonal matrix with diagonal blocks given by $H_s({\textbf{x}}^{k})$ for $s=1,\dots , K$, and with ${\textbf{B}}_k = {\textbf{B}}({\textbf{x}}^{k})$ the block partitioned matrix with diagonal blocks equal to zero and off-diagonal blocks equal to $B_{ij}({\textbf{x}}^{k})$. That is,

$$\begin{aligned} \begin{aligned}&{\textbf{H}}_k = \left( \begin{array}{ccccc} H_1({\textbf{x}}^{})&{} &{} &{} \\ &{}H_2({\textbf{x}}^{})&{} &{} &{} \\ &{} &{}\ddots &{} \\ &{} &{} &{}H_{ K }({\textbf{x}}^{})\\ \end{array}\right) ,\\&{\textbf{B}}_k = \left( \begin{array}{ccccc} 0&{} B_{12}({\textbf{x}}^{k})&{}\dots &{} B_{1 K }({\textbf{x}}^{k}) \\ B_{21}({\textbf{x}}^{k}) &{}0&{} \ddots &{} \vdots \\ \vdots &{}\ddots &{} \ddots &{} B_{ K -1 K } ({\textbf{x}}^{k}) \\ B_{ K 1}({\textbf{x}}^{k}) &{}\dots &{} B_{ K K -1}({\textbf{x}}^{k}) &{}0\\ \end{array}\right) . \end{aligned} \end{aligned}$$

(9)

The algorithm we introduce here is motivated by near-separability property and hence we state the formal assumption below.

Assumption 1 There exists a constant $M > 0$ such that for all ${\textbf{x}}^{} \in {\mathbb {R}}^N$

$$\begin{aligned} \Vert {\textbf{B}}({\textbf{x}}^{})\Vert \le M \Vert {\textbf {{J}}}({\textbf{x}}^{})^\top {\textbf {{J}}}({\textbf{x}}^{})\Vert . \end{aligned}$$

(10)

The assumption above is not restrictive as ${\textbf{B}}({\textbf{x}}^{})$ is a submatrix of ${\textbf {{J}}}({\textbf{x}}^{})^\top {\textbf {{J}}}({\textbf{x}}^{}).$ Furthermore, the global convergence of the algorithm we propose does not depend on it in the sense that we do not use the assumption for the convergence proof. In fact the proposed algorithm works even for problems that are not nearly-separable as it can be seen as a kind of quasi-Newton method, however the efficiency of the algorithm depends on the near-separability of the problem and the value of M. Moreover the value of M plays an important role in the analysis of local convergence and in achieving linear rate.

Consider a standard iteration of LM method for a given iteration ${\textbf{x}}^{k}$

$$\begin{aligned} {\textbf{x}}^{k+1} = {\textbf{x}}^{k}+ {\textbf{d}}^k, \end{aligned}$$

where ${\textbf{d}}^k\in {\mathbb {R}}^N$ is the solution of

$$\begin{aligned} \left( {\textbf {{J}}}_k^\top {\textbf {{J}}}_k+\mu _k {\textbf{I}}\right) {\textbf{d}}^k = -{\textbf {{J}}}_k^\top {\textbf{R}}_k, \end{aligned}$$

(11)

where ${\textbf {{J}}}_k = {\textbf {{J}}}({\textbf{x}}^{k})\in {\mathbb {R}}^{m\times N}$ denotes the Jacobian matrix of ${\textbf{R}}_k = {\textbf{R}}({\textbf{x}}^{k})$ and $\mu _k$ is a positive scalar. When N is very large solving (11) at each iteration of the method may be prohibitively expensive. In the following we propose a modification of the Levenberg-Marquardt method that exploits near-separability of the problem to approximate the linear system (11) with a set of independent linear systems of smaller size.

The linear system (11) at iteration k can therefore be rewritten as

$$\begin{aligned} ({\textbf{H}}_k +\mu _k {\textbf{I}}+{\textbf{B}}_k){\textbf{d}}^k = -{\textbf{g}}^k. \end{aligned}$$

(12)

The matrix ${\textbf{B}}_k$ depends only on the derivatives of the residual vector $\rho = (r_j({\textbf{x}}^{}))_{j\in {\widehat{E}}}.$ If the problem is separable then ${\widehat{E}} = \emptyset$ and ${\textbf{B}}_k = 0$ so the coefficient matrix of (12) is block diagonal, the system can be decomposed into K independent linear systems, and the solution ${\textbf{d}}^k$ is a vector with the s-th block component equal to the solution $d^k_s$ of

$$\begin{aligned} (H_s^k+\mu _k I)d_s^k = -g_s^k. \end{aligned}$$

(13)

If the problem is nearly-separable, the number of nonzero elements in ${\textbf{B}}_k$ is small compared to the size N of the matrix and to the number of nonzero elements in ${\textbf{H}}_k$. Thus the solution of (13) may provide an approximation of the solution of the Levenberg-Marquardt direction (11), with the quality of the approximation depending on the number and magnitude of the nonzero elements in ${\textbf {{B}}}_k.$

Given that information contained in ${\textbf {{B}}}_k$ might be relevant and that solving K systems of smaller dimension is much cheaper than solving the system of dimension N, we propose the following modification of the right hand side of (13), which attempts to exploit the information contained in the off-diagonal blocks, while retaining separability of the approximated linear system. The idea underlying the right-hand side correction is analogous to the one proposed in [1, 14] for systems of nonlinear equations and distributed optimization problems.

Our goal is to split the LM linear system into separable systems of smaller dimension. Starting from the LM linear system (12) and aiming at a separable system of linear equations, i.e. a system with the matrix ${\textbf{H}}_k + \mu _k {\textbf{I}}$ as in (13), we need to take into account the fact that ${\textbf{B}}_k$ is not zero. Clearly putting ${\textbf{B}}_k \mathbf {d_k}$ to the right hand side would be ideal but ${\textbf{d}}_k$ is unknown. Therefore we add ${\textbf{B}}_k \mathbf {g_k}$ at the right hand side of (13) as this way we maintain separability and information contained in ${\textbf{B}}_k.$ Intuitively, ${\textbf{B}}_k \mathbf {g_k}$ is the best approximation for ${\textbf{B}}_k \mathbf {d_k}$ that we have available, so we use it to get a separable system of linear equations. To compensate (at least partially) for the substitution of ${\textbf{B}}_k \mathbf {d_k}$ with ${\textbf{B}}_k \mathbf {g_k}$ we also add a correction factor $\beta _k$ as explained in (14) and further on.

Consider the system

$$\begin{aligned} ({\textbf{H}}_k+\mu _k {\textbf{I}}){\textbf{d}}^k = (\beta _k {\textbf{B}}_k-{\textbf{I}}){\textbf{g}}^k \end{aligned}$$

(14)

where $\beta _k\in {\mathbb {R}}$ is a correction coefficient that we can choose. Once the right-hand side has been computed, since ${\textbf{H}}_k+\mu _k {\textbf{I}}$ is a block diagonal matrix, (14) can still be decomposed into K independent linear system. That is, the solution ${\textbf{d}}^k$ of (14) is given by

$$\begin{aligned} {\textbf{d}}^k = \left( \begin{array}{cc} d_1^k\\ \vdots \\ d_{ K }^k\\ \end{array}\right) \quad \text {with}\quad H_s^k d^k_s = \beta _k \sum _{j=1,\ j\ne s}^{K}B^k_{sj}g^k_j - g^k_s. \end{aligned}$$

(15)

The correction coefficient $\beta _k$ can be chosen freely at each iteration so far. However we will see that it is of fundamental importance for both the convergence analysis of the method and practical performance. This parameter is further specified in the algorithm we propose and discussed in detail in Subsection 4.3. Let us now give only a rough reasoning behind its introduction. With $\beta _k$ we are trying to preserve some information contained in ${\textbf{B}}_k$ in a cheap way and without destroying separability. One possibility is the following choice of $\beta _k$, which ensures that the residual given by the solution of (14) with respect to the exact linear system (12) is minimized. That is,

$$\begin{aligned} \beta _k = {{\,\mathrm{arg\,min}\,}}_{\beta \in {\mathbb {R}}} \Vert \varphi _k(\beta )\Vert ^2_2 \end{aligned}$$

with

$$\begin{aligned} \varphi (\beta ) = ({\textbf{H}}_k+\mu _k {\textbf{I}}+{\textbf{B}}_k)({\textbf{H}}_k +\mu _k {\textbf{I}})^{-1}(\beta {\textbf{B}}_k-I){\textbf{g}}^k + {\textbf{g}}^k. \end{aligned}$$

(16)

Further details on this choice are presented later on. One can ask why we use a single coefficient $\beta _k$ in each iteration, i.e. if it would perhaps be more efficient to try to "correct" the right hand side with more than one parameter, maybe allowing a diagonal matrix $\beta _k$ in the right hand side instead of a single scalar. In fact, if we take a diagonal matrix $\beta _k \in {\mathbb {R}}^{N \times N}$ and plug it in the above minimization problem, then solving this problem we could recover the LM iteration exactly. However such procedure would imply the same cost for one iteration as the full LM iteration. Clearly, there are other alternatives between these two extremes - a single number and N numbers, but our experience show that the choice with a single coefficient $\beta _k$ brings the best results in terms of cost benefit function. The convergence analysis presented in the next Section will further restrict the values of parameter $\beta _k.$

Consider the block diagonal matrix ${\textbf{H}}_k$ defined in (7). Since the s-th diagonal block $H_s^k = (J_{sR_s}^k)^\top J_{sR_s}^k + (J_{s\rho _s}^k)^\top J_{s\rho _s}^k$ we have that ${\textbf{H}}_k$ is symmetric and positive semi-definite, and therefore there exists a matrix ${{\textbf {S}}}_k\in {\mathbb {R}}^{m\times N}$ such that ${{\textbf {S}}}_{k}^{\top } {{\textbf {S}}}_k={\textbf{H}}_k$. We denote with ${\textbf {C}}_k$ the matrix ${\textbf {C}}_k = {\textbf {{J}}}_k - {{\textbf {S}}}_k.$

Let us now state the algorithm. We will assume that the splitting into suitable sets is done before the iterative procedure and it is kept fixed through the process. Thus, the diagonal matrix ${\textbf{H}}_k$ and off-diagonal ${\textbf{B}}_k$ are already defined in each iteration.

The regularization parameter $\mu _k$ plays an important role in (14) and consequently in the resolution of our independent problem stated in (15). In the algorithm below we adopt a choice based on the same principles proposed in [11] although other options are possible. The parameter is thus computed using the values defined as

$$\begin{aligned} \begin{aligned}&{\hat{a}}_0^k = \frac{\ell _k^2}{4}\Vert {\textbf{H}}_k\Vert \Vert {\textbf {{J}}}_k\Vert \Vert {\textbf{R}}_k\Vert \\&{\hat{a}}_1^k = \frac{\ell _k^2}{4}\Vert {\textbf {{J}}}_k\Vert \Vert {\textbf{R}}_k\Vert + \ell _k\Vert {\textbf{H}}_k\Vert \Vert {\textbf {{J}}}_k\Vert ^2\Vert {\textbf{R}}_k\Vert \\&{\hat{a}}_2^k = \Vert {\textbf{H}}_k\Vert \Vert {\textbf {{J}}}_k\Vert ^2+\ell _k\Vert {\textbf{H}}_k\Vert \Vert {\textbf{R}}_k\Vert +\ell _k\Vert {\textbf {{J}}}_k\Vert ^2\Vert {\textbf{R}}_k\Vert \\&{\hat{a}}_3^k = \Vert {\textbf {{J}}}_k\Vert ^2+\ell _k\Vert {\textbf{R}}_k\Vert \\ \end{aligned} \end{aligned}$$

(17)

for each iteration with $\ell _k$ specified in the algorithm below.

The key novelty is in (19) where we compute the search direction. The damping parameter $\mu _k$ is defined through steps 1 while the values of $\ell _k$ updated at lines 12–16 resemble trust-region approach. Roughly speaking, if the decrease is sufficient the damping parameter in the next iteration is decreased, otherwise we keep $\ell _{k+1} = \ell _k.$ In fact the choice of ${\mu }_k$ will be crucial for the convergence proof, see Lemma 8. The correction parameter $\beta _k$ is specified in line 2. Assuming the standard properties of the objective function, see ahead Assumption 2 and 3, one can always choose $\beta _k$ such that (18) holds. However, the value of $\beta _k$ depend on the norm of off-diagonal blocks. Thus the method essentially exploits the sparse structure we assume in this paper, see Assumption 1. The search direction ${\textbf{d}}^k$ is computed in line 4. Clearly, the system (19) is in fact completely separable and solving it requires solving K independent linear systems of the form (15).

The right-hand side correction vector in each system is also stated in (15). Thus, for the parameter $\beta _k$ chosen in line 2, to compute ${\textbf{d}}^k$ we need to solve K systems of linear equations

$$\begin{aligned} (H_s^k +\mu _k I) d^k_s = \beta _k \sum _{j=1,\ j\ne s}^{K}B^k_{sj}g^k_j - g^k_s, \end{aligned}$$

each one of dimension $n_s,$ and $\sum _{s=1}^{K} n_s = N.$ These systems can be solved independently, either in parallel or sequentially, and the cost of their solving is in general significantly smaller than the cost of solving N dimensional LM system of linear equations. These savings are meaningful since the direction ${\textbf{d}}^k$ obtained this way is a descent direction as we will show in the convergence analysis. After that we invoke backtracking to fulfill the modified Armijo condition given in (20) and define a new iteration. Modification of the Armijo condition again depends on the norm of off-diagonal blocks as the step size is bounded above by $1/\gamma _k.$ In the case of ${\textbf{B}}_k = 0,$ i.e. if the system is completely separable we get $\gamma _k = 1$ and the classical Armijo condition is recovered. In this case the system (19) is the classical LM system and the algorithm reduces to the classical LM with line search. On the other hand, for ${\textbf{B}}^k \ne 0$ the value of $\Vert {\textbf{B}}^k\Vert$ fundamentally influences the values of $\beta _k$ and $\gamma _k$ and the algorithm allows non-negligible values of $\beta _k, \gamma _k$ only if $\Vert {\textbf{B}}^k\Vert$ is not loo large, i.e. if the problem has a certain level of separability

4 Convergence analysis

The convergence analysis is divided in 2 parts, in Sect. 4.1. we prove that the algorithm is well defined and globally convergent under a set of standard assumptions, while the local convergence analysis is presented in Sect. 4.2. The choice of $\beta _k$ and its influence are discussed in Sect. 4.3.

4.1 Global convergence

The following assumptions are regularity assumptions commonly used in LM methods

Assumption 2 The vector of residuals ${\textbf{R}}: {\mathbb {R}}^N\rightarrow {\mathbb {R}}^{m}$ is continuously differentiable.

Assumption 3 The Jacobian matrix ${\textbf {{J}}}\in {\mathbb {R}}^{m\times N}$ of ${\textbf{R}}$ is L-Lipschitz continuous. That is, for every ${\textbf{x}}^{},{\textbf{y}}_{} \in {\mathbb {R}}^N$

$$\begin{aligned} \Vert {\textbf {{J}}}({\textbf{x}}^{}) - {\textbf {{J}}}({\textbf{y}}) \Vert \le L\Vert {\textbf{x}}^{} - {\textbf{y}}\Vert . \end{aligned}$$

For the rest of this subsection we assume that $\{{\textbf{x}}^{k}\}$ is the sequence generated by Algorithm 1 with an arbitrary initial guess ${\textbf{x}}^{0}\in {\mathbb {R}}^N$.

The following Lemma, proved in [5], is needed for the convergence analysis.

Lemma 1

[5] If Assumptions A2 and A3 hold, for every ${\textbf{x}}^{}$ and ${\textbf{y}}$ in ${\mathbb {R}}^N$ we have

$$\begin{aligned} \Vert {\textbf{R}}({\textbf{x}}^{}+{\textbf{y}}) - {\textbf{R}}({\textbf{x}}^{}) - {\textbf {{J}}}({\textbf{x}}^{}){\textbf{y}}\Vert \le \frac{L}{2}\Vert {\textbf{y}}\Vert ^2. \end{aligned}$$

(21)

Lemma 2

Assume that ${\textbf{d}}^k$ is computed as in (19) with $\beta _k$ satisfying (18), and that Assumption A2 holds. Then ${\textbf{d}}^k$ is a descent direction for ${\textbf{F}}$ at ${\textbf{x}}^{k}.$ Moreover the following inequalities hold

(i)
$\displaystyle ({\textbf{g}}^k)^\top {\textbf{d}}^k \le - \frac{(1-b)\Vert {\textbf{g}}^k\Vert ^2}{\Vert {\textbf{H}}_k\Vert + \mu _k}$.
(ii)
$\displaystyle {\textbf{F}}({\textbf{x}}^{k+1})\le {\textbf{F}}({\textbf{x}}^{k})-ct_k\frac{(1-b)\Vert {\textbf{g}}^k\Vert ^2}{\Vert {\textbf{H}}_k\Vert + \mu _k}$.

Proof

We want to prove that $({\textbf{g}}^k)^\top {\textbf{d}}^k\le 0$ for every index k. By definition of ${\textbf{d}}^k$ and using the fact that $\Vert A^{-1}\Vert \ge \Vert A\Vert ^{-1}$ for every invertible matrix A, we have

$$\begin{aligned} \begin{aligned}&({\textbf{g}}^k)^\top {\textbf{d}}^k= -({\textbf{g}}^k)^\top ({\textbf{H}}_k + \mu _k I)^{-1} (I-\beta _k {\textbf{B}}_k) {\textbf{g}}^k \\&= -({\textbf{g}}^k)^\top ({\textbf{H}}_k + \mu _k I)^{-1}{\textbf{g}}^k + \beta _k({\textbf{g}}^k)^\top ({\textbf{H}}_k + \mu _k I)^{-1}{\textbf{B}}_k {\textbf{g}}^k \\&\le \Vert {\textbf{g}}^k\Vert ^2\left( \Vert \beta _k {\textbf{B}}_k\Vert \Vert ({\textbf{H}}_k + \mu _k I)^{-1}\Vert - \frac{1}{\Vert {\textbf{H}}_k + \mu _k I\Vert }\right) . \end{aligned} \end{aligned}$$

(22)

Since $\mu _k>0$ and ${\textbf{H}}_k$ is symmetric and positive semidefinite, we have

$$\begin{aligned} \Vert ({\textbf{H}}_k + \mu _k I)^{-1}\Vert \le \frac{1}{\lambda _{\min }({\textbf{H}}_k + \mu _k I)}\le \frac{1}{\mu _k} \end{aligned}$$

and

$$\begin{aligned} \frac{1}{\Vert {\textbf{H}}_k + \mu _k I\Vert } = \frac{1}{\Vert {\textbf{H}}_k\Vert +\mu _k}. \end{aligned}$$

Using this two facts and the bound (18) on $\Vert \beta _kB_k\Vert$ in inequality (22), we get

$$\begin{aligned} \begin{aligned}&({\textbf{g}}^k)^\top {\textbf{d}}^k\le \Vert {\textbf{g}}^k\Vert ^2\left( \Vert \beta _k {\textbf{B}}_k\Vert \Vert ({\textbf{H}}_k + \mu _k I)^{-1}\Vert - \frac{1}{\Vert {\textbf{H}}_k + \mu _k I\Vert }\right) \\&\le \Vert {\textbf{g}}^k\Vert ^2\left( \frac{b\mu _k}{\Vert {\textbf{H}}_k\Vert +\mu _k}\frac{1}{\mu _k} - \frac{1}{\Vert {\textbf{H}}_k\Vert +\mu _k} \right) = \frac{b-1}{\Vert {\textbf{H}}_k\Vert +\mu _k}\Vert {\textbf{g}}^k\Vert ^2, \end{aligned} \end{aligned}$$

which is part i) of the Lemma. Since $b<1$ this also implies that ${\textbf{d}}^k$ is a descent direction at iteration k. By (20) we have that for every iteration index k

$$\begin{aligned} {\textbf{F}}({\textbf{x}}^{k}+t_k {\textbf{d}}^k)<{\textbf{F}}({\textbf{x}}^{k})+c t_k ({\textbf{d}}^k)^\top {\textbf{g}}^k. \end{aligned}$$

Replacing $({\textbf{g}}^k)^\top {\textbf{d}}^k$ with part (i) of the statement we get ii).

$\square$

Remark 4.1

Lemma 2 states that if the right-hand side correction coefficient $\beta _k$ is chosen to satisfy the condition (18), then ${\textbf{d}}^k$ is a descent direction and therefore the backtracking procedure can always find a step size $t_k$ such that the Armijo condition (20) is satisfied. In particular this implies that Algorithm 1 is well defined. In Lemma 9 we will also prove that under the current assumptions the step size $t_k$ is bounded from below.

Lemma 3

If Assumption A2 holds and ${\textbf{d}}^k$ is the solution of (19), then for every iteration k we have

$$\begin{aligned} \Vert {\textbf{d}}^k\Vert \le \frac{\gamma _k}{\mu _k}\Vert {\textbf{g}}^k\Vert \le \frac{\gamma _k}{\mu _k}\Vert {\textbf {{J}}}_k\Vert \Vert {\textbf{R}}_k\Vert . \end{aligned}$$

(23)

Proof

By definition of ${\textbf{d}}^k$ and $\gamma _k$ we have

$$\begin{aligned} \begin{aligned} \Vert {\textbf{d}}^k\Vert&= \Vert ({\textbf{H}}_k +\mu _k {\textbf{I}})^{-1}(I-\beta _k {\textbf{B}}_k){\textbf{g}}^k\Vert \\&\le \Vert ({\textbf{H}}_k +\mu _k {\textbf{I}})^{-1}\Vert \Vert (I-\beta _k {\textbf{B}}_k)\Vert \Vert {\textbf{g}}^k\Vert \\&\le \frac{1}{\lambda _{\min }({\textbf{H}}_k +\mu _k {\textbf{I}})}\gamma _k\Vert {\textbf{g}}^k\Vert \le \frac{\gamma _k}{\mu _k}\Vert {\textbf{g}}^k\Vert , \end{aligned} \end{aligned}$$

(24)

which is the first inequality in the thesis. The second inequality follows directly from the fact that ${\textbf{g}}^k = {\textbf {{J}}}_k^\top {\textbf{R}}_k$. $\square$

Lemma 4

If Assumption A2 holds, for every $t\in [0,1/\gamma _k]$ we have

$$\begin{aligned} \begin{aligned} \Vert {\textbf{R}}_k + t{\textbf {{J}}}_k{\textbf{d}}_k\Vert ^2&\le \Vert {\textbf{R}}_k\Vert ^2 + t ({\textbf{g}}^k)^\top {\textbf{d}}^k + t^2\Vert {\textbf {{J}}}\Vert ^2\Vert {\textbf{d}}_k\Vert ^2\\ {}&\,\,\,- t\frac{(1-b)}{\gamma _k^2}\frac{\mu _k^2}{\Vert {\textbf{H}}_k\Vert +\mu _k}\Vert {\textbf{d}}^k\Vert ^2 \end{aligned} \end{aligned}$$

(25)

Proof

By Lemma 3 we have

$$\begin{aligned} \Vert {\textbf{g}}^k\Vert \ge \frac{\mu _k}{\gamma _k}\Vert {\textbf{d}}^k\Vert . \end{aligned}$$

Using this inequality in part i) of Lemma 2, we get

$$\begin{aligned} ({\textbf{g}}^k)^\top {\textbf{d}}^k\le -\frac{1-b}{\Vert {\textbf{H}}_k\Vert +\mu _k}\Vert {\textbf{g}}^k\Vert ^2\le -\frac{1-b}{\gamma _k^2}\frac{\mu _k^2}{\Vert {\textbf{H}}_k\Vert +\mu _k}\Vert {\textbf{d}}^k\Vert ^2. \end{aligned}$$

(26)

Using this inequality and the fact that ${\textbf{g}}^k = {\textbf {{J}}}_k^\top {\textbf{R}}_k$ we then have

$$\begin{aligned} \begin{aligned}&\Vert {\textbf{R}}_k+t{\textbf {{J}}}_k{\textbf{d}}_k\Vert ^2 = \Vert {\textbf{R}}_k\Vert ^2+2t({\textbf{g}}^k)^\top {\textbf{d}}^k+t^2\Vert {\textbf {{J}}}_k{\textbf{d}}^k\Vert ^2\\&\le \Vert {\textbf{R}}_k\Vert ^2 + t ({\textbf{g}}^k)^\top {\textbf{d}}^k + t^2\Vert {\textbf {{J}}}_k\Vert ^2\Vert {\textbf{d}}_k\Vert ^2- t\frac{(1-b)}{\gamma _k^2}\frac{\mu _k^2}{\Vert {\textbf{H}}_k\Vert +\mu _k}\Vert {\textbf{d}}^k\Vert ^2 \end{aligned} \end{aligned}$$

(27)

which is the thesis. $\square$

Lemma 5

If Assumptions A2 and A3 hold, for every $t\in [0,1/\gamma _k]$ we have

$$\begin{aligned} \begin{aligned}&\Vert {\textbf{R}}({\textbf{x}}^{k}+ t{\textbf{d}}^k)\Vert ^2\le \Vert {\textbf{R}}_k\Vert ^2 + t ({\textbf{g}}^k)^\top {\textbf{d}}^k + t\Vert {\textbf{d}}^k\Vert ^2\Bigg (\frac{L^2}{4}t^3\Vert {\textbf{d}}^k\Vert ^2+t\Vert {\textbf {{J}}}_k\Vert ^2\\ {}&+Lt\Vert {\textbf{R}}_k\Vert +Lt^2\Vert {\textbf {{J}}}_k\Vert \Vert {\textbf{d}}^k\Vert -\frac{1-b}{\gamma _k^2}\frac{\mu _k^2}{\Vert {\textbf{H}}_k\Vert +\mu _k}\Bigg ) \end{aligned} \end{aligned}$$

(28)

Proof

Let us introduce the following function from ${\mathbb {R}}$ to ${\mathbb {R}}^m$

$$\begin{aligned} \Psi (t)={\textbf{R}}({\textbf{x}}^{k}+t{\textbf{d}}^k)-{\textbf{R}}_k-t{\textbf {{J}}}_k{\textbf{d}}^k \end{aligned}$$

(29)

and let us notice that by Lemma 1 we have that $\Vert \Psi (t)\Vert \le \frac{L}{2}t^2\Vert {\textbf{d}}^k\Vert ^2.$ Using this bound on $\Psi$ and Lemma 4 we get

$$\begin{aligned} \begin{aligned}&\Vert {\textbf{R}}({\textbf{x}}^{k} + t_k{\textbf{d}}^k)\Vert ^2 = \Vert \Psi (t) + {\textbf{R}}_k + t{\textbf {{J}}}_k{\textbf{d}}^k\Vert ^2 \\&\le \Vert \Psi (t)\Vert ^2 + \Vert {\textbf{R}}_k + t{\textbf {{J}}}_k{\textbf{d}}^k\Vert ^2 + 2\Vert \Psi (t)\Vert \Vert {\textbf{R}}_k + t{\textbf {{J}}}_k{\textbf{d}}^k\Vert \\&\le \frac{1}{4}L^2t^4\Vert {\textbf{d}}^k\Vert ^4 + Lt^2\Vert {\textbf{d}}^k\Vert ^2(\Vert {\textbf{R}}_k\Vert +t\Vert {\textbf {{J}}}_k\Vert \Vert {\textbf{d}}_k\Vert )+\Vert {\textbf{R}}_k\Vert ^2 \\&+ t ({\textbf{g}}^k)^\top {\textbf{d}}^k + t^2\Vert {\textbf {{J}}}_k\Vert ^2\Vert {\textbf{d}}_k\Vert ^2- t\frac{(1-b)}{\gamma _k^2}\frac{\mu _k^2}{\Vert {\textbf{H}}_k\Vert +\mu _k}\Vert {\textbf{d}}^k\Vert ^2. \end{aligned} \end{aligned}$$

(30)

The thesis follows immediately. $\square$

Lemma 6

Let us assume that Assumptions A2 and A3 hold, and let us denote with $\mu ^*_k$ the largest root of the polynomial $q_k(\mu ) = \sum _{j=0}^4a^k_j\mu ^j$ with

$$\begin{aligned} \begin{aligned}&a_0^k = \frac{L^2}{4}\Vert {\textbf{H}}_k\Vert \Vert {\textbf {{J}}}_k\Vert \Vert {\textbf{R}}_k\Vert \\&a_1^k = \frac{L^2}{4}\Vert {\textbf {{J}}}_k\Vert \Vert {\textbf{R}}_k\Vert + L\Vert {\textbf{H}}_k\Vert \Vert {\textbf {{J}}}_k\Vert ^2\Vert {\textbf{R}}_k\Vert \\&a_2^k = \Vert {\textbf{H}}_k\Vert \Vert {\textbf {{J}}}_k\Vert ^2+L\Vert {\textbf{H}}_k\Vert \Vert {\textbf{R}}_k\Vert +L\Vert {\textbf {{J}}}_k\Vert ^2\Vert {\textbf{R}}_k\Vert \\&a_3^k = \Vert {\textbf {{J}}}_k\Vert ^2+L\Vert {\textbf{R}}_k\Vert \\&a_4^k = -\frac{1-b}{\gamma _k^2} \end{aligned} \end{aligned}$$

(31)

If $\mu _k>\mu _k^*$, then $t_k = \min \{1,1/\gamma _k\}.$

Proof

Using the bound to $\Vert {\textbf{d}}_k\Vert$ given by Lemma 3, and the fact that $t_k\le \min \{1,1/\gamma _k\}$, we have

$$\begin{aligned} \begin{aligned}&\frac{L^2}{4}t^3\Vert {\textbf{d}}^k\Vert ^2+t\Vert {\textbf {{J}}}_k\Vert ^2+Lt\Vert {\textbf{R}}_k\Vert +Lt^2\Vert {\textbf {{J}}}_k\Vert \Vert {\textbf{d}}^k\Vert -\frac{1-b}{\gamma _k^2}\frac{\mu _k^2}{\Vert {\textbf{H}}_k\Vert +\mu _k}\\&\le t^3\frac{L^2\gamma _k^2}{4\mu _k^2}\Vert {\textbf {{J}}}_k\Vert ^2\Vert {\textbf{R}}^k\Vert ^2+t\Vert {\textbf {{J}}}_k\Vert ^2+Lt\Vert {\textbf{R}}_k\Vert +t^2\frac{L\gamma _k}{\mu _k}\Vert {\textbf {{J}}}_k\Vert ^2\Vert {\textbf{R}}^k\Vert \\&\,\,\,-\frac{1-b}{\gamma _k^2}\frac{\mu _k^2}{\Vert {\textbf{H}}_k\Vert +\mu _k}\\&\le \frac{L^2}{4\mu _k^2}\Vert {\textbf {{J}}}_k\Vert ^2\Vert {\textbf{R}}^k\Vert ^2+\Vert {\textbf {{J}}}_k\Vert ^2+L\Vert {\textbf{R}}_k\Vert +\frac{L}{\mu _k}\Vert {\textbf {{J}}}_k\Vert ^2\Vert {\textbf{R}}^k\Vert -\frac{1-b}{\gamma _k^2}\frac{\mu _k^2}{\Vert {\textbf{H}}_k\Vert +\mu _k}\\&= \frac{1}{\mu _k^2(\Vert {\textbf{H}}_k\Vert +\mu _k)}q_k(\mu _k), \end{aligned} \end{aligned}$$

(32)

where $q_k$ is the polynomial with coefficients defined in (31). Together with Lemma 5, this implies that for every $t\le \min \{1,1/\gamma _k\}$ we have

$$\begin{aligned} \Vert {\textbf{R}}({\textbf{x}}^{k}+ t{\textbf{d}}^k)\Vert ^2\le \Vert {\textbf{R}}_k\Vert ^2 + t ({\textbf{g}}^k)^\top {\textbf{d}}^k + t\frac{q_k(\mu _k)}{\mu _k^2(\Vert {\textbf{H}}_k\Vert +\mu _k)}\Vert {\textbf{d}}^k\Vert ^2 \end{aligned}$$

(33)

Since $a^k_4<0$ we have that $q(\mu )\rightarrow -\infty$ as $\mu \rightarrow +\infty$. This implies that if $\mu _k>\mu _k^*$ then $q(\mu _k)<0$ and, by the inequality above, we have

$$\begin{aligned} \Vert {\textbf{R}}({\textbf{x}}^{k}+ t{\textbf{d}}^k)\Vert ^2\le \Vert {\textbf{R}}_k\Vert ^2 + t ({\textbf{g}}^k)^\top {\textbf{d}}^k \end{aligned}$$

(34)

for every $t\le \min \{1,1/\gamma _k\}.$ Since $c\in (0,1)$, this implies in particular that $\min \{1,1/\gamma _k\}$ satisfies Armijo condition (20) and therefore $t_k = \min \{1,1/\gamma _k\}.$ $\square$

Lemma 7

If Assumptions A2 and A3 hold and at iteration k we have $\ell _k\ge L$, then $t_k = \min \{1,1/\gamma _k\}$.

Proof

By the previous Lemma, in order to prove the thesis it is enough to show that when $\ell _k\ge L$ we have $\mu _k\ge \mu ^*_k$ with $\mu _k^*$ largest root of the polynomial $q_k$ defined in (31). Using the Cauchy bound for the roots of polynomials [16], we have that

$$\begin{aligned} |\mu ^*_k|\le 1 + \max _{i=0:3}\frac{|a^k_i|}{|a^k_4|}. \end{aligned}$$

(35)

Since $a^k_4 = -\frac{1-b}{\gamma _k^2}$ and $\gamma _k\le 1+b$, we have that for every k

$$\begin{aligned} \frac{(1+b)^2}{1-b}\ge \frac{1}{|a^k_4|}. \end{aligned}$$

Using this inequality, the definition of $\mu _k$, and the fact that ${\hat{a}}^k_i\ge 0$ for every $i=0,\dots ,3$, we have

$$\begin{aligned} \mu _k = 1 + \frac{(1+b)^2}{1-b}\max _{i=0:3}{\hat{a}}^k_i = 1 + \frac{(1+b)^2}{1-b}\max _{i=0:3}|{\hat{a}}^k_i|\ge 1 + \max _{i=0:3}\frac{|{\hat{a}}^k_i|}{|a^k_4|}. \end{aligned}$$

This, together with (35) implies that $\mu _k\ge \mu _k^*$ which concludes the proof. $\square$

Lemma 8

If Assumptions A2 and A3 hold then we have $t_k = \min \{1,1/\gamma _k\}$ for infinitely many values of k.

Proof

By Lemma 7 we have that $t_k = \min \{1,1/\gamma _k\}$ whenever $\ell _k\ge L.$ Assume by contradiction that there exists an iteration index ${\bar{k}}$ such that for every $k\ge {\bar{k}}$ the step size $t_k$ is strictly smaller than $\min \{1,1/\gamma _k\}.$ Since in Algorithm 1 we have $\ell _{k+1} = 2\ell _k$ whenever $t_k<\min \{1,1/\gamma _k\}$, this implies that for every $k\ge {\bar{k}}$ we have

$$\begin{aligned} \ell _{k+1} = 2\ell _{k} = 2^{k-\bar{k}}\ell _{{\bar{k}}}. \end{aligned}$$

Therefore there exists $k'\ge {\bar{k}}$ such that $\ell _{k'}\ge L$ which implies $t_k = \min \{1,1/\gamma _k\}$, contradicting the fact that $t_k<\min \{1,1/\gamma _k\}$ for every $k\ge {\bar{k}}$. $\square$

The above Lemma allows us to prove the first global statement below. Namely, we prove that any bounded iterative sequence has at least one accumulation point which is stationary.

Theorem 1

Assume that Assumptions A2, A3 hold and that $\{{\textbf{x}}^{k}\}$ is a sequence generated by Algorithm 1 with arbitrary ${\textbf{x}}^{0}\in {\mathbb {R}}^N$. If $\{{\textbf{x}}^{k}\}$ is bounded, then it has at least one accumulation point that is also a stationary point for ${\textbf{F}}({\textbf{x}}^{}).$

Proof

Since $\{{\textbf{x}}^{k}\}\subset {\mathbb {R}}^n$ is bounded and by Lemma 8 the sequence of step sizes $\{t_k\}$ takes value $\min \{1,1/\gamma _k\}$ infinitely many times, we can take a subsequence $\{{\textbf{x}}^{k_j}\}\subset \{{\textbf{x}}^{k}\}$ such that $t_{k_j} = \min \{1,1/\gamma _{k_j}\}$ for every j and that ${\textbf{x}}^{k_j}$ converges to $\bar{{\textbf{x}}^{}}$ as j tends to infinity. By Lemma 2 we have

$$\begin{aligned} c(1-b)\sum _{j=0}^\infty \frac{1}{\min \{1,1/\gamma _{k_j}\}}\frac{\Vert {\textbf{g}}_{k_j}\Vert ^2}{\Vert {\textbf{H}}_{k_j}\Vert + \mu _{k_j}}\le {\textbf{F}}_0 <\infty \end{aligned}$$

which implies that

$$\begin{aligned} \lim _{j\rightarrow +\infty } \frac{1}{\min \{1,1/\gamma _{k_j}\}}\frac{\Vert {\textbf{g}}_{k_j}\Vert ^2}{\Vert {\textbf{H}}_{k_j}\Vert + \mu _{k_j}}= 0. \end{aligned}$$

(36)

By definition of $\gamma _k$ and (18) we have that $\min \{1,1/\gamma _{k}\}\le (1+b)$. Since $\{{\textbf{x}}_{k_j}\}$ is a compact subset of ${\mathbb {R}}^N$, and ${\textbf{R}}({\textbf{x}}^{})$ is twice continuously differentiable, we have that the sequences $\Vert {\textbf{H}}_{k_j}\Vert$, $\Vert {\textbf{R}}_{k_j}\Vert$, and $\Vert {\textbf {{J}}}_{k_j}\Vert$ are bounded from above, which by definition of $\mu _k$ imply that $\Vert {\textbf{H}}_{k_j}\Vert + \mu _{k_j}$ is also bounded from above. This, together with (36) implies that $\Vert {\textbf{g}}_{k_j}\Vert$ vanishes as j tends to infinity and therefore $\bar{{\textbf{x}}^{}}$ is a stationary point of ${\textbf{F}}({\textbf{x}}^{}).$ $\square$

Lemma 9

If Assumptions A2 and A3 hold and $\ell _{min}>0$ then for every index k we have

$$\begin{aligned} t_k\ge \min \left\{ {\nu }\frac{\ell _{min}^8}{4L^8}, \frac{1}{1+b}\right\} . \end{aligned}$$

Proof

From inequality (32), since $t\le \min \{1,1/\gamma _k\}$, we have

$$\begin{aligned} \begin{aligned}&\frac{L^2}{4}t^3\Vert {\textbf{d}}^k\Vert ^2+t\Vert {\textbf {{J}}}_k\Vert ^2+Lt\Vert {\textbf{R}}_k\Vert +Lt^2\Vert {\textbf {{J}}}_k\Vert \Vert {\textbf{d}}^k\Vert -\frac{1-b}{\gamma _k^2}\frac{\mu _k^2}{\Vert {\textbf{H}}_k\Vert +\mu _k} \\&\le \frac{1}{\mu _k^2(\Vert {\textbf{H}}_k\Vert +\mu _k)}\left( \left( q_k(\mu _k)+\frac{1-b}{\gamma _k^2}\mu _k^4\right) t - \frac{1-b}{\gamma _k^2}\mu _k^4\right) , \end{aligned} \end{aligned}$$

(37)

where $q_k$ is defined in (31). Using the inequality above together with Lemma 7 we have

$$\begin{aligned} \begin{aligned}&\Vert {\textbf{R}}({\textbf{x}}^{k}+ t{\textbf{d}}^k)\Vert ^2\le \Vert {\textbf{R}}_k\Vert ^2 + t ({\textbf{g}}^k)^\top {\textbf{d}}^k \\ {}&+ t\Vert {\textbf{d}}^k\Vert ^2\frac{1}{\mu _k^2(\Vert {\textbf{H}}_k\Vert +\mu _k)}\left( \left( q_k(\mu _k)+\frac{1-b}{\gamma _k^2}\mu _k^4\right) t - \frac{1-b}{\gamma _k^2}\mu _k^4\right) \end{aligned} \end{aligned}$$

(38)

Let us define

$$\begin{aligned} {\bar{t}}_k:= \frac{\mu _k^4}{\frac{\gamma _k^2}{1-b}q_k(\mu _k)+\mu _k^4}. \end{aligned}$$

We can easily see that if $t\le {\bar{t}}_k$ then the term between parentheses of the previous inequality is non-positive and therefore

$$\begin{aligned} \begin{aligned}&\Vert {\textbf{R}}({\textbf{x}}^{k}+ t{\textbf{d}}^k)\Vert ^2\le \Vert {\textbf{R}}_k\Vert ^2 + t ({\textbf{g}}^k)^\top {\textbf{d}}^k. \end{aligned} \end{aligned}$$

(39)

Since in Algorithm 1 we take $c\in (0,1)$ this implies that if $t_k\le {\bar{t}}_k$ Armijo condition (20) holds, and therefore the accepted stepsize $t_k$ satisfies $t_k\ge \nu {\bar{t}}_k$.

By Lemma 8 we have that if $\ell _k\ge L$ then $t_k = \min \{1,1/\gamma _{k}\}\ge 1/(b+1)$. Let us consider the case when $\ell _k <L,$ which also implies $\mu _k\le {{\bar{\mu }}}_k$ with

$$\begin{aligned} {{\bar{\mu }}}_k = 1+\max _{i=0:3}\frac{|a_i^k|}{|a_4^k|}. \end{aligned}$$

Using the definition of ${{\bar{\mu }}}_k$ and the fact that ${{\bar{\mu }}}_k\ge 1$ and $\mu _k\le {{\bar{\mu }}}_k$, we have

$$\begin{aligned} \begin{aligned} \frac{\gamma _k^2}{1-b}q_k(\mu _k) + \mu _k^4&= -\mu _k^4+\frac{a^k_3}{|a^k_4|}\mu _k^3+\frac{a^k_2}{|a^k_4|}\mu _k^2+\frac{a^k_1}{|a^k_4|}\mu _k +\frac{a^k_0}{|a^k_4|}+\mu _k^4\\&\le \frac{a^k_3}{|a^k_4|}{{\bar{\mu }}}_k^3+\frac{a^k_2}{|a^k_4|}{{\bar{\mu }}}_k^2+\frac{a^k_1}{|a^k_4|}{{\bar{\mu }}}_k +\frac{a^k_0}{|a^k_4|}\\&\le 4 \left( 1+\max _{i=0:3}\frac{|a_i^k|}{|a_4^k|}\right) {{\bar{\mu }}}_k^3 = 4{{\bar{\mu }}}_k^{4}. \end{aligned} \end{aligned}$$

(40)

Since we are considering the case $\ell _k<L$, we have ${\hat{a}}^k_i\le \frac{\ell _k^2}{L^2}|a^k_i|$ for every $i=0,\dots ,3.$ Moreover, $\frac{1}{|a^k_4|} = \frac{\gamma _k^2}{1-b}\le \frac{(1+b)^2}{1-b}.$ This implies

$$\begin{aligned} \mu _{k} = & 1 + \frac{{(1 + b)^{2} }}{{1 - b}}\max _{{i = 0:3}} \hat{a}_{i}^{k} \ge 1 + \frac{{(1 + b)^{2} }}{{1 - b}}\frac{{\ell _{k}^{2} }}{{L^{2} }}\max _{{i = 0:3}} |a_{i}^{k} | \\ \ge & 1 + \frac{{\ell _{k}^{2} }}{{L^{2} }}\max _{{i = 0:3}} \frac{{|a_{i}^{k} |}}{{|a_{4}^{k} |}} \ge \frac{{\ell _{k}^{2} }}{{L^{2} }}\left( {1 + \max _{{i = 0:3}} \frac{{|a_{i}^{k} |}}{{|a_{4}^{k} |}}} \right) = \frac{{\ell _{k}^{2} }}{{L^{2} }}\bar{\mu }_{k} . \\ \end{aligned}$$

Using this inequality and (40) in the definition of ${\bar{t}}_k$ we get

$$\begin{aligned} \begin{aligned} {\bar{t}}_k = \frac{\mu _k^4}{\frac{\gamma _k^2}{1-b}q_k(\mu _k)+\mu _k^4}\ge \frac{\mu _k^4}{4{{\bar{\mu }}}_k^4} \ge \left( \frac{\ell _k^2}{L^2}{{\bar{\mu }}}_k\right) ^4\frac{1}{4{{\bar{\mu }}}_k^4} \ge \frac{\ell _{min}^8}{4L^8} \end{aligned} \end{aligned}$$

(41)

which gives us the thesis. $\square$

Finally, we can state the global convergence results.

Theorem 2

If Assumptions A2 and A3 hold and $\ell _{min}>0$ then every accumulation point of the sequence $\{{\textbf{x}}^{k}\}$ is a stationary point of ${\textbf{F}}({\textbf{x}}^{}).$

Proof

Let $\bar{{\textbf{x}}^{}}$ be an accumulation point of $\{{\textbf{x}}^{k}\}$ and let $\{{\textbf{x}}^{k_j}\}$ be a subsequence converging to $\bar{{\textbf{x}}^{}}.$ By Lemma 2 we have

$$\begin{aligned} c(1-b)\sum _{j=0}^\infty t_{k_j}\frac{\Vert {\textbf{g}}_{k_j}\Vert ^2}{\Vert {\textbf{H}}_{k_j}\Vert + \mu _{k_j}}\le {\textbf{F}}_0 <\infty \end{aligned}$$

and therefore that

$$\begin{aligned} \lim _{j\rightarrow =\infty }t_{k_j}\frac{\Vert {\textbf{g}}_{k_j}\Vert ^2}{\Vert {\textbf{H}}_{k_j}\Vert + \mu _{k_j}} = 0. \end{aligned}$$

By Lemma 9 the sequence $\{t_k\}$ is bounded from below while by continuity of ${\textbf {{J}}}({\textbf{x}}^{}), {\textbf{R}}({\textbf{x}}^{}), {\textbf{H}}({\textbf{x}}^{})$, and of the norm 2, we have that $\Vert {\textbf{H}}_{k_j}\Vert + \mu _{k_j}$ is bounded from above. This implies

$$\begin{aligned} 0=\lim _{j\rightarrow +\infty }\Vert {\textbf{g}}_{k_j}\Vert = \Vert {\textbf{g}}(\bar{{\textbf{x}}^{}})\Vert \end{aligned}$$

and thus $\bar{{\textbf{x}}^{}}$ is a stationary point of ${\textbf{F}}({\textbf{x}}^{}).$ $\square$

4.2 Local convergence

Let us now analyze the local convergence. We are going to show that the LMS method generates a linearly converging sequence under a set of suitable assumptions. Notice that the assumptions we use are standard, see [2] plus the sparsity assumption that we already stated. Let S denote the set of all stationary points of $\Vert {\textbf{R}}({\textbf{x}}^{})\Vert$, namely $S = \{{\textbf{x}}^{}\in {\mathbb {R}}^{N}| {\textbf {{J}}}({\textbf{x}}^{})^\top {\textbf{R}}({\textbf{x}}^{}) = 0\}.$ Consider a stationary point ${\textbf{x}}^{*}\in S$ and a ball $B({\textbf{x}}^{*},r)$ with radius $r > 0$ around it. In the rest of the section we make the following assumptions, see [2].

Assumption 4 There exists $\omega >0$ such that for every ${\textbf{x}}^{} \in B( {\textbf{x}}^{*},r)$

$$\begin{aligned} \omega {{\,\textrm{dist}\,}}({\textbf{x}}^{}, S)\le \Vert {\textbf {{J}}}({\textbf{x}}^{})^\top {\textbf{R}}({\textbf{x}}^{})\Vert \end{aligned}$$

Assumption 5 There exists $\sigma >0$ such that for every ${\textbf{x}}^{} \in B( {\textbf{x}}^{*},r)$ and every ${\bar{{\textbf{z}}}} \in B({\textbf{x}}^{*},r) \cap S$

$$\begin{aligned} \Vert ({\textbf {{J}}}({\textbf{x}}^{}) - {\textbf {{J}}}({\bar{{\textbf{z}}}}))^\top {\textbf{R}}({\bar{{\textbf{z}}}})\Vert \le \sigma \Vert {\textbf{x}}^{}-{\bar{{\textbf{z}}}}\Vert . \end{aligned}$$

From now on we denote with $\rho _k$ the relative residual of the linear system (11) by the approximate solution ${\textbf{d}}^k$. That is

$$\begin{aligned} \Vert {\textbf{g}}^k + ({\textbf {{J}}}_k^\top {\textbf {{J}}}_k +\mu _k {\textbf{I}}){\textbf{d}}^k\Vert \le \rho _k\Vert {\textbf{g}}^k\Vert . \end{aligned}$$

(42)

This residual is already considered in (16), where we briefly mentioned that we will determine $\beta _k$ such that this residual is minimized. Further details on this choice are presented in Sect. 4.3 and in this part we will keep $\rho _k$ without further specification, i.e., assuming only that it is small enough that local convergence requirements can be fulfilled. Clearly, for the completely separable problems, i.e. ${\textbf{B}}_k = 0$ we get $\rho _k = 0$ and hence the value of $\rho _k$ depends on M stated in Assumption 1 - if M is small enough, i.e., if the problem is nearly-separable to the sufficient degree it is reasonable to expect that the values of $\rho _k$ will be small enough with a suitable choice of $\beta _k.$

The inequalities in the Lemma below are direct consequences of Assumption 3, their proofs can be found in [2].

Lemma 10

Let Assumption 2 hold. There exist positive constants $L_2, L_3$ and $L_4$ such that for every ${\textbf{x}}^{}, {\textbf{y}}\in D_r,\ {\bar{{\textbf{z}}}}\in B({\textbf{x}}^{*},r) \cap S$ the inequalities below hold:

$$\begin{aligned}{} & {} \Vert {\textbf{R}}({\textbf{x}}^{})-{\textbf{R}}({\textbf{y}}) - {\textbf {{J}}}({\textbf{y}})({\textbf{x}}^{}-{\textbf{y}})\Vert \le \frac{L}{2}\Vert {\textbf{x}}^{}-{\textbf{y}}\Vert ^2 \end{aligned}$$

(43)

$$\begin{aligned}{} & {} \Vert {\textbf{R}}({\textbf{x}}^{})-{\textbf{R}}({\textbf{y}})\Vert \le L_2\Vert {\textbf{x}}^{}-{\textbf{y}}\Vert \end{aligned}$$

(44)

$$\begin{aligned}{} & {} \Vert {\textbf {{J}}}({\textbf{x}}^{})^\top {\textbf{R}}({\textbf{x}}^{}) - {\textbf {{J}}}({\textbf{y}})^\top {\textbf{R}}({\textbf{y}})\ |\le L_3\Vert {\textbf{x}}^{}-{\textbf{y}}\Vert \end{aligned}$$

(45)

$$\begin{aligned}{} & {} \Vert {\textbf {{J}}}({\textbf{x}}^{})^\top {\textbf{R}}({\textbf{x}}^{}) - {\textbf {{J}}}({\textbf{y}})^\top {\textbf{R}}({\textbf{y}}) - {\textbf {{J}}}({\textbf{x}}^{})^\top {\textbf {{J}}}({\textbf{x}}^{})({\textbf{x}}^{}-{\textbf{y}}) \Vert \nonumber \\{} & {} \le L_4\Vert {\textbf{x}}^{}-{\textbf{y}}\Vert ^2+\Vert ({\textbf {{J}}}({\textbf{x}}^{}) - {\textbf {{J}}}({\textbf{y}}))^\top {\textbf{R}}({\textbf{y}})\Vert \end{aligned}$$

(46)

$$\begin{aligned}{} & {} \Vert ({\textbf {{J}}}({\textbf{x}}^{}) - {\textbf {{J}}}({\textbf{y}}))^\top {\textbf{R}}({\textbf{y}})\Vert \le LL_2\Vert {\textbf{x}}^{}-{\bar{{\textbf{z}}}}\Vert \Vert {\textbf{y}}-\bar{{\textbf{x}}^{}}\Vert + LL_2 \Vert {\textbf{y}}-{\bar{{\textbf{z}}}}\Vert ^2 \nonumber \\{} & {} \quad + \Vert {\textbf {{J}}}({\textbf{x}}^{})^\top {\textbf{R}}({\bar{{\textbf{z}}}})\Vert \Vert {\textbf {{J}}}({\textbf{y}})^\top {\textbf{R}}({\bar{{\textbf{z}}}})\Vert \end{aligned}$$

(47)

From now on, given any point ${\textbf{x}}^{}\in {\mathbb {R}}^N$, we denote with $\bar{{\textbf{x}}^{}}$ the point in S that realizes $\Vert {\textbf{x}}^{} - \bar{{\textbf{x}}^{}}\Vert = {{\,\textrm{dist}\,}}({\textbf{x}}^{}, S).$

Lemma 11

There exists $r>0$ and $c_1>0$ such that if ${\textbf{x}}^{k}\in B({\textbf{x}}^{*},r)$ then $\Vert {\textbf{d}}^k\Vert \le c_1{{\,\textrm{dist}\,}}({\textbf{x}}^{k},S).$

Proof

Let us define ${\textbf {H}}^* = {\textbf{H}}({\textbf{x}}^{*})$. Consider the eigendecomposition of ${\textbf {H}}^* = {{\textbf {S}}}_*^\top {{\textbf {S}}}_* = {\textbf{Q}}_*{\varvec{{\Lambda }}}_*{\textbf{Q}}_*^\top$ where ${\varvec{{\Lambda }}}_*$ is a diagonal matrix containing the ordered eigenvalues of ${{\textbf {S}}}_*^\top {{\textbf {S}}}_*$ and ${\textbf{Q}}_*$ is the matrix containing the orthogonal eigenvectors corresponding to the eigenvalues in ${\varvec{{\Lambda }}}_*.$ Denoting with p the rank of ${{\textbf {S}}}_*^\top {{\textbf {S}}}_*$ we have that

$$\begin{aligned} {\varvec{{\Lambda }}}_* =\left( \begin{array}{ccc} \Lambda _{*1} &{} 0\\ 0 &{} 0 \end{array}\right) \end{aligned}$$

with $\Lambda _{*1} = {{\,\textrm{diag}\,}}(\lambda ^*_1,\dots , \lambda ^*_p)$ with $\lambda _1\ge \lambda _2\ge \dots \ge \lambda _p>0.$ Consider the eigendecomposition of ${\textbf{H}}_k = {\textbf{Q}}_k{\varvec{{\Lambda }}}_k {\textbf{Q}}_k^\top$ and consider the partition of ${\textbf{Q}}_k$ and ${\varvec{{\Lambda }}}_k$ corresponding to the partition of ${\varvec{{\Lambda }}}_*:$

$$\begin{aligned} {\textbf{Q}}_k = \left( Q_{k1}, Q_{k2}\right) ,\ {\varvec{{\Lambda }}}_k =\left( \begin{array}{ccc} \Lambda _{k1} &{} 0\\ 0 &{} \Lambda _{k2} \end{array}\right) \end{aligned}$$

with $\Lambda _{k1} = {{\,\textrm{diag}\,}}(\lambda ^k_1,\dots ,\lambda ^k_p)\in {\mathbb {R}}^{p\times p}$, $\Lambda _{k2} = {{\,\textrm{diag}\,}}(\lambda ^k_{p+1},\dots ,\lambda ^k_N)\in {\mathbb {R}}^{(N-p)\times (N-p)}$, $Q_{k1}\in {\mathbb {R}}^{n\times p}$, $Q_{k2}\in {\mathbb {R}}^{n\times (N-p)}$ and $\lambda ^k_1\ge \dots \ge \lambda ^k_p$. Since ${\textbf{R}}$ is continuously differentiable on $B({\textbf{x}}^{*},r)$ the entries of ${{\textbf {S}}}({\textbf{x}}^{})^\top {{\textbf {S}}}({\textbf{x}}^{})$ are continuous functions of ${\textbf{x}}^{}$ and thus the eigenvalues of ${{\textbf {S}}}({\textbf{x}}^{})^\top {{\textbf {S}}}({\textbf{x}}^{})$ are continuous function of ${\textbf{x}}^{}$. Therefore, for r small enough we have $\lambda ^k_i\ge \frac{1}{2}\lambda ^*_p$ for every $i=1,\ldots ,p.$ Moreover, since ${\textbf{Q}}_k$ is an orthogonal matrix, we have that $\Vert {\textbf{d}}^k\Vert ^2 = \Vert Q_{k1}^\top {\textbf{d}}^k\Vert ^2 + \Vert Q_{k2}^\top {\textbf{d}}^k\Vert ^2.$ For $i=1,2$, by definition of ${\textbf{d}}^k$, we have

$$\begin{aligned} {\textbf{Q}}_{ki}(\beta _k{\textbf{B}}_k - {\textbf{I}}){\textbf{g}}^k = {\textbf{Q}}_{ki}({\textbf{H}}_k + \mu {\textbf{I}}){\textbf{d}}^k = ({\varvec{{\Lambda }}}_{ki} + \mu {\textbf{I}}){\textbf{Q}}_{ki}^\top {\textbf{d}}^k \end{aligned}$$

(48)

so that

$$\begin{aligned} \begin{aligned} \Vert {\textbf{Q}}_{ki}^\top {\textbf{d}}^k\Vert = \Vert ({\varvec{{\Lambda }}}_{ki} + \mu {\textbf{I}})^{-1}(\beta _k{\textbf{B}}_k - {\textbf{I}}){\textbf{g}}^k\Vert . \end{aligned} \end{aligned}$$

(49)

By definition of $\gamma _k$, inequality (45), and the fact that $\lambda ^k_p\ge \frac{1}{2}\lambda ^*_p$ we have

$$\begin{aligned} \begin{aligned} \Vert Q_{k1}^\top {\textbf{d}}^k\Vert&\le \Vert ({\varvec{{\Lambda }}}_{k1} + \mu {\textbf{I}})^{-1}\Vert \Vert (\beta _k{\textbf{B}}_k - {\textbf{I}})\Vert {\textbf{g}}^k\Vert \le \frac{\gamma _k}{\lambda ^k_p+\mu _k}\Vert {\textbf{g}}^k\Vert \\&\le \frac{2\gamma _k L_3}{\lambda ^*_p} \Vert {\textbf{x}}^{k}-\bar{{\textbf{x}}}^{k}\Vert \le \frac{2\gamma _k L_3}{\lambda ^*_p}{{\,\textrm{dist}\,}}({\textbf{x}}^{k},S) \end{aligned} \end{aligned}$$

(50)

and analogously,

$$\begin{aligned} \Vert Q_{k2}^\top {\textbf{d}}^k\Vert \le \frac{\gamma _k}{\mu _k}\Vert {\textbf{g}}^k\Vert \le \frac{\gamma _k L_3}{\mu _k} {{\,\textrm{dist}\,}}({\textbf{x}}^{k},S). \end{aligned}$$

Therefore the thesis holds with

$$\begin{aligned} c_1 = \gamma L_3\left( \frac{4}{(\lambda ^*_p)^2} + \frac{1}{\mu _{\min }^2} \right) ^{1/2} \end{aligned}$$

for $\gamma = 1+b \ge \gamma _k,$ and $\mu _{\min } = \inf _{k}\mu _k\ge 1.$$\square$

Lemma 12

If ${\textbf{x}}^{k}, {\textbf{x}}^{k+1}\in B({\textbf{x}}^{*},r/2)$ then

$$\begin{aligned} \begin{aligned} \omega {{\,\textrm{dist}\,}}({\textbf{x}}^{k+1}, S) \le (L_4c_1^2 + L L_2(2+c_1)(1+c_1)) \Vert {\textbf{x}}^{k}-\bar{{\textbf{x}}}^{k}\Vert ^2 \\ +(\mu _kc_1 + \rho _{\max }L_3)\Vert {\textbf{x}}^{k}-\bar{{\textbf{x}}}^{k}\Vert + \Vert {\textbf {{J}}}_k^\top {\textbf{R}}(\bar{{\textbf{x}}}^{k})\Vert + \Vert ({\textbf {{J}}}_{k+1})^\top {\textbf{R}}(\bar{{\textbf{x}}}^{k})\Vert \end{aligned} \end{aligned}$$

(51)

where $\rho _{max} = \max _{k}\rho _k$ with $\rho _k$ defined in (42) and $\bar{{\textbf{x}}}^{k}$ is a point in S such that ${{\,\textrm{dist}\,}}({\textbf{x}}^{k}, S) = \Vert {\textbf{x}}^{k} - \bar{{\textbf{x}}}^{k}\Vert$.

Proof

Since

$$\begin{aligned} \begin{aligned} \omega {{\,\textrm{dist}\,}}({\textbf{x}}^{k+1}, S)&\le \Vert {\textbf{g}}_{k+1}\Vert \le \Vert {\textbf{g}}_{k}+{\textbf {{J}}}_{k}^\top {\textbf {{J}}}_{k}{\textbf{d}}^k\Vert +\Vert {\textbf{g}}_{k+1} - {\textbf{g}}_{k} - ({\textbf {{J}}}_{k})^\top {\textbf {{J}}}_{k}({\textbf{x}}^{k+1}-{\textbf{x}}^{k})\Vert \\&\le L_4\Vert {\textbf{d}}^k\Vert ^2 + \Vert ({\textbf {{J}}}_k-{\textbf {{J}}}_{k+1})^\top {\textbf{R}}_{k+1}\Vert +\mu _k \Vert {\textbf{d}}^k\Vert \\&\,\,\,+\Vert {\textbf{g}}^k + ({\textbf {{J}}}_k^\top {\textbf {{J}}}_k +\mu _k {\textbf{I}}){\textbf{d}}^k\Vert .\end{aligned} \end{aligned}$$

(52)

we have

$$\begin{aligned} \begin{aligned} \Vert ({\textbf {{J}}}_k-{\textbf {{J}}}_{k+1})^\top R_{k+1}\Vert&\le LL_2(\Vert {\textbf{x}}^{k}- \bar{{\textbf{x}}}^{k}\Vert +\Vert {\textbf{x}}^{k+1}- \bar{{\textbf{x}}}^{k}\Vert )\Vert {\textbf{x}}^{k+1}- \bar{{\textbf{x}}}^{k}\Vert \\ {}&+ \Vert {\textbf {{J}}}_k^\top {\textbf{R}}(\bar{{\textbf{x}}}^{k})\Vert +\Vert ({\textbf {{J}}}_{k+1})^\top {\textbf{R}}(\bar{{\textbf{x}}}^{k})\Vert \\ {}&\le LL_2(1+c_1)(2+c_1)\Vert {\textbf{x}}^{k+1}- \bar{{\textbf{x}}}^{k}\Vert ^2 \\ {}&+ \Vert {\textbf {{J}}}_k^\top {\textbf{R}}(\bar{{\textbf{x}}}^{k})\Vert +\Vert ({\textbf {{J}}}_{k+1})^\top {\textbf{R}}(\bar{{\textbf{x}}}^{k})\Vert . \end{aligned} \end{aligned}$$

(53)

By definition of $\rho _k$ there holds

$$\begin{aligned} \begin{aligned} \Vert {\textbf{g}}^k + ({\textbf {{J}}}_k^\top {\textbf {{J}}}_k +\mu _k {\textbf{I}}){\textbf{d}}^k\Vert \le \rho _{\max }\Vert {\textbf{g}}^k\Vert \le \rho _{\max }L_3\Vert {\textbf{x}}^{k}-\bar{{\textbf{x}}}^{k}\Vert \end{aligned} \end{aligned}$$

(54)

Replacing these two inequalities in (52) and using Lemma 11 we get the thesis. $\square$

Lemma 13

Assume that there exists $\eta \in (0,1)$ such that $\eta \omega >c_1\mu _{max} + \rho _{\max }L_3 + (2+c_1)\sigma$,

$$\begin{aligned} \varepsilon = \frac{\eta \omega -(c_1\mu _{max} + \rho _{\max }L_3 + (2+c_1)\sigma )}{L_4c_1^2 + LL_2(1+c_1)(2+c_2)}. \end{aligned}$$

If ${\textbf{x}}^{k}, {\textbf{x}}^{k+1}\in B({\textbf{x}}^{*},r/2)$ and ${{\,\textrm{dist}\,}}({\textbf{x}}^{k}, S)\le \varepsilon$ then

$$\begin{aligned} {{\,\textrm{dist}\,}}({\textbf{x}}^{k+1}, S)\le \eta {{\,\textrm{dist}\,}}({\textbf{x}}^{k}, S). \end{aligned}$$

Proof

By Assumption 5 and Lemma 11

$$\begin{aligned} \Vert {\textbf {{J}}}_k^\top {\textbf{R}}(\bar{{\textbf{x}}}^{k})\Vert \le \Vert ({\textbf {{J}}}_k-{\textbf {{J}}}(\bar{{\textbf{x}}}^{k}) )^\top {\textbf{R}}(\bar{{\textbf{x}}}^{k})\Vert \le \sigma \Vert {\textbf{x}}^{k} - \bar{{\textbf{x}}}^{k}\Vert \end{aligned}$$

and

$$\begin{aligned} \Vert ({\textbf {{J}}}_{k+1})^\top {\textbf{R}}(\bar{{\textbf{x}}}^{k})\Vert \le \Vert ({\textbf {{J}}}_{k+1}-{\textbf {{J}}}(\bar{{\textbf{x}}}^{k}) )^\top {\textbf{R}}(\bar{{\textbf{x}}}^{k})\Vert \le \sigma (1+c_1)\Vert {\textbf{x}}^{k} - \bar{{\textbf{x}}}^{k}\Vert \end{aligned}$$

Therefore, from Lemma 12, since we are assuming ${{\,\textrm{dist}\,}}({\textbf{x}}^{k}, S)\le \varepsilon ,$ there follows

$$\begin{aligned} \begin{aligned}&\omega {{\,\textrm{dist}\,}}({\textbf{x}}^{k+1}, S) \le (L_4c_1^2 + L L_2(2+c_1)(1+c_1)) \Vert {\textbf{x}}^{k}-\bar{{\textbf{x}}}^{k}\Vert ^2 \\&+(\mu _kc_1 + \rho _{\max }L_3 + (2+c_1)\sigma )\Vert {\textbf{x}}^{k}-\bar{{\textbf{x}}}^{k}\Vert \\&\le \left( (L_4c_1^2 + L L_2(2+c_1)(1+c_1)) \varepsilon +(\mu _kc_1 + \rho _{\max }L_3 + (2+c_1)\sigma )\right) \Vert {\textbf{x}}^{k}-\bar{{\textbf{x}}}^{k}\Vert \end{aligned} \end{aligned}$$

(55)

and we get the thesis by definition of $\varepsilon .$ $\square$

The above Lemmas allow us to prove the local linear convergence.

Theorem 3

Assume that Assumptions 2-5 hold and that there exists $\eta \in (0,1)$ such that $\eta \omega >\mu _{\max } c_1+L_3\rho _{\max }+(2+c_1)\sigma$ and let us define

$$\begin{aligned} \varepsilon =\min \left\{ \frac{\eta \omega -(c_1\mu _{max} + \rho _{\max }L_3 + (2+c_1)\sigma )}{L_4c_1^2 + LL_2(1+c_1)(2+c_2)}, \frac{1}{2}\frac{r(1-\eta )}{1+c_1-\eta } \right\} . \end{aligned}$$

If ${\textbf{x}}^{0}\in B({\textbf{x}}^{*},\varepsilon )$ then ${{\,\textrm{dist}\,}}({\textbf{x}}^{k},S)\rightarrow 0$ linearly and ${\textbf{x}}^{k}\rightarrow \bar{{\textbf{x}}^{}}\in S\cap B({\textbf{x}}^{*}, r/2)$.

Proof

We prove by induction on k that ${\textbf{x}}^{k}\in B({\textbf{x}}^{*}, r/2)$ for every k.

For $k = 1$, by Lemma 11 and the definition of $\varepsilon$, we have

$$\begin{aligned} \Vert {\textbf{x}}^{1} - {\textbf{x}}^{*}\Vert \le \Vert {\textbf{x}}^{1} - {\textbf{x}}^{0}\Vert + \Vert {\textbf{x}}^{0} - {\textbf{x}}^{*}\Vert \le \Vert {\textbf {d}}_o\Vert + \varepsilon \le \varepsilon (1+c_1)\le \frac{r}{2}. \end{aligned}$$

Given $k\ge 1$, assume that for every $j=1,\ldots ,k-1$ there holds ${{\,\textrm{dist}\,}}({\textbf{x}}^{j}, S)\le \varepsilon$ and ${\textbf{x}}^{j}\in B(x^*, r/2)$. Then we have

$$\begin{aligned} \Vert {\textbf{x}}^{k+1} - {\textbf{x}}^{*}\Vert \le \Vert {\textbf{x}}^{1} - {\textbf{x}}^{*}\Vert + \sum _{j=1}^k\Vert {\textbf {d}}^j\Vert \end{aligned}$$

and the fact that the right-hand side is smaller than r/2 follows again from Lemma 11 and the definition of $\varepsilon$. So, ${\textbf{x}}^{k}\in B({\textbf{x}}^{*}, r/2)$ for every k and to prove the first part of the thesis it is enough to apply Lemma 13.

Therefore, if there exists $\bar{{\textbf{x}}}^{} = \lim {\textbf{x}}^{k}$, then the limit has to belong to $S\cap B(x^*, r/2)$, and to prove the second part of the thesis we only need to prove that such a limit exists. For every index k we have that $\Vert {\textbf {d}}^k\Vert \le c_1\varepsilon \eta ^k$, so given any two indices l, q, we have

$$\begin{aligned} \Vert {\textbf{x}}^{l} - {\textbf{x}}^{q}\Vert \le \sum _{j=q}^l\Vert {\textbf{d}}_j\Vert \le \sum _{j=q}^\infty \Vert {\textbf{d}}_j\Vert \le c_1\varepsilon \sum _{j=q}^\infty \eta ^j \end{aligned}$$

and $\{{\textbf{x}}^{k}\}$ is a Cauchy sequence in ${\mathbb {R}}^N.$ So, it is convergent. $\square$

Remark 4.2

We notice that the condition

$$\begin{aligned} \eta \omega >\mu _{\max } c_1+L_3\rho _{\max }+(2+c_1)\sigma \end{aligned}$$

in Theorem 3 is analogous to the condition used to prove local linear convergence in [2], namely $\eta \omega >(2+c_1)\sigma$. The two additional terms in the condition in Theorem 3 are a consequence of the main differences between Algorithm 1 and the method considered in [2]. In particular $\mu _{\max } c_1$ depends on the different choice of $\mu _k$, while $L_3\rho _{\max }$ arise from the fact that at each iteration the Levenberg Marquardt system is solved inexactly. We also notice that, recalling the definition of $c_1$ in Lemma 11, the condition above implies

$$\begin{aligned} \sigma \le \frac{1}{2+c_1}(\eta \omega -c_1\mu _{\max }-L_3\rho _{\max })\le \frac{\eta \omega }{c_1}<\frac{\eta \omega }{2\gamma L_3}\lambda ^*_p<\lambda ^*_p, \end{aligned}$$

which in turn is analogous to the condition $\sigma <\lambda ^*_n$ used for the convergence analysis of classical Levenberg-Marquardt method in the case of problems with nonsingular Jacobian and nonzero residual at the solution [5].

4.3 Choice of $\beta _k$

The choice of $\beta$ is mentioned several times as a crucial ingredient of the algorithm we consider. Recall that the role of $\beta _k$ is to compensate, if possible, information that we disregarded by splitting the original LM system into k separable systems in a computationally efficient way. Furthermore, due to condition (18) $\beta _k$ can have non-negligible value only if $\Vert {\textbf{B}}_k\Vert$ is not too large, i.e., if the problem is sparse enough and that is enough for global convergence. To obtain local linear convergence we need to make the residual small enough, recall (42). An intuitive approach is to determine $\beta _k$ such that the residual given by the solution of (14) with respect to the exact linear system (12) is minimized. That is, we have

$$\begin{aligned} \beta _k = {{\,\mathrm{arg\,min}\,}}_{\beta \in {\mathbb {R}}} \Vert \varphi _k(\beta )\Vert ^2_2 \end{aligned}$$

(56)

with

$$\begin{aligned} \varphi _k(\beta ) = ({\textbf{H}}_k+{\textbf{B}}_k+\mu _k {\textbf{I}})({\textbf{H}}_k+\mu _k {\textbf{I}})^{-1}(\beta {\textbf{B}}_k-I){\textbf{g}}^k + {\textbf{g}}^k. \end{aligned}$$

(57)

Defining ${\textbf{u}}= {\textbf{B}}_k{\textbf{g}}^k,\ {\textbf{v}}= {\textbf{B}}_k({\textbf{H}}_k +\mu _k {\textbf{I}})^{-1}{\textbf{B}}_k{\textbf{g}}^k,\ {\textbf{w}}={\textbf{B}}_k({\textbf{H}}_k +\mu _k {\textbf{I}})^{-1}{\textbf{g}}^k,$ we have

$$\begin{aligned} \Vert \varphi _k(\beta )\Vert ^2 = \beta ^2\Vert {\textbf{u}}+{\textbf{w}}\Vert ^2-2\beta ({\textbf{u}}+{\textbf{v}})^\top {\textbf{w}}+\Vert {\textbf{w}}\Vert ^2 \end{aligned}$$

and the solution of (56) is given by

$$\begin{aligned} \beta _k = \frac{({\textbf{u}}+{\textbf{v}})^\top {\textbf{w}}}{\Vert {\textbf{u}}+{\textbf{v}}\Vert ^2}. \end{aligned}$$

(58)

Let us now consider the actual computation of $\beta _k.$ To compute the vectors ${\textbf{u}},\ {\textbf{v}},\ {\textbf{w}}$ we first compute ${\textbf{u}}= {\textbf{B}}_k{\textbf{g}}^k$ directly, then we find $\widehat{\textbf{v}},\ \widehat{\textbf{w}}$ such that $({\textbf{H}}_k +\mu _k {\textbf{I}})\widehat{\textbf{v}}= {\textbf{u}}$ and $({\textbf{H}}_k +\mu _k {\textbf{I}})\widehat{\textbf{w}}= {\textbf{g}}^k$, and finally we take ${\textbf{v}}= {\textbf{B}}_k\widehat{\textbf{v}}$, ${\textbf{w}}= {\textbf{B}}_k\widehat{\textbf{w}}.$ Since $({\textbf{H}}_k +\mu _k {\textbf{I}})$ is block-diagonal, $({\textbf{H}}_k +\mu _k {\textbf{I}})\widehat{\textbf{v}}= {\textbf{u}}$ can be decomposed into independent linear systems with coefficient matrices $H^k_s + \mu _k I$ for $s=1,\ldots ,K$ and we can proceed analogously for $({\textbf{H}}_k +\mu _k {\textbf{I}})\widehat{\textbf{w}}= {\textbf{g}}^k$. Moreover, if we solve $({\textbf{H}}_k +\mu _k {\textbf{I}})\widehat{\textbf{v}}= {\textbf{u}}$ by computing a factorization of $H^k_s+ \mu _k I,$ then the same factorization can be used to also solve the linear system for $\widehat{\textbf{w}}$ and later to solve (15), so the computation of $\beta _k$ is not expensive.

Having $\beta _k$ computed as above, for the residual $\varphi _k(\beta _k)$ we have

$$\begin{aligned} \begin{aligned} \Vert \varphi _k(\beta _k)\Vert ^2&= \Vert {\textbf{B}}_k({\textbf{H}}_k +\mu _k {\textbf{I}})^{-1}{\textbf{g}}^k\Vert ^2 + \beta _k^2\Vert {\textbf{B}}_k(I+({\textbf{H}}_k +\mu _k {\textbf{I}})^{-1}{\textbf{B}}_k){\textbf{g}}^k\Vert ^2\\&-2\beta _k ({\textbf{g}}^k)^\top ({\textbf{H}}_k +\mu _k {\textbf{I}})^{-\top }{\textbf{B}}_k^\top {\textbf{B}}_k(I+({\textbf{H}}_k +\mu _k {\textbf{I}})^{-1}{\textbf{B}}_k){\textbf{g}}^k. \end{aligned} \end{aligned}$$

(59)

If the vector $({\textbf{H}}_k +\mu _k {\textbf{I}})^{-1}{\textbf{g}}^k$ is in the null space of ${\textbf{B}}_k$, we have that $\beta _k = 0$ and $|\varphi _k(\beta _k)\Vert = 0$, so in this case the direction ${\textbf{d}}^k$ is equal to the Levenberg-Marquardt direction. If ${\textbf{g}}^k$ is in the kernel of ${\textbf{B}}_k$, then the residual $\Vert \varphi (\beta _k)\Vert$ is equal to $\Vert {\textbf{B}}_k({\textbf{H}}_k +\mu _k {\textbf{I}})^{-1}{\textbf{g}}^k\Vert ^2$ for any choice of the parameter $\beta .$ If neither $({\textbf{H}}_k +\mu _k {\textbf{I}})^{-1}{\textbf{g}}^k$ nor ${\textbf{g}}^k$ are in the null space of ${\textbf{B}}_k$, then the optimal $\beta _k$ (58) is nonzero and so the right-hand side correction is effective in reducing the residual in the linear system. In general we have that

$$\begin{aligned} \begin{aligned} \Vert \varphi _k(\beta _k)\Vert&\le \Vert {\textbf{B}}_k({\textbf{H}}_k +\mu _k {\textbf{I}})^{-1}{\textbf{g}}^k\Vert \le \Vert {\textbf{B}}_k\Vert \Vert ({\textbf{H}}_k +\mu _k {\textbf{I}})^{-1}\Vert \Vert {\textbf{g}}^k\Vert \end{aligned} \end{aligned}$$

(60)

so $\rho _k$ in (42) is bounded from above by $\Vert {\textbf{B}}_k\Vert \Vert ({\textbf{H}}_k +\mu _k {\textbf{I}})^{-1}\Vert .$ Taking into account Assumption 1, the definition of $\mu _k$ and the fact that this implies in particular $\mu _k\ge \frac{(b+1)^2}{1-b}\Vert {\textbf {{J}}}_k\Vert ^2$, we have

$$\begin{aligned} \begin{aligned}\rho _k&\le \Vert {\textbf{B}}_k\Vert \Vert ({\textbf{H}}_k +\mu _k {\textbf{I}})^{-1}\Vert \le \frac{M\Vert {\textbf {{J}}}_k\Vert ^2}{\mu _k} \le \frac{M(1-b)}{(1+b)^2}<M. \end{aligned} \end{aligned}$$

(61)

From the inequalities above, we have the dependence of the relative residual $\rho _k$ on the norm of the matrix $\Vert {\textbf{B}}_k\Vert$, i.e., on the constant M which measures the importance of the part that we disregard when approximating the Levenberg-Marquardt system with a block diagonal one. We can also notice that the residual is smaller for larger values of the damping parameter $\mu _k$.

5 Implementation and numerical results

In this section we present the results of a set of numerical experiments carried out to investigate the performance of the proposed method, compare it with classical Levenberg-Marquardt method and analyze the effectiveness of the right-hand side correction. For all the tests presented here we consider the case of Network Adjustment problems [17], briefly described in the Subsection 5.1. The LMS method is defined assuming that we can take advantage of the sparsity by suitable partition of variables and residuals and that we are able to apply the efficient right-hand-side correction as described in the Subsection 4.3, i.e., computing $\beta _k$ as in (58).

5.1 Least squares network adjustment problem

Consider a set of points $\{P_1,\dots , P_n\}$ in ${\mathbb {R}}^2$ with unknown coordinates, and assume that a set of observations of geometrical quantities involving the points are available. Least Squares adjustments consists into using the available measurements to find accurate coordinates of the points, by minimizing the residual with respect to the given observations in the least squares sense.

We consider here network adjustment problems with three kinds of observations: point-point distance, angle formed by three points and point-line distance.

In order to be able to consider suitable increasing sizes, the problems are generated artificially, taking into account the information about average connectivity and structure of the network obtained from the analysis of real cadastral networks. The problems are generated as follows. Given the number of points n we take $\{P_1,\dots , P_n\}$ by uniformly sampling $25\%$ of the points on a regular $2\sqrt{n}\times 2\sqrt{n}$ grid and we generate observations of the three kinds mentioned above until the average degree of the points is equal to 6. Each observation is generated by randomly selecting the points involved and generating a random number with Gaussian distribution with mean equal to the true measurement and given standard deviation. We use a standard deviation equal to 0.01 and 1 degree for distance and angle observations respectively. For all points we also add coordinates observations: for $1\%$ of the points we use standard deviation 0.01, while for the remaining $99\%$ we use standard deviation 1.

The optimization problem is defined as a weighted least squares problem

$$\begin{aligned} \min _{{\textbf{x}}^{}\in {\mathbb {R}}^N}\frac{1}{2}\sum _{j=1}^m r_j({\textbf{x}}^{})^2 = \min _{{\textbf{x}}^{}\in {\mathbb {R}}^N} \frac{1}{2}\Vert {\textbf{R}}({\textbf{x}}^{})\Vert ^2_2 \end{aligned}$$

(62)

with $r_j({\textbf{x}}^{}) = w_j^{-1}{\widehat{r}}_j({\textbf{x}}^{})$, where ${\widehat{r}}_j$ is the residual function of the j-th observation and $w_j$ is the corresponding standard deviation.

In Fig. 1 we present the spyplot of the matrix ${\textbf {{J}}}^\top {\textbf {{J}}}$ for a problem of size 35,000.

5.2 Comparison with Levenberg-Marquardt method

In all the tests that follow we use a Python implementation of Algorithm LMS and classical LM method, and PyPardiso [9] to solve the sparse linear systems that arises at each iteration. All the tests were performed on a computer with Intel(R) Core(TM) i7-1165G7 processor @ 2.80GHz and 16.0 GB of RAM running Windows 10. All the methods that we consider have the same iteration structure. The main difference is the fact that while in LM method the linear system is solved directly using PyPardiso, in LMS we first perform the splitting and then use the same PyPardiso function to solve the resulting linear systems, therefore the comparisons in terms of time that we present are meaningful.

The partition of variables and residuals into sets $E_s, s=1,\ldots ,K$ is assumed to be given before application of LMS algorithm. To compute the partitioning of the variables, we use METIS [12] which, given a network and an integer $K>1$ finds a partition of the vertices of the network into K subsets of similar sizes, that approximately minimizes the number of edges between nodes in different subsets. The partition is computed by METIS in a multilevel fashion. Starting from a coarse representation of the graph, an initial partition is computed, projected onto a denser representation of the network and then refined. This process is repeated on a sequence of progressively more dense networks, up until the original graph.

In all the tests that we considered, the time needed to compute the partitioning is negligible with respect to the overall computation time. This is in part due to the fact that the partitioning is computed only once at the beginning of the procedure and not repeated at each iteration.

We now consider a set of problems of increasing size and we solve each problem with the LMS method and correction coefficient $\beta _k$ computed as in (58). The problems are also solved with LM method. We consider problems with size between 20,000 and 120,000 and we plot the time taken by the two methods to reach termination. Both methods use as initial guess the coordinate observations available in the problem description and they stop when at least $68\%, 95\%, 99.5\%$ of the residuals is smaller than 1, 2 and 3 times the standard deviation respectively. The obtained results are in the first plot of Fig. 2. To give a better comparison, in the second plot we extend the size of the problems solved with the proposed method up to 1 million variables. Clearly, LM method could not cope with such large problems (in our testing environment) while LMS successfully solved problems of increasing dimensions up to final value of 1 million variables. In Fig. 3 we have the log-log plot of the time necessary to solve each problem, compared with different rates of growth. For the method with $K>1$, a small number of values of the parameter K was tested and the best one was selected to perform the comparison. The value K used at each dimension is reported in Fig. 8.

From Fig. 2 one can see that the LMS method with $K>1$ results in a significant reduction of the time necessary to reach the desired accuracy, compared to Levenberg-Marquardt method. Moreover, from the second plot of Fig. 2 and from Fig. 3 we can notice that, on the problems that we considered, the time taken by the proposed method grows approximately as $n^{1.3}$, which suggests the fact that the method discussed in this paper would be suitable for problems of very large dimensions.

To better understand the behaviour of the method, in Fig. 4 we plot the mentioned percentages and the value of the relative residual $F_k/F_0$ at each iteration, for a problem of size $N=10^5$ and $K = 15.$ For the same problem, in Fig. 5 we plot the distribution plot of the coordinate error with respect to the true solution, for the initial guess and the estimated solution (left and right-hand plot, respectively).

5.3 Influence of the parameters K and $\beta _k$

Let us now study how the number of subproblems K influences the performance of the method. We consider two problems with 100,000 and 200,000 variables respectively and we solve them with the proposed algorithm for a set of increasing values of K. For each considered K we plot in Fig. 6 the time taken by the method to arrive at the desired accuracy. The initial guess and the stopping criterion are defined as in the previous test.

One can notice that the time decreases as K increases up to an optimal value ($K=15$ for the first problem and $K=30$ for the second one) after which the time starts to increase again. The reason behind this behavior is that larger values of K yield smaller linear system and therefore cheaper iterations, but also less accurate search direction ${\textbf{d}}^k$ resulting in a larger number of iterations necessary to achieve the desired accuracy. For large values of K the increase in the number of iterations outweights the saving in the solution of the linear system and the overall computation cost increases. Finally we can notice that, despite the existence of an optimal value of the parameter K, it appears from this test that there exists an interval of values for which the cost of the method is comparable. This suggests that fine-tuning of the parameter K is not necessary and that, given a problem, choosing K according to the number of variables should be enough to achieve good performance.

To see that the proposed right-hand side correction improves the performance of the method, we repeat the test presented in Subsection 5.2 for $N = 10^6$, but the comparison is here carried out with the case $\beta _k = 0$ that is, when the linear system is approximated as in (13) but no right-hand side correction is applied.

For both methods, a few different values of the parameter K were tested. In Fig. 7 we report the time needed by the two methods to satisfy the convergence criterion, for the best K among the considered values. In figure 8 we plot for each method and each size the value of the parameter K corresponding to the timings in Fig. 7.

We can see that applying the proposed right-hand side correction effectively reduce the time necessary to satisfy the stopping condition. From Fig. 7 one can notice that the optimal K for the method with right-hand side correction is generally higher than the method without correction. These two results together suggests that the method with right-hand side correction is able to achieve a better performance because it allows the set of variables to be partitioned into smaller subsets, which implies a faster computation of the direction at each iteration, before incurring into a decrease in the performance due to the additional number of iterations necessary to reach the desired accuracy.

6 Conclusions

We presented a method of inexact Levenberg-Marquardt type for sparse problems of very large dimension. Assuming that the problem is nearly separable, i.e., sufficiently sparse such that each component of the residual depends only on few variables, the proposed methods is defined through splitting into a set of K independent systems of equations of smaller dimension. The decoupling is done taking into account dense diagonal blocks of LM system and disregarding hopefully very sparse off-diagonal blocks. To compensate for disregarded off-diagonal block we introduced a correction on the right-hand side of the system in such way that decoupling is maintained but information contained in the off-diagonal matrix is preserved in a computationally affordable way, using a single parameter that can be computed in the same fashion - solving a sequence of small dimensional systems of linear equations. The key idea is that solving K systems of smaller dimensions, that can be done sequentially or in parallel, is significantly cheaper than solving a large system of linear equations even if the system is sparse.

The presented algorithm is globally convergent under the set of standard assumptions for a suitable choice of regularization parameter in LM system. In fact the global convergence does not rely on separability assumption at all as one can show that the direction computed by decoupled sequence of LM systems is descent direction. To achieve global convergence we rely on line-search and regularization parameter update by a trust-region like scheme, similarly to [11]. Local linear convergence is proved under the standard conditions and assuming that the residual of linear system is small enough in each iteration. Hence, the near-separability assumption plays a role in local convergence. To achieve small residuals for the decoupled problem we rely heavily on the right-hand side correction and discuss the optimal choice of parameter that is employed in the correction. Theoretical considerations are supported by numerical examples. We consider the network adjustment problem on simulated data, inspired by a real-world problem of cadaster maps, of growing size and with the proposed method solve problems of up to one million of variables. Comparison with the classical LM is presented and it is shown that the proposed method is significantly faster and able to cope with large dimensions. The experiments reported in this paper are done in sequential way while the parallel implementation will be a subject of further research.

Data Availability

The testcases used to obtain the numerical results presented in Sect. 5 are generated by the authors and publicly available at the following address https://cloud.pmf.uns.ac.rs/s/GaSNnns9fdJeXqD. The code is available at https://github.com/gretamalaspina/LMS.

References

Bajović, D., Jakovetić, D., Krejić, N., Krklec Jerinkić, N.: Newton-like method with diagonal correction for distributed optimization. SIAM J. Optim. 27(2), 1171–1203 (2017)
Article MathSciNet MATH Google Scholar
Behling, R., Gonçalves, D.S., Santos, S.A.: Local convergence analysis of the Levenberg-Marquardt framework for nonzero-residue nonlinear least-squares problems under an error bound condition. J. Optim. Theory Appl. 183, 1099–1122 (2019)
Article MathSciNet MATH Google Scholar
Bellavia, S., Gratton, S., Riccietti, E.: A Levenberg-Marquardt method for large nonlinear least-squares problems with dynamic accuracy in functions and gradients, Springer verlag. Numer. Math. 140(3), 791–825 (2018)
Article MathSciNet MATH Google Scholar
Dan, H., Yamashita, N., Fukushima, M.: Convergence properties of the inexact Levenberg-Marquardt method under local error bound conditions. Optim. Methods Softw. 17(4), 605–626 (2002)
Article MathSciNet MATH Google Scholar
J.E. Dennis, Jr., Schnabel, R.B.: Numerical Methods for Unconstrained Optimization and Nonlinear Equations. Classics Appl. Math. 16, (1996).
Fan, J., Pan, J.: Convergence properties of a self-adaptive Levenberg-Marquardt algorithm under local error bound condition. Comput. Optim. Appl. 34(1), 47–62 (2006)
Article MathSciNet MATH Google Scholar
Fan, J., Yuan, Y.: On the quadratic convergence of the Levenberg-Marquardt method without non-singularity assumption. Computing 74(1), 23–29 (2005)
Article MathSciNet MATH Google Scholar
Franken, J., Florijn, W., Hoekstra, M., Hagemans, E.: Rebuilding the Cadastral Map of the Netherlands: The Artificial Intelligence Solution, FIG working week 2021 Proceedings, (2021)
Haas, A.: https://github.com/haasad/PyPardisoProject
van den Heuvel, F., Vestjens, G., Verkuijl, G., van den Broek, M.: Rebuilding the Cadastral Map of the Netherlands: The Geodetic Concept. FIG Working Week 2021 Proceedings, (2021)
Karas, E.W., Santos, S.A., Svaiter, B.F.: Algebraic rules for computing the regularization parameter of the Levenberg-Marquardt method. Comput. Optim. Appl. 65, 723–751 (2016)
Article MathSciNet MATH Google Scholar
Karypis, G., Kumar, V.: Graph Partitioning and Sparse Matrix Ordering System. University of Minnesota, Minneapolis (2009)
Google Scholar
Konolige, K.: Sparse Bundle Adjustment. British Machine Vision Conference (BMVC), Aberystwyth, Wales (2010).
Krejić, N., Lužanin, Z.: Newton-like method with modification of the right-hand side vector. Math. Comput. 71, 237 (2002)
Article MathSciNet MATH Google Scholar
Mao, G., Fidan, B., Anderson, B.D.O.: Wireless sensor network localization techniques. Comput. Netw. 51(10), 2529–2553 (2007)
Article MATH Google Scholar
Marden, M.: Geometry of Polynomials. American Mathematics Society, Ann Arbor (1966)
MATH Google Scholar
Teunissen, P.J.G.: Adjustment Theory. Series on Mathematical Geodesy and Positioning, DUP Blueprint (2003).
Yamashita, N., Fukushima, V.: On the rate of convergence of the Levenberg-Marquardt method, topics in numerical analysis: with special emphasis on nonlinear problems. Comput. Suppl. 15, 239 (2001)
Article MATH Google Scholar

Download references

Acknowledgements

We are grateful to the referees whose constructive comments helped us to improve the results presented in this paper.

Funding

This work is supported by the European Union’s Horizon 2020 programme under the Marie Skłodowska-Curie Grant Agreement no. 812912. The work of Krejić is partially supported by the Serbian Ministry of Education, Science and Technological Development.

Author information

Authors and Affiliations

Department of Mathematics and Informatics, Faculty of Sciences, University of Novi Sad, Trg Dositeja Obradovića 4, 21000, Novi Sad, Serbia
Nataša Krejić & Greta Malaspina
Mathware department, Sioux Technologies, Esp 130, 5633, AA Eindhoven, The Netherlands
Lense Swaenen

Authors

Nataša Krejić
View author publications
You can also search for this author in PubMed Google Scholar
Greta Malaspina
View author publications
You can also search for this author in PubMed Google Scholar
Lense Swaenen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Greta Malaspina.

Ethics declarations

Conflict of interest

The authors declare that they have no competing interests that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Krejić, N., Malaspina, G. & Swaenen, L. A split Levenberg-Marquardt method for large-scale sparse problems. Comput Optim Appl 85, 147–179 (2023). https://doi.org/10.1007/s10589-023-00460-9

Download citation

Received: 28 June 2022
Accepted: 03 February 2023
Published: 15 February 2023
Issue Date: May 2023
DOI: https://doi.org/10.1007/s10589-023-00460-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A split Levenberg-Marquardt method for large-scale sparse problems

Abstract

Similar content being viewed by others

Modified inexact Levenberg–Marquardt methods for solving nonlinear least squares problems

A Levenberg–Marquardt method for large nonlinear least-squares problems with dynamic accuracy in functions and gradients

Solving large linear least squares problems with linear equality constraints

1 Introduction

2 Nearly separable problems

3 LMS: the Levenberg-Marquardt method with splitting

4 Convergence analysis

4.1 Global convergence

Lemma 1

Lemma 2

Proof

Remark 4.1

Lemma 3

Proof

Lemma 4

Proof

Lemma 5

Proof

Lemma 6

Proof

Lemma 7

Proof

Lemma 8

Proof

Theorem 1

Proof

Lemma 9

Proof

Theorem 2

Proof

4.2 Local convergence

Lemma 10

Lemma 11

Proof

Lemma 12

Proof

Lemma 13

Proof

Theorem 3

Proof

Remark 4.2

4.3 Choice of \(\beta _k\)

5 Implementation and numerical results

5.1 Least squares network adjustment problem

5.2 Comparison with Levenberg-Marquardt method

5.3 Influence of the parameters K and \(\beta _k\)

6 Conclusions

Data Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation