Abstract
We consider largescale nonlinear least squares problems with sparse residuals, each of them depending on a small number of variables. A decoupling procedure which results in a splitting of the original problems into a sequence of independent problems of smaller sizes is proposed and analysed. The smaller size problems are modified in a way that offsets the error made by disregarding dependencies that allow us to split the original problem. The resulting method is a modification of the LevenbergMarquardt method with smaller computational costs. Global convergence is proved as well as local linear convergence under suitable assumptions on sparsity. The method is tested on the network localization simulated problems with up to one million variables and its efficiency is demonstrated.
Similar content being viewed by others
1 Introduction
The problem we consider is a nonlinear least squares problem
where for every \(j=1,\ldots ,m\), \(r_j: {\mathbb {R}}^N\rightarrow {\mathbb {R}}\) is a \(C^{1}\) function, \({\textbf{R}}({\textbf{x}}^{}) = \left( r_1({\textbf{x}}^{}),\dots ,r_m({\textbf{x}}^{})\right) ^\top \in {\mathbb {R}}^{m}\) is the vector of residuals, and \({\textbf{F}}\) is the aggregated residual function. We are assuming a special structure of the residuals that is relevant in many applications. That is, we assume that the residuals are very sparse functions, each depending only on a small number of variables while the whole problem is largescale, i.e. with N very large. We are not assuming any particular sparsity pattern allowing many practical problems to fit the framework we consider.
Typical problems of this kind are Least Squares Network Adjustment [17], Bundle Adjustment [13], Wireless Sensor Network Localization [15], where the variables correspond to the coordinates of physical points in a region of the 2 or 3 dimensional space and residuals correspond to observations of geometrical quantities involving these points. In these cases each observation typically involves a small (often fixed) number of points, and thus each residual function involves a small number of variables. Moreover, when considering problems of large dimension with points deployed in a large region of the space, the number of observations involving each point is small with respect to the total amount of observations. That is, each variable is involved in a relatively small number of residual functions. This leads to problems that are very sparse. Given that the measurements are prone to errors or to different kinds of noise, the residuals are in general weighted with the weights being reciprocal of the measurement precision. Furthermore, a typical property of such problems is that they are nearly separable, meaning that it is possible to partition the points into subsets that are connected by a small number of observations. The dominant properties of all these problems, a very large dimension N and sparsity, motivated the modification of classical LevenbergMarquardt method presented in this paper.
The LevenbergMarquardt (LM) method is generally used to solve the least squares problems of large dimension. The method is based on a regularization strategy for improvement of GaussNewton method. In each iteration of GaussNewton method one has to solve a linear least squares problem. The LM method further adds a regularization (damping) parameter that facilitate the direction computation. Thus in each iteration of the LM method one has to solve a system of linear equations to compute the step or step direction. The regularization parameter plays a fundamental role and its derivation is subject of many studies.
Many modifications of the classical LevenbergMarquardt scheme have been proposed in literature to retain convergence while relaxing the assumptions on the objective function and to improve the performance of the method. In [6, 7, 18] the damping parameter is defined as a multiple of the objective function. With this choice of the parameter local superlinear or quadratic convergence is proved under a local error bound assumption for zero residual problems, while global convergence is achieved by employing a line search strategy. In [11] the authors propose an updating strategy for the parameter that, in combination with Armijo line search, ensures global convergence and qquadratic local convergence under the same assumption of the previous papers. In [2] the nonzero residual case is considered and a LevenbergMarquardt scheme is proposed that achieves local convergence with order depending on the rank of the Jacobian matrix and of a combined measure of nonlinearity and residual size around stationary points. In [4] an inexact LevenbergMarquardt is considered and local convergence is proved under a local error bound condition. In [3] the authors propose an approximated LevenbergMarquardt method, suitable for largescale problems, that relies on inaccurate function values and derivatives.
The problems we are interested in are of very large dimension and sparse. Sparsity very often induces the property we call nearseparability, i.e. it is possible to partition the variables into subsets such that a subset of residual functions depends only on a subset of variables while only a limited number of residual functions depends on variables in different subsets. This property is mainly a natural consequence of the problem origin. For example in the Network Adjustment problem physical distance of the points determines the set of points that are connected by observations. Although the complete separability of the residuals is not common, the number of residuals that depend on points from different subsets is in general rather small compared to the problem size N. On the other hand, for a very large N solving the LM linear system in each iteration, even for very sparse problems can be costly. The particular example is the refinement of cadastral maps and in this case N is prohibitively large for direct application of the LM method. For example, the Dutch Kadaster is pursuing making the cadastral map more accurate by making the map consistent with accurate field surveyor measurements [8, 10]. This application yields a nonlinear least squares problem which is known as ‘least squares adjustment’ in the field of geography. If the entire Netherlands were to be considered for one big adjustment problem, the number of variables would be twice the number of feature points in the Netherlands, which is on the order of 1 billion variables and even considering separate parts of Netherlands still yields a very largescale problem.
The method we propose here is designed to exploit the sparsity and nearseparability in the following way. Assuming that we can split the variables in such a way that a large number of residual function depends only on a particular subset of variables, while a relatively small number of residual functions depends on variables from different subsets, the system of linear equations in LM iteration has a particular block structure. The variable subsets and corresponding residuals imply strong dominance of relatively dense diagonal blocks while the nondiagonal blocks are very sparse. This structure motivated the idea of splitting: we decompose the LM system matrix into K independent systems of linear equations, determined by the diagonal blocks of the LM system. This way we get K linear systems of significantly smaller dimensions and we can solve them faster, either sequentially or in parallel. Thus, one can consider this approach as a kind of inexact LM method. However, this kind of splitting might be too inaccurate given that we completely disregarded the offdiagonal blocks of the LM matrix. Therefore, we modify the K independent linear systems in such a way that offdiagonal blocks are included in the righthand sides of these independent systems. The modification of the righthand sides of independent linear systems is based on a correction parameter proposed for the Newton method in [14] and for the distributed Newton method in [1] and attempts to minimize the difference between the full LM linear system residual and the residual of the modified method. Having an affordable search direction computed by solving the K independent linear systems we can proceed in the usual way  applying a line search to get a globally convergent method. Furthermore, under a set of standard assumptions we can prove local linear convergence.
A proper splitting of the variables into suitable subsets is a key assumption for the efficiency of this method but the problems we are interested in very often offer a natural way of meaningful splitting. For example in the localization problems or network adjustment problems the geometry of points dictates meaningful subsets. In the experiments we present here one can see that a graph partitioning algorithm provides a good subset division in a costefficient way.
The numerical experiments presented in the paper demonstrate clearly the advantage of the proposed method with respect to the fullsize LM method. We show that the splitting method successfully copes with very large dimensions, testing the problems of up to 1 million variables. Furthermore, we demonstrate that even in the case of large dimensions that can be solved by classical LM method, say around a couple of tens of thousands, the splitting method works faster. Additionally, we investigate the robustness of the splitting in the sense that we demonstrate empirically that the number of subsets plays an important role in the efficiency of the method but that there is a reasonable degree of freedom in choosing that number without affecting the performance of the method. The experiments presented here are done in a sequential way, i.e. we did not solve the independent systems of linear equations in parallel which would further enhance the proposed method. Parallel implementation will be a subject of further research.
The paper is organized as follows. In Sect. 2 we present the framework that we are considering. The proposed method is described in Sect. 3 while the theoretical analysis of the proposed method is carried out in Sect. 4. In Sect. 5 we discuss some implementation details and present numerical results.
The notation we use is the following. Mappings defined on \({\mathbb {R}}^N,\) vectors from \({\mathbb {R}}^N\) and matrices with at least one dimension being N are denoted by boldfaced letters \({\textbf{F}}, {\textbf{R}}, {\textbf{x}}^{}, {\textbf{B}}, \ldots\) while their blockelements are denoted by the same letters in italics and indices so \({\textbf{x}}^{} = (x_1,\ldots ,x_s), \; {\textbf{x}}^{} \in {\mathbb {R}}^N, \; x_s \in {\mathbb {R}}^{n_s}.\) The dimensions are clearly stated to avoid confusion. We use \(\lambda _{\min }(\cdot )\) and \(\lambda _{\max }(\cdot )\) to denote the smallest and largest eigenvalue of a matrix, respectively. The Euclidean norm is denoted by \(\Vert \cdot \Vert\) for both matrices and vectors.
2 Nearly separable problems
The problem we consider is stated in (1). Denote with \({\mathcal {I}} = \{1,\dots ,N\}\) and with \({\mathcal {J}} = \{1,\dots ,m\}\). Given a partition \(I_1,\dots ,I_{K}\) of \({\mathcal {I}}\) we define the corresponding partition of \({\mathcal {J}}\) into \(E_1,\dots , E_{K}\) as follows:
That is, given a partition of the set of variables, each of the subsets \(E_s\) contains the indices corresponding to residual functions that only involve variables in \(I_s\), while \({\widehat{E}}\) contains the indices of residuals that involve variables belonging to different subsets \(I_s\). We say that problem (1) is separable if there exist \(K \ge 2\) and a partition \(\{I_s\}_{s=1,\ldots ,K }\) of \({\mathcal {I}}\) such that \({\widehat{E}} = \emptyset\), while we say that it is nearlyseparable if there exist \(K \ge 2\) and a partition \(\{I_s\}_{s=1,\ldots ,K }\) of \({\mathcal {I}}\) such that the cardinality of \({\widehat{E}}\) is small with respect to the cardinality of \(\bigcup _{s=1}^{ K }E_s.\) The term “nearlyseparable” is not defined precisely and should be understood in the same fashion as sparsity, i.e. assuming that we can identify the corresponding partitions.
The above described splitting can be interpreted as follows. Given the least squares problem in (1) we define the corresponding underlying network as the undirected graph \({\mathcal {G}} = (\mathcal I, {\mathcal {E}})\) where \({\mathcal {I}}\) and \({\mathcal {E}}\) denote the set of nodes and set of edges respectively. The graph \({\mathcal {G}}\) that has a node for each variable \(x_i\), and an edge between node i and node k if there is a residual function \(r_j\) that involves \(x_i\) and \(x_k\). With this definition in mind the partition of \({\mathcal {I}}\) and \({\mathcal {J}}\) that we described above corresponds to a partition of the sets of nodes and edges of the network, where \(E_s\) contains the indices of edges that are between nodes in the same subset \(I_s\) and \({\widehat{E}}\) contains the edges that connect different subsets. The problem is separable if the underlying network \({\mathcal {G}}\) is not connected (and the number K is equal to the number of connected components of \({\mathcal {G}}\)). The problem is nearlyseparable if we can partition the set of nodes of the network in such a way that the number of edges that connect different subsets is small with respect to the number of edges that are internal to the subsets.
Given the partition \(\{I_s\}_{s=1,\ldots ,K }\) of the variables and the corresponding partition \(\{E_s\}_{s=1,\ldots ,K },\ {\widehat{E}}\) of the residuals, for \(s=1,\ldots , K\) we define \(x_{s}\in {\mathbb {R}}^{n_s}\) as the vector of the variables in \(I_s\) where \(n_s\) denotes the cardinality of \(I_s\), and we introduce the following functions
so that for every \(s=1,\ldots ,K\), \(R_s: {\mathbb {R}}^{n_s} \rightarrow {\mathbb {R}}\) is the vector of residuals involving only variables in \(I_s\), while \(\rho : {\mathbb {R}}^N\rightarrow R\) is the vector of residuals in \({\widehat{E}}\) and \(F_s\), \(\Phi\) are the corresponding local aggregated residual functions. Notice that \(\sum _{s=1}^K n_s = N.\) With this notation problem (1) can be rewritten as
In particular, if the problem is separable (and therefore \(\widehat{E}\) is empty) \(\Phi \equiv 0\) and solving problem (4) is equivalent to solving K independent least squares problems given by
If the problem is not separable then in general \(\Phi\) is not equal to zero and that is the case we are interested in.
3 LMS: the LevenbergMarquardt method with splitting
Let \(\{I_s\}_{s=1,\ldots ,K }\) be a partition of \({\mathcal {I}}\) and \(\{E_s\}_{s=1,\ldots ,K }\) be the corresponding partition of \({\mathcal {J}}\) as defined in (2). To ease the notation we assume that we reordered the variables and the residuals functions according to the given partitions, such that for \({\textbf{x}}^{} \in {\mathbb {R}}^N\)
With this reordering, denoting with \(J_{sR_j}\) the Jacobian of the partial residual vector \(R_j\) defined in (3) with respect to the variables in \(I_s\), and with \(J_{s \rho }\) the Jacobian of the partial residual \(\rho\) with respect to \(x_s,\) we have
From this structure of \({\textbf{R}}\) and \({\textbf {{J}}}\) we get the corresponding block structure of the gradient \({\textbf{g}}({\textbf{x}}^{}) = {\textbf {{J}}}({\textbf{x}}^{})^\top {\textbf{R}}({\textbf{x}}^{})\) and the matrix \({\textbf {{J}}}({\textbf{x}}^{})^\top {\textbf {{J}}}({\textbf{x}}^{})\):
with
In the following, we denote with \({\textbf{g}}^k = {\textbf {{J}}}_k^\top {\textbf{R}}_k\) the vector with sth block component equal to \(g_s({\textbf{x}}^{k})\), with \({\textbf{H}}_k = {\textbf{H}}({\textbf{x}}^{k})\) the block diagonal matrix with diagonal blocks given by \(H_s({\textbf{x}}^{k})\) for \(s=1,\dots , K\), and with \({\textbf{B}}_k = {\textbf{B}}({\textbf{x}}^{k})\) the block partitioned matrix with diagonal blocks equal to zero and offdiagonal blocks equal to \(B_{ij}({\textbf{x}}^{k})\). That is,
The algorithm we introduce here is motivated by nearseparability property and hence we state the formal assumption below.
Assumption 1 There exists a constant \(M > 0\) such that for all \({\textbf{x}}^{} \in {\mathbb {R}}^N\)
The assumption above is not restrictive as \({\textbf{B}}({\textbf{x}}^{})\) is a submatrix of \({\textbf {{J}}}({\textbf{x}}^{})^\top {\textbf {{J}}}({\textbf{x}}^{}).\) Furthermore, the global convergence of the algorithm we propose does not depend on it in the sense that we do not use the assumption for the convergence proof. In fact the proposed algorithm works even for problems that are not nearlyseparable as it can be seen as a kind of quasiNewton method, however the efficiency of the algorithm depends on the nearseparability of the problem and the value of M. Moreover the value of M plays an important role in the analysis of local convergence and in achieving linear rate.
Consider a standard iteration of LM method for a given iteration \({\textbf{x}}^{k}\)
where \({\textbf{d}}^k\in {\mathbb {R}}^N\) is the solution of
where \({\textbf {{J}}}_k = {\textbf {{J}}}({\textbf{x}}^{k})\in {\mathbb {R}}^{m\times N}\) denotes the Jacobian matrix of \({\textbf{R}}_k = {\textbf{R}}({\textbf{x}}^{k})\) and \(\mu _k\) is a positive scalar. When N is very large solving (11) at each iteration of the method may be prohibitively expensive. In the following we propose a modification of the LevenbergMarquardt method that exploits nearseparability of the problem to approximate the linear system (11) with a set of independent linear systems of smaller size.
The linear system (11) at iteration k can therefore be rewritten as
The matrix \({\textbf{B}}_k\) depends only on the derivatives of the residual vector \(\rho = (r_j({\textbf{x}}^{}))_{j\in {\widehat{E}}}.\) If the problem is separable then \({\widehat{E}} = \emptyset\) and \({\textbf{B}}_k = 0\) so the coefficient matrix of (12) is block diagonal, the system can be decomposed into K independent linear systems, and the solution \({\textbf{d}}^k\) is a vector with the sth block component equal to the solution \(d^k_s\) of
If the problem is nearlyseparable, the number of nonzero elements in \({\textbf{B}}_k\) is small compared to the size N of the matrix and to the number of nonzero elements in \({\textbf{H}}_k\). Thus the solution of (13) may provide an approximation of the solution of the LevenbergMarquardt direction (11), with the quality of the approximation depending on the number and magnitude of the nonzero elements in \({\textbf {{B}}}_k.\)
Given that information contained in \({\textbf {{B}}}_k\) might be relevant and that solving K systems of smaller dimension is much cheaper than solving the system of dimension N, we propose the following modification of the right hand side of (13), which attempts to exploit the information contained in the offdiagonal blocks, while retaining separability of the approximated linear system. The idea underlying the righthand side correction is analogous to the one proposed in [1, 14] for systems of nonlinear equations and distributed optimization problems.
Our goal is to split the LM linear system into separable systems of smaller dimension. Starting from the LM linear system (12) and aiming at a separable system of linear equations, i.e. a system with the matrix \({\textbf{H}}_k + \mu _k {\textbf{I}}\) as in (13), we need to take into account the fact that \({\textbf{B}}_k\) is not zero. Clearly putting \({\textbf{B}}_k \mathbf {d_k}\) to the right hand side would be ideal but \({\textbf{d}}_k\) is unknown. Therefore we add \({\textbf{B}}_k \mathbf {g_k}\) at the right hand side of (13) as this way we maintain separability and information contained in \({\textbf{B}}_k.\) Intuitively, \({\textbf{B}}_k \mathbf {g_k}\) is the best approximation for \({\textbf{B}}_k \mathbf {d_k}\) that we have available, so we use it to get a separable system of linear equations. To compensate (at least partially) for the substitution of \({\textbf{B}}_k \mathbf {d_k}\) with \({\textbf{B}}_k \mathbf {g_k}\) we also add a correction factor \(\beta _k\) as explained in (14) and further on.
Consider the system
where \(\beta _k\in {\mathbb {R}}\) is a correction coefficient that we can choose. Once the righthand side has been computed, since \({\textbf{H}}_k+\mu _k {\textbf{I}}\) is a block diagonal matrix, (14) can still be decomposed into K independent linear system. That is, the solution \({\textbf{d}}^k\) of (14) is given by
The correction coefficient \(\beta _k\) can be chosen freely at each iteration so far. However we will see that it is of fundamental importance for both the convergence analysis of the method and practical performance. This parameter is further specified in the algorithm we propose and discussed in detail in Subsection 4.3. Let us now give only a rough reasoning behind its introduction. With \(\beta _k\) we are trying to preserve some information contained in \({\textbf{B}}_k\) in a cheap way and without destroying separability. One possibility is the following choice of \(\beta _k\), which ensures that the residual given by the solution of (14) with respect to the exact linear system (12) is minimized. That is,
with
Further details on this choice are presented later on. One can ask why we use a single coefficient \(\beta _k\) in each iteration, i.e. if it would perhaps be more efficient to try to "correct" the right hand side with more than one parameter, maybe allowing a diagonal matrix \(\beta _k\) in the right hand side instead of a single scalar. In fact, if we take a diagonal matrix \(\beta _k \in {\mathbb {R}}^{N \times N}\) and plug it in the above minimization problem, then solving this problem we could recover the LM iteration exactly. However such procedure would imply the same cost for one iteration as the full LM iteration. Clearly, there are other alternatives between these two extremes  a single number and N numbers, but our experience show that the choice with a single coefficient \(\beta _k\) brings the best results in terms of cost benefit function. The convergence analysis presented in the next Section will further restrict the values of parameter \(\beta _k.\)
Consider the block diagonal matrix \({\textbf{H}}_k\) defined in (7). Since the sth diagonal block \(H_s^k = (J_{sR_s}^k)^\top J_{sR_s}^k + (J_{s\rho _s}^k)^\top J_{s\rho _s}^k\) we have that \({\textbf{H}}_k\) is symmetric and positive semidefinite, and therefore there exists a matrix \({{\textbf {S}}}_k\in {\mathbb {R}}^{m\times N}\) such that \({{\textbf {S}}}_{k}^{\top } {{\textbf {S}}}_k={\textbf{H}}_k\). We denote with \({\textbf {C}}_k\) the matrix \({\textbf {C}}_k = {\textbf {{J}}}_k  {{\textbf {S}}}_k.\)
Let us now state the algorithm. We will assume that the splitting into suitable sets is done before the iterative procedure and it is kept fixed through the process. Thus, the diagonal matrix \({\textbf{H}}_k\) and offdiagonal \({\textbf{B}}_k\) are already defined in each iteration.
The regularization parameter \(\mu _k\) plays an important role in (14) and consequently in the resolution of our independent problem stated in (15). In the algorithm below we adopt a choice based on the same principles proposed in [11] although other options are possible. The parameter is thus computed using the values defined as
for each iteration with \(\ell _k\) specified in the algorithm below.
The key novelty is in (19) where we compute the search direction. The damping parameter \(\mu _k\) is defined through steps 1 while the values of \(\ell _k\) updated at lines 12–16 resemble trustregion approach. Roughly speaking, if the decrease is sufficient the damping parameter in the next iteration is decreased, otherwise we keep \(\ell _{k+1} = \ell _k.\) In fact the choice of \({\mu }_k\) will be crucial for the convergence proof, see Lemma 8. The correction parameter \(\beta _k\) is specified in line 2. Assuming the standard properties of the objective function, see ahead Assumption 2 and 3, one can always choose \(\beta _k\) such that (18) holds. However, the value of \(\beta _k\) depend on the norm of offdiagonal blocks. Thus the method essentially exploits the sparse structure we assume in this paper, see Assumption 1. The search direction \({\textbf{d}}^k\) is computed in line 4. Clearly, the system (19) is in fact completely separable and solving it requires solving K independent linear systems of the form (15).
The righthand side correction vector in each system is also stated in (15). Thus, for the parameter \(\beta _k\) chosen in line 2, to compute \({\textbf{d}}^k\) we need to solve K systems of linear equations
each one of dimension \(n_s,\) and \(\sum _{s=1}^{K} n_s = N.\) These systems can be solved independently, either in parallel or sequentially, and the cost of their solving is in general significantly smaller than the cost of solving N dimensional LM system of linear equations. These savings are meaningful since the direction \({\textbf{d}}^k\) obtained this way is a descent direction as we will show in the convergence analysis. After that we invoke backtracking to fulfill the modified Armijo condition given in (20) and define a new iteration. Modification of the Armijo condition again depends on the norm of offdiagonal blocks as the step size is bounded above by \(1/\gamma _k.\) In the case of \({\textbf{B}}_k = 0,\) i.e. if the system is completely separable we get \(\gamma _k = 1\) and the classical Armijo condition is recovered. In this case the system (19) is the classical LM system and the algorithm reduces to the classical LM with line search. On the other hand, for \({\textbf{B}}^k \ne 0\) the value of \(\Vert {\textbf{B}}^k\Vert\) fundamentally influences the values of \(\beta _k\) and \(\gamma _k\) and the algorithm allows nonnegligible values of \(\beta _k, \gamma _k\) only if \(\Vert {\textbf{B}}^k\Vert\) is not loo large, i.e. if the problem has a certain level of separability
4 Convergence analysis
The convergence analysis is divided in 2 parts, in Sect. 4.1. we prove that the algorithm is well defined and globally convergent under a set of standard assumptions, while the local convergence analysis is presented in Sect. 4.2. The choice of \(\beta _k\) and its influence are discussed in Sect. 4.3.
4.1 Global convergence
The following assumptions are regularity assumptions commonly used in LM methods
Assumption 2 The vector of residuals \({\textbf{R}}: {\mathbb {R}}^N\rightarrow {\mathbb {R}}^{m}\) is continuously differentiable.
Assumption 3 The Jacobian matrix \({\textbf {{J}}}\in {\mathbb {R}}^{m\times N}\) of \({\textbf{R}}\) is LLipschitz continuous. That is, for every \({\textbf{x}}^{},{\textbf{y}}_{} \in {\mathbb {R}}^N\)
For the rest of this subsection we assume that \(\{{\textbf{x}}^{k}\}\) is the sequence generated by Algorithm 1 with an arbitrary initial guess \({\textbf{x}}^{0}\in {\mathbb {R}}^N\).
The following Lemma, proved in [5], is needed for the convergence analysis.
Lemma 1
[5] If Assumptions A2 and A3 hold, for every \({\textbf{x}}^{}\) and \({\textbf{y}}\) in \({\mathbb {R}}^N\) we have
Lemma 2
Assume that \({\textbf{d}}^k\) is computed as in (19) with \(\beta _k\) satisfying (18), and that Assumption A2 holds. Then \({\textbf{d}}^k\) is a descent direction for \({\textbf{F}}\) at \({\textbf{x}}^{k}.\) Moreover the following inequalities hold

(i)
\(\displaystyle ({\textbf{g}}^k)^\top {\textbf{d}}^k \le  \frac{(1b)\Vert {\textbf{g}}^k\Vert ^2}{\Vert {\textbf{H}}_k\Vert + \mu _k}\).

(ii)
\(\displaystyle {\textbf{F}}({\textbf{x}}^{k+1})\le {\textbf{F}}({\textbf{x}}^{k})ct_k\frac{(1b)\Vert {\textbf{g}}^k\Vert ^2}{\Vert {\textbf{H}}_k\Vert + \mu _k}\).
Proof
We want to prove that \(({\textbf{g}}^k)^\top {\textbf{d}}^k\le 0\) for every index k. By definition of \({\textbf{d}}^k\) and using the fact that \(\Vert A^{1}\Vert \ge \Vert A\Vert ^{1}\) for every invertible matrix A, we have
Since \(\mu _k>0\) and \({\textbf{H}}_k\) is symmetric and positive semidefinite, we have
and
Using this two facts and the bound (18) on \(\Vert \beta _kB_k\Vert\) in inequality (22), we get
which is part i) of the Lemma. Since \(b<1\) this also implies that \({\textbf{d}}^k\) is a descent direction at iteration k. By (20) we have that for every iteration index k
Replacing \(({\textbf{g}}^k)^\top {\textbf{d}}^k\) with part (i) of the statement we get ii).
\(\square\)
Remark 4.1
Lemma 2 states that if the righthand side correction coefficient \(\beta _k\) is chosen to satisfy the condition (18), then \({\textbf{d}}^k\) is a descent direction and therefore the backtracking procedure can always find a step size \(t_k\) such that the Armijo condition (20) is satisfied. In particular this implies that Algorithm 1 is well defined. In Lemma 9 we will also prove that under the current assumptions the step size \(t_k\) is bounded from below.
Lemma 3
If Assumption A2 holds and \({\textbf{d}}^k\) is the solution of (19), then for every iteration k we have
Proof
By definition of \({\textbf{d}}^k\) and \(\gamma _k\) we have
which is the first inequality in the thesis. The second inequality follows directly from the fact that \({\textbf{g}}^k = {\textbf {{J}}}_k^\top {\textbf{R}}_k\). \(\square\)
Lemma 4
If Assumption A2 holds, for every \(t\in [0,1/\gamma _k]\) we have
Proof
By Lemma 3 we have
Using this inequality in part i) of Lemma 2, we get
Using this inequality and the fact that \({\textbf{g}}^k = {\textbf {{J}}}_k^\top {\textbf{R}}_k\) we then have
which is the thesis. \(\square\)
Lemma 5
If Assumptions A2 and A3 hold, for every \(t\in [0,1/\gamma _k]\) we have
Proof
Let us introduce the following function from \({\mathbb {R}}\) to \({\mathbb {R}}^m\)
and let us notice that by Lemma 1 we have that \(\Vert \Psi (t)\Vert \le \frac{L}{2}t^2\Vert {\textbf{d}}^k\Vert ^2.\) Using this bound on \(\Psi\) and Lemma 4 we get
The thesis follows immediately. \(\square\)
Lemma 6
Let us assume that Assumptions A2 and A3 hold, and let us denote with \(\mu ^*_k\) the largest root of the polynomial \(q_k(\mu ) = \sum _{j=0}^4a^k_j\mu ^j\) with
If \(\mu _k>\mu _k^*\), then \(t_k = \min \{1,1/\gamma _k\}.\)
Proof
Using the bound to \(\Vert {\textbf{d}}_k\Vert\) given by Lemma 3, and the fact that \(t_k\le \min \{1,1/\gamma _k\}\), we have
where \(q_k\) is the polynomial with coefficients defined in (31). Together with Lemma 5, this implies that for every \(t\le \min \{1,1/\gamma _k\}\) we have
Since \(a^k_4<0\) we have that \(q(\mu )\rightarrow \infty\) as \(\mu \rightarrow +\infty\). This implies that if \(\mu _k>\mu _k^*\) then \(q(\mu _k)<0\) and, by the inequality above, we have
for every \(t\le \min \{1,1/\gamma _k\}.\) Since \(c\in (0,1)\), this implies in particular that \(\min \{1,1/\gamma _k\}\) satisfies Armijo condition (20) and therefore \(t_k = \min \{1,1/\gamma _k\}.\) \(\square\)
Lemma 7
If Assumptions A2 and A3 hold and at iteration k we have \(\ell _k\ge L\), then \(t_k = \min \{1,1/\gamma _k\}\).
Proof
By the previous Lemma, in order to prove the thesis it is enough to show that when \(\ell _k\ge L\) we have \(\mu _k\ge \mu ^*_k\) with \(\mu _k^*\) largest root of the polynomial \(q_k\) defined in (31). Using the Cauchy bound for the roots of polynomials [16], we have that
Since \(a^k_4 = \frac{1b}{\gamma _k^2}\) and \(\gamma _k\le 1+b\), we have that for every k
Using this inequality, the definition of \(\mu _k\), and the fact that \({\hat{a}}^k_i\ge 0\) for every \(i=0,\dots ,3\), we have
This, together with (35) implies that \(\mu _k\ge \mu _k^*\) which concludes the proof. \(\square\)
Lemma 8
If Assumptions A2 and A3 hold then we have \(t_k = \min \{1,1/\gamma _k\}\) for infinitely many values of k.
Proof
By Lemma 7 we have that \(t_k = \min \{1,1/\gamma _k\}\) whenever \(\ell _k\ge L.\) Assume by contradiction that there exists an iteration index \({\bar{k}}\) such that for every \(k\ge {\bar{k}}\) the step size \(t_k\) is strictly smaller than \(\min \{1,1/\gamma _k\}.\) Since in Algorithm 1 we have \(\ell _{k+1} = 2\ell _k\) whenever \(t_k<\min \{1,1/\gamma _k\}\), this implies that for every \(k\ge {\bar{k}}\) we have
Therefore there exists \(k'\ge {\bar{k}}\) such that \(\ell _{k'}\ge L\) which implies \(t_k = \min \{1,1/\gamma _k\}\), contradicting the fact that \(t_k<\min \{1,1/\gamma _k\}\) for every \(k\ge {\bar{k}}\). \(\square\)
The above Lemma allows us to prove the first global statement below. Namely, we prove that any bounded iterative sequence has at least one accumulation point which is stationary.
Theorem 1
Assume that Assumptions A2, A3 hold and that \(\{{\textbf{x}}^{k}\}\) is a sequence generated by Algorithm 1 with arbitrary \({\textbf{x}}^{0}\in {\mathbb {R}}^N\). If \(\{{\textbf{x}}^{k}\}\) is bounded, then it has at least one accumulation point that is also a stationary point for \({\textbf{F}}({\textbf{x}}^{}).\)
Proof
Since \(\{{\textbf{x}}^{k}\}\subset {\mathbb {R}}^n\) is bounded and by Lemma 8 the sequence of step sizes \(\{t_k\}\) takes value \(\min \{1,1/\gamma _k\}\) infinitely many times, we can take a subsequence \(\{{\textbf{x}}^{k_j}\}\subset \{{\textbf{x}}^{k}\}\) such that \(t_{k_j} = \min \{1,1/\gamma _{k_j}\}\) for every j and that \({\textbf{x}}^{k_j}\) converges to \(\bar{{\textbf{x}}^{}}\) as j tends to infinity. By Lemma 2 we have
which implies that
By definition of \(\gamma _k\) and (18) we have that \(\min \{1,1/\gamma _{k}\}\le (1+b)\). Since \(\{{\textbf{x}}_{k_j}\}\) is a compact subset of \({\mathbb {R}}^N\), and \({\textbf{R}}({\textbf{x}}^{})\) is twice continuously differentiable, we have that the sequences \(\Vert {\textbf{H}}_{k_j}\Vert\), \(\Vert {\textbf{R}}_{k_j}\Vert\), and \(\Vert {\textbf {{J}}}_{k_j}\Vert\) are bounded from above, which by definition of \(\mu _k\) imply that \(\Vert {\textbf{H}}_{k_j}\Vert + \mu _{k_j}\) is also bounded from above. This, together with (36) implies that \(\Vert {\textbf{g}}_{k_j}\Vert\) vanishes as j tends to infinity and therefore \(\bar{{\textbf{x}}^{}}\) is a stationary point of \({\textbf{F}}({\textbf{x}}^{}).\) \(\square\)
Lemma 9
If Assumptions A2 and A3 hold and \(\ell _{min}>0\) then for every index k we have
Proof
From inequality (32), since \(t\le \min \{1,1/\gamma _k\}\), we have
where \(q_k\) is defined in (31). Using the inequality above together with Lemma 7 we have
Let us define
We can easily see that if \(t\le {\bar{t}}_k\) then the term between parentheses of the previous inequality is nonpositive and therefore
Since in Algorithm 1 we take \(c\in (0,1)\) this implies that if \(t_k\le {\bar{t}}_k\) Armijo condition (20) holds, and therefore the accepted stepsize \(t_k\) satisfies \(t_k\ge \nu {\bar{t}}_k\).
By Lemma 8 we have that if \(\ell _k\ge L\) then \(t_k = \min \{1,1/\gamma _{k}\}\ge 1/(b+1)\). Let us consider the case when \(\ell _k <L,\) which also implies \(\mu _k\le {{\bar{\mu }}}_k\) with
Using the definition of \({{\bar{\mu }}}_k\) and the fact that \({{\bar{\mu }}}_k\ge 1\) and \(\mu _k\le {{\bar{\mu }}}_k\), we have
Since we are considering the case \(\ell _k<L\), we have \({\hat{a}}^k_i\le \frac{\ell _k^2}{L^2}a^k_i\) for every \(i=0,\dots ,3.\) Moreover, \(\frac{1}{a^k_4} = \frac{\gamma _k^2}{1b}\le \frac{(1+b)^2}{1b}.\) This implies
Using this inequality and (40) in the definition of \({\bar{t}}_k\) we get
which gives us the thesis. \(\square\)
Finally, we can state the global convergence results.
Theorem 2
If Assumptions A2 and A3 hold and \(\ell _{min}>0\) then every accumulation point of the sequence \(\{{\textbf{x}}^{k}\}\) is a stationary point of \({\textbf{F}}({\textbf{x}}^{}).\)
Proof
Let \(\bar{{\textbf{x}}^{}}\) be an accumulation point of \(\{{\textbf{x}}^{k}\}\) and let \(\{{\textbf{x}}^{k_j}\}\) be a subsequence converging to \(\bar{{\textbf{x}}^{}}.\) By Lemma 2 we have
and therefore that
By Lemma 9 the sequence \(\{t_k\}\) is bounded from below while by continuity of \({\textbf {{J}}}({\textbf{x}}^{}), {\textbf{R}}({\textbf{x}}^{}), {\textbf{H}}({\textbf{x}}^{})\), and of the norm 2, we have that \(\Vert {\textbf{H}}_{k_j}\Vert + \mu _{k_j}\) is bounded from above. This implies
and thus \(\bar{{\textbf{x}}^{}}\) is a stationary point of \({\textbf{F}}({\textbf{x}}^{}).\) \(\square\)
4.2 Local convergence
Let us now analyze the local convergence. We are going to show that the LMS method generates a linearly converging sequence under a set of suitable assumptions. Notice that the assumptions we use are standard, see [2] plus the sparsity assumption that we already stated. Let S denote the set of all stationary points of \(\Vert {\textbf{R}}({\textbf{x}}^{})\Vert\), namely \(S = \{{\textbf{x}}^{}\in {\mathbb {R}}^{N} {\textbf {{J}}}({\textbf{x}}^{})^\top {\textbf{R}}({\textbf{x}}^{}) = 0\}.\) Consider a stationary point \({\textbf{x}}^{*}\in S\) and a ball \(B({\textbf{x}}^{*},r)\) with radius \(r > 0\) around it. In the rest of the section we make the following assumptions, see [2].
Assumption 4 There exists \(\omega >0\) such that for every \({\textbf{x}}^{} \in B( {\textbf{x}}^{*},r)\)
Assumption 5 There exists \(\sigma >0\) such that for every \({\textbf{x}}^{} \in B( {\textbf{x}}^{*},r)\) and every \({\bar{{\textbf{z}}}} \in B({\textbf{x}}^{*},r) \cap S\)
From now on we denote with \(\rho _k\) the relative residual of the linear system (11) by the approximate solution \({\textbf{d}}^k\). That is
This residual is already considered in (16), where we briefly mentioned that we will determine \(\beta _k\) such that this residual is minimized. Further details on this choice are presented in Sect. 4.3 and in this part we will keep \(\rho _k\) without further specification, i.e., assuming only that it is small enough that local convergence requirements can be fulfilled. Clearly, for the completely separable problems, i.e. \({\textbf{B}}_k = 0\) we get \(\rho _k = 0\) and hence the value of \(\rho _k\) depends on M stated in Assumption 1  if M is small enough, i.e., if the problem is nearlyseparable to the sufficient degree it is reasonable to expect that the values of \(\rho _k\) will be small enough with a suitable choice of \(\beta _k.\)
The inequalities in the Lemma below are direct consequences of Assumption 3, their proofs can be found in [2].
Lemma 10
Let Assumption 2 hold. There exist positive constants \(L_2, L_3\) and \(L_4\) such that for every \({\textbf{x}}^{}, {\textbf{y}}\in D_r,\ {\bar{{\textbf{z}}}}\in B({\textbf{x}}^{*},r) \cap S\) the inequalities below hold:
From now on, given any point \({\textbf{x}}^{}\in {\mathbb {R}}^N\), we denote with \(\bar{{\textbf{x}}^{}}\) the point in S that realizes \(\Vert {\textbf{x}}^{}  \bar{{\textbf{x}}^{}}\Vert = {{\,\textrm{dist}\,}}({\textbf{x}}^{}, S).\)
Lemma 11
There exists \(r>0\) and \(c_1>0\) such that if \({\textbf{x}}^{k}\in B({\textbf{x}}^{*},r)\) then \(\Vert {\textbf{d}}^k\Vert \le c_1{{\,\textrm{dist}\,}}({\textbf{x}}^{k},S).\)
Proof
Let us define \({\textbf {H}}^* = {\textbf{H}}({\textbf{x}}^{*})\). Consider the eigendecomposition of \({\textbf {H}}^* = {{\textbf {S}}}_*^\top {{\textbf {S}}}_* = {\textbf{Q}}_*{\varvec{{\Lambda }}}_*{\textbf{Q}}_*^\top\) where \({\varvec{{\Lambda }}}_*\) is a diagonal matrix containing the ordered eigenvalues of \({{\textbf {S}}}_*^\top {{\textbf {S}}}_*\) and \({\textbf{Q}}_*\) is the matrix containing the orthogonal eigenvectors corresponding to the eigenvalues in \({\varvec{{\Lambda }}}_*.\) Denoting with p the rank of \({{\textbf {S}}}_*^\top {{\textbf {S}}}_*\) we have that
with \(\Lambda _{*1} = {{\,\textrm{diag}\,}}(\lambda ^*_1,\dots , \lambda ^*_p)\) with \(\lambda _1\ge \lambda _2\ge \dots \ge \lambda _p>0.\) Consider the eigendecomposition of \({\textbf{H}}_k = {\textbf{Q}}_k{\varvec{{\Lambda }}}_k {\textbf{Q}}_k^\top\) and consider the partition of \({\textbf{Q}}_k\) and \({\varvec{{\Lambda }}}_k\) corresponding to the partition of \({\varvec{{\Lambda }}}_*:\)
with \(\Lambda _{k1} = {{\,\textrm{diag}\,}}(\lambda ^k_1,\dots ,\lambda ^k_p)\in {\mathbb {R}}^{p\times p}\), \(\Lambda _{k2} = {{\,\textrm{diag}\,}}(\lambda ^k_{p+1},\dots ,\lambda ^k_N)\in {\mathbb {R}}^{(Np)\times (Np)}\), \(Q_{k1}\in {\mathbb {R}}^{n\times p}\), \(Q_{k2}\in {\mathbb {R}}^{n\times (Np)}\) and \(\lambda ^k_1\ge \dots \ge \lambda ^k_p\). Since \({\textbf{R}}\) is continuously differentiable on \(B({\textbf{x}}^{*},r)\) the entries of \({{\textbf {S}}}({\textbf{x}}^{})^\top {{\textbf {S}}}({\textbf{x}}^{})\) are continuous functions of \({\textbf{x}}^{}\) and thus the eigenvalues of \({{\textbf {S}}}({\textbf{x}}^{})^\top {{\textbf {S}}}({\textbf{x}}^{})\) are continuous function of \({\textbf{x}}^{}\). Therefore, for r small enough we have \(\lambda ^k_i\ge \frac{1}{2}\lambda ^*_p\) for every \(i=1,\ldots ,p.\) Moreover, since \({\textbf{Q}}_k\) is an orthogonal matrix, we have that \(\Vert {\textbf{d}}^k\Vert ^2 = \Vert Q_{k1}^\top {\textbf{d}}^k\Vert ^2 + \Vert Q_{k2}^\top {\textbf{d}}^k\Vert ^2.\) For \(i=1,2\), by definition of \({\textbf{d}}^k\), we have
so that
By definition of \(\gamma _k\), inequality (45), and the fact that \(\lambda ^k_p\ge \frac{1}{2}\lambda ^*_p\) we have
and analogously,
Therefore the thesis holds with
for \(\gamma = 1+b \ge \gamma _k,\) and \(\mu _{\min } = \inf _{k}\mu _k\ge 1.\)\(\square\)
Lemma 12
If \({\textbf{x}}^{k}, {\textbf{x}}^{k+1}\in B({\textbf{x}}^{*},r/2)\) then
where \(\rho _{max} = \max _{k}\rho _k\) with \(\rho _k\) defined in (42) and \(\bar{{\textbf{x}}}^{k}\) is a point in S such that \({{\,\textrm{dist}\,}}({\textbf{x}}^{k}, S) = \Vert {\textbf{x}}^{k}  \bar{{\textbf{x}}}^{k}\Vert\).
Proof
Since
we have
By definition of \(\rho _k\) there holds
Replacing these two inequalities in (52) and using Lemma 11 we get the thesis. \(\square\)
Lemma 13
Assume that there exists \(\eta \in (0,1)\) such that \(\eta \omega >c_1\mu _{max} + \rho _{\max }L_3 + (2+c_1)\sigma\),
If \({\textbf{x}}^{k}, {\textbf{x}}^{k+1}\in B({\textbf{x}}^{*},r/2)\) and \({{\,\textrm{dist}\,}}({\textbf{x}}^{k}, S)\le \varepsilon\) then
Proof
By Assumption 5 and Lemma 11
and
Therefore, from Lemma 12, since we are assuming \({{\,\textrm{dist}\,}}({\textbf{x}}^{k}, S)\le \varepsilon ,\) there follows
and we get the thesis by definition of \(\varepsilon .\) \(\square\)
The above Lemmas allow us to prove the local linear convergence.
Theorem 3
Assume that Assumptions 25 hold and that there exists \(\eta \in (0,1)\) such that \(\eta \omega >\mu _{\max } c_1+L_3\rho _{\max }+(2+c_1)\sigma\) and let us define
If \({\textbf{x}}^{0}\in B({\textbf{x}}^{*},\varepsilon )\) then \({{\,\textrm{dist}\,}}({\textbf{x}}^{k},S)\rightarrow 0\) linearly and \({\textbf{x}}^{k}\rightarrow \bar{{\textbf{x}}^{}}\in S\cap B({\textbf{x}}^{*}, r/2)\).
Proof
We prove by induction on k that \({\textbf{x}}^{k}\in B({\textbf{x}}^{*}, r/2)\) for every k.
For \(k = 1\), by Lemma 11 and the definition of \(\varepsilon\), we have
Given \(k\ge 1\), assume that for every \(j=1,\ldots ,k1\) there holds \({{\,\textrm{dist}\,}}({\textbf{x}}^{j}, S)\le \varepsilon\) and \({\textbf{x}}^{j}\in B(x^*, r/2)\). Then we have
and the fact that the righthand side is smaller than r/2 follows again from Lemma 11 and the definition of \(\varepsilon\). So, \({\textbf{x}}^{k}\in B({\textbf{x}}^{*}, r/2)\) for every k and to prove the first part of the thesis it is enough to apply Lemma 13.
Therefore, if there exists \(\bar{{\textbf{x}}}^{} = \lim {\textbf{x}}^{k}\), then the limit has to belong to \(S\cap B(x^*, r/2)\), and to prove the second part of the thesis we only need to prove that such a limit exists. For every index k we have that \(\Vert {\textbf {d}}^k\Vert \le c_1\varepsilon \eta ^k\), so given any two indices l, q, we have
and \(\{{\textbf{x}}^{k}\}\) is a Cauchy sequence in \({\mathbb {R}}^N.\) So, it is convergent. \(\square\)
Remark 4.2
We notice that the condition
in Theorem 3 is analogous to the condition used to prove local linear convergence in [2], namely \(\eta \omega >(2+c_1)\sigma\). The two additional terms in the condition in Theorem 3 are a consequence of the main differences between Algorithm 1 and the method considered in [2]. In particular \(\mu _{\max } c_1\) depends on the different choice of \(\mu _k\), while \(L_3\rho _{\max }\) arise from the fact that at each iteration the Levenberg Marquardt system is solved inexactly. We also notice that, recalling the definition of \(c_1\) in Lemma 11, the condition above implies
which in turn is analogous to the condition \(\sigma <\lambda ^*_n\) used for the convergence analysis of classical LevenbergMarquardt method in the case of problems with nonsingular Jacobian and nonzero residual at the solution [5].
4.3 Choice of \(\beta _k\)
The choice of \(\beta\) is mentioned several times as a crucial ingredient of the algorithm we consider. Recall that the role of \(\beta _k\) is to compensate, if possible, information that we disregarded by splitting the original LM system into k separable systems in a computationally efficient way. Furthermore, due to condition (18) \(\beta _k\) can have nonnegligible value only if \(\Vert {\textbf{B}}_k\Vert\) is not too large, i.e., if the problem is sparse enough and that is enough for global convergence. To obtain local linear convergence we need to make the residual small enough, recall (42). An intuitive approach is to determine \(\beta _k\) such that the residual given by the solution of (14) with respect to the exact linear system (12) is minimized. That is, we have
with
Defining \({\textbf{u}}= {\textbf{B}}_k{\textbf{g}}^k,\ {\textbf{v}}= {\textbf{B}}_k({\textbf{H}}_k +\mu _k {\textbf{I}})^{1}{\textbf{B}}_k{\textbf{g}}^k,\ {\textbf{w}}={\textbf{B}}_k({\textbf{H}}_k +\mu _k {\textbf{I}})^{1}{\textbf{g}}^k,\) we have
and the solution of (56) is given by
Let us now consider the actual computation of \(\beta _k.\) To compute the vectors \({\textbf{u}},\ {\textbf{v}},\ {\textbf{w}}\) we first compute \({\textbf{u}}= {\textbf{B}}_k{\textbf{g}}^k\) directly, then we find \(\widehat{\textbf{v}},\ \widehat{\textbf{w}}\) such that \(({\textbf{H}}_k +\mu _k {\textbf{I}})\widehat{\textbf{v}}= {\textbf{u}}\) and \(({\textbf{H}}_k +\mu _k {\textbf{I}})\widehat{\textbf{w}}= {\textbf{g}}^k\), and finally we take \({\textbf{v}}= {\textbf{B}}_k\widehat{\textbf{v}}\), \({\textbf{w}}= {\textbf{B}}_k\widehat{\textbf{w}}.\) Since \(({\textbf{H}}_k +\mu _k {\textbf{I}})\) is blockdiagonal, \(({\textbf{H}}_k +\mu _k {\textbf{I}})\widehat{\textbf{v}}= {\textbf{u}}\) can be decomposed into independent linear systems with coefficient matrices \(H^k_s + \mu _k I\) for \(s=1,\ldots ,K\) and we can proceed analogously for \(({\textbf{H}}_k +\mu _k {\textbf{I}})\widehat{\textbf{w}}= {\textbf{g}}^k\). Moreover, if we solve \(({\textbf{H}}_k +\mu _k {\textbf{I}})\widehat{\textbf{v}}= {\textbf{u}}\) by computing a factorization of \(H^k_s+ \mu _k I,\) then the same factorization can be used to also solve the linear system for \(\widehat{\textbf{w}}\) and later to solve (15), so the computation of \(\beta _k\) is not expensive.
Having \(\beta _k\) computed as above, for the residual \(\varphi _k(\beta _k)\) we have
If the vector \(({\textbf{H}}_k +\mu _k {\textbf{I}})^{1}{\textbf{g}}^k\) is in the null space of \({\textbf{B}}_k\), we have that \(\beta _k = 0\) and \(\varphi _k(\beta _k)\Vert = 0\), so in this case the direction \({\textbf{d}}^k\) is equal to the LevenbergMarquardt direction. If \({\textbf{g}}^k\) is in the kernel of \({\textbf{B}}_k\), then the residual \(\Vert \varphi (\beta _k)\Vert\) is equal to \(\Vert {\textbf{B}}_k({\textbf{H}}_k +\mu _k {\textbf{I}})^{1}{\textbf{g}}^k\Vert ^2\) for any choice of the parameter \(\beta .\) If neither \(({\textbf{H}}_k +\mu _k {\textbf{I}})^{1}{\textbf{g}}^k\) nor \({\textbf{g}}^k\) are in the null space of \({\textbf{B}}_k\), then the optimal \(\beta _k\) (58) is nonzero and so the righthand side correction is effective in reducing the residual in the linear system. In general we have that
so \(\rho _k\) in (42) is bounded from above by \(\Vert {\textbf{B}}_k\Vert \Vert ({\textbf{H}}_k +\mu _k {\textbf{I}})^{1}\Vert .\) Taking into account Assumption 1, the definition of \(\mu _k\) and the fact that this implies in particular \(\mu _k\ge \frac{(b+1)^2}{1b}\Vert {\textbf {{J}}}_k\Vert ^2\), we have
From the inequalities above, we have the dependence of the relative residual \(\rho _k\) on the norm of the matrix \(\Vert {\textbf{B}}_k\Vert\), i.e., on the constant M which measures the importance of the part that we disregard when approximating the LevenbergMarquardt system with a block diagonal one. We can also notice that the residual is smaller for larger values of the damping parameter \(\mu _k\).
5 Implementation and numerical results
In this section we present the results of a set of numerical experiments carried out to investigate the performance of the proposed method, compare it with classical LevenbergMarquardt method and analyze the effectiveness of the righthand side correction. For all the tests presented here we consider the case of Network Adjustment problems [17], briefly described in the Subsection 5.1. The LMS method is defined assuming that we can take advantage of the sparsity by suitable partition of variables and residuals and that we are able to apply the efficient righthandside correction as described in the Subsection 4.3, i.e., computing \(\beta _k\) as in (58).
5.1 Least squares network adjustment problem
Consider a set of points \(\{P_1,\dots , P_n\}\) in \({\mathbb {R}}^2\) with unknown coordinates, and assume that a set of observations of geometrical quantities involving the points are available. Least Squares adjustments consists into using the available measurements to find accurate coordinates of the points, by minimizing the residual with respect to the given observations in the least squares sense.
We consider here network adjustment problems with three kinds of observations: pointpoint distance, angle formed by three points and pointline distance.
In order to be able to consider suitable increasing sizes, the problems are generated artificially, taking into account the information about average connectivity and structure of the network obtained from the analysis of real cadastral networks. The problems are generated as follows. Given the number of points n we take \(\{P_1,\dots , P_n\}\) by uniformly sampling \(25\%\) of the points on a regular \(2\sqrt{n}\times 2\sqrt{n}\) grid and we generate observations of the three kinds mentioned above until the average degree of the points is equal to 6. Each observation is generated by randomly selecting the points involved and generating a random number with Gaussian distribution with mean equal to the true measurement and given standard deviation. We use a standard deviation equal to 0.01 and 1 degree for distance and angle observations respectively. For all points we also add coordinates observations: for \(1\%\) of the points we use standard deviation 0.01, while for the remaining \(99\%\) we use standard deviation 1.
The optimization problem is defined as a weighted least squares problem
with \(r_j({\textbf{x}}^{}) = w_j^{1}{\widehat{r}}_j({\textbf{x}}^{})\), where \({\widehat{r}}_j\) is the residual function of the jth observation and \(w_j\) is the corresponding standard deviation.
In Fig. 1 we present the spyplot of the matrix \({\textbf {{J}}}^\top {\textbf {{J}}}\) for a problem of size 35,000.
5.2 Comparison with LevenbergMarquardt method
In all the tests that follow we use a Python implementation of Algorithm LMS and classical LM method, and PyPardiso [9] to solve the sparse linear systems that arises at each iteration. All the tests were performed on a computer with Intel(R) Core(TM) i71165G7 processor @ 2.80GHz and 16.0 GB of RAM running Windows 10. All the methods that we consider have the same iteration structure. The main difference is the fact that while in LM method the linear system is solved directly using PyPardiso, in LMS we first perform the splitting and then use the same PyPardiso function to solve the resulting linear systems, therefore the comparisons in terms of time that we present are meaningful.
The partition of variables and residuals into sets \(E_s, s=1,\ldots ,K\) is assumed to be given before application of LMS algorithm. To compute the partitioning of the variables, we use METIS [12] which, given a network and an integer \(K>1\) finds a partition of the vertices of the network into K subsets of similar sizes, that approximately minimizes the number of edges between nodes in different subsets. The partition is computed by METIS in a multilevel fashion. Starting from a coarse representation of the graph, an initial partition is computed, projected onto a denser representation of the network and then refined. This process is repeated on a sequence of progressively more dense networks, up until the original graph.
In all the tests that we considered, the time needed to compute the partitioning is negligible with respect to the overall computation time. This is in part due to the fact that the partitioning is computed only once at the beginning of the procedure and not repeated at each iteration.
We now consider a set of problems of increasing size and we solve each problem with the LMS method and correction coefficient \(\beta _k\) computed as in (58). The problems are also solved with LM method. We consider problems with size between 20,000 and 120,000 and we plot the time taken by the two methods to reach termination. Both methods use as initial guess the coordinate observations available in the problem description and they stop when at least \(68\%, 95\%, 99.5\%\) of the residuals is smaller than 1, 2 and 3 times the standard deviation respectively. The obtained results are in the first plot of Fig. 2. To give a better comparison, in the second plot we extend the size of the problems solved with the proposed method up to 1 million variables. Clearly, LM method could not cope with such large problems (in our testing environment) while LMS successfully solved problems of increasing dimensions up to final value of 1 million variables. In Fig. 3 we have the loglog plot of the time necessary to solve each problem, compared with different rates of growth. For the method with \(K>1\), a small number of values of the parameter K was tested and the best one was selected to perform the comparison. The value K used at each dimension is reported in Fig. 8.
From Fig. 2 one can see that the LMS method with \(K>1\) results in a significant reduction of the time necessary to reach the desired accuracy, compared to LevenbergMarquardt method. Moreover, from the second plot of Fig. 2 and from Fig. 3 we can notice that, on the problems that we considered, the time taken by the proposed method grows approximately as \(n^{1.3}\), which suggests the fact that the method discussed in this paper would be suitable for problems of very large dimensions.
To better understand the behaviour of the method, in Fig. 4 we plot the mentioned percentages and the value of the relative residual \(F_k/F_0\) at each iteration, for a problem of size \(N=10^5\) and \(K = 15.\) For the same problem, in Fig. 5 we plot the distribution plot of the coordinate error with respect to the true solution, for the initial guess and the estimated solution (left and righthand plot, respectively).
5.3 Influence of the parameters K and \(\beta _k\)
Let us now study how the number of subproblems K influences the performance of the method. We consider two problems with 100,000 and 200,000 variables respectively and we solve them with the proposed algorithm for a set of increasing values of K. For each considered K we plot in Fig. 6 the time taken by the method to arrive at the desired accuracy. The initial guess and the stopping criterion are defined as in the previous test.
One can notice that the time decreases as K increases up to an optimal value (\(K=15\) for the first problem and \(K=30\) for the second one) after which the time starts to increase again. The reason behind this behavior is that larger values of K yield smaller linear system and therefore cheaper iterations, but also less accurate search direction \({\textbf{d}}^k\) resulting in a larger number of iterations necessary to achieve the desired accuracy. For large values of K the increase in the number of iterations outweights the saving in the solution of the linear system and the overall computation cost increases. Finally we can notice that, despite the existence of an optimal value of the parameter K, it appears from this test that there exists an interval of values for which the cost of the method is comparable. This suggests that finetuning of the parameter K is not necessary and that, given a problem, choosing K according to the number of variables should be enough to achieve good performance.
To see that the proposed righthand side correction improves the performance of the method, we repeat the test presented in Subsection 5.2 for \(N = 10^6\), but the comparison is here carried out with the case \(\beta _k = 0\) that is, when the linear system is approximated as in (13) but no righthand side correction is applied.
For both methods, a few different values of the parameter K were tested. In Fig. 7 we report the time needed by the two methods to satisfy the convergence criterion, for the best K among the considered values. In figure 8 we plot for each method and each size the value of the parameter K corresponding to the timings in Fig. 7.
We can see that applying the proposed righthand side correction effectively reduce the time necessary to satisfy the stopping condition. From Fig. 7 one can notice that the optimal K for the method with righthand side correction is generally higher than the method without correction. These two results together suggests that the method with righthand side correction is able to achieve a better performance because it allows the set of variables to be partitioned into smaller subsets, which implies a faster computation of the direction at each iteration, before incurring into a decrease in the performance due to the additional number of iterations necessary to reach the desired accuracy.
6 Conclusions
We presented a method of inexact LevenbergMarquardt type for sparse problems of very large dimension. Assuming that the problem is nearly separable, i.e., sufficiently sparse such that each component of the residual depends only on few variables, the proposed methods is defined through splitting into a set of K independent systems of equations of smaller dimension. The decoupling is done taking into account dense diagonal blocks of LM system and disregarding hopefully very sparse offdiagonal blocks. To compensate for disregarded offdiagonal block we introduced a correction on the righthand side of the system in such way that decoupling is maintained but information contained in the offdiagonal matrix is preserved in a computationally affordable way, using a single parameter that can be computed in the same fashion  solving a sequence of small dimensional systems of linear equations. The key idea is that solving K systems of smaller dimensions, that can be done sequentially or in parallel, is significantly cheaper than solving a large system of linear equations even if the system is sparse.
The presented algorithm is globally convergent under the set of standard assumptions for a suitable choice of regularization parameter in LM system. In fact the global convergence does not rely on separability assumption at all as one can show that the direction computed by decoupled sequence of LM systems is descent direction. To achieve global convergence we rely on linesearch and regularization parameter update by a trustregion like scheme, similarly to [11]. Local linear convergence is proved under the standard conditions and assuming that the residual of linear system is small enough in each iteration. Hence, the nearseparability assumption plays a role in local convergence. To achieve small residuals for the decoupled problem we rely heavily on the righthand side correction and discuss the optimal choice of parameter that is employed in the correction. Theoretical considerations are supported by numerical examples. We consider the network adjustment problem on simulated data, inspired by a realworld problem of cadaster maps, of growing size and with the proposed method solve problems of up to one million of variables. Comparison with the classical LM is presented and it is shown that the proposed method is significantly faster and able to cope with large dimensions. The experiments reported in this paper are done in sequential way while the parallel implementation will be a subject of further research.
Data Availability
The testcases used to obtain the numerical results presented in Sect. 5 are generated by the authors and publicly available at the following address https://cloud.pmf.uns.ac.rs/s/GaSNnns9fdJeXqD. The code is available at https://github.com/gretamalaspina/LMS.
References
Bajović, D., Jakovetić, D., Krejić, N., Krklec Jerinkić, N.: Newtonlike method with diagonal correction for distributed optimization. SIAM J. Optim. 27(2), 1171–1203 (2017)
Behling, R., Gonçalves, D.S., Santos, S.A.: Local convergence analysis of the LevenbergMarquardt framework for nonzeroresidue nonlinear leastsquares problems under an error bound condition. J. Optim. Theory Appl. 183, 1099–1122 (2019)
Bellavia, S., Gratton, S., Riccietti, E.: A LevenbergMarquardt method for large nonlinear leastsquares problems with dynamic accuracy in functions and gradients, Springer verlag. Numer. Math. 140(3), 791–825 (2018)
Dan, H., Yamashita, N., Fukushima, M.: Convergence properties of the inexact LevenbergMarquardt method under local error bound conditions. Optim. Methods Softw. 17(4), 605–626 (2002)
J.E. Dennis, Jr., Schnabel, R.B.: Numerical Methods for Unconstrained Optimization and Nonlinear Equations. Classics Appl. Math. 16, (1996).
Fan, J., Pan, J.: Convergence properties of a selfadaptive LevenbergMarquardt algorithm under local error bound condition. Comput. Optim. Appl. 34(1), 47–62 (2006)
Fan, J., Yuan, Y.: On the quadratic convergence of the LevenbergMarquardt method without nonsingularity assumption. Computing 74(1), 23–29 (2005)
Franken, J., Florijn, W., Hoekstra, M., Hagemans, E.: Rebuilding the Cadastral Map of the Netherlands: The Artificial Intelligence Solution, FIG working week 2021 Proceedings, (2021)
van den Heuvel, F., Vestjens, G., Verkuijl, G., van den Broek, M.: Rebuilding the Cadastral Map of the Netherlands: The Geodetic Concept. FIG Working Week 2021 Proceedings, (2021)
Karas, E.W., Santos, S.A., Svaiter, B.F.: Algebraic rules for computing the regularization parameter of the LevenbergMarquardt method. Comput. Optim. Appl. 65, 723–751 (2016)
Karypis, G., Kumar, V.: Graph Partitioning and Sparse Matrix Ordering System. University of Minnesota, Minneapolis (2009)
Konolige, K.: Sparse Bundle Adjustment. British Machine Vision Conference (BMVC), Aberystwyth, Wales (2010).
Krejić, N., Lužanin, Z.: Newtonlike method with modification of the righthand side vector. Math. Comput. 71, 237 (2002)
Mao, G., Fidan, B., Anderson, B.D.O.: Wireless sensor network localization techniques. Comput. Netw. 51(10), 2529–2553 (2007)
Marden, M.: Geometry of Polynomials. American Mathematics Society, Ann Arbor (1966)
Teunissen, P.J.G.: Adjustment Theory. Series on Mathematical Geodesy and Positioning, DUP Blueprint (2003).
Yamashita, N., Fukushima, V.: On the rate of convergence of the LevenbergMarquardt method, topics in numerical analysis: with special emphasis on nonlinear problems. Comput. Suppl. 15, 239 (2001)
Acknowledgements
We are grateful to the referees whose constructive comments helped us to improve the results presented in this paper.
Funding
This work is supported by the European Union’s Horizon 2020 programme under the Marie SkłodowskaCurie Grant Agreement no. 812912. The work of Krejić is partially supported by the Serbian Ministry of Education, Science and Technological Development.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no competing interests that are relevant to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Krejić, N., Malaspina, G. & Swaenen, L. A split LevenbergMarquardt method for largescale sparse problems. Comput Optim Appl 85, 147–179 (2023). https://doi.org/10.1007/s10589023004609
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10589023004609