A split Levenberg-Marquardt method for large-scale sparse problems

We consider large-scale nonlinear least squares problems with sparse residuals, each of them depending on a small number of variables. A decoupling procedure which results in a splitting of the original problems into a sequence of independent problems of smaller sizes is proposed and analysed. The smaller size problems are modified in a way that offsets the error made by disregarding dependencies that allow us to split the original problem. The resulting method is a modification of the Levenberg-Marquardt method with smaller computational costs. Global convergence is proved as well as local linear convergence under suitable assumptions on sparsity. The method is tested on the network localization simulated problems with up to one million variables and its efficiency is demonstrated.


Introduction
The problem we consider is a nonlinear least squares problem 1 3 where for every j = 1, … , m , r j ∶ ℝ N → ℝ is a C 1 function, R(x) = r 1 (x), … , r m (x) ⊤ ∈ ℝ m is the vector of residuals, and F is the aggregated residual function. We are assuming a special structure of the residuals that is relevant in many applications. That is, we assume that the residuals are very sparse functions, each depending only on a small number of variables while the whole problem is large-scale, i.e. with N very large. We are not assuming any particular sparsity pattern allowing many practical problems to fit the framework we consider. Typical problems of this kind are Least Squares Network Adjustment [17], Bundle Adjustment [13], Wireless Sensor Network Localization [15], where the variables correspond to the coordinates of physical points in a region of the 2 or 3 dimensional space and residuals correspond to observations of geometrical quantities involving these points. In these cases each observation typically involves a small (often fixed) number of points, and thus each residual function involves a small number of variables. Moreover, when considering problems of large dimension with points deployed in a large region of the space, the number of observations involving each point is small with respect to the total amount of observations. That is, each variable is involved in a relatively small number of residual functions. This leads to problems that are very sparse. Given that the measurements are prone to errors or to different kinds of noise, the residuals are in general weighted with the weights being reciprocal of the measurement precision. Furthermore, a typical property of such problems is that they are nearly separable, meaning that it is possible to partition the points into subsets that are connected by a small number of observations. The dominant properties of all these problems, a very large dimension N and sparsity, motivated the modification of classical Levenberg-Marquardt method presented in this paper.
The Levenberg-Marquardt (LM) method is generally used to solve the least squares problems of large dimension. The method is based on a regularization strategy for improvement of Gauss-Newton method. In each iteration of Gauss-Newton method one has to solve a linear least squares problem. The LM method further adds a regularization (damping) parameter that facilitate the direction computation. Thus in each iteration of the LM method one has to solve a system of linear equations to compute the step or step direction. The regularization parameter plays a fundamental role and its derivation is subject of many studies.
Many modifications of the classical Levenberg-Marquardt scheme have been proposed in literature to retain convergence while relaxing the assumptions on the objective function and to improve the performance of the method. In [6,7,18] the damping parameter is defined as a multiple of the objective function. With this choice of the parameter local superlinear or quadratic convergence is proved under a local error bound assumption for zero residual problems, while global convergence is achieved by employing a line search strategy. In [11] the authors propose an updating strategy for the parameter that, in combination with Armijo line search, ensures global convergence and q-quadratic local convergence under the same assumption of the previous papers. In [2] the non-zero residual case is considered and a Levenberg-Marquardt scheme is proposed that achieves local convergence with order depending on the rank of the Jacobian matrix and of a combined measure of nonlinearity and residual size around stationary points. In [4] an inexact Levenberg-Marquardt is considered and local convergence is proved under a local error bound condition. In [3] the authors propose an approximated Levenberg-Marquardt method, suitable for large-scale problems, that relies on inaccurate function values and derivatives.
The problems we are interested in are of very large dimension and sparse. Sparsity very often induces the property we call near-separability, i.e. it is possible to partition the variables into subsets such that a subset of residual functions depends only on a subset of variables while only a limited number of residual functions depends on variables in different subsets. This property is mainly a natural consequence of the problem origin. For example in the Network Adjustment problem physical distance of the points determines the set of points that are connected by observations. Although the complete separability of the residuals is not common, the number of residuals that depend on points from different subsets is in general rather small compared to the problem size N. On the other hand, for a very large N solving the LM linear system in each iteration, even for very sparse problems can be costly. The particular example is the refinement of cadastral maps and in this case N is prohibitively large for direct application of the LM method. For example, the Dutch Kadaster is pursuing making the cadastral map more accurate by making the map consistent with accurate field surveyor measurements [8,10]. This application yields a non-linear least squares problem which is known as 'least squares adjustment' in the field of geography. If the entire Netherlands were to be considered for one big adjustment problem, the number of variables would be twice the number of feature points in the Netherlands, which is on the order of 1 billion variables and even considering separate parts of Netherlands still yields a very large-scale problem.
The method we propose here is designed to exploit the sparsity and near-separability in the following way. Assuming that we can split the variables in such a way that a large number of residual function depends only on a particular subset of variables, while a relatively small number of residual functions depends on variables from different subsets, the system of linear equations in LM iteration has a particular block structure. The variable subsets and corresponding residuals imply strong dominance of relatively dense diagonal blocks while the non-diagonal blocks are very sparse. This structure motivated the idea of splitting: we decompose the LM system matrix into K independent systems of linear equations, determined by the diagonal blocks of the LM system. This way we get K linear systems of significantly smaller dimensions and we can solve them faster, either sequentially or in parallel. Thus, one can consider this approach as a kind of inexact LM method. However, this kind of splitting might be too inaccurate given that we completely disregarded the off-diagonal blocks of the LM matrix. Therefore, we modify the K independent linear systems in such a way that off-diagonal blocks are included in the righthand sides of these independent systems. The modification of the right-hand sides of independent linear systems is based on a correction parameter proposed for the Newton method in [14] and for the distributed Newton method in [1] and attempts to minimize the difference between the full LM linear system residual and the residual of the modified method. Having an affordable search direction computed by solving the K independent linear systems we can proceed in the usual way -applying a line search to get a globally convergent method. Furthermore, under a set of standard assumptions we can prove local linear convergence.
A proper splitting of the variables into suitable subsets is a key assumption for the efficiency of this method but the problems we are interested in very often offer a natural way of meaningful splitting. For example in the localization problems or network adjustment problems the geometry of points dictates meaningful subsets. In the experiments we present here one can see that a graph partitioning algorithm provides a good subset division in a cost-efficient way.
The numerical experiments presented in the paper demonstrate clearly the advantage of the proposed method with respect to the full-size LM method. We show that the splitting method successfully copes with very large dimensions, testing the problems of up to 1 million variables. Furthermore, we demonstrate that even in the case of large dimensions that can be solved by classical LM method, say around a couple of tens of thousands, the splitting method works faster. Additionally, we investigate the robustness of the splitting in the sense that we demonstrate empirically that the number of subsets plays an important role in the efficiency of the method but that there is a reasonable degree of freedom in choosing that number without affecting the performance of the method. The experiments presented here are done in a sequential way, i.e. we did not solve the independent systems of linear equations in parallel which would further enhance the proposed method. Parallel implementation will be a subject of further research.
The paper is organized as follows. In Sect. 2 we present the framework that we are considering. The proposed method is described in Sect. 3 while the theoretical analysis of the proposed method is carried out in Sect. 4. In Sect. 5 we discuss some implementation details and present numerical results.
The notation we use is the following. Mappings defined on ℝ N , vectors from ℝ N and matrices with at least one dimension being N are denoted by boldfaced letters F, R, x, B, … while their block-elements are denoted by the same letters in italics and indices so x = (x 1 , … , x s ), x ∈ ℝ N , x s ∈ ℝ n s . The dimensions are clearly stated to avoid confusion. We use min (⋅) and max (⋅) to denote the smallest and largest eigenvalue of a matrix, respectively. The Euclidean norm is denoted by ‖ ⋅ ‖ for both matrices and vectors.

Nearly separable problems
The problem we consider is stated in (1). Denote with I = {1, … , N} and with J = {1, … , m} . Given a partition I 1 , … , I K of I we define the corresponding partition of J into E 1 , … , E K as follows: That is, given a partition of the set of variables, each of the subsets E s contains the indices corresponding to residual functions that only involve variables in I s , while Ê contains the indices of residuals that involve variables belonging to different

3
A split Levenberg-Marquardt method for large-scale sparse… subsets I s . We say that problem (1) is separable if there exist K ≥ 2 and a partition {I s } s=1,…,K of I such that Ê = � , while we say that it is nearly-separable if there exist K ≥ 2 and a partition {I s } s=1,…,K of I such that the cardinality of Ê is small with respect to the cardinality of ⋃ K s=1 E s . The term "nearly-separable" is not defined precisely and should be understood in the same fashion as sparsity, i.e. assuming that we can identify the corresponding partitions.
The above described splitting can be interpreted as follows. Given the least squares problem in (1) we define the corresponding underlying network as the undirected graph G = (I, E) where I and E denote the set of nodes and set of edges respectively. The graph G that has a node for each variable x i , and an edge between node i and node k if there is a residual function r j that involves x i and x k . With this definition in mind the partition of I and J that we described above corresponds to a partition of the sets of nodes and edges of the network, where E s contains the indices of edges that are between nodes in the same subset I s and Ê contains the edges that connect different subsets. The problem is separable if the underlying network G is not connected (and the number K is equal to the number of connected components of G ). The problem is nearly-separable if we can partition the set of nodes of the network in such a way that the number of edges that connect different subsets is small with respect to the number of edges that are internal to the subsets.
Given the partition {I s } s=1,…,K of the variables and the corresponding partition {E s } s=1,…,K ,Ê of the residuals, for s = 1, … , K we define x s ∈ ℝ n s as the vector of the variables in I s where n s denotes the cardinality of I s , and we introduce the following functions so that for every s = 1, … , K , R s ∶ ℝ n s → ℝ is the vector of residuals involving only variables in I s , while ∶ ℝ N → R is the vector of residuals in Ê and F s , Φ are the corresponding local aggregated residual functions. Notice that ∑ K s=1 n s = N. With this notation problem (1) can be rewritten as In particular, if the problem is separable (and therefore Ê is empty) Φ ≡ 0 and solving problem (4) is equivalent to solving K independent least squares problems given by If the problem is not separable then in general Φ is not equal to zero and that is the case we are interested in.

LMS: the Levenberg-Marquardt method with splitting
Let {I s } s=1,…,K be a partition of I and {E s } s=1,…,K be the corresponding partition of J as defined in (2). To ease the notation we assume that we reordered the variables and the residuals functions according to the given partitions, such that for x ∈ ℝ N With this reordering, denoting with J sR j the Jacobian of the partial residual vector R j defined in (3) with respect to the variables in I s , and with J s the Jacobian of the partial residual with respect to x s , we have From this structure of R and J we get the corresponding block structure of the gradient g(x) = J(x) ⊤ R(x) and the matrix J(x) ⊤ J(x): with In the following, we denote with g k = J ⊤ k R k the vector with s-th block component equal to g s (x k ) , with H k = H(x k ) the block diagonal matrix with diagonal blocks given by H s (x k ) for s = 1, … , K , and with B k = B(x k ) the block partitioned matrix with diagonal blocks equal to zero and off-diagonal blocks equal to B ij (x k ) . That is,

3
A split Levenberg-Marquardt method for large-scale sparse… The algorithm we introduce here is motivated by near-separability property and hence we state the formal assumption below.
The assumption above is not restrictive as B(x) is a submatrix of J(x) ⊤ J(x). Furthermore, the global convergence of the algorithm we propose does not depend on it in the sense that we do not use the assumption for the convergence proof. In fact the proposed algorithm works even for problems that are not nearly-separable as it can be seen as a kind of quasi-Newton method, however the efficiency of the algorithm depends on the near-separability of the problem and the value of M. Moreover the value of M plays an important role in the analysis of local convergence and in achieving linear rate. Consider a standard iteration of LM method for a given iteration x k where d k ∈ ℝ N is the solution of where J k = J(x k ) ∈ ℝ m×N denotes the Jacobian matrix of R k = R(x k ) and k is a positive scalar. When N is very large solving (11) at each iteration of the method may be prohibitively expensive. In the following we propose a modification of the Levenberg-Marquardt method that exploits near-separability of the problem to approximate the linear system (11) with a set of independent linear systems of smaller size. The linear system (11) at iteration k can therefore be rewritten as The matrix B k depends only on the derivatives of the residual vector = (r j (x)) j∈Ê . If the problem is separable then Ê = � and B k = 0 so the coefficient matrix of (12) is block diagonal, the system can be decomposed into K independent linear systems, and the solution d k is a vector with the s-th block component equal to the solution d k s of (9) If the problem is nearly-separable, the number of nonzero elements in B k is small compared to the size N of the matrix and to the number of nonzero elements in H k . Thus the solution of (13) may provide an approximation of the solution of the Levenberg-Marquardt direction (11), with the quality of the approximation depending on the number and magnitude of the nonzero elements in B k . Given that information contained in B k might be relevant and that solving K systems of smaller dimension is much cheaper than solving the system of dimension N, we propose the following modification of the right hand side of (13), which attempts to exploit the information contained in the off-diagonal blocks, while retaining separability of the approximated linear system. The idea underlying the right-hand side correction is analogous to the one proposed in [1,14] for systems of nonlinear equations and distributed optimization problems.
Our goal is to split the LM linear system into separable systems of smaller dimension. Starting from the LM linear system (12) and aiming at a separable system of linear equations, i.e. a system with the matrix H k + k I as in (13), we need to take into account the fact that B k is not zero. Clearly putting B k to the right hand side would be ideal but d k is unknown. Therefore we add B k at the right hand side of (13) as this way we maintain separability and information contained in B k . Intuitively, B k is the best approximation for B k that we have available, so we use it to get a separable system of linear equations. To compensate (at least partially) for the substitution of B k with B k we also add a correction factor k as explained in (14) and further on.
Consider the system where k ∈ ℝ is a correction coefficient that we can choose. Once the right-hand side has been computed, since H k + k I is a block diagonal matrix, (14) can still be decomposed into K independent linear system. That is, the solution d k of (14) is given by The correction coefficient k can be chosen freely at each iteration so far. However we will see that it is of fundamental importance for both the convergence analysis of the method and practical performance. This parameter is further specified in the algorithm we propose and discussed in detail in Subsection 4.3. Let us now give only a rough reasoning behind its introduction. With k we are trying to preserve some information contained in B k in a cheap way and without destroying separability. One possibility is the following choice of k , which ensures that the residual given by the solution of (14) with respect to the exact linear system (12) is minimized. That is, Further details on this choice are presented later on. One can ask why we use a single coefficient k in each iteration, i.e. if it would perhaps be more efficient to try to "correct" the right hand side with more than one parameter, maybe allowing a diagonal matrix k in the right hand side instead of a single scalar. In fact, if we take a diagonal matrix k ∈ ℝ N×N and plug it in the above minimization problem, then solving this problem we could recover the LM iteration exactly. However such procedure would imply the same cost for one iteration as the full LM iteration. Clearly, there are other alternatives between these two extremes -a single number and N numbers, but our experience show that the choice with a single coefficient k brings the best results in terms of cost benefit function. The convergence analysis presented in the next Section will further restrict the values of parameter k . Consider the block diagonal matrix H k defined in (7). Since the s-th diagonal block we have that H k is symmetric and positive semi-definite, and therefore there exists a matrix Let us now state the algorithm. We will assume that the splitting into suitable sets is done before the iterative procedure and it is kept fixed through the process. Thus, the diagonal matrix H k and off-diagonal B k are already defined in each iteration.
The regularization parameter k plays an important role in (14) and consequently in the resolution of our independent problem stated in (15). In the algorithm below we adopt a choice based on the same principles proposed in [11] although other options are possible. The parameter is thus computed using the values defined as for each iteration with k specified in the algorithm below.
The key novelty is in (19) where we compute the search direction. The damping parameter k is defined through steps 1 while the values of k updated at lines 12-16 resemble trust-region approach. Roughly speaking, if the decrease is sufficient the damping parameter in the next iteration is decreased, otherwise we keep k+1 = k . In fact the choice of k will be crucial for the convergence proof, see Lemma 8. The correction parameter k is specified in line 2. Assuming the standard properties of the objective function, see ahead Assumption 2 and 3, one can always choose k such that (18) holds. However, the value of k depend on the norm of off-diagonal blocks. Thus the method essentially exploits the sparse structure we assume in this paper, see Assumption 1. The search direction d k is computed in line 4. Clearly, the system (19) is in fact completely separable and solving it requires solving K independent linear systems of the form (15). The right-hand side correction vector in each system is also stated in (15). Thus, for the parameter k chosen in line 2, to compute d k we need to solve K systems of linear equations each one of dimension n s , and ∑ K s=1 n s = N. These systems can be solved independently, either in parallel or sequentially, and the cost of their solving is in general significantly smaller than the cost of solving N dimensional LM system of linear equations. These savings are meaningful since the direction d k obtained this way is a descent direction as we will show in the convergence analysis. After that we invoke backtracking to fulfill the modified Armijo condition given in (20) and define a new iteration. Modification of the Armijo condition again depends on the norm of off-diagonal blocks as the step size is bounded above by 1∕ k . In the case of B k = 0, i.e. if the system is completely separable we get k = 1 and the classical Armijo condition is recovered. In this case the system (19) is the classical LM system and the algorithm reduces to the classical LM with line search. On the other hand, for B k ≠ 0 the value of ‖B k ‖ fundamentally influences the values of k and k and the algorithm allows non-negligible values of k , k only if ‖B k ‖ is not loo large, i.e. if the problem has a certain level of separability

Convergence analysis
The convergence analysis is divided in 2 parts, in Sect. 4.1. we prove that the algorithm is well defined and globally convergent under a set of standard assumptions, while the local convergence analysis is presented in Sect. 4.2. The choice of k and its influence are discussed in Sect. 4.3.

Global convergence
The following assumptions are regularity assumptions commonly used in LM methods

Assumption 2
The vector of residuals For the rest of this subsection we assume that {x k } is the sequence generated by Algorithm 1 with an arbitrary initial guess x 0 ∈ ℝ N .
The following Lemma, proved in [5], is needed for the convergence analysis.
Lemma 1 [5] If Assumptions A2 and A3 hold, for every x and y in ℝ N we have (19) with k satisfying (18), and that Assumption A2 holds. Then d k is a descent direction for F at x k . Moreover the following inequalities hold

Lemma 2 Assume that d k is computed as in
Proof We want to prove that (g k ) ⊤ d k ≤ 0 for every index k. By definition of d k and using the fact that ‖A −1 ‖ ≥ ‖A‖ −1 for every invertible matrix A, we have Since k > 0 and H k is symmetric and positive semidefinite, we have and Using this two facts and the bound (18) on ‖ k B k ‖ in inequality (22), we get which is part i) of the Lemma. Since b < 1 this also implies that d k is a descent direction at iteration k. By (20) we have that for every iteration index k Replacing (g k ) ⊤ d k with part (i) of the statement we get ii). ◻ Remark 4.1 Lemma 2 states that if the right-hand side correction coefficient k is chosen to satisfy the condition (18), then d k is a descent direction and therefore the backtracking procedure can always find a step size t k such that the Armijo condition (20) is satisfied. In particular this implies that Algorithm 1 is well defined. In Lemma 9 we will also prove that under the current assumptions the step size t k is bounded from below.
Proof Let us introduce the following function from ℝ to ℝ m and let us notice that by Lemma 1 we have that ‖Ψ(t)‖ ≤ L 2 t 2 ‖d k ‖ 2 . Using this bound on Ψ and Lemma 4 we get The thesis follows immediately.
Proof Using the bound to ‖d k ‖ given by Lemma 3, and the fact that t k ≤ min{1, 1∕ k } , we have (28) where q k is the polynomial with coefficients defined in (31). Together with Lemma 5, this implies that for every t ≤ min{1, 1∕ k } we have Since a k 4 < 0 we have that q( ) → −∞ as → +∞ . This implies that if k > * k then q( k ) < 0 and, by the inequality above, we have for every t ≤ min{1, 1∕ k }. Since c ∈ (0, 1) , this implies in particular that min{1, 1∕ k } satisfies Armijo condition (20) and therefore t k = min{1, 1∕ k }. ◻

Lemma 7
If Assumptions A2 and A3 hold and at iteration k we have k ≥ L , then t k = min{1, 1∕ k }.
Proof By the previous Lemma, in order to prove the thesis it is enough to show that when k ≥ L we have k ≥ * k with * k largest root of the polynomial q k defined in (31). Using the Cauchy bound for the roots of polynomials [16], we have that Since a k 4 = − 1−b 2 k and k ≤ 1 + b , we have that for every k Using this inequality, the definition of k , and the fact that â k i ≥ 0 for every i = 0, … , 3 , we have (32) This, together with (35) implies that k ≥ * k which concludes the proof. ◻

Lemma 8
If Assumptions A2 and A3 hold then we have t k = min{1, 1∕ k } for infinitely many values of k.
Proof By Lemma 7 we have that t k = min{1, 1∕ k } whenever k ≥ L. Assume by contradiction that there exists an iteration index k such that for every k ≥k the step size t k is strictly smaller than min{1, 1∕ k }. Since in Algorithm 1 we have k+1 = 2 k whenever t k < min{1, 1∕ k } , this implies that for every k ≥k we have Therefore there exists k ′ ≥k such that k ′ ≥ L which implies t k = min{1, 1∕ k } , contradicting the fact that t k < min{1, 1∕ k } for every k ≥k . ◻ The above Lemma allows us to prove the first global statement below. Namely, we prove that any bounded iterative sequence has at least one accumulation point which is stationary.

Theorem 1 Assume that Assumptions A2, A3 hold and that {x k } is a sequence generated by Algorithm 1 with arbitrary x 0 ∈ ℝ N . If {x k } is bounded, then it has at least one accumulation point that is also a stationary point for F(x).
Proof Since {x k } ⊂ ℝ n is bounded and by Lemma 8 the sequence of step sizes {t k } takes value min{1, 1∕ k } infinitely many times, we can take a subsequence {x k j } ⊂ {x k } such that t k j = min{1, 1∕ k j } for every j and that x k j converges to x as j tends to infinity. By Lemma 2 we have which implies that By definition of k and (18) we have that min{1, 1∕ k } ≤ (1 + b) . Since {x k j } is a compact subset of ℝ N , and R(x) is twice continuously differentiable, we have that the sequences ‖H k j ‖ , ‖R k j ‖ , and ‖J k j ‖ are bounded from above, which by definition of k imply that ‖H k j ‖ + k j is also bounded from above. This, together with (36) implies that ‖g k j ‖ vanishes as j tends to infinity and therefore x is a stationary point of F(x). ◻

3
A split Levenberg-Marquardt method for large-scale sparse…

Lemma 9 If Assumptions A2 and A3 hold and min > 0 then for every index k we have
Proof From inequality (32), since t ≤ min{1, 1∕ k } , we have where q k is defined in (31). Using the inequality above together with Lemma 7 we have

Let us define
We can easily see that if t ≤t k then the term between parentheses of the previous inequality is non-positive and therefore Since in Algorithm 1 we take c ∈ (0, 1) this implies that if t k ≤t k Armijo condition (20) holds, and therefore the accepted stepsize t k satisfies t k ≥t k . By Lemma 8 we have that if k ≥ L then t k = min{1, 1∕ k } ≥ 1∕(b + 1) . Let us consider the case when k < L, which also implies k ≤̄k with Using the definition of ̄k and the fact that ̄k ≥ 1 and k ≤̄k , we have t k ≥ min 8 min Since we are considering the case k < L , we have Using this inequality and (40) in the definition of t k we get which gives us the thesis. ◻ Finally, we can state the global convergence results.

Theorem 2 If Assumptions A2 and A3 hold and min > 0 then every accumulation point of the sequence {x k } is a stationary point of F(x).
Proof Let x be an accumulation point of {x k } and let {x k j } be a subsequence converging to x. By Lemma 2 we have and therefore that By Lemma 9 the sequence {t k } is bounded from below while by continuity of J(x), R(x), H(x) , and of the norm 2, we have that ‖H k j ‖ + k j is bounded from above. This implies

3
A split Levenberg-Marquardt method for large-scale sparse… and thus x is a stationary point of F(x). ◻

Local convergence
Let us now analyze the local convergence. We are going to show that the LMS method generates a linearly converging sequence under a set of suitable assumptions. Notice that the assumptions we use are standard, see [2] plus the sparsity assumption that we already stated. Let S denote the set of all stationary points of Consider a stationary point x * ∈ S and a ball B(x * , r) with radius r > 0 around it. In the rest of the section we make the following assumptions, see [2]. Assumption 4 There exists > 0 such that for every x ∈ B(x * , r) Assumption 5 There exists > 0 such that for every x ∈ B(x * , r) and every z ∈ B(x * , r) ∩ S From now on we denote with k the relative residual of the linear system (11) by the approximate solution d k . That is This residual is already considered in (16), where we briefly mentioned that we will determine k such that this residual is minimized. Further details on this choice are presented in Sect. 4.3 and in this part we will keep k without further specification, i.e., assuming only that it is small enough that local convergence requirements can be fulfilled. Clearly, for the completely separable problems, i.e. B k = 0 we get k = 0 and hence the value of k depends on M stated in Assumption 1 -if M is small enough, i.e., if the problem is nearly-separable to the sufficient degree it is reasonable to expect that the values of k will be small enough with a suitable choice of k . The inequalities in the Lemma below are direct consequences of Assumption 3, their proofs can be found in [2].
Lemma 10 Let Assumption 2 hold. There exist positive constants L 2 , L 3 and L 4 such that for every x, y ∈ D r ,z ∈ B(x * , r) ∩ S the inequalities below hold: From now on, given any point x ∈ ℝ N , we denote with x the point in S that realizes ‖x −x‖ = dist (x, S).

Lemma 11
There exists r > 0 and Proof Let us define H * = H(x * ) . Consider the eigendecomposition of H * = S ⊤ * S * = Q * * Q ⊤ * where * is a diagonal matrix containing the ordered eigenvalues of S ⊤ * S * and Q * is the matrix containing the orthogonal eigenvectors corresponding to the eigenvalues in * . Denoting with p the rank of S ⊤ * S * we have that Consider the eigendecomposition of H k = Q k k Q ⊤ k and consider the partition of Q k and k corresponding to the partition of * ∶ Since R is continuously differentiable on B(x * , r) the entries of S(x) ⊤ S(x) are continuous functions of x and thus the eigenvalues of S(x) ⊤ S(x) are continuous function of x . Therefore, for r small enough we and analogously, Therefore the thesis holds with Proof Since we have By definition of k there holds Replacing these two inequalities in (52) and using Lemma 11 we get the thesis. ◻ (50) S). r∕2) and dist (x k , S) ≤ then Proof By Assumption 5 and Lemma 11 and Therefore, from Lemma 12, since we are assuming dist (x k , S) ≤ , there follows and we get the thesis by definition of . ◻ The above Lemmas allow us to prove the local linear convergence.

Theorem 3
Assume that Assumptions 2-5 hold and that there exists ∈ (0, 1) such that > max c 1 + L 3 max + (2 + c 1 ) and let us define Proof We prove by induction on k that x k ∈ B(x * , r∕2) for every k.
For k = 1 , by Lemma 11 and the definition of , we have Given k ≥ 1 , assume that for every j = 1, … , k − 1 there holds dist (x j , S) ≤ and x j ∈ B(x * , r∕2) . Then we have = − (c 1 max + max L 3 + (2 + c 1 ) ) ‖d j ‖ and the fact that the right-hand side is smaller than r/2 follows again from Lemma 11 and the definition of . So, x k ∈ B(x * , r∕2) for every k and to prove the first part of the thesis it is enough to apply Lemma 13. Therefore, if there exists x = lim x k , then the limit has to belong to S ∩ B(x * , r∕2) , and to prove the second part of the thesis we only need to prove that such a limit exists. For every index k we have that ‖d k ‖ ≤ c 1 k , so given any two indices l, q, we have and {x k } is a Cauchy sequence in ℝ N . So, it is convergent. ◻

Remark 4.2
We notice that the condition in Theorem 3 is analogous to the condition used to prove local linear convergence in [2], namely > (2 + c 1 ) . The two additional terms in the condition in Theorem 3 are a consequence of the main differences between Algorithm 1 and the method considered in [2]. In particular max c 1 depends on the different choice of k , while L 3 max arise from the fact that at each iteration the Levenberg Marquardt system is solved inexactly. We also notice that, recalling the definition of c 1 in Lemma 11, the condition above implies which in turn is analogous to the condition < * n used for the convergence analysis of classical Levenberg-Marquardt method in the case of problems with nonsingular Jacobian and nonzero residual at the solution [5].

Choice of ˇk
The choice of is mentioned several times as a crucial ingredient of the algorithm we consider. Recall that the role of k is to compensate, if possible, information that we disregarded by splitting the original LM system into k separable systems in a computationally efficient way. Furthermore, due to condition (18) k can have non-negligible value only if ‖B k ‖ is not too large, i.e., if the problem is sparse enough and that is enough for global convergence. To obtain local linear convergence we need to make the residual small enough, recall (42). An intuitive approach is to determine k such that the residual given by the solution of (14) with respect to the exact linear system (12) is minimized. That is, we have and the solution of (56) is given by Let us now consider the actual computation of k . To compute the vectors u, v, w we first compute u = B k g k directly, then we find v,ŵ such that (H k + k I)v = u and (H k + k I)ŵ = g k , and finally we take v = B kv , w = B kŵ . Since (H k + k I) is blockdiagonal, (H k + k I)v = u can be decomposed into independent linear systems with coefficient matrices H k s + k I for s = 1, … , K and we can proceed analogously for (H k + k I)ŵ = g k . Moreover, if we solve (H k + k I)v = u by computing a factorization of H k s + k I, then the same factorization can be used to also solve the linear system for ŵ and later to solve (15), so the computation of k is not expensive.
Having k computed as above, for the residual k ( k ) we have If the vector (H k + k I) −1 g k is in the null space of B k , we have that k = 0 and � k ( k )‖ = 0 , so in this case the direction d k is equal to the Levenberg-Marquardt direction. If g k is in the kernel of B k , then the residual ‖ ( k )‖ is equal to ‖B k (H k + k I) −1 g k ‖ 2 for any choice of the parameter . If neither (H k + k I) −1 g k nor g k are in the null space of B k , then the optimal k (58) is nonzero and so the righthand side correction is effective in reducing the residual in the linear system. In general we have that so k in (42) is bounded from above by ‖B k ‖‖(H k + k I) −1 ‖. Taking into account Assumption 1, the definition of k and the fact that this implies in particular From the inequalities above, we have the dependence of the relative residual k on the norm of the matrix ‖B k ‖ , i.e., on the constant M which measures the importance of the part that we disregard when approximating the Levenberg-Marquardt system with a block diagonal one. We can also notice that the residual is smaller for larger values of the damping parameter k .
In this section we present the results of a set of numerical experiments carried out to investigate the performance of the proposed method, compare it with classical Levenberg-Marquardt method and analyze the effectiveness of the righthand side correction. For all the tests presented here we consider the case of Network Adjustment problems [17], briefly described in the Subsection 5.1. The LMS method is defined assuming that we can take advantage of the sparsity by suitable partition of variables and residuals and that we are able to apply the efficient right-hand-side correction as described in the Subsection 4.3, i.e., computing k as in (58).

Least squares network adjustment problem
Consider a set of points {P 1 , … , P n } in ℝ 2 with unknown coordinates, and assume that a set of observations of geometrical quantities involving the points are available. Least Squares adjustments consists into using the available measurements to find accurate coordinates of the points, by minimizing the residual with respect to the given observations in the least squares sense. We consider here network adjustment problems with three kinds of observations: point-point distance, angle formed by three points and point-line distance.
In order to be able to consider suitable increasing sizes, the problems are generated artificially, taking into account the information about average connectivity and structure of the network obtained from the analysis of real cadastral networks. The problems are generated as follows. Given the number of points n we take {P 1 , … , P n } by uniformly sampling 25% of the points on a regular 2 √ n × 2 √ n grid and we generate observations of the three kinds mentioned above until the average degree of the points is equal to 6. Each observation is generated by randomly selecting the points involved and generating a random number with Gaussian distribution with mean equal to the true measurement and given standard deviation. We use a standard deviation equal to 0.01 and 1 degree for distance and angle observations respectively. For all points we also add coordinates observations: for 1% of the points we use standard deviation 0.01, while for the remaining 99% we use standard deviation 1.
The optimization problem is defined as a weighted least squares problem with r j (x) = w −1 jr j (x) , where r j is the residual function of the j-th observation and w j is the corresponding standard deviation.
In Fig. 1 we present the spyplot of the matrix J ⊤ J for a problem of size 35,000.

Comparison with Levenberg-Marquardt method
In all the tests that follow we use a Python implementation of Algorithm LMS and classical LM method, and PyPardiso [9] to solve the sparse linear systems that arises at each iteration. All the tests were performed on a computer with Intel(R) Core(TM) i7-1165G7 processor @ 2.80GHz and 16.0 GB of RAM running Windows 10. All the methods that we consider have the same iteration structure. The main difference is the fact that while in LM method the linear system is solved directly using PyPardiso, in LMS we first perform the splitting and then use the same PyPardiso function to solve the resulting linear systems, therefore the comparisons in terms of time that we present are meaningful. The partition of variables and residuals into sets E s , s = 1, … , K is assumed to be given before application of LMS algorithm. To compute the partitioning of the variables, we use METIS [12] which, given a network and an integer K > 1 finds a partition of the vertices of the network into K subsets of similar sizes, that approximately minimizes the number of edges between nodes in different subsets. The partition is computed by METIS in a multilevel fashion. Starting from a coarse representation of the graph, an initial partition is computed, projected onto a denser representation of the network and then refined. This process is repeated on a sequence of progressively more dense networks, up until the original graph. In all the tests that we considered, the time needed to compute the partitioning is negligible with respect to the overall computation time. This is in part due to the fact that the partitioning is computed only once at the beginning of the procedure and not repeated at each iteration.
We now consider a set of problems of increasing size and we solve each problem with the LMS method and correction coefficient k computed as in (58). The problems are also solved with LM method. We consider problems with size between 20,000 and 120,000 and we plot the time taken by the two methods to reach termination. Both methods use as initial guess the coordinate observations available in the problem description and they stop when at least 68%, 95%, 99.5% of the residuals is smaller than 1, 2 and 3 times the standard deviation respectively. The obtained results are in the first plot of Fig. 2. To give a better comparison, in the second plot we extend the size of the problems solved with the proposed method up to 1 million variables. Clearly, LM method could not cope with such large problems (in our testing environment) while LMS successfully solved problems of increasing dimensions up to final value of 1 million variables. In Fig. 3 we have the log-log plot of the time  necessary to solve each problem, compared with different rates of growth. For the method with K > 1 , a small number of values of the parameter K was tested and the best one was selected to perform the comparison. The value K used at each dimension is reported in Fig. 8.
From Fig. 2 one can see that the LMS method with K > 1 results in a significant reduction of the time necessary to reach the desired accuracy, compared to Levenberg-Marquardt method. Moreover, from the second plot of Fig. 2 and from Fig. 3 we can notice that, on the problems that we considered, the time taken by the proposed method grows approximately as n 1.3 , which suggests the fact that the method discussed in this paper would be suitable for problems of very large dimensions.
To better understand the behaviour of the method, in Fig. 4 we plot the mentioned percentages and the value of the relative residual F k ∕F 0 at each iteration, for a problem of size N = 10 5 and K = 15. For the same problem, in Fig. 5 we plot the  distribution plot of the coordinate error with respect to the true solution, for the initial guess and the estimated solution (left and right-hand plot, respectively).

Influence of the parameters K and ˇk
Let us now study how the number of subproblems K influences the performance of the method. We consider two problems with 100,000 and 200,000 variables respectively and we solve them with the proposed algorithm for a set of increasing values of K. For each considered K we plot in Fig. 6 the time taken by the method to arrive at the desired accuracy. The initial guess and the stopping criterion are defined as in the previous test.
One can notice that the time decreases as K increases up to an optimal value ( K = 15 for the first problem and K = 30 for the second one) after which the time starts to increase again. The reason behind this behavior is that larger values of K yield smaller linear system and therefore cheaper iterations, but also less accurate search direction d k resulting in a larger number of iterations necessary to achieve the desired accuracy. For large values of K the increase in the number of iterations outweights the saving in the solution of the linear system and the overall computation cost increases. Finally we can notice that, despite the existence of an optimal value of the parameter K, it appears from this test that there exists an interval of values for which the cost of the method is comparable. This suggests that fine-tuning of the parameter K is not necessary and that, given a problem, choosing K according to the number of variables should be enough to achieve good performance.
To see that the proposed right-hand side correction improves the performance of the method, we repeat the test presented in Subsection 5.2 for N = 10 6 , but the comparison is here carried out with the case k = 0 that is, when the linear system is approximated as in (13) but no right-hand side correction is applied.
For both methods, a few different values of the parameter K were tested. In Fig. 7 we report the time needed by the two methods to satisfy the convergence criterion, for the best K among the considered values. In figure 8 we plot for each method and each size the value of the parameter K corresponding to the timings in Fig. 7.
We can see that applying the proposed right-hand side correction effectively reduce the time necessary to satisfy the stopping condition. From Fig. 7 one can notice that the optimal K for the method with right-hand side correction is Fig. 6 Time to compute a solution for optimal k and different values of K, N = 10 5 and N = 2 ⋅ 10 5 generally higher than the method without correction. These two results together suggests that the method with right-hand side correction is able to achieve a better performance because it allows the set of variables to be partitioned into smaller subsets, which implies a faster computation of the direction at each iteration, before incurring into a decrease in the performance due to the additional number of iterations necessary to reach the desired accuracy.

Conclusions
We presented a method of inexact Levenberg-Marquardt type for sparse problems of very large dimension. Assuming that the problem is nearly separable, i.e., sufficiently sparse such that each component of the residual depends only on few variables, the proposed methods is defined through splitting into a set of K independent systems of equations of smaller dimension. The decoupling is done taking into account dense diagonal blocks of LM system and disregarding hopefully very sparse off-diagonal blocks. To compensate for disregarded off-diagonal block we introduced a correction on the right-hand side of the system in such way that decoupling is maintained but information contained in the off-diagonal matrix is preserved in a computationally affordable way, using a single parameter that can be computed in the same fashion -solving a sequence of small dimensional systems of linear equations. The key idea is that solving K systems of smaller dimensions, that can be done sequentially or in parallel, is significantly cheaper than solving a large system of linear equations even if the system is sparse.
The presented algorithm is globally convergent under the set of standard assumptions for a suitable choice of regularization parameter in LM system. In fact the global convergence does not rely on separability assumption at all as one can show that the direction computed by decoupled sequence of LM systems is descent direction. To achieve global convergence we rely on line-search and regularization parameter update by a trust-region like scheme, similarly to [11]. Local linear convergence is proved under the standard conditions and assuming that the residual of linear system is small enough in each iteration. Hence, the near-separability assumption plays a role in local convergence. To achieve small residuals for the decoupled problem we rely heavily on the right-hand side correction and discuss the optimal choice of parameter that is employed in the correction. Theoretical considerations are supported by numerical examples. We consider the network adjustment problem on simulated data, inspired by a real-world problem of cadaster maps, of growing size and with the proposed method solve problems of up to one million of variables. Comparison with the classical LM is presented and it is shown that the proposed method is significantly faster and able to cope with large dimensions. The experiments reported in this paper are done in sequential way while the parallel implementation will be a subject of further research.