1 Introduction

Our interest lies in efficient and robust methods for solving large-scale linear least squares problems with linear equality constraints. We assume that \(A \in {{\mathbb {R}}}^{m \times n}\) and \(C \in {{\mathbb {R}}}^{p \times n}\), with \(m > n \gg p\). We further assume that A is large and sparse and C represents a few, possibly dense, linear constraints. Given \(b \in {{\mathbb {R}}}^{m}\) and \(d \in {{\mathbb {R}}}^{p}\), the least squares problem with equality constraints (the LSE problem) is

$$\begin{aligned}&\min _{x\in {{\mathbb {R}}}^{n}} \left\| A\, x -b \right\| ^2_2 \end{aligned}$$
(1.1)
$$\begin{aligned}&\text{ s.t. } \;\; C\, x = d. \end{aligned}$$
(1.2)

A solution exists if and only if (1.2) is consistent. For simplicity, we assume that C has full row rank (although the proposed approaches can be made more general). In this case, (1.2) is consistent for any d. A solution to the LSE problem (1.1)–(1.2) is unique if and only if \({{{\mathscr {N}}}}(A) \cap {{{\mathscr {N}}}}(C) =\{0\}\), where for any matrix B, \({{{\mathscr {N}}}}(B)\) denotes its null space. This is equivalent to the extended matrix

$$\begin{aligned} {{{\mathscr {A}}}} = \begin{pmatrix} A \\ C \end{pmatrix} \end{aligned}$$
(1.3)

having full column rank. In the case of non-uniqueness, there is a unique minimum-norm solution.

LSE problems arise in a variety of practical applications, including scattered data approximation [13], fitting curves to data [16], surface fitting problems [35], real-time signal processing, and control and communication leading to recursive problems [50], as well as nonlinear least squares problems and least squares problems with inequality constraints. For example, in fitting curves to data, equality constraints may arise from the need to interpolate some data or from a requirement for adjacent fitted curves to match with continuity of the curves and possibly of some derivatives. Motivations for LSE problems together with solution strategies are summarized in the research monographs [7, 8, 29].

Classical approaches for solving LSE problems derive an equivalent unconstrained linear least squares (LS) problem of lower dimension. There are two standard ways to perform this reduction: the null-space approach [21, 29] and the method of direct elimination [10], both of which, with suitable implementation, offer good numerical stability. These methods, termed constraint substitution methods, consider the constraints (1.2) as the primary data and substitute from them into the LS problem (1.1). The former performs a substitution using a null-space basis of C obtained from a QR factorization, while the latter is based on substituting an expression for selected solution components from the constraints into (1.1). This can be done using a QR factorization of C [7, 10]. If there are a large number of constraints, a pivoted LU factorization might also be an option [31]. Other solution methods, which may be regarded as complementary to the constraint substitution approaches, reverse the direction of the substitution, substituting from the LS problem into the constraints. This involves the use of an augmented system and include a Lagrange multiplier formulation [22], updating procedures that force the constraints to be satisfied a posteriori [6, 7], and a weighting approach [5, 36, 46],

Solving large-scale LS problems is typically much harder than solving systems of linear algebraic equations, in part because key issues such as ill-conditioning or dense structures within an otherwise sparse problem can vary significantly between different problem classes. Consequently, we do not expect that there will be a single method that is optimal for all LSE problems, and having a range of approaches available that target different problems is important. Our main objective is to revisit classical solution strategies and to propose new ideas and modifications that enable large-scale systems to be solved, with an emphasis first on the possibility that the constraints may be dense, and second on requiring that the constraints be tightly satisfied. In Sects. 2 and 3, we consider the null-space method and the direct elimination approach, respectively. We review the methods and show how they can be used for large-scale problems. In Sect. 4, we present complementary solution approaches within an augmented system framework. This allows us to treat the constraints and the least squares part of the problem using a single extended system of equations or via a global updating scheme. Both direct and iterative methods are discussed.

Much of the published literature related to LSE problems lacks numerical results. For instance, Björck [6] remarks “no attempt has yet been made to implement the (general updating LSE) algorithm”, and as far as we are aware, attempts remain absent. We assume this is because implementing the algorithms is far from straightforward. While it is not the intention here to offer a full general comparison of the different approaches, throughout our study we use numerical experiments on problems arising from real applications to highlight key features that may make a method attractive (or unsuitable) for particular problems and to illustrate the effectiveness of the different approaches. Our key findings and recommendations are summarized in Sect. 5.

We end this introduction by describing our test environment. The test matrices are taken from the SuiteSparse Matrix Collection [15] and comprise a subset of those used by Gould and Scott in their study of numerical methods for solving large-scale LS problems [20]. If necessary, the matrix is transposed to give an overdetermined system. Basic information on our test set is given in Table 1.

Table 1 Statistics for our test set. m, n and \(nnz({{\mathscr {A}}})\) are, respectively, the row and column counts and the number of entries in the matrix \({{{\mathscr {A}}}}\) given by (1.3). dratio is the ratio of the nonzero counts of the densest row to the sparsest row of \({{{\mathscr {A}}}}\). \(^\dag \) indicates at least one column was removed to ensure there are no null columns in A

The problems in the top half of the table contain rows that are identified as dense by Algorithm 1 of [42] (with the density parameter set to 0.05). These rows are taken to form the constraint matrix C and all other rows form A. For the other problems, we form A by removing the 20 densest rows of the SuiteSparse matrix; some or all of these rows are used to form C (and the rest are discarded). Table 1 reports data for \(p=5\) and 20 (denoted, for example, by deter_5 and deter_20, respectively). Although the densest rows are not necessarily very dense, we make this choice because it corresponds to the typical situation in which the constraints couple many of the solution components together. For some of our test examples, splitting the supplied matrix into a sparse part and a dense part results in the sparse part A containing a small number of null columns (at most 7 such columns for our test examples). For the purpose of our experiments, we remove the corresponding columns from the extended matrix (1.3) (the data in Table 1 is for the modified problem). In all our tests, we check that the norms of the computed solution x and least squares residual \(r = b- A\,x\) are consistent with the values given in Table 1.

In our experiments, we prescale the extended matrix \({\mathscr {A}}\) given by (1.3) by normalizing each of its columns. That is, we replace \({\mathscr {A}}\) by \({{\mathscr {A}}}{\mathscr {D}}\), where \({\mathscr {D}}\) is the diagonal matrix with entries \({\mathscr {D}}_{ii}\) satisfying \({\mathscr {D}}_{ii} =1/ \Vert {\mathscr {A}}e_i\Vert _2 \) (\(e_i\) denotes the i-th unit vector). The entries of \({{\mathscr {A}}}{\mathscr {D}}\) are at most one in absolute value. The vectors b and d are set to be vectors of 1’s (so that \(\Vert b\Vert _2\) and \(\Vert d\Vert _2\) are O(1)).

For the substitution approaches described in Sects. 2 and 3, we have developed prototype Fortran codes; in Sect. 4, the augmented system methods are implemented using the SuiteSparseQR package of Davis [14] and Fortran software from the HSL mathematical software library [26]. The prototype codes are not optimised for efficiency and so computational times are not reported. Developing library quality implementations is far from trivial and is outside the scope of the current study, which focuses rather on determining which approaches are sufficiently promising for sophisticated implementations to be considered in the future.

Notation All norms are 2-norms and in the rest of the paper, to simplify the notation, \(\Vert .\Vert _2\) is denoted by \(\Vert .\Vert \). I is used to denote the identity matrix of appropriate dimension. The entries of any matrix B are \((B)_{i,j}\) and its columns are denoted by \(b_1,b_2, \ldots \). The null space of B is \({{{\mathscr {N}}}}(B)\) and Z is used to denote a matrix whose columns form a basis for the null space (i.e., Z satisfies \(B\,Z = 0\)). Permutation matrices are denoted by P (possibly with a subscript). The normal matrix for (1.1) is \(H = A^T\,A\).

2 The null-space approach

The null-space approach is a standard technique for solving least squares problems. It is based on constructing a matrix \(Z \in {{\mathbb {R}}}^{n \times (n-p)}\) such that its columns form a basis for \({{{\mathscr {N}}}}(C)\). Any \(x \in {{\mathbb {R}}}^{n}\) satisfying the constraints can be written in the form

$$\begin{aligned} x=x_1 + Z\, x_2, \end{aligned}$$
(2.1)

where \(x_1 \in {{\mathbb {R}}}^{n}\) is a particular solution of the underdetermined system \( C\, x_1 = d.\) The minimum-norm solution can be obtained from the QR factorization of C, that is, \(C\,P_C = Q_C \begin{pmatrix}R_C&0 \end{pmatrix}\), where the permutation \(P_C \in {{\mathbb {R}}}^{n \times n}\) represents the pivoting, \(R_C \in {{\mathbb {R}}}^{p \times p}\) is an upper triangular matrix and \(Q_C \in {{\mathbb {R}}}^{p \times p}\) is an orthogonal matrix. \(x_1\) is then given by

$$\begin{aligned} x_1 = P_C \begin{pmatrix} R_C^{-1}Q_C^T\, d \\ 0 \end{pmatrix}. \end{aligned}$$

Substituting (2.1) into (1.1) gives the transformed LS problem

$$\begin{aligned} \min _{x_2} \left\| A \,Z\, x_2 - (b -A\, x_1) \right\| ^2. \end{aligned}$$
(2.2)

The method is summarized as Algorithm 1.

figure a

In the 1970s, the null-space method was developed and discussed by a number of authors, including in relation to quadratic programming [21, 29, 39, 45]. These and subsequent contributions formulate the approach via the orthogonal null-space basis obtained, for example, from the QR factorization of \(C^T\) given by

$$\begin{aligned} C^T = Q \begin{pmatrix} R \\ 0 \end{pmatrix}, \end{aligned}$$

where \(Q \in {{\mathbb {R}}}^{n \times n}\) is an orthogonal matrix. Z is equal to the last \(n-p\) columns of Q and consequently is dense. Note that although it is possible to store Q implicitly using, for example, Householder transformations, the memory demands and implied operation counts are generally too high. Our interest is in large LS problems and therefore it may not be practical to solve the \((n-p) \times (n-p)\) system in Step 3 if Z is dense. To make the approach feasible for large problems we can exploit our recent work [44] on constructing sparse null-space bases of “wide” matrices such as C that have many more columns than rows and may include some dense rows.

Scott and Tůma [44] propose a number of ways to construct sparse Z. In our experiments, we employ Algorithm 3 from Section 3 of [44]. This algorithm first computes a QR factorization of C with column pivoting. The chosen pivots correspond to p columns of R. Then each of the remaining \(n-p\) columns of C induces a column \(z \in Z\) that is computed independently of the other columns as follows. While in the trivial case of a zero column the corresponding z contains a single nonzero entry, for any nonzero column \(c \in C\) a linearly dependent set involving other columns of C is constructed. The smallest such set is called a circuit; circuits play an important role in the problem of the sparsest null-space basis [12]. The coefficients of the linear combination of c and other columns of C that sum to zero are the row entries of the column \(z \in Z\) corresponding to c. The linearly dependent sets are found using a partial pivoted QR factorization of C (with at most p steps) that involves the column c. To obtain Z with a narrow bandwidth so that \(Z^TH\,Z\) is sparse when H is sparse, a pivoting threshold \(\theta \in [0,1]\) is employed in these partial QR factorizations. The role of \(\theta \) is to balance the locality of the dependent sets (combining columns of C whose indices are close to c) with the stability of a QR factorization with column pivoting (which maximizes the absolute values of the diagonal entries of R). Small values of \(\theta \) result in Z having a narrow bandwidth.

Fig. 1
figure 1

The number of entries in \(Z^TH\,Z\) (left) and the constraints residual \(\left\| r_c\right\| \) (right) for problem deter3 (top) and gemat1 (bottom) as the threshold pivoting parameter \(\theta \) used in the computation of the null-space basis increases from 0.1 to 1. The four curves correspond to \(p=2\) (black dotted line), 5 (blue full line), 10 (red dashed line) and 20 (green dash-dotted line). Observe that using a small \(\theta \) can improve the sparsity of \(Z^TH\,Z\). For deter3, the constraint residuals are small for all the tested \(\theta \) and for gemat1, they are satisfactory for \(\theta > 0.2\) (colour figure online)

Our first results are for problems deter3 and gemat1. As discussed in the Introduction, we form the constraint matrix C by taking the \(p = 2\), 5, 10, 20 densest rows of \({{\mathscr {A}}}\). The sparse block A is the same for each case. In Fig. 1, we plot the number of entries \(nnz(Z^TH\,Z)\) in \(Z^TH\,Z\) and the norm of the constraints residual \(\left\| r_c\right\| = \left\| d - C \,x \right\| \). As expected, \(nnz(Z^TH\,Z)\) increases with \(\theta \), and this increase grows with p. This is illustrated further by the results in Table 2. We see that, independently of the choice of \(\theta \), for some problems (including lp_fit2p and sctap1-2r) the constraints are not tightly satisfied. This demonstrates an inherent limitation of the null-space approach of [44] that focuses on constructing the columns of Z so as to keep \(Z^TH\,Z\) sparse but does not result in Z having orthogonal columns.

Table 2 The density of \(Z^TH\,Z\) (that is, \(nnz(Z^TH\,Z)/(n-p)^2\)) and constraint residual \(\left\| r_c \right\| \) for two values of the threshold pivoting parameter \(\theta \) used in the computation of the null-space basis. \(\ddag \) indicates insufficient memory for the sparse direct solver HSL_MA87

The matrix \(Z^TH\,Z\) in Step 3 of Algorithm 1 is symmetric positive definite. In the above experiments, we employ the sparse direct solver HSL_MA87 [23] (combined with an approximate minimum degree ordering). However, for large problems, the memory demands mean it may not be possible to use a direct method; this is illustrated by problem south31 with \(\theta = 1\). If a preconditioned iterative solver is used instead, not only are the solver memory requirements much less but explicitly forming the potentially ill-conditioned normal matrix H can be avoided and because Z only needs to be applied implicitly, the need for sparsity can be relaxed. Currently, finding a good preconditioner for use in this case remains an open problem [32].

If a sequence of LSE problems is to be solved with the same set of constraints but different A, the null-space basis can be reused, substantially reducing the work required. But if the constraints are changed, then Z will also change. In [44], we present a strategy that allows Z to be updated when a row (or block of rows) is added to C.

3 The method of direct elimination

The second method we look at is direct elimination [29]. The basic idea is to express the dependency of p selected components of the vector x on the remaining \(n-p\) components and to substitute this into the LS problem (1.1). Here we propose how to choose the p components so as to retain sparsity in the transformed problem.

Consider the constraints (1.2). The method starts by permuting and splitting the solution components as follows:

$$\begin{aligned} C\,x = C\,P_c \,y = \begin{pmatrix} C_1&C_2 \end{pmatrix} \begin{pmatrix} y_{1} \\ y_{2} \end{pmatrix} = d, \end{aligned}$$

where \(P_c\in {{\mathbb {R}}}^{n \times n} \) is a permutation matrix chosen so that \(C_1 \in {{\mathbb {R}}}^{p \times p}\) is nonsingular. Let \(A\,P_c = \begin{pmatrix} A_1&A_2 \end{pmatrix}\) be a conformal partitioning of \(AP_c\). Substituting the expression

$$\begin{aligned} y_1 = C_1^{-1}(d - C_2\,y_2) \end{aligned}$$
(3.1)

into (1.1) gives the transformed LS problem

$$\begin{aligned} \min _{y_2} \left\| A_T y_2 -(b - A_1\,C_1^{-1}d)\right\| ^2, \end{aligned}$$
(3.2)

with the transformed matrix

$$\begin{aligned} A_T = A_2 - A_1\,C_1^{-1}C_2 \in {{\mathbb {R}}}^{m \times (n-p)}. \end{aligned}$$
(3.3)

Note that if \(C_1\) is irreducible, the transformation combines all the rows of \(C_2\). If C is composed of dense rows then \(A_T\) has more dense rows than A. We thus seek to add as few row patterns as possible replicating the (possibly) dense pattern of C within \(A_T\). If both A and C are sparse, the substitution leads to a sparse LS problem. We have the following straightforward result.

Lemma 3.1

Let \(A \in {{\mathbb {R}}}^{m \times n}\) be sparse. Let \(m>n > p\) and assume a conformal column splitting induced by the permutation \(P_c\) is such that \(CP_c = \begin{pmatrix}C_1 \,&C_2\end{pmatrix}\) and \(A\,P_c = \begin{pmatrix}A_1\,&A_2\end{pmatrix}\) with \(C_1 \in {{\mathbb {R}}}^{p \times p}\) nonsingular and \(A_1 \in {{\mathbb {R}}}^{m \times p}\). Define the index set

$$\begin{aligned} Occupied =\{i \; | \; (A_1)_{i,k} \ne 0 \text{ for } \text{ some } \text{ k, } 1 \le k \le p \}. \end{aligned}$$

Then the number of dense rows in \(A_T\) given by (3.3) is at most the number of entries in Occupied.

Proof

The result follows directly from the transformation. Assuming the rows of \(C_1^{-1}C_2\) are dense, the substitution step (3.1) of the direct elimination implies a dense row k in \(A_T\) only if there is a nonzero in the k-th row of \(A_1\). \(\square \)

A simple example is given in Fig. 2. Here we ignore cancellation of nonzeros during arithmetic operations. We see that the pattern of \(A_T\) satisfies Lemma 3.1. Note that, although in this example \(C_1^{-1}C_2\) is shown as dense, it need not be fully dense and the number of entries in Occupied represents an upper bound on the number of dense rows in \(A_T\).

Fig. 2
figure 2

Example of the transformation in the direct elimination approach. Here \(m = 9\), \(p=3\), \(n=7\). The depicted matrices (from the left) represent the transformation \(A_T = A_2 - A_1\,C_1^{-1}C_2\). The matrix \(C_1^{-1}C_2 \in {{\mathbb {R}}}^{p \times n} \) is depicted as fully dense

Lemma 3.1 implies that the LSE problem is transformed to a LS problem (3.2) that has some dense rows, which we refer to as a sparse-dense LS problem. Consequently, existing methods for sparse-dense LS problems can be used, including those recently proposed in [40, 41, 43] (see also the recent direct LS solver HSL_MA85 from the HSL library). A straightforward algorithmic implication of the lemma is that the permuting and splitting of C cannot be separated from considering the sparsity pattern of A because the splitting also determines \(A_1\) and \(A_2\). Thus we want to permute the columns of C to allow a sufficiently well-conditioned factorization of \(C_1\) while limiting the number of entries in Occupied and hence the number of dense rows in \(A_T\). The approach outlined in Algorithm 2 is one way to achieve this. There is an important difference between the pivoting used in Algorithm 3 of [44] (which we used in the previous section) and that of Algorithm 2 below. The former modifies the column pivoting that is considered as standard for QR factorizations by employing a threshold parameter \(\theta \) that ensures Z is banded and the transformed normal matrix \(Z^TH\,Z\) retains sparsity. The choice of \(\theta \) aims to balance the stability of the factorization with the sparsity of Z. The threshold parameter \(\tau \in (0,1]\) used in Algorithm 2 also guarantees the pivots in the QR factorization of C are sufficiently large but the selection of the candidate pivots is balanced with limiting the fill-in in the transformed matrix \(A_T\). A crucial role is played by the set of rows held in Occupied that potentially cause fill-in in \(A_T\). The use of different notation for the threshold parameters emphasises the difference between the two QR-based approaches and the distinct roles of the two thresholds.

figure b

Observe that the pivoting strategy in Algorithm 2 considers C and A simultaneously and will not select a column as the pivot column if this column in A is dense (as it would lead to \(A_T\) being dense). While we do not discuss the implementation details, we remark that care is needed to ensure efficiency. For example, the QR factorization with pivoting of a wide matrix is relatively cheap but it may be necessary to store the squares of the column norms using a heap, which is why we emphasize their role in the algorithm by using the explicit notation \(w_i\) for these norms.

The effects of increasing the pivoting parameter \(\tau \) on the number of dense rows in \(A_T\) are illustrated in Fig. 3 for problems deter3 and gemat1; results for the full test set are given in Table 3. The dense rows of the transformed matrix \(A_T\) are determined using Algorithm 1 of [42] and to solve the transformed LS problem (3.2) we use the sparse-dense preconditioned iterative approach of [40].

Fig. 3
figure 3

The number of dense rows in the transformed matrix \(A_T\) as the parameter \(\tau \) increases from 0.05 to 1 for problems deter3 (left) and gemat1 (right). The four curves correspond to \(p=2\) (black dotted line), 5 (blue full line), 10 (red dashed line) and 20 (green dash-dotted line) (colour figure online)

Table 3 The number (ndense) of dense rows in \(A_T\) and norm of the constraints residual \(\left\| r_c \right\| \) for two values of the pivoting parameter \(\tau \)

This computes a Cholesky factorization of the normal matrix corresponding to the sparse part of \(A_T\) and uses it as a preconditioner within a conjugate gradient (CG) method; the CG convergence tolerance that measures the relative decrease of the transformed residual \(\Vert A_T^T\,r\Vert /\Vert r\Vert \) is set to \(10^{-11}\). For the problems in the top half of the table for which the rows of C are much denser than those of A (recall Table 1), reducing \(\tau \) leads to only a small reduction in the number ndense of dense rows in \(A_T\). However, when the constraints are not dense (the problems in the lower half of the table), ndense can be significantly decreased by choosing \(\tau < 1\), although if \(\tau \) is too small, the matrix \(C_1\) computed by Algorithm 2 can become highly ill-conditioned and \(A_T\) close to being singular. In our experiments we occasionally observed this for \(\tau < 10^{-5}\).

By comparing the pairs of problems in the lower half of the table (such as deter3_20 and deter3_5) and considering the plots in Fig. 3, we see that increasing the number p of constraints can lead to a sharp increase in ndense (even if these constraints are relatively sparse), which can result in the transformed problem being hard to solve. The constraints are very well satisfied in all the test cases, making this an attractive approach if a good sparse-dense LS solver is available and the number of dense rows in the transformed problem is not too large. Furthermore, it can be used, without modification, if the matrix A contains a (small) number of dense rows. However, for a sequence of problems, if A and/or C changes then, because direct elimination couples the two matrices, the computation must be completely restarted.

4 Approaches described via augmented systems

We now focus on complementary approaches that are based on substitution from the unconstrained least squares problem into the constraints. A useful way to describe this is via the augmented (or saddle-point) system

$$\begin{aligned} \begin{pmatrix} H\, &{} C^T \\ C &{} 0 \end{pmatrix} \begin{pmatrix} x \\ \lambda \end{pmatrix} = \begin{pmatrix} A^T \,b \\ d \end{pmatrix}, \qquad H = A^T\,A. \end{aligned}$$
(4.1)

Here \(\lambda \in {{\mathbb {R}}}^{p}\) is a vector of additional variables that are often called Lagrange multipliers [18, 22]. The solution x of (4.1) solves the LSE problem. Using (4.1) can be particularly useful if C is dense and p is small. As we see in the following discussions, this is because the work involved in the proposed algorithms that depends upon p is effectively independent of the density of C. Observe that because (4.1) has a zero (2, 2) block, the augmented system can be also used to give an alternative derivation of the null-space approach of Algorithm 1. For if Z is such that \(C\,Z = 0\) and \(x_1\) is a particular solution of the second equation of (4.1) so that \(Cx_1 = d\) (steps 1 and 2 of Algorithm 1), then if \(x = x_1 + {\hat{x}}\), (4.1) becomes

$$\begin{aligned} \begin{pmatrix} H\, &{} C^T \\ C &{} 0 \end{pmatrix} \begin{pmatrix} {\hat{x}} \\ \lambda \end{pmatrix} = \begin{pmatrix} A^T (b- A\,x_1) \\ 0 \end{pmatrix}. \end{aligned}$$

The second equation in this system is equivalent to finding \(x_2\) such that \({\hat{x}} = Z\,x_2\). Substituting this into the first equation \(H\,Z\,x_2 + C^T \lambda = A^T (b- A\,x_1)\). Hence \(Z^TH\,Zx_2 = (A\,Z)^T (b- A\,x_1)\), as in Algorithm 1.

4.1 Direct use of Lagrange multipliers

Algorithm 3 presents a straightforward updating scheme for solving the LSE problem using Lagrange multipliers and (4.1). Any appropriate direct or iterative method can be used for Step 1, which is usually the most expensive part of the computation.

figure c

There is no dependence on C so the solution y does not need to be recomputed when C changes. The method used to solve the system with a block of p right-hand sides in Step 2 can be chosen to exploit Step 1. For example, a sparse Cholesky factorization of H may be computed in Step 1 and the factors reused in Step 2. Using existing sparse LS solvers (and a dense linear solver for the \(p \times p\) at Step 4), Algorithm 3 is straightforward to implement and the solution y of the unconstrained LS problem obtained from Step 1 can be compared with that of the LSE computed in Step 5.

As discussed by Golub [17] and Heath [22], a numerically superior direct method that avoids both forming the potentially ill-conditioned normal matrix H and computing the multipliers \(\lambda \) can be derived using a QR factorization of A. Following [43], we obtain Algorithm 4. Here P is a permutation matrix chosen to ensure sparsity of the R factor. Note that, unless b (and hence f) changes, the Q factor need not be retained and the R factor can be reused if the constraints change but A is fixed.

figure d

Results for Algorithm 4 presented in Table 4 confirm that the computed solution is such that the norm of the constraints residual \(\Vert r_c\Vert = \Vert d - C\,x\Vert \) is small. We omit results for problems such as deter_5 that have \(p=5\) constraints because they are similar (with \(\Vert r_c\Vert \) typically smaller than for the corresponding problems with \(p=20\)).

Table 4 Norm of the constraint residuals \(\Vert r_c\Vert \) for QR with updating (Algorithm 4)

4.2 An extended augmented system approach

An equivalent formulation of (4.1) is given by the 3-block saddle-point system (the first order optimality conditions)

$$\begin{aligned} {{{\mathscr {A}}}}_{aug}\, y= b_{aug}, \end{aligned}$$

where

$$\begin{aligned} {{{\mathscr {A}}}}_{aug} = \begin{pmatrix} I &{} 0 &{} A \\ 0 &{} 0 &{} C \\ A^T &{} C^T &{} 0 \end{pmatrix}, \qquad y = \begin{pmatrix} r \\ -\lambda \\ x \end{pmatrix}, \qquad b_{aug}= \begin{pmatrix} b \\ d \\ 0 \end{pmatrix}. \end{aligned}$$
(4.2)

Applying the analysis of Section 5 of [43] to this problem yields Algorithm 5. In exact arithmetic, the main difference between the work required by Algorithms 4 and 5 is that the former involves an additional solve with \(RP^T\). For both algorithms, K is independent of b and d.

figure e

4.3 Augmented regularized normal equations

The next approach weights the constraints and uses a regularization parameter within an augmented system formulation and then aims to balance these two modifications. Consider the weighted least squares problem (WLS)

$$\begin{aligned} \min _{x}\left\| A_{\gamma } \,x_{\gamma } - b_{\gamma } \right\| ^2 \;\; \text{ with } \;\; A_{\gamma } = \begin{pmatrix} A \\ \gamma \, C \end{pmatrix}, \;\; b_{\gamma } = \begin{pmatrix} b \\ \gamma \, d \end{pmatrix}, \end{aligned}$$
(4.3)

for some large \(\gamma \) (\(\gamma \gg 1\)). Let \(x_{LSE}\) be the solution of the LSE problem (1.1)–(1.2). Then because

$$\begin{aligned} \lim _{\gamma \rightarrow \infty } x_{\gamma } = x_{LSE}, \end{aligned}$$

the WLS problem can be used to solve the LSE problem approximately [28]. An obvious solution method is to solve the normal equations for (4.3):

$$\begin{aligned} H_\gamma \,x = A_\gamma ^T \,A_\gamma \,x = (A^T\,A + \gamma ^2\, C^TC)\, x = A^T\, b + \gamma ^2 \,C^T d= A_\gamma ^T \,b_\gamma . \end{aligned}$$

The appeal is that no special methods are required: software for solving standard normal equations can be used. However, for very large values of \(\gamma \), the normal matrix \(H_\gamma \) becomes extremely ill-conditioned; this is discussed in Section 4 of [9], where it is shown that the method of normal equations can break down if \(\gamma > \epsilon ^{-1/2}\) (\(\epsilon \) is the machine precision). Furthermore, if C contains dense rows then \(H_\gamma \) will be dense.

Another possibility is to use the regularized normal equations

$$\begin{aligned} (H_\gamma + \omega ^2 I) \,x = A_\gamma ^T \,b_\gamma , \end{aligned}$$
(4.4)

where \(\omega > 0\) is a regularization parameter [49]. Solving (4.4) is equivalent to solving the \((m + p +n) \times (m + p +n)\) augmented regularized normal equations

$$\begin{aligned} {{{\mathscr {A}}}}(\omega ,\gamma ) \begin{pmatrix} y \\ x \end{pmatrix} = \begin{pmatrix} b_\gamma \\ 0 \end{pmatrix}, \qquad {{{\mathscr {A}}}}(\omega ,\gamma ) = \begin{pmatrix} \omega I &{} A_\gamma \\ A_\gamma ^T \;&{} -\omega I \end{pmatrix}, \end{aligned}$$
(4.5)

where \(y = \omega ^{-1}(b_\gamma - A_\gamma \,x)\in {{\mathbb {R}}}^{m+p}\). The spectral condition number of (4.5) is

$$\begin{aligned} \text {cond}({{{\mathscr {A}}}}(\omega ,\gamma )) = \sqrt{\text {cond}(H_\gamma + \omega ^2 \,I)} \end{aligned}$$

and Saunders [38] shows that \(\text {cond}({{{\mathscr {A}}}}(\omega ,\gamma )) \approx \Vert A_{\gamma }\Vert /\omega \) regardless of the condition of \(A_{\gamma }\). Thus using (4.5) potentially gives a significantly more accurate approximation to the pseudo solution \(x = A_\gamma ^+ \,b_\gamma \) (where \((.)^+\) denotes the Moore-Penrose pseudoinverse of a matrix) compared to the approximation provided by solving (4.4). In [48], the parameters are set to \(\omega = 10^{-q}\) and \(\gamma = 10^q\), where

$$\begin{aligned}q = \min \{ k: 10^{-2k} \le \nu ^{-t} \}. \end{aligned}$$

Here t-bit floating-point arithmetic with base \(\nu \) is used.

Rewriting (4.5) using (4.3) and a conformal partitioning of y gives

$$\begin{aligned} \begin{pmatrix} \omega I &{} 0 &{} A \\ 0 &{} \omega I &{} \gamma C \\ A^T \,&{} \gamma C^T \,&{} -\omega I \end{pmatrix} \begin{pmatrix} y_s \\ y_c \\ x \end{pmatrix} = \begin{pmatrix} b \\ \gamma d \\ 0 \end{pmatrix}. \end{aligned}$$
(4.6)

This system can be solved as in [43] using a modified version of Algorithm 5. Or, eliminating \(y_s\) and setting \(\omega \gamma = 1\), yields

$$\begin{aligned} \begin{pmatrix} -H(\omega ) &{} C^T \\ C \;&{} \omega ^2 I \end{pmatrix} \begin{pmatrix} x \\ y_c \end{pmatrix} = \begin{pmatrix} -A^T\,b \\ d \end{pmatrix}, \qquad H(\omega ) = A^T\,A + \omega ^2 \,I. \end{aligned}$$
(4.7)

We can solve this system using a QR factorization of \(\begin{pmatrix}A \\ \omega I \end{pmatrix}\) and modifying Algorithm 4. Or, ignoring the block structure, we can treat it as a sparse symmetric indefinite linear system and compute an \(LDL^T\) factorization (with L unit lower triangular and D block diagonal with blocks of size 1 and 2) using a sparse direct solver such as HSL_MA97 [24] that incorporates pivoting for stability with a sparsity-preserving ordering. This factorization would have to be recomputed for each new set of constraints. Alternatively, a block signed Cholesky factorization of (4.7) can be used, that is,

$$\begin{aligned} \begin{pmatrix} -H(\omega ) &{} C^T \\ C \;&{} \omega ^2 I \end{pmatrix} = \begin{pmatrix} L &{} \\ B\; &{} \;L_{\omega } \end{pmatrix} \begin{pmatrix} -I &{} \\ &{} \;I \end{pmatrix} \begin{pmatrix} L^T \;&{} B^T \\ &{} L_{\omega }^T \end{pmatrix}, \end{aligned}$$

where

$$\begin{aligned} H(\omega ) = L\,L^T, \quad L\,B^T = -C^T \quad \text{ and } \quad S = \omega ^2 I + B\,B^T = L_{\omega }\,L_{\omega }^T. \end{aligned}$$

We then obtain Algorithm 6. Note that B need not be computed explicitly. Rather, the Schur complement S may be computed using \(\omega ^2 I + C\,L^{-T}L^{-1}C^T\), and \(w=B\,z\) may be computed by solving \(L\,v = z\) and then setting \(w=-C\,v\), and \(w= -B^T\,y_c\) may be obtained by solving \(Lw = C^T\,y_c\).

figure f

Results for Algorithm 6 for three of our test problems using a range of values of \(\omega \) are given in Table 5. Note that here \(\Vert r_c\Vert \) is computed using \(r_c = d - C \,x\) (rather than using \(r_c = \omega *y_c\)). We see that, provided \(\omega \) is sufficiently small, the values of \(\Vert x\Vert \) and \(\Vert r\Vert \) are consistent with those given in Table 1.

By replacing the Cholesky factorization of \(H(\omega )\) by an incomplete factorization \(H(\omega ) \approx {\tilde{L}}\,{\tilde{L}}^T\), we can obtain a preconditioner for solving (4.7). In particular, the right-preconditioned system is

$$\begin{aligned} \begin{pmatrix} -H(\omega ) &{} C^T \\ C\; &{} \omega ^2 I \end{pmatrix} M^{-1}\begin{pmatrix} w \\ w_c \end{pmatrix} = \begin{pmatrix} -A^T \,b \\ d \end{pmatrix}, \qquad M\begin{pmatrix} x \\ y_c \end{pmatrix} = \begin{pmatrix} w \\ w_c \end{pmatrix}, \end{aligned}$$
(4.8)

and we can take the preconditioner in factored form to be

$$\begin{aligned} M = \begin{pmatrix} {\tilde{L}} \\ {\tilde{B}} \;&{} \;\;\;I \end{pmatrix} \begin{pmatrix} -I &{} \\ &{} \; {\tilde{S}}_d \end{pmatrix} \begin{pmatrix} {\tilde{L}}^T \;&{} {\tilde{B}}^T \\ &{} I \end{pmatrix}, \end{aligned}$$
(4.9)

with

$$\begin{aligned} {\tilde{L}}\,{\tilde{B}}^T = -C^T \quad \text{ and } \quad {\tilde{S}} = \omega ^2 I + {\tilde{B}} \,{\tilde{B}}^T. \end{aligned}$$

As the preconditioner (4.9) is indefinite, it needs to be used with a general nonsymmetric iterative method such as GMRES [37]. A positive definite preconditioner for use with MINRES [33] can be obtained by replacing \(-I\) in (4.9) by I. MINRES has the important advantage of only requiring three vectors of length equal to the size of the linear system. GMRES results are included in Table 5. The GMRES convergence tolerance is taken to be \(10^{-11}\). We see that the GMRES iteration count is essentially independent of \(\omega \). We also ran MINRES with the same settings. For problems sctap1-2r, south31 and deter3_20 with \(\omega = 10^{-5}\) the counts were 17, 772 and 56 (approximately twice the GMRES counts). This would be of more interest if all counts were higher.

Table 5 Results for the augmented regularized normal equations approach (Algorithm 6) for problems sctap1-2r, south31, and deter3_20 using a range of values of \(\omega \). iters is the number of preconditioned GMRES iterations. The computed \(\Vert x\Vert \) and \(\Vert r\Vert \) are consistent for both approaches
Table 6 Convergence results for problems sctap1-2r with \(\omega =1.0\times 10^{-8}\) and stormg2-8_20 with \(\omega =1.0\times 10^{-6}\). tol and iters are the convergence tolerance and the iteration count for GMRES

Our findings in Sect. 4 suggest that, if we require the constraints to be solved with a small residual, then an augmented system based approach combined with a QR factorization performs better (in terms of \(\Vert r_c\Vert \)) than combining it with regularization and a Cholesky factorization. Unfortunately, QR factorizations are more expensive and while strategies for computing incomplete orthogonal factorizations for use in building preconditioners have been proposed (see, for instance, [2,3,4, 27, 30, 34, 47]), the only available software is the MIQR package of Li and Saad [30] (probably because developing high quality implementations is non-trivial). In their study of preconditioners for LS problems, Gould and Scott [19, 20] found that MIQR generally performed less well than incomplete Cholesky factorization preconditioners and so is not considered here.

We have made the implicit assumption that A is sparse. However, it is straightforward to extend the augmented system-based approaches to the more general case that A contains rows that are dense. For example, if A is permuted and partitioned as

$$\begin{aligned} A = \begin{pmatrix} A_1 \\ A_2 \end{pmatrix}, \end{aligned}$$

where \(A_1\) is sparse and \(A_2\) is dense, then using a conformal partitioning of \(y_s\) and of b, (4.7) can be replaced by the augmented system

$$\begin{aligned} \begin{pmatrix} -H_{1}(\omega ) &{} C_d^T \\ C_d \;&{} \omega ^2 I \end{pmatrix} \begin{pmatrix} x \\ y_d \end{pmatrix} = \begin{pmatrix} -A_1^T\,b_1 \\ d \end{pmatrix} \end{aligned}$$

with

$$\begin{aligned} \quad H_{1}(\omega ) = A_1^T\,A_1 + \omega ^2 I, \quad y_d = \begin{pmatrix} y_c \\ y_2 \end{pmatrix}, \quad C_d = \begin{pmatrix} C \\ \omega \,A_2 \end{pmatrix}, \quad d = \begin{pmatrix} d \\ \omega \,b_2 \end{pmatrix}. \end{aligned}$$

Finally, we remark that, if we use the 3-block form (4.6) then we can follow [43], which in turn generalises the work of Carson, Higham and Pranesh [11], and obtain an augmented system approach with multi-precision refinement. This has the potential to reduce the computational cost in terms of time and/or memory, thus allowing larger problems to be solved.

5 Conclusions

We have considered a number of approaches for solving large-scale LSE problems in which the constraints may be dense. Our main findings can be summarized as follows:

  • The classical null-space method relies on computing a null-space basis matrix Z for the “wide” constraint matrix C such that \(Z^T A^T \,A \,Z\) is sparse. In recent work [44], we proposed how this can be achieved using a method based on a QR factorization of C with threshold pivoting. This is not straightforward to implement. Furthermore, our numerical experiments show that, in some cases, the norm \(\Vert r_c\Vert \) of the constraints residual can be larger than for other approaches considered in this study. Thus, although in some contexts null-space approaches are popular, we do not recommend the strategy of [44] for LSE problems.

  • The direct elimination approach couples the constraint matrix and the LS matrix, leading to a sparse-dense transformed least squares problem. Existing direct or iterative methods can be used to solve the transformed problem and our experiments found the computed constraint residuals are small. The approach can be used for problems for which A (as well as C) contains a small number of dense rows. A weakness is that, if solving a sequence of problems in which either A or C is fixed, the coupling of the two blocks in the solution process means that it must be restarted. Furthermore, the number of dense rows in the transformed problem can be relatively large, making it expensive to solve.

  • There are several options for using an augmented system formulation. This can be solved using standard building blocks, such as a sparse QR factorization, a sparse symmetric indefinite linear solver, or a block sparse Cholesky factorization. An attraction of each of these is that existing “black box” solvers can be exploited, thereby greatly reducing the effort required in developing robust and efficient implementations. The augmented system formulation can be generalised to handle dense rows in A and offers the potential for mixed-precision computation. Moreover, an incomplete Cholesky factorization can be used as a preconditioner with a Krylov subspace solver.

  • In the case of a series of LSE problems in which only the constraints change, both the null-space and direct elimination approaches have the disadvantage that the computation must be redone for each new set of constraints. For the augmented system approaches, a significant amount of work can be reused from the first problem in the sequence when solving subsequent problems.

Finally, we observe that there is a lack of iterative methods and preconditioners that can be used to extend the size of LSE problems that can be solved. We have shown that using an incomplete factorization within a block factorization of an augmented system can be effective, but most current incomplete factorizations that result in efficient preconditioners are serial in nature and not able to tackle extremely large problems (but see [1, 25] for novel approaches that are designed to exploit parallelism). Addressing the lack of iterative approaches is a challenging subject for future work.