On a conjugate directions method for solving strictly convex QP problem

Problem of solving the strictly convex, quadratic programming problem is studied. The idea of conjugate directions is used. First we assume that we know the set of directions conjugate with respect to the hessian of the goal function. We apply n simultaneous directional minimizations along these conjugate directions starting from the same point followed by the addition of the directional corrections. Theorem justifying that the algorithm finds the global minimum of the quadratic goal function is proved. The way of effective construction of the required set of conjugate directions is presented. We start with a vector with zero value entries except the first one. At each step new vector conjugate to the previously generated is constructed whose number of nonzero entries is larger by one than in its predecessor. Conjugate directions obtained by means of the above construction procedure with appropriately selected parameters form an upper triangular matrix which in exact computations is the Cholesky factor of the inverse of the hessian matrix. Computational cost of calculating the inverse factorization is comparable with the cost of the Cholesky factorization of the original second derivative matrix. Calculation of those vectors involves exclusively matrix/vector multiplication and finding an inverse of a diagonal matrix. Some preliminary computational results on some test problems are reported. In the test problems all symmetric, positive definite matrices with dimensions from 14×14\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$14\times 14$$\end{document} to 2000×2000\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$2000\times 2000$$\end{document} from the repository of the Florida University were used as the hessians.


Introduction
In the paper we consider problem of solving the following unconstrained optimization problem Equation (3) corresponds to the first order necessary optimality conditions for problem (1) where matrix G is symmetric and positive definite. The problem is equivalent to the problem of finding the solution of the set of linear equations The equivalence of (3) and (1) is easily verified. There are two main approaches to solve the quadratic problem (1) or the set of linear equations (3). In the first, a sequence of consecutive directional minimizations along conjugate directions (first of all conjugate gradients) is carried out. In the second approach, one first finds the triangular Cholesky decomposition of matrix G and next two sets of linear equations with triangular matrices are solved.
The notion of the conjugate directions plays very important role in the solution methods of the QP problem (1) and general unconstrained nonlinear minimization problems. Powell (1964) has proposed the conjugate directions method without calculating the derivatives. The consecutive direction was generated as the difference between the minima found on two parallel straight lines. The most popular are the conjugate gradient methods. The first one was developed by Hestenes and Stiefel (1952), who proposed it for solving symmetric, positive-definite linear algebraic systems. In 1964 the method has been extended to nonlinear problems by Fletcher and Reeves (1964). The procedure is recursive where d 0 = −∇ f (x 0 ) and every consecutive search direction is the linear combination of the previous one and the Cauchy steepest descent direction It means that starting from an initial guess x 0 (with corresponding residual r 0 = ∇ f (x 0 ) = Gx 0 −b), this method generates a sequence of iterates x k that minimize f over the Krylov subspaces x 0 + span(r 0 , Gr 0 , . . . , G k−1 r 0 ). Since the publication of the Hestenes and Stiefel (1952), conjugate gradient methods have become an area of intensive research and large number of variants of conjugate gradient algorithms have been designed. Some well-known choices for β k are Fletcher and Reeves (1964), Polak and Ribiére (1969), Polyak (1969), Dai and Yuan (1999) and Hestenes and Stiefel (1952): respectively, where g k = ∇ f (x k ), r k = g k+1 g k and · means the Euclidean norm. Surveys on the developments in that area one may find for instance in Hager and Zhang (2005), Stachurski (2012) and in Dai (2011). Standard algebraic approach to solve either the set of linear equations (1) or the quadratic problem (3) is two-phase. In phase one, the triangular Cholesky decomposition (published post-mortem by Benoit (1924)) of the hessian matrix is found, i.e., the lower triangular matrix L such that G = LL T . In phase two, two sets of linear equations with triangular matrices are consecutively solved using the forward substitution followed by the backward substitution in order to find the solution of (1) A closely related approach is to express the hessian as G = LDL T , where L is lower triangular with ones on the main diagonal and D is a diagonal matrix. Similarly as in formula (4), after finding the decomposition, two sets with triangular matrices and one with a diagonal matrix are solved A survey of direct methods for sparse linear systems one may find in the  technical report. Problems of type (1,3) are a kind of inverse problems and have many applications. An inverse problem in science is the process of calculating from a set of observations the unknown input values: for instance calculating an image in computer tomography, deblurring the digital picture, reconstructing the source in acoustics or calculating the density of the Earth from measurements of its gravity field (for more information see Wikipedia (2017)). Many of them involve large-scale problems with symmetric, positive definite matrices and multiple right-hand sides.
We observe recently a renewed interest in large, sparse, unconstrained quadratic optimization problems in that context. Currently the research goes into many directions. There are many papers devoted to search for the best preconditioner to the problem, see for instance Gratton et al. (2011), Andrea et al. (2017, Fasano and Roma (2016) and many other. Feng et al. (1995) consider the block conjugate gradient methods for solving linear systems with multiple right-hand sides. Block algorithm is also applied to problems with multiple right-hand sides in El Guennouni et al. (2003) (this time the considered matrices are not symmetric). Some researchers concentrate on finding effectively the solution of the unconstrained quadratic optimization or the associated system of linear equations. For instance in Alléon et al. (2003), Simoncini and Szyld (2007) one may find information on algorithms combining direct methods with the iterative approach. Tasks of the noisy pictures deblurring are formulated frequently in the form of the quadratic programming problems with nonnegativity conditions on variables and regularization, see for instance Chen and Zhou (2010) and references therein. The problem in Chen and Zhou (2010) is essentially quadratic and non smooth due to the regularity term formulated in the l 1 -norm. Another trend is to find directly factorization of the inverse matrix. Let us mention in that context the papers of Yunlan et al. (1997) and Zhu et al. (2011). In Yunlan et al. (1997) and Zhu et al. (2011) the structure of the hessian matrix is deeply exploited. Let us mention also SelInv -the selected inversion algorithm for the selected inversion of a sparse symmetric matrix developed by Lin et al. (2011). In the first stage of SellInv the LL T matrix decomposition of G is calculated and in the second stage selected subsets of entries are found. SelInv requires special structure of the matrix to do it effectively. Electronic structure calculations are among possible applications areas. In our result we start from the notion of conjugate vectors and do not use the structure. It is worth to mention here also the iterative refinement technique introduced by Ogita and Oishi (2012).
The content of the paper may be treated as a result on the junction of two disciplines: optimization and linear algebra. Currently a close relationship between optimization and numerical linear algebra algorithms is observed. As O'Leary (2000) stated "The efficiency and effectiveness of most optimization algorithms hinges on the numerical linear algebra algorithms that they utilize. Effective linear algebra is crucial to their success, and because of this, optimization applications have motivated fundamental advances in numerical linear algebra." Our paper inscribes into that direction. In our opinion it may be of interest to some researchers working in optimization algorithms and linear algebra as well. In our paper, we use the notion of conjugacy frequently exploited in optimization. We propose a non conventional way of using given set of conjugate directions. In that approach n independent simultaneous directional minimizations starting from the same point x 0 are simultaneously executed and finally the solution is obtained by adding the linear combination of the search directions with combination coefficients equal to the calculated directional step sizes to x 0 . Section 2 contains some basic definitions and properties of the conjugate directions and in consecutive Sect. 3 the main theorem justifying correctness of the proposed procedure using n independent, simultaneous directional searches is formulated and proved. In Sect. 4 the algorithm for recursive generation of the conjugate directions is presented. In Sect. 5 it is shown that the generated conjugate directions form the RD −1 R T decomposition of the inverse of the second order derivative matrix (hessian) -G −1 . Particular values on the main diagonal matrix R and what is strongly related the main diagonal entries of the D matrix depend on the choice of the parameters controlling the length of the generated vectors. Let us stress that the recursive presentation of the Cholesky factorization presented in the book written by Stoer (1976) (see also Stoer and Bulirsch (1993)) was to some extent an inspiration for us (the same representation was used by Wilkinson (1963) to prove the numerical stability of the Cholesky decomposition). Our method works also recursively, coming out from a subspace of the given dimension to the subspace of dimension larger by one.
The resulting algorithm is two-phase. Its details are discussed in Sect. 6. In the first phase, we generate set of conjugate directions. The second phase consists of performing n independent directional minimizations. It may be in our opinion useful in the distributed calculations and cloud computing. The second phase may be also executed by means of simply applying three consecutive matrix/vector multiplications. Both approaches are compared numerically on the set of 128 different strictly convex, quadratic programming problems of different sizes (space dimension varies from 14 to 2000). They were created with the aid of 64 symmetric, positive definite matrices taken from the Florida University repository  and two different versions of the b vector (see Sect. 7). All those matrices were taken from the real life applications and they are stored in the sparse matrix format.
The computational effort in the second phase in our approach would be much lower than in the standard way of decomposing the matrix G followed by the solution of two sets of linear equations with triangular matrices and plus eventually a set with the diagonal matrix. The computational cost of the first phase is identical with the cost of calculating the standard LDL T factorization of matrix G. In the final Sect. 8 some conclusions and corollaries are presented.

Some definitions and properties of the conjugate directions
To realize factorization of the inverse G −1 of the considered symmetric, positive definite matrix G we shall construct directions which are conjugate with respect to G. Formal definition of the G-conjugacy is recalled below. Definition of conjugate vectors and proof of their linear independence are taken from Bertsekas (1995 due to the G-conjugacy property. From the last equality we deduce that α j = 0, because d j Gd j > 0 by positive definiteness assumption. We obtained the contradiction. That concludes the proof.

Solving QP problem by means of simultaneous directional minimizations
The key property used in this section is formulated as the theorem.
and their solutions by α i respectively. Then pointx defined asx is the optimal solution of the quadratic programming problem (1).
Proof We shall show that Gx − b = 0. To prove the thesis it is sufficient to show that the gradientḡ = ∇ f (x) (first derivative of function f calculated at the pointx) is orthogonal to all vectors d i , i.e., Let us select any of the conjugate vectors d i (let it be d k ). Then The conjugacy property ensures that inside the sum there is only one term with nonzero value, namely for i = k. Therefore after simple recalculations we obtain Let us notice thatf by the rule of differentiating composite functions. Hence due to the exact directional minimization on the straight line defined by the set of points is orthogonal to all vectors from that basis. Therefore, it should be equal to 0.
Similar theorem may be found in Fletcher (1987). However, he considered it in the essentially sequential way -directional search into one direction, followed by the move to the new point, then another directional minimization from the second point followed by the move to the third point and so on. Fletcher's theorem is the basis for the conjugate directions methods in optimization. Here, we consider n independent directional minimizations starting all of them from the same point x 0 .
Theorem 1 suggests an obvious numerical procedure to solve a quadratic programming problem (1) with a symmetric, positive definite matrix G in a computational cloud or on a machine with many parallel processors. We may simply run simultaneously n independent directional minimizations starting from the same point x 0 . The crucial problem is the effective generation of the conjugate directions. A recursive procedure of their generation is proposed in the consecutive section.

Recursive generation of conjugate directions
We shall start with the vector having only one nonzero entry, namely the first one (for instance equal to 1) and every consecutive vector will have one nonzero entry more than the previous one, i.e., k-th vector will have first k entries possibly not equal to zero, the rest will assume zero value. Furthermore every k + 1 vector should be conjugate to all previous vectors with respect to matrix G.
So, let us assume now that the starting vector d 1 = [c, 0, . . . , 0] T , where c > 0 is a positive constant, and try to find consecutive vector d 2 verifying the following requirements: d 2 is conjugate to d 1 with respect to G, -only two first entries of d 2 may have nonzero values, all others are equal to 0, i.e., Then Defining vector d 2 as a vector with two nonzero entries as follows we shall have the following situation Vector d 2 should be G conjugate with d 1 , i.e., Let us calculate appropriate vector d 2 . Due to the choice of d 1 and d 2 we have Now, simple matrix calculations lead to In order to preserve the conjugacy property this quantity should be equal to 0 and therefore coefficients of vector d 2 should verify the equality We've got one degree of freedom in Eq. (8). We may take for instance d 2 2 = 1 and solve Eq. (8) with respect to d 2 1 . In what follows we shall derive similar representation for every k. In step number k > 1 we permit that only the first k elements of the new direction d k may assume nonzero values, all others are set to 0. Furthermore, we preserve the conjugacy with respect to G to all previous directions d 1 , . . . , d k−1 where in every vector d i (i < k) only the first i coefficients may assume nonzero values.
Let us assume the following block representation of matrix G and let d k has the following structure with d being a column vector of size k − 1 and c is a scalar. Similarly d i (i < k) may be represented as whered i belongs to the space IR i−1 . The conjugacy requirement due to the splitting formulated in Eqs. (9-11) may be written as Hence, the considerations may be restricted to the k dimensional space and conjugacy with respect to the k × k principal submatrix G k . Applying (9) again we shall arrive to the following equivalent form of (12) Vector d belongs to the space spanned on vectors d 1 , . . . , d k−1 . Therefore it may be expressed as their linear combination Let us put the above expression of d to (13) and calculate coefficients τ j which will ensure verification of the conjugacy condition by vector d k where (14) results in the following values of coefficients and finally we obtain the formula specifying new vector Hence, we may formulate the following algorithm for generation of vectors mutually conjugate with respect to matrix G.

Algorithm I
[Initialization] Set k = 1. Select c, for instance c = 1. Build vector Increase the dimension, set k = k+1.
The induction analysis presented above proves the following lemma Lemma 2 Vectors d 1 , d 2 , . . ., d n generated by Algorithm I are mutually conjugate with respect to matrix G.

Matrix form of the algorithm for generating conjugate directions
Algorithm presented above may be written in the matrix form. Let us denote by Due to their construction Taking the above relation into account we may express formula (16) in the following matrix form Let us introduce the following notation Then formula (17) may be expressed as Below is the formal formulation of the pseudo code of the matrix form of Algorithm I.

Algorithm II
2. Increase the dimension by 1; set k = k + 1. Calculate vector d using formula (18). 3. Build new matrix L k as follows To calculate vector d we should realize the matrix/vector multiplication, followed by the eventual multiplication of each component by scalar c and consecutive division by the main-diagonal entry of the diagonal matrix D. The last two operations can be realized simultaneously.
New lower triangular matrix L k is augmented at each step by the addition of a new row representing transposed vectord k and a new column containing exclusively entries equal to 0 above the main diagonal.

Generated directions and their relation with the factorization of the inverse matrix
Last matrix L n contains n-rows being transpositions of n mutually G-conjugate vectors and verifies the following matrix relation where L n is lower triangular and D n is the diagonal matrix.
Below we shall show that Algorithm II produces the LDL factorization of the inverse of G.

Lemma 3 The following equality holds
Proof Taking the inverses in Eq. (19) we obtain Multiplying (21) from the left-hand side by L n T and by L n from the right-hand side we shall obtain the desired expression (20) of the inverse of G.
It means that the proposed algorithm produces directly an RD −1 R T factorization of the inverse matrix G −1 . It is easily observed that the same statement is valid for the intermediate set of generated vectors.

Lemma 4
The following equality holds for each 1 ≤ k ≤ n Deeper analysis and stability issues of this approach to obtain directly the factorization of the inverse matrix will be pursued elsewhere.

Two-phase algorithm
A two-phase algorithm is proposed on the basis of the above theoretical considerations. In the first phase we calculate the conjugate directions. For their calculation we may apply either Algorithm I or Algorithm II formulated above. As follows from the above discussion the proposed algorithm is recursive. We start with the space of dimension equal to 1 and increase it by one at each step. It is also important in this procedure that the only algebraic operations involved are the matrix/vector multiplication and division or multiplication of the matrix entries by an appropriate number.
It is possible to reduce further the computational effort, if we restrict calculation of the entries of the diagonal matrix D in step 1 to the last entry on the main diagonal, i.e., D kk . All other entries on the main diagonal are the same as in the previous iteration.
In the second phase the optimal solution of the QP (quadratic programming) problem (1) is found. After closing calculation of the conjugate gradients it is clear, whether the QP problem has the optimal solution or it is unbounded. So, let us now assume that the QP problem is bounded from below.
The algorithm executing n independent simultaneous directional minimizations is presented in Sect. 6.1. It is based on theorem 1.
Second possibility presented in Sect. 6.2 is to solve the stationarity condition. By Lemma 4 in Sect. 5 we know that as a byproduct of the conjugate gradient generation we obtain the factorization of the inverse hessian. Therefore we can find the solution of the stationarity condition (3) by simple matrix/vector multiplications. This approach is discussed in the Sect. 6.2.

Finding the solution of the QP problem
The key property used in this subsection is the result of Theorem 1.
Similar theorem may be found in Fletcher (1987). However, he considered it in the essentially sequential way -directional search into one direction, followed by the move to the new second point, then another directional minimization from the second point followed by the move to the third point and so on. Fletcher's theorem is the basis for the conjugate directions methods in nonlinear unconstrained optimization. Here, we consider n independent directional minimizations starting all of them from the same point x 0 .
Theorem 1 suggests an obvious numerical procedure to find the minimum of the quadratic function (1) with a symmetric, positive definite matrix G in a computational cloud or on a machine with many parallel processors. We may simply run simultaneously n independent directional minimizations starting from the same point x 0 .

Algorithm III
A0. Select the starting point x 0 . x 0 = 0 may be a suitable choice, because then ∇ f (x 0 ) = −b. A1. Execute n independent directional minimizations, i.e., find solutions to all problems A2. Calculate the solution of the problem using the formulā Function f in formula (23) is quadratic (see formula (1)). Therefore, each directional step size α i is the stationary point of the derivative of functionf i , which is minimized in Step A1. Hence the step size verifies the following equation Simple arithmetic operations on Eq. (25) leads to the following expression of α i What is important, derivative ∇ f (x 0 ) may be calculated once. The same value is used for every step i. We need the following data: ∇ f (x 0 ), d i and G to calculate α i .

Finding the solution of the set of equations
Conjugacy condition may be equivalently written as follows Multiplying Eq. (27) from the left-hand side by the inverse of R T and from the righthand side by R −1 we shall obtain Now application of the commonly known rule of calculating the inverse of the product of square nonsingular matrices yields Let us apply formula (29) to equation This results in In view of equality (30) it is easy to distribute the calculation of x verifying Eq. (1). It involves exclusively matrix/vector multiplications and calculation of the image of the matrix D −1 on some appropriate vector. Matrix D is diagonal. Hence multiplication by D −1 requires only n arithmetic divisions. There are efficient methods to realize matrix/vector multiplication effectively, see for instance Bertsekas and Tsitsiklis (1989).

Numerical experiments
The sample matrix test problems from the repository of the University of Florida were used for the numerical verification of the proposed algorithm (description may be found in Davis and Hu (2011a)). The repository is available on-line and maintained by . We restricted calculations to the subset of symmetric, positive definite matrices from the repository. They are stored in three different formats: Rutherford-Boeing format (see Duff et al. (1989aDuff et al. ( , b, 1997), in the Matrix Market format Boisvert  (1997a) and Boisvert et al. (1997b) and the MATLAB format. In the MATLAB format used by us, the problem set is a single MATLAB variable of struct type, stored in a single MAT-file.
Tables 1 and 2 contain information about the test problems. We selected from the repository all problems with symmetric, positive definite matrices of size between 0 and 2000. Many of them are structural problems, where matrix G is the so-called stiffness matrix of the structure and is usually symmetric and positive definite. It arises as the result of the finite element discretization of the problem. The details may be found for instance in Felippa (2001) or McGuire et al. (2000). The computational results are presented in Tables 3-6. The quantities defined below do appear in particular columns:   AccSolD accuracy of the solution (max i |(G * x − b) i |), obtained with the aid of n consecutive directional minimizations.
Tables 3 and 4 contain the results for problems with the right-hand side vector b = [1, 1, . . . , 1] T . Unfortunately, almost all problems in the repository does not contain vector b.
Tables 5 and 6 contain results for problems with the same matrices G as above. The only difference is that the solution vector of the linear system is assumed to have all components equal to one (x i = 1, ∀i) and vector b = Gx respectively. We omitted the factorization accuracy and the decomposition times. They are the same as in Tables 3 and 4 presented above. Below, in Tables 7 and 8 we present the results obtained by means of the standard Cholesky decomposition of matrix G = LL T . Having this factorization available two sets of linear equations with triangular matrices are solved. The first one by means  -Presented results of sequential calculations seem to be encouraging.
-They failed only on problems: bcsstk20, bcsstk19, bcsstk11, bcsstk12, ex3, ex33, plat1919. Let us stress, that we haven't used any scaling. Further numerical experimentation is necessary. -Variant with the matrix/vector operations was usually faster than the n simultaneous directional minimizations. The accuracy was exactly the same (all digits of the solution and the forward and backward errors were equal). -Standard approach (Cholesky decomposition of matrix G followed by solving two sets of linear equations with triangular matrices was usually more robust. In any case, when the method proposed in the paper succeeded, the results were similar. The accuracy was sometimes slightly better in our approach, sometimes in the standard one. In cases, when our approach failed, the standard approach generated solutions with better quality indicators. However, they were not fully satisfactory. This comparison is not fully fair, because we compare the results of factorization RD −1 R T of G −1 with the LL T decomposition of G.

Conclusions
In the paper we investigated the two-stage approach to solve strictly convex quadratic programming problems. In the first stage we generate the set of vectors mutually conjugate with respect to the hessian of the QP problem. It is followed in the second stage by n independent directional minimizations executed concurrently from the same starting point x 0 . We think that it may be useful for distributed calculations in cloud computing.
Additionally we find in that way a factorization of the inverse matrix G −1 without calculating factorization of G itself. It has the form RD −1 R T , where columns of matrix R are the conjugate vectors found by the first stage algorithm. Knowledge of such inverse factorization may be very useful for problems both from optimization (when we have a family of QP problems with different linear terms) and linear algebra problems (with multiple right-hand sides). Our approach may be also used in conjunction with the Bientinesi et al. (2008) one sweep procedure to calculate the inverse of a symmetric, positive definite matrix. It may be applied to find the factorization of the particular Schur block components.
The first stage is essentially sequential. But our second idea of applying n independent directional minimizations of the associated quadratic programming problem permits distribution of calculations in the second stage. It should be stressed that the solution accuracy of two algorithms tested in the second stage was exactly the same. All digits in the reported errors were equal.
There are still some open questions which require further investigation. The first one is, what are the classes of sparse problems, where the approach presented in the paper will be better than the standard factorization of G followed by the solution of two sets of linear equations with triangular matrices. Second is how to apply those ideas to the problems with the block structure. May be the fact that columns of factor L in the decompositions G = LDL T or G = LL T are mutually conjugate with respect to G −1 would serve as a hint. Another possibility is to combine the block approach of Bientinesi et al. (2008) with our approach applied to particular blocks. Factorization generated by means of our approach may be also used as the starting one in the refinement technique studied by Ogita and Oishi (2012) and Yanagisawa et al. (2014).
The reported computational results should be treated as preliminary. There are of course some issues which should be clarified. The most important are: identification of the suitable sparsity structure and selection of the appropriate values of parameter c. We are not restricted to the choice of c = 1, which was used in our test calculations. This may be especially important in establishing the numerical stability of the method. To solve that task the most promising idea would be the normalization of the consecutive conjugate vectors.