1 Introduction

Computing eigenvalues and eigenvectors of matrices is probably the most prominent linear-algebra operation in quantum chemistry. Eigenvalues and eigenvectors are computed in the self-consistent field (SCF) algorithm [1,2,3], in restricted-step second-order optimizations [4], response theory calculations [5, 6], algebraic diagrammatic construction (ADC) [7, 8], and unitary coupled cluster (UCC) [9,10,11]. Moreover they are calculated in ground and excited state configuration interaction (CI) calculations [12, 13], including in state-average complete active space self-consistent field (CASSCF) [14,15,16]. Quantum chemistry calculations are typically performed using localized basis sets, most commonly made by Gaussian atomic orbitals (GTOs). Thanks to the compactness of such basis sets, the generalized eigenvalue problem in SCF can be solved using dense linear-algebra techniques. Despite the cubic scaling, diagonalizing the Fock (or Kohn-Sham) matrix is usually not a significant bottleneck for standard calculations, which are instead dominated by the cost of assembling the Fock matrix. Iterative diagonalization techniques are therefore mostly used for post-Hartree Fock calculations, such as CASSCF and CI, where the combinatorially scaling size of the CI Hamiltonian makes it impossible to build it in memory, let alone diagonalize it, but for the smallest cases. They are also used to compute the direction and step length in second-order SCF and CASSCF strategies and to compute excited states in CI singles and time-dependent SCF. Most of these applications involve computing one, or a small number (up to a few hundreds) of eigenvalues and eigenvectors, and involve symmetric, diagonally dominant matrices.

Since its introduction in 1975, the iterative method proposed by Ernest R. Davidson has been the method of choice [17] and the default strategy used by the majority of quantum chemistry codes. Davidson’s diagonalization performs particularly well for diagonally dominant symmetric eigenvalue problems, for which it exhibits fast and robust convergence, and has been generalized to non-symmetric problems and to deal with multiple eigenvalues and eigenvectors [18, 19]. However, the method suffers of a few drawbacks. First, its performance degrades if the matrix is not diagonally dominant. Even though most algorithms in quantum chemistry involve diagonally dominant matrices, this is not always the case, with the parameter Hessian in second-order CASSCF being a prominent example. Finally, as the vector subspace used in Davidson’s method to expand the sought eigenvectors can become quite large, especially if many eigenvalues are needed, the cost of diagonalizing the matrix projected in the subspace and of orthogonalizing the new trial vectors can become non-negligible, and memory requirements problematic.

The locally optimal block preconditioned conjugate gradient (LOBPCG) method offers a way out by more aggressively truncating the approximation space, while attempting to preserve the convergence properties of Davidson [20, 21]. Since its introduction, it has been successfully used in several applications, and in particular in approaches to the quantum many-body problem in condensed-matter physics [22] and [23], but to our knowledge it has not permeated quantum chemistry to the same extent.

One important caveat with LOBPCG is its potential numerical instabilities, which can degrade convergence significantly if care is not taken. This point was emphasized in [24], which describes an appropriately stable procedure to build an orthogonal basis of expansion vectors. This relies crucially on a ortho(X,Y) primitive, whose goal is to orthogonalize the vectors in X with respect to those in Y and among themselves. This turns out to be surprisingly hard to perform reliably in the presence of roundoff errors. The strategy suggested in [24] relies on the ortho(X,Y) function in [25], which is based on a modified singular value decomposition (SVD). This however can sometimes fail [26], and the use of the SVD can become expensive. For this reason [26] suggests dropping ill-conditioned directions.

In this paper, we clarify the origins of numerical instabilities, in particular due to the ill-conditioned orthogonalizations and reuse of matrix–vector products. We then present an implementation that uses exclusively Cholesky decompositions for orthogonalizations. These decompositions are known to be efficient but unstable, and for this reason the Cholesky orthogonalization has often been regarded as impractical; however, it was recently realized that appropriate stabilizations based on repeated, potentially shifted factorizations, turn it into a completely reliable orthogonalization algorithm [27]. The resulting LOBPCG algorithm is extremely stable (converging, if desired, to close to machine precision, with no degradation in convergence rate), simple (with no dropping of directions, and with a single parameter controlling the accuracy of the orthogonalizations), and efficient.

In this paper, we present two open-source implementations of the LOBPCG algorithm. One is diaglib, an open-source, library written in Fortran 95 that also includes an implementation of Davidson’s method and that can be obtained at the following address: https://github.com/Molecolab-Pisa/diaglib. The library has been interfaced with the cfour suite of quantum chemistry programs [28, 29], which we have used to test its performance choosing a variety of test cases. The other is included in the DFTK Julia software package [30], supports generalized eigenvalue problems and GPUs and is available at https://github.com/JuliaMolSim/DFTK.jl/. These implementations are more focused on numerical stability than on sheer speed, but are overall efficient, thanks to the extensive use of highly optimized blas and lapack routines.

We then test the LOBPCG implementation on a selection of quantum chemistry problems that include full CI, second-order SCF and CASSCF calculations. For such methods, we compare the performance of LOBPCG to the ones of Davidson’s method. Our preliminary results show that while Davidson diagonalization is more efficient for strongly diagonally dominant problems, as the ones encountered in full CI, LOBPCG is a viable alternative, as it exhibits similar performance and, for difficult cases such in CASSCF, can even slightly outperform Davidson.

This paper is organized as follows. In Sect. 2, we present the LOBPCG algorithm. In Sect. 3, we discuss in detail the numerical implementation of LOBPCG, analyzing the possible numerical issues and proposing cost-effective, yet robust, solutions. In Sect. 4, we test our implementation of a few quantum chemistry applications. Finally, some final remarks are given in Sect. 5.

2 Theory

In this section, we describe the LOBPCG algorithm in infinite precision arithmetic, without consideration for numerical stability. In this paper, we will focus on real symmetric matrices, but everything of course extends to Hermitian complex matrices. Let \(\textbf{A} \in \mathbb {R}^{n\times n}\) be a symmetric matrix, for which we seek \(m \ll n\) eigenvalues. Given a set of vectors \(Y = (y_{1}, \dots , y_{p}) \in \mathbb R^{n \times p}\) with \(p \ge m\), the Rayleigh-Ritz variational procedure obtains an approximation \(\textrm{RR}(Y) \in \mathbb R^{n \times m}\) to the first m eigenpairs by building an orthonormal basis of Y, computing the \(p \times p\) matrix representation of A on that set of vectors, diagonalizing it and taking the first m eigenvectors.

To use this procedure to obtain an iterative algorithm, a way of constructing Y must be specified. A standard method for doing so is to use the residuals, defined for a vector x by \(r(x) = \textbf{A} x - \rho (x) x\), where \(\rho (x) = \tfrac{x^{T} \textbf{A x}}{x^{T} x}\) is the Rayleigh quotient of x. In ill-conditioned problems, these residuals might not however be a good search direction, and therefore it is useful to precondition them according to

$$\begin{aligned} w = \textbf{T}r(x) \end{aligned}$$

where T is a given preconditioner (for instance, when \(\textbf{A}\) is diagonally dominant, \(\textbf{T} = (\textrm{diag}(\textbf{A}) - \rho (x) \textbf{I})^{-1}\)).

This choice of search direction results in the block Davidson algorithm, starting from an initial set of vectors \(X^{[0]} \in \mathbb R^{n \times m}\):

$$\begin{aligned} X^{[k+1]} = \textrm{RR}(X^{[1]}, \textbf{T}^{[1]}R^{[1]} \dots , \textbf{T}^{[k]} R^{[k]}) \end{aligned}$$

where \(R^{[k]} = \textbf{A} X^{[k]} - X^{[k]}\Lambda ^{[k]}\) is the residual at the k-th iteration, and \(\Lambda ^{[k]}\) is the diagonal matrix composed of the Ritz values. Since the expansion subspace is only ever enlarged, the implementation is standard: each block of vectors added to the subspace is orthogonalized against the previous vectors and against itself (although see subtleties of this operation in the next section).

Since the Davidson method performs the Rayleigh-Ritz procedure in the full convergence history, its computational requirements can increase quickly. The LOBPCG algorithm instead only keeps the last two iterates:

$$\begin{aligned} X^{[k+1]} = \textrm{RR}(X^{[k-1]}, X^{[k]}, T^{[k]}(A X^{[k]} - X^{[k]} \Lambda ^{[k]})). \end{aligned}$$

The method is locally optimal (LO) because the Rayleigh-Ritz procedure optimizes the Rayleigh quotient in the local expansion subspace. It is a block algorithm (B) and uses a preconditioner (P). Finally, the intuition of keeping only the previous iterate comes from the conjugate gradient (CG) algorithm for solving linear systems. This algorithm can seem like a drastic truncation of the Davidson method. There is however reason to believe that it can converge asymptotically as quickly as the full Davidson algorithm, inspired by the optimality in the Krylov space of the three-terms conjugate gradient algorithm [20].

The convergence properties of this algorithm are sensitive to the gap between eigenvalues m and \(m+1\), which might be small. This is particularly clear in the case of a simplified version of the LOBPCG algorithm, the block gradient descent with fixed step (which might be termed “BG,” since it is obtained by removing the locally optimal, preconditioning and conjugate features of LOBPCG), where explicit convergence rates can be obtained easily [31]. Accordingly, as is standard, in practice one uses a block size m which is larger than the number of eigenvalues \(m_\textrm{sought}\) actually sought, and stops the algorithm as soon as the first \(m_\textrm{sought}\) eigenvalues are converged. The convergence rate is then dependent on the gap between eigenvalues \(m_\textrm{sought}\) and \(m+1\).

3 Implementation

3.1 The LOBPCG algorithm

When implementing the above algorithm on a computer, we face the difficulty that the basis \((X^{[k-1]}, X^{[k]}, T^{[k]}(A X^{[k]} - X^{[k]} \Lambda ^{[k]}))\) is extremely badly conditioned. This is because, as the iteration progress, \(X^{[k]}\) becomes close to \(X^{[k-1]}\), and the residual becomes small. Therefore, if we try to solve the Rayleigh-Ritz problem as a generalized eigenvalue problem, the results will be inaccurate. Instead, following [24], we construct systematically an orthogonal basis \((X^{[k]}, W^{[k]}, P^{[k]})\) of the expansion subspace spanned by \((X^{[k-1]}, X^{[k]}, T^{[k]}(A X^{[k]} - X^{[k]} \Lambda ^{[k]}))\). The \(P^{[k]}\) is implicitly constructed as the orthogonalization of \(X^{[k]}\) against \(X^{[k-1]}\); the \(W^{[k]}\) is constructed as the orthogonalization of \(T^{[k]}(A X^{[k]} - X^{[k]} \Lambda ^{[k]})\) against \(X^{[k]}\) and \(P^{[k]}\).

To obtain these orthogonal bases, we introduce the primitive ortho(X,Y) which, given a set of orthogonal vectors Y, returns an orthogonal basis of the projection of the vectors in X onto the space orthogonal to Y. In infinite precision arithmetic, this would be given by an orthogonalization of \(X - Y Y^{T} X\); in finite precision arithmetic, care has to be taken, as we will see in the next section. Given this primitive, the LOBPCG algorithm is given in Algorithm 1.

figure a

3.1.1 Basis selection

In such algorithm, \(X^{[k]}, W^{[k]}, P^{[k]}\) are the \(n\times m\) matrices that contain the m desired eigenvectors (X) and the corresponding preconditioned residuals (W) and increments (P), and we denote with a \(\sim\) symbol the vectors before orthogonalization. The matrices \(AX^{[k]}, AW^{[k]}, AP^{[k]}\) contain the results of the application of \(\textbf{A}\) to such vectors. lobpcg is a matrix-free algorithm, i.e., it does not require to assemble and store in memory the matrix \(\textbf{A}\) but just to be able to perform the relevant matrix–vector multiplications. More details on the orthogonalization procedure are given in Sect. 3.2. The reduced matrix \(a^{[k]} \in \mathbb {R}^{3m\times 3m}\) is diagonalized using standard dense linear algebra routines (in our implementation, we use LAPACK’s dsyev).

In the block implementation of lobpcg, computing \(\widetilde{P}^{[k]}\) as \(X^{[k]}-X^{[k-1]}\) can become problematic, as these vectors become smaller and smaller when approaching convergence, which can create numerical instabilities. As a more robust alternative, the \(\widetilde{P}^{[k]}\) vectors are computed in a different way. Let \(u_x^{[k]} \in \mathbb {R}^{3m\times m}\) be the first m eigenvectors of the reduced matrix. The eigenvectors have a block structure, that is

$$\begin{aligned} u^{[k]} = \left( \begin{array}{c} u_{xx}^{[k]} \\ u_{wx}^{[k]} \\ u_{px}^{[k]} \end{array} \right) \end{aligned}$$

where each block is a \(m\times m\) square matrix. The new eigenvectors are computed as (line 19)

$$\begin{aligned} X^{[k+1]} = X^{[k]}u_{xx}^{[k]} + W^{[k]}u_{wx}^{[k]} + P^{[k]}u_{px}^{[k]} \end{aligned}$$

To compute the \(P^{[k+1]}\) vectors, we first get the expansion coefficients \(\widetilde{u_p}^{[k+1]}\) of \(X^{[k+1]}-X^{[k]}\) in \(V^{[k]}\), which are obtained by subtracting the identity matrix from the unconverged components of \(u_{xx}^{[k]}\). Then, we orthogonalize them against \(u_x\) and use them to compute the new \(P^{[k+1]}\) vectors. Note that we assemble the P vectors only corresponding to the active eigenvectors, i.e., the ones that have not yet converged; for this reason, it is important to perform such operation before orthogonalizing \(\widetilde{u_{p}}^{[k+1]}\).

3.1.2 Reuse of applications

Apart from the choice of a basis and its orthogonalization, a numerically sensitive point is the reuse of the applications of \(\textbf{A}\). Since this is a potentially costly operation, it is not feasible to recompute for instance \(\textbf{A P}^{[k]}\) before the Rayleigh-Ritz procedure; instead, we use the fact that \(P^{[k]}\) is built as a linear combination of other vectors, on which we know the application of \(\textbf{A}\). If this is done naively, however this can result in a large error. This is because, in general, if \(\textbf{A V}\) is known to some precision \(\varepsilon\), then \((\textbf{A V}) u\) will be an approximation of \(\textbf{A}(Vu)\) with a precision of the order of \(\Vert u\Vert \varepsilon\).

Consider the problem of computing \(\textbf{A} P^{[k]}\), line 24 of the algorithm. In exact arithmetic, we could compute \(\widetilde{P}^{[k]} = V^{[k]} \widetilde{u}_{p}^{[k]}\), compute \(\textbf{A}\widetilde{P}^{[k+1]} = (\textbf{A} V^{[k]}) \widetilde{u}_{p}^{[k]}\), orthogonalize \(\widetilde{P}^{[k+1]}\) against \(X^{[k+1]}\) and update \(\textbf{A} \widetilde{P}^{[k+1]}\) accordingly, etc. This however amounts to obtaining \(\textbf{A} P^{[k+1]}\) by right-multiplying \(\textbf{A}V^{[k]}\) with a sequence of potentially ill-conditioned (and therefore of large norm) matrices, which incurs a large error on \(\textbf{A} P^{[k+1]}\). Instead, we obtain directly the expansion coefficients \(u_{p}^{[k]}\) of \(P^{[k+1]}\) on \(V^{[k]}\) and obtain \(\textbf{A} P^{[k+1]}\) as \((\textbf{A} V^{[k]}) u_{p}^{[k]}\). Since both \(P^{[k+1]}\) and \(V^{[k]}\) are orthogonal, so is \(u_{p}^{[k]}\), and therefore no precision is lost in the update \(AP^{[k+1]} = AV^{[k]}u_p^{[k]}\). The same is true for the update \(AX^{[k+1]} = AV^{[k]}u_x^{[k]}\) (line 19).

3.1.3 Locking

Another crucial aspect of an efficient and stable implementation concerns the treatment of converged eigenvectors. In our implementation, we freeze the first \(m_\textrm{conv}\) consecutive eigenvectors, which means that we only compute \(m_\textrm{act} = m - m_\textrm{conv}\) new residuals, W and P vectors. The converged eigenvectors are kept into X, to enforce the orthogonality of the active search subspace. This means that the reduced matrix and the V subspace dimensions are \(m + 2m_\textrm{act}\), and that only \(m_\textrm{act}\) matrix–vector multiplications are performed at each iteration, combining thus stability and efficiency.

3.2 A robust and stable ortho(X,Y) procedure

One of the most crucial steps in lobpcg is the orthogonalization of a set of vectors against a given set and its subsequent orthonormalization. We first tackle the ortho(X) routine, which orthogonalizes a set of vectors.

3.2.1 The ortho(X) procedure

The gold standard for orthogonalizing a set of vectors is to compute the (thin) singular value decomposition of X and then take the left singular vectors. A slightly less expensive, yet very stable alternative, is to use the QR decomposition of X, which in our tests performs equivalently well. Another good option is the modified Gram-Schmidt algorithm. However, these algorithms can all become expensive, especially if a large number of eigenvalues are sought.

An alternative and cheaper strategy is to compute the Cholesky decomposition of the overlap matrix

$$\begin{aligned} X^{T} X = M = LL^T \end{aligned}$$

The orthogonal vectors can then be obtained by solving the triangular linear system

$$\begin{aligned} {\texttt {ortho}}(X) L^T = X. \end{aligned}$$

This is often more efficient, as it allows for greater parallelization and full use of BLAS3 routines.

This procedure works in infinite precision, but has two issues in finite precision. First, even after a first Cholesky orthogonalization, the vectors can fail to be orthogonal. Second, the Cholesky decomposition can fail, because M may not be positive-definite to machine precision; this happens when the conditioning of X is larger than the square root of the inverse machine epsilon, about \(10^{8}\) in double-precision arithmetic.

Fortunately, there is a simple fix to the first problem: orthogonalize twice. This has been established to produce vectors orthogonal to machine precision [32]. For the second problem, following [27], we level shift the metric before its Cholesky decomposition by adding a small constant to its diagonal. Such a constant can be chosen very small (in our implementation, we start from 100 times the norm of X times the machine precision) and, if the decomposition still fails, increased until the Cholesky decomposition is successful. In our tests, the shifted decomposition never failed, and therefore, with at most 4 Cholesky orthogonalizations (one failed unshifted, one shifted, then two unshifted) we are guaranteed to obtain vectors that are orthogonal to machine precision. In practice, often much less than this is needed—failures of the first Cholesky orthogonalization have been observed only exceptionally. A pseudo-code for the ortho procedure is given in algorithm 2.

figure b

The algorithm as given is somewhat wasteful, as in the common case where only one or two successful Cholesky factorizations are needed it recomputes the overlap to check for termination. This can potentially be alleviated by computing a cheap estimation to the norm of \(L^{-T}\): if this is moderate, then the new vectors are orthogonal to a good accuracy, and a new round is unnecessary.

3.2.2 The ortho(X,Y) procedure

Using the previous orthogonalization algorithm, we could implement the \(\texttt{ortho}(X,Y)\) as \(\texttt{ortho}(X - Y Y^{T} X)\). This is however numerically unstable: if \(X - Y Y^{T} X\) is of order \(\delta\) (because X was almost in the range of Y), then the orthogonalization above will multiply it by a factor of order \(1/\delta\), meaning that \(Y^{T} \textrm{ortho}(X - Y Y^{T} X)\) will be of order \(\varepsilon /\delta\), where \(\varepsilon\) is the machine epsilon, and the vectors will not be sufficiently orthogonal to Y. To avoid this, we use a loop: first project out Y, then orthogonalize, iteratively until convergence. In practice, two steps are usually enough to achieve convergence.

The algorithm for the ortho(X,Y) procedure is given in algorithm 3.

figure c

Similar to before, this algorithm is relatively wasteful in the common case where one or two passes are enough, because it recomputes \(Y^{T}X\) to check for termination. This can be remedied by monitoring the growth factor of ortho(X) (the maximum amplification of errors in X caused by the ortho routine, and therefore a measure of the lack of Y-orthogonality after one iteration), and exiting the loop when that number is moderate.

3.3 Generalized eigenvalue problems

Missing from this algorithm is a discussion of generalized eigenvalue problems, simply because they are not often encountered in quantum chemistry. In the generalized eigenvalue problem, one solves \(\textbf{A} x = \lambda \textbf{B} x\), where \(\textbf{B}\) is a symmetric positive definite matrix. Eigenvectors are orthogonal with respect to the modified inner product \(\langle x, y \rangle _{\textbf{B}} = x^{T} \textbf{B} x\). The theoretical LOBPCG algorithm is unchanged except for the fact that residuals are now \({\textbf{A}} x - \lambda {\textbf{B}} x\), and that all orthogonalizations are with respect to the modified inner product.

Our practical algorithm has to be modified by keeping a \(\textbf{B}\)-orthogonal basis V. This could be done by using the \({\textbf {B}}\) inner product in the orthogonalization of W, line 26 of the algorithm above, and maintaining the values of \(\textbf{B} V\) along the iterations. However, a naive implementation of this step requires either multiple applications of \(\textbf{B}\) (twice per iteration, on \(\widetilde{W}^{[k+1]}\) and on \(W^{[k+1]}\)) or potentially unsafe reuses of applications of \(\textbf{B}\). As a compromise, a good option is to use the intermediate quantity

$$\begin{aligned} \widehat{W}^{[k+1]} = \texttt{ortho}(\widetilde{W}^{[k+1]}, (\textbf{B} X^{[k+1]}, \textbf{B} P^{[k+1]})), \end{aligned}$$

which is \(\textbf{B}\)-orthogonal to \(X^{[k+1]}\) and \(P^{[k+1]}\), but whose vectors are only orthogonal (and not \(\textbf{B}\)-orthogonal) to each other. This set of vectors \(\widehat{W}^{[k+1]}\) is however well conditioned (with respect to the \(\textbf{B}\)-inner product). We can therefore \(\textbf{B}\)-orthogonalize it to compute \(W^{[k+1]} = \widehat{W}^{[k+1]} L^{-T}\) with L a well-conditioned matrix; it is then safe to re-use the \(\textbf{B}\) application as \(\textbf{B} W^{[k+1]} = (\textbf{B} \widehat{W}^{[k+1]}) L^{-T}\). This appeared to perform very well in our tests, even if \(\textbf{B}\) itself was not well-conditioned. If more stability is needed, reuses of \(\textbf{B}\) applications appear necessary.

4 Numerical experiments

To test our implementation of LOBPCG, and to compare its performance with respect to the block-Davidson method, we interfaced the diaglib library with the cfour quantum chemistry package [28, 29]. As typical problems where an iterative procedure to compute one or a few eigenvectors is required, we selected three different test cases, coming from full CI Hamiltonian, quadratically convergent self-consistent field (SCF) and quadratically convergent complete active space self-consistent field (CASSCF). In all the calculations, we use a threshold of \(10^{-14}\) for the ortho(X,Y) and ortho(X) procedures. Together with the convergence threshold for the eigenvectors, this is the only parameters that control the LOBPCG calculation. For Davidson, we use a subspace dimension of 25, i.e., we keep in memory up to 25 vectors per eigenvector in the history. For both the algorithms we exploit a locking procedure for the converged eigenvectors.

4.1 Full CI calculations

We compute the first few total symmetric electronic states of water at the full CI level of theory, using a determinant CI direct implementation. The full CI Hamiltonian is sparse and diagonally dominant, but extremely large, and thus provides a good test case for well-behaved, large, sparse systems. Furthermore, the iterative solution of (Full) CI problems is a quite common task in quantum chemistry, as it is encountered in CASCI/CASSCF and truncated CI (including for excited states at the CI singles level of theory). We use Pople’s 6–31 G\(^*\) basis set [33] and perform both all-electron and frozen-core calculations, correlating thus 10 electrons in 18 orbitals (18 360 640 determinants) or 8 electrons in 10 orbitals (1 416 732 determinants). We seek 10, 20 or 50 eigenpairs. Convergence is achieved when the root-mean-square norm of the residual is smaller than \(10^{-9}\), and its maximum absolute value is smaller than \(10^{-8}\). For LOBPCG, we seek 5 additional eigenpairs, as numerical tests proved that this improves convergence and, despite the additional matrix–vector products required, improves overall performance. Note that we do not check that the additional eigenvalues are converged, as they are only used to increase the expansion subspace. No additional eigenpairs are sought for Davidson, as this choice showed the overall best performance. A brief description of the process that led us to these choices is reported in the Supporting Information.

Fig. 1
figure 1

All-electron full CI calculations for water using Davidson or LOBPCG. Total timings (left panel) and cumulative time for the Rayleigh-Ritz and orthogonalization procedures (right panel)

Fig. 2
figure 2

Frozen-core full CI calculations for water using Davidson or LOBPCG. Total timings (left panel) and cumulative time for the Rayleigh-Ritz and orthogonalization procedures (right panel)

It comes as no surprise that Davidson outperforms LOBPCG for Full-CI calculations. As the full-CI Hamiltonian is strongly diagonally dominant for closed-shell systems, it is an ideal scenario for Davidson, an algorithm originally conceived for exactly this problem. Nevertheless, the performance of LOBPCG is comparable, the latter algorithm being about 20–30% slower than the former. It is interesting to note that while for LOBPCG the cost of the Rayleigh-Ritz and orthogonalization procedures is overall negligible with respect to the cost of computing matrix–vector multiplications, this is not the case for Davidson. Keeping up to 25 vectors in the history comes with a cost that can be clearly seen in the right panels of Figs. 1, 2. On the other hand, the larger subspace used in Davidson’s method allows for faster convergence, which is achieved in 28, 37 and 24 iterations for the calculation seeking 10, 20 and 50 eigenpairs, respectively, both for the all-electron and the frozen-core calculations. This has to be compared with 26, 41 and 45 iterations for LOBPCG, again, for both sets of calculations. On the other hand, the long history is also a limitation for Davidson, as the amount of memory required to perform a calculation can become very high. As an example, the largest calculation performed (all-electron, 50 states) required about 356 GB of memory for Davidson, to be compared with 55 for LOBPCG. Using a smaller subspace dimension in Davidson is of course possible, but such a size must be chosen with some care. To better illustrate this point, we repeated the Davidson calculations using a maximum of 10 points in the history: no calculation fully converged within 100 iterations, with 1, 3 and 2 non-converged roots. Therefore, while Davidson is optimal for full CI calculations if enough memory is available to use a large expansion subspace, LOBPCG can be seen as a competitive alternative when this is not the case.

4.2 Quadratically convergent SCF calculations

Fig. 3
figure 3

Orbital rotation Hessian diagonalization at SCF convergence (left panel) and with extended Hückel guess orbitals (right panel). The number of iterations required to achieve convergence is reported as a function of the number of seeked eigenpairs

In quadratically convergent implementations of the self-consistent field, the Hartree-Fock wavefunction is optimized by using a second-order method. In CFOUR, this is done using an efficient numerical realization of the Levenberg-Marquardt method [34], known as norm-extended optimization, where the step is computed from the lowest eigenpair of the (augmented) orbital-rotation Hessian [35]. The same matrix is used in response calculations and for the analysis of the stability of the Hartree-Fock wavefunction, which requires again to compute a few eigenvalues and eigenvectors of the orbital-rotation Hessian. For closed-shell systems, such a matrix is dense, but typically diagonally dominant. While computing and storing in memory the full Hessian is possible, such a task is expensive and exhibits a steep scaling in computational cost with respect to the system’s size, as assembling it requires a costly partial integral transformation. Direct implementations are therefore usually preferred. To compare Davidson to LOBPCG, we compute the first few eigenpairs (up to 10) of the Hessian after convergence of the SCF procedure for a transition metal complex, FeC(CO)\(_3\), using Dunning’s cc-pVDZ basis set. To provide an example on a somewhat more challenging case, we further repeat the calculations, but at the beginning of the SCF calculations, that is, using orbitals computed with an extended Hückel guess. Far from convergence, the Hessian has many negative eigenvalues and is not guaranteed to be as diagonally dominant as with fully converged orbitals. In the following calculations, convergence is achieved when the root-mean-square norm of the residual is smaller than \(10^{-7}\), and its maximum absolute value is smaller than \(10^{-6}\). Such threshold is adequate for stability analysis, but also for the first steps of a second-order optimization procedure. For both Davidson and LOBPCG, to improve convergence, we seek twice the eigenpairs required and stop the calculation when just the required ones are converged. The size of the subspace for Davidson is 25. For the calculations at SCF convergence, where the Hessian is strongly diagonally dominant (Fig. 3, left panel) Davidson outperforms again LOBPCG. On the other hand, the less well-behaved case (Fig. 3, right panel) shows a different picture. LOBPCG and Davidson exhibit a very similar behavior, with LOBPCG even outperforming Davidson in a few cases. As orbital rotation Hessians are hardly very large matrices (for the case reported here, the size is 3849), using large expansion subspaces in Davidson can probably further improve convergence, but for difficult cases, LOBPCG can be a valid alternative.

We report in Fig. 4 (left panel) the root-mean-square residual as a function of the iterations for the computation of the first eigenpair of the orbital rotation Hessian. As expected the behaviors between the two algorithms are analogous.

We would like here to underline the fact that LOBPCG manages to behave similarly to Davidson even despite the very small size of the subspace used for the Rayleigh-Ritz procedure. To show how remarkable this is, we report in Fig. 4 (right panel) a comparison between LOBPCG and Davidson, where for the latter we use a three-dimensional subspace–that is, the same dimension used in LOBPCG. Davidson eventually manages to converge; however, it requires a much larger number of iterations. While this is purely an academic example, as such small expansion subspaces are never used in practice, it testifies to the effectiveness of the LOBPCG 3-terms sequence.

Fig. 4
figure 4

Root-mean-square (RMS) of the residual along the iterations for the diagonalization of the orbital rotation Hessian at SCF convergence seeking for one eigenpair. The subspace for the Davidson algorithm is of size 25 (left panel) and 3 (right panel)

4.3 CASSCF calculations

Fig. 5
figure 5

Root-mean-square (RMS) of the residual along the iterations seeking for one eigenpair (left panel) and number of iterations to achieve convergence as a function of the number of seeked eigenpairs (right panel) for the diagonalization of the CASSCF augmented Hessian at the first macroiteration

CASSCF calculations can be very challenging from a numerical point of view, which makes second-order methods particularly attractive [4, 36,37,38]. In cfour, the same technique used for Hartree-Fock, namely, the norm-extended optimization algorithm, is used. The (augmented) Hessian in CASSCF is made by a dense, typically quite ill-conditioned, medium-sized block for the orbital optimization and a large, sparse, usually diagonally dominant block that corresponds to the Hamiltonian in the CAS space. Even for well-behaved systems, computing the NEO step, which in turn requires computing the first eigenpair of the augmented Hessian, can be challenging. To illustrate this, we report calculations on niacin (vitamin B3), a small conjugated organic molecule. We correlated all the \(\pi\) electrons, resulting in a CAS(6,6) calculation, and we employ Pople’s 6–31 G* basis set [33]. Symmetry broken unrestricted natural orbitals are used as a guess [39, 40]. The system is very well behaved, and convergence of the wavefunction is achieved in just 3 s-order iterations. Nevertheless, the iterative calculation of the step, i.e., the augmented hessian lowest eigenvector, can still be challenging. We report in the left Fig. 5 the convergence pattern for Davidson and LOBPCG at the first second-order iteration. LOBPCG outperforms Davidson at every second-order iterations, 27, 25 and 23 iterations at the first, second and third second-order step, to be compared with 32, 28 and 31 iterations for Davidson. Also when more than one eigenpair is required, as it is the case for state-specific excited state calculations, LOBPCG keeps being the best performing algorithm, as reported in the right panel of Fig. 5. This behavior is not surprising, as the CASSCF augmented Hessian is not diagonally dominant and is consistent with what observed for second-order SCF calculations starting with extended Hückel orbitals Davidson’s performance may be improved by increasing the size of the expansion subspace. This is, however, not the best option, as for large active spaces the size of the Hessian can become very large—comparable with the sizes reported for full CI calculations. Using very large expansion subspaces is therefore very demanding in terms of memory, and can make the orthogonalization expensive. LOBPCG seems therefore a more robust choice for this specific problem.

4.4 Preconditioning

To improve the convergence of both LOBPCG and Davidson one can devise different types of preconditioners. In most quantum chemistry applications, computing and storing in memory the matrix of which one seeks one or a few eigenpairs is prohibitively expensive, which forces the choice of a Jacobi (diagonal) preconditioner in most cases. This is definitely the case for the full CI and CASSCF examples showed in the previous section. However, for both second-order SCF and CASSCF, we have implemented the explicit construction of the orbital rotation Hessian, mainly as a debug option, which allows us to perform a few numerical experiments. In particular, we compare three possible choices and focus on the CASSCF orbital-rotation Hessian as a test case, as it is notoriously ill-conditioned and therefore we expect that a more advanced preconditioning strategy may be particularly beneficial. For these examples, we only seek to compute one eigenpair. The first preconditioner that we test is, as in the previous sections, approximates the matrix to its diagonal. The second one, which will be here addressed as tridiagonal, improves upon the diagonal approximation by also including the upper and lower diagonal elements. These two options should perform particularly well in diagonally dominant matrices. As a third choice, we propose a sparse approximation M to the matrix A, that is

$$\begin{aligned} M_{ij} = {\left\{ \begin{array}{ll} A_{ij},&{} \text {if } |A_{ij}|>\text {tol or}\quad i=j\\ 0, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$

where we set tol equal to 0.5 and decreased it to 0.1 as soon as the root-mean-square norm of the residual is close to convergence. For both the second and third options it is necessary to solve a linear system. For the tridiagonal case we simply exploit a LAPACK routine which performs a Gaussian elimination with partial pivoting. Instead, in the case of the sparse M matrix, the linear system is solved using the incomplete LU (iLU) decomposition [41] as implemented by Saunders et al [42]. In Table 1 we compare the behavior of LOBPCG and Davidson when changing the preconditioner. As expected the preconditioner based on the sparse approximation of A is the one that performs best. However, such an option may become expensive and requires to store the full matrix in core or at least to have an heuristic procedure to estimate the elements. From this simple-minded experiment, we note that both methods benefit from better preconditioners, with LOBPCG exhibiting slightly more marked improvements. However, assembling such preconditioners is expensive, as it requires to build the matrix, or at least some approximation to it, in memory, which in many practical cases is far too demanding. Given the relatively small beneficial effect of going beyond a diagonal preconditioner, we believe that the latter is the optimal compromise choice.

Table 1 Total number of iterations required to converge the lowest eigenvalue of the CASSCF orbital-rotation Hessian at the first macroiteration using the LOBPCG and Davidson solvers with three different preconditioners

5 Conclusions and perspectives

In this contribution, we have described an efficient and numerically robust implementation of LOBPCG, available both in the DFTK plane-wave density functional program and in the open-source library diaglib. We have discussed in detail how to avoid numerical problems and error propagation and presented a cost-effective, yet stable strategy to orthogonalize a set of vectors using Cholesky decomposition of the overlap matrix. We have then compared the resulting implementation to Davidson’s method for a selection of test cases in quantum chemistry. Davidson’s method is the de facto standard for solving large eigenvalue problems in quantum chemistry and for good reasons. As many of such problems are characterized by strongly diagonally dominant matrices, Davidson’s method always exhibits reliable, fast convergence. This comes, however, at a price. To be efficient, Davidson’s method requires a rather large expansion subspace, which can become cumbersome for large-scale calculations, both in terms of memory requirements and computational effort in the orthogonalization step. Furthermore, the method has some difficulties dealing with non-diagonally dominant matrices, as the ones encountered in CASSCF calculations. For all these reasons, LOBPCG represents a valid alternative. Due to its low memory requirements, it can be used to treat systems for which deploying Davidson’s method would be too costly. It can also be a backup method in cases where Davidson fails or, for particularly hard cases as in CASSCF, used as a default method. The implementation in diaglib is free and accessible, and can be used under the terms of the LGPL v2.1 license, while the one in DFTK is available under the MIT license. It is our hope that it will provide an useful tool to the developers community in quantum chemistry.

Our numerical experiments highlight the degradation of the convergence rate of the Davidson method as the history size is truncated. On the other hand, LOBPCG is able to maintain a good convergence rate (although slightly inferior to untruncated Davidson) with a subspace of size 3N. It would be an interesting topic of further research to devise a method that is able to interpolate between the two, being able to use a large history size if available, but preserving the good behavior of LOBPCG when used with a smaller history size. This could pave the way toward a fully adaptive method that truncates the history size dynamically based on available information.