1 Introduction

Since the High-\(T_c\) superconductor was discovered many physicists have tried to understand the mechanism behind the superconductivity. It is believed that strong electron correlations underlie the phenomenon, however the exact mechanism is not yet fully understood. One of the numerical approaches to the problem is the exact diagonalization method. In this method the eigenvalue problem is solved for the Hamiltonian derived exactly from the Hubbard model [1, 2], which is a model of a strongly-correlated electron system. When we solve the ground state (the smallest eigenvalue and its corresponding eigenvector) of the Hamiltonian, we can understand its properties at absolute zero (−273.15 \(^\circ \mathrm {C}\)). Many computational methods for the problem have been proposed. Since the Hamiltonian from the Hubbard model is a huge sparse symmetric matrix, an iteration method, such as the Lanczos method [3] or the LOBPCG method (see Fig. 1) [4, 5], is usually utilized for solving the eigenvalue problem.

Fig. 1.
figure 1

Algorithm of LOBPCG method for matrix A. Here the matrix T is a preconditioner.

The convergence of the LOBPCG method depends strongly on the use of a preconditioner. We previously confirmed that the zero-shift point Jacobi preconditioner, which is a shift-and-invert preconditioner [6] using an approximate eigenvalue, gives excellent convergence properties for the Hubbard model with the trapping potential [7]. However, we also reported that the benefit of the preconditioner strongly depends on the characteristics of the non-zero elements of the Hamiltonian and that the preconditioner does not always improve the convergence [8]. Therefore we proposed a novel preconditioner using the Neumann expansion for solving the ground state of the Hamiltonian and demonstrated that this preconditioner improves convergence for a Hamiltonian that is difficult to solve with the zero-shift point Jacobi preconditioner [8]. Moreover we applied a communication avoiding strategy, which was developed considering the properties of the Hubbard model, to the preconditioner.

In order to understand more details of strongly correlated electron systems in particular properties at temperatures near absolute zero, we must solve for the several smallest eigenvalues and corresponding eigenvectors of the Hamiltonian. The LOBPCG method can solve multiple eigenvalues by using a block of vectors.

In this paper, we extend the Neumann expansion preconditioner to the LOBPCG method for solving multiple eigenvalues and corresponding eigenvectors. Moreover, we demonstrate that the preconditioner improves the convergence properties and can achieve excellent parallel performance.

The paper is structured as follows. In Sect. 2 we briefly introduce related work for solving the ground state of the Hubbard model using the LOBPCG method. Section 3 describes the use of the Neumann expansion preconditioner with the communication avoiding strategy for solving for multiple eigenvalues and their corresponding eigenvectors. Section 4 demonstrates the parallel performance of the algorithm on the SGI ICE X and K supercomputers. A summary and conclusions are given in Sect. 5.

2 Related Work

2.1 Hamiltonian-Vector Multiplication

When solving the ground state of a symmetric matrix using the LOBPCG method, the most time-consuming operation is the matrix-vector multiplication. The Hamiltonian derived from the Hubbard model (see Fig. 2) is

$$\begin{aligned} H =- t\sum _{i,j,\sigma } c_{j \sigma }^{\dagger } c_{i \sigma } + \sum _i U_i n_{i\uparrow } n_{i \downarrow }, \end{aligned}$$
(1)

where t is the hopping parameter from one site to another, and \(U_i\) is the repulsive energy for double occupation of the i-th site by two electrons [1, 2, 7]. Quantities \(c_{i,\sigma }\), \(c_{i,\sigma }^{\dagger }\) and \(n_{i,\sigma }\) are the annihilation, creation, and number operator of an electron with pseudo-spin \(\sigma \) on the i-th site, respectively. The indices in formula (1) for the Hamiltonian denote the possible states for electrons in the model. The dimension of the Hamiltonian for the \(n_s\)-site Hubbard model is

$$\left( \begin{array}{c}{n_s}\\ {n_\uparrow } \end{array}\right) \times \left( \begin{array}{c}{n_s}\\ {n_\downarrow } \end{array}\right) ,$$

where \(n_\uparrow \) and \(n_\downarrow \) are the number of the up-spin and down-spin electrons, respectively.

Fig. 2.
figure 2

A schematic figure of the 2-dimensional Hubbard model, where t is the hopping parameter and U is the repulsive energy for double occupation of a site. Up arrows and down arrows correspond to up-spin and down-spin electrons, respectively.

The diagonal element in formula (1) is derived from the repulsive energy \(U_i\) in the corresponding state. The hopping parameter t affects non-zero elements with column-index corresponding to the original state and row-index corresponding to the state after hopping. Since the ratio U / t greatly affects the properties of the model, we have to execute many simulations varying this ratio to reveal the properties of the model.

When considering the physical properties of the model, we can split the Hamiltonian-vector multiplication as

$$\begin{aligned} Hv = Dv+(I_{\downarrow } \otimes A_{\uparrow })v +(A_{\downarrow } \otimes I_{\uparrow })v, \end{aligned}$$
(2)

where \(I_{\uparrow (\downarrow )}\), \(A_{\uparrow (\downarrow )}\) and D are the identity matrix, a sparse symmetric matrix derived from the hopping of an up-spin electron (a down-spin electron), and a diagonal matrix from the repulsive energy, respectively [7]. Since there is no regularity in the state change by electron hopping, the distribution of non-zero elements in matrix \(A_{\uparrow (\downarrow )}\) is irregular.

Next, a matrix V is constructed by the following procedures from a vector v. First, decompose the vector v into n blocks, and order in the two-dimensional manner as follows,

where \(m_{\uparrow }\) and \(m_{\downarrow }\) are the dimensions of the Hamiltonian for up-spin and down-spin electrons, i.e.

$$m_{\uparrow }= \left( \begin{array}{c}{n_s}\\ {n_\uparrow }\end{array}\right) , m_{\downarrow }=\left( \begin{array}{c}{n_s}\\ {n_\downarrow } \end{array}\right) .$$

The subscripts on each element of v formally indicate the row and column within the matrix V. Therefore V is a dense matrix. The k-th elements of the matrix D, \(d_k\), are used in the same manner to define a new matrix \(\bar{D}\). The multiplication in Eq. (2) can then be written as

$$\begin{aligned} V_{i,j}^{new}=\bar{D}_{i,j}V_{i,j}+\sum _{k} A_{\uparrow i,k}V_{k,j}+ \sum _{k} V_{i,k}A_{\downarrow j,k} \end{aligned}$$
(3)

where the subscript i, j of the matrix is represented as the (i, j)-th element and V and \(\bar{D}\). Accordingly we can parallelize the multiplication \(Y=HV(\equiv Hv)\) as follows:  

CAL 1::

  \(Y^{c}=\bar{D}^{c}\odot V^{c}+A_{\uparrow } V^{c}\),

COM 1::

  all-to-all communication from \(V^{c}\) to \(V^{r}\),

CAL 2::

  \(W^{r}= V^{r} A^T_{\downarrow }\),

COM 2::

  all-to-all communication from \(W^{r}\) to \(W^{c}\),

CAL 3::

  \(Y^{c}=Y^{c}+W^{c}\).

  where superscripts c and r denote column wise and row wise partitioning of the matrix data for the parallel calculation. The operator \(\odot \) means an element wise multiplication. The parallelization strategy requires two all-to-all communication operations per multiplication.

2.2 Preconditioner of LOBPCG Method for Solving the Ground State of Hubbard Model

Zero-Shift Point Jacobi Preconditioner. A suitable preconditioner improves the convergence properties of the LOBPCG method. As a consequence many preconditioners have been proposed. Preconditioners for the Hamiltonian derived from the Hubbard model also have been proposed. For the Hubbard model, the zero-shift point Jacobi (ZSPJ) preconditioner, which is a shift-and-invert preconditioner using an approximate eigenvalue obtained during LOBPCG iteration, has excellent convergence properties for Hamiltonians where the diagonal elements predominate over the off-diagonal elements, i.e. cases where the repulsive energy U is large [7, 8].

Neumann Expansion Preconditioner. For the Hubbard model with a small repulsive energy, a preconditioner using the Neumann expansion was previously proposed [8]. The expansion is

$$\begin{aligned} (I-M)^{-1}=I+M+M^2+M^3+\cdots . \end{aligned}$$
(4)

The expansion converges when the operator norm of the matrix M is less than 1 (\(||M||_{op}<1\)) [9]. Here the matrix M is

$$ M=I-\frac{2}{\lambda _{\max }-\lambda _{\min }}(H-\lambda _{\min }I), $$

where \(\lambda _{\min }\) and \(\lambda _{\max }\) are the smallest and largest eigenvalues, respectively. When the exact eigenvalues are utilized for \(\lambda _{\min }\) and \(\lambda _{\max }\), \(||M||_{op}\) is equal to 1. Since the LOBPCG method calculates an approximation of the smallest eigenvalue, we consider the residual error of this approximation and make a low estimate of \(\lambda _{\min }\). The Gershgorin circle theorem is used to assign a \(\lambda _{\max }\) that is an overestimate of the true value. The inequality \(||M||_{op}<1\) is hence obeyed and the expansion (4) can converge, i.e. the inverse matrix of \(\frac{2}{\lambda _{\max }-\lambda _{\min }}(H-\lambda _{\min }I)\). The expansion is an effective preconditioner for the smallest eigenvalue \(\lambda _{\min }\). In practice the Gershgorin circle theorem may give estimates for \(\lambda _{\max }\) that are much too large. Multiplying by a damping factor \(\alpha \) can help alleviate this inefficiency. We found 0.9 to be an appropriate \(\alpha \) in numerical tests.

2.3 Communication Avoiding Neumann Expansion Preconditioner for Hubbard Model

When we execute the LOBPCG method with the s-th order Neumann expansion preconditioner, we calculate \(s+1\) matrix-vector multiplications, Hv, \(H^2v\), \(\ldots \), and \(H^{s+1}v\), per iteration. As the multiplications \((I_{\downarrow } \otimes A_{\uparrow })\) and \((A_{\downarrow } \otimes I_{\uparrow })\) are commutative Yamada et al. proposed a communication avoiding strategy for the Hamiltonian-vector multiplication [8]. Then, \(H^2\) is given as

$$\begin{aligned} H^2= & {} (I_{\downarrow } \otimes A_{\uparrow })(D+(I_{\downarrow } \otimes A_{\uparrow }))+(A_{\downarrow } \otimes I_{\uparrow })(D+(A_{\downarrow } \otimes I_{\uparrow }))\\&+\,D(D+(I_{\downarrow } \otimes A_{\uparrow })+(A_{\downarrow } \otimes I_{\uparrow })) +(I_{\downarrow } \otimes A_{\uparrow }) (A_{\downarrow } \otimes I_{\uparrow }) +(A_{\downarrow } \otimes I_{\uparrow }) (I_{\downarrow } \otimes A_{\uparrow })\\= & {} (I_{\downarrow } \otimes A_{\uparrow })(D+(I_{\downarrow } \otimes A_{\uparrow })+2(A_{\downarrow } \otimes I_{\uparrow }))\\&+\,(A_{\downarrow } \otimes I_{\uparrow })(D+(A_{\downarrow } \otimes I_{\uparrow })) +D(D+(I_{\downarrow } \otimes A_{\uparrow })+(A_{\downarrow } \otimes I_{\uparrow })). \end{aligned}$$

As a result, \(Y_1=Hv\) and \(Y_2=H^2v\) can be calculated by the following algorithm:  

CAL 1::

  \(Y^{c}=\bar{D}^{c}\odot V^{c}+A_{\uparrow } V^{c}\),

COM 1::

 all-to-all communication from \(V^{c}\) to \(V^{r}\),

CAL 2::

  \(W^{r}=V^{r}A^{T}_\downarrow \),

COM 2::

 all-to-all communication from \(W^{r}\) to \(W^{c}\),

CAL 3::

  \(Y_1^{c}=Y^{c}+W^{c}\),

CAL 4::

  \(Y^{c}=Y_1^{c}+W^{c}\),

CAL 5::

  \(Y^{c}=\bar{D}^{c}\odot Y_1^{c}+A_{\uparrow }Y^{c}\),

CAL 6::

  \(W^{r}=\bar{D}^{r}\odot V^{r}+W^{r}\),

CAL 7::

  \(W^{r}=W^{r} A^{T}_\downarrow \) ,

COM 3::

 all-to-all communication from \(W^{r}\) to \(W^{c}\),

CAL 8::

  \(Y_2^{c}=Y^{c}+W^{c}\).

  The algorithm requires three all-to-all communication operations. On the other hand, when using the original algorithm described in Sect. 2.1, four all-to-all communication operations are required to calculate the same multiplication. However, the new algorithm has extra calculations, CAL 4 and CAL 6, as compared to the original one. Therefore when the cost of one all-to-all communication operation is larger than that of the extra calculations, we expect to achieve speedup with the communication avoiding strategy. The algorithm can not be directly applied to the multiplication \(H^{s+1}\) for \(s\ge 2\). In this case, the multiplication \(H^{s+1}\) is calculated by appropriately combining Hv and \(H^2v\) operations.

3 Neumann Expansion Preconditioner for Multiple Eigenvalues of Hubbard Model

3.1 How to Calculate Multiple Eigenvalues Using LOBPCG Method

The LOBPCG method for solving the m smallest eigenvalues and corresponding eigenvectors carries out recurrence with m vectors simultaneously (see Fig. 3). In this algorithm, the generalized eigenvalue problem has to be solved. We can solve the problem using the LAPACK function dsyev, if the matrix \(S_B\) is a positive definite matrix. Although theoretically \(S_B\) is always a positive definite matrix, numerically this is not always the case. The reason is that the norms of the vectors \(w_k^{(i)}\) and \(p_k^{(i)}\) (\(i=1,2,\ldots ,m\)) become small as the LOBPCG iteration converges, and it is possible that trailing digits are lost in the calculation of \(S_B\). Therefore we set the matrix \(S_B\) to the identity matrix by orthogonalizing the vectors per iteration. In the following numerical tests, we utilize the TSQR method for the orthgonalization [10, 11].

Fig. 3.
figure 3

LOBPCG method for solving the m smallest eigenvalues and corresponding eigenvectors. \(T^{(i)}\) is a preconditioner for the i-th smallest eigenvalues. This algorithm requires m matrix-vector multiplication operations and m preconditioned ones per iteration.

3.2 Neumann Expansion Preconditioner of LOBPCG Method for Solving Multiple Eigenvalues

When we calculate multiple eigenvalues (and corresponding eigenvectors) using the LOBPCG method, we can individually apply a preconditioning operation to each vector corresponding to the eigenvectors. We set the matrix \(M_i\) using the Neumann expansion preconditioner for the i-th smallest eigenvalue \(\lambda _i\) of the Hamiltonian as

$$ M_i=I-\frac{2}{\lambda _{\max }-\lambda _{i}}(H-\lambda _{i}I). $$

Since we obtain approximate eigenvalues after each iteration of the LOBPCG method, we consider the residual errors of these approximations to define an appropriate \(\lambda _i\). The matrix \(M_i\) has \((i-1)\) eigenvalues whose absolute values are greater than or equal to 1. In this case, the Neumann expansion using \(M_i\) can not converge. The eigenvectors corresponding to the eigenvalues agree with those corresponding to the eigenvalues \(\lambda _1\), \(\lambda _2\), \(\ldots \), \(\lambda _{i-1}\) of the Hamiltonian, and then, they are calculated during the LOBPCG iteration simultaneously. Accordingly, we orthogonalize the vectors \(x_k^{(i)}\), \(w_k^{(i)}\), and \(p_k^{(i)}\) (\(i=1,2,\ldots ,m\)) in the order that takes away the components of the vectors \(x_k^{(1)}\), \(x_k^{(2)}\), \(\ldots \), \(x_k^{(i-1)}\) from \(w_k^{(i)}\) given by the Neumann expansion preconditioner using \(M_i\). That is, we orthogonal the vectors utilizing the algorithm including the following operation

$$\begin{aligned} w_k^{(j)}:=w_k^{(j)}-\sum _{i=1}^{j-1}(w_k^{(j)},x_k^{(i)})x_k^{(i)} . \end{aligned}$$
(5)

The formula (5) can approximately remove the components of the eigenvectors corresponding to the eigenvalues, whose absolute values are greater than or equal to 1, from the preconditioned vectors. Therefore we expect that the Neumann expansion using \(M_i\) becomes an appropriate preconditioner for solving for multiple eigenvalues.

4 Performance Result

4.1 Computational Performance and Convergence Property

We examined the computational performance and convergence properties of the LOBPCG method. We solved the 2-D 4 \(\times \) 5-site Hubbard model with 5 up-spin electrons and 5 down-spin ones. The dimension of the Hamiltonian derived from the model is about 240 million. The number of non-zero off-diagonal elements is about 1.6 billion. We solved for one, five and 10 eigenvalues (and corresponding eigenvectors) of the Hamiltonian on 768 cores (64 MPI processes \(\times \) 12 OpenMP threads) of the SGI ICE X supercomputer (see Table 1) in Japan Atomic Energy Agency (JAEA). Table 2 shows the results for a weak interaction case (\(U/t=1\)) and a strong one (\(U/t=10\)). Table 3 shows the elapsed times of some representative operations.

Table 1. Details of SGI ICE X
Table 2. Elapsed time and number of iterations for convergence of LOBPCG method using zero-shift point Jacobi (ZSPJ), Neumann expansion (NE), or communication avoiding Neumann expansion (CANE) preconditioner. Here, s is the number of the Neumann expansion series.

The results for \(U/t=1\) indicate that point Jacobi (PJ) and zero-shift point Jacobi (ZSPJ) preconditioners hardly improve the convergence compared to without using a preconditioner at all. When we solve for many eigenvalues, the PJ and ZSPJ preconditioners have little effect on the speed of the calculation. On the other hand, the Neumann expansion preconditioner can decrease the number of iterations required for convergence. Moreover, the larger the Neumann expansion series s, the fewer iterations required. When we solve for only the smallest eigenvalue, the total elapsed time increases as s increases. The reason is that the elapsed time of the Hamiltonian-vector multiplication operation is dominant over the whole calculation for solving the only smallest eigenvalue (see Table 3). When we solve multiple eigenvalues, the TSQR operation becomes dominant. Therefore when the series number s becomes large, it is possible to achieve speedup of the computation.

Table 3. Elapsed time for operations per iteration. This table shows the results using the zero-shift point Jacobi (ZSPJ), Neumann expansion (NE), and communication avoiding Neumann expansion (CANE). Here, the Neumann expansion series s is equal to 1. For \(m=1\), instead of executing TSQR, we calculate \(S_B\) ,moreover, ZSPJ preconditioner is calculated together with x, p, X, P.
Table 4. Speedup ratio for the elapsed time per iteration using the Neumann expansion preconditioner and communication avoiding strategy.

Next, we discuss the results for \(U/t=10\). The results indicate that the PJ preconditioner improves the convergence properties. On the other hand, ZSPJ for small m improves convergence, however, its convergence properties when solving for multiple eigenvalues are almost the same as those for the PJ preconditioner. When we solve for multiple eigenvalues using the Neumann expansion preconditioner, the solution is obtained faster than using the PJ or ZSPJ preconditioners. Moreover, as the Neumann expansion series s increases, the Neumann expansion preconditioner improves the convergence properties and the total elapsed time decreases, especially when m is large.

Finally, we talk about the effect of the communication avoiding strategy. Table 4 shows the speedup ratio for the elapsed time using the Neumann expansion preconditioner per iteration and the communication avoiding strategy. In all cases the communication avoiding strategy realizes speedup. When we solve for only the smallest eigenvalue (and its corresponding eigenvector), the speedup ratio is almost the same as that for the matrix-vector multiplication, because the multiplication cost is dominant. On the other hand, when we solve for multiple eigenvalues, the calculation cost except the multiplication becomes dominant. Therefore the speedup ratio is a little smaller than that for only the multiplication. Furthermore, when the Neumann expansion series s is equal to 3, we confirm that the ratio improves. In this case, since four multiplications (Hw, \(H^2w\), \(H^3w\) and \(H^4w\)) are executed per iteration, the ratio of the multiplication cost increases. Moreover, we can execute four multiplication operations by two communication avoiding multiplications. Therefore, the ratio for \(s=3\) is better than that for \(s=1\).

Table 5. Details of K computer

4.2 Parallel Performance

In order to examine the parallel performance of the LOBPCG method using the Neumann expansion preconditioner, we solved for the 10 smallest eigenvalues and corresponding eigenvectors of the Hamiltonian derived from the 4 \(\times \) 5-site Hubbard model for \(U/t=1\) with 6 up-spin and 6 down-spin electrons. We used the LOBPCG method with ZSPJ, NE, and CANE preconditioners using hybrid parallelization on SGI ICEX in JAEA and the K computer in RIKEN (see Table 5). The results are shown in Table 6. The results indicate that all preconditioners achieve excellent parallel efficiency. The communication avoiding strategy on SGI ICEX decreases the elapsed time per iteration by about 15%. On the other hand, the communication avoiding strategy on the K computer did not realize speedup when using a small number of cores. The ratio of the network bandwidth to FLOPS per node of the K computer is larger than that of SGI ICEX, so it is possible that the cost of the extra calculations (CAL 4 & CAL 6) is larger than that of the all-to-all communication operation. However since the cost of the all-to-all communication operation increases as the number of the cores increases, the strategy realizes speedup on 4096 cores. Therefore, the strategy has a potential of speedup for parallel computing using a sufficiently large number of cores, even if the ratio of the network bandwidth to FLOPS is large.

Although the LOBPCG method using NE has four times more Hamiltonian-vector multiplications per iteration than the method with ZSPJ, the former takes about twice the elapsed time of the latter. The reason is that the calculation operations except the multiplication is dominant in this case. Therefore, we conclude that in order to solve for multiple eigenvalues of the Hamiltonian derived from the Hubbard model using the LOBPCG method in a short computation time, it is crucial to reduce the number of the iterations for the convergence even if the calculation cost of the preconditioner is large.

Table 6. Parallel performance of LOBPCG method on SGI ICEX and K computer. This table shows the number of iterations, the total elapsed time, and the elapsed time per iteration of LOBPCG method using zero-shift point Jacobi (ZSPJ), Neumann expansion (NE), or communication avoiding Neumann expansion (CANE) preconditioner. Here, the Neumann expansion series s is 3.

5 Conclusions

In this paper we applied the Neumann expansion preconditioner to the LOBPCG method to solve for multiple eigenvalues and corresponding eigenvectors of the Hamiltonian derived from the Hubbard model. We examined the convergence properties and parallel performance of the algorithms. Since the norm of the matrix used in the Neumann expansion should be less than 1, we transform it using approximate eigenvalues calculated by the LOBPCG iteration and the upper bounds of the eigenvalues by the Gershgorin circle theorem. Moreover, we orthogonalize the iteration vectors in the order that removes the components of the eigenvectors corresponding to the eigenvalues, whose absolute values are greater than or equal to 1, from the preconditioned vectors.

The Neumann expansion preconditioner with the communication avoiding strategy can achieve speedup even for problems which are hardly improved by the conventional preconditioners. Furthermore, a numerical experiment indicated that the LOBPCG method using this preconditioner has excellent parallel efficiency on thousands cores, and the communication avoiding strategy based on the property of the Hubbard model realizes speedup for parallel computers if a sufficiently large number of cores are used. Therefore, we confirm that the preconditioner based on the Neumann expansion is suitable for solving the eigenvalue problem derived from the Hubbard model using the LOBPCG method.