Keywords

1 Introduction

Eigenvalue problems appear in a variety of fields such as quantum dynamics, structure analysis and economics. Therefore, many solvers for them have been developed and the strategies to improve their performance have also been proposed. In the quantum dynamics, when we solve eigenvalue problems derived from quantum models, we can recognize quantum states which indicate properties of the models. In this research, we focus on the eigenvalue problem for the Hamiltonian derived from the Hubbard model and will propose the strategies to realize a high performance solver on multi-GPU systems. The model can exhibit the property of many interesting phenomena such as high-temperature superconductivity [9, 11], therefore, a lot of physicists take interest in it. The Hamiltonian, which represents the energy of the Hubbard model, is given as

$$\begin{aligned} H =- t\sum _{i,j,\sigma } c_{j \sigma }^{\dagger } c_{i \sigma } + \sum _i U_i n_{i\uparrow } n_{i \downarrow }, \end{aligned}$$
(1)

where t is the hopping parameter from a site to another one and \(U_i\) is the repulsive energy for one-site double occupation of two fermion the i-th site. Quantities \(c_{i,\sigma }\), \(c_{i,\sigma }^{\dagger }\) and \(n_{i,\sigma }\) are the annihilation, the creation, and the number operator of a fermion with pseudo-spin \(\sigma \) on the i-th site, respectively. When we solve the ground state (the smallest eigenvalue and the corresponding eigenvector) of the Hamiltonian, we can understand the property of the model. Moreover, when we solve multi eigenpairs (eigenvalues and the corresponding eigenvectors), we can reveal more detail property. Therefore, many methods to solve the model have been proposed. One of the most accurate solvers is the exact diagonalization method which directly solves some eigenpairs of the Hamiltonian derived exactly from the model. At this time, the Hamiltonian becomes a huge sparse symmetric matrix. Accordingly, an iteration method, such as the Lanczos method and LOBPCG (Locally Optimal Block Preconditioned Conjugate Gradient) method [7, 8], is usually utilized. And then, the parallelization strategies for multi-CPU systems [14] and the tuning ones for single-GPU systems [1, 12, 15, 17] have been proposed. In this research, in order to realize the LOBPCG method for solving some eigenpairs using multi-GPU systems, we will propose the parallelization and tuning strategies. The parallelization not only realizes speedup by distributing the calculations, but also enables simulations for larger models by distributing data. Since the memory size of GPU is generally smaller than that of CPU, we can calculate larger models by storing data in CPU memory (host memory) than in only GPU memory (device memory). Accordingly, in order to simulate a larger model, we transfer the data that are required for an operation from host to device memory and temporarily store them in device memory, and then, we execute the operation using them. Moreover, we have to transfer the calculation results from device to host memory, if necessary. Since the speed of the data transfer between host and device memory is much slower than that of data transfer in GPU, it is important to decrease the cost of data transfer between host and device memory for high performance parallel LOBPCG method on multi-GPU systems. In order to realize the decrease, we focus on the data transfer between host and device memory in this paper. Accordingly, although it may be possible to perform faster using CPUs in addition to GPUs, we target the LOBPCG method whose all time-consuming operations are performed on only GPUs.

The rest of the paper of structured as follows. Section 2 presents the implement schemes on multi-GPU systems. We propose the tuning strategies in view of data transfer between host and device memory in Sect. 3. Section 4 shows the result of numerical experiments on HPE SGI 8600 in Japan Atomic Energy Agency. A summary and conclusion are given in Sect. 5.

Fig. 1.
figure 1

LOBPCG method for solving the m smallest eigenvalues and the corresponding eigenvectors of a symmetric matrix H. \(T^{(i)}\) is a preconditioner for the i-th smallest eigenvalues. And \(\boldsymbol{\mathcal {X}}_k^{(i)}\), \(\boldsymbol{\mathcal {P}}_k^{(i)}\), and \(\boldsymbol{\mathcal {W}}_k^{(i)}\) are \(H\boldsymbol{x}_k^{(i)}\), \(H\boldsymbol{p}_k^{(i)}\), and \(H\boldsymbol{w}_k^{(i)}\), respectively. As convergence progresses, a set of iteration vectors \(\{\boldsymbol{x}_k^{(1)},\ldots ,\boldsymbol{x}_k^{(m)},\boldsymbol{w}_k^{(1)},\ldots ,\boldsymbol{w}_k^{(m)},\boldsymbol{p}_k^{(1)},\ldots ,\boldsymbol{p}_k^{(m)}\}\) becomes linearly dependent, and the general eigenvalue problem can not be solved. Therefore, in practice, we orthonormalize a set of the vectors to calculate the algorithm stably.

2 LOBPCG Method for Solving Multi Eigenpairs on Multi-GPU Systems

We can solve m eigenpairs of the matrix H using the LOBPCG method shown in Fig. 1Footnote 1. The method requires m matrix-vector multiplications and some linear operations using iteration vectors \(\boldsymbol{x}_k^{(i)}\), \(\boldsymbol{w}_k^{(i)}\), \(\boldsymbol{p}_k^{(i)}\), \(\boldsymbol{\mathcal {X}}_k^{(i)}\), \(\boldsymbol{\mathcal {W}}_k^{(i)}\), and \(\boldsymbol{\mathcal {P}}_k^{(i)}\) \((i=1,2,\ldots ,m)\). Moreover, in order to execute the LOBPCG method stably, we have to orthonormalize iteration vectors \(\boldsymbol{x}_k^{(i)}\), \(\boldsymbol{w}_k^{(i)}\) and \(\boldsymbol{p}_k^{(i)}\). At this time, \(S_B\) becomes the identity matrix. The parallelization schemes of the multiplication for the Hubbard model on multi-CPU systems have been proposed [14]. In addition, the tuning strategies for single-GPU systems have been also proposed [12, 15, 17]. We propose the parallelization of the multiplication on multi-GPU systems by combining the above two strategies appropriately. Since the size of device memory is typically a fraction of that of host memory, it is supposed in this paper that we store only the information of the matrix H and 2m vectors (\(\boldsymbol{w}_k^{(i)}\) and \(\boldsymbol{\mathcal {W}}_k^{(i)}\), \(i=1,\ldots ,m\)) in device memory and the other vectors (\(\boldsymbol{x}_k^{(i)}\), \(\boldsymbol{\mathcal {X}}_k^{(i)}\), \(\boldsymbol{p}_k^{(i)}\) and \(\boldsymbol{\mathcal {P}}_k^{(i)}\)) in host memory. Here, the time-consuming operations of LOBPCG method are Hamiltonian matrix-vector multiplication operations and the vector operations (dot product, vector update and orthonormalization). In the following, we introduce the parallelization strategies in multi-GPU systems for each operation.

2.1 Hamiltonian Matrix-Vector Multiplications

When we solve m eigenpairs of the Hamiltonian, we have to operate m Hamiltonian-vector multiplications per iteration. Since each of these multiplications can be executed independently, we focus on parallelization and tuning strategies for one multiplication. The Hamiltonian is represented as

$$\begin{aligned} H=D+I_{\downarrow }\otimes A_{\uparrow }+A_{\downarrow }\otimes I_{\uparrow }, \end{aligned}$$
(2)

where D is a diagonal matrix from the repulsive energy, and \(A_{\uparrow }\) (\(A_{\downarrow }\)) is a sparse symmetric matrix from the hopping of the up-spin (down-spin). And, \(I_{\uparrow }\) (\(I_{\downarrow }\)) is the identity matrix, dimension of which is the same as that of \(A_{\uparrow }\) (\(A_{\downarrow }\)). When the dimension of \(A_{\uparrow }\) (\(A_{\downarrow }\)) is \(n_\uparrow \) (\(n_\downarrow \)), the dimension of the Hamiltonian H is \(n_\uparrow \times n_{\downarrow }\). Since the dimension of \(A_{\uparrow }\) and \(A_{\downarrow }\) is much smaller than that of H, we store the non-zero elements of \(A_{\uparrow }\), \(A_{\downarrow }\) and D in device memory instead of the matrix H. Here, the multiplication of the Hamiltonian and vector is

$$\begin{aligned} Hv=Dv+(I_{\downarrow }\otimes A_{\uparrow })v+(A_{\downarrow }\otimes I_{\uparrow })v. \end{aligned}$$

We transform the vector v to the matrix V as the following manner:

$$\begin{aligned} v=\left( \begin{array}{c} v_{1}\\ v_{2}\\ \vdots \\ v_{n_{\uparrow } \times n_{\downarrow }} \end{array} \right) \rightarrow V=\left( \begin{array}{cccc} v_{1}&{}v_{n_\uparrow +1}&{}\cdots &{}v_{n_\uparrow \times (n_{\downarrow }-1)+1} \\ v_{2}&{}v_{n_\uparrow +2}&{}\cdots &{}v_{n_\uparrow \times (n_{\downarrow }-1)+2} \\ \vdots &{}\vdots &{}\vdots &{}\vdots \\ v_{n_\uparrow }&{}v_{2n_\uparrow }&{}\cdots &{}v_{n_\uparrow \times n_{\downarrow }} \end{array} \right) , \end{aligned}$$
(3)

and the diagonal elements of the matrix D to the matrix \(\bar{D}\) as the same manner. Here, the matrix-vector multiplication is change into the following matrix-matrix multiplication:

$$\begin{aligned} (HV)_{i,j}=\bar{D}_{i,j}V_{i,j}+\sum _{k}{{A_{\uparrow }}_{i,k} V_{k,j} }+\sum _{k}{{A_{\downarrow }}_{j,k} V_{i,k}}, \end{aligned}$$

where the subscript ij of a matrix is the (ij)-th element of the matrix. Since the matrix V is a dense matrix, we can execute the multiplications in parallel as follows:

  • CAL 1: \(Y^{c}=\bar{D}^{c}\odot V^{c}+A_{\uparrow } V^{c}\),

  • COM 1: all-to-all communication from \(V^{c}\) to \(V^{r}\),

  • CAL 2: \(Z^{r}= V^{r} A^T_{\downarrow }\),

  • COM 2: all-to-all communication from \(Z^{r}\) to \(Z^{c}\),

  • CAL 3: \(Y^{c}=Y^{c}+Z^{c}\).

where the superscription c and r denotes the columnwise and rowwise partitioning in matrix data for the parallel calculation. And, \(\odot \) means an elementwise multiplication. The parallelization strategy requires two all-to-all communication operations per multiplication.

When all non-zero elements of the matrix H and the data of the decomposed matrices ,\(V^c\) and \(V^r\), are stored in device memory, we can execute CAL 1, CAL 2 and CAL 3 by using the algorithms proposed for single-GPU systems [12, 15, 17]. Here, the storage layout of \(V^c\) and \(V^r\) should be row-major and column-major order, respectively, in order that \(A_{\uparrow }V^{c}\) and \( V^{r} A^T_{\downarrow }\) are performed with contiguous memory access. Therefore, the method requires the change of the storage layout of the matrix between column-major and row-major order, however, the operation can be executed with high performance by using the shared memory on GPU systems.

Fig. 2.
figure 2

Schematic CUDA Fortran codes of vector operations of the LOBPCG method for solving m eigenpairs on multi-GPU systems. Here, the number of tiles of the vectors is l and the dimension of each tiled vector is n. Therefore, the dimension of the decomposed vector on each process is \(l\times n\). Here, h is cuBLAS library handle. And, x, p, dot and dot0 are stored in host memory and dx, dp, w, \(\alpha \), \(\beta \), \(\gamma \), dd, dd0 and dtmp in device memory. The codes are simplified to indicate the relationship between the data transfer and the execution on a GPU. Therefore, the code should be extended appropriately to actually execute the LOBPCG method. Moreover, the data is transferred in one operation by packing data.

2.2 Vector Operations

The vector operations in the LOBPCG method can be categorized into the following three groups:

  • Dot product for constructing \(3m \times 3m\)-dimensional symmetric matrix (\(S_A\) in Fig. 1),

  • Updating all column vectors of \( X_{k+1}\), \(P_{k+1}\), \(W_{k+1}\), \(\mathcal {X}_{k+1}\) and \(\mathcal {P}_{k+1}\) ,where \(X_{k+1}=( \boldsymbol{x}_{k+1}^{(1)},\ldots ,\boldsymbol{x}_{k+1}^{(m)})\), \(W_{k+1}=( \boldsymbol{w}_{k+1}^{(1)},\ldots ,\boldsymbol{w}_{k+1}^{(m)})\), \(P_{k+1}=( \boldsymbol{p}_{k+1}^{(1)},\ldots ,\boldsymbol{p}_{k+1}^{(m)})\), \(\mathcal {X}_{k+1}=(\boldsymbol{\mathcal {X}}_{k+1}^{(1)},\ldots ,\boldsymbol{\mathcal {X}}_{k+1}^{(m)})\) \((=HX_{k+1})\) and \(\mathcal {P}_{k+1}=(\boldsymbol{\mathcal {P}}_{k+1}^{(1)},\ldots ,\boldsymbol{\mathcal {P}}_{k+1}^{(m)})\) \((=HP_{k+1})\).

  • Orthonormalization for all column vectors of \(X_{k+1}\), \(P_{k+1}\) and \(W_{k+1}\).

We discuss the dot product and update for vectors first, then the orthonormalization of vectors.

Dot Product and Vector Update. The parallelization of dot product can be realized by calculating the partial sum of dot product on each process and performing the sum-reduction for the partial sum by MPI_ALLREDUCE in all processes. The parallelization of vector update operation can be realized by performing the ‘axpy’ operation for the decomposed vectors on each process without data communication between processors. When all vectors are stored on device memory, the vector operations can be parallelized straightforwardly. However, in this research, 4 matrices (\(X_k\), \(\mathcal {X}_k\), \(P_k\) and \(\mathcal {P}_k\)) are stored in host memory, and the data have to be transferred to device memory. Therefore, we partition the vectors into some tiles and we transfer each tile of the vectors from host to device memory and execute the partial dot product operation on each tile (Fig. 2(a)) [10]. The vector update operation can be also executed using almost the same strategy (Fig. 2(b)). However, the operation requires transferring the updated vectors from device to host memory.

Orthonormalization of Vectors. Here, we discuss the orthonormalization of the iteration vectors \(\boldsymbol{x}_k^{(i)}\), \(\boldsymbol{p}_k^{(i)}\), and \(\boldsymbol{w}_k^{(i)}\). In order that we execute the LOBPCG method for solving multi eigenpairs stably, a set of the basis of the subspace spanned by all iteration vectors, that is, a set of all column vectors of \(X_k\), \(P_k\), and \(W_k\), should be linearly independent. Therefore, we should orthogonalize the basis per iteration. The orthognalization can be realized by many methods such as the modified Gram-Schmidt (MGS) orthonormalization, TSQR, CholeskyQR, CholeskyQR2 and so on [2, 4, 13]. When we apply these methods for vectors directly, we have to transfer vectors stored in host memory. Therefore, we focus on the orthogonalization strategy for the LOBPCG method proposed by Hetmaniuk and Lehoueq (HL) [3, 5]. The HL strategy is as follows:

  • Here, we represent the eigenvector corresponding to the i-th smallest eigenvalue of \(S_A\) as \(\boldsymbol{v}^{(i)}\). And, it is assumed that a set of vectors \(\{\boldsymbol{v}^{(1)},\boldsymbol{v}^{(2)},\ldots ,\boldsymbol{v}^{(m)}\}\) is orthonormal, moreover, a set of all column vectors of the matrices \(X_k\), \(P_k\) and \(W_k\) in the k-th iteration is also orthonormal.

  • When the (ij)-th element of the matrices \(C^1\), \(C^2\) and \(C^3\) are defined as \(C^1_{(i,j)}=(\alpha ^{(i)}_j)\), \(C^2_{(i,j)}=(\beta ^{(i)}_j)\) and \(C^3_{(i,j)}=(\gamma ^{(i)}_j)\), that is,

    $$\left( \begin{array}{c} C^1\\ C^2\\ C^3 \end{array} \right) =(\boldsymbol{v}^{(1)},\boldsymbol{v}^{(2)},\ldots ,\boldsymbol{v}^{(m)}),$$

    \(X_{k+1}\) and \(P_{k+1}\) are calculated by

    $$(X_{k+1},P_{k+1})= (X_{k},W_{k},P_{k})C , ~~~~C=\left( \begin{array}{cc} C^1&{}0\\ C^2&{}C^2\\ C^3&{}C^3 \end{array} \right) .$$

    Here, a set of all column vectors of matrices \(X_{k+1}\) becomes orthonormal.

  • We decompose matrix C into QR using QR decomposition. When we calculate \(X_{k+1}\) and \(P_{k+1}\) using the following formula

    $$(X_{k+1},P_{k+1})= (X_{k},W_{k},P_{k})Q, $$

    all columns of \(X_{k+1}\) and \(P_{k+1}\) are orthonormalFootnote 2.

  • Next, we orthonormalize the column vectors of \(W_{k+1}\) against those of \(X_{k+1}\) and \(P_{k+1}\) using the classical Gram-Schmidt (CGS) method, that is, \(W_{K+1}\) is updated by the following formula [6]

    $$\begin{aligned} W_{k+1}=(I-X_{k+1}X_{k+1}^T-P_{k+1}P_{k+1}^T)W_{k+1}. \end{aligned}$$
    (4)

    The method requires much less allreduce communication operations than the MGS method.

  • Finally, we orthonormalize a set of the columns vectors of \(W_{k+1}\) using the MGS method. The MGS method requires a lot of allreduce communication operations, however, the number of the operations in this case is reduced by about one-ninth compared to the orthonormalization of all column vectors of three matrices \(X_{k+1}\), \(P_{k+1}\) and \(W_{k+1}\). Moreover, since we orthonormalize a set of only the column vectors of \(W_{k+1}\) stored in device memory, we do not need to transfer any vectors between host to device memory.

In this operation, when we find vectors to be a combination of other vectors, we eliminate them in this iteration. We show the schematic CUDA Fortran code for orthogonalizing the column vectors \(W_{k+1}\) against those of \(X_{k+1}\) and \(P_{k+1}\) in Fig. 3. The operation requires to transfer \(X_{k+1}\) and \(P_{k+1}\) from host to device memory twice.

Fig. 3.
figure 3

Schematic CUDA Fortran code of orthogonalizing the column vectors of W against those of X and P using the classical Gram-Schmidt method.

Performance. Here, we examine the performance of these three operations using the above methods for solving the eigenvalue problems of the Hamiltonian derived from 5 \(\times \) 4-site Hubbard model with 5 up-spins and 5 down-ones using 16 GPUs of 4 nodes on HPE SGI8600 in Japan Atomic Energy Agency. The details of the system are shown in Table 1. Here, the dimension of the Hamiltonian is 240,374,016, therefore, the dimension of a partitioned vector for each GPU is 15,023,376.

Figure 4 shows the elapsed time of each operation for solving 5 or 10 eigenpairs. The result indicates that the elapsed times tend to increase as the number of tiles becomes larger. The reason is that as the number of the tiles increases, the number of data transfer operations increases and the data size for each transfer operation decreases, that is, the latency of data transfer increases and the throughput declines. Moreover, in this result, it is noted that the elapsed time of the vector update operation is unstable. The reason is that when we execute operations on four GPUs on the system whose node logic diagram as shown in Fig. 5, we generally run two processes on each processor. When two processes on a processor simultaneously transfer data between host and device memory in the same direction, the bus connected between a CPU and two GPUs are shared with two processes and the throughput per process is limited to about half of the throughput of data transfer using one process. In the beginning of the iteration, the data transfer operations on two processes of one processor are synchronized. However, when the number of the tiles is large, that is, the number of iterations for completing the operation is large, the data transfer operations in the same direction on the two processes gradually become out of synchronization and the opposite directional transfer operations sometimes might be synchronized. They can be performed without conflict, and they achieve almost the same as the peak throughput when the size of the transferring data is large. Figure 6 shows their elapsed times for solving 10 eigenpairs. The result demonstrates that the elapsed time of the conventional method (synchronous data transfer) is in the interval between that of the same-directional data transfer with synchronizationFootnote 3 and that of the opposite-directional one (Fig. 7). And, it is confirmed that the interval of throughput for the two data transfer operations becomes narrow as the number of tiles increases. The reason is that when transferring large data, the throughput of the opposite-directional data transfer is larger than that of the same-directional one, however, as the size of the transferring data per operation becomes small, that is, the number of tiles becomes larger, the former declines and ultimately becomes almost the same as the latter.

Fig. 4.
figure 4

Elapsed time of orthonormalization and dot product and vector update per operation.

Table 1. Details of GPU-system of HPE SGI 8600 in Japan Atomic Energy Agency.
Fig. 5.
figure 5

Node logic diagram of GPU system on HPE SGI8600 in Japan Atomic Energy Agency.

Fig. 6.
figure 6

Comparison of elapsed time of vector update operation using three types of data transfer.

Fig. 7.
figure 7

Schematic code of vector update operation with performing HtoD data transfer on one process of a CPU and DtoH transfer on the other process at the same time. Here, two processes are run on each processor and the MPI ranks of these processes are set to be contiguous.

3 Optimization CPU-GPU Data Transfer

3.1 Asynchronous Data Transfer

When we execute the calculation consuming huge memory on a GPU, we have to transfer the necessary data from host to device memory (HtoD). The data transfer operation can be overlapped with the calculation on GPUs by using asynchronous data transfer. In actual, the dot product, the vector update and the orthonormalization operations shown in Sect. 2.2 can be overlapped with the data transfer using the multi-buffering strategy. Since the data transfer of the dot product and the orthonormalization operations is only HtoD, the overlap can be realized using the double-buffering. Moreover, the update vector operation requires to transfer the updated vectors from device to host memory (DtoH) in addition to HtoD data transfer. Therefore, the overlap of data transfer and calculation in the operation can be realized by using the triple-buffering.

Here, we compare the performance of the three operations using the synchronous data transfer with that using the asynchronous one. Figure 8 shows the relationship between the elapsed time of each operations and the number of the tiles. The results indicate that the method using the asynchronous data transfer is faster than the method using synchronous one for the dot product and the orthonormalization operations. On the other hand, for the vector update operation, when the number of tiles is small, the asynchronous data transfer realizes speedup, however, it is confirmed that the performance becomes unstable as the number of tiles increases. As a result, when the number is large, there are cases where the speedup effect can not be obtained.

Fig. 8.
figure 8

Comparison of elapsed time using synchronous and asynchronous data transfer operations. Here, m means the number of eigenpairs to be solved.

3.2 Reduction of Data Transfers

After updating vectors, we orthonormalize the column vectors of \(W_{k+1}\), and we calculate \(S_A\) using the dot product operations. The vector update operation requires HtoD transfer for four matrices \({X}_{k}\), \(\mathcal {X}_{k}\), \({P}_{k}\) and \(\mathcal {P}_{k}\), and DtoH transfer for four updated matrices \({X}_{k+1}\), \(\mathcal {X}_{k+1}\), \({P}_{k+1}\) and \(\mathcal {P}_{k+1}\). The column vectors of these four updated matrices will be used as is for the following calculations. On the other hand, the column vectors of \({W}_{k+1}\) are the residual vectors, and they are modified by a preconditioning. Therefore, we can perform the dot product operations and the orthonormalization after applying preconditioner. However, when we apply a preconditioner which works elementwisely like point Jacobi preconditioner, we can modify the updated residual vectors elementwise. Accordingly, when we apply such a preconditioner or do not perform any preconditioning, we can calculate the partial sum of the dot products by using the subset of five matrices \({X}_{k+1}\), \(\mathcal {X}_{k+1}\), \(W_{k+1}\), \(P_{k+1}\) and \(\mathcal {P}_{k+1}\) before the data is transferred to the host memoryFootnote 4. Therefore, we can calculate \(X_{k+1} W_{k+1}^T\) and \(P_{k+1} W_{k+1}^T\) which are used for the orthogonalization of the column vectors of \(W_{k+1}\) against those of \(X_{k+1}\) and \(P_{k+1}\) without the HtoD data transfer. However, when we execute the orthogonalization using the result of the dot product, we have to transfer \(X_{k+1}\) and \(P_{k+1}\) from host to device memory. After the orthogonalization, we orthonormalize the column vectors of \(W_{k+1}\) against each other without any HtoD data transfer, and then, we obtain \(\mathcal {W}_{k+1}\) by m matrix-vector multiplication operations. So as to calculate the remaining dot products using the \(\mathcal {W}_{k+1}\), we have to transfer \(X_{k+1}\) and \(P_{k+1}\) from host to device memory. Accordingly, we eliminate the number of the matrices which are transferred form host to device memory by four per iteration. In the following, the method is called ‘Red 1’.

Next, we focus on the operation to orthogonalize the column vectors of \(W_{k+1}\) against those of \(X_{k+1}\) and \(P_{k+1}\). When we orthogonalize the i-th column vector \(\boldsymbol{w}_{k+1}^{(i)}\), we remove the projection of each column vector of \(X_{k+1}\) and \(P_{k+1}\) from \(\boldsymbol{w}_{k+1}^{(i)}\) using the result of the dot product. When we operate \(\boldsymbol{w}_{k+1}^{(i)}\) directly, we have to transfer \(X_{k+1}\) and \(P_{k+1}\) from host to device memory. In order to reduce this data transfer, we represent the operated vector \(\boldsymbol{w}_{k+1}^{(i),*}\) as

$$\begin{aligned} \boldsymbol{w}_{k+1}^{(i),*}=\sum _l f_x(l,i)\boldsymbol{x}_{k+1}^{(l)}+\sum _l f_p(l,i)\boldsymbol{p}_{k+1}^{(l)}+\sum _l f_w(l,i)\boldsymbol{w}_{k+1}^{(l)}, \end{aligned}$$
(5)

and we operate the coefficients instead of calculating the vector \(\boldsymbol{w}_{k+1}^{(i)}\) directly. After the orthogonalization of the column vectors of \(W_{k+1}\) against those of \(X_{k+1}\) and \(P_{k+1}\), the coefficients are set as \(f_x(l,i)=-(\boldsymbol{x}_{k+1}^{(l)}, \boldsymbol{w}_{k+1}^{(i)})\), \(f_p(l,i)=-(\boldsymbol{p}_{k+1}^{(l)}, \boldsymbol{w}_{k+1}^{(i)})\) and \(f_w(l,i)=0 (l\ne i), 1 (l=i)\). In this operation, when we modify the vector by the operation \(\boldsymbol{t}^{*}=\boldsymbol{t}-(\boldsymbol{t},\boldsymbol{s})\boldsymbol{s}\) for \(||\boldsymbol{s}||=1\), the norm of \(\boldsymbol{t}^{*}\) is given by \(||\boldsymbol{t}^{*}||=\sqrt{||\boldsymbol{t}||^2-(\boldsymbol{t},\boldsymbol{s})^2}\). When the norm of a vector becomes smaller than the tolerance in the process of the orthogonalization, we consider the vector to be the combination of other vectors and we eliminate the vector. Afterwards, we execute the orthonormalization of a set of column vectors of \(W_{k+1}\) against each other using the MGS method. Here, we can calculate the dot product \(z_{i,j}=(\boldsymbol{w}_{k+1}^{(i),*},\boldsymbol{w}_{k+1}^{(j),*})\) by

$$\begin{aligned}&z_{i,j}=\left( \sum _l f_x(l,i)\boldsymbol{x}_{k+1}^{(l)}+\sum _l f_p(l,i)\boldsymbol{p}_{k+1}^{(l)}+\sum _l f_w(l,i)\boldsymbol{w}_{k+1}^{(l)},\right. \\&\qquad \qquad \qquad \quad \;\; \left. \sum _l f_x(l,j)\boldsymbol{x}_{k+1}^{(l)}+\sum _l f_p(l,j)\boldsymbol{p}_{k+1}^{(l)}+\sum _l f_w(l,j)\boldsymbol{w}_{k+1}^{(l)}\right) \\&=\sum _l f_x(l,i) f_x(l,j)+\sum _l f_p(l,i) f_p(l,j)\\&\;\; +\sum _l \sum _s f_x(l,i)f_w(s,j) (\boldsymbol{x}_{k+1}^{(l)},\boldsymbol{w}_{k+1}^{(s)}) +\sum _l \sum _s f_p(l,i)f_w(s,j) (\boldsymbol{p}_{k+1}^{(l)},\boldsymbol{w}_{k+1}^{(s)})\\&\;\;+\sum _l \sum _s f_w(l,i)f_x(s,j) (\boldsymbol{w}_{k+1}^{(l)},\boldsymbol{x}_{k+1}^{(s)}) +\sum _l \sum _s f_w(l,i)f_p(s,j) (\boldsymbol{w}_{k+1}^{(l)},\boldsymbol{p}_{k+1}^{(s)})\\&\;\; +\sum _l \sum _s f_w(l,i)f_w(s,j) (\boldsymbol{w}_{k+1}^{(l)},\boldsymbol{w}_{k+1}^{(s)}). \end{aligned}$$

When \(||\boldsymbol{w}_{k+1}^{(j),*}||=1\), the vectors \(\boldsymbol{w}_{k+1}^{(i),*}\) and \(\boldsymbol{w}_{k+1}^{(j),*}\) become orthogonal by calculating \(f_x(l,i)=f_x(l,i)-z_{i,j}f_x(l,j)\), \(f_p(l,i)=f_p(l,i)-z_{i,j}f_p(l,j)\) and \(f_w(l,i)=f_w(l,i)-z_{i,j}f_w(l,j)\) \((l=1,2,\ldots ,m)\). Accordingly, we can orthonormalize all column vectors of \(W_{k+1}\) without performing additional dot product operationsFootnote 5.

Therefore, the multiplication of the Hamiltonian and an orthonormalized vector \(\boldsymbol{w}_{k+1}^{(i),*}\) can be represented as

$$\begin{aligned} H\boldsymbol{w}_{k+1}^{(i),*}=\sum _l f_x(l,i)H\boldsymbol{x}_{k+1}^{(l)}+\sum _l f_p(l,i)H\boldsymbol{p}_{k+1}^{(l)}+\sum _l f_w(l,i)H\boldsymbol{w}_{k+1}^{(l)}. \end{aligned}$$
(6)

Here, we calculate the remaining dot products \((H\boldsymbol{x}_{k+1}^{(j)}, \boldsymbol{w}_{k+1}^{(i),*})\), \((H\boldsymbol{p}_{k+1}^{(j)}, \boldsymbol{w}_{k+1}^{(i),*})\) and \((H\boldsymbol{w}_{k+1}^{(i),*}, \boldsymbol{w}_{k+1}^{(j),*})\) for constructing the matrix \(S_A\) using the coefficients and the results of the dot products in consideration of (5) and (6). Since the dot products except \((H\boldsymbol{w}_{k+1}^{(i)},\boldsymbol{w}_{k+1}^{(j)})\) have already been calculated during the vector update operation, we calculate \((H\boldsymbol{w}_{k+1}^{(i)},\boldsymbol{w}_{k+1}^{(j)})\) using \(H\boldsymbol{w}_{k+1}^{(i)}(=\boldsymbol{\mathcal {W}}_{k+1}^{(i)})\) obtained by executing matrix-vector multiplication. Therefore, we do not need to transfer extra matrix data from host memory for constructing \(S_A\). After we solve the eigenvalue problem for \(S_A\), we transfer four matrices \(X_{k+1}\), \(\mathcal {X}_{k+1}\), \(P_{k+1}\) and \(\mathcal {P}_{k+1}\) to update \(X_{k+2}\), \(\mathcal {X}_{k+2}\), \(P_{k+2}\) and \(\mathcal {P}_{k+2}\). Before the update, we construct \(W_{k+1}(=(\boldsymbol{w}_{k+1}^{(1),*},\ldots ,\boldsymbol{w}_{k+1}^{(m),*}))\) and \(\mathcal {W}_{k+1}(=(H\boldsymbol{w}_{k+1}^{(1),*},\ldots ,H\boldsymbol{w}_{k+1}^{(m),*}))\) based on (5) and (6) using the transferred matrices and the coefficients \(f_x\), \(f_p\) and \(f_w\). As a result, the strategy can reduce HtoD data transfer for four matrices compared to ‘Red 1’. In the following, the method is called ‘Red 2’.

In order to evaluate the effect of the reduction of data transfers, we execute these two methods (‘Red 1’ and ‘Red 2’) and the conventional method (‘Conv’) described in Sect. 2 with the synchronous, the opposite-direction and the asynchronous data transfers under the same condition as in Sect. 2. Here, we set the number of tiles to be 20, and we execute the LOBPCG method with no preconditioner. And then, we show the elapsed time of the orthonormalization, the vector update and the dot product operations in Table 2. In ‘Red1’ and ‘Red2’ methods, some dot product operations are fused with other operations. Therefore, these elapsed times includes the elapsed time of the fused dot product operations. The result demonstrates that the reduction of the number of the data transfer between host and device memory can considerably improve the performance. Moreover, ‘Red2’ method can realize orthonormalization by operating the coefficients of (5) instead of calculating vectors directly, therefore, the method greatly reduce the elapsed time for orthonormalization.

Table 2. Effect of reduction of data transfers. Here, ‘Sync.’, ‘Opposite’ and ‘Async.’ are represented as the synchronous data transfer, the opposite-directional one and the asynchronous one, respectively.

4 Numerical Experiments

In this section, we examine the performance of the LOBPCG method for the Hubbard model on the multi-GPU system in HPE SGI8600 in Japan Atomic Energy Agency. We solve the eigenvalue problems of the Hamiltonian derived from 5 \(\times \) 4-site Hubbard model with 6 up-spins and 6 down-ones. The details of the system are shown in Table 1. Here, the dimension of the Hamiltonian is 1,502,337,600. We attempt to find the eight smallest eigenvalues and the corresponding eigenvectors using a block size of 10 columns. Accordingly, we use the convergence criterion

$$||H\boldsymbol{x}_k^{(i)}-\lambda (i)\boldsymbol{x}_k^{(i)}||\le 10^{-6}, ~i=1,2,\ldots ,8,$$

where \(\lambda (i)\) is an approximate value of the i-th smallest eigenvalues. Here, we set the number of tiles to be 20 and use MPIDirect for communication for the matrix-vector multiplication operation between GPUs. Moreover, we use the zero-shift preconditioner [14, 16]. Since the preconditioner works elementwisely, we utilize ‘Red 2’ as the method for reducing of the data transfers. Table 3 shows the elapsed time of ‘Red2’ method using three types of data transfer. The result indicates that the synchronous data transfer method has the lowest performance of the three methods. The reason is that the conflict for the bus connected between CPU and two GPUs occurs due to performing the data transfer between host and device memory in the same direction simultaneously. The method using the opposite-directional data transfer is performed much faster than the synchronous one, because the method avoids the conflict by the opposite-directional data transfer operations. The method does not overlap the data transfer with the calculation, since its data transfer is a synchronous operation. On the other hand, the asynchronous method can overlap the data transfer with the calculation. Therefore, the method has better performance than the opposite-directional one.

Table 3. Parallel performance of LOBPCG method on SGI HPE8600 system. This table shows the total elapsed time, the number of iterations, and the elapsed time per iteration of ‘Red2’ method using the synchronous, the opposite-direction and the asynchronous data transfer operations.
Table 4. Parallel performance of LOBPCG method on SGI HPE8600 system. This table shows the total elapsed time, the number of iterations, and the elapsed time per iteration of ‘Conv’, ‘Red1’ and ‘Red2’ methods using the asynchronous data transfer operation.

Next, we show the elapsed time of ‘conv’, ‘Red1’ and ‘Red2’ methods using the asynchronous data transfer operation in Table 4. The result indicates that ‘Red2’ is the fastest. And, although Table 2 indicates that ‘Red1’ is more than 10% slower than ‘Red2’, ‘Red1’ is only a few percent slower in this result. The reason is that ‘Red2’ always requires m matrix-vector multiplication operations for calculating \(H\boldsymbol{w}\) by (6), whereas, since ‘Red1’ execute the operations for only independent linearly vectors, there is no need the multiplication for eliminated vectors by the orthonormalization operation.

5 Conclusions

we have proposed the parallelization and the tuning strategies of the LOBPCG method, whose almost all operations are performed using GPUs, in order to solve an eigenvalue problem for a large Hamiltonian derived from the Hubbard model using multi-GPU systems. In this research, the dimension of the Hamiltonian is very large and some vectors are stored in host memory. In order to perform the calculations using GPUs in this situation, we have to transfer data between host and device memory as needed. The cost of the data transfer is very large. Therefore, we reduced the transfer operations by considering the algorithm of the LOBPCG method and achieved the improvement of the performance. Moreover, when we execute the conventional method using two processes on each processor on the system as shown in Fig. 5, two processes transfer data in the same direction at the same time. Accordingly, the bus connected between host and device memory is shared with two processes and the throughput per process is limited to about half of the peak throughput. In order to avoid sharing the bus, we have proposed the strategy in which two processes on each processor transfer data in opposite directions. The method has much better performance than the conventional one.

We proposed the strategies in consideration of the property of consuming a small amount of device memory to store Hamiltonian data. Therefore, the proposed strategies can be applied not only to eigenvalue problems for the Hamiltonian derived from the Hubbard model, but also to problems that consume a small amount of device memory to store matrix data or that do not require matrix data to be stored by calculating it per iteration.

In this research, since we mainly focused on the data transfer between host and device memory, all time-consuming operations, that is, matrix-vector multiplications and vector operations, have been executed using GPUs. Recently, the performance of a CPU is improving considerably. Therefore, it is possible to achieve better performance by performing some of calculations using CPUs, especially for problems with a lot of data transfer between host and device memory like the problem in this research. In future work, we plan to investigate the strategy to appropriately distribute the calculations to CPUs and GPUs.