High Performance Parallel LOBPCG Method for Large Hamiltonian Derived from Hubbard Model on Multi-GPU Systems

Yamada, Susumu; Imamura, Toshiyuki; Machida, Masahiko

doi:10.1007/978-3-031-10419-0_1

Susumu Yamada⁹,
Toshiyuki Imamura¹⁰ &
Masahiko Machida⁹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13214))

Included in the following conference series:

Asian Conference on Supercomputing Frontiers

2445 Accesses

Abstract

The physical property of the Hubbard model can be understood by solving the eigenvalue problem for the Hamiltonian derived from the model. Since the Hamiltonian is a large sparse matrix, an iteration method is usually utilized for solving the problems. One of effectual solvers for this problem is the LOBPCG (Locally Optimal Block Preconditioned Conjugate Gradient) method. The tuning strategies of the method on GPU systems when all iteration vectors are stored in device memory have been proposed. In this research, we propose tuning strategies for parallel LOBPCG method on multi-GPU system when the Hamiltonian is large and some iteration vectors are stored in host memory. When the LOBPCG method is used for solving multi eigenpairs (eigenvalues and the corresponding eigenvectors), the number of iteration vectors, whose size is the same as the dimension of the Hamiltonian, is proportional to the number of the eigenpairs. On the other hand, the memory consumption for the non-zero elements of the Hamiltonian can be significantly reduced by considering the regular arrangement of the elements. Therefore, when we execute the LOBPCG method for a large Hamiltonian on GPUs, some of the vectors have to be stored on host memory and have to be transferred between host and device memory as needed. Since the cost of the data transfer is very large, we also propose the optimization for it. The simulation result on a multi-GPU system shows that the optimization of the data transfer is very effective for high performance computing.

You have full access to this open access chapter, Download conference paper PDF

A Jacobi_PCG solver for sparse linear systems on multi-GPU cluster

Article 11 October 2016

Parallel ILU preconditioners in GPU computation

Article 12 August 2017

Evaluation of Directive-Based GPU Programming Models on a Block Eigensolver with Consideration of Large Sparse Matrices

Keywords

1 Introduction

Eigenvalue problems appear in a variety of fields such as quantum dynamics, structure analysis and economics. Therefore, many solvers for them have been developed and the strategies to improve their performance have also been proposed. In the quantum dynamics, when we solve eigenvalue problems derived from quantum models, we can recognize quantum states which indicate properties of the models. In this research, we focus on the eigenvalue problem for the Hamiltonian derived from the Hubbard model and will propose the strategies to realize a high performance solver on multi-GPU systems. The model can exhibit the property of many interesting phenomena such as high-temperature superconductivity [9, 11], therefore, a lot of physicists take interest in it. The Hamiltonian, which represents the energy of the Hubbard model, is given as

$$\begin{aligned} H =- t\sum _{i,j,\sigma } c_{j \sigma }^{\dagger } c_{i \sigma } + \sum _i U_i n_{i\uparrow } n_{i \downarrow }, \end{aligned}$$

(1)

where t is the hopping parameter from a site to another one and $U_i$ is the repulsive energy for one-site double occupation of two fermion the i-th site. Quantities $c_{i,\sigma }$, $c_{i,\sigma }^{\dagger }$ and $n_{i,\sigma }$ are the annihilation, the creation, and the number operator of a fermion with pseudo-spin $\sigma $ on the i-th site, respectively. When we solve the ground state (the smallest eigenvalue and the corresponding eigenvector) of the Hamiltonian, we can understand the property of the model. Moreover, when we solve multi eigenpairs (eigenvalues and the corresponding eigenvectors), we can reveal more detail property. Therefore, many methods to solve the model have been proposed. One of the most accurate solvers is the exact diagonalization method which directly solves some eigenpairs of the Hamiltonian derived exactly from the model. At this time, the Hamiltonian becomes a huge sparse symmetric matrix. Accordingly, an iteration method, such as the Lanczos method and LOBPCG (Locally Optimal Block Preconditioned Conjugate Gradient) method [7, 8], is usually utilized. And then, the parallelization strategies for multi-CPU systems [14] and the tuning ones for single-GPU systems [1, 12, 15, 17] have been proposed. In this research, in order to realize the LOBPCG method for solving some eigenpairs using multi-GPU systems, we will propose the parallelization and tuning strategies. The parallelization not only realizes speedup by distributing the calculations, but also enables simulations for larger models by distributing data. Since the memory size of GPU is generally smaller than that of CPU, we can calculate larger models by storing data in CPU memory (host memory) than in only GPU memory (device memory). Accordingly, in order to simulate a larger model, we transfer the data that are required for an operation from host to device memory and temporarily store them in device memory, and then, we execute the operation using them. Moreover, we have to transfer the calculation results from device to host memory, if necessary. Since the speed of the data transfer between host and device memory is much slower than that of data transfer in GPU, it is important to decrease the cost of data transfer between host and device memory for high performance parallel LOBPCG method on multi-GPU systems. In order to realize the decrease, we focus on the data transfer between host and device memory in this paper. Accordingly, although it may be possible to perform faster using CPUs in addition to GPUs, we target the LOBPCG method whose all time-consuming operations are performed on only GPUs.

The rest of the paper of structured as follows. Section 2 presents the implement schemes on multi-GPU systems. We propose the tuning strategies in view of data transfer between host and device memory in Sect. 3. Section 4 shows the result of numerical experiments on HPE SGI 8600 in Japan Atomic Energy Agency. A summary and conclusion are given in Sect. 5.

2 LOBPCG Method for Solving Multi Eigenpairs on Multi-GPU Systems

We can solve m eigenpairs of the matrix H using the LOBPCG method shown in Fig. 1^{Footnote 1}. The method requires m matrix-vector multiplications and some linear operations using iteration vectors $\boldsymbol{x}_k^{(i)}$, $\boldsymbol{w}_k^{(i)}$, $\boldsymbol{p}_k^{(i)}$, $\boldsymbol{\mathcal {X}}_k^{(i)}$, $\boldsymbol{\mathcal {W}}_k^{(i)}$, and $\boldsymbol{\mathcal {P}}_k^{(i)}$ $(i=1,2,\ldots ,m)$. Moreover, in order to execute the LOBPCG method stably, we have to orthonormalize iteration vectors $\boldsymbol{x}_k^{(i)}$, $\boldsymbol{w}_k^{(i)}$ and $\boldsymbol{p}_k^{(i)}$. At this time, $S_B$ becomes the identity matrix. The parallelization schemes of the multiplication for the Hubbard model on multi-CPU systems have been proposed [14]. In addition, the tuning strategies for single-GPU systems have been also proposed [12, 15, 17]. We propose the parallelization of the multiplication on multi-GPU systems by combining the above two strategies appropriately. Since the size of device memory is typically a fraction of that of host memory, it is supposed in this paper that we store only the information of the matrix H and 2m vectors ($\boldsymbol{w}_k^{(i)}$ and $\boldsymbol{\mathcal {W}}_k^{(i)}$, $i=1,\ldots ,m$) in device memory and the other vectors ($\boldsymbol{x}_k^{(i)}$, $\boldsymbol{\mathcal {X}}_k^{(i)}$, $\boldsymbol{p}_k^{(i)}$ and $\boldsymbol{\mathcal {P}}_k^{(i)}$) in host memory. Here, the time-consuming operations of LOBPCG method are Hamiltonian matrix-vector multiplication operations and the vector operations (dot product, vector update and orthonormalization). In the following, we introduce the parallelization strategies in multi-GPU systems for each operation.

2.1 Hamiltonian Matrix-Vector Multiplications

When we solve m eigenpairs of the Hamiltonian, we have to operate m Hamiltonian-vector multiplications per iteration. Since each of these multiplications can be executed independently, we focus on parallelization and tuning strategies for one multiplication. The Hamiltonian is represented as

$$\begin{aligned} H=D+I_{\downarrow }\otimes A_{\uparrow }+A_{\downarrow }\otimes I_{\uparrow }, \end{aligned}$$

(2)

where D is a diagonal matrix from the repulsive energy, and $A_{\uparrow }$ ($A_{\downarrow }$) is a sparse symmetric matrix from the hopping of the up-spin (down-spin). And, $I_{\uparrow }$ ($I_{\downarrow }$) is the identity matrix, dimension of which is the same as that of $A_{\uparrow }$ ($A_{\downarrow }$). When the dimension of $A_{\uparrow }$ ($A_{\downarrow }$) is $n_\uparrow $ ($n_\downarrow $), the dimension of the Hamiltonian H is $n_\uparrow \times n_{\downarrow }$. Since the dimension of $A_{\uparrow }$ and $A_{\downarrow }$ is much smaller than that of H, we store the non-zero elements of $A_{\uparrow }$, $A_{\downarrow }$ and D in device memory instead of the matrix H. Here, the multiplication of the Hamiltonian and vector is

$$\begin{aligned} Hv=Dv+(I_{\downarrow }\otimes A_{\uparrow })v+(A_{\downarrow }\otimes I_{\uparrow })v. \end{aligned}$$

We transform the vector v to the matrix V as the following manner:

$$\begin{aligned} v=\left( \begin{array}{c} v_{1}\\ v_{2}\\ \vdots \\ v_{n_{\uparrow } \times n_{\downarrow }} \end{array} \right) \rightarrow V=\left( \begin{array}{cccc} v_{1}&{}v_{n_\uparrow +1}&{}\cdots &{}v_{n_\uparrow \times (n_{\downarrow }-1)+1} \\ v_{2}&{}v_{n_\uparrow +2}&{}\cdots &{}v_{n_\uparrow \times (n_{\downarrow }-1)+2} \\ \vdots &{}\vdots &{}\vdots &{}\vdots \\ v_{n_\uparrow }&{}v_{2n_\uparrow }&{}\cdots &{}v_{n_\uparrow \times n_{\downarrow }} \end{array} \right) , \end{aligned}$$

(3)

and the diagonal elements of the matrix D to the matrix $\bar{D}$ as the same manner. Here, the matrix-vector multiplication is change into the following matrix-matrix multiplication:

$$\begin{aligned} (HV)_{i,j}=\bar{D}_{i,j}V_{i,j}+\sum _{k}{{A_{\uparrow }}_{i,k} V_{k,j} }+\sum _{k}{{A_{\downarrow }}_{j,k} V_{i,k}}, \end{aligned}$$

where the subscript i, j of a matrix is the (i, j)-th element of the matrix. Since the matrix V is a dense matrix, we can execute the multiplications in parallel as follows:

CAL 1: $Y^{c}=\bar{D}^{c}\odot V^{c}+A_{\uparrow } V^{c}$,
COM 1: all-to-all communication from $V^{c}$ to $V^{r}$,
CAL 2: $Z^{r}= V^{r} A^T_{\downarrow }$,
COM 2: all-to-all communication from $Z^{r}$ to $Z^{c}$,
CAL 3: $Y^{c}=Y^{c}+Z^{c}$.

where the superscription c and r denotes the columnwise and rowwise partitioning in matrix data for the parallel calculation. And, $\odot $ means an elementwise multiplication. The parallelization strategy requires two all-to-all communication operations per multiplication.

When all non-zero elements of the matrix H and the data of the decomposed matrices ,$V^c$ and $V^r$, are stored in device memory, we can execute CAL 1, CAL 2 and CAL 3 by using the algorithms proposed for single-GPU systems [12, 15, 17]. Here, the storage layout of $V^c$ and $V^r$ should be row-major and column-major order, respectively, in order that $A_{\uparrow }V^{c}$ and $ V^{r} A^T_{\downarrow }$ are performed with contiguous memory access. Therefore, the method requires the change of the storage layout of the matrix between column-major and row-major order, however, the operation can be executed with high performance by using the shared memory on GPU systems.

2.2 Vector Operations

The vector operations in the LOBPCG method can be categorized into the following three groups:

Dot product for constructing $3m \times 3m$-dimensional symmetric matrix ($S_A$ in Fig. 1),
Updating all column vectors of $ X_{k+1}$, $P_{k+1}$, $W_{k+1}$, $\mathcal {X}_{k+1}$ and $\mathcal {P}_{k+1}$ ,where $X_{k+1}=( \boldsymbol{x}_{k+1}^{(1)},\ldots ,\boldsymbol{x}_{k+1}^{(m)})$, $W_{k+1}=( \boldsymbol{w}_{k+1}^{(1)},\ldots ,\boldsymbol{w}_{k+1}^{(m)})$, $P_{k+1}=( \boldsymbol{p}_{k+1}^{(1)},\ldots ,\boldsymbol{p}_{k+1}^{(m)})$, $\mathcal {X}_{k+1}=(\boldsymbol{\mathcal {X}}_{k+1}^{(1)},\ldots ,\boldsymbol{\mathcal {X}}_{k+1}^{(m)})$ $(=HX_{k+1})$ and $\mathcal {P}_{k+1}=(\boldsymbol{\mathcal {P}}_{k+1}^{(1)},\ldots ,\boldsymbol{\mathcal {P}}_{k+1}^{(m)})$ $(=HP_{k+1})$.
Orthonormalization for all column vectors of $X_{k+1}$, $P_{k+1}$ and $W_{k+1}$.

We discuss the dot product and update for vectors first, then the orthonormalization of vectors.

Dot Product and Vector Update. The parallelization of dot product can be realized by calculating the partial sum of dot product on each process and performing the sum-reduction for the partial sum by MPI_ALLREDUCE in all processes. The parallelization of vector update operation can be realized by performing the ‘axpy’ operation for the decomposed vectors on each process without data communication between processors. When all vectors are stored on device memory, the vector operations can be parallelized straightforwardly. However, in this research, 4 matrices ($X_k$, $\mathcal {X}_k$, $P_k$ and $\mathcal {P}_k$) are stored in host memory, and the data have to be transferred to device memory. Therefore, we partition the vectors into some tiles and we transfer each tile of the vectors from host to device memory and execute the partial dot product operation on each tile (Fig. 2(a)) [10]. The vector update operation can be also executed using almost the same strategy (Fig. 2(b)). However, the operation requires transferring the updated vectors from device to host memory.

Orthonormalization of Vectors. Here, we discuss the orthonormalization of the iteration vectors $\boldsymbol{x}_k^{(i)}$, $\boldsymbol{p}_k^{(i)}$, and $\boldsymbol{w}_k^{(i)}$. In order that we execute the LOBPCG method for solving multi eigenpairs stably, a set of the basis of the subspace spanned by all iteration vectors, that is, a set of all column vectors of $X_k$, $P_k$, and $W_k$, should be linearly independent. Therefore, we should orthogonalize the basis per iteration. The orthognalization can be realized by many methods such as the modified Gram-Schmidt (MGS) orthonormalization, TSQR, CholeskyQR, CholeskyQR2 and so on [2, 4, 13]. When we apply these methods for vectors directly, we have to transfer vectors stored in host memory. Therefore, we focus on the orthogonalization strategy for the LOBPCG method proposed by Hetmaniuk and Lehoueq (HL) [3, 5]. The HL strategy is as follows:

Here, we represent the eigenvector corresponding to the i-th smallest eigenvalue of $S_A$ as $\boldsymbol{v}^{(i)}$. And, it is assumed that a set of vectors $\{\boldsymbol{v}^{(1)},\boldsymbol{v}^{(2)},\ldots ,\boldsymbol{v}^{(m)}\}$ is orthonormal, moreover, a set of all column vectors of the matrices $X_k$, $P_k$ and $W_k$ in the k-th iteration is also orthonormal.
When the (i, j)-th element of the matrices $C^1$, $C^2$ and $C^3$ are defined as $C^1_{(i,j)}=(\alpha ^{(i)}_j)$, $C^2_{(i,j)}=(\beta ^{(i)}_j)$ and $C^3_{(i,j)}=(\gamma ^{(i)}_j)$, that is,
$$\left( \begin{array}{c} C^1\\ C^2\\ C^3 \end{array} \right) =(\boldsymbol{v}^{(1)},\boldsymbol{v}^{(2)},\ldots ,\boldsymbol{v}^{(m)}),$$
$X_{k+1}$ and $P_{k+1}$ are calculated by
$$(X_{k+1},P_{k+1})= (X_{k},W_{k},P_{k})C , ~~~~C=\left( \begin{array}{cc} C^1&{}0\\ C^2&{}C^2\\ C^3&{}C^3 \end{array} \right) .$$
Here, a set of all column vectors of matrices $X_{k+1}$ becomes orthonormal.
We decompose matrix C into QR using QR decomposition. When we calculate $X_{k+1}$ and $P_{k+1}$ using the following formula
$$(X_{k+1},P_{k+1})= (X_{k},W_{k},P_{k})Q, $$
all columns of $X_{k+1}$ and $P_{k+1}$ are orthonormal^{Footnote 2}.
Next, we orthonormalize the column vectors of $W_{k+1}$ against those of $X_{k+1}$ and $P_{k+1}$ using the classical Gram-Schmidt (CGS) method, that is, $W_{K+1}$ is updated by the following formula [6]
$$\begin{aligned} W_{k+1}=(I-X_{k+1}X_{k+1}^T-P_{k+1}P_{k+1}^T)W_{k+1}. \end{aligned}$$
(4)
The method requires much less allreduce communication operations than the MGS method.
Finally, we orthonormalize a set of the columns vectors of $W_{k+1}$ using the MGS method. The MGS method requires a lot of allreduce communication operations, however, the number of the operations in this case is reduced by about one-ninth compared to the orthonormalization of all column vectors of three matrices $X_{k+1}$, $P_{k+1}$ and $W_{k+1}$. Moreover, since we orthonormalize a set of only the column vectors of $W_{k+1}$ stored in device memory, we do not need to transfer any vectors between host to device memory.

In this operation, when we find vectors to be a combination of other vectors, we eliminate them in this iteration. We show the schematic CUDA Fortran code for orthogonalizing the column vectors $W_{k+1}$ against those of $X_{k+1}$ and $P_{k+1}$ in Fig. 3. The operation requires to transfer $X_{k+1}$ and $P_{k+1}$ from host to device memory twice.

Performance. Here, we examine the performance of these three operations using the above methods for solving the eigenvalue problems of the Hamiltonian derived from 5 $\times $ 4-site Hubbard model with 5 up-spins and 5 down-ones using 16 GPUs of 4 nodes on HPE SGI8600 in Japan Atomic Energy Agency. The details of the system are shown in Table 1. Here, the dimension of the Hamiltonian is 240,374,016, therefore, the dimension of a partitioned vector for each GPU is 15,023,376.

Figure 4 shows the elapsed time of each operation for solving 5 or 10 eigenpairs. The result indicates that the elapsed times tend to increase as the number of tiles becomes larger. The reason is that as the number of the tiles increases, the number of data transfer operations increases and the data size for each transfer operation decreases, that is, the latency of data transfer increases and the throughput declines. Moreover, in this result, it is noted that the elapsed time of the vector update operation is unstable. The reason is that when we execute operations on four GPUs on the system whose node logic diagram as shown in Fig. 5, we generally run two processes on each processor. When two processes on a processor simultaneously transfer data between host and device memory in the same direction, the bus connected between a CPU and two GPUs are shared with two processes and the throughput per process is limited to about half of the throughput of data transfer using one process. In the beginning of the iteration, the data transfer operations on two processes of one processor are synchronized. However, when the number of the tiles is large, that is, the number of iterations for completing the operation is large, the data transfer operations in the same direction on the two processes gradually become out of synchronization and the opposite directional transfer operations sometimes might be synchronized. They can be performed without conflict, and they achieve almost the same as the peak throughput when the size of the transferring data is large. Figure 6 shows their elapsed times for solving 10 eigenpairs. The result demonstrates that the elapsed time of the conventional method (synchronous data transfer) is in the interval between that of the same-directional data transfer with synchronization^{Footnote 3} and that of the opposite-directional one (Fig. 7). And, it is confirmed that the interval of throughput for the two data transfer operations becomes narrow as the number of tiles increases. The reason is that when transferring large data, the throughput of the opposite-directional data transfer is larger than that of the same-directional one, however, as the size of the transferring data per operation becomes small, that is, the number of tiles becomes larger, the former declines and ultimately becomes almost the same as the latter.

Table 1. Details of GPU-system of HPE SGI 8600 in Japan Atomic Energy Agency.

Full size table

3 Optimization CPU-GPU Data Transfer

3.1 Asynchronous Data Transfer

When we execute the calculation consuming huge memory on a GPU, we have to transfer the necessary data from host to device memory (HtoD). The data transfer operation can be overlapped with the calculation on GPUs by using asynchronous data transfer. In actual, the dot product, the vector update and the orthonormalization operations shown in Sect. 2.2 can be overlapped with the data transfer using the multi-buffering strategy. Since the data transfer of the dot product and the orthonormalization operations is only HtoD, the overlap can be realized using the double-buffering. Moreover, the update vector operation requires to transfer the updated vectors from device to host memory (DtoH) in addition to HtoD data transfer. Therefore, the overlap of data transfer and calculation in the operation can be realized by using the triple-buffering.

Here, we compare the performance of the three operations using the synchronous data transfer with that using the asynchronous one. Figure 8 shows the relationship between the elapsed time of each operations and the number of the tiles. The results indicate that the method using the asynchronous data transfer is faster than the method using synchronous one for the dot product and the orthonormalization operations. On the other hand, for the vector update operation, when the number of tiles is small, the asynchronous data transfer realizes speedup, however, it is confirmed that the performance becomes unstable as the number of tiles increases. As a result, when the number is large, there are cases where the speedup effect can not be obtained.

3.2 Reduction of Data Transfers

After updating vectors, we orthonormalize the column vectors of $W_{k+1}$, and we calculate $S_A$ using the dot product operations. The vector update operation requires HtoD transfer for four matrices ${X}_{k}$, $\mathcal {X}_{k}$, ${P}_{k}$ and $\mathcal {P}_{k}$, and DtoH transfer for four updated matrices ${X}_{k+1}$, $\mathcal {X}_{k+1}$, ${P}_{k+1}$ and $\mathcal {P}_{k+1}$. The column vectors of these four updated matrices will be used as is for the following calculations. On the other hand, the column vectors of ${W}_{k+1}$ are the residual vectors, and they are modified by a preconditioning. Therefore, we can perform the dot product operations and the orthonormalization after applying preconditioner. However, when we apply a preconditioner which works elementwisely like point Jacobi preconditioner, we can modify the updated residual vectors elementwise. Accordingly, when we apply such a preconditioner or do not perform any preconditioning, we can calculate the partial sum of the dot products by using the subset of five matrices ${X}_{k+1}$, $\mathcal {X}_{k+1}$, $W_{k+1}$, $P_{k+1}$ and $\mathcal {P}_{k+1}$ before the data is transferred to the host memory^{Footnote 4}. Therefore, we can calculate $X_{k+1} W_{k+1}^T$ and $P_{k+1} W_{k+1}^T$ which are used for the orthogonalization of the column vectors of $W_{k+1}$ against those of $X_{k+1}$ and $P_{k+1}$ without the HtoD data transfer. However, when we execute the orthogonalization using the result of the dot product, we have to transfer $X_{k+1}$ and $P_{k+1}$ from host to device memory. After the orthogonalization, we orthonormalize the column vectors of $W_{k+1}$ against each other without any HtoD data transfer, and then, we obtain $\mathcal {W}_{k+1}$ by m matrix-vector multiplication operations. So as to calculate the remaining dot products using the $\mathcal {W}_{k+1}$, we have to transfer $X_{k+1}$ and $P_{k+1}$ from host to device memory. Accordingly, we eliminate the number of the matrices which are transferred form host to device memory by four per iteration. In the following, the method is called ‘Red 1’.

Next, we focus on the operation to orthogonalize the column vectors of $W_{k+1}$ against those of $X_{k+1}$ and $P_{k+1}$. When we orthogonalize the i-th column vector $\boldsymbol{w}_{k+1}^{(i)}$, we remove the projection of each column vector of $X_{k+1}$ and $P_{k+1}$ from $\boldsymbol{w}_{k+1}^{(i)}$ using the result of the dot product. When we operate $\boldsymbol{w}_{k+1}^{(i)}$ directly, we have to transfer $X_{k+1}$ and $P_{k+1}$ from host to device memory. In order to reduce this data transfer, we represent the operated vector $\boldsymbol{w}_{k+1}^{(i),*}$ as

$$\begin{aligned} \boldsymbol{w}_{k+1}^{(i),*}=\sum _l f_x(l,i)\boldsymbol{x}_{k+1}^{(l)}+\sum _l f_p(l,i)\boldsymbol{p}_{k+1}^{(l)}+\sum _l f_w(l,i)\boldsymbol{w}_{k+1}^{(l)}, \end{aligned}$$

(5)

and we operate the coefficients instead of calculating the vector $\boldsymbol{w}_{k+1}^{(i)}$ directly. After the orthogonalization of the column vectors of $W_{k+1}$ against those of $X_{k+1}$ and $P_{k+1}$, the coefficients are set as $f_x(l,i)=-(\boldsymbol{x}_{k+1}^{(l)}, \boldsymbol{w}_{k+1}^{(i)})$, $f_p(l,i)=-(\boldsymbol{p}_{k+1}^{(l)}, \boldsymbol{w}_{k+1}^{(i)})$ and $f_w(l,i)=0 (l\ne i), 1 (l=i)$. In this operation, when we modify the vector by the operation $\boldsymbol{t}^{*}=\boldsymbol{t}-(\boldsymbol{t},\boldsymbol{s})\boldsymbol{s}$ for $||\boldsymbol{s}||=1$, the norm of $\boldsymbol{t}^{*}$ is given by $||\boldsymbol{t}^{*}||=\sqrt{||\boldsymbol{t}||^2-(\boldsymbol{t},\boldsymbol{s})^2}$. When the norm of a vector becomes smaller than the tolerance in the process of the orthogonalization, we consider the vector to be the combination of other vectors and we eliminate the vector. Afterwards, we execute the orthonormalization of a set of column vectors of $W_{k+1}$ against each other using the MGS method. Here, we can calculate the dot product $z_{i,j}=(\boldsymbol{w}_{k+1}^{(i),*},\boldsymbol{w}_{k+1}^{(j),*})$ by

$$\begin{aligned}&z_{i,j}=\left( \sum _l f_x(l,i)\boldsymbol{x}_{k+1}^{(l)}+\sum _l f_p(l,i)\boldsymbol{p}_{k+1}^{(l)}+\sum _l f_w(l,i)\boldsymbol{w}_{k+1}^{(l)},\right. \\&\qquad \qquad \qquad \quad \;\; \left. \sum _l f_x(l,j)\boldsymbol{x}_{k+1}^{(l)}+\sum _l f_p(l,j)\boldsymbol{p}_{k+1}^{(l)}+\sum _l f_w(l,j)\boldsymbol{w}_{k+1}^{(l)}\right) \\&=\sum _l f_x(l,i) f_x(l,j)+\sum _l f_p(l,i) f_p(l,j)\\&\;\; +\sum _l \sum _s f_x(l,i)f_w(s,j) (\boldsymbol{x}_{k+1}^{(l)},\boldsymbol{w}_{k+1}^{(s)}) +\sum _l \sum _s f_p(l,i)f_w(s,j) (\boldsymbol{p}_{k+1}^{(l)},\boldsymbol{w}_{k+1}^{(s)})\\&\;\;+\sum _l \sum _s f_w(l,i)f_x(s,j) (\boldsymbol{w}_{k+1}^{(l)},\boldsymbol{x}_{k+1}^{(s)}) +\sum _l \sum _s f_w(l,i)f_p(s,j) (\boldsymbol{w}_{k+1}^{(l)},\boldsymbol{p}_{k+1}^{(s)})\\&\;\; +\sum _l \sum _s f_w(l,i)f_w(s,j) (\boldsymbol{w}_{k+1}^{(l)},\boldsymbol{w}_{k+1}^{(s)}). \end{aligned}$$

When $||\boldsymbol{w}_{k+1}^{(j),*}||=1$, the vectors $\boldsymbol{w}_{k+1}^{(i),*}$ and $\boldsymbol{w}_{k+1}^{(j),*}$ become orthogonal by calculating $f_x(l,i)=f_x(l,i)-z_{i,j}f_x(l,j)$, $f_p(l,i)=f_p(l,i)-z_{i,j}f_p(l,j)$ and $f_w(l,i)=f_w(l,i)-z_{i,j}f_w(l,j)$ $(l=1,2,\ldots ,m)$. Accordingly, we can orthonormalize all column vectors of $W_{k+1}$ without performing additional dot product operations^{Footnote 5}.

Therefore, the multiplication of the Hamiltonian and an orthonormalized vector $\boldsymbol{w}_{k+1}^{(i),*}$ can be represented as

$$\begin{aligned} H\boldsymbol{w}_{k+1}^{(i),*}=\sum _l f_x(l,i)H\boldsymbol{x}_{k+1}^{(l)}+\sum _l f_p(l,i)H\boldsymbol{p}_{k+1}^{(l)}+\sum _l f_w(l,i)H\boldsymbol{w}_{k+1}^{(l)}. \end{aligned}$$

(6)

Here, we calculate the remaining dot products $(H\boldsymbol{x}_{k+1}^{(j)}, \boldsymbol{w}_{k+1}^{(i),*})$, $(H\boldsymbol{p}_{k+1}^{(j)}, \boldsymbol{w}_{k+1}^{(i),*})$ and $(H\boldsymbol{w}_{k+1}^{(i),*}, \boldsymbol{w}_{k+1}^{(j),*})$ for constructing the matrix $S_A$ using the coefficients and the results of the dot products in consideration of (5) and (6). Since the dot products except $(H\boldsymbol{w}_{k+1}^{(i)},\boldsymbol{w}_{k+1}^{(j)})$ have already been calculated during the vector update operation, we calculate $(H\boldsymbol{w}_{k+1}^{(i)},\boldsymbol{w}_{k+1}^{(j)})$ using $H\boldsymbol{w}_{k+1}^{(i)}(=\boldsymbol{\mathcal {W}}_{k+1}^{(i)})$ obtained by executing matrix-vector multiplication. Therefore, we do not need to transfer extra matrix data from host memory for constructing $S_A$. After we solve the eigenvalue problem for $S_A$, we transfer four matrices $X_{k+1}$, $\mathcal {X}_{k+1}$, $P_{k+1}$ and $\mathcal {P}_{k+1}$ to update $X_{k+2}$, $\mathcal {X}_{k+2}$, $P_{k+2}$ and $\mathcal {P}_{k+2}$. Before the update, we construct $W_{k+1}(=(\boldsymbol{w}_{k+1}^{(1),*},\ldots ,\boldsymbol{w}_{k+1}^{(m),*}))$ and $\mathcal {W}_{k+1}(=(H\boldsymbol{w}_{k+1}^{(1),*},\ldots ,H\boldsymbol{w}_{k+1}^{(m),*}))$ based on (5) and (6) using the transferred matrices and the coefficients $f_x$, $f_p$ and $f_w$. As a result, the strategy can reduce HtoD data transfer for four matrices compared to ‘Red 1’. In the following, the method is called ‘Red 2’.

In order to evaluate the effect of the reduction of data transfers, we execute these two methods (‘Red 1’ and ‘Red 2’) and the conventional method (‘Conv’) described in Sect. 2 with the synchronous, the opposite-direction and the asynchronous data transfers under the same condition as in Sect. 2. Here, we set the number of tiles to be 20, and we execute the LOBPCG method with no preconditioner. And then, we show the elapsed time of the orthonormalization, the vector update and the dot product operations in Table 2. In ‘Red1’ and ‘Red2’ methods, some dot product operations are fused with other operations. Therefore, these elapsed times includes the elapsed time of the fused dot product operations. The result demonstrates that the reduction of the number of the data transfer between host and device memory can considerably improve the performance. Moreover, ‘Red2’ method can realize orthonormalization by operating the coefficients of (5) instead of calculating vectors directly, therefore, the method greatly reduce the elapsed time for orthonormalization.

Table 2. Effect of reduction of data transfers. Here, ‘Sync.’, ‘Opposite’ and ‘Async.’ are represented as the synchronous data transfer, the opposite-directional one and the asynchronous one, respectively.

Full size table

4 Numerical Experiments

In this section, we examine the performance of the LOBPCG method for the Hubbard model on the multi-GPU system in HPE SGI8600 in Japan Atomic Energy Agency. We solve the eigenvalue problems of the Hamiltonian derived from 5 $\times $ 4-site Hubbard model with 6 up-spins and 6 down-ones. The details of the system are shown in Table 1. Here, the dimension of the Hamiltonian is 1,502,337,600. We attempt to find the eight smallest eigenvalues and the corresponding eigenvectors using a block size of 10 columns. Accordingly, we use the convergence criterion

$$||H\boldsymbol{x}_k^{(i)}-\lambda (i)\boldsymbol{x}_k^{(i)}||\le 10^{-6}, ~i=1,2,\ldots ,8,$$

where $\lambda (i)$ is an approximate value of the i-th smallest eigenvalues. Here, we set the number of tiles to be 20 and use MPIDirect for communication for the matrix-vector multiplication operation between GPUs. Moreover, we use the zero-shift preconditioner [14, 16]. Since the preconditioner works elementwisely, we utilize ‘Red 2’ as the method for reducing of the data transfers. Table 3 shows the elapsed time of ‘Red2’ method using three types of data transfer. The result indicates that the synchronous data transfer method has the lowest performance of the three methods. The reason is that the conflict for the bus connected between CPU and two GPUs occurs due to performing the data transfer between host and device memory in the same direction simultaneously. The method using the opposite-directional data transfer is performed much faster than the synchronous one, because the method avoids the conflict by the opposite-directional data transfer operations. The method does not overlap the data transfer with the calculation, since its data transfer is a synchronous operation. On the other hand, the asynchronous method can overlap the data transfer with the calculation. Therefore, the method has better performance than the opposite-directional one.

Table 3. Parallel performance of LOBPCG method on SGI HPE8600 system. This table shows the total elapsed time, the number of iterations, and the elapsed time per iteration of ‘Red2’ method using the synchronous, the opposite-direction and the asynchronous data transfer operations.

Full size table

Table 4. Parallel performance of LOBPCG method on SGI HPE8600 system. This table shows the total elapsed time, the number of iterations, and the elapsed time per iteration of ‘Conv’, ‘Red1’ and ‘Red2’ methods using the asynchronous data transfer operation.

Full size table

Next, we show the elapsed time of ‘conv’, ‘Red1’ and ‘Red2’ methods using the asynchronous data transfer operation in Table 4. The result indicates that ‘Red2’ is the fastest. And, although Table 2 indicates that ‘Red1’ is more than 10% slower than ‘Red2’, ‘Red1’ is only a few percent slower in this result. The reason is that ‘Red2’ always requires m matrix-vector multiplication operations for calculating $H\boldsymbol{w}$ by (6), whereas, since ‘Red1’ execute the operations for only independent linearly vectors, there is no need the multiplication for eliminated vectors by the orthonormalization operation.

5 Conclusions

we have proposed the parallelization and the tuning strategies of the LOBPCG method, whose almost all operations are performed using GPUs, in order to solve an eigenvalue problem for a large Hamiltonian derived from the Hubbard model using multi-GPU systems. In this research, the dimension of the Hamiltonian is very large and some vectors are stored in host memory. In order to perform the calculations using GPUs in this situation, we have to transfer data between host and device memory as needed. The cost of the data transfer is very large. Therefore, we reduced the transfer operations by considering the algorithm of the LOBPCG method and achieved the improvement of the performance. Moreover, when we execute the conventional method using two processes on each processor on the system as shown in Fig. 5, two processes transfer data in the same direction at the same time. Accordingly, the bus connected between host and device memory is shared with two processes and the throughput per process is limited to about half of the peak throughput. In order to avoid sharing the bus, we have proposed the strategy in which two processes on each processor transfer data in opposite directions. The method has much better performance than the conventional one.

We proposed the strategies in consideration of the property of consuming a small amount of device memory to store Hamiltonian data. Therefore, the proposed strategies can be applied not only to eigenvalue problems for the Hamiltonian derived from the Hubbard model, but also to problems that consume a small amount of device memory to store matrix data or that do not require matrix data to be stored by calculating it per iteration.

In this research, since we mainly focused on the data transfer between host and device memory, all time-consuming operations, that is, matrix-vector multiplications and vector operations, have been executed using GPUs. Recently, the performance of a CPU is improving considerably. Therefore, it is possible to achieve better performance by performing some of calculations using CPUs, especially for problems with a lot of data transfer between host and device memory like the problem in this research. In future work, we plan to investigate the strategy to appropriately distribute the calculations to CPUs and GPUs.

Notes

1.
In practice, in order to improve the convergence property, it is advisable to set the parameter m larger than the number of eigenpairs which we want to find.
2.
$X_{k+1}$ is the same as the above result, because a set of $(\boldsymbol{v}^{(1)}, \boldsymbol{v}^{(2)},\ldots ,\boldsymbol{v}^{(m)})$ is orthonormal.
3.
MPI_barrier operation is executed in the beginning of each iteration of the outer-most loop in Fig. 2(b).
4.
When we use the HL strategy, the column vectors of $X_{k+1}$ and $P_{k+1}$ are orthonormal. In this case, we do not need to calculate $X_{k+1}X_{k+1}^T$, $X_{k+1}P_{k+1}^T$ and $P_{k+1}P_{k+1}^T$.
5.
This orthonormalization operation is equivalent to CholeskyQR method [13].

References

Anzt, H., Tomov, S., Dongarra, J.: Accelerating the LOBPCG method on GPUs using a blocked sparse matrix vector product. In: Proceedings of the Symposium on High Performance Computing, pp. 75–82 (2015)
Google Scholar
Demmel, J., Grigori, L., Hoemmen, M., Langou, J.: Communication-optimal parallel and sequential QR and LU factorizations. SIAM J. Sci. Comput. 34, A206–A239 (2012). https://doi.org/10.1137/080731992
Article MathSciNet MATH Google Scholar
Duersch, J.A., Gu, M., Shao, M., Yang, C.: A robust and efficient implementation of LOBPCG. SIAM J. Sci. Comput. 40, C655–C676 (2018). https://doi.org/10.1137/17M1129830
Article MathSciNet MATH Google Scholar
Furuya, T., Nakatsukasa, Y., Yanagisawa, Y., Yamamoto, Y.: CholeskyQR2: a simple and communication-avoiding algorithm for computing a Tall-Skinny QR factorization on a large-scale parallel system. In: ScalA 2014 (2014)
Google Scholar
Hetmaniuk, U., Lehoucq, R.: Basis selection in LOBPCG. J. Comput. Phys. 228, 324–332 (2006)
Article MathSciNet Google Scholar
Iwata, J.I., et al.: A massively-parallel electronic-structure calculations based on real-space density functional theory. J. Comput. Phys. 229, 2339–2363 (2010). https://doi.org/10.1016/j.jcp.2009.11.038
Article MathSciNet MATH Google Scholar
Knyazev, A.V.: Preconditioned Eigensolvers - an oxymoron? Electron. Trans. Numer. Anal. 7, 104–123 (1998)
MathSciNet MATH Google Scholar
Knyazev, A.V.: Toward the optimal Eigensolver: locally optimal block preconditioned conjugate gradient method. SIAM J. Sci. Comput. 23, 517–541 (2001)
Article MathSciNet Google Scholar
Montorsi, A. (ed.): The Hubbard Model: A Collection on Reprints. World Scientific, Singapore (1992). https://doi.org/10.1142/1346
Rabbi, F., Daley, C.S., Aktulga, H.M., Wright, N.J.: Evaluation of directive-based GPU programming models on a block Eigensolver with consideration of large sparse matrices. In: Wienke, S., Bhalachandra, S. (eds.) WACCPD 2019. LNCS, vol. 12017, pp. 66–88. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-49943-3_4
Chapter Google Scholar
Rasetti, M. (ed.): The Hubbard Model: Recent Results. World Scientific, Singapore (1991). https://doi.org/10.1142/1377
Siro, T., Harju, A.: Exact diagonalization of the Hubbard model on graphics processing units. Comp. Phy. Comm. 183, 1884–1889 (2012)
Article Google Scholar
Stathopoulos, A., Wu, K.: A block orthogonalization procedure with constant synchronization requirements. SIAM J. Sci. Comput. 23, 2165–2182 (2006). https://doi.org/10.1137/S1064827500370883
Article MathSciNet MATH Google Scholar
Yamada, S., Imamura, T., Machida, M.: 16.447 TFlops and 159-billion-dimensional exact-diagonalization for trapped Fermion-Hubbard model on the earth simulator. In: Proceedings of SC05 (2005)
Google Scholar
Yamada, S., Imamura, T., Machida, M.: High performance eigenvalue solver in exact-diagonalization method for Hubbard model on CUDA GPU. In: Joubert, G.R., Leather, H., Parsons, M., Peters, F., Sawyer, M. (eds.) Parallel Computing: On the road to Exascale. Advances in Parallel Computing, vol. 27, pp. 361–369. IOS (2016). https://doi.org/10.3233/978-1-61499-621-7-361
Yamada, S., Imamura, T., Machida, M.: Communication avoiding Neumann expansion preconditioner for LOBPCG method: convergence property of exact diagonalization method for Hubbard model. In: Bassini, S., Danelutto, M., Dazzi, P., Joubert, G.R., Peters, F. (eds.) Parallel Computing is Everywhere. Advances in Parallel Computing, vol. 32, pp. 27–36. IOS (2018). https://doi.org/10.3233/978-1-61499-843-3-27
Yamada, S., Imamura, T., Machida, M.: High performance eigenvalue solver for Hubbard model: tuning strategies for LOBPCG method on CUDA GPU. In: Foster, I., Joubert, G.R., Kučera, L., Nagel, W.E., Peters, F. (eds.) Parallel Computing: Technology Trends. Advances in Parallel Computing, vol. 36, pp. 105–113. IOS (2020). https://doi.org/10.3233/APC200030

Download references

Acknowledgments

This research was partially supported by JSPS KAKENHI Grant Number 18K11345. This research was conducted with the supercomputer HPE SGI8600 in the Japan Atomic Energy Agency.

Author information

Authors and Affiliations

Japan Atomic Energy Agency, Kashiwa, Chiba, Japan
Susumu Yamada & Masahiko Machida
RIKEN, Kobe, Hyogo, Japan
Toshiyuki Imamura

Authors

Susumu Yamada
View author publications
You can also search for this author in PubMed Google Scholar
Toshiyuki Imamura
View author publications
You can also search for this author in PubMed Google Scholar
Masahiko Machida
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Susumu Yamada .

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, USA
Dhabaleswar K. Panda
Material Science and Chemistry, A*STAR Institute of High Performance Computing, Singapore, Singapore
Michael Sullivan

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yamada, S., Imamura, T., Machida, M. (2022). High Performance Parallel LOBPCG Method for Large Hamiltonian Derived from Hubbard Model on Multi-GPU Systems. In: Panda, D.K., Sullivan, M. (eds) Supercomputing Frontiers. SCFA 2022. Lecture Notes in Computer Science, vol 13214. Springer, Cham. https://doi.org/10.1007/978-3-031-10419-0_1

Download citation

DOI: https://doi.org/10.1007/978-3-031-10419-0_1
Published: 01 July 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-10418-3
Online ISBN: 978-3-031-10419-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

High Performance Parallel LOBPCG Method for Large Hamiltonian Derived from Hubbard Model on Multi-GPU Systems

Abstract

Similar content being viewed by others

A Jacobi_PCG solver for sparse linear systems on multi-GPU cluster

Parallel ILU preconditioners in GPU computation

Evaluation of Directive-Based GPU Programming Models on a Block Eigensolver with Consideration of Large Sparse Matrices

Keywords

1 Introduction

2 LOBPCG Method for Solving Multi Eigenpairs on Multi-GPU Systems