We target elastic and viscoelastic crustal deformation to a given fault slip. Following [8], the governing equation is
$$\begin{aligned} \sigma _{ij,j}+f_i=0, \end{aligned}$$
(1)
with
$$\begin{aligned} \dot{\sigma }= & {} \lambda \dot{\epsilon }_{kk}\delta _{ij}+2\mu \dot{\epsilon }_{ij}-\frac{\mu }{\eta }\left( \sigma _{ij}-\frac{1}{3}\sigma _{kk}\delta _{ij}\right) ,\end{aligned}$$
(2)
$$\begin{aligned} \epsilon _{ij}= & {} \frac{1}{2}(u_{i,j}+u_{j,i}), \end{aligned}$$
(3)
where \(\sigma _{ij}\) and \(f_{i}\) are the stress tensor and outer force.
, \((\ \ )_{,i}\), \(\delta _{ij}\), \(\eta \), \(\epsilon _{ij}\), and \(u_{i}\) are the first derivative in time, spatial derivative in the i-th direction, Kronecker delta, viscosity coefficient, strain tensor, and displacement, respectively. \(\lambda \) and \(\mu \) are Lame’s constants. Discretization of this equation by the finite-element method leads to solving a large system of linear equations. For a solver, (i) good convergency and (ii) small computational cost in each kernel are basically required to reduce the time-to-solution. The proposed method considering these requirements is based on viscoelastic analysis by [10], which can be described as follows (Algorithms 1 and 2).
An adaptive preconditioned conjugate gradient solver with Element-by-Element method [13], multi-grid method, and mixed-precision arithmetic is used in Algorithm 2. Most of the computational cost is in the inner loop of Algorithm 2. It can be computed in single precision, and we can reduce computational cost and data transfer size; thereby we can expect it to be suitable for GPU systems. In addition, we introduce the multi-grid method and use a coarse model to estimate the initial solution for the preconditioning part. This procedure reduces the whole computation cost in the preconditioner as the coarse model has less degrees-of-freedom compared to the target model. Below, we call line 7 of Algorithm 2(a) as the inner coarse loop and line 9 of Algorithm 2(a) as the inner fine loop. First-order tetrahedral elements are used in the inner coarse loop and second-order tetrahedral elements are used in the inner fine loop, respectively. The most computational costly kernel is the Element-by-Element kernel which computes sparse matrix-vector products. The Element-by-Element kernel computes the product of the element stiffness matrix and vectors element wise, and adds the results for all elements to compute a global matrix vector product. As element matrices are computed on the fly, the data transfer size from memory can be reduced significantly. This leads to circumventing the memory bandwidth bottleneck, and thus is suitable for recent architectures including GPUs, which have low memory bandwidth compared with its arithmetic capability. In summary, our base solver [1] computes much part of computation in single precision, reduces the amount of data transfer and computation, and avoids memory bound computation in sparse matrix-vector multiplication. They are desirable conditions for GPU computation to exhibit higher performance. On the other hand, the key kernel in the solver, Element-by-Element kernel, requires many random data accesses when adding up element wise results. This data access becomes the bottleneck in the solver. In this paper, we aim to improve the performance of the Element-by-Element kernel. We add two techniques described in following subsections, into our baseline solver.
2.1 Parallel Computation of Multiple Time Steps
In the developed method, we solve four time steps in the analysis in parallel. [6] describes its approach to obtain the accurate predictor using multiple time steps for linear wave propagation simulation. This paper extends the algorithm to viscoelastic analyses. As the stress of the step before needs to be obtained before solving the next step, only one time step can be solved exactly. In Algorithm 1, we focus on solving the equation on i-th timestep. Here we compute until the error of the i-th time step (displacement) becomes smaller than prescribed threshold \(\epsilon \) as described in lines 13 to 17 of Algorithm 1. The next three time steps, \(i+1, i+2\), and \(i+3\)-th time steps, are solved using the solutions of the steps before to estimate the solution. The estimated solution of the step before is used to update the stress state and outer force vector, which is corresponding to lines 18 and 19 in Algorithm 1. By using this method, we can obtain estimated solutions for improving the convergency of the solver. In this method, four vectors for \(i, i+1, i+2\), and \(i+3\)-th time steps can be computed simultaneously. In the Element-by-Element kernel, the matrix is read only once for four vectors; thus we can improve the computation efficiency. In addition, four values corresponding to the four time steps will be consecutive in memory address space. Therefore we can reduce random memory accesses and computation time compared to conducting the Element-by-Element kernel of one vector for four times. That is, the arithmetic count per iteration increases by approximately four times, but the decrease in the number of iterations and the improvement of computational efficiency of the Element-by-Element kernel are expected to reduce the time-to-solution.
In order to improve convergency, it is important to estimate the initial solution of the fourth time step accurately. We can use a typical predictor such as the Adams-Bashforth method, however we developed more accurate predictor considering that solutions for viscoelastic analysis smoothly change in each time step, as described in lines 7 to 12 in Algorithm 1. For predicting the 9th step and on, we use a linear predictor. In this linear predictor, a linear regression based on the accurately computed 7 time steps are used to predict the future time step. As regressions based on higher order polynomials or exponential base functions may lead to jumps in the prediction, we will not use them in this study.
2.2 Reduction of Atomic Access
The algorithm introduced in previous subsection is assumed to circumvent the bottleneck of the performance of Element-by-Element kernel. On the other hand, implementation in the previous study [7] requires to add up element wise results directly to the global vector using atomic function, as shown in Fig. 1a. Considering that each node can be shared by multiple elements, performance may decrease due to the race condition; thereby we need to modify its algorithm to improve the efficiency of the Element-by-Element kernel. We use a buffering method to reduce the number of accesses to the global vector. Regarding NVIDIA GPU, we can utilize a shared memory, in which values can be referred among threads in the same Block. The computation procedure is as below and also described in Fig. 1b.
-
1.
Group elements in to blocks, and store element wise results into a shared memory
-
2.
Add up nodal values in shared memory using a precomputed table
-
3.
Add up nodal values to global vector.
We can expect the performance improvement as the number of atomic operations to the global vector can be reduced and summation of temporal results is mainly performed in preliminary reduction in a shared memory, which has wider bandwidth. In this scheme, the setting of block size is assumed to have some impact on its performance. By allocating more elements in a Block, we can improve the number of reduction of nodal values in shared memory. However, the total number of threads is constrained by the shared memory size. In addition, we need to synchronize threads in a Block when switching from element wise matrix-vector multiplication to data addition part, using large number of threads in a Block leads to an increase in synchronization cost. Under these circumstances, we allocate 128 threads (32 elements \(\times \) four time steps) per Block.
In GPU computation, SIMT composing of 32 threads is used [12]. When the number of computation differs between the 32 threads, it is expected to lead to decrease in performance. In reduction phase, we need to assign threads per node. However, since the number of connected elements differs significantly between nodes, we can expect large load imbalance among the 32 threads. Thus we sort the nodes according to the number of elements to be added up as described in Fig. 2. This leads to good load balance among the 32 threads, leading to higher computational efficiency.
This method on shared memory requires implementation by CUDA. We also use CUDA for inner product computation to improve the memory access pattern and thus improve efficiency. On the other hand, other computations such as vector addition and subtraction are very simple computation; thus each thread uses almost the same number of registers whether we use CUDA or OpenACC. Also it is not necessary to use functions specialized for NVIDIA GPUs such as shared memory or warp function. For these reasons, the computations result in memory bandwidth bound and there is little difference between implementation by CUDA and by OpenACC. Thus we use CUDA for these performance sensitive kernels, and use OpenACC for the other parts. The CUDA part is called via a wrapper function.