Abstract
The computation of crustal deformation following a given fault slip is important for understanding earthquake generation processes and reduction of damage. In crustal deformation analysis, reflecting the complex geometry and material heterogeneity of the crust is important, and use of largescale unstructured finiteelement method is suitable. However, since the computation area is large, its computation cost has been a bottleneck. In this study, we develop a fast unstructured finiteelement solver for GPUbased largescale computers. By computing several times steps together, we reduce random access, together with the use of predictors suitable for viscoelastic analysis to reduce the total computational cost. The developed solver enabled 2.79 times speedup from the conventional solver. We show an application example of the developed method through a viscoelastic deformation analysis of the Eastern Mediterranean crust and mantle following a hypothetical M 9 earthquake in Greece by using a 2,403,562,056 degreeoffreedom finiteelement model.
Keywords
 CUDA
 Finite element analysis
 Conjugate gradient method
Download conference paper PDF
1 Introduction
One of the targets of solid earth science is the prediction of the place, magnitude, and time of earthquakes. One approach to this target is to estimate earthquake occurrence probability by comparing the current plate conditions with plate conditions when past earthquakes have occurred [9]. In this process, inverse analysis is required to estimate the current interplate displacement distribution using the crustal deformation data observed at the surface. In order to realize this inverse analysis, forward analysis methods computing elastic and viscoelastic crustal deformation for a given interplate slip distribution are under development.
In previous crustal deformation analyses, simplified models such as horizontally stratified layers were used [8]. However, recent studies point out that the simplification of crustal geometry has significant effects on the response [11]. Recently, 3D crust property data as well as crustal deformation data measured at observation stations are being accumulated. Thus, 3D crustal deformation analyses reflecting these data in full resolution are being anticipated.
The 3D finiteelement method is capable of modeling 3D geometry and material heterogeneity of the crust. However, modeling the available 1 Km resolution crust property data fully into 3D finiteelement crustal deformation analysis leads to large computational problems with more than \(10^9\) degreesoffreedom. Thus, acceleration of this analysis using highperformance computers is required. Targeting the elastic crustal deformation analysis problem, we have been developing unstructured finiteelement solvers suitable for GPUbased highperformance computers by developing algorithms considering the underlying hardware [7]. When compared with elastic analysis, viscoelastic analysis requires solving many time steps and thus its computational cost becomes even larger; therefore we target further acceleration of this solver in this paper.
Due to its high floating point performance, GPUs generally have relatively low memory bandwidth. Furthermore, data transfer performance is further decreased when memory access is not coalesced. Finiteelement analysis mainly consists of memory bandwidth bound kernels, and the most computationally expensive sparse matrixvector product kernel has many random memory accesses. Thus, it is not straight forward to utilize the high arithmetic capability of GPUs in finiteelement solvers. Reduction of data transfer and random access is important to improve computational efficiency. In this study, we accelerate the previous GPU solver by introducing algorithms that reduce data transfer by reduction of solver iterations, and reduce random access of the major computational kernels. Here we use a multitime step method together with a predictor to obtain the initial solution of the iterative solver. We improve the convergency of the iterative solver by adapting the predictor to the characteristic of solutions for the viscoelastic problem. In addition, by using several vectors for computation, we can reduce random memory access in the major sparse matrixvector kernel and improve performance.
Section 2 explains the developed method. Section 3 shows the performance of the developed method on Piz Daint [4], which is a P100 GPU based supercomputer system. Section 4 shows an application example using the developed method. Section 5 summarizes the paper and gives future prospects.
2 Methodology
We target elastic and viscoelastic crustal deformation to a given fault slip. Following [8], the governing equation is
with
where \(\sigma _{ij}\) and \(f_{i}\) are the stress tensor and outer force. , \((\ \ )_{,i}\), \(\delta _{ij}\), \(\eta \), \(\epsilon _{ij}\), and \(u_{i}\) are the first derivative in time, spatial derivative in the ith direction, Kronecker delta, viscosity coefficient, strain tensor, and displacement, respectively. \(\lambda \) and \(\mu \) are Lame’s constants. Discretization of this equation by the finiteelement method leads to solving a large system of linear equations. For a solver, (i) good convergency and (ii) small computational cost in each kernel are basically required to reduce the timetosolution. The proposed method considering these requirements is based on viscoelastic analysis by [10], which can be described as follows (Algorithms 1 and 2).
An adaptive preconditioned conjugate gradient solver with ElementbyElement method [13], multigrid method, and mixedprecision arithmetic is used in Algorithm 2. Most of the computational cost is in the inner loop of Algorithm 2. It can be computed in single precision, and we can reduce computational cost and data transfer size; thereby we can expect it to be suitable for GPU systems. In addition, we introduce the multigrid method and use a coarse model to estimate the initial solution for the preconditioning part. This procedure reduces the whole computation cost in the preconditioner as the coarse model has less degreesoffreedom compared to the target model. Below, we call line 7 of Algorithm 2(a) as the inner coarse loop and line 9 of Algorithm 2(a) as the inner fine loop. Firstorder tetrahedral elements are used in the inner coarse loop and secondorder tetrahedral elements are used in the inner fine loop, respectively. The most computational costly kernel is the ElementbyElement kernel which computes sparse matrixvector products. The ElementbyElement kernel computes the product of the element stiffness matrix and vectors element wise, and adds the results for all elements to compute a global matrix vector product. As element matrices are computed on the fly, the data transfer size from memory can be reduced significantly. This leads to circumventing the memory bandwidth bottleneck, and thus is suitable for recent architectures including GPUs, which have low memory bandwidth compared with its arithmetic capability. In summary, our base solver [1] computes much part of computation in single precision, reduces the amount of data transfer and computation, and avoids memory bound computation in sparse matrixvector multiplication. They are desirable conditions for GPU computation to exhibit higher performance. On the other hand, the key kernel in the solver, ElementbyElement kernel, requires many random data accesses when adding up element wise results. This data access becomes the bottleneck in the solver. In this paper, we aim to improve the performance of the ElementbyElement kernel. We add two techniques described in following subsections, into our baseline solver.
2.1 Parallel Computation of Multiple Time Steps
In the developed method, we solve four time steps in the analysis in parallel. [6] describes its approach to obtain the accurate predictor using multiple time steps for linear wave propagation simulation. This paper extends the algorithm to viscoelastic analyses. As the stress of the step before needs to be obtained before solving the next step, only one time step can be solved exactly. In Algorithm 1, we focus on solving the equation on ith timestep. Here we compute until the error of the ith time step (displacement) becomes smaller than prescribed threshold \(\epsilon \) as described in lines 13 to 17 of Algorithm 1. The next three time steps, \(i+1, i+2\), and \(i+3\)th time steps, are solved using the solutions of the steps before to estimate the solution. The estimated solution of the step before is used to update the stress state and outer force vector, which is corresponding to lines 18 and 19 in Algorithm 1. By using this method, we can obtain estimated solutions for improving the convergency of the solver. In this method, four vectors for \(i, i+1, i+2\), and \(i+3\)th time steps can be computed simultaneously. In the ElementbyElement kernel, the matrix is read only once for four vectors; thus we can improve the computation efficiency. In addition, four values corresponding to the four time steps will be consecutive in memory address space. Therefore we can reduce random memory accesses and computation time compared to conducting the ElementbyElement kernel of one vector for four times. That is, the arithmetic count per iteration increases by approximately four times, but the decrease in the number of iterations and the improvement of computational efficiency of the ElementbyElement kernel are expected to reduce the timetosolution.
In order to improve convergency, it is important to estimate the initial solution of the fourth time step accurately. We can use a typical predictor such as the AdamsBashforth method, however we developed more accurate predictor considering that solutions for viscoelastic analysis smoothly change in each time step, as described in lines 7 to 12 in Algorithm 1. For predicting the 9th step and on, we use a linear predictor. In this linear predictor, a linear regression based on the accurately computed 7 time steps are used to predict the future time step. As regressions based on higher order polynomials or exponential base functions may lead to jumps in the prediction, we will not use them in this study.
2.2 Reduction of Atomic Access
The algorithm introduced in previous subsection is assumed to circumvent the bottleneck of the performance of ElementbyElement kernel. On the other hand, implementation in the previous study [7] requires to add up element wise results directly to the global vector using atomic function, as shown in Fig. 1a. Considering that each node can be shared by multiple elements, performance may decrease due to the race condition; thereby we need to modify its algorithm to improve the efficiency of the ElementbyElement kernel. We use a buffering method to reduce the number of accesses to the global vector. Regarding NVIDIA GPU, we can utilize a shared memory, in which values can be referred among threads in the same Block. The computation procedure is as below and also described in Fig. 1b.

1.
Group elements in to blocks, and store element wise results into a shared memory

2.
Add up nodal values in shared memory using a precomputed table

3.
Add up nodal values to global vector.
We can expect the performance improvement as the number of atomic operations to the global vector can be reduced and summation of temporal results is mainly performed in preliminary reduction in a shared memory, which has wider bandwidth. In this scheme, the setting of block size is assumed to have some impact on its performance. By allocating more elements in a Block, we can improve the number of reduction of nodal values in shared memory. However, the total number of threads is constrained by the shared memory size. In addition, we need to synchronize threads in a Block when switching from element wise matrixvector multiplication to data addition part, using large number of threads in a Block leads to an increase in synchronization cost. Under these circumstances, we allocate 128 threads (32 elements \(\times \) four time steps) per Block.
In GPU computation, SIMT composing of 32 threads is used [12]. When the number of computation differs between the 32 threads, it is expected to lead to decrease in performance. In reduction phase, we need to assign threads per node. However, since the number of connected elements differs significantly between nodes, we can expect large load imbalance among the 32 threads. Thus we sort the nodes according to the number of elements to be added up as described in Fig. 2. This leads to good load balance among the 32 threads, leading to higher computational efficiency.
This method on shared memory requires implementation by CUDA. We also use CUDA for inner product computation to improve the memory access pattern and thus improve efficiency. On the other hand, other computations such as vector addition and subtraction are very simple computation; thus each thread uses almost the same number of registers whether we use CUDA or OpenACC. Also it is not necessary to use functions specialized for NVIDIA GPUs such as shared memory or warp function. For these reasons, the computations result in memory bandwidth bound and there is little difference between implementation by CUDA and by OpenACC. Thus we use CUDA for these performance sensitive kernels, and use OpenACC for the other parts. The CUDA part is called via a wrapper function.
3 Performance Measurement
We measure performance of the developed method on hybrid nodes of Piz Daint^{Footnote 1}.
3.1 Performance Measurement of the ElementbyElement Kernel
We use one P100 GPU on Piz Daint to measure performance of the ElementbyElement kernels. The target finiteelement problem consists of 959,128 tetrahedral elements, with 4,004,319 degreesoffreedom in secondorder tetrahedral mesh and 522,639 degreesoffreedom in firstorder tetrahedral mesh. Here we compare four versions of the kernels summarized in Table 1. Case A corresponds to the conventional ElementbyElement kernel, and Case D corresponds to the proposed kernel.
Figure 3 shows the normalized elapsed time per vector of the kernels in inner fine and coarse loops. We can see that the use of four vectors, reduction, and reordering significantly improves performance. In order to assess the time spent for data access, we also indicate the time measured for the ElementbyElement kernel without computing the element wise matrixvector products. We can see that the data access is dominant in the ElementbyElement kernel on P100 GPUs, and that the elapsed time of the kernel has decreased with the decrease in memory access by reduction. When compared to the performance in secondorder tetrahedral mesh, the performance in firstorder tetrahedral mesh was further improved by reduction using shared memory. This effect can be confirmed by the number of call for atomic add to the global vector: In secondorder tetrahedral mesh, atomic addition is performed 115,095,360 times in Case B and 43,189,848 times in Case D; thereby the number of calls is reduced by about 37%. For the firstorder tetrahedral mesh, atomic addition is performed 46,038,144 times in Case B and 10,786,920 times in Case D; thus the number of calls is reduced by about 23%. In total, we can see that the computational performance of the developed kernel (Case D) has improved by 3.3 times in firstorder tetrahedral mesh and 2.2 times in secondorder tetrahedral mesh when comparing with the conventional kernel (Case A).
3.2 Comparison of Solver Performance
We compare the developed solver with the previous viscoelastic solver in [10] using GPUs in Piz Daint. This solver is originally designed for CPUbased supercomputers and we port this to GPU computation environment and for performance measurement. The solver uses CRSbased matrixvector products, however, we modify this to ElementbyElement method, because it would be more clear to confirm the effects of our proposed method. The same tolerances of solvers is used for both methods, \(\epsilon =10^{8}\) is used for the outer loop, \((\bar{\epsilon }_{c}^{in},N_c)=(0.1,300)\) is used for the inner coarse loop, and \((\bar{\epsilon }^{in},N)=(0.2,30)\) is used for the inner fine loop. These tolerance numbers are selected to minimize the elapsed time for both solvers. We use time step increment \(dt=2592000\) s with \(N_t=300\) time steps, and measure performance of the viscoelastic computation part (time step 2 to 300).
A model with 41,725,739 degreesoffreedom and 30,720,000 secondorder tetrahedral elements is computed using 32 Piz Daint nodes. Figure 4 shows the number of iterations and elapsed time of the solvers. By using the multistep predictor, the number of iterations of the most computationally costly inner coarse loop has decreased by 2.3 times. In addition, ElementbyElement kernel performance is improved as measured in the previous subsection. These two modifications to the solver have decreased the total elapsed time by 2.79 times.
4 Application Example
We apply the developed solver to a viscoelastic deformation problem following a hypothetical earthquake on the Hellenic arc subduction interface, which affects deformation measured in Greece and across the Eastern Mediterranean. We selected this Hellenic region, because recent analysis of timescale bridging numerical models suggests that the large amount of sediments subducting could mean that a larger than anticipated M 9 earthquake might be able to occur in this highly populated region [3]. To model the complete viscoelastic response of the system we simulate a large depth range, including the Earth’s crust, lithosphere and complete mantle down to the core boundary. The target domain is of size 3,686 km \(\times \) 3,686 km \(\times \) 2,857 km. Geometry data of layered structure is given in spatial resolution of 1 km [2].
To fully reflect the geometry data into the analysis model, we set resolution of finiteelement model to 0.9 km (secondorder tetrahedral element size is 1.8 km). As this becomes a large scale problem, we use a parallel mesh generator capable of robust meshing of large complex shaped multiple material problems [5, 6]. This leads to a finiteelement model of 589,422,093 secondorder tetrahedral elements, 801,187,352 nodes, and 2,403,562,056 degreesoffreedom shown in Fig. 5a–d. We can see that the layered structure geometry is reflected into the model. We input a hypothetical fault slip in the direction of the subduction, that is, slip with (dx, dy, dz) = (25, 25, −10) m, at the subduction interface separating the continental crust of Africa and Europe in the center of the model with diameter of 250 km. Following this hypothetical M 9 earthquake we compute the elastic coseismic surface deformation and postseismic viscoelastic deformation due to viscoelastic relaxation of the crust, lithosphere and mantle. Following [10], a split node method is used to input the fault dislocation, and time step increment dt is set to 30 days (2,592,000 s). The analysis of 2,000 time steps took 4587 s using 512 P100 GPUs on Piz Daint.
Figure 5e and f shows the surface deformation snapshots. We can see that elastic coseismic response as well as the viscoelastic response is computed reflecting the 3D geometry and heterogeneity of crust. We can expect more realistic response distribution by inputting fault slip distributions following current solid earth science knowledge.
5 Conclusion
We developed a fast unstructured finiteelement solver for viscoelastic crust deformation analysis targeting GPUbased computers. The target problem becomes very computationally costly since it requires solving a problem with more than \(10^9\) degreesoffreedom. In this analysis, the random data access in ElementbyElement method in matrixvector products was the bottleneck. To eliminate this bottleneck, we proposed two methods: one is a reduction method to use shared memory of GPUs, and the other one is a multistep predictor and linear predictor to improve the convergency of the solver. Performance measurement on Piz Daint showed 2.79 times speedup from the previous solver. By the acceleration of viscoelastic analysis by the developed solver, we expect applications to inverse analysis of crust properties or many case analysis.
Notes
 1.
Piz Daint comprises of 1,431 \(\times \) multicore compute node (Two Intel Xeon E52695 v4) and 5,320 \(\times \) hybrid compute node (Intel Xeon E52690 v3 + NVIDIA Tesla P100) connected by Cray Aries routing and communications ASIC, and Dragonfly network topology.
References
Agata, R., Ichimura, T., Hirahara, K., Hyodo, M., Hori, T., Hori, M.: Robust and portable capacity computing method for many finite element analyses of a highfidelity crustal structure model aimed for coseismic slip estimation. Comput. Geosci. 94, 121–130 (2016)
Bird, P.: An updated digital model of plate boundaries. Geochem. Geophys. Geosyst. 4(3), 1027 (2003)
Brizzi, S., van Zelst, I., van Dinther, Y., Funiciello, F., Corbi, F.: How longterm dynamics of sediment subduction controls shortterm dynamics of seismicity. In: American Geophysical Union (2017)
Piz Daint. https://www.cscs.ch/computers/pizdaint/
Fujita, K., Katsushima, K., Ichimura, T., Hori, M., Maddegedara, L.: Octreebased multiplematerial parallel unstructured mesh generation method for seismic response analysis of soilstructure systems. Procedia Comput. Sci. 80, 1624–1634 (2016). 2016 International Conference on Computational Science, ICCS 2016, 6–8 June 2016, San Diego, California, USA
Fujita, K., Katsushima, K., Ichimura, T., Horikoshi, M., Nakajima, K., Hori, M., Maddegedara, L.: Wave propagation simulation of complex multimaterial problems with fast loworder unstructured finiteelement meshing and analysis. In: Proceedings of the International Conference on High Performance Computing in AsiaPacific Region, HPC Asia 2018, pp. 24–35. ACM, New York (2018)
Fujita, K., Yamaguchi, T., Ichimura, T., Hori, M., Maddegedara, L.: Acceleration of elementbyelement kernel in unstructured implicit loworder finiteelement earthquake simulation using OpenACC on Pascal GPUs. In: Proceedings of the Third International Workshop on Accelerator Programming Using Directives, pp. 1–12. IEEE Press (2016)
Fukahata, Y., Matsu’ura, M.: Quasistatic internal deformation due to a dislocation source in a multilayered elastic/viscoelastic halfspace and an equivalence theorem. Geophys. J. Int. 166(1), 418–434 (2006)
Hori, T., Hyodo, M., Miyazaki, S., Kaneda, Y.: Numerical forecasting of the time interval between successive M8 earthquakes along the Nankai Trough, Southwest Japan, using ocean bottom cable network data. Mar. Geophys. Res. 35(3), 285–294 (2014)
Ichimura, T., Agata, R., Hori, T., Hirahara, K., Hashimoto, C., Hori, M., Fukahata, Y.: An elastic/viscoelastic finite element analysis method for crustal deformation using a 3D islandscale highfidelity model. Geophys. J. Int. 206(1), 114–129 (2016)
Masterlark, T.: Finite element model predictions of static deformation from dislocation sources in a subduction zone: sensitivities to homogeneous, isotropic, poissonsolid, and halfspace assumptions. J. Geophys. Res. Solid Earth 108(B11) (2003)
Nickolls, J., Buck, I., Garland, M., Skadron, K.: Scalable parallel programming with CUDA. Queue 6(2), 40–53 (2008)
Winget, J.M., Hughes, T.J.R.: Solution algorithms for nonlinear transient heat conduction analysis employing elementbyelement iterative strategies. Comput. Methods Appl. Mech. Eng. 52(1–3), 711–815 (1985)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Yamaguchi, T. et al. (2018). Viscoelastic Crustal Deformation Computation Method with Reduced Random Memory Accesses for GPUBased Computers. In: , et al. Computational Science – ICCS 2018. ICCS 2018. Lecture Notes in Computer Science(), vol 10861. Springer, Cham. https://doi.org/10.1007/9783319937014_3
Download citation
DOI: https://doi.org/10.1007/9783319937014_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 9783319937007
Online ISBN: 9783319937014
eBook Packages: Computer ScienceComputer Science (R0)