Hybrid-View Programming of Nuclear Fusion Simulation Code in XcalableMP

,


Introduction
In XMP, the global-view model allows programmers to define global arrays, which are distributed to processors by adding the directives. Some typical communication patterns are supported by directives, such as data exchange between neighbor processors in stencil computations. In contrast to the global-view model, the localview model describes remote memory access using the node (processor) index. This operation is implemented as one-sided communication. XMP employs the coarray concept from Coarray Fortran as a local-view programming model. A coarray is a distributed data object, which is indexed by the coarray dimension that maps indices to processors. In XMP, the coarray is defined in C as well as Fortran. In the local-view model, a thread on each processor executes its own local computations independently with remote memory access to data located in different processors by coarray access. The local-view model requires that programmers define their algorithms by explicitly decomposing the data structures and controlling the flow in each processor. The data view is similar to that in MPI, but coarray remote access provides a more intuitive view of accessing the data in different processors, thereby increasing productivity.
In this chapter, 1 we consider a hybrid-view programming approach, which combines the global-view and local-view models in XMP according to the characteristics of the distributed data structure of the target application. The global-view model allows programmers to express regular parallel computations such as domain decomposition with stencil computation in a highly intuitive manner simply by adding directives to a serial version of code. However, it is difficult to describe parallel programs in the global-view model when more irregular communication patterns and complex load balancing are required on the processing. Thus, localview programming is necessary in these situations.
We apply this hybrid-view programming for Gyrokinetic Toroidal Code -Princeton (GTC-P) [4], which is a large-scale plasma turbulence code that can be applied at the International Thermonuclear Experimental Reactor (ITER [13]) scale and beyond for next-generation nuclear fusion simulation. The GTC-P is an improved version of the original GTC [2] and it is a type of gyrokinetic Particlein-Cell (PIC) code with two basic data arrays: global grid data that corresponds to the physical problem space and particle data that corresponds to particles moving around the grid space. The original GTC-P was written in C as a form of hybrid programming with OpenMP and MPI. In this code, the grid data and particle data are mapped onto MPI processes and exchanged. As found with most codes of this type, it is difficult to manage complex data distributions and communication for both grid data and particle data during code development. Furthermore, to simulate the microturbulence phenomenon in plasmas for magnetically confined fusion devices, non-flat domain decomposition is necessary in one dimension, as well as parallelizing multiple dimensions, to obtain accurate large-scale simulations. Therefore, the number of computations becomes extremely large for next-generation and largescale reactors such as ITER.
We consider both types of data models in XMP, i.e., global-view and localview models, which are suitable for representing grid space data and particle data, respectively, because of their data distribution and communication pattern. In this study, we implement the GTC-P code in two ways: using XMP with a localview only model, and with a combination of local-view and global-view models, where we evaluate the performance and productivity of these approaches. As the preliminary result, we implemented and evaluated the GTC-P in XMP hybrid-view model [15]. Moreover, we indicate the causes of performance degradation for GTC-P in XMP and evaluate the GTC-P of hybrid versions written in XMP+OpenMP and MPI+OpenMP in this study.
The remainder of this chapter is organized as follows. Next, we briefly describe the GTC-P nuclear fusion simulation code in Sect. 2. Section 3 describes the implementation of GTC-P using Hybrid programing model of XMP. We report the performance and productivity evaluation in Sect. 4, and related works in Sect. 5. Finally, we conclude our study in Sect. 6.

Nuclear Fusion Simulation Code
Typical methods used to simulate the microturbulence phenomenon in magnetically confined fusion plasmas include the Monte Carlo method and the PIC method. In this study, we only consider the gyrokinetic PIC method among them as a target application to explain the GTC and GTC-P code briefly.

Gyrokinetic PIC Simulation
The simulation of the gyrokinetic PIC method uses a space grid to calculate the field and for the particle trajectory calculation, which does not depend on the grid when moving in the free space. Figure 1 shows an image of a gyrokinetic PIC simulation with a two-dimensional block distribution. The typical behavior of the gyrokinetic PIC code is as follows.
1. Add the charge of the particle to the nearby grid points. 2. Solve the electric field affected by the electrostatic potential by calculating the charge density of the nearby grid points using Poisson's equation. 3. Interpolate the electric field in the current position based on each particle in the nearby grid points and move the position of the particle in the space. During each step, a process must communicate with another if that process holds the data for the space in the grid that is affected, as shown in Fig. 1 (A), or if the particle data move from or to that process, as shown in Fig. 1 (B). Based on the above, if the size of the distributed domain, e.g., the grid in Fig. 1, is not changed, the data distribution employed in the global-view programming model is suitable and the communication between nearby grid points can be described by the reflect directive in XMP coding. In contrast, if the number of particles on each distributed domain changes dynamically during each time step of the simulation, such as particle motion, coarray communication is required using local-view programming.

GTC
GTC is a three-dimensional (3D) gyrokinetic PIC code, which was developed by DOE SciDAC, UC Irvine, etc. [2] for studying the microturbulence phenomenon in plasmas for magnetically confined fusion devices. Figure 2 shows a conceptual image of a 3D torus physical space. GTC treats the physical space and the movement of particles in three directions: the toroidal direction around the major axis, the poloidal direction around the magnetic axis, and the radial direction of the minor radius from the magnetic axis. The cross-section of the toroidal direction is known Fig. 2 Conceptual image of the three-dimensional torus space in GTC-P [10] as the poloidal plane. GTC-P is a modified version of GTC, where there are several differences in the parallelization scheme. Moreover, GTC-P is implemented as two versions in the C and Fortran languages, whereas the original GTC is coded in Fortran. In this study, we focus on the C implementation of GTC-P.
GTC parallelizes the problem according to three levels. Processing on the space grid domain in the toroidal direction and the processing of particles in each domain are mapped onto MPI processes. Also, the grid-related calculation and particles in each distributed domain are further subdivided using OpenMP for each process. GTC-P has four levels of parallelism with additional parallelism in the radial direction. The total number of MPI processes that need to be executed is N t × N r × N rp , where N t is the number of domains decomposed in the toroidal direction, N r is the number of domains decomposed in the radial direction, and N rp is the number of particles decomposed in each of the distributed domains.
There is a difference in the number of grid points on the poloidal plane, as demonstrated in Fig. 3 (left). The toroidal domain can be distributed with equally sized intervals, but the radial domain cannot be distributed with equally sized intervals due to the large difference in the domain size depending on its position in space. Therefore, in order to align as much as possible the number of grid points to be mapped on each process, the outer area of the radial domain is distributed as short radial interval and its inner area is distributed as long radial interval, such as Fig. 3 (right).
GTC-P has mainly six computational kernels. The charge kernel deposits the charge from particles onto the grid using the four-point approximation of nearby grid points. The poisson, field, and smooth kernels solve the gyrokinetic Poisson's equation, compute an electric field, and smooth the charge and potential with a filter on the grid, respectively. The push kernel interpolates the electric field onto particles using the field. The charge and push kernels account for large percentage of the elapsed time in this simulation [4,16].

Implementation of GTC-P by Hybrid-view Programming
In this section, we describe how to implement GTC-P using hybrid programming model of XMP.

Hybrid-View Programming Model
XMP allows the use of hybrid-view programming, which combines the global-view and local-view models. The global-view model allows programmers to express regular parallel computations, such as domain decomposition with stencil computation, in a highly intuitive manner simply by adding directives to a serial version of the code. On the other hand, when the data distribution cannot be simply described in domain decomposition manner or the communication pattern is complicated, the global-view model is not suitable, and more dynamism is required to express the code naturally. Thus, the coarray notation provided by the local-view model is required in this case, and it is possible to program in a flexible manner using these models. Figure 4 shows a skeleton code of the implementation of a gyrokinetic PIC simulation with XMP. In this example, the grid uses a two-dimensional block #pragma xmp reflect(f) 10 /* Calculate the particle−related work */ 11 /* Pack the communication elements from array "p" to array "send" */ 12 /* Calculate the destination process "pe" and communication size "icount" */ distribution and each block has a sleeve area, which is used to calculate the field with the nearby grid points based on the shadow directive. The particle movement is represented by the coarray notation where the communication elements are packed in the array send. Based on the above, we describe the two implementations of GTC-P using XMP. First, we implement the XMP-localview version using coarray communication, which is equivalent to using MPI point-to-point communication with the exception of MPI collective communication (as shown below). Next, the XMP-hybridview version is implemented by describing the fields using a distributed array with the reflect directive for overlapped sleeve area communication and the distributed data in the global-view programming model, as well as using the coarray notation to move the particle data. In addition, we use the bcast and reduction directives instead of MPI collective communication (MPI_Bcast and MPI_Allreduce) in both versions.

Implementation Based on the XMP-Localview Model: XMP-localview
In GTC-P, the communication processes required to move particles between grids and to exchange grid points are represented by MPI_Sendrecv or MPI_Isend/Irecv, where most of the communication is performed between adjacent processes in one dimension. GTC-P has the steady state exchange of particles between neighboring subdomains. Because the number of particles changes dynamically, this implementation uses the coarray notation in the localview programming model.

Figures 5 and 6
show the particle data movement using MPI and the corresponding the coarray notation in GTC-P, respectively. In the exchange of coarray notation for particle data movement, it communicates the number of particles and the particle data, i.e., nsendright and sendright, with the adjacent process on the neighbor to the right. In addition, Figs. 7 and 8 show the exchange of grid points using MPI and the corresponding coarray notation in GTC-P, respectively. In the example of the coarray notation for the exchange of grid points, after copying a value to a one-dimensional array, i.e., sendr or Xsendr, it communicates with the adjacent process on the neighbor to the right. Because the coarray notation is non-blocking communication, xmp_sync_image on the sixth line of Fig. 6 and the seventh line of Fig. 8 are required to guarantee that communication has been completed between two processes, in this case, the neighboring (right_pe) and current processes.  8 Exchange of grid points using the coarray notation in GTC-P

Implementation Based on the XMP-Hybridview Model: XMP-Hybridview
In the XMP-hybridview implementation, all of the space grid data are denoted by a global-view model with compile-time mapping and the sleeve data are exchanged by XMP directives, whereas the particle data movements are denoted by a localview model with the coarray notation, as shown Fig. 6. It is necessary to represent an unequal block size for domain decomposition in the radial dimension. Because this dimension's space grid is denoted in the global-view model, we apply the gblock notation to represent it correctly in the same manner as the original MPI implementation. The gblock notation can control the variable block size of each domain on the mapped space position. This feature is especially important for porting GTC-P onto XMP with a global-view model. Figure 10 shows an example of the GTC-P implementation with the XMP global-view programming model using gblock. The 11th line of this example denotes the block size distribution in the radial dimension. Because of describing the data distribution by global-view model, we can describe the loop distribution only to insert loop directive onto the serial code that is from the 28th to 30th lines of this example. In addition, OpenMP directives can be combined with XMP such as the 27th line. The calculation of the grid-related works, such as the deposit of the charge from particles onto the grid using a four-point approximation of grid points, the computation of an electronic field, and the interpolation of the electronic field onto particles, are similar to four-point stencil calculation on the poloidal plain. In these codes, we can describe the loop parallelization by inserting loop directive onto the serial version. Appropriate directives are used for each dimension of the distributed array in XMP, and we further synchronize the sleeve data that overlap at each end of the distributed domain, which we can describe simply using the reflect directive. Figure 9 shows an example of the reflect directive, which is the same as the communication described in Figs. 7 and 8. Thus, we can describe it using a directive on one line, which is much simpler compared with the MPI notation in Figs. 7 and 8. When the width clause is specified, it can be designated as part of the sleeve elements and the periodic is used to update the sleeve area of the global lower (upper) bound based on the global upper (lower) bound (Fig. 10).

Experimental Setting
We evaluated the performance of our two implementations using a massively parallel GPU cluster: HA-PACS[1] at the Center for Computational Sciences, University of Tsukuba. Table 1 shows the computing environment employed for one node. HA-PACS is a GPU cluster, but we only utilized CPUs in this study. We have a plan to extend this research using a GPU-enabled version of XcalableMP, XcalableACC [9] to make use of GPU of HA-PACS. We apply the optimization option for NUMA with 'numactl -localalloc', and disable the CPU affinity setting of MVAPICH2 with MV2_ENABLE_AFFINITY=0.
As preliminary evaluations, we investigate the amount of the memory usage and the performance of communication using XMP and MPI. First, we indicate the comparison of the memory usage when one array is allocated in the local-view model, global-view model, and MPI. They are evaluated with 'getpid()' and 'grep VmHWM /proc/[pid]/status' from C program during execution. An array size is 1 MB. We show the minimum size in the each amount of the memory usage when four node execution. The tests showed that the amount of memory usage of all programming models is almost same according to Table 2. Then, we evaluate the performance of XMP and MPI communication with Ping-Pong program, which is defined by a power of two communication size, because XMP coarray is implemented by GASNet [6] which is a communication library optimized for some interconnections specifies, e.g., InfiniBand and Gemini. Figure 11 shows the performance of XMP coarray and MPI_Send/Recv communication. XMP is a good performance if the transfer size is about 65,536 Bytes or less, whereas MPI is a good performance if it is more than about 65,536 Bytes. We used a parameter of GASNet GASNET_IBV_PORTS="mlx4_0:1+mlx4_0:2" which specifies  to use two ports of Infiniband, but we could get the performance of only single port of Infiniband. It may be an issue with GASNet library. The GTC-P simulation size is determined by several important numerical parameters. Table 3 shows the default parameters for problem size A provided by GTC-P, where we modified the parameters to evaluate weak scaling based on problem size A. Strong scaling was evaluated using the minimum parameters in the decomposition of each domain shown in Table 3, where mstep is the number of calculation steps, mzetamax is the number of grid points in the toroidal dimension, and mpsi is the number of grid points in the radial domain. Because the number of grid points in the poloidal plane and in the toroidal domain must be the same during decomposition, this was also changed in the parameter set for problem size A.
First, we used up to 32 nodes of HA-PACS where 16 processes ran on each node and the total number of processes ranged from 16 to 512. The processes mapped to evaluate the decomposition on each domain are shown in Table 4. As described above, three problem dimensions were considered: toroidal, radial, and particle. When we decomposed these dimensions into parallel processes, we always fixed the decomposition number on two dimensions (e.g., toroidal and radial) as 2 × 2 and we varied the decomposition size in the other dimension (e.g., particle) from 4 to 128, thereby scaling the total number of processes from 16 to 512. However, during decomposition on the toroidal dimension, we fixed the decomposition number on the radial and particle dimensions as 2 × 4. This was due to variations in the number of calculations because increasing the toroidal dimension also changes the poloidal planes, as described above. We used this scheme to change the scaling dimension. Second, we used 16 nodes where one process ran on each node and the number of threads ranged from 1 to 16 in each process. The processes mapped on each domain to evaluate the decomposition are 2 × 4 × 2 and 2 × 2 × 4.

Results
With weak scaling, Figs. 12, 13, and 14 shows the elapsed time for both calculation and communication of MPI, XMP-localview, and XMP-hybridview required to scale the number of processes from 16 to 512, where decomposition on the toroidal Fig. 12 Elapsed time of the decomposition on toroidal dimension from 16 to 512 processes in weak scaling Fig. 13 Elapsed time of the decomposition on radial dimension from 16 to 512 processes in weak scaling Fig. 14 Elapsed time of the decomposition on particle dimension from 16 to 512 processes in weak scaling and particle dimensions exhibited good scalability, whereas scaling on the radial dimension was poor. Figure 13 shows that the performance of decomposition on the radial dimension decreased as the number of nodes increased compared with the other two types of domain decomposition, as shown in Figs. 12 and 14. Most of the communications are performed at the neighboring surface during decomposition on any dimension and the total amount of communication data does not vary greatly; thus, we focused on the calculation load balance between processes. Table 5 shows the difference between the maximum and minimum calculation times required for each type of decomposition, where the calculation time was defined as the computational time required for each process except the communication time. This table shows that the calculation time for processes differed greatly with radial dimension decomposition as the number of processes increased. This phenomenon occurred with all three implementations, including MPI. This may be explained by the method used to decompose the domain in the radial dimension. For other dimensions, it is easy to decompose the domain completely and equally for all processes. However, decomposition is complicated in the radial dimension because the domain volume varies in the inner part and outer part due to the torus form of the problem space. The volume and the corresponding grid size  . 15 Breakdown of the minimum and maximum calculation times on the radial domain decomposition using 512 processes are calculated based on the formula used to describe the torus shape, which implies that an error is incurred during integer rounding to determine the number of grids. Table 5 shows that the number of total grid points assigned to the processes with the maximum and minimum calculation times differed greatly. Also, Fig. 15 shows the breakdown of the minimum and maximum calculation times on the radial domain decomposition 512 processes. The difference between the calculation times of the grid-related works, such as charge, push, poisson, field, and smooth, increased on radial domain decomposition. During each time step, the computation of all processes must be bounded as a barrier operation and the increase in the integer rounding error according to the problem size (i.e., weak scaling) causes a greater load imbalance, which degrades the overall performance. On the other hand, the communication time of XMP-localview and XMPhybridview on the radial dimension increases as the number of nodes increased compared with the MPI, as shown in Fig. 13. We explored the number of send calls and each communication size because the performance of communication on XMP and MPI are reversed at about 65,536 Bytes according to Fig. 11. Figure 16 shows the number of send calls in process number 0 on each domain decomposition classified as the communication size of more than 65,536 Bytes and 65,536 Bytes or less. In the radial domain decomposition, the number of send calls at more than 65,536 Bytes increases compared with the toroidal and particle decomposition. Therefore, the performance of XMP-localview and XMP-hybridview is degraded compared with MPI. The results were the XMP-localview implementation obtains approximately the same performance as the MPI implementation while the performance degradation using XMP-hybridview is increased by up to 20% compared with the MPI implementation.
With strong scaling, Figs. 17 and 18 show the elapsed time for both calculation and communication of MPI, XMP-localview, and XMP-hybridview, where the decomposition on the radial and particle dimensions, respectively. The performances of XMP-localview and XMP-hybridview on the particle dimension are    We explored the number of send calls and each communication size same as weak scaling. Figure 19 shows the number of send calls in process number 0 on radial and particle domain decomposition classified as the communication size of more than 65,536 Bytes and 65,536 Bytes or less. The number of send calls on the Fig. 20 Elapsed time of the decomposition on radial and particle dimension from 1 to 16 threads radial domain decomposition at 65,536 Bytes or less increases compared with the particle decomposition. Therefore, the performance of XMP-localview and XMPhybridview are increased compared with MPI on radial domain decomposition from 128 to 512 processes in strong scaling. Figure 20 shows the elapsed time of the decomposition on radial and particle dimension, i.e., 2 × 4 × 2 and 2 × 2 × 4, ranged from 1 to 16 threads per process using 16 nodes where one process ran on each node. The results were the performance of XMP implementation with thread parallelization is scaled the same as MPI.

Productivity and Performance
A good programming environment should facilitate high performance and high productivity, but high performance is sometimes obtained by low-level programming such as MPI, which unfortunately yields low productivity.
The XMP-localview implementation is simple and intuitive compared with MPI because the coarray communication is expressed in the form of an array assignment statement, as shown Figs. 6 and 8. In coarray notation, the communication size and data are intuitively represented by array section and the data type is checked automatically. The performance of XMP-localview is comparable to that of the MPI version.
In XMP-hybridview, the global data structure required for the field data is described in the global-view model, which is almost the same as that in the serial code without particle calculation to communicate to the other process, and its data distribution is annotated by the directives, as shown in Fig. 10. This improves the readability of the code because it is unnecessary for users to describe the many arguments on a line such as MPI APIs, thereby facilitating the easy maintenance of the program and simple parallelization from the original sequential code. For the global data structure, the communication with the overlapped sleeve area in the distributed calculation domain can be described in only one line of the reflect directive, as shown in Fig. 9. Table 6 shows the delta Source Lines of Code (SLOC) [14] for several implementations. This metric indicates how many lines of code changed from the serial implementation of GTC-P, which shows modified, added, and deleted lines. Due to the reasons described above, the amount of code added and modified from the serial implementation is smaller with the XMP-hybridview implementation than the MPI implementation. In both of XMP implementations, the deleted lines are larger than MPI implementation because the explicit memory free is unnecessary for the distributed array and communication buffer for coarray in the global scope. In summary, XMP-hybridview implementation increases productivity.
The difference in performance between XMP-hybridview and MPI is attributable to the increase in the communication size of the reflect directive. The reflect directive is responsible for the communication designated as the sleeve area by the width clause, but it cannot update partially the sleeve area. Figure 21 shows that the two-dimensional array is distributed to two nodes and exchanged the sleeve area in MPI and XMP implementations. In GTC-P, there is a communication pattern updated only in inner sleeve area which is represented as the hatching areas in Fig. 21 (left).

Related Research
GTC or GTC-P have been executed and optimized on some platforms. X. Liao et al. [7] optimized GTC to use offload-programming model for the Intel Xeon Phi accelerator, and evaluated the performance on MilkyWay-2 supercomputer. K. Madduri et al. [8] described the optimization for multi-and many-core systems, and evaluated on some systems including Graphic Processing Unit (GPU) based on NVIDIA Fermi architectures. In our study, we focus on the evaluation of not only the performance but also the productivity for GTC-P. PIC method is often implemented some PGAS parallel programming languages. H. Sakagami and T. Mizuno [12] implemented 2D particle code, ESPAC2: 2D electrostatic plasma, based on PIC method using High Performance Fortran (HPF) [5] which is directive-based language similar to OpenMP and supports the globalview model. The particle data is distributed into the block, while the electrostatic field is replicated onto each process. After the distributed particle data is calculated in each process, the reduction operation is executed to update the particle data of electrostatic field on each time step. The data distribution is an easy expression which is annotated by directives in HPF. R. Preissl et al. [11] introduced hybrid PGAS+OpenMP approach for 3D PIC code, Gyrokinetic Tokamak Simulation (GTS) which is implemented in MPI+OpenMP. As PGAS parallel programming language, they used Coarray Fortran. The one-sided communication in Coarray Fortran is simple and more intuitive notation compared with MPI programming because it is expressed in the form of array assignment statement. However, the description of data distribution is same as MPI. To use simple coarray communication and easy data distribution by directives, we consider a hybrid-view approach, which combines the global-view and local-view models in XMP.

Conclusion
In this study, we implemented two versions of GTC-P, a large-scale nuclear fusion simulation code, using the global-view and local-view programming models in XMP for parallel programming languages, and we evaluated their performance and productivity. The first version, XMP-localview, only uses coarray communication in the local-view programming model, which simply replaces MPI point-to-point communication, except for collective communication such as MPI_Allreduce. The second version, XMP-hybridview, uses the distribution of the calculation domain and the reflect directive in the global-view programming model, as well as coarray communication for particle motion in the local-view programming model. Experimental evaluations showed that the XMP-localview implementation obtained approximately the same performance as MPI, whereas the XMP-hybridview implementation degraded the performance by 20%. In addition, we obtained high productivity with the XMP implementation. In XMP-localview, the coarray notation is simpler and more intuitive compared with MPI programming, and the XMPhybridview allows more natural data expression for both static grid space data (in the global-view model) and dynamic particle data (in the local-view model), thereby increasing the readability of the code. 16 Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made. The images or other third party material in this chapter are included in the chapter's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.