Implementation and Performance Evaluation of Omni Compiler

This chapter describes the implementation and performance evaluation of Omni compiler, which is a reference implementation of the compiler for XcalableMP. For performance evaluation, this chapter also presents how to implement the HPC Challenge benchmarks, which is a benchmark suite for an HPC parallel language. The results show that the performance of XMP is comparable to that of MPI in many cases.


Omni compiler
If necessary, a code besides the directives is also modified. Secondly, a native compiler (e.g., gcc or Intel) compiles the translated code and creates an execution binary with linking to Omni compiler runtime library. The runtime library uses MPI in XMP, and CUDA in OpenACC, and both MPI and CUDA in XACC. As for XMP, Omni compiler may create better runtime libraries by adding a one-sided communication library to MPI, which is described in Chap. 3.

Example of Code Translation
This section describes how Omni compiler translates a user code for the global-view memory model. A code translation for the local-view memory model is described in Chap. 3. void *_XMP_DESC_a; double *_XMP_ADDR_a; unsigned long long _XMP_ACC_a_0; _XMP_init_array_desc(&_XMP_DESC_a, .., sizeof(double), 10, 10); : _XMP_alloc_array(&_XMP_ADDR_a, &_XMP_ACC_a_0, ..., _XMP_DESC_a); Omni compiler also adds a function _XMP_alloc_array() to allocate memory for the distributed array, and it sets values in an address _XMP_ADDR_a and a leading dimension _XMP_ACC_a_0. Note that a multidimensional distributed array is expressed as a one-dimensional array in the translated code since the size of each dimension of the array may be determined dynamically. Figure 3 shows an XMP example code using a loop directive to parallelize the following nested loop statement depending on the template t. Each dimension of t is distributed onto two nodes, which is omitted there.

Loop Statement
In the translated code above, a pointer _XMP_MULTI_ADDR_a is used which has the size of each dimension as a head pointer of the distributed array a[][].
To improve performance, operations in a loop statement are performed using the pointer [4]. Note that this pointer can be used when the number of elements in each dimension of a distributed array is divisible by the number of nodes. If the condition is not met, a one dimensional pointer _XMP_ADDR_a and an offset _XMP_ACC_a_0 are used as shown in the translated code below.
Moreover, because values in ending conditions of the loop statement (i < 10, j < 10) are constants in a pre-translated code and are divisible by the number of nodes,  the values are translated to constants (i < 5, j < 5) automatically. If the values are variables in the pre-translated code or not divisible by the number of nodes, the runtime function is inserted just before the loop statement to calculate values for ending conditions. The calculated values are set in newly created variables. Figure 4 shows an XMP example code using a bcast directive to broadcast a local array b. Basically translations of communication directives are simple. The runtime functions call MPI functions directly.

Installation
This section describes how to install the latest Omni compiler version 1.3.2. Omni compiler is installed by a general installation method on UNIX ( ./configure; make; make install ). When executing ./configure without options, only XMP is installed. When installing OpenACC and/or XACC, it is required for some options to "./configure", which is described in Sect. 3.5.

Overview
We provide two versions of Omni compiler, the one is "stable version" and the other is "nightly build version." While the stable version is a so-called official version that has a version number, the nightly build version is a trial version that is released at midnight on our website (https://omni-compiler.org). Omni compiler is developed in GitHub repository (https://github.com/omni-compiler/omni-compiler). Our web server gets the source code from the GitHub repository and generates the nightly build version every day.

From GitHub
Please visit the GitHub repository (https://github.com/omni-compiler/omnicompiler) which provides only nightly build version. Otherwise, please execute the following git command.

From Our Website
Please visit our website (https://omni-compiler.org) which provides packages of stable version and nightly build version. The package of nightly build version is generated every midnight around 12:00 a.m. (JST) if the latest GitHub repository was updated yesterday. These packages contain XcodeML.

Software Dependency
Before installation of Omni compiler, the following software must be installed. yacc, lex, C Compiler (C99 or over), Fortran Compiler (Fortran 2008 or over), Java Compiler, MPI (version 2 or over), libxml2, make

General Installation
This section explains how to install Omni compiler in a general Unix environment.

One-Sided Library
Omni compiler may generate a better runtime library by a one-sided library for XMP. Omni compiler supports the following one-sided libraries.

Creation of Execution Binary
This section describes how to create an execution binary from a code with XMP, XACC, and OpenACC directives, and how to execute it. Note that Omni compiler supports only C language for OpenACC.

Compile
• XMP in C language $ xmpcc a.c • XMP in Fortran $ xmpf90 a.f90 • XACC in C language $ xmpcc -xacc a.c • XACC in Fortran $ xmpf90 -xacc a.f90 • OpenACC in C language $ ompcc -acc a.c A native compiler finally compiles the code translated by Omni compiler. Thus, all compile options of XMP are passed to the native compiler. For example, when using the optimization option "-O2", it is passed to the native compiler.

XcalableMP and XcalableACC
Because the runtime libraries of XMP and XACC use MPI, a program is executed via an MPI execution command (e.g., "mpiexec"). However, when using GASNet, a program is executed via a GASNet execution command (e.g., "gasnetrun_ibv").

Cooperation with Profiler
In order to improve the performance of an application, it is useful to take a profile. Omni compiler has a function to cooperate with Scalasca (https://www.scalasca.org) and tlog which are profiling tools. The function can profile the execution of XMP directives. Note that the function supports only the XMP in C language now.

Scalasca
Scalasca is an opensource software that measures and analyzes the runtime behaviors.
When profiling all XMP directives that exist in code, please add the "--profile scalasca" option to a compile command.
$ xmpcc --profile scalasca a.c When profiling selected XMP directives there, please add the "profile" clause to the directives and the "--selective-profile scalasca" option to a compile command.

tlog
Omni compiler package contains tlog that measures executing time of the XMP directives.
When profiling all XMP directives that exist in code, please add the "--profile tlog" option to a compile command.
$ xmpcc --profile tlog a.c When profiling selected XMP directives there, please add the "profile" clause to the directives as in Sect. 4.3.1 and the "--selective-profile tlog" option to a compile command.
After executing a program, tlog generates a file "trace.log" which stores profiling results. To open the result, please use the "tlogview" command. Figure 6 shows an example of profiling by tlog. $ tlogview trace.log

Performance Evaluation
In order to evaluate the performance of XMP, we implemented the HPC Challenge (HPCC) benchmark (https://icl.utk.edu/hpcc/), namely, EP STREAM Triad (STREAM), High-Performance Linpack (HPL), Global fast Fourier transform (FFT), and RandomAccess [5]. While the HPCC benchmark is used to evaluate multiple attributes of HPC systems, the benchmark is also useful to evaluate the properties of a parallel language. The HPCC benchmark was used at the HPCC Award Competition (https://www.hpcchallenge.org). The HPCC Award Competition consists of two classes. While the purpose of class 1 is to evaluate the performance of a machine, the purpose of class 2 is to evaluate both the productivity and performance of a parallel programming language. XMP won the class 2 prizes in 2013 and 2014.

Experimental Environment
For performance evaluation, this section uses 16,384 compute nodes on the K computer and 128 compute nodes on a Cray CS300 system named "the COMA system." Tables 1 and 2 show the hardware specifications and software environments.
For comparison purposes, this section also evaluates the HPCC benchmark in C language and MPI library. We execute STREAM, HPL, and FFT with eight threads per process on each CPU of the K computer, and with ten threads per process on each CPU of the COMA system. Since RandomAccess is not parallelized with threads and can be executed by the power of only two processes, we execute it with eight processes on each CPU of both systems.
The specification of HPCC Award Competition class 2 defines the minimum problem size for each benchmark. While the main array of HPL should occupy at least half of the system memory, the main arrays of STREAM, FFT, and RandomAccess should occupy at least a quarter of the system memory. We set each  problem size to be equal to the minimum size. As for coarray syntax, Omni compiler uses FJRDMA on the K computer and uses GASNet on the COMA system.

Design
STREAM measures the memory bandwidth to use simple vector kernel (a ← b + αc). STREAM is so straightforward that its kernel does not require communication.   Fig. 7 Part of the STREAM code [5] calculates the performance on each node locally. In line 18, the reduction directive performs a reduction operation among nodes to calculate the total performance.

Evaluation
First of all, in order to consider the effectiveness of #pragma loop xfill and #pragma loop noalias, we evaluate STREAM with and without these directives on a single node of the K computer. We also insert these directives into the MPI implementation for evaluation. Figure 8 shows that the performance results with these directives are about 1.46 times better than those without the directives. Therefore, we use the directives in next evaluations. Figure 9 shows the performance results and a comparative performance evaluation of both implementations. The comparative performance evaluation is called the "performance ratio." When the performance ratio is greater than 1, the performance result of the XMP implementation is better than that of the MPI implementation. The K computer The COMA system Fig. 9 Performance results for STREAM [5] K computer, and 11.55 TB/s for 128 compute nodes on the COMA system. The values of the performance ratio are between 0.99 and 1.00 on both systems.

Design
HPL evaluates the floating point rate of execution for solving a linear system of equations. The performance result has been used in the TOP500 list (https://www. top500.org). To achieve a good load balance on HPL, we distribute the main array in a block-cyclic manner. Moreover, in order to achieve high performance with portability, our implementation calls BLAS [6] to perform the matrix operations. These techniques are inherited from the MPI implementation. HPL has an operation in which a part of the coefficient matrix is broadcast to the other process columns asynchronously. This operation, called "panel broadcast," is one of the most important operations for overlapping panel factorizations and data transfer. Figure 11 shows the implementation that uses the gmove directive with the async clause.   The K computer The COMA system

Fig. 13
Performance results for HPL [5] cblas_dgemm(). Note that L_ld and A_ld remain unchanged from the beginning of the program, and so each xmp_array_lda() is called only once. Figure 13 shows the performance results and performance ratios. XMP's best performance results are 402.01 TFlops (76.68% of the peak performance) for 4096 compute nodes on the K computer, and 47.32 TFlops (70.02% of the peak performance) for 128 compute nodes on the COMA system. The values of the performance ratio are between 0.95 and 1.09 on the K computer, and between 0.99 and 1.06 on the COMA system.

Design
FFT evaluates the performance for a double-precision complex one-dimensional discrete Fourier transform. We implement a six-step FFT algorithm [7,8] using FFTE library [9]. The six-step FFT algorithm is also used in the MPI implementation. In the six-step FFT algorithm, both the computing performance and the all-to-all communication performance for a matrix transpose are important. The six-step FFT algorithm reduces the cache-miss ratio by expression of a twodimensional array. In order to develop the XMP implementation, we use XMP in Fortran because FFTE library is written in Fortran and therefore it is easy to call it. In addition, we use the XMP intrinsic subroutine xmp_transpose() to transpose a distributed array in the global-view memory model. Figure 14 shows an example of xmp_transpose(). The first argument is an output array, and the second argument is the input array. The third argument is an option to save memory, and is "0" or "1." If it is "0," an input array must not be changed. If it is "1," an input p (2) p(4) a (4,12) b (12,4) Fig. 14 Action of subroutine xmp_transpose() [5] array may be changed but less memory may be used. Thus, we use "1" in the XMP implementation. In Fig. 14, the second dimensions of arrays a() and b() are distributed in the block manner, and array b() is transposed to array a(). For example, elements b(1:3,2) on p (2) are transferred to elements a(2,1:3) on p(1). Figure 15 shows a part of the XMP implementation. In lines 1-9, arrays a(), b(), and w() are distributed in a block manner. The a() is aligned with template ty, and the b() and w() arrays are aligned with template tx. In lines 16-20, each thread on all nodes calls the FFTE subroutine zfft1d(), which applies the distributed array b(). Note that the subroutine zfft1d() executes with a single thread locally. In lines 22-23, the XMP loop directive and the OpenMP parallel directive parallelize the loop statement. In line 30, xmp_transpose() is used to transpose the distributed twodimensional array. 10 a,b,1)   Fig. 15 Part of the FFT code [5] Figure 16 shows the performance results and performance ratios. XMP's best performance results are 39.01 TFlops for 16,384 compute nodes on the K computer, and 0.94 TFlops for 128 compute nodes on the COMA system. The values of the performance ratio are between 0.94 and 1.13 on the K computer, and between 0.94 and 1.12 on the COMA system.

Design
RandomAccess evaluates the performance of random updates of a single table of 64-bit integers which may be distributed among processes. The random update for a distributed table requires an all-to-all communication. We implement a recursive exchange algorithm [10], as with the MPI implementation. The recursive exchange algorithm consists of multiple steps. A process sends a data chunk to another process in each step. Because RandomAccess requires a random communication pattern, as its name suggests, the pattern is not supported by the global-view memory model. Thus, we use the local-view memory model to implement RandomAccess. Note that the MPI implementation uses functions MPI_Isend() and MPI_Irecv().

Implementation
A source node transfers a data chunk to a destination node, and then the destination node updates own table using the received data. The MPI implementation repeatedly executes the recursive exchange algorithm by 1024 elements in the table. The HPCC Award Competition class 2 specification defines the constant value 1024. The recursive exchange algorithm sends about half of the 1024 elements in each The K computer The COMA system Fig. 16 Performance results for FFT [5] step. Therefore, the chunk size is about 4096 Bytes (= 1024/2 × 64 bits/8). Note that the destination node cannot know how many elements are sent by the source node. Thus, the MPI implementation gets the number of elements using the function MPI_Get_count(). We implement the algorithm using a coarray and the post/wait directives for the recursive exchange algorithm, and the number of elements is added to the first element of the coarray. Figure 17 shows a part of the XMP implementation. In line 2, the coarrays recv[][][] and send[][] are declared. In line 6, the data chunk size is set at the first element of the coarray, and it is put in line 7. In line 8, the node sends notification of the completion of the coarray operation of line 7 to the node p [ipartner]. In line 10, the node receives the notification from the node p [jpartner], which ensures that the node p[jpartner] receives the data. In line 11, the node gets the number of elements in the received data. In line 12, the node updates own table by using the received data. Figure 18 shows the performance results and performance ratios. The Giga-updates per second (GUPS) on the vertical axis is the measurement value, which is the number of update tables per second divided by 10 9 . XMP's best performance results are 259.73 GUPS for 16,384 compute nodes on the K computer, and 6.23 GUPS for 128 compute nodes on the COMA system. The values of the performance ratio are between 1.01 and 1.11 on the K computer, and between 0.57 and 1.03 on the COMA system. On the K computer, the performance results for the XMP implementation are always slightly better than those for the MPI implementation. However, on the COMA system, the performance results for the XMP implementation are worse than those for the MPI implementation using multiple CPUs. ... 14 } The K computer The COMA system Fig. 18 Performance results for RandomAccess [5]

Discussion
We implement STREAM, HPL, and FFT using the global-view memory model, which enables programmers to develop the parallel codes from the sequential codes using the XMP directives and functions easily. Specifically, in order to implement the parallel STREAM code, a programmer only adds the XMP directives into the sequential STREAM code. The XMP directives and existing directives, such as OpenMP directives and Fujitsu directives, can coexist. Moreover, existing high-performance libraries, such as BLAS and FFTE, can be used with an XMP distributed array. These features improve the portability and performance of XMP applications. We also implement RandomAccess using the local-view memory model, where the coarray syntax enables a programmer to transfer data intuitively. In the evaluation, the performance of the XMP implementation is better than that of the MPI implementation on the K computer, but is worse than that of the MPI implementation on the COMA system.
To clarify the reason why XMP performance is dropped on the COMA system, we develop a simple ping-pong benchmark using the local-view memory model. The benchmark measures the latency for transferring data repeatedly between two nodes. For comparison purposes, we also implement one using MPI_Isend() and MPI_Irecv() that are used in the MPI version RandomAccess. Figure 19 shows parts of the codes. In XMP of Fig. 19 [1]. Figure 19 also shows the ping-pong benchmark that uses MPI functions. Figure 20 shows the latency for transferring data. The results on the K computer show that the latency for the XMP implementation is better than that for the Fig. 19 Part of the ping-pong benchmark code in XMP and MPI [5] Latency (micro sec.)  Performance results for ping-pong benchmark [5] MPI implementation for 2048 Bytes or greater transfer size on the K computer.
In contrast, the results on the COMA system show that the latency of the XMP implementation is always worse than that of the MPI implementation on the COMA system. The latency of XMP with FJRDMA at 4096 Bytes, which is the average data chunk size, is 5.83 μs and the latency of MPI is 6.89 μs on the K computer. The latency of XMP with GASNet is 5.05 μs and that of MPI is 3.37 μs on the COMA system. Thus, we consider the reason for the performance difference of RandomAccess is the communication performance. The performance difference is also due to the differences in the synchronization mechanism of the one-sided XMP coarray and the two-sided MPI functions. Note that a real application would not synchronize after every one-sided communication. It is expected that a single synchronization should occur after multiple one-sided communications to achieve higher performance. In addition, the performance results for HPL and FFT are slightly different from those for the MPI implementations. We consider that these differences are caused by small differences in the implementations. In HPL, for the panel-broadcast operation, the XMP implementation uses the gmove directive with the async clause, which calls MPI_Ibcast() internally. In contrast, the MPI implementation uses MPI_Send() and MPI_Recv() to perform the operation by the "modified increasing ring" [11]. In FFT, the XMP implementation uses XMP in Fortran, but the MPI implementation uses C language. Both implementations call the same FFTE library. In addition, the MPI implementation uses MPI_Alltoall() to transpose a matrix. Since xmp_transpose() calls MPI_Alltoall() internally, the performance levels for both xmp_transpose() and MPI_Alltoall() must be the same. Therefore, the language differences and refactoring may have caused the performance difference.

Conclusion
The chapter describes the implementation and performance evaluation of Omni compiler. We evaluate the performance of the HPCC benchmark in XMP on the K computer up to 16,384 compute nodes and a generic cluster system up to 128 compute nodes. The performance results for the XMP implementations are almost the same as those for the MPI implementations in many cases. Moreover, it demonstrates that the global-view and the local-view memory models are useful to develop the HPCC benchmark.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.