Optimizing Three-Dimensional Stencil-Operations on Heterogeneous Computing Environments

Complex algorithms and enormous data sets require parallel execution of programs to attain results in a reasonable amount of time. Both aspects are combined in the domain of three-dimensional stencil operations, for example, computational ﬂuid dynamics. This work contributes to the research on high-level parallel programming by discussing the generalizable implementation of a three-dimensional stencil skeleton that works in heterogeneous computing environments. Two exem-plary programs, a gas simulation with the Lattice Boltzmann method, and a mean blur, are executed in a multi-node multi-graphics processing units environment, proving the runtime improvements in heterogeneous computing environments compared to a sequential program.


Introduction
The field of High Performance Computing (HPC) is growing as intricate algorithms require more processing power and the demand for handling larger datasets rises.Efficient parallel programs are necessary for evaluating massive datasets.Most HPC environments incorporate numerous nodes equipped with multiple central processing units (CPUs) and graphics processing units (GPUs).Creating programs that combine multiple nodes and accelerators necessitates an understanding of low-level frameworks such as implementations of the Message Passing Interface (MPI) [1], OpenMP [2], and CUDA [3].Moreover, writing a parallel program is error-prone and tedious, e.g., out-of-memory errors and invalid memory accesses are troublesome to identify even for skilled programmers.Furthermore, faultless programs need to be optimized as the selection of memory spaces, distributing data, and thread task allocation are design choices that significantly impact performance but require experience, which scientists usually lack.
Since experts in this field are hard to find, high-level frameworks are a convenient alternative to bypass the tedious creation of parallel programs.High-level frameworks commonly abstract from the distribution of data, provide portable code for different hardware architectures, are adjustable to distinct accelerators, and require less maintenance for the end user.In 1989, Cole introduced algorithmic skeletons enclosing reoccurring parallel and distributed computing patterns as one of the most common approaches to abstract from low-level details [4].Multiple libraries [5,6], general frameworks [7,8], and domain-specific languages [9] use the concept.
The paper contributes to the ongoing work by focusing on a particularly arduous operation, namely the three-dimensional stencil operation.Stencil operations calculate elements depending on neighboring elements within the data structure and therefore require communication between the computational units used.Those operations are essential for, e.g., computational fluid dynamics of gas or airflow.Current high-level approaches are not capable of efficiently updating data in a generalized way and dealing with three-dimensional data structures.
This paper initially presents related work, concentrating on high-level approaches that abstract from problem-specific details (Section 2).Section 3 outlines the Muesli library used, while Section 4 explains the additional implementation of the threedimensional skeleton and the examples used to measure the runtime.Section 5 evaluates the work, discussing our runtime experiments on various hardware set-ups.Finally, a summary of our research is given in Section 6.

Related Work
Ongoing work discussing three-dimensional stencil operations is twofold.On the one hand, generic high-level frameworks targeting the three-dimensional stencils have the advantage of parallelizing pre-and postprocessing steps, as they offer a variety of operations/skeletons. On the other hand, specialized frameworks already contain implementations for processing stencil calculations but are often inefficient as they focus on the algorithm and not on parallelization.Most related in the area of highlevel skeleton programming frameworks, SkePU3 targets multi-node and multi-GPU environments for most skeletons in combination with StarPU.However, for stencil operations (called MapOverlap), the data exchange between the programs is missing for multi-node programs [7].
FastFlow added GPU support but focuses on communication skeletons and misses a comparable stencil operation [8,10].Lift handles n-dimensional stencils on single GPUs missing the combination of multiple nodes and accelerators [11].SkelCL implements a MapOverlap skeleton for multiple GPUs missing multiple nodes [12].Specialized libraries such as Palabos for the Lattice Boltzmann methods (LBMs) [13], or publications discussing a single method, e.g., the Helmholtz equation [14] focus on the algorithm and do not include accelerators like GPUs.
This work extends the mentioned work, as the presented stencil skeleton is generalizable for multiple applications, allows pre and postprocessing steps, and runs on multiple nodes and accelerators.The implementation of the skeleton is tested with two examples: 1) a version of a LBM 2) a three-dimensional mean blur.
The used library is called Muenster Skeleton Library (Muesli) [5].Muesli provides an object-oriented approach that offers one-, two-, and three-dimensional data structures (DA, DM, DC) with skeletons as member functions.The supported skeletons are, for example, multiple versions of Map and Zip (index and inplace variants), Fold, Gather, and, as discussed in this work MapStencil.Internally, MPI, OpenMP, and CUDA are used, which enables simultaneous parallelism on multiple nodes, CPUs, and GPUs.The library can be included with a simple include statement #include<muesli.h>.For writing a parallel program, Muesli provides methods to state, among others, the number of processes and GPUs used.Apart from that, Muesli abstracts from parallel programming details by internally distributing the data structures on the available computational units, choosing the number of threads started on the corresponding low-level framework, and copying data to the correct memory spaces.This abstraction also reduces errors commonly made by inexperienced programmers, such as race conditions and inefficient data distribution.
Listing 1 shows a simple program calculating the Scalar product of the distributed arrays a and b in Muesli.In line 6, a distributed array of size three with a default value of 2 is created.In the skeleton calls in lines 7-9, it can be seen that skeletons have a user function as an argument which can either be a C++ function or a C++ functor.For the index variant of map, Muesli applies the argument function of map to each element of the data structure (here a distributed array (DA)) and its index (line 7).For the zip skeleton, the second required data structure is passed as an argument.Lastly, lines 7+9 show that the same function can be used in different contexts, firstly for calculating the sum of the index and the value and secondly as a reduction operator.Stencil operations are map operations that additionally require reading the surrounding elements of each considered element.Figure 1 displays a two-dimensional stencil with radius one.The peculiarity regarding stencil operations on multiple nodes and accelerators is that each execution of the stencil operation requires updating elements that are shared between computational units.As communication of updated elements requires synchronization between the computational nodes, it decreases the opportunity for executing tasks in parallel within the program.Muesli abstracts from all communication between the computational nodes with a MapStencil skeleton.

Using the MapStencil Skeleton in Muesli
The usage of the 3-dimensional mapStencil skeleton for the end-user is shown in Listing 2. Firstly, a function to be executed on each element is defined (l.1-10).Merely functors of type DCMapStencilFunctor can be used with the mapStencil skeleton.Therefore, the first argument of the functor has to be of type PLCube (PaddedLocalCube), and the subsequent arguments must be integers for indexing the data structure.The class PLCube most importantly offers a getter-function taking three index arguments, relieving the end-user from index calculations (l. 8 The presented functor calculates the sum of all elements with a radius of two.It divides the sum by the number of total elements, therefore calculating a mean filter.This functor can be applied to a distributed cube by calling the mapStencil skeleton as a member function (l.14).The skeleton takes the functor as a template argument and requires a distributed cube of the same dimension with the current data1 , the radius of the stencil2 and the neutral value for border elements.

Implementation of the MapStencil Skeleton
Supplementing the previous chapter implementation details of the mapStencil Skeleton will be described to facilitate the reverse engineering of stencil operations for other projects.The description is twofold firstly it discusses the PLCube class which allows access to neighbour elements in the user function and secondly the kernel invocation.
Adding the MapStencil skeleton to the existing distributed cube (DC) class requires adding two additional member fields: a vector of PLCubes and a (maximal) supported stencil radius.As previously mentioned, the PLCubes class allows the end-user to abstract from the indexing of the data structure.To make access to the different memory spaces efficient, each computational unit has a separate PLCube storing merely the elements needed to calculate the assigned elements.This design choice makes the class flexible to be used for CPUs and for GPUs.It contains the following attributes to provide a light, minimal design: • int width, height, depth -the three dimensions of the data structure, • int stencilRadius radius of the stencil required to calculate the overlapping elements, • int neutralValue used when the index is outside of the data structure, • T* data, T* topPadding, T* bottomPadding CPU or GPU pointer for current data, • four integers to save global indexes for the start and end of main and padding data areas.
Most importantly, the getter-function is implemented, taking three integers as arguments and returning the suitable value.This is either the neutral value or the corresponding element from the CPU or GPU memory.Assuming GPUs are used as accelerators, the skeleton updates the current data structure in case the data is not up to date (Listing 3, l.6).Afterwards, it synchronizes the PLCubes inside one node, and the data between multiple nodes (l.7, l.9).Foreach GPU used, the mapStencilKernelDC kernel is called executing the functor on the appropriate part of the overall data structure.In any other case (multiple nodes and CPU), it is only necessary to synchronize the nodes (l.21) and call the functor with the corresponding arguments (l.29).

Example Applications for Three-Dimensional Stencil Operations
Two examples are used to evaluate our implementation: an implementation of the Lattice Boltzmann Method (LBM) and a mean blur.The LBM is used for fluid simulations e.g. the distribution of gas.It distinguishes between the collision and the streaming step, which alternate in continuous simulations [18, p. 61ff.].In the streaming step, gas particles move from one cell to another.The fluid flow caused by the colliding particles is calculated in the collision step.The distribution function f i (x, t) calculates for a cell x and a timestamp t how many particles move in the next step to neighbor i.Index 0 corresponds to the cell itself.f * i defines the distribution after the collision of the particles (see formula (2)).∆t is the period to be simulated.
For the collision steps, the Bhatnagar-Gross-Krook-operator is used.τ is a constant defining the convergence of the simulation.Thus τ influences the viscosity of the gas.
The equilibrium state is calculated by where w i are the weights of the chosen grid and c i is the position of the neighbor cells relative to the main cell.The constant number c s is the sound velocity of the model.The mass density ρ and the puls density u are defined by For the implementation of the LBM, a D3Q19-Grid was used, D being the number of dimensions and Q the number of neighbors.Both steps (collision and streaming) are combined in one mapStencil call.Noteworthy, the implementation has to consider that single cells can be marked as blocked, simulating objects that are barriers to the flow of gas or as distributing constantly gas.Therefore, special cells are marked with Not a Number values.To simulate this behavior without requiring additional storage, the handling of the floating point numbers is extended.According to the IEEE-754 Standard, each floating point number that has a maximal exponent with a mantissa that is not equal to zero is considered Not a Number.The most significant bit of the mantissa of f 0 is set so that the number is definitely understood as NaN.The remaining bits of the mantissa can then be used freely to store other data.In the code, bit masks and a struct with bit-fields are defined in the code to access this information as easily as possible (Listing 4).The data stored for each cell is an array<float, Q>.Q is a constant number for the neighbor cells and the cell itself.This type is abbreviated in the following listing with cell_t.Moreover, it is abstracted from the three-dimensional vector operations (l.28, 29, 31, 33).The user function starts by transforming the current value of the cell into the single float parts (l.4).In case it is a cell that distributes gas (FLAG_KEEP_VELOCITY), the cell remains without changes (l.5-7).In any other case, for all neighbor cells, the current amount of particles is read (l.10-12).In the collision step, all cells which are obstacles reverse the airflow (l.[16][17][18][19][20][21][22].All other cells calculate the particles streaming from the next cells (l.27-34).

MSL_USERFUNC cell_t update ( const
The second example used is a mean filter commonly used for smoothing images (user function in Listing 2).Aside from images, filters are commonly used to preprocess data to reduce noise.This might also be applied to signal processing and other application contexts.This example application has the advantage that the stencil size (i.e.radius) can be varied, as depending on the context different stencil sizes are reasonable.Moreover, this program does not require if clauses which potentially slow down the program.

Evaluation
As the purpose of high-level parallel frameworks is to facilitate writing efficient parallel programs, the speedup needs to be measured.For this purpose, the presented exemplary programs, a mean filter and an LBM implementation, are executed on the HPC machine Palma II For testing CPU-parallelization, the zen2 partition is used, which is equipped with 12 nodes, each with one Zen2 (EPYC 7742) CPU with 64 cores.For running the sequential version, a single Skylake (Gold 6140) CPU starting a single thread is used (normal ).It is not possible to run sequential programs on the zen2 partition as all sequential programs have to run on the normal partition.
For the GPU-programs, the gpu2080 and gpuhgx partitions are used.The gpu2080 partition has five nodes, each with 8 GeForce RTX 2080 Ti GPUs.The gpuhgx partition is equipped with two nodes, each with 8 A100 SXM GPUs.These partitions, most importantly, vary in the maximum memory and the number of cores.The gpu2080 partition has more nodes.However, each GPU has 11GB VRAM and 4352 cores allowing less parallelization than more powerful GPUs (such as the A100) can provide.In contrast, the gpuhgx partition has 80GB VRAM per GPU, allowing bigger data structures to be processed, and has more cores (6912) to speed up the program.Using different GPUs also contributes to proving the universal applicability of the MapStencil skeleton in Muesli.
Next to a sequential program, the Muesli-programs solving the LBM are compared to a native implementation.The native program is written to the best of our knowledge which can be categorized as having 2-3 years of experience in C/C++ programming, with a focus on optimization4 .

LBM
The LBM is used to simulate fluid flow in the three-dimensional space.Consequently, it is reasonable to run an experiment that does not only execute the mapStencil skeleton one time but has multiple iterations simulating multiple dispersion steps.200 iterations were chosen for every experiment to compare run times between different data sizes.
Data sizes were chosen to completely utilize the available storage.For the LBM, each cell requires 76 bytes, as each cell stores 19 32-bit floating point numbers.For the calculation, one data structure to read and one to write is necessary.The largest theoretically possible data structure size for a given amount of memory can be simply calculated by: This results in a maximum side length of 426 for the RTX 2080 Ti GPU and 826 for the A100 SXM.Although the CPU partition would support bigger data structures, the data size was not increased to the maximum as the speedup converged, and the runtime of the sequential program became unreasonable high (approximately 10 hours for calculating the LBM simulation for a data size of 960 3 ).
As the CPU has 64 cores, this is caused by multithreading.In total, in the optimal case, 128 threads are started on the 64 cores available.In this scenario, it should be considered that the calculations are easy to execute in parallel as all data resides on the memory of the single CPU, accessible for all threads.Using four nodes requires communicating the border values, thus requiring operations that cannot be executed in parallel.Although more elements need to be communicated with increasing data sizes, the share of operations that require communication is decreasing, thus allowing more parallelism.This can be seen as the best speedup that can be achieved with four nodes for the biggest tested data size (960 3 ).Eight nodes achieve a speedup of 452.Table 2 shows that for the CPU-zen2 partition, a speedup of 116 can be reached with a single CPU.Table 2: Runtimes (seconds) and speedups for the parallel implementation of the LBM gas simulation for CPU programs on the zen2 partition.
In contrast, the GeForce RTX 2080 Ti achieves a speedup of 88 with a single GPU.The maximum speed up which can be achieved by GPUs depends on the GPU used.
The GeForce RTX 2080 Ti has 68 streaming multiprocessors (SMs), each capable of executing 64 threads in parallel (4352 in total).Although additional threads may be scheduled, the hardware does not have physical cores to start the threads in parallel.Multiple factors limit the possible speedup.Most importantly, threads with diverging execution branches cannot be executed in parallel, limiting the 64 threads executed in parallel per SM.
The A100 has 108 SMs, allowing more threads to be executed in parallel.The maximum speedup achieved is roughly 353.In order to check whether a low-level program is significantly faster, a native implementation was programmed to be compared against the Muesli program.As can be seen in Figure 2, the implementations are close to each other.In contrast to the native implementation, Muesli has a slight overhead.However, as it is very small, the differences are expected and neglectable.Runtimes for bigger data structures are not included for one GPU and two GPUs to increase the readability of the graph.Also worth mentioning is that the native implementation has 544 lines of code without implementing MPI inter-node communication.In contrast, the Muesli program has 246 lines of code and can be run on multiple nodes.Moreover, it should be considered that writing native code is arduous as each line requires fine-tuning.Table 3: Speedup for the parallel implementation of the LBM gas simulation on the gpu2080 partition.
Regarding the scalability on multiple GPUs, an extract of the achieved speedups can be seen in Table 3.A speedup of 1.7 compared to a single GPU version can be reached for two GPUs, and for four GPUs, a speedup of 2.75 is achieved.To ensure that the communication causes the overhead, the time for the update function was measured separately.Without the communication, a speedup of 1.94 and 3.88 was achieved, which can be attributed to the synchronization of streams.Although the speedup is limited by the communication operations, using multiple GPUs also has the advantage of being able to process bigger data structures, since there is more memory available.Table 4: Speedup for the parallel implementation of the LBM gas simulation for multiple nodes on the gpu2080 partition.
1 Runtimes without communication between nodes and GPUs.
Considering multiple levels of parallelism, the program can also run on multiple nodes equipped with multiple GPUs.Runtimes are depicted in Figure 3 and Table 4.The columns in the table distinguish between the runtimes including all communication between devices and nodes and runtimes only measuring the calculation on the accelerators.Most eye-catching in the figure, the runtimes for four nodes and four GPUs are not linear but show a switching pattern.This is caused by not splitting the data structure into complete slices but into incomplete slices (e.g., 640 has 40 slices per GPU while 680 has 42.5).The program can handle this.However, the communicational effort rises.

Mean Filter
Using the LBM implementation as an exemplary program has multiple downsides.Firstly, one factor that a major influence on the generalizability of the skeleton stays constant -namely, the stencil radius.This factor influences the number of elements that need to be communicated between the computational nodes.Therefore, it is essential to vary this factor to analyze the performance of the skeleton.Moreover, the implementation of the LBM has multiple conditional statements like branch operations slowing down the possible parallelism.In contrast, the mean blur does not contain any if-branches.
Firstly, the CPU parallelism is discussed.Table 5 lists the speedup for one, four, and eight nodes compared to a sequential program.Noteworthy, the optimal time is not achieved by using 128 threads but by using 64 threads.As the instructions are easily executed in parallel, scheduling threads that are executed when other threads are idle is no longer beneficial.For four nodes, a speedup of 184 can be reached.As can be seen, for rising data sizes, the speedup improves as fewer communication operations are required.The same applies to programs using eight nodes reaching a speedup of 348.
Secondly, the speedup for the GeForce RTX 2080 Ti GPUs is measured.The program has fewer conditional statements, avoiding branch divergence.The speedup for a single GeForce RTX 2080 Ti GPUs for a data size of 400 3 is 747, significantly better than for the LBM implementation.Table 6 and 7 list the runtimes and speedups for a small stencil radius of two (reading 125 elements per calculation) for one and two nodes with each one or four GPUs.Including communication operations, a speedup of 3.6 is reached.To ensure that this is caused by the communication operations, the runtime spent on calculation is measured separately.The speedup depicted in the last column reaches 3.99, which is close to an optimum of 4. As communication operations require synchronization, the optimal speedup is hardly achievable.In contrast, scaling across nodes improves the speedup for a single GPU from 747 to 1253.Comparing the runtimes without communication, 43.65 seconds are nearly doubled from 24.88.Scaling from one node to two nodes is more efficient, as merely one overlapping data region needs to be communicated.The speedup for four GPUs behaves similarly to the above-explained behavior.
Regarding bigger stencil radiuses, the runtimes for a stencil radius of 10 are listed.This requires reading 9261 elements per calculation.This extreme example is chosen to observe the runtime and speedup when not all elements can be loaded in caches.
Besides changing the hardware, the stencil radius was adjusted to discuss the impact on the performance.With an increasing stencil radius, the calculation of one element requires more read and write operations.For a stencil radius of two, the sum of 125 elements is calculated, and self-explanatory, this grows cubic.Figure 4     the influence on the runtime.Although the number of elements processed grows cubic, the runtime does not grow cubic.Moreover, it ascertains that the speedup improves with a growing share of calculation operations.This is detailed listed in Table 8.Similar to the CPU program, the speedup with and without communication is measured.For bigger stencil radiuses, the speedup comes closer to the optimum.Although more elements need to be communicated between the nodes, the intensity of the calculation requires more time which   As can be seen, the two-node program is faster.This is caused by parallelizing communication as some operations are executed by the CPU instead of communicating between only GPUs.In contrast, the second figure shows that when a single node program with two GPUs is compared to a two nodes program with each one GPU.The single node program is slightly faster as a communication operation between GPUs is faster than an MPI communication between two nodes.
The program was also tested on the A100 partition allowing even bigger data sizes to be processed by a single GPU.In contrast to a single GeForce RTX 2080 Ti GPUs (Speedup 747), the A100 has a speedup of 1122.34 for a data size of 400 3 elements, and 3680.54 for four GPUs (2309.08).According to the previous approach, the runtime was measured with and without communication (Table 9).Even with the communication, the speedup is close to the optimum.For four GPUs, a speedup of 3.65 and 3.93 can be reached for 400 3 and 800 3 elements.For eight GPUs, a speedup of 5.18 and 7.55 is reached.

Conclusion
We have presented the implementation and experimental evaluation of a threedimensional MapStencil skeleton.The implementation was tested with two example programs: 1) an LBM implementation and 2) a mean filter.The two examples complement each other for different application contexts.The LBM has a rather complex function, with a constant stencil radius, while the mean blur has a simple user function, and the stencil radius can be varied.Both examples were tested in complex hardware environments equipped with multiple nodes, CPU cores, and accelerators.
For the LBM, a speedup of 116 for one node and 452 for four nodes can be reached, confirming that for complex functions, the communication between nodes has only a minor influence on the speedup.In contrast, using multiple GPUs for complex functions does not provide the expected speedup.For a single GPU, merely a speedup of 88 can be reached.This speedup scales for two GPUs (161), but as the communication rises, the speedup does not scale according to the number of accelerators.Abstracting from the communication, the program scales across multiple nodes.
In contrast, running the mean filter example shows that non-diverging programs do not benefit from multithreading with CPUs, having a speedup of 180 for four nodes.In contrast, a GPU can reach a speedup of 747, benefiting from warps of threads that can run the same instruction in parallel.The communication between GPUs decreases the speedup from 7.7 to 5.4, taking approximately 30% of the runtime.When increasing the stencil radius, the speedup improves as the initialization of the communication and the required synchronization is more time-consuming than increasing the stencil radius, which is communicated.This finding is confirmed for two different types of GPUs an A100 partition and a GeForce RTX 2080 Ti partition.
Overall it is shown that Stencil operations are especially relevant to be included in high-level frameworks as the communication between nodes and accelerators is highly complex, and the implementation is not feasible for inexperienced programmers.The speedup achieved proves that high-level frameworks can provide the means to solve this problem.

Fig. 2 :
Fig. 2: Runtime comparison of a multi-GPU Muesli program and native (CUDA) implementation of the LBM on a single-node of the gpu2080 partition.

Fig. 3 :
Fig. 3: Runtimes of the LBM Muesli program on multiple nodes and multiple GPUs on the gpu2080 partition.

Fig. 5 :
Fig. 5: Runtimes of the Blur Muesli program on the gpu2080 partitions with similar hardware. ).

Table 1 :
3. Table1lists the hardware specification of the partitions used.Those are two GPU-partitions and two CPU-partitions.To provide meaningful results, all parallel programs are executed ten times.Runtimes are measured using MPI_Wtime.Overview of used hardware. depicts

Table 5 :
Runtimes in seconds and speedups of CPU program for the mean filter with stencil radius 2 on the gpu2080 partition.

Table 6 :
Speedup for the parallel implementation of the mean blur for a single node of the gpu2080 partition.

Table 7 :
Speedup for the parallel implementation of the mean blur for two nodes of the gpu2080 partition.
1Runtimes without communication between nodes and GPUs.

Table 8 :
Speedup for the parallel implementation of the mean blur for multiple GPUs on the gpu2080 partition.

Table 9 :
Runtimes in seconds and speedups for a mean filter on the gpuhgx partition1Runtime without communication between nodes and GPUs.