1 Introduction

The field of high performance computing (HPC) is growing as intricate algorithms require more processing power and the demand for handling larger datasets rises. Efficient parallel programs are necessary for evaluating massive datasets. Most HPC environments incorporate numerous nodes equipped with multiple central processing units (CPUs) and graphics processing units (GPUs). Creating programs that combine multiple nodes and accelerators necessitates an understanding of low-level frameworks such as implementations of the message passing interface (MPI) [1], OpenMP [2], and CUDA [3]. Moreover, writing a parallel program is error-prone and tedious, e.g., out-of-memory errors and invalid memory accesses are troublesome to identify even for skilled programmers. Furthermore, faultless programs need to be optimized as the selection of memory spaces, distributing data, and thread task allocation are design choices that significantly impact performance but require experience, which scientists usually lack.

Since experts in this field are hard to find, high-level frameworks are a convenient alternative to bypass the tedious creation of parallel programs. High-level frameworks commonly abstract from the distribution of data, provide portable code for different hardware architectures, are adjustable to distinct accelerators, and require less maintenance for the end user. In 1989, Cole introduced algorithmic skeletons enclosing reoccurring parallel and distributed computing patterns as one of the most common approaches to abstract from low-level details [4]. Multiple libraries [5, 6], general frameworks [7, 8], and domain-specific languages [9] use the concept.

The paper contributes to the ongoing work by focusing on a particularly arduous operation, namely the three-dimensional stencil operation. Stencil operations calculate elements depending on neighboring elements within the data structure and therefore require communication between the computational units used. Those operations are essential for, e.g., computational fluid dynamics of gas or airflow. Current high-level approaches are not capable of efficiently updating data in a generalized way and dealing with three-dimensional data structures.

This paper initially presents related work, concentrating on high-level approaches that abstract from problem-specific details (Sect. 2). Section 3 outlines the Muesli library used, while Sect. 4 explains the additional implementation of the three-dimensional skeleton and the examples used to measure the runtime. Section 5 evaluates the work, discussing our runtime experiments on various hardware set-ups, and compares our results to other frameworks. Finally, a summary of our research is given in Sect. 6.

2 Related Work

Ongoing work discussing high-level approaches to three-dimensional stencil operations is diverse. The work can be distinguished by examining the types of accelerators used, the operations available, and the maintenance of the frameworks. Table 1 lists research regarding high-level approaches to parallelize stencil operations to the best of the authors’ knowledge.

Table 1 Overview of frameworks that generate parallel code for stencil operations

Most related in the area of high-level skeleton programming frameworks, SkePU3 targets multi-node and multi-GPU environments for most skeletons in combination with StarPU. However, for stencil operations (called MapOverlap), the data exchange between the programs is missing for multi-node programs [7]. Celerity is a high-level approach focusing on parallel for-loops, including stencil operations [10]. FastFlow added GPU support but focuses on communication skeletons and misses a comparable stencil operation [8, 11]. Lift handles n-dimensional stencils on single GPUs missing the combination of multiple nodes and accelerators [12]. SkelCL implements a MapOverlap skeleton for multiple GPUs missing multiple nodes [13]. Moreover, it has not been adapted to hardware changes since 2016. The high-level block-structured framework WaLBerla discusses options for optimization in detail. Unfortunately, the experiment section does not discuss the use of multiple GPUs [14]. EPSILOD tests prototypes for stencil operations but misses a generic stencil implementation [15]. ExaStencil provides a domain-specific language for stencil operations, therefore it misses the option to parallelize pre- and post-processing steps [16]. DUNE provides the means to write parallel programs in C++ but merely for CPUs [17]. Partans and PSkel developed a framework for stencil operations but have not been adopted [18] [19]. Pereira et al. [20] have expanded OpenACC for single GPUs stencil operations, however, their project is not publically available [20]. Palabos proved that frameworks generating GPU code (npFEM) can be included in the framework to solve the Lattice Boltzmann methods (LBMs) [21], however it is unclear if this feature is still supported. Publications discussing a single method, e.g., the Helmholtz equation [22] focus on the algorithm and do not include accelerators like GPUs.

This work extends the mentioned work, as the presented stencil skeleton is generalizable for multiple applications, allows pre- and post-processing steps, and runs on multiple nodes and accelerators. The implementation of the skeleton is tested with two examples: (1) a version of a LBM (2) a three-dimensional mean blur.

3 The Muenster Skeleton Library Muesli

Nowadays, most skeleton frameworks are implemented in C/C++ [6,7,8, 36,37,38], as it offers interoperability with multiple parallel frameworks such as OpenMP, MPI, CUDA, and OpenCL and is exceptionally performant. Noteworthy, Python recently gained attention for natural science applications as it provides an easy interface to write packages in C/C++ [14, 27].

The used library is called Muenster Skeleton Library (Muesli) [5]. Muesli provides an object-oriented approach that offers one-, two-, and three-dimensional data structures (DA, DM, DC) with skeletons as member functions. The supported skeletons are, for example, multiple versions of Map and Zip (index and inplace variants), Fold, Gather, and, as discussed in this work MapStencil. Internally, MPI, OpenMP, and CUDA are used, which enables simultaneous parallelism on multiple nodes, CPUs, and GPUs. The library can be included with a simple include statement #include<muesli.h>. For writing a parallel program, Muesli provides methods to state, among others, the number of processes and GPUs used. Apart from that, Muesli abstracts from parallel programming details by internally distributing the data structures on the available computational units, choosing the number of threads started on the corresponding low-level framework, and copying data to the correct memory spaces. This abstraction also reduces errors commonly made by inexperienced programmers, such as race conditions and inefficient data distribution.

Listing 1 shows a simple program calculating the Scalar product of the distributed arrays a and b in Muesli. In line 8, a distributed array of size three with a default value of 2 is created. In the skeleton calls in lines 9–11, it can be seen that skeletons have a user function as an argument which can either be a C++ function or a C++ functor. For the index variant of map, Muesli applies the argument function of map to each element of the data structure (here a distributed array (DA)) and its index (line 9). For the zip skeleton, the second required data structure is passed as an argument. Lastly,

lines 9+11 show that the same function can be used in different contexts, firstly for calculating the sum of the index and the value and secondly as a reduction operator.

figure a

4 Three-Dimensional Stencil Operations

Fig. 1
figure 1

Exemplary two-dimensional stencil operation

Stencil operations are map operations that additionally require reading the surrounding elements of each considered element. Figure 1 displays a two-dimensional stencil with radius one. The peculiarity regarding stencil operations on multiple nodes and accelerators is that each execution of the stencil operation requires updating elements that are shared between computational units. As communication of updated elements requires synchronization between the computational nodes, it decreases the opportunity for executing tasks in parallel within the program. Muesli abstracts from all communication between the computational nodes with a MapStencil skeleton.

4.1 Using the MapStencil Skeleton in Muesli

The usage of the 3-dimensional mapStencil skeleton for the end-user is shown in Listing 2. Firstly, a function to be executed on each element is defined (l. 1–10). The first argument of the function has to be of type PLCube (PaddedLocalCube), and the subsequent arguments must be integers for indexing the data structure. The class PLCube most importantly offers a getter-function taking three index arguments, relieving the end-user from index calculations (l. 8). The presented functor calculates the sum of all elements with a radius of two. It divides the sum by the number of total elements, therefore calculating a mean filter. This functor can be applied to a distributed cube by calling the mapStencil skeleton as a member function (l. 13). The skeleton takes the functor as a template argument and requires a distributed cube of the same dimension with the current data,Footnote 1 the radius of the stencilFootnote 2 and the neutral value for border elements.

figure b

4.2 Implementation of the MapStencil Skeleton

Supplementing the previous chapter implementation details of the mapStencil Skeleton will be described to facilitate the reverse engineering of stencil operations for other projects. The description is twofold firstly it discusses the PLCube class which allows access to neighbour elements in the user function and secondly the kernel invocation.

Adding the MapStencil skeleton to the existing distributed cube (DC) class requires adding two additional member fields: a vector of PLCubes and a (maximal) supported stencil radius. As previously mentioned, the PLCubes class allows the end-user to abstract from the indexing of the data structure. To make access to the different memory spaces efficient, each computational unit has a separate PLCube storing merely the elements needed to calculate the assigned elements. This design choice makes the class flexible to be used for CPUs and for GPUs. It contains the following attributes to provide a light, minimal design:

  • int width, height, depth - the three dimensions of the data structure,

  • int stencilRadius radius of the stencil required to calculate the overlapping elements,

  • int neutralValue used when the index is outside of the data structure,

  • T* data, T* topPadding, T* bottomPadding CPU or GPU pointer for current data,

  • four integers to save global indexes for the start and end of main and padding data areas.

Most importantly, the getter-function is implemented, taking three integers as arguments and returning the suitable value. This is either the neutral value or the corresponding element from the CPU or GPU memory. Assuming GPUs are used as accelerators, the skeleton updates the current data structure in case the data is not up to date (Listing 3, l.6). Afterwards, it synchronizes the PLCubes inside one node, and the data between multiple nodes (l.7, l.9). Foreach GPU used, the mapStencilKernelDC kernel is called executing the functor on the appropriate part of the overall data structure. In any other case (multiple nodes and CPU), it is only necessary to synchronize the nodes (l. 21) and call the functor with the corresponding arguments (l. 29).

figure c

4.3 Example Applications for Three-Dimensional Stencil Operations

Two examples are used to evaluate our implementation: an implementation of the LBM and a mean blur. The LBM is used for fluid simulations e.g. the distribution of gas. It distinguishes between the collision and the streaming step, which alternate in continuous simulations [39, p. 61ff.]. In the streaming step, gas particles move from one cell to another. The fluid flow caused by the colliding particles is calculated in the collision step. The distribution function \(f_i(x, t)\) calculates for a cell x and a timestamp t how many particles move in the next step to neighbor i. Index 0 corresponds to the cell itself. \(f_i^*\) defines the distribution after the collision of the particles (see formula (2)). \(\Delta t\) is the period to be simulated.

$$\begin{aligned} f_i(x + c_i \Delta t, t + \Delta t):= f_i^*(x, t) \end{aligned}$$
(1)

For the collision steps, the Bhatnagar–Gross–Krook-operator is used. \(\tau \) is a constant defining the convergence of the simulation. Thus \(\tau \) influences the viscosity of the gas.

$$\begin{aligned} f_i^*(x, t):= f_i(x, t) - \frac{\Delta t}{\tau } \left( f_i(x, t) - f_i^{\text {eq}}(x, t) \right) . \end{aligned}$$
(2)

The equilibrium state is calculated by

$$\begin{aligned} f_i^{\text {eq}}(x, t):= w_i \rho \left( 1 + \frac{u \cdot c_i}{c_s^2} + \frac{u \cdot c_i}{2 c_s^4} + \frac{u \cdot u}{2 c_s^2} \right) , \end{aligned}$$
(3)

where \(w_i\) are the weights of the chosen grid and \(c_i\) is the position of the neighbor cells relative to the main cell. The constant number \(c_s\) is the sound velocity of the model. The mass density \(\rho \) and the puls density u are defined by

$$\begin{aligned} \rho (x, t) = \sum _{i} f_i(x, t), \hspace{1cm} \rho u (x, t):= \sum _{i} c_i f_i(x, t). \end{aligned}$$
(4)

For the implementation of the LBM, a D3Q19-Grid was used, D being the number of dimensions and Q the number of neighbors. Both steps (collision and streaming) are combined in one mapStencil call. Noteworthy, the implementation has to consider that single cells can be marked as blocked, simulating objects that are barriers to the flow of gas or as distributing constantly gas. Therefore, special cells are marked with Not a Number values. To simulate this behavior without requiring additional storage, the handling of the floating point numbers is extended. According to the IEEE-754 Standard, each floating point number that has a maximal exponent with a mantissa that is not equal to zero is considered Not a Number. The most significant bit of the mantissa of \(f_0\) is set so that the number is definitely understood as NaN. The remaining bits of the mantissa can then be used freely to store other data. In the code, bit masks and a struct with bit-fields are defined in the code to access this information as easily as possible (Listing 4).

figure d

The data stored for each cell is an array<float, Q>. Q is a constant number for the neighbor cells and the cell itself. This type is abbreviated in the following listing with cell_t. Moreover, it is abstracted from the three-dimensional vector operations (l. 28, 29, 31, 33). The user function starts by transforming the current value of the cell into the single float parts (l. 4). In case it is a cell that distributes gas (FLAG_KEEP_VELOCITY), the cell remains without changes (l. 5–7). In any other case, for all neighbor cells, the current amount of particles is read (l. 10–12). In the collision step, all cells which are obstacles reverse the airflow (l. 16–22). All other cells calculate the particles streaming from the next cells (l. 27–34). Noteworthy, the function contains multiple conditional statements impeding the parallelism on GPUs as the threads execution diverges.

figure e

The second example used is a mean filter commonly used for smoothing images (user function in Listing 2). Aside from images, filters are commonly used to pre-process data to reduce noise. This might also be applied to signal processing and other application contexts. This example application has the advantage that the stencil size (i.e. radius) can be varied, as depending on the context different stencil sizes are reasonable. Moreover, this program does not require if clauses which potentially slow down the program.

Table 2 Overview of used hardware

5 Evaluation

As the purpose of high-level parallel frameworks is to facilitate writing efficient parallel programs, the speedup needs to be measured. For this purpose, the presented exemplary programs, a mean filter and an LBM implementation, are executed on the HPC machine Palma II.Footnote 3 Table 2 lists the hardware specification of the partitions used. Those are two GPU-partitions and two CPU-partitions. To provide meaningful results, all parallel programs are executed ten times. Runtimes are measured using

For testing CPU-parallelization, the zen2 partition is used, which is equipped with 12 nodes, each with one Zen2 (EPYC 7742) CPU with 64 cores. For running the sequential version, a single Skylake (Gold 6140) CPU starting a single thread is used (normal). It is not possible to run sequential programs on the zen2 partition as all sequential programs have to run on the normal partition.

For the GPU-programs, the gpu2080 and gpuhgx partitions are used. The gpu2080 partition has five nodes, each with 8 GeForce RTX 2080 Ti GPUs. The gpuhgx partition is equipped with two nodes, each with 8 A100 SXM GPUs. These partitions, most importantly, vary in the maximum memory and the number of cores. The gpu2080 partition has more nodes. However, each GPU has 11GB VRAM and 4352 cores allowing less parallelization than more powerful GPUs (such as the A100) can provide. In contrast, the gpuhgx partition has 80GB VRAM per GPU, allowing bigger data structures to be processed, and has more cores (6912) to speed up the program. Using different GPUs also contributes to proving the universal applicability of the MapStencil skeleton in Muesli.

Next to a sequential program, the Muesli-programs solving the LBM are compared to a native implementation. The native program is written to the best of our knowledge which can be categorized as having 2–3 years of experience in C/C++ programming, with a focus on optimization.Footnote 4

5.1 LBM

The LBM is used to simulate fluid flow in the three-dimensional space. Consequently, it is reasonable to run an experiment that does not only execute the mapStencil skeleton one time but has multiple iterations simulating multiple dispersion steps. 200 iterations were chosen for every experiment to compare run times between different data sizes.

Data sizes were chosen to completely utilize the available storage. For the LBM, each cell requires 76 bytes, as each cell stores 19 32-bit floating point numbers. For the calculation, one data structure to read and one to write is necessary. The largest theoretically possible data structure size for a given amount of memory can be simply calculated by:

$$\begin{aligned} d(\text {gb}):= \root 3 \of {\text {gb} \cdot \frac{2^{30}}{2 \cdot 76}} \end{aligned}$$
(5)

This results in a maximum side length of 426 for the RTX 2080 Ti GPU and 826 for the A100 SXM. Although the CPU partition would support bigger data structures, the data size was not increased to the maximum as the speedup converged, and the runtime of the sequential program became unreasonable high (approximately 10 h for calculating the LBM simulation for a data size of 960\(^3\)).

As the CPU has 64 cores, this is caused by multithreading. In total, in the optimal case, 128 threads are started on the 64 cores available. In this scenario, it should be considered that the calculations are easy to execute in parallel as all data resides on the memory of the single CPU, accessible for all threads. Using four nodes requires communicating the border values, thus requiring operations that cannot be executed in parallel. Although more elements need to be communicated with increasing data sizes, the share of operations that require communication is decreasing, thus allowing more parallelism. This can be seen as the best speedup that can be achieved with four nodes for the biggest tested data size (960\(^3\)). Eight nodes achieve a speedup of 452. Table 3 shows that for the CPU-zen2 partition, a speedup of 116 can be reached with a single CPU.

Table 3 Runtimes (seconds) and speedups for the parallel implementation of the LBM gas simulation for CPU programs on the zen2 partition

In contrast, the GeForce RTX 2080 Ti achieves a speedup of 88 with a single GPU. The maximum speed up which can be achieved by GPUs depends on the GPU used. The GeForce RTX 2080 Ti has 68 streaming multiprocessors (SMs), each capable of executing 64 threads in parallel (4352 in total). Although additional threads may be scheduled, the hardware does not have physical cores to start the threads in parallel. Multiple factors limit the possible speedup. Most importantly, threads with diverging execution branches cannot be executed in parallel, limiting the 64 threads executed in parallel per SM.

The A100 has 108 SMs, allowing more threads to be executed in parallel. The maximum speedup achieved is roughly 353.

Fig. 2
figure 2

Runtime comparison of a multi-GPU Muesli program and native (CUDA) implementation of the LBM on a single-node of the gpu2080 partition

In order to check whether a low-level program is significantly faster, a native implementation was programmed to be compared against the Muesli program. As can be seen in Fig. 2, the implementations are close to each other. In contrast to the native implementation, Muesli has a slight overhead. However, as it is very small, the differences are expected and neglectable. Runtimes for bigger data structures are not included for one GPU and two GPUs to increase the readability of the graph. Also worth mentioning is that the native implementation has 544 lines of code without implementing MPI inter-node communication. In contrast, the Muesli program has 246 lines of code and can be run on multiple nodes. Moreover, it should be considered that writing native code is arduous as each line requires fine-tuning.

Table 4 Speedup for the parallel implementation of the LBM gas simulation on the gpu2080 partition

Regarding the scalability on multiple GPUs, an extract of the achieved speedups can be seen in Table 4. A speedup of 1.7 compared to a single GPU version can be reached for two GPUs, and for four GPUs, a speedup of 2.75 is achieved. To ensure that the communication causes the overhead, the time for the update function was measured separately. Without the communication, a speedup of 1.94 and 3.88 was achieved, which can be attributed to the synchronization of streams. Although the speedup is limited by the communication operations, using multiple GPUs also has the advantage of being able to process bigger data structures, since there is more memory available.

Table 5 Speedup for the parallel implementation of the LBM gas simulation for multiple nodes on the gpu2080 partition

Considering multiple levels of parallelism, the program can also run on multiple nodes equipped with multiple GPUs. Runtimes are depicted in Fig. 3 and Table 5. The columns in the table distinguish between the runtimes including all communication between devices and nodes and runtimes only measuring the calculation on the accelerators. Most eye-catching in the figure, the runtimes for four nodes and four GPUs are not linear but show a switching pattern. This is caused by not splitting the data structure into complete slices but into incomplete slices (e.g., 640 has 40 slices per GPU while 680 has 42.5). The program can handle this. However, the communicational effort rises.

Fig. 3
figure 3

Runtimes of the LBM Muesli program on multiple nodes and multiple GPUs on the gpu2080 partition

Table 6 Runtimes in seconds and speedups of CPU program for the mean filter with stencil radius 2 on the zen2 partition

5.2 Mean Filter

Using the LBM implementation as an exemplary program has multiple downsides. Firstly, one factor that has a major influence on the generalizability of the skeleton stays constant—namely, the stencil radius. This factor influences the number of elements that need to be communicated between the computational nodes. Therefore, it is essential to vary this factor to analyze the performance of the skeleton. Moreover, the implementation of the LBM has multiple conditional statements like branch operations slowing down the possible parallelism. In contrast, the mean blur does not contain any if-branches.

Firstly, the CPU parallelism is discussed. Table 6 lists the speedup for one, four, and eight nodes compared to a sequential program. Noteworthy, the optimal time is not achieved by using 128 threads but by using 64 threads. As the instructions are easily executed in parallel, scheduling threads that are executed when other threads are idle is no longer beneficial. For four nodes, a speedup of 184 can be reached. As can be seen, for rising data sizes, the speedup improves as fewer communication operations are required. The same applies to programs using eight nodes reaching a speedup of 348.

Table 7 Speedup for the parallel implementation of the mean blur for a single node of the gpu2080 partition
Table 8 Speedup for the parallel implementation of the mean blur for two nodes of the gpu2080 partition

Secondly, the speedup for the GeForce RTX 2080 Ti GPUs is measured. The program has fewer conditional statements, avoiding branch divergence. The speedup for a single GeForce RTX 2080 Ti GPUs for a data size of 400\(^3\) is 747, significantly better than for the LBM implementation. Tables 7 and 8 list the runtimes and speedups for a small stencil radius of two (reading 125 elements per calculation) for one and two nodes with each one or four GPUs. Including communication operations, a speedup of 3.6 is reached. To ensure that this is caused by the communication operations, the runtime spent on calculation is measured separately. The speedup depicted in the last column reaches 3.99, which is close to an optimum of 4. As communication operations require synchronization, the optimal speedup is hardly achievable. In contrast, scaling across nodes improves the speedup for a single GPU from 747 to 1253. Comparing the runtimes without communication, 43.65 s are nearly doubled from 24.88. Scaling from one node to two nodes is more efficient, as merely one overlapping data region needs to be communicated. The speedup for four GPUs behaves similarly to the above-explained behavior.

Regarding bigger stencil radiuses, the runtimes for a stencil radius of 10 are listed. This requires reading 9261 elements per calculation. This extreme example is chosen to observe the runtime and speedup when not all elements can be loaded in caches.

Fig. 4
figure 4

Runtimes of the Blur Muesli program on the gpu2080 partition

Besides changing the hardware, the stencil radius was adjusted to discuss the impact on the performance. With an increasing stencil radius, the calculation of one element requires more read and write operations. For a stencil radius of two, the sum of 125 elements is calculated, and self-explanatory, this grows cubic. Figure 4 depicts the influence on the runtime. Although the number of elements processed grows cubic, the runtime does not grow cubic.

Table 9 Speedup for the parallel implementation of the mean blur for multiple GPUs on the gpu2080 partition

Moreover, it ascertains that the speedup improves with a growing share of calculation operations. This is detailed listed in Table 9. Similar to the CPU program, the speedup with and without communication is measured. For bigger stencil radiuses, the speedup comes closer to the optimum. Although more elements need to be communicated between the nodes, the intensity of the calculation requires more time which dominates the total runtime. For a stencil radius of 12 (processing 15.625 elements per thread). The operations are still very performant as elements are automatically written in GPU caches, allowing efficient data access. As multiple combinations of hardware settings are tested, it is interesting to compare having the same number of GPUs distributed on a different number of nodes. For example, having one node with two GPUs, in contrast to having two nodes with one GPU, has the downside of having one less CPU. However, it has the advantage of allowing GPU to GPU communication. Two comparisons are displayed in Fig. 5. The first figure compares a one-node eight GPUs program to a two-node four GPUs program for different stencil radiuses. As can be seen, the two-node program is faster. This is caused by parallelizing communication as some operations are executed by the CPU instead of communicating between only GPUs. In contrast, the second figure shows that when a single node program with two GPUs is compared to a two nodes program with each one GPU. The single node program is slightly faster as a communication operation between GPUs is faster than an MPI communication between two nodes.

Fig. 5
figure 5

Runtimes of the Blur Muesli program on the gpu2080 partitions with similar hardware

Table 10 Runtimes in seconds and speedups for a mean filter on the gpuhgx partition

The program was also tested on the A100 partition allowing even bigger data sizes to be processed by a single GPU. In contrast to a single GeForce RTX 2080 Ti GPUs (Speedup 747), the A100 has a speedup of 1122.34 for a data size of 400\(^3\) elements, and 3680.54 for four GPU (2309.08). According to the previous approach, the runtime was measured with and without communication (Table 10). Even with the communication, the speedup is close to the optimum. For four GPU, a speedup of 3.65 and 3.93 can be reached for 400\(^3\) and 800\(^3\) elements. For eight GPU, a speedup of 5.18 and 7.55 is reached.

5.3 Framework Comparison

Ongoing work on stencil calculations claims to exploit the parallelism on the used hardware. Therefore, it is essential to compare our implementation to other frameworks listed in the related work section. Firstly, we chose to compare our work to Palabos. Additionally, lbmpy was included in the comparison on a single GPU. In Table 11 it can be seen that Muesli is up to 4.2 times faster than the recent CPU version of Palabos. The GPU programs were compared on a single GPU of the gpu2080 partition. Muesli always outperforms lbmpy however, the speedup decreases. Noteworthy, it was impossible to test bigger data sizes on the partition as lbmpy consumes significantly more memory. For the lbmpy library \(280^3\) was the maximum size possible while the Muesli program could run cubes of size \(400^3\).

Table 11 Comparison of Palabos, lbmpy and Muesli on the zen2 partition for CPU programs and on the 2080 partition for GPU programs. All GPU programs were run on a single GPU

As those frameworks can not scale over multiple GPUs the multi-GPU program was compared to an implementation using the Celerity runtime. This comparison is influenced by the compiler as Muesli uses the NVCC compiler which sometimes generates superior code compared to the Clang-based compiler of hipSYCL, which we used as the SYCL implementation for building the Celerity code. During the analysis of the program, it was discovered that the pure Celerity program performed slower than Muesli (Fig. 6) as the SYCL compiler did not unroll some for loops. Manually inserting a inside the user function of the gas simulation results in a better performance of Celerity. As those results are mixed a real comparison would require more examples and a variety of hardware compositions to compare the efficiency of the communication operations.

Fig. 6
figure 6

Runtime comparison of a multi-GPU Muesli program and a multi-GPU celerity program of the LBM on the gpuhgx partition

6 Conclusion

We have presented the implementation and experimental evaluation of a three-dimensional MapStencil skeleton. The implementation was tested with two example programs: (1) an LBM implementation and (2) a mean filter. Both examples were tested in complex hardware environments equipped with multiple nodes, CPU cores, and accelerators.

For the LBM implementation, a speedup of 116 for one node and 452 for four nodes can be reached, confirming that for complex functions, the communication between nodes has only a minor influence on the speedup. In contrast, using multiple GPUs for complex functions does not provide the expected speedup. For a single GPU, merely a speedup of 88 can be reached. This speedup scales for two GPUs (161), but as the communication rises, the speedup does not scale according to the number of accelerators. This is caused by the complex instruction flow as the conditional statements cause thread divergence. Abstracting from the communication, the program scales across multiple nodes.

In contrast, running the mean filter example shows that non-diverging programs do not benefit from multithreading with CPUs, having a speedup of 180 for four nodes. In contrast, a GPU can reach a speedup of 747, benefiting from warps of threads that can run the same instruction in parallel. The communication between GPUs decreases the speedup from 7.7 to 5.4, taking approximately 30% of the runtime. When increasing the stencil radius, the speedup improves as the larger radius improved the GPU utilization, amortizing GPU threads’ latency. This finding is confirmed for two different types of GPUs an A100 partition and a GeForce RTX 2080 Ti partition. Moreover, it is confirmed that inter-GPU communication is faster than inter-node communication.

Overall, it is shown that stencil operations are particularly relevant for inclusion in high-level frameworks, as they are commonly used across domains and the implementation is not feasible for inexperienced programmers in a reasonable amount of time, also due to the complex communication operations. The speedup achieved proves that high-level frameworks can provide the means to produce parallel stencil programs without knowledge of parallel programming.