1 Introduction

Particle-based methods were developed to mitigate the problems linked to element distortion and mesh entanglement that occur in large deformation simulations of structures using finite element methods (FEM) [8, 24, 35, 41, 48, 49]. But, these methods are typically more computationally expensive than FEM. For instance, in Smoothed Particle Hydrodynamics, a neighbour search is performed at each time step [24, 35], and in the material point method (MPM), data are exchanged back and forth with the background grid, also at each time step [48, 49]. Therefore, serial particle-based codes can be slow, albeit easy to develop. Because of this, such codes are usually limited to small-scale applications. This problem impedes such codes from being used for large-scale simulations.

For the past decade or so, CPU clock speeds have been steady, and transistors have become so small that the physical limit of how many can fit on a chip is being reached. Overall performance gains are now achieved by increasing the number of threads per CPU. Moreover, many GPUs are now specially dedicated to heavy computing.

Particle-based methods algorithms are mostly parallel in nature (not entirely so due to potential race conditions, especially in GPU implementations). Therefore, parallelising codes can bring substantial performance gains. This can be done in several ways, such as multithreading on one or more CPUs and utilising GPUs. However, many researchers who write scientific codes do not have formal computer science training. With time and practice, many of them have become adept at writing serial scientific codes.

However, writing parallel codes is not straightforward but is relatively easy with the use of libraries such as OpenMP or MPI. Both of these libraries are used in plenty of open-source codes available online, meaning that many examples and resources are available to learners [11, 20]. However, these techniques are not without their limitations. OpenMP is a shared memory library; thus, the code cannot be executed across different computers or nodes. Unlike OpenMP, MPI is a distributed memory library. Therefore, the code can be executed on different (connected) computers or nodes, but a substantial amount of code is necessary for the communication between CPUs. This creates a substantial overhead in both development time and performance. As the number of CPUs used increases, the time dedicated to communication increases and the speedup decreases.

Porting codes to GPUs is rewarding as the speedups can be of two orders of magnitude [55]. Moreover, powerful GPUs are often readily available to researchers, even becoming the hardware of choice in all the most powerful supercomputers. But GPU programming is a very different paradigm from CPU programming. Indeed, due to the architectural differences between CPUs and GPUs, GPU codes need to be written in so-called ‘kernel’ languages such as CUDA and OpenCL.

CUDA and OpenCL are low-level languages that are hardware-specific. They are state-of-the-art in GPU computing but limit code portability. Furthermore, unlike CPU applications, efficient GPU codes require the user to have an intimate understanding of specific hardware, for example requiring programmers to specify precisely how and what memory is used. For CPU applications, this is usually done by the operating system. All this is usually way out of reach for non-computer scientist researchers. But recently, high-level libraries that do not require CUDA or OpenCL knowledge have emerged making porting codes to GPUs easy.

This paper is about how one can easily port particle-based codes from CPU to GPU. Here our serial C++ MPM, Karamelo (Sect. 2), is used as a case example to show case how this is done step by step. In Sect. 3, we discuss the type of libraries available and why we have chosen Kokkos [17]. In Sect. 4, we will present the in-depth process, sharing code snippets to help you port your code. Naturally, we will also discuss the performance of our GPU code and compare it to the original CPU code (Sect. 5).

Note that even though this paper focuses on the porting of a C++ code to GPU, the same techniques apply to other languages since Kokkos supports Python and Fortran. Also note that this paper is not about comparing CUDA and Kokkos (see Edwards et al. [17] for details); neither is it to push the state-of-the-art in parallel computing. It is rather about giving scientists an understanding of how easily port codes for particle-based methods to GPUs. For clarity and ease of reading, the definition of most of these terms is given in Table 1.

2 The material point method

MPM is a family of algorithms for multiphase continuum mechanical simulation first conceptualised by Sulsky et al. [50]. In the MPM, the solids are discretised into Lagrangian particles moving over a fixed Eulerian background grid on which the governing equations are solved. The MPM is, therefore, a hybrid Lagrangian–Eulerian method allowing it to handle a great breadth of simulation problems. Compared to purely Lagrangian methods, the MPM excels at large deformation problems, which can require prohibitively expensive frequent remeshing in mesh-based methods and facilitates more robust collision treatment [4]. The MPM’s recent rise in popularity is therefore no surprise; it has found applications in many fields from geoengineering [1, 19, 22], mechanical engineering [18, 30, 34, 46] and materials sciences [12, 31, 45]. It has even been adopted by Walt Disney Animation Studios for challenging animation tasks such as snow simulations [47].

Table 1 Definition of specific parallel computing terminology

Recent years have seen increased demands for faster and more efficient MPM codes. Many parallel MPM codes have been developed by a number of researchers, targeting a range of hardware ranging from multi-threaded single CPU to multi-GPU computer clusters. Unfortunately, these codes were either easy to modify but slow, or fast but difficult to understand and adapt to our research needs. The need to develop a portable, efficient, and easy to modify code led to the development of our own code, Karamelo [11].

2.1 The MPM algorithm

This section briefly presents the explicit dynamics MPM formulation. For more details, we refer to the recent book of Nguyen et al. [39].

The MPM is built on the two main concepts, namely Lagrangian material points carrying physical information, and a background Eulerian grid used for the discretisation of continuous fields (i.e. displacement field). These two concepts are treated in Sect. 2.1.1. A complete algorithm for the MPM is then provided in Sect. 2.1.2. Throughout this paper, subscripts \(_p\) and \(_I\) are used to refer to quantities at the particle and node positions, respectively. And superscripts \(^t\) and \(^{t+\varDelta t}\) refer to quantities at timesteps t and \(t+\varDelta t\), respectively.

2.1.1 Lagrangian particles and Eulerian grid

In the MPM, a continuum body is discretised by a finite set of \(n_p\) Lagrangian material points (or particles) that are tracked throughout the deformation process. The terms particle and material point will be used interchangeably throughout this paper. In the original MPM, the subregions represented by the particles are not explicitly defined. Only their mass and volume are tracked. However, the shape of these subregions are tracked in advanced MPM formulations such as the generalised interpolation material point (GIMP) method [3] and the convected particle domain interpolation (CPDI), see the works Sadeghirad et al. [43, 44], Nguyen et al. [37, 38]. Each material point has an associated position \({\textbf{x}}^t_p\; (p=1,2,\ldots ,n_p )\), mass \(m_p\), density \(\rho _p\), velocity \({\textbf{v}}_p\), deformation gradient \({\textbf{F}}_p\), Cauchy stress tensor \(\varvec{\sigma }_p\), temperature \(T_p\), and any other internal state variables necessary for the constitutive model. Collectively, these material points provide a Lagrangian description of the continuum body. Since each material point contains a fixed amount of mass at all times, mass conservation is automatically satisfied.

The original MPM developed by Sulsky is an updated Lagrangian scheme, also called Updated Lagrangian MPM (ULMPM) thereafter. For this MPM, the space that the simulated body occupies and will occupy during deformation is discretised by a background grid (see Fig. 1) where the equation of balance of momentum is solved. On the other hand, in the Total Lagrangian MPM (TLMPM) presented in the work by de Vaucorbeil et al. [10], the background grid covers only the space occupied by the body in its reference configuration as illustrated in Fig. 1. The use of a grid allows the method to be quite scalable by eliminating the need for directly computing particle-particle interactions. Indeed, the particles interact with other particles of the same body, with other solid bodies, or with fluids through a background Eulerian grid. Most often, for efficiency reasons, a fixed regular Cartesian grid is used throughout the simulation.

Fig. 1
figure 1

The MPM discretisation: the space is discretised by a background grid which can be either a Cartesian grid or an unstructured grid (not shown), while a solid is discretised using particles. a The updated Lagrangian MPM grid covers the entire deformation space, whereas b the total Lagrangian MPM only covers the initial configuration. Thus, if there are two solids, there will be two grids

2.1.2 The basic explicit MPM algorithm

A typical explicit ULMPM computational cycle consists of four steps (see Fig. 2). The first step is to map the information (e.g.   mass, momentum and internal and external forces) from the particles to the grid (P2G), since the grid is reset at every cycle. Next, the discrete equations of momentum are solved on the grid nodes (Grid Updating). Then, the particles’ position, velocity, volume, density, deformation gradient, stresses and all relevant internal variables are updated (G2P). These last two steps are equivalent to the updated Lagrangian FEM [5]. Finally, the grid is reset to its original state. Due to this grid resetting, mesh distortion never occurs, making the MPM a good method for large deformation problems.

Fig. 2
figure 2

Material point method: a computational step consists of four steps: a P2G (Particle to Grid) in which information is mapped from particles to nodes, b Grid Updating in which momentum equations are solved for the nodes, c G2P (Grid to Particles) where the updated nodes are then mapped back to the particles to update their positions and velocities and d Grid resetting where the grid is reset. The operations in dashed boxes are not present in the ULFEM

The flowchart of the TLMPM is quite similar to the one of the ULMPM except that the first Piola–Kirchoff stress tensor is used in the internal force vector, and the spatial derivatives are performed with respect to the original configuration [10], not the current (deformed) one.

2.2 Karamelo

Karamelo is an open-source C++ MPM library developed by de Vaucorbeil et al. [11]. Karamelo’s key design philosophy is to be portable and easy to modify while still being competitively fast. To this end, the structure of Karamelo is based on the Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) program [52]; in particular, Karamelo adopts LAMMPS’s ‘styles’, a design pattern that yields easy to use, modular classes that can be swapped in and out, facilitating customization of almost any part of Karamelo. Karamelo previously supported only multiple CPU parallelisation. And it is the aim of this paper to show how we have ported this code to GPUs. The full code is available at: https://github.com/adevaucorbeil/karamelo/.

2.3 The particle to grid (P2G) problem

Whatever the type of grid used, the number of nodes that will interact with a given particle p is geometrically limited by the support of the shape functions. For instance, for a 2D problem using linear shape functions and a simple Cartesian background grid, one particle would interact with a maximum of 4 nodes. This allows particles to keep track of neighbouring nodes in fixed sized arrays. This, in turn, renders the G2P step trivially parallelisable using nested loops. However, the opposite is not true since the number of particles interacting with a given node is not known a-priori and evolves over time. In particular, nodes in regions of compression tend to have significantly more neighbouring particles than those in tension. This is the crux of parallelising the P2G step: nodes must either maintain particle adjacency information in jagged arrays, incurring a significant bookkeeping and memory management overhead, or the outer loop of P2G must iterate over particles, incurring a synchronisation overhead to avoid data races around threads writing to the same node at the same time.

3 GPU APIs

In the first published GPU implementation of MPM, Dong et al. [15] used the CUDA application programming interface (API) to implement a single GPU MPM code. Potential race conditions during the P2G step (see Sect. 2.3) was avoided by adding an extra step of calculating particle-node associations on CPU and loop over each node’s neighbours sequentially to avoid the race condition. This method is very expensive for large deformation scenarios, since every time a particle moves between cells, the associations change, in turn necessitating expensive GPU-CPU-GPU deep-copies. But, to avoid complexity and having a more balanced workload Gao et al. [21] directly used the single CPU parallel MPM algorithm without significant modification, using atomics to address the P2G race condition. Dong and Grabe [14] implemented a multiple GPU MPM using domain decomposition and MPI for the data exchange between the different GPUs while keeping the particle-node association list. Later, Dong et al. [16] used a single root complex to speed-up the communication between GPUs. Finally, the most powerful massively parallel MPM code right now is Claymore, which uses an advanced algorithm for workload balancing; unlike naive subdomain decomposition, Claymore dynamically distributes particles approximately evenly between devices [56]. In the absence of large deformation, this is essentially equivalent to optimised domain decomposition in the initial reference frame, but Claymore has strategies to handle drastic topological change too.

However, the problem with Claymore and most of the efficient open-source MPM codes is that they are written in such a way that it is difficult for people who are not professional programmers to understand and therefore modify them. This is unlike Karamelo that has been developed as a flexible and easy to modify code.

The platform dependence of current MPM codes raises a number of key issues, the most obvious of which is portability. Although arguably superior in terms of speed, CUDA-MPI implementations are restricted to running on CUDA compatible GPUs. Additionally, portability doesn’t just concern hardware vendors, it also concerns time: Features like warp-level CUDA intrinsics are relatively modern. Therefore, using them breaks backwards compatibility. By entangling parallelisation hardware logic with algorithmic design, forward compatibility and code maintainability are also hindered: if hardware improves and new parallelisation features are available,we updating the code to use the new features can be very tedious. Lastly, although minor, involving low-level code also severely impacts code readability and accessibility.

These observations motivate the development of a hardware agnostic, massively parallel MPM library. Parallelisation logic should be abstracted away so that backends changes are completely independent to algorithmic code changes. As performance is still a priority, the code should be able to self-optimise depending on the platform that it is run. And lastly, the code should also be written in such a way that modifications can easily be made, and MPM variations and extensions can easily be added. Such a library could be invaluable to future users and researchers, who would be able to code and run efficient MPM programmes on any hardware.

3.1 General Comparison

GPU programming is still greatly non-standardised. Porting between different APIs is often quite involved and time consuming. This is especially true for GPU APIs which are often intertwined with types, structures and program flow, possibly even requiring different kernel languages. It is therefore important to make a well-informed decision when selecting a GPU API. However, the number of papers comparing GPU APIs is not insignificant. Most such literature tends to focus on comparison of API performance and programmer productivity. Hoshino et al. [26] compared the performance of CUDA and OpenACC, finding that OpenACC is typically twice as slow as CUDA. However in certain scenarios with careful tuning up to 98% efficiency can be achieved. Memeti et al. [36] compared the performance and programmer productivity of OpenCL, OpenACC and CUDA, finding that programming with OpenCL takes significantly more effort than CUDA, which in turn takes significantly more effort than OpenACC. But OpenCL and CUDA yield significantly faster codes than OpenACC. These results all agree with Li et al. [32], which found OpenACC to be 37% faster to program with but in cases up to nine times slower to run than CUDA.

In our particular application, the criteria for API selection are slightly unique, with aforementioned emphases on portability, extensibility and abstraction, albeit preferably with minimal sacrifice to performance. Most notably, although generally yielding the fastest programmes, CUDA code can only run on NVidia hardware, greatly limiting portability, especially considering that AMD is approaching NVidia in GPU market share [42]. OpenCL is generally portable but it suffers from poor ease-of-use, requiring significant management of non-abstracted low-level logic. OpenACC is portable, easy to work with, and has the added benefit that it does not require a separate kernel language; although generally known to be slow, since the performance of OpenACC can varies with application, we shortlisted OpenACC for further qualitative testing in an MPM context.

Fig. 3
figure 3

Snippets of codes showing how the P2G step would be coded using the three different GPU APIs compared here. a With OpenACC, one uses native arrays and decorates parallel loops using pragma directives. b With Julia, one uses specific functions and macros and launches kernels using tags. c With Kokkos, one uses managed data structures and the exposed parallel_for function

Another API we shortlisted was JuliaGPU. Julia is a dynamically typed language designed for high-performance technical computing; it has similar syntax to Python but targets runtime performance on par with C and C++ [7]. Besard et al. extended this philosophy to GPU programming by adding CUDA support to Julia, and the JuliaGPU organisation now facilitates NVidia, AMD and Intel GPU programming in Julia all under one API [6, 40]. The JuliaGPU API is extremely terse, in most cases only requiring the decoration of existing serial code, similar to OpenCL. However, one drawback for our application is that by merit of being a different programming language, porting Karamelo to Julia has potential to take longer than with a C++ API.

The last but potentially most promising API that we shortlisted was Kokkos [17]. The Kokkos ecosystem is a set of C++ libraries and tools that wrap other CPU and GPU APIs, notably including OpenMP on CPU, and CUDA and OpenMPTarget on GPU. This abstraction pattern with swappable backends independent of user code means that Kokkos is in theory limitlessly portable (although at the time of writing, Kokkos backend support is still incomplete for a number of hardware and operating system combinations). Kokkos also aims to be highly efficient, dynamically and automatically making low-level optimisations including memory layout and iteration pattern decisions. Lastly, the Kokkos libraries also include numerous useful algorithms and data structures that are both CPU and GPU compatible.

3.2 Test implementations

Since the performance of GPU APIs vary depending on application, for the most accurate comparison it is beneficial to implement and benchmark a context-specific test code. To this end, we ported the 88 line, two-dimensional MPM program written by Hu [27] in the taichi programming language to Julia, OpenACC and Kokkos.

These three APIs differ significantly in how parallelisation is programmed. To illustrate this, parallelisations of the P2G step are given in Fig. 3. OpenACC is integrated with native arrays and some standard template library (STL) containers, and works using pragma directives preceding standard C++ for loops. Kokkos introduces a number of managed data structures, most notably the array container, and exposes functions in the namespace which take kernels as lambda arguments. Finally, in Julia the CUDA.jl package exposes functions and macros (note the dynamic typing), and one uses tags to launch kernels, in this case taking the form of predefined functions with additional access to information on the current thread and block.

A naive measure of simplicity is that of code size; line and character counts for the three ports are given in Table 2. Note that one reason why all ports are significantly longer than 88 lines is that Hu’s code relies on taichi already having a singular value decomposition (SVD) function; these needed to be reimplemented in both Julia and C++ and are included in our counts. It may be counter-intuitive that the Julia code is significantly longer than OpenACC and Kokkos, even with Julia being dynamically typed. This is in part due to GPU programming in Julia requiring boilerplate code to configure and launch GPU kernels; in a larger program, this would be less significant. However, it is also due to the fact that Julia requires parallelised functions to explicitly take all parameters as arguments, quickly leading to code bulk as kernels become increasingly complex; this is in contrast to OpenACC and Kokkos, where variables can simply be captured from outside the loops.

Table 2 A naive measure of code complexity: line and character counts of Hu’s MPM code re-written for OpenACC, Julia and Kokkos

The three APIs differ significantly across various aspects contributing to ease of use. Syntactically, OpenACC is the simplest, with most low-level decisions offloaded to the compiler. However, this was a significant source of frustration as it was often very difficult to determine exactly what the outcomes of those decisions were. Sometimes the compiler decides not to run decorated loops on GPU at all, and root causes are often buried in large quantities of meaningless compiler output. Many errors are not picked up at compile time, only at runtime, and even then often give obfuscated or incorrect error messages. Furthermore, different C++ compilers require different (and in many cases, numerous) compiler flags to have OpenACC offload to GPU properly.

Similarly, Julia was very quick to write but proved very difficult to debug. This was primary due to four factors. Firstly, in being dynamically typed, type information can be difficult to ascertain. Secondly, Julia uses lazy compilation at runtime, blurring the distinction between compilation errors and execution errors. Thirdly, the Julia compiler error messages are often meaningless with GPU code. And lastly, available debugging tools are simply less advanced and far fewer compared to that of C++.

Conversely, while Kokkos has a steeper initial learning curve, the libraries are ultimately very intuitive, and we found the experience of developing and debugging Kokkos code to be significantly smoother than both OpenACC and Julia. Kokkos compiler warnings tend to be very informative, and with C++ being strongly typed, many sources of errors can be picked up at compile time through Kokkos’ judicious usage of static assertions and C++’s Substitution Failure is Not an Error (SFINAE) language feature [53]. Kokkos works well with all GPU debugging tools such as CUDA-MEMCHECK and NVIDIA Nsight systems, and by switching the backend to serial CPU, standard C++ debuggers work too. Finally, although low-level decisions are abstracted away by default, almost everything can ultimately be explicitly specified where necessary. This is invaluable for debugging.

To compare the APIs’ performance, we ran and timed 5000 iterations of Hu’s MPM code with varying numbers of particles; the resultant run times are shown in Fig. 4a. The Kokkos and Julia implementations run almost equally as fast, with the OpenACC code in most cases around twice as slow. Notably, we also found OpenACC to have a significant initial overhead, being slower than even serial code for smaller numbers of particles.

Considering all the above factors, we have selected Kokkos as the most suitable GPU API for the conversion of Karamelo. Before converting the full Karamelo codebase, we experimented with optimisation of the minimal Kokkos MPM code to determine what properties are likely to have big impacts on performance. Figure 4b shows the results of these optimisations applied sequentially. Compared to the unoptimised code, minimising writes to shared memory was found to yield a speedup of 25%. Specifically, inside for loops with multiple calculation steps, instead of writing multiple times to the s which reside in shared memory, it is faster to create a local variable, do all calculations with that local variable, then perform only one write to shared memory at the end. Conversely, zero speedup was observed in minimising reads from shared memory. The original code had one each of particle and node structs; we found that by splitting these into s of each of the components (such as position, velocity and mass), a further 37% speedup could be obtained. Further splitting s of two-dimensional vectors (such as the positions and velocities) into one of doubles, each for the \(x\) and \(y\) coordinates gave a 16% speedup, and splitting s of matrices into four s of doubles yielded a 27% speedup. Part of this speedup can be attributed to atomics being significantly faster with primitives than with class types; we believe this is a consequence of hardware support for lockfree atomic update of primitives in contrast to atomic updating of class types requiring locking. However, splitting s does appear to increase overhead slightly, making performance slightly worse for smaller problems. Lastly, using floats instead of doubles gave a speedup of 24%.

Fig. 4
figure 4

a Comparison of GPU API runtimes for 5000 iterations and b the effect of optimisations on Kokkos runtime for 750 iterations of Hu’s MPM code

To better understand the distribution of execution cost within the algorithm, we also profiled each of the MPM substeps separately along with the GPU to CPU deep-copies necessitated by file system dumping; distributions for the runtimes are presented in Fig. 5 (note the logarithmic \(x\)-axis). Grid update and grid reset are both \(\mathcal {O}\left( n_{\textrm{nodes}}\right) \) and as expected take roughly the same time. To calculate the complexity of the other two substeps, it is necessary to consider what it means to be a given particle’s neighbuoring node. When properties are projected from particles to nodes and vice versa, at some stage they are always scaled by the (symmetric) shape functions; in each dimension, let these shape functions have (integer) supports of \(\varDelta x\). Outside the support, the shape functions go to zero; thus, a particle node pair that is sufficiently far apart does not contribute and is therefore not considered a neighbouring pair. With a uniform background grid, the number of neighbouring nodes of any given particle is thus

$$\begin{aligned} n_\text {neighbors}\left( I\right) \le \left( \varDelta x\right) ^d \quad \forall I \end{aligned}$$
(3.1)

where \(d\) is the spatial dimension, and the complexity of the G2P and P2G substeps are therefore both \(\mathcal {O}\left( n_\text {particles}\left( \varDelta x\right) ^d\right) \). Note that \(n_\text {nodes}\) and \(n_\text {particles}\) are technically independent, but in practice may be considered proportional. In our case we used ratio of 0.59 (48 particles, \(9\times 9\) grid; 192 particles, \(18\times 18\) grid; etc). While the G2P and P2G steps have the same computational complexity, P2G is notably more expensive due to needing atomics, reflected in the slower runtime (see Fig. 5a). Lastly, deep-copies are extremely expensive, generally ranging between three and four orders of magnitude slower than all four substeps (see Fig. 5b). It is therefore desirable to keep data on the GPU for as long as possible, keeping deep-copies (and in turn dumping) to a minimum.

Fig. 5
figure 5

Kernel density estimates for the distribution runtimes using 200 samples comparing a the MPM steps. Grid update and reset take nearly the same time. P2G is more expansive than G2P due to atomics. b Deep-copies are three to four orders of magnitude slower than any MPM step

4 Method

The Kokkos port of Karamelo was performed incrementally in a number of steps. As many of these changes were significant and structural, it was important to subdivide this process as much as possible, so that intermediate stages of porting could be compiled and regression tested.

In this section, we present the major Kokkos related aspects of porting, along with specific design changes and library re-implementations for GPU compatibility.

4.1 Parallel P2G

Algorithmic changes were necessary to make Karamelo’s P2G code GPU compatible. Karamelo originally used two neighbour lists, one for every particle’s neighbouring nodes, and one for every node’s neighbouring particles. Thus, in the P2G step, the code would loop over all grid nodes in the outer loop, and for each grid node, loop over all the node’s neighbouring particles in the inner loop, adding their contributions to the node under consideration. It is theoretically possible to keep this algorithm and simply parallelise the outer loop; this yields Algorithm 1 and is essentially equivalent to the approach of Dong and Grabe [14] as presented in Sect. 3. However, the particles are not uniformly distributed. So, in theory, nodes can have limitless numbers of neighbouring particles. This presents two problems. Firstly, the node neighbours adjacency lists do not form a rectangular two-dimensional array; Karamelo used to use a vector of vectors, and while a of s is theoretically achievable, the process is very fragile, requires explicit specification of memory space (for example CUDA space) violating hardware agnosticism, and has potential to be very slow due to requiring many small allocations on GPU [29]. Secondly, the number of particle neighbours that each node has may change at every timestep, and the population of these variable length neighbour lists is non-trivial to parallelise (recall that Dong and Grave [14] resorted to performing these calculations on CPU and performing deep-copies to get them back onto GPU).

Instead, we changed Karamelo to use Algorithm 2, which uses the particles’ neighbours adjacency list. This is the method used by most modern parallel codes studied in Sect. 3. Note that due to the outer loop being over particles, atomics are now necessary to prevent multiple particles writing to the same node at the same time, but consequently, the inner loop can additionally be parallelised for free.

Algorithm 1
figure n

Node neighbours P2G

Algorithm 2
figure o

Particle neighbours P2G

4.2 Kokkos views and loops

To allow Karamelo to run on GPU, the first step was to convert all GPU-accessible data structures to Kokkos Views. These were primarily members of the and classes, which store the node and particle information respectively. At this stage, since loops have not yet been offloaded to GPU. Therefore, we initially specified all Views to use the CUDA unified virtual memory (UVM) space which can be accessed by both CPU and GPU. This is easily achieved using an additional template argument, namely

. However, this was a temporary change since CUDA UVM space is limited in capacity, slower than regular device spaces, and obviously not portable.

Broadly, the next step was to convert parallel loops to Kokkos’s parallel syntax. This was not always trivial, since very often the code inside loops weren’t immediately GPU compatible. As shown in Fig. 3, Kokkos exposes functions that run a given lambda expression over a range of indexes. If no explicit specification is given, Kokkos will automatically determine the execution space and optimal iteration order (stride pattern). Converting existing loops required a number of small changes. Firstly, pointer indirection is problematic on GPU since pointed to memory generally resides on CPU. It is therefore necessary to cache and pass resultant values to GPU instead of pointers. Additionally, due to a deficiency in earlier C++ language standards, most compilers implicitly access member variables in lambdas through the ‘this’ pointer, so members must also explicitly be cached (this has been fixed in C++17, but at the time of writing, most GPU compilers are limited to C++14) [28]. To illustrate this, the conversion of a basic serial loop is given in Fig. 6. Note that due to Kokkos Views using reference counting, the View copies are not deep-copies but rather just new references.

Fig. 6
figure 6

Example of a conversion of a a serial loop to b a parallelised loop that runs on GPU using Kokkos

Some loops in the Karamelo code also contain reductions, such as the summation of total kinetic energy or calculation of minimum values for stable time step size. To this end, Kokkos exposes functions that are very similar in syntax to . The only loops that could not be offloaded to GPU were the loops for dumping, as writing to the file system must be done in serial from CPU. Therefore, this necessitated deep-copying from GPU to CPU before each dump. Since memory allocations and deep-copying are extremely expensive, it is desirable to minimise them where possible; for optimisation, we did the following. Allocating host mirrors was done once in the dumping class’s constructor, tying the lifetime of the mirrors to the lifetime of the program, and only for properties that were to be dumped. Deep-copying was then performed before each dump, also only for dumped properties. The details of the layout used is shown in Listing 7. To increase efficiency, once deep-copying is complete it is actually possible to dump asynchronously on CPU using the host copy while calculation continues on GPU using the device copy (Fig. 7).

Finally, once all code had been either offloaded to GPU or had the necessary data deep-copied to CPU, s were incrementally taken off CUDA UVM space with continuous regression testing.

Fig. 7
figure 7

The layout adopted for efficient dumping classes: the memory allocation is performed only once at the time of class construction and deep-copies only at time of dumping (if needed)

4.3 Linear algebra

The CPU version of Karamelo made use of the Eigen linear algebra library throughout, and Eigen is not natively GPU compatible [25]. Instead of Eigen, one can use the Kokkos kernels which provides a full ecosystem for linear algebra. However, this library carry a lot of functions that we do not use. Therefore, to limit dependencies and limit Karamelo’s footprint, we decided to implemented our own Kokkos compatible linear algebra library from scratch.

The bulk of this development involved creating a templated matrix class (an exerpt of the class is presented in Listing 8). To maximise both usability and utility, this class makes use of a range of template metaprogramming techniques including parameter packs, universal references, static assertions, idiomatic SFINAE through , type inspection in unevaluated contexts using and , and type aliasing. The class provides a range of constructors and assignment operators, accessors, overloaded arithmetic operators, various products, norms, element-wise operations, transposes, and debugging output (Fig. 8).

To meet current Karamelo demands, we also implemented QR decomposition using Givens rotations, eigendecomposition using the QR algorithm with Wilkinson shifts, and SVD and matrix inversion using eigendecomposition [33, 57].

Fig. 8
figure 8

Exerpt of the Matrix class which is at the core of the GPU implementation. This class provides a range of constructors and assignment operators, accessors, overloaded arithmetic operators, various products, norms, element-wise operations, transposes, and debugging output. It makes use of a range of template metaprogramming techniques to maximise both usability and utility

4.4 Lazy references

As was found in Sect. 3.2, writes to shared memory are expensive, and minimising writes in our Kokkos MPM test code decreased runtimes by 25%. It is not unusual to write to a variable multiple times within one iteration, for example if there are multiple calculation steps or nested loops. As shown in Fig. 9b, it is always possible to bring the number of writes down to one or zero per iteration by creating a local copy, reading and writing to that local copy, and only writing the final result back to shared memory. While this is technically sound, it isn’t particularly safe, as accidentally removing the final write can lead to bugs that are undetectable at compile time and potentially difficult to find. Instead, it is best practice to bind the lifetime of this abstract cache to the lifetime of an object, following the resource acquisition is initialisation (RAII) idiom [51]. To this end we introduce the class, with equivalent usage shown in Fig. 9c. The class itself is very simple, storing both a local copy and a reference to the wrapped value, with the check and write logic moved into the destructor; almost all operators are also trivially overloaded so that s may be used like normal references.

Fig. 9
figure 9

Lazy references could be used to safely minimise writes to shared memory. a Multiple writes slow down the GPU kernel as writing to shared memory is expensive. b Speed can be gained by using local shared variables but one has to remember to write the result to shared memory. c Lazy references can be used to automatise this process

4.5 Inheritance

One major obstacle in translating CPU code into GPU code is that of inheritance and polymorphism, specifically around C++’s use of late binding for dynamic dispatch. The problem is, virtual pointers (VPTRs) in virtual tables (VTables) point to functions that reside in the memory space the class is initialised in, which is usually the host space. It is technically possible to create GPU-compatible class instances using placement new in a device context, but this process is fragile and brings further complications [54]. The simple solution is to not call virtual functions inside loops at all, rather moving said loops into the virtual functions themselves as shown in Fig. 10b. Similarly to Sect. 4.4, this is technically sound but semantically poor, requiring consistent code duplication of the control structure between every derived class; if for instance a programmer changes the value of in but forgets to change , resultant bugs may be hard to find. Our solution is to use the curiously recurring template pattern (CRTP) [2]. As shown in Fig. 10c, when using the CRTP, an intermediate class is introduced that is templated with the derived class as an argument; this intermediate class overrides the virtual function with the shared control structure, but injects the derived class’s kernel statically using early binding. We believe we are the first to apply the CRTP to this particular problem.

Fig. 10
figure 10

Overcoming a major obstacle in translating CPU codes to GPU. a Virtual functions cannot be used in GPU kernels. b One solution could be to move the virtual functions directly in the loops but is semantically poor. c Our solution is to use CRTP to emulate polymorphic parallel kernels

Fig. 11
figure 11

Example of expressions that Karamelo allows users to specify

Fig. 12
figure 12

CRTP is used for expression operations as they requires looping over all particles or nodes. a Typical pattern of the structure for expression application. b The example of the operation

4.6 User input expression evaluation

Karamelo allows users to specify expressions in its input files, for example as arguments for a fix. At runtime, these expressions may then be evaluated over particles or nodes. An example is given in Fig. 11, highlighting support for particle/node properties (such as and coordinates) as operands, various operators, single and multivariate functions and even composition of expressions. Previously, Karamelo would store expressions as strings and simultaneously parse and evaluate expressions as necessary. However, reparsing expressions is very inefficient, and furthermore, parsing on GPU is likely to be both complex and slow due to requiring variable length character arrays. Instead, it is best to parse expressions on CPU into some kind of GPU-compatible abstract syntax tree (AST), finally evaluating the AST in parallel on GPU. We found the most suitable AST representation was reverse polish notation (RPN), using Dijkstra’s shunting-yard algorithm for parsing [13]. To facilitate parallel evaluation, our Expression class holds a two- dimensional buffer acting as registers; operations then write to and combine registers as per the given RPN expression. To illustrate this, consider the expression from Fig. 11; after applying the shunting-yard algorithm we get in RPN.

The application of all expression operations essentially has the same structure of looping over all particles/nodes and updating certain registers; this makes expressions a perfect candidate for CRTP as given in Sect. 4.5. The CRTP function is shown in Fig. 12a and ’s implementation is provided as an example in Fig. 12b. We also expose expression functions as styles so that users may easily add their own.

5 Results

To quantify asymptotic performance improvements from parallelisation, we ran 1000 steps of two simulations on one CPU (Intel Xeon Platinum 8274), multiple CPUs, and one GPU (Tesla V100-SXM2-32GB) while varying the number of particles. Note that Kokkos supports AMD GPUs as well (since version 4.2). However, we do not have any available to us to run comparison tests. The two considered simulations are:

  • Two bouncing balls (Fig. 13a) which represent the most basic 2D simulation problem, which uses the MPM built-in contact algorithm.

  • A twisted column (Fig. 13b) problem inspired by that introduced by Gil et al. [23]. The test consists of a column which is fixed at the bottom and an angular velocity \(\omega _0=2\pi \) rad/ms is applied to the top surface. The column is a \(100\,\textrm{mm} \times 10\,\textrm{mm} \times 10\,\textrm{mm}\) square cuboid. The angular velocity \(\omega _0\) is applied by constraining the velocity of each and every node on the top surface of the column. Let n be the number of rotations we want to simulate and T is the final time, then we define \(\omega =\omega _0 n/T\). The applied velocities as a function of time are therefore given by:

    $$\begin{aligned} v_x(t) = -\omega y(t), \quad v_y(t) = +\omega x(t) \end{aligned}$$
    (5.1)

    where x(t) and y(t) are the positions of the node along the x and y axes. The column is made of a thermal elasto-plastic material. The flow stress of this material is assumed to obey the Johnson Cook plasticity model, of which the material parameters are omitted for brevity. Note that this is a three-dimensional thermal mechanical simulation in which the diffusion of the heat generated by plastic work is modelled.

Fig. 13
figure 13

The two test cases onto which the GPU implementation has been tested: a two bouncing balls and b a twisted column

Note that for the initial simulations, we disabled dumping to isolate the cost of the MPM.

For the multiple CPU bouncing balls simulation, we used nine CPUs, subdividing each dimension of the domain in three. The results of multiple CPU and of one GPU are identical to that of one single CPU and not reported herein. For this case, running on nine CPUs was 3.34 times faster than on one CPU, and one GPU was 18.18 times faster than one CPU or 8.43 times faster than nine CPUs (Fig. 14). The minimal three times speed up with nine times the number of CPUs demonstrates the poor workload balancing of naive domain decomposition, since a number of subdomains (especially the top left and bottom right) essentially always remained idle; this therefore illustrates one scenario where multiple process parallelisation offers very little gain over single process parallelisation. Additionally, for a small number of particles, one CPU was actually fastest, owing to the lesser overheads.

Fig. 14
figure 14

Results of the bouncing balls simulations: a runtime and b speed up of 1000 iterations versus the number of particles. A maximum speed up of 18 was achieved with a GPU compared to a single CPU

For the multiple CPU twisted column simulation, we used eight CPUs, subdividing each dimension in two. On complex problems, the speedup from GPU parallelisation is so significant that the gradient of the GPU curves are not visible on linear axes, so note that both axes in Fig. 15a are logarithmic. Asymptotically, eight CPUs was 7.03 times faster than one CPU, and one GPU was 86.57 times faster than one CPU and 12.31 times faster than eight. It is interesting to see that the GPU curve only exhibits linear scaling beyond a fairly large number of particles (around 40,000 in this case). Before that point, the GPU is not yet saturated, so since the GPU can launch more threads as necessary, increasing the number of particles has almost no effect on runtime.

Finally, we compared the costs of dumping on CPU and GPU by running 1000 iterations of the twisted column simulation with a fixed number of 40,960 particles while varying the frequency of dumping. The results thereof are shown in Fig. 16. We see that although GPU is by far the fastest without dumps, with more than five dumps per 100 steps it is actually slower than using eight CPUs. The poor scalability of GPU dumping has two reasons: firstly that deep-copies are necessary and costly and secondly that the one GPU code also only dumps using one CPU. We see that beyond a certain point, it is faster to just do all calculations on CPU rather than calculating on GPU and deep-copying. That said, dumping with such a high frequently is reasonably contrived; in most cases, occasional dumping suffices, and especially with asynchronous dumping, beyond a certain point the cost goes almost to zero.

Fig. 15
figure 15

Results of the twisted column simulations: a runtime and b speed up of 1000 iterations versus the number of particles. A maximum speed up of 84 was achieved with a GPU compared to a single CPU

Fig. 16
figure 16

Cost of dumping: runtime of 1000 iterations of the bar in torsion simulation versus the frequency of dumping. Dumping scales poorly on the GPU because of the necessary deep-copies and also the use of a unique CPU to write to disc

6 Conclusion and future work

In this paper, we have shown an example of how one could accelerate a C++ scientific code using GPUs. It is important to emphasise that for someone who knows C++, this process is fairly easy. There is no new language to learn, but rather just some fundamental concepts linked to where data reside in memory and an understanding of how to code for the selected GPU API.

To accelerate our MPM code, we found that Kokkos was the most suitable API to use. Kokkos is developed by Sandia National Laboratories in the USA. It is used for large and popular projects such as LAMMPS (a molecular dynamic package). Therefore, we are confident that it is going to be maintained for many years to come.

We have discussed the general process and suggested best practice solutions for key aspects of CPU to GPU porting using Karamelo as an example. The idea is for you to be able to use this information to port your own code. Our code running in parallel on one GPU was found to be up to 85 times faster than on CPU. We expect this to be a ball park indicative figure that could also be achieved in other codes.

What we have shown here is the first, easy step towards accelerating a serial code. But one can go further. A more major step forward would be to parallelise the code on multiple GPUs. The original Karamelo code already uses MPI in multiple CPU parallelisation. Kokkos is compatible with MPI, but this might not be the best way of doing it owing to the large amount code required to make it. Instead, another avenue would be to use existing abstracted libraries for partitioning the global address space; one such example is OpenSCHMEM [9], which incidentally is also already compatible with Kokkos.

Beyond the APIs themselves, a lot of work could be done to the algorithms used to improve the code’s efficiency. In the case of MPM, current algorithms are still far from achieving optimal workload balancing in general; given the amount of activity in this area, we expect significant advancements in the near future. Going forward, keeping Karamelo up-to-date will continue to make cutting edge MPM technologies accessible and useful for researchers and industry alike.