Portable C++ Code that can Look and Feel Like Fortran Code with Yet Another Kernel Launcher (YAKL)

This paper introduces the Yet Another Kernel Launcher (YAKL) C++ portability library, which strives to enable user-level code with the look and feel of Fortran code. The intended audience includes both C++ developers and Fortran developers unfamiliar with C++. The C++ portability approach is briefly explained, YAKL’s main features are described, and code examples are given that demonstrate YAKL’s usage. YAKL fills a niche capability important particularly to scientific applications seeking to port Fortran code quickly to a portable C++ library. YAKL places heavy emphasis on simplicity, readability, and productivity with performance mainly emphasizing Graphics Processing Units (GPUs). Central to YAKL’s ability to allow Fortran-like user-level code are three features: (1) a multi-dimensional Array class that allows Fortran behavior; (2) a limited library of Fortran intrinsic functions; and (3) an efficient pool allocator that transparently enables cheap frequent allocations and deallocations of YAKL Arrays. While YAKL allows Fortran-style code, it also allows Arrays that exhibit C-like behavior as well, including row-major index ordering and lower bounds of “0”. YAKL currently supports CPUs, CPU threading, and Nvidia, AMD, and Intel GPUs.

An approach to portability that is growing in popularity over time is the use of portable C++ libraries such as kokkos 10 [4], RAJA 11 [5], and SYCL / OneAPI / Data Parallel C++ 12 [6]. These libraries rely on the ability of the C++ language to encapsulate code as an object, most often as a "functor" (a class that overloads the operator() operator). Often, this is achieved using a C++ "lambda" expression, 13 which conveniently creates an anonymous functor in-place for the programmer, automatically capturing the required data. Over the last three years, a C++ portability library called Yet Another Kernel Launcher (YAKL) 14 has been developed (and is still actively developed) largely to support the Energy Exascale Earth System Model's (E3SM's) efforts under the U.S. Department of Energy Exascale Computing Project. YAKL is a relatively simple and small C++ library for portability to CPUs, Nvidia GPUs, AMD GPUs, and Intel GPUs that specializes in allowing user-level code that maintains much of the look and feel of the dominant style of Fortran code currently used by the E3SM.
YAKL was first motivated by the absence of other C++ portability frameworks being able to support AMD GPUs, which was eventually remedied in the other frameworks. Its current motivation is the ability to quickly port Fortran code to a portable C++ library and to support development practices that are familiar to Fortran domain science developers such as multi-dimensional arrays, frequent allocation and deallocation of arrays, and use of simple reduction intrinsics such as sum and minval. YAKL currently supports CPU targets, CPU threading (through the OpenMP specification), Nvidia GPUs (through the CUDA language), AMD GPUs (through the HIP language), and Intel GPUs (through the SYCL specification). Importantly, SYCL can also produce code for Nvidia and AMD GPUs, and HIP can also produce code for Nvidia GPUs. YAKL has already been used by a number of published studies [7][8][9][10]. Particularly with a kernel whose algorithm is well-suited to floating point operation ("flop") capable hardware, YAKL was shown to be capable of achieving up to 80% of peak single precision flops per second (flop/s) on an Nvidia Tesla V100 GPU [7].
Fortran developers less familiar with C++ might find it useful to first read Sect. 3.1. A brief overview of C++ portability is given in Sect. 2. Then an overview of YAKL's main features is given in Sect. 3. The YAKL hardware targets are described in Sect. 4. Finally, concluding remarks and future work are detailed in Sect. 5.

A Brief Description of the C++ Portability Approach
It is important to note that C++ portability is not a language extension but rather just a library fully specified within the C++ language that uses additional hardware-specific languages such as CUDA, HIP, SYCL, and OpenMP to target specific hardware under Fig. 1 An example of encapsulating code as an object using a lambda expression and then passing it to a function to launch it with CPU threading using the OpenMP specification the hood. Therefore, portable C++ libraries benefit fully from the many available mature C++ compilers.

Code as an Object
The core of C++ portability is the ability to express code as an object that can be passed to functions as parameters. This is most conveniently performed using a lambda expression, which creates an anonymous class object (meaning it has no class name), either copies by value whatever variables are needed or references them, and then wraps a section of code inside an overloaded operator() operator. The class object created by the lambda expression can then be passed as a parameter to a function that can then execute the code inside that class object on any desired hardware target. See Fig. 1 as an example. Instead of running the lambda inside a for loop with OpenMP threading, one could call it inside a CUDA kernel or a SYCL parallel dispatch to run on Nvidia GPUs and Intel GPUs, respectively.

Shallow-Copy Data Structures and Copy-by-Value Lambda Expressions
Another common thread among C++ portability libraries is that they seek to allow developers to run on accelerated hardware that has its own distinct memory space. The complication of this is that for machines that do not page data automatically between host and device memory spaces, pointers and references to host data are invalid in device memory (and vice versa). Even when machines do page data, it is typically more efficient to move and manage the data manually.
By default, lambda expressions will capture data by reference. However, since host references are invalid in device memory, this is no longer acceptable. Therefore, in most C++ portability libraries, lambdas must capture data by value, meaning the lambda syntax is typically: [=] (parameters...) {...}. The exception to this is with SYCL's use of Buffer objects because, in SYCL, one obtains an appropriate accessor that is already valid in the desired memory space. Therefore, lambdas that use these accessors may pass them by reference.
When copying data structures by value into a lambda expression, it is common for those data structures to use what are called shallow-copy semantics, which means that the actual underlying data pointer is valid already on the device it is being launched on, and only a small amount of metadata of the data structure is copied explicitly, while the pointer is used as is. This way, large amounts of data are not copied every time a kernel is launched. This can, however, lead to confusion for developers used to Fortran assignment semantics, which are deep-copy semantics. In most C++ portability frameworks, to copy the underlying data, one most use an explicit "deep copy" routine rather than assigning to an object directly. This will be explained in further detail later. Therefore, keep in mind that in C++ portability libraries, assigning one array object to another is more akin to pointer assignment -it does not copy or duplicate the underlying data.

Some Subtleties of C++ Portability Libraries
A complication that is more subtle is that C++ lambda expressions only capture by value variables in local scope. Therefore, variables at the class scope, global scope, and namespace scope are still captured by reference inside the resulting class object. To avoid this behavior, one must pull all variables used inside the lambda expression's code that live in class, global, or namespace scope into local scope before creating the lambda expression. This can be conveniently done with auto &varname = ::varname or auto &varname = this->varname, though YAKL has its own syntax for this called YAKL_SCOPE, which is covered later.
Another subtlety in C++ portability libraries is calling member functions from your own class from parallel_for kernels. As a general rule, it is best practice to make all class methods that are called from device kernels static, meaning they belong to the class itself, not any particular instance of that class. The reason is that static methods have no this-> reference (which is generally only valid in host memory). One can capture the class by value in lambda expressions, making the this-> reference valid inside parallel_for kernels (meaning you can you class methods and data inside that parallel_for call). However, member functions called by parallel_for kernels still cannot use class member data or functions because those will still invoke this-> references that are only valid in host memory.
In short: (1) it is a best practice to make all class member functions that are called from device kernels static and to pass any required class data by parameter; and (2) it is a best practice to pull all class, global, and namespace scoped data into local scope before creating the lambda expression. lar niche application of allowing Fortran-like behavior in the user-level code. Much, though not all, of YAKL's API is patterned after the kokkos API. Notable simplifications in YAKL compared to other C++ portability libraries include allowing only allowing two memory spaces (host and device), only allowing basic Array slicing of contiguous chunks and whole dimensions, and only supporting basic scalar reductions: minval, maxval, and sum, though there are other simplifications as well.
While the example below highlights YAKL's ability to enable user-level code that looks like Fortran, please note that YAKL also allows C-style behavior as well.

An Example of Fortran Code Compared to Fortran-Style YAKL Code
It is helpful to start with an example of what a given snippet of Fortran code would look like once converted to YAKL. The examples in Figs. 2 and 3 are computing the maximum stable time step over a Cartesian grid Shallow-Water model. Figure 2 shows the Fortran code example using OpenACC directives as an example of parallelization on GPUs. Figure 3 shows the corresponding Fortran-style YAKL code that does the same calculation. The using and typedef statements at the top of Fig. 3 would typically be placed in a header file somewhere and reused by all of the YAKL code so that the main code is more readable.
The features that allow a developer used to coding in Fortran to feel more comfortable in C++ are: (1) avoiding the need to reverse the indexing order, (2) avoiding the need to change to a zero-based indexing strategy, and (3) being able to allocate and deallocate arrays at any place in the program without worrying about performance penalties (enabled by the YAKL pool allocator). While this small example would not be too arduous to move to a C-style indexing strategy, managing many thousands of lines of code with arbitrary lower indexing bounds in the arrays and potentially complex integer arithmetic based on the chosen indexing strategy becomes a difficult task to manage.
As evident from this example, most of the code looks the same, particularly in the calculations themselves. There are some changes when moving to YAKL, however. First, the developer will need to become familiar with a C++ function syntax, which is unavoidable when moving from Fortran to C++. Here, the intent(in) arrays are declared with the type defined realConst2d type. The size and epsilon Fortran intrinsic routines remain identical in syntax when using YAKL. Those can be used inside kernels as well. Since YAKL does not support in-kernel reductions, the intermediate values must be stored in a new array dt2d first and then reduced with a call to minval.
The starkest change between Fortran and Fortran-style YAKL, however, is how the looping is expressed. The syntax was made as minimal as possible when designing YAKL, but some level of change of this nature is unavoidable. By commenting out the corresponding do loops, it can be seen what looping is implied by the parallel_for call. More information on the behavior of parallel_for is given in Sect. 3.2.1.
In brief, though, the syntax is as follows:  Fig. 3 should be thought of as the equivalent of the Fortran code: real, allocatable :: dt2d(:,:) followed by allocate(dt2d(nx,ny)). Also note that allocations like this in YAKL are very cheap because they are done using a pool allocator (see Sect. 3.9).
Finally, note in line 20 of Fig. 3 that the parallel_for call includes a string label for the kernel launch. While a label is not required, it is helpful when it comes to automatic timing of all of the kernels in the code as well as labeling kernel launches clearly in GPU profiling tools such as Nvidia's nvprof and nsight tools. It also enables automated "printf" debugging of the code to dump to one file per process every action that occurs in YAKL, including labeled calls to parallel_for.

parallel_for, Bounds, YAKL_LAMBDA, and fence
YAKL achieves parallel dispatch using the following possible function definitions: template <class F, int N> void parallel_for( char const *label , Bounds <N> bounds , F const &code ); template <class F, int N> void parallel_for( Bounds<N> bounds , F const &code ); The string label is optional, though very highly recommended. When there are a lot of parallel_for calls, and it is hard to come up with meaningful names, the YAKL_AUTO_LABEL() macro function is available, which simply inserts the filename (with the path removed) augmented with the line number. This way, it is readily known where the call lives in the code.
The parallel_for functions are defined in a yakl::c:: namespace and a yakl::fortran:: namespace, where each namespace has that language's behavior. In the c namespace, if you pass a scalar, N as the loop bounds, it is assumed to iterate over {0,1,...,N-1}; whereas in the fortran namespace, it is assumed to iterate over {1,2,...,N} The Bounds class comes in a yakl::c:: namespace and a yakl::fortran:: namespace, and it describes the looping implied in the parallel kernel launch. The Bounds class accepts an integer template parameter that determines how many tightly nested loops are being dispatched. Bounds in the c namespace default to a lower bound of zero, and bounds in the fortran namespace default to a lower bound of one. Each loop in the Bounds class constructor's parameters is described either by an integer, an initializer list of two parameters that describes the inclusive lower and upper bounds, or an initializer list of three values that describes the inclusive lower bound, the inclusive upper bound, and the stride. Negative strides are not currently supported, and this is partially to protect the user from attempting to use parallel_for for work that depends on the order in which loop indices are processed (e.g. prefix sums and other general loop-carried dependencies). If a loop cannot be cast with a positive stride, then it contains a loop-carried dependency. Examples of specifying loop bounds are as follows: yakl::c::Bounds<2>(ny,nx); // Outer loop from 0 to ny-1, inner loop from 0 to nx-1 yakl::fortran::Bounds<1>(nx); // Loop from 1 to nx yakl::c::Bounds<1>({-1,nx+1,2}); // for (int i=-1; i <= nx+1; i+=2) yakl::fortran::Bounds<1>({1-hs,nx+hs}); // do i = 1-hs , nx+hs Finally, the code is recommended to be passed to the parallel_for function by using a C++ lambda expression via the YAKL_LAMBDA macro. On the CPU, this maps simply to [=], meaning variables used are passed by value. While the CPU could technically pass data by reference, since it isn't possible to pass data by reference on GPUs (as CPU references are not valid in GPU memory in general), it was deemed wise to use copy-by-value behavior even on the CPU. In the CUDA and HIP backends, it maps to [=] __host__ __device__.
When creating the lambda expression, the parameters passed to the lambda expression are the looping indices assigned by the parallel_for call in the hardware backend. For instance, if Bounds<3>(nz,ny,nx) is passed to the parallel_for function, then the lambda expression must accept exactly three parameters to accept indices for each of these loop: // do k = 1 , nz // do j = 1 , ny // do i = 1 , nx parallel_for( yakl::fortran::Bounds<3>(nz,ny,nx) , YAKL_LAMBDA (int k, int j, int i) { ... } All calls to parallel_for are asynchronous with respect to host code. However, the order of parallel_for calls on the device is respected, meaning that a subsequent call to parallel_for call will not start until all previous device work has been completed. To synchronize after any asynchronous call in YAKL, use the yakl::fence() routine, which synchronizes the host code with respect to all existing asynchronous work on the device. All calls to parallel_for are assumed to be run on the device for which the YAKL code is targeted. YAKL can only target one device at a time.
There is an optional LaunchConfig parameter to the parallel_for function. This object contains two template parameters: an integer vector length that determines the number of threads in a "block" in CUDA and HIP, and a boolean bit-for-bit flag that defaults to false, which determines whether the launched kernel should be run in serial on the CPU instead of on the GPU whenever the C Pre-Processor (CPP) macro YAKL_B4B is defined (there is more on this behavior in Sect. 3.11).

Handling Loops that are not Tightly Nested
"Tightly nested" loops as used here means: (1) all loops appear consecutively with no work in between and (2) inner loop bounds do not depend on outer loop indices. There are some common approaches to handling situations where these are not both true.
If there is work in between loop bounds, the two common approaches are: 1. Push that work down into inner loops and simply duplicate the processing of that line of code. 2. Perform that work before entering the tightly nested loops and store to a temporary array that is used in the tightly nested loops.
There are also many cases where one loop's bounds will depend upon another. For instance, in many ocean models, not all vertical levels are active for every horizontal grid point. In these cases, the number of vertical levels depends upon the element index. To alleviate the situation, the typical practice is to have the inner loop over vertical levels iterate to the maximum number of vertical levels and place an if-statement inside the innermost loop.

Multi-dimensional Dynamically Allocated Array Classes
The next most important feature of the YAKL library is the dynamically allocated multi-dimensional Array class, which takes four template parameters: (1) data type, (2) number of dimensions, (3) memory space, and (4) style. The data type is templated and can be any type (e.g. float, int const, or bool). The memory space can be either yakl::memHost or yakl::memDevice. For all hardware targets with separate device memory spaces (i.e., most if not all GPU targets), hostspace Array objects cannot be (portably) used on the device, and vice versa. The exception is when the CPP macro YAKL_MANAGED_MEMORY is specified, in which case yakl::memDevice Array objects can be accessed on the host. Finally, the style parameter can either be yakl::styleC or yakl::styleFortran. C-style Array objects have row-major index ordering (meaning the right-most index varies the fastest) and zero-based indexing. Fortran-style Array objects have column-major index ordering (meaning the left-most index varies the fastest) and by default one-based indexing, though the lowest index of a given dimension can be any integer.
All YAKL Array objects have contiguous indexing. When creating an Array object in device memory, it is not recommended to use a data type that has a constructor because device memory is nearly always allocated with the hardware backend's version of malloc, which does not call the constructor. However, when creating an Array object in host memory, the C++ new operator is used, meaning any data type should be suitable.
All Array objects have debugging capabilities (when turned on) to detect things like out of bounds indexing, indexing with the wrong number of dimensions, and using data in the wrong memory space (device data on the host or host data on the device). There are also YAKL flags that cause all allocations to be performed with Managed or Shared memory to allow device data to be used on the host.

Shallow and Deep Copy
As mentioned earlier, YAKL Array objects use shallow copy semantics for assignments, meaning if a and b are Array objects, then a = b will copy the metadata from b to a, but then they will each share the same data pointer. Thus, changes to one will affect the other, similar to pointer assignment or the equivalence statement in Fortran. In order to maintain separate data pointers and copy the data itself between them, a "deep copy" is needed, which is achieved via the deep_copy_to() member function: e.g., b.deep_copy_to(a).
One can only deep copy between Array objects of the same data type and total number of elements. The developer assumes responsibility for maintaining proper indexing when performing a deep copy between C-style and Fortran-style Array objects or between Array objects with differing numbers of dimensions.

Memory Space Management
YAKL's Array objects have convenient functions to transfer the data between memory spaces. The createHostCopy() and createDeviceCopy() member functions will create a separate copy of the Array object in host and device memory spaces, respectively, and deep copy the data. Even if the object is already in that memory space, a separate object with a separate data pointer is created with a full deep copy to avoid any semantics confusion when using the routine. If the user expect an object with a separate data pointer and the data pointer is, instead, shared, this would lead to code bugs that might be difficult to track down. If the user wishes to simply create a similar object in host or device memory without copying the underlying data, the createHostCopy() and createDeviceCopy() member functions are also available, respectively.
If the array objects already exist, then one can deep copy the data between different memory spaces with the deep_copy_to member function described in the previous section.

Automatic and Manual Array Deallocation
YAKL Array deallocation works similarly to Fortran's automatic deallocation semantics. Whenever an Array object falls out of scope, it is automatically deallocated. YAKL Array objects count the number of references to the same data pointer. As soon as the number of references drops to zero, the data is deallocated. Therefore, if an Array object is created and allocated inside function, as long it is not shallow copied to a global object or returned from the function, it will be deallocated as soon as the function ends.
The highest likelihood for memory leaks (in all contexts) is with statically scoped Array objects that technically do not fall out of scope until the program ends. While this should never lead to memory usage that grows unbounded over the executable's runtime, it can nonetheless lead to errors and bad behavior. In these cases, one can deallocate the Array object manually in one of two ways: (1) explicitly call the deallocate() member function; or (2) replace the Array object with an empty Array object via shallow copy assignment, e.g., arr = real2d(); using the typedef from Fig. 3.

Multi-dimensional Statically-Sized SArray Classes
YAKL also has statically sized multi-dimensional array objects via the SArray (for C-style indexing) and FSArray (for Fortran-style indexing) classes. These are the multi-dimensional equivalent of simple arrays in C with the dimension size known at compile time, e.g., float data [128]. These are placed in the stack of whatever context they are defined. They can be defined inside parallel_for kernels as small thread-private arrays. With YAKL's debugging turned on, the indices are checked during runtime. While some host architectures allow runtime-sized stack arrays, this is not allowed in YAKL because most accelerator devices do not allow this behavior. The dimension sizes must be known at compile time.

Handling Parallel Data Races
Reductions: YAKL allows handling kernel-wide reduction operations via intrinsic functions, which mimic Fortran intrinsic syntax: sum, minval, and maxval, minloc, and maxloc. These are based on vendor provided libraries for optimal performance.
Atomic Instructions: Whenever multiple parallel threads might read/write to the same data location, one needs to use atomic instructions. YAKL supports these at a low level with the following three functions: atomicAdd(a,b), atomicMin(a,b), and atomicMax(a,b). These correspond to serial equivalents of a=a+b, a=min(a,b), and a=max(a,b), respectively.

"Scalar Live-Out"
There are cases where a scalar value is written to inside a device parallel_for kernel and it must be read on the host after the kernel has finished. Since all variables inside a parallel_for call are copied by value, this means the scalar data must actually be allocated on the device beforehand if it is to be accessed afterward. To make this process easier, YAKL has a ScalarLiveOut class where room for a single scalar value is allocated on the device in the constructor, the initial value can be assigned in the constructor, the variable can be written to with a simple operator= assignment, and it can be read subsequently on the host with the hostRead() member function. In the rare case that it is necessary, one can get the device reference for modification with the operator() overload.
The most common need for a scalar live-out situation is in device-resident error checking routines where a boolean value is used to determine if, for instance, the data is within physical bounds or not.

Limited Fortran Intrinsics Library
There is a limited (but growing) library of Fortran intrinsic routines in YAKL. These currently include size, shape, lbound, ubound, allocated, associated, epsilon, tiny, huge, sign, mod, merge, minval, minloc * , maxval, maxloc * , sum, product * , any, matmul * , transpose * , count, and pack * . Routines with a superscript asterisk are only available for SArray and FSArray objects and do not invoke parallel kernels. The routines minval, maxval, sum, any, and count will invoke parallel_for device kernels whenever operating on dynamically allocate Array objects, but they use simple inline looping for SArray and FSArray objects. The reason for this is that statically sized arrays are intended to be relatively small and dynamically sized arrays are intended to be larger. This removes ambiguity in terms of what behavior to expect when calling one of these intrinsics.
Therefore, any of these routines may be called on an SArray or FSArray object anywhere in the code, but routines that invoke a parallel_for kernel should not be called inside another parallel_for call.

Componentwise Operator Library
YAKL also has a library of componentwise operators that can be performed on YAKL Array objects in the yakl::componentwise namespace. These include unary operators and binary operators between two arrays or between an array and a scalar. Each of these operators launches a parallel_for in the default stream. These are largely to make error checking code more convenient to write (e.g., if (any(arr < 0)) {...}).

YAKL_INLINE and Calling Functions from parallel_for Kernels
When calling a function from a parallel_for kernel, it is recommended to use the YAKL_INLINE macro, which gives it modifiers for the appropriate hardware backend to run on the device. For instance, in the CUDA and HIP backends, YAKL_INLINE maps to __forceinline__ __host__ __device__. It is very highly recommended that any class member function decorated with YAKL_INLINE also be decorated with static to avoid any potential use of the class's this-> pointer. While YAKL does have a YAKL_CLASS_LAMBDA macro that captures *this, the user can still run into erroneous situations when trying to use a class's own on the GPU from inside YAKL_INLINE functions.

A Transparent and Efficient Pool Allocator
As mentioned earlier, it is a common practice in many Fortran codes to use automatic arrays in functions and subroutines. To enable efficient frequent allocations and deallocations on the device in C++, YAKL enables a transparent pool allocator under the hood for all YAKL allocations, including Array objects. Allocations on the host are typically not that expensive, but on GPU devices, they can be prohibitively expensive. A pool allocator allocates an initial large block of memory and then hands out chunks quickly upon request. Even for host allocations, YAKL has proven to often be faster than system malloc calls -likely due to how closely packed arrays are in memory when using the pool compared to using malloc. The pool allocator is intended for use outside parallel_for calls, not inside them.
YAKL's pool allocator, "Gator", is based on a simple linear allocation mechanism, which traverses linearly through existing allocations to find space for a requested allocation. This is certainly not the fastest allocation method, but it is beneficial in terms of reducing segmentation and improving memory locality. Further, since pool allocations typically overlap with device kernel executions, the relatively small increase in searching time is typically unnoticeable compared to faster allocation mechanisms that use the available space less efficiently. Particularly in the presence of allocations from multiple parallel CPU threads in indeterminant order, the projects that use YAKL found it important to use the available pool space as efficiently as possible.
Whenever a pool runs out of space, an additional pool is allocated and remains in place until YAKL's finalize() routine is called. The user can control the initial size of the pool (in MB) with the GATOR_INITIAL_MB environment variable, can control the size of additional pools with the GATOR_GROW_MB environment variable, and can disable the pool with the GATOR_DISABLE environment variable. YAKL also has an InitConfig class to control this with runtime information as well.
YAKL exposes Fortran hooks to the pool allocator through a Fortran module gator_mod. Fortran codes can pass contiguous pointers to these routines to allocate data from the Fortran side. This is advantageous for porting Fortran codes because one can allocate efficiently through the pool allocator using "Managed" or "Shared" memory, and those arrays are accessible on the host and device in both C++ and Fortran.
The "Gator" class is also available to the user to use on their own for pools the may need to manage for other purposes. Note that YAKL's pool allocator is not intended to quickly manage things like CUDA "Shared Memory" within kernels. It is only meant to manage device resident allocations from the host.

YAKL_SCOPE and Using Non-Local Data Inside parallel_for Kernels
There are case when the code launched by a parallel_for call uses data that isn't immediately in local scope but is rather in global or namespace scope or in the class scope. In these cases, as mentioned earlier, C++ lambda expressions will not capture out-of-local-scope data by value. Therefore references to these data are invalid on the device and will lead to invalid memory address errors or segmentation faults in the best case scenarios.
The YAKL_SCOPE macro function is intended to help in this case to bring that data into local scope before creating the lambda. For instance, for global scope and class scope data, the call would be YAKL_SCOPE(data,::data); and YAKL_SCOPE(data,this->data), respectively. After these calls, the variable data can safely be used in the code wrapped in a YAKL_LAMBDA expression and launched by a parallel_for call.

Bitwise Floating Point Reproducibility
There are many projects for which bitwise floating point reproducibility is important. For instance, in climate, initially small (even machine precision) differences in floating point values will diverge rapidly into distinct weather states in a matter of only a week or two of simulation time. This chaotic behavior makes determining acceptable and unacceptable answer changes difficult. Therefore, having a bitwise reproducible answer (even if only during testing) is important if only to understand when the answer could have changed.
The issue largely comes in regarding floating point mathematical operations that can occur in a non-deterministic order, since floating point arithmetic is not generally commutative. The reduction libraries used by YAKL are deterministic. The atomicAdd instructions, however, are not. Therefore, for any kernel that contains an atomicAdd instruction, the user can place an optional parameter, yakl::DefaultLaunchConfigB4b at the end of the parallel_for call to ensure that whenever the CPP macro YAKL_B4B is defined during compilation, that kernel is run in serial on the CPU.
When the user defines YAKL_B4B, YAKL automatically turns on "Managed" memory for CUDA and HIP and "Shared" memory for SYCL. This allows the device data in the kernel to be accessible on the host so that kernels with the yakl::DefaultLaunchConfigB4b parameter at the end can be run successfully in serial on the CPU. Since these are run serially when YAKL_B4B is defined, floating point determinism is maintained. Whenever YAKL_B4B is not defined, those kernels still run in parallel efficiently on the GPU with non-deterministic results.

Hierarchical Parallelism
YAKL supports two levels of parallelism on the GPU: one intended for threading inside a vector unit (e.g., "Streaming Multiprocessor" for Nvidia GPUs or "Compute Unit" for AMD GPUs); and one for threading across multiple vector units. The functions to launch on these are called parallel_inner and parallel_outer, respectively. When parallel_outer is called, it creates an object called an InnerHandler, which must be accepted after the loop indices in the lambda function passed to it. The parallel_inner routine then accepts this InnerHandler object as a parameter. This data structure holds internal YAKL data to manage the two-level parallelism. Technically, this is only required because of SYCL, which requires this kind of behavior. CUDA and HIP would not require this object. Since the goal is single source portability, though, it is required for all contexts.
There are two other functions also defined that will be commonly used in this context. A single_inner function ensures only one inner thread performs the work inside, and a fence_inner function synchronizes kernel work within an inner loop until all previous inner loop threads have completed. For example, this maps to __syncthreads() in CUDA and HIP. Both single_inner and fence_inner both require the InnerHandler object to satisfy SYCL requirements. Further, all lambda functions passed to parallel_inner and single_inner should be standard C++ pass-by-reference lambdas rather than using YAKL_LAMBDA: i.e., [&] ( ... ) { ... }.

"Streams"
YAKL supports multiple parallel "streams" (using CUDA terminology). In SYCL, these are called "queues." A stream / queue should be thought of as a first-in, firstout queue into which all device operations are enqueued and completed in the order they were enqueued. By default, YAKL uses a single default stream / queue for all operations. The user can, however, use multiple queues by defining the CPP macro YAKL_ENABLE_STREAMS at compile time. YAKL's create_stream() routine returns a YAKL yakl::Stream object. Streams can be used to record yakl::Event objects, streams can wait on event objects, and the host can wait on event or stream objects to be completed. parallel_for, parallel_outer, and intrinsics that launch kernels all take optional stream arguments that default to the default stream.
One thing to be aware of, however, is that YAKL's default use of a non-blocking pool allocator for all device allocations creates a potential aliasing problem when using multiple streams simultaneously. If the user deallocates and allocates during runtime, then multiple Array objects will likely be aliasing overlapping memory address ranges at the same time from the host's perspective. This is fine when using a single stream because device work is guaranteed to be performed in-order. It is advantageous, even, because it reduces device memory usage compared to allocating all variables at the same time at program initialization.
When using multiple streams at the same time, however, there is generally no guarantee in what order the work will be completed. This means these Array objects aliasing the same memory range might end up running at the same time, producing indeterminate and incorrect results. To avoid this, the user has two options. First, the user can disable the pool at initialization during runtime using the yakl::InitConfig class or using the export GATOR_DISABLE=1 shell environment variable. This will then use expensive device allocation routines every time an Array object is allocated, and the performance hit might be unacceptably large. The alternative is for the user to manage the dependencies of data on streams themselves. This is done via the Array class's add_stream_dependency(yakl::Stream stream) member function. If you know that your Array object, arr, is used by kernels launched in stream1, then you can declare that stream dependence with arr.add_stream_dependency(stream1). Then, whenever that array is deallocated, either with an explicit deallocate call or by falling out of scope, events will be declared in all streams that array is dependent upon, and it will not actually be released from the pool until those events are completed. This removes any possibility of Array objects aliasing the same memory range being used at the same time on the device. If the user desires to use multiple streams and the pool allocator simultane-ously, they take responsibility for declaring stream dependencies for data used during runtime.

Debugging and Profiling Capabilities
YAKL has a number of debugging capabilities, the vast majority of which are turned on with the CPP macro variable YAKL_DEBUG defined at compile time. This flag turns on checks that ensure the following things among others: • The Array constructor used has the correct dimensionality YAKL also has an automated "printf" debugging capability enabled by defining the CPP macro YAKL_VERBOSE_FILE at compile time. This will dump one file per process containing all activity in the YAKL library, flushing after each line printed, to enable the user to determine where each MPI task fails in a failed run.
YAKL has built-in timers as well as hooks to switch to other timing libraries. Using yakl::timer_start(char const *label) and yakl::timer_stop (char const *label), YAKL keeps track of the runtime between those calls for each CPU thread and MPI task (using a fence() operation before each to ensure GPU work has completed). YAKL will print out the timers in human readable fixed column width format for the main task to stdout at the end of the run by default, but the user can override this behavior if they desire. Timers are enabled by specifying the CPP macro YAKL_PROFILE. If the user wishes to automatically generate timers for all YAKL kernel launches (including internal ones), the CPP macro YAKL_AUTO_PROFILE will accomplish this. All parallel_for calls without labels will stated as unlabeled with no distinctions between them.

Nvidia GPUs with CUDA
YAKL's CUDA hardware target is used to target Nvidia GPUs. Memory allocations and frees are performed with cudaMalloc and cudaFree. If the C Pre-Processor (CPP) macro YAKL_MANAGED_MEMORY is defined, then cudaMallocManaged is used to allow device allocations to be used on the host. If the CPP macro _OPENACC is defined along with YAKL_MANAGED_MEMORY, then all allocated memory is run through acc_map_data to ensure the OpenACC runtime does not automatically create data statements for those pointer address ranges. Similarly, if the CPP macro _OPENMP45 is defined, then managed allocations are run through omp_target_associate_ptr to ensure the OpenMP runtime leaves data within those address ranges alone. Memory transfers are performed with cudaMemcpyAsync.
CUDA has hardware atomic functions for addition, minimum, and maximum operators, and they are used when possible. When the CUDA compute capability is too low for a given data type and operation, then a compare and swap (CAS) implementation is used instead.
CUDA makes the following definitions: The latter macro functions are useful for determining whether a section of code is currently being executed in the host compiler pass or the device compiler pass in order to hide host-only code from device compilation. This is how YAKL handles managed data structures (which contain host-only reference counting and allocation / free calls) being passed to device kernels without warnings and errors. For kernel launches, the "chevron syntax" is used, and the total amount of threading, nIter is decomposed into vectorSize threads within a CUDA "block" and ceil(nIter / vectorSize) threads distributed across CUDA blocks, where nIter represents the total number of threads being distributed in parallel. vectorSize defaults to 128 but can be changed in the parallel_for call. All CUDA kernels and cudaMemcpy calls are performed in the CUDA default stream (which is the same as stream "0") unless the user specifies a different stream. One aspect of CUDA not experienced with other hardware backends is that the kernel launch parameter size has a limit of typically 4 Kb. Whenever a function is called with more than this, the kernel has to first be loaded into a temporary buffer in device memory and then launched from device memory with a dereferencing of the functor.
For reductions, the Nvidia "CUB" library 15 is used, and for FFTs, the Nivdia "cuFFT" library is used. 16 For reductions, the Nvidia CUB is used.

AMD GPUs with HIP
The HIP backend is used to target AMD GPUs, and it is unsurprisingly very similar to the CUDA backend. The following macros differ from CUDA's A recent development of SYCL is support for the is_device_copyable C++ type trait. YAKL currently overloads this for all functors small enough that the total parameter space for the functor being launched is less than 2,048 bytes (a current Intel hardware limitation). Then, SYCL accepts the functor as a device copyable structure, copies it to the device, and launches it, even though it is not "trivially copyable". Whenever the size of the functor is too large to achieve this, it is manually copied to a device memory buffer similar to large CUDA functors.
In the SYCL backend, reductions are performed using the SYCL 2020 specification. Atomic instructions are achieved with the fetch_[min|max|add] member functions of the relaxed_atomic_ref SYCL class. SYCL parallel_for launches are performed directly from the default queue object created upon first instantiation of a class with a Singleton pattern, and the SYCL backend synchronizes with the queue's wait() function. FFTs are compute using the Intel MKL library. 19 SYCL parallel_for calls use the sycl::nd_range approach for specifying the total number of threads as well as the "local size" to use (an analogue of block size in CUDA). YAKL currently defaults to a size of 128.

CPU Threading with OpenMP
The CPU threading in YAKL is implemented with OpenMP pragma statements implemented directly inline with the serial CPU for loops, where the OpenMP collapse clause is used to collapse all loops into a single level of threadable parallelism.

Concluding Remarks and Future Work
This paper has introduced the Yet Another Kernel Launcher (YAKL) C++ portability library, which seeks to enable user-level code that looks like Fortran code for scientific developers who are comfortable in that context. YAKL's features are explored, and examples of its use are provided. The hardware backends are described in detail.
From the authors' experiences, while Fortran is a helpful language in many ways, it is not always simple to run Fortran code on accelerators. Directives-based runtime implementations have varying levels of difficulty when modern features of the Fortran language are used such as classes, type-bound procedures, and non-contiguous pointers. Fortran's module-based structure can make inlining code more difficult, which can lead to difficulties in calling routines from device kernels. Further, the feature sets supported from directives-based specifications can vary widely from one compiler to another, making portability across many compilers more difficult. There are many codes for which Fortran with directives works quite well, and there are some for which the implementations are less reliable. This was the one of the motivations for moving to C++ for the E3SM-MMF project. The authors' experiences have been more predictable and less prone to compiler bugs when using C++ portability.
However, using a C++ portability library is not a decision to be taken lightly, and it is not necessarily the right decision for all projects. Converting Fortran code to a C++ portability library is a daunting task, though that is one of the main tasks that the YAKL library seeks to make easier by allowing Fortran-like behavior in the resulting userlevel C++ code. For a more complete description of YAKL, please see the tutorial-style and API documentation located at https://github.com/mrnorman/YAKL/wiki.
Regarding future work, there is ongoing investigation into the degree to which YAKL can be built on top of the Kokkos library. Particularly, the parallel dispatch seems to be the most straightforward aspect of YAKL to use kokkos as a backend for. Other than this, the main additional functionality that is planned for inclusion into YAKL includes implementation of additional vendor library provided routines to act on YAKL Array objects, such as scan operations, batched reductions, sorting routines, argmin, and argmax -as well as operations at the "inner" parallelism level for hierarchical parallelism applications.