We have created a new version of SkePU to overcome the limitations of the original design. SkePU 2 builds on the mature runtime system of SkePU 1: highly optimized skeleton algorithms for each supported backend target, smart containers, multi-GPU support, etc. These are preserved and have been updated for the C++11 standard. This is of particular value for the Map and MapReduce skeletons, which in SkePU 1 are implemented thrice for unary, binary and ternary variants; in SkePU 2 a single variadic template variant covers all N-ary type combinations. There are similar improvements to the implementation wherever code clarity can be improved and verbosity reduced with no run-time performance cost.
The main changes in SkePU 2 are related to the programming interface and code transformation. SkePU 1 uses preprocessor macros to transform user functions for parallel backends; SkePU 2 instead utilizes a source-to-source translator (precompiler), a separate program based on libraries from the open source Clang project.Footnote 2 Source code is passed through this tool before normal compilation. However, a SkePU 2 program is valid C++11 as-is; a sequential binary (with identical semantics to the parallel one) will be built if the code is compiled directly by a standard C++ compiler.
This section introduces the new programming interface, syntax and other features of SkePU 2, first by means of an example. Listing 2 contains a vector sum computation in SkePU 2 syntax, mirroring Listing 1.
A skeleton is invoked with the overloaded operator(), with arguments matching those of the user function. Additionally, the output container is (where applicable) passed as the first argument. Smart containers may be passed either by reference or by iterator, the latter allowing operations on partial vectors or matrices. A particular argument grouping is required by SkePU 2: all element-wise containers must be grouped first, followed by all random-access containers, and uniform arguments last.
There are six skeletons available in SkePU 2: Map, Reduce, MapReduce, Scan, MapOverlap, and Call; this is fewer than in SkePU 1, as the generalized Map now covers the use-cases of MapArray and Generate. See Table 1 for a list of the skeletons.
Skeletons instances are declared with an inferred type (using the auto specifier) and defined by assignment from a factory function, as exemplified in Listing 2. The actual type of a skeleton instance should be regarded as unknown to the programmer.
Map is greatly expanded compared to SkePU 1. A Map skeleton accepts N containers for any integer N including 0. These containers must be of equal size, as do the return container. As one element from each of these containers will be passed as arguments to a call to a user function, we refer to these containers as element-wise arguments. Map additionally takes any number of SkePU containers which are accessible in their entirety inside a user function called random access arguments thus rendering MapArray from SkePU 1 redundant. These parameters are declared to be either in (by const qualification), out (with a C++11 attribute), or inout (default) arguments and only copied (e.g., between main memory and device memory) when necessary. Finally, scalar arguments can also be included, passed unaltered to the user function. The Map skeleton is thus three-way variadic, as each group of arguments is handled differently and is of arbitrary size.
Another feature of Map is the option to access the index for the currently processed container element to the user function. This is handled automatically, deduced from the user function signature. An index parameter’s type is one out of two structs: Index1D for vectors and Index2D for matrices. This feature replaces the dedicated Generate skeleton of SkePU 1, allowing for a commonly seen pattern calling Generate to generate a vector of consecutive indices and then pass this vector to MapArray to be implemented in one single Map call.
Reduce is a generic reduction operation with an associative operator available in multiple variants. A vector is reduced in only one way, but for matrices five options exist. A reduction on a matrix may be performed in either one or two dimensions (for two-dimensional reduction the user supplies two user functions), both either row-wise or column-wise. The fifth mode treats the matrix as a vector (in row-major order) and is the only mode available if an iterator into a matrix is supplied.
MapReduce is a combination of Map and Reduce and offers the features of both, with the limitation that the element-wise arity must be at least 1.
Scan implements two variants of the prefix sum operation generalized to any associative binary operator. The variants are inclusive or exclusive scan, where the latter supports a user-defined starting value.
MapOverlap is a one or two-dimensional stencil operation. Parameters for specializing the boundary handling are available, and there is specific support for separable 2D stencils.
Call is a completely new skeleton for SkePU 2. It is not a skeleton in a strict sense, as it does not enforce a specific structure for computations. Call simply invokes its user function. The programmer can provide arbitrary computations as explicit user function backend specializations, which must include at least a sequential general-purpose CPU backend as a default variant. The direction (in, out, inout) of parameter data flow follows the same principles as for the Map skeleton described above. Call provides seamless integration with SkePU features such as smart containers and auto-tuning of back-end selection. Basically, Call extends the traditional skeleton programming model in SkePU with the functionality of user-defined multi-variant components (i.e., ”PEPPHER” components ) with auto-tunable automated variant selection.
Listing 3 contains an example application of the Call skeleton, integer sorting, which can otherwise be difficult to implement in data-parallel skeleton programming. One of two distinctly different algorithms are selected depending on whether the Call instance is executed on CPU or GPU. (Note that the example is just an illustration; the CPU insertion sort algorithm is inefficient, and the even-odd sorting in the GPU variant works only inside a single work group. Also, the syntax for specializing user functions will be refined in the future.)
In the example in Listing 2, the user function is defined as a free function template. This is one of two ways to define user functions in SkePU 2; the other is with lambda expression syntax as in Listing 4, where the function is written inline with the skeleton instance. Free functions are reminiscent of the macros used in SkePU 1, and still suitable for cases where a user function can be shared across skeleton instances. In most cases, however, the lambda syntax is superior; it increases code locality while eliminating namespace pollution. There are no run-time differences between the two, as identical code is generated by the precompiler.
Naturally, the source-to-source translator is limited in scope when transforming user functions for parallel execution. Operations with side effects, for example memory allocation or I/O operations, have undefined behavior inside user functions unless explicitly allowed by SkePU 2. Also, not all syntactical constructs of C++ are supported, e.g., range-for loops. In general, the body of a user function should be written in C-compatible syntax. SkePU 2 does not enforce these rules with error messages at this time.
User functions can be nested, i.e., called from inside other user functions. This is demonstrated in Listing 5.
User Types and Constants
For many applications, basic types such as int and float may not be sufficient in a high-level programming interface. SkePU 2 therefore includes the possibility of using a custom struct as the element type in smart containers or used as extra argument to a skeleton instance. Even then, there are major restrictions on such types depending on the backends used; the type should not have any features outside those of a C-style struct and the memory layout needs to match across backends.
Listing 5 demonstrates user types in SkePU 2 with the use of a complex number type cplx for Mandelbrot fractal generation. Functions operating on objects of type cplx are defined as free functions and are treated as user functions by the precompiler. The example also uses the related feature user constants, e.g., MAX_ITERS, which are compile-time constant values that can be read in user functions. These objects are annotated with the [[skepu2::userconstant]] attribute.
Improved Type Safety
One of the goals with the SkePU 2 design was to increase the level of type safety from SkePU 1. In the following example, a programmer has made the mistake of supplying a unary user function to Reduce. Listing 6 shows the error in SkePU 1 code, and Listing 7 illustrates the same in SkePU 2 syntax.
The SkePU 1 example compiles without problem, and only at run-time terminates with the error message in Listing 8. The message itself is shared between all reduce instances, limiting the information obtained by the user. SkePU 2, on the other hand, halts compilation and prints an error message even before the precompiler has transformed the code. It directs the user to the affected skeleton instance. (The message does not directly describe the issue, an aspect which can be further improved with C++11’s static_assert.)