We have spent the entire book promoting the art of writing our own code. Now we finally acknowledge that some great programmers have already written code that we can just use. Libraries are the best way to get our work done. This is not a case of being lazy—it is a case of having better things to do than reinvent the work of others.

This chapter covers three different sets of library functionality:

  1. 1.

    Built-in functions defined by the SYCL specification

  2. 2.

    The C++ standard library

  3. 3.

    C++17 parallel algorithms, supported by the oneAPI DPC++ Library (oneDPL)

SYCL defines a rich set of built-in functions that provide common functions shared by host and device code. All SYCL implementations support these functions, and so we can rely on key math libraries being available on all SYCL devices.

The C++ standard library is not guaranteed to be supported in device code by all SYCL implementations. However, the DPC++ compiler (and other compilers) support this as an extension to SYCL, and so we briefly discuss the limitations of that extension here.

Finally, the oneAPI DPC++ Library (oneDPL) provides a set of algorithms based on the C++17 algorithms, implemented in SYCL, to provide a high-productivity solution for SYCL programmers. This can minimize programming effort across CPUs, GPUs, and FPGAs. Although oneDPL is not part of SYCL 2020, since it is implemented on top of SYCL, it should be compatible with any SYCL 2020 compiler.

Built-In Functions

SYCL provides a rich set of built-in functions with support for various data types. These built-in functions are available in the sycl namespace on host and device and can be classified as in the following:

  • Floating-point math functions: asin, acos, log, sqrt, floor, etc.

  • Integer functions: abs, max, min, etc.

  • Common functions: clamp, smoothstep, etc.

  • Geometric functions: cross, dot, distance, etc.

  • Relational functions: isequal, isless, isfinite, etc.

The documentation for this extensive collection of functions can be found in the SYCL 2020 specification, and the online documentation at registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html in sections 4.17.5 through 4.17.9.

Some compilers may provide options to control the precision of these functions. For example, the DPC++ compiler provides several such options, including -mfma, -ffast-math, and -ffp-contract=fast. It is important to check the documentation of a SYCL implementation to understand the availability of similar options (and their default values).

Several of the SYCL built-in functions have equivalents in the C++ standard library (e.g., sycl::log and std::log). SYCL implementations are not required to support calling C++ standard library functions within device code, but some implementations (e.g., DPC++) do.

Figure 18-1 demonstrates the usage of both the C++ std::log function and SYCL built-in sycl::log function in device code. Using DPC++ compiler implementation, both functions produce the same numeric results. In the example, the built-in relational function sycl::isequal is used to compare the results of std::log and sycl::log.

Figure 18-1
A program includes constant expression integer size = 9, array a, array b, bool pass = true, queue q, range s z, buffers A, B, and P, accessors A, B, and P, submit and parallel for functions.

Using std::log and sycl::log

Note that the SYCL 2020 specification does not mandate that a SYCL math function implementation must produce the exact same numeric result as its corresponding C and C++ standard math function for a given hardware target. The specification allows for certain variations in the implementation to account for the characteristics and limitations of different hardware platforms. Therefore, it is possible for a SYCL implementation to produce matching results in practice, as demonstrated in the code example shown in Figure 18-1.

Use the sycl:: Prefix with Built-In Functions

We strongly recommend invoking the SYCL built-in functions with an explicit sycl:: prepended to the name. Calling just sqrt() is not guaranteed to invoke the SYCL built-in on all implementations even if “using namespace sycl;” has been used.

SYCL built-in functions should always be invoked with an explicit sycl:: in front of the built-in name. Failure to follow this advice may result in strange and non-portable results.

When writing portable code, we recommend avoiding using namespace sycl; completely, in favor of explicitly using std:: and sycl:: namespaces. By being explicit, we remove the possibility of encountering unresolvable conflicts within certain SYCL implementations. This may also make code easier to debug in the future (e.g., if an implementation provides different precision guarantees for math functions in the std:: and sycl:: namespaces).

The C++ Standard Library

As mentioned previously, the SYCL specification does not guarantee that functions from the C++ standard library will be supported in device code. However, there are several compilers that do support these functions: this simplifies the offloading of existing C++ code to SYCL devices and makes it easier to write libraries that use SYCL as an implementation detail (e.g., a user passing a function into a library can write that function without using any SYCL-specific features).

Your Mileage May Vary

Since support in device code for functions from the std:: namespace varies across SYCL implementations, we cannot be sure that kernels employing the C++ standard library will be portable across multiple SYCL compilers and implementations.

The DPC++ compiler is compatible with a set of tested C++ standard APIs—we simply need to include the corresponding C++ header files and use the std namespace. All these APIs can be employed in device kernels the way they are employed in a typical C++ host application. Figure 18-2 shows an example of how to use std::swap in device code.

Figure 18-2
A program to end the scope of host A so that the upcoming kernel can operate on buffer, with a call to swap. The highlighted functions include main, submit, single task, and swap. The sample output at the bottom reads 8, 9 in row 1, 9, 8 in row 2.

Using std::swap in device code

Figure 18-3 lists C++ standard APIs with “Y” to indicate those that have been tested for use in SYCL kernels for CPU, GPU, and FPGA devices, at the time of writing. A blank indicates incomplete coverage (not all three device types) at the time of publication for this book.

Figure 18-3
Three parallel tables with four columns each labeled C + + standard A P I, libraries t d c + +, library c + +, and M S V C. The entries of the second, third, and fourth columns are Y.

Library support with CPU/GPU/FPGA coverage (at time of book publication)

The tested standard C++ APIs are supported in libstdc++ (GNU) with gcc 7.5.0+ and libc++ (LLVM) with clang 11.0+ and MSVC Standard C++ Library with Microsoft Visual Studio 2019+ for the host CPU as well.

On Linux, GNU libstdc++ is the default C++ standard library for the DPC++ compiler, so no compilation or linking option is required. If we want to use libc++, use the compile options -stdlib=libc++ -nostdinc++ to leverage libc++ and to not include C++ std headers from the system. The DPC++ compiler has been verified using libc++ in SYCL kernels on Linux, but the runtime needs to be rebuilt with libc++ instead of libstdc++. Details are in https://intel.github.io/llvm-docs/GetStartedGuide.html#build-dpc-toolchain-with-libc-library. Because of these extra steps, libc++ is not the recommended C++ standard library for us to use in general, without a specific reason to do so.

To achieve cross-architecture portability, if a std:: function is not marked with “Y” in Figure 18-3, we need to be careful that we don’t create functional incorrectness (or build failures) for our application as it runs on target devices that we haven’t tested on!

oneAPI DPC++ Library (oneDPL)

C++17 introduced parallel versions of the algorithms defined in the C++ standard library. Unlike their serial counterparts, each of the parallel algorithms accepts an execution policy as its first argument—this execution policy denotes how an algorithm may execute.

Loosely speaking, an execution policy communicates to an implementation whether it can parallelize the algorithm using threads, SIMD instructions, or both. We can pass one of the values seq, unseq, par, or par_unseq as the execution policy, with meanings shown in Figure 18-4.

Figure 18-4
A table has 2 columns and 4 rows. The column headers are execution policy and meaning. The execution policies are seq, unseq, par, and par underscore unseq.

Execution policies

oneDPL extends the standard execution policies to provide support for SYCL devices. These SYCL-aware execution policies specify not only how an algorithm should execute, but also where it should execute. A SYCL-aware policy inherits a standard C++ execution policy, encapsulates a SYCL device or queue, and allows us to set an optional kernel name. SYCL-aware execution policies can be used with all standard C++ algorithms that support execution policies according to the C++17 standard.

oneDPL is not tied to any single SYCL compiler, it is designed to support all SYCL compilers.

Before we can use oneDPL and its SYCL-aware execution policies, we need to add some additional header files. Which headers we include will depend on the algorithms we intend to use, some common examples include:

  • #include <oneapi/dpl/algorithm>

  • #include <oneapi/dpl/numeric>

  • #include <oneapi/dpl/memory>

SYCL Execution Policy

Currently, only algorithms with the parallel unsequenced policy (par_unseq) can be safely offloaded to SYCL devices. This restriction stems from the forward progress guarantees provided by work-items in SYCL, which are incompatible with the requirements of other execution policies (e.g., par).

There are three steps to using a SYCL execution policy:

  1. 1.

    Add #include <oneapi/dpl/execution> into our code.

  2. 2.

    Create a policy object by providing a standard policy type, a class type for a unique kernel name as a template argument (optional), and one of the following constructor arguments:

    A SYCL queue

    A SYCL device

    A SYCL device selector

    An existing policy object with a different kernel name

  3. 3.

    Pass the created policy object to an algorithm.

A oneapi::dpl::execution::dpcpp_default object is a predefined device_policy created with a default kernel name and default queue. This can be used to create custom policy objects or passed directly when invoking an algorithm if the default choices are sufficient.

Figure 18-5 shows examples that assume use of the using namespace oneapi::dpl::execution; directive when referring to policy classes and functions.

Figure 18-5
A program for auto policies b, c, d, and e with make device policy, for each policy. The S Y C L devices are the g p u selector, default selector, default policy, and queue.

Creating execution policies

Using oneDPL with Buffers

The algorithms in the C++ standard library are all based on iterators. To support passing SYCL buffers into these algorithms, oneDPL defines two special helper functions: oneapi::dpl::begin and oneapi::dpl::end.

These functions accept a SYCL buffer and return an object of an unspecified type that satisfies the following requirements:

  • Is CopyConstructible, CopyAssignable, and comparable with operators == and !=.

  • The following expressions are valid: a + n, a – n, and a – b, where a and b are objects of the type and n is an integer value.

  • Has a get_buffer method with no arguments. The method returns the SYCL buffer passed to oneapi::dpl::begin and oneapi::dpl::end functions.

Note that using these helper functions requires us to add #include <oneapi/dpl/iterator> to our code. This functionality is not included by default, because these iterators are not required when using USM (which we will revisit shortly).

The code in Figure 18-6 shows how to use the std::fill function in conjunction with the begin/end helpers to fill a SYCL buffer. Note that the algorithm is in the std:: namespace, and only the execution policy is in a nonstandard namespace—this is not a typo! The C++ standard library explicitly permits implementations to define their own execution policies to support coding patterns like this.

Figure 18-6
A program with auto buffer begin, auto buffer end, auto policy functions, and class fill. The s y c l devices are queue and buffer.

Using std::fill

The code in Figure 18-7 shows an even simpler version of this code, using a default policy and ordinary (host-side) iterators. In this case, a temporary SYCL buffer is created, and the data is copied to this buffer. After processing of the temporary buffer on a device is complete, the data is copied back to the host. Working directly with existing SYCL buffers (where possible) is recommended to reduce data movement between the host and device and any unnecessary overhead of buffer creations and destructions.

Figure 18-7
A program with a vector function displays the output as passed or failed. The highlighted functions include main, fill, begin, and end.

Using std::fill with default policy and host-side iterators

Figure 18-8 shows an example which performs a binary search of the input sequence for each of the values in the search sequence provided. As the result of a search for the ith element of the search sequence, a Boolean value indicating whether the search value was found in the input sequence is assigned to the ith element of the result sequence. The algorithm returns an iterator that points to one past the last element of the result sequence that was assigned a result. The algorithm assumes that the input sequence has been sorted by the comparator provided. If no comparator is provided, then a function object that uses operator< to compare the elements will be used.

The complexity of the preceding description highlights that we should leverage library functions where possible, instead of writing our own implementations of similar algorithms which may take significant debugging and tuning time. Authors of the libraries that we can take advantage of are often experts in the internals of the device architectures we are targeting and may have access to information that we do not, so we should always leverage optimized libraries when they are available.

Figure 18-8
A program to initialize sorted data, create d p c + + iterators, create named policy from existing one, call algorithm, and check data. Data to be sorted are as follows, k of 0 = 0, k of 1 = 5, k of 2 and 3 = 6, k of 4 and 5 = 7, k of 6 and 7 = 8, k of 8 and 9 = 9, v of 0 to 4 = 1, 6, 3, 7, and 8.figure 8

Using binary_search

The code example shown in Figure 18-8 demonstrates the three typical steps when using oneDPL in conjunction with SYCL buffers:

  1. 1.

    Create SYCL iterators from our buffers.

  2. 2.

    Create a named policy from an existing policy.

  3. 3.

    Invoke the parallel algorithm.

Using oneDPL with USM

In this section, we explore two ways to use oneDPL in combination with USM:

  • Through USM pointers

  • Through USM allocators

Unlike with buffers, we can directly use USM pointers as the iterators passed to an algorithm. Specifically, we can pass the pointers to the start and (one past the) end of the allocation to a parallel algorithm. It is important to be sure that the execution policy and the allocation itself were created for the same queue or context, to avoid undefined behavior at runtime. (Remember that this is not oneDPL specific, and we must always pay close attention to contexts when using USM!)

If the same USM allocation is to be processed by several algorithms, we can either use an in-order queue or explicitly wait for completion of each algorithm before using the same allocation in the next one (this is typical operation ordering when using USM). We should also be careful to ensure that we wait for completion before accessing the data on the host, as shown in Figure 18-9.

Figure 18-9
A program to create a queue q with a constant integer n = 10 and functions m allocation host and m allocation device, and to display the output as passed or failed. The highlighted functions include main, m allocation host, m allocation device, make device policy, wait, memory copy, and free.

Using oneDPL with a USM pointer

Alternatively, we can use std::vector with a USM allocator as shown in Figure 18-10. With this approach, std::vector manages its own memory (as normal) but allocates any memory it needs via an internal call to sycl::malloc_shared. The begin() and end() member functions then return iterators that step through a USM allocation. This style of programming is very convenient, especially when migrating existing C++ code that already makes use of containers and algorithms.

Figure 18-10
A program to create a queue q with a constant integer n = 10, u s m allocator, and vector. The highlighted functions include main, alloc, fill, make device policy, begin, end, and wait.

Using oneDPL with a USM allocator

Error Handling with SYCL Execution Policies

As detailed in Chapter 5, the SYCL error handling model supports two types of errors. With synchronous errors, the runtime throws exceptions, while asynchronous errors are only processed by an asynchronous error handler at specified times during program execution.

For algorithms executed with SYCL-aware execution policies, the handling of all errors (synchronous or asynchronous) is the responsibility of the caller. Specifically,

  • No exceptions are thrown explicitly by algorithms.

  • Exceptions thrown by the runtime on the host CPU, including SYCL synchronous exceptions, are passed through to the caller.

  • SYCL asynchronous errors are not handled by oneDPL, so must be handled (if any handling is desired) by the caller using the usual SYCL asynchronous exception mechanisms.

Summary

We should use libraries wherever possible in our heterogeneous applications, to avoid wasting time rewriting and testing common functions and parallel patterns. We should leverage the work of others rather than writing everything ourselves, and we should use that approach wherever practical to simplify application development and (often) to realize superior performance.

This chapter has briefly introduced three sets of library functionality that we think every SYCL developer should be familiar with:

  1. 1.

    The SYCL built-in functions, for common math operations

  2. 2.

    The standard C++ library, for other common operations

  3. 3.

    The C++17 parallel algorithms (supported by oneDPL), for complete kernels

With any library, it is important to understand which devices, compilers, and implementations are tested and supported before relying upon them in production. This is not SYCL-specific advice, but worth remembering—the number of potential targets for a portable programming solution like SYCL is huge, and it is our responsibility as programmers to identify which libraries are aligned with our goals.