Parallel programming is not really about driving in the fast lane. It is actually about driving fast in all the lanes. This chapter is all about enabling us to put our code everywhere that we can. We choose to enable all the compute resources in a heterogeneous system whenever it makes sense. Therefore, we need to know where those compute resources are hiding (find them) and put them to work (execute our code on them).

We can control where our code executes—in other words, we can control which devices are used for which kernels. C++ with SYCL provides a framework for heterogeneous programming in which code can execute on a mixture of a host CPU and devices. The mechanisms which determine where code executes are important for us to understand and use.

This chapter describes where code can execute, when it will execute, and the mechanisms used to control the locations of execution. Chapter 3 will describe how to manage data so it arrives where we are executing our code, and then Chapter 4 returns to the code itself and discusses the writing of kernels.

Single-Source

C++ with SYCL programs are single-source, meaning that the same translation unit (typically a source file and its headers) contains both the code that defines the compute kernels to be executed on SYCL devices and also the host code that orchestrates execution of those kernels. Figure 2-1 shows these two code paths graphically, and Figure 2-2 provides an example application with the host and device code regions marked.

Combining both device and host code into a single-source file (or translation unit) can make it easier to understand and maintain a heterogeneous application. The combination also provides improved language type safety and can lead to more compiler optimizations of our code.

Figure 2-1
A flow diagram. The regular C P U code from the single source code runs natively on C P U. The device kernels from the single source code are submitted to the s y c l queue., which runs on S Y C L devices such as G P U, F P G A, and C P U.

Single-source code contains both host code (runs on CPU) and device code (runs on SYCL devices)

Figure 2-2
A program to create a queue on implementation, create a buffer using host allocated data array, and obtain access to the buffer on the host. A of i d x = i d x is the device code. The rest are the host code.

Simple SYCL program

Host Code

Applications contain C++ host code, which is executed by the CPU(s) on which the operating system has launched the application. Host code is the backbone of an application that defines and controls assignment of work to available devices. It is also the interface through which we define the data and dependences that should be managed by the SYCL runtime.

Host code is standard C++ augmented with SYCL-specific constructs and classes that may be implementable as a C++ library. This makes it easier to reason about what is allowed in host code (anything that is allowed in C++) and can simplify integration with build systems.

The host code in an application orchestrates data movement and compute offload to devices but can also perform compute-intensive work itself and can use libraries like any C++ application.

Device Code

Devices correspond to accelerators or processors that are conceptually independent from the CPU that is executing host code. An implementation may also expose the host processor as a device, as described later in this chapter, but the host processor and devices should be thought of as logically independent from each other. The host processor runs native C++ code, while devices run device code which includes some additional features and restrictions.

Queues are the mechanism through which work is submitted to a device for future execution. There are three important properties of device code to understand:

  1. 1.

    It executes asynchronously from the host code. The host program submits device code to a device, and the runtime tracks and starts that work only when all dependences for execution are satisfied (more on this in Chapter 3). The host program execution carries on before the submitted work is started on a device, providing the property that execution on devices is asynchronous to host program execution, unless we explicitly tie the two together. As a side effect of this asynchronous execution, work on a device isn’t guaranteed to start until the host program forces execution to begin through various mechanisms that we cover in later chapters, such as host accessors and blocking queue wait operations.

  2. 2.

    There are restrictions on device code to make it possible to compile and achieve performance on accelerator devices. For example, dynamic memory allocation and runtime type information (RTTI) are not supported within device code, because they would lead to performance degradation on many accelerators. The small set of device code restrictions is covered in detail in Chapter 10.

  3. 3.

    Some functions and queries defined by SYCL are available only within device code, because they only make sense there, for example, work-item identifier queries that allow an executing instance of device code to query its position in a larger data-parallel range (described in Chapter 4).

In general, we will refer to work that is submitted to queues as actions. Actions include execution of device code on a device, but in Chapter 3 we will learn that actions also include memory movement commands. In this chapter, since we are concerned with the device code aspect of actions, we will be specific in mentioning device code much of the time.

Choosing Devices

To explore the mechanisms that let us control where device code will execute, we’ll look at five use cases:

  • Method#1: Running device code somewhere when we don’t care which device is used. This is often the first step in development because it is the simplest.

  • Method#2: Explicitly running device code on a CPU device, which is often used for debugging because most development systems have an accessible CPU. CPU debuggers are also typically very rich in features.

  • Method#3: Dispatching device code to a GPU or other accelerator.

  • Method#4: Dispatching device code to a heterogeneous set of devices, such as a GPU and an FPGA.

  • Method#5: Selecting specific devices from a more general class of devices, such as a specific type of FPGA from a collection of available FPGA types.

Developers will typically debug their code as much as possible with Method#2 and only move to Methods #3–#5 when code has been tested as much as is practical with Method#2.

Method#1: Run on a Device of Any Type

When we don’t care where our device code will run, it is easy to let the runtime pick for us. This automatic selection is designed to make it easy to start writing and running code, when we don’t yet care about what device is chosen. This device selection does not take into account the code to be run, so should be considered an arbitrary choice which likely won’t be optimal.

Before talking about choice of a device, even one that the implementation has selected for us, we should first cover the mechanism through which a program interacts with a device: the queue.

Queues

A queue is an abstraction to which actions are submitted for execution on a single device. A simplified definition of the queue class is given in Figures 2-3 and 2-4. Actions are usually the launch of data-parallel compute, although other commands are also available such as manual control of data motion for when we want more control than the automatic movement provided by the SYCL runtime. Work submitted to a queue can execute after prerequisites tracked by the runtime are met, such as availability of input data. These prerequisites are covered in Chapters 3 and 8.

Figure 2-3
A program creates queues associated with a default device, using a device selector, associated with an explicit device to which the program already holds a reference, and associated with a device in a specific S Y C L context. A device selector is also used in place of a device.

Simplified definition of some constructors of the queue class

Figure 2-4
A program to submit a command group to the queue, wait for all previously submitted actions to finish executing, and pass asynchronous exceptions to an asynchronous handler function.

Simplified definition of some key member functions in the queue class

A queue is bound to a single device, and that binding occurs on construction of the queue. It is important to understand that work submitted to a queue is executed on the single device to which that queue is bound. Queues cannot be mapped to collections of devices because that would create ambiguity on which device should perform work. Similarly, a queue cannot spread the work submitted to it across multiple devices. Instead, there is an unambiguous mapping between a queue and the device on which work submitted to that queue will execute, as shown in Figure 2-5.

Figure 2-5
An illustration. Four queues from top to bottom are linked to icons of G P U 1, C P U, F P G A, and G P U 2 respectively.

A queue is bound to a single device. Work submitted to the queue executes on that device

Multiple queues may be created in a program, in any way that we desire for application architecture or programming style. For example, multiple queues may be created to each bind with a different device or to be used by different threads in a host program. Multiple different queues can be bound to a single device, such as a GPU, and submissions to those different queues will result in the combined work being performed on the device. An example of this is shown in Figure 2-6. Conversely, as we mentioned previously, a queue cannot be bound to more than one device because there must not be any ambiguity on where an action is being requested to execute. If we want a queue that will load balance work across multiple devices, for example, then we can create that abstraction in our code.

Figure 2-6
An illustration. Four queues are on the left. G P U 1, C P U, F P G A, and G P U 2 are on the right. Three queues from the top are linked to G P U 1. The last queue links to G P U 2.

Multiple queues can be bound to a single device

Because a queue is bound to a specific device, queue construction is the most common way in code to choose the device on which actions submitted to the queue will execute. Selection of the device when constructing a queue is achieved through a device selector abstraction.

Binding a Queue to a Device When Any Device Will Do

Figure 2-7 is an example where the device that a queue should bind to is not specified. The default queue constructor that does not take any arguments (as in Figure 2-7) simply chooses some available device behind the scenes. SYCL guarantees that at least one device will always be available, so some device will always be selected by this default selection mechanism. In many cases the selected device may happen to be a CPU which is also executing the host program, although this is not guaranteed.

Figure 2-7
A program to create a queue on whatever default device that the implementation chooses and for the implicit the use of default selector. Eight sample outputs with one line per run give selected devices such as N VIDIA GeForce R T X 3060 and A M D Radeon R X 5700 X T.

Implicit default device selector through default construction of a queue

Using the trivial queue constructor is a simple way to begin application development and to get device code up and running. More control over selection of the device bound to a queue can be added as it becomes relevant for our application.

Method#2: Using a CPU Device for Development, Debugging, and Deployment

A CPU device can be thought of as enabling the host CPU to act as if it was an independent device, allowing our device code to execute regardless of the accelerators available in a system. We always have some processor running the host program, so a CPU device is therefore usually available to our application (very occasionally a CPU might not be exposed as a SYCL device by an implementation, for a variety of reasons). Using a CPU device for code development has a few advantages:

  1. 1.

    Development of device code on less capable systems that don’t have any accelerators: One common use is development and testing of device code on a local system, before deploying to an HPC cluster for performance testing and optimization.

  2. 2.

    Debugging of device code with non-accelerator tooling: Accelerators are often exposed through lower-level APIs that may not have debug tooling as advanced as is available for host CPUs. With this in mind, a CPU device often supports debugging using standard tools familiar to developers.

  3. 3.

    Backup if no other devices are available, to guarantee that device code can be executed functionally: A CPU device may not have performance as a primary goal, or may not match the architecture for which kernel code was optimized, but can often be considered as a functional backup to ensure that device code can always execute in any application.

It should not be a surprise to find that multiple CPU devices are available to a SYCL application, with some aimed at ease of debugging while others may be focused on execution performance. Device aspects can be used to differentiate between these different CPU devices, as described later in this chapter.

When considering use of a CPU device for development and debugging of device code, some consideration should be given to differences between the CPU and a target accelerator architecture (e.g., GPU). Especially when optimizing code performance, and particularly when using more advanced features such as sub-groups, there can be some differences in functionality and performance across architectures. For example, the sub-group size may change when moving to a new device. Most development and debugging can typically occur on a CPU device, sometimes followed by final tuning and debugging on the target device architecture.

A CPU device is functionally like a hardware accelerator in that a queue can bind to it and it can execute device code. Figure 2-8 shows how the CPU device is a peer to other accelerators that might be available in a system. It can execute device code, in the same way that a GPU or FPGA is able to, and can have one or more queues constructed that bind to it.

Figure 2-8
An illustration. A queue is on the left. C P U, G P U 1, and F P G A are on the right. The queue links to the C P U.

A CPU device can execute device code like any accelerator

An application can choose to create a queue that is bound to a CPU device by explicitly passing cpu_selector_v to a queue constructor, as shown in Figure 2-9.

Figure 2-9
A program creates a queue to use the C P U device explicitly. The example output gives, selected device, Intel R Xeon R Gold 6128 C P U at 3.4 gigahertz. Device vendor, Intel R Corporation.

Selecting the host device using the cpu_selector_v

Even when not specifically requested (e.g., using cpu_selector_v), the CPU device might happen to be chosen by the default selector as occurred in the output in Figure 2-7.

A few variants of device selectors are defined to make it easy for us to target a type of device. The cpu_selector_v is one example of these selectors, and we’ll get into others in the coming sections.

Method#3: Using a GPU (or Other Accelerators)

GPUs are showcased in the next example, but any type of accelerator applies equally. To make it easy to target common classes of accelerators, devices are grouped into several broad categories, and SYCL provides built-in selector classes for them. To choose from a broad category of device type such as “any GPU available in the system,” the corresponding code is very brief, as described in this section.

Accelerator Devices

In the terminology of the SYCL specification, there are a few broad groups of accelerator types:

  1. 1.

    CPU devices.

  2. 2.

    GPU devices.

  3. 3.

    Accelerators, which capture devices that don’t identify as either a CPU device or a GPU. This includes FPGA and DSP devices.

A device from any of these categories is easy to bind to a queue using built-in selectors, which can be passed to queue (and some other class) constructors.

Device Selectors

Classes that must be bound to a specific device, such as the queue class, have constructors that can accept a DeviceSelector. A DeviceSelector is a callable taking a const reference to a device, and which ranks it numerically so that the runtime can choose a device with the highest ranking. For example, one queue constructor which accepts a DeviceSelector is queue(const DeviceSelector &deviceSelector, const property_list &propList = {});

There are four built-in selectors for the broad classes of common devices.

default_selector_v

Any device of the implementation’s choosing

cpu_selector_v

Select a device that identifies itself as a CPU in device queries

gpu_selector_v

Select a device that identifies itself as a GPU in device queries

accelerator_selector_v

Select a device that identifies itself as an “accelerator,” which includes FPGAs

One additional selector included in DPC++ (not available in SYCL) is available by including the header "sycl/ext/intel/fpga_extensions.hpp".

ext::intel::fpga_selector_v

Select a device that identifies itself as an FPGA

A queue can be constructed using one of the built-in selectors, such as

queue myQueue{ gpu_selector_v{} };

Figure 2-10 shows a complete example using the GPU selector, and Figure 2-11 shows the corresponding binding of a queue with an available GPU device.

Figure 2-12 shows an example using a variety of built-in selectors and demonstrates use of device selectors with another class (device) that accepts a device selector on construction.

Figure 2-10
A program creates a queue bound to an available G P U device. An example output at the bottom reads as follows. Selected device, A M D Radeon R X 5700 X T. Device vendor, A M D Corporation.

GPU device selector example

Figure 2-11
An illustration. A queue is on the left. C P U, G P U 1, and F P G A are on the right. The queue links to G P U 1.

Queue bound to a GPU device available to the application

Figure 2-12
A void program to display different outputs. The example output indicates the selected device and device vendor of default selector v, c p u selector v, g p u selector v, accelerator selector v, and f p g a selector v.

Example device identification output from various classes of device selectors and demonstration that device selectors can be used for construction of more than just a queue (in this case, construction of a device class instance)

When Device Selection Fails

If a GPU selector is used when creating an object such as a queue and if there are no GPU devices available to the runtime, then the selector throws a runtime_error exception. This is true for all device selector classes in that if no device of the required class is available, then a runtime_error exception is thrown. It is reasonable for complex applications to catch that error and instead acquire a less desirable (for the application/algorithm) device class as an alternative. Exceptions and error handling are discussed in more detail in Chapter 5.

Method#4: Using Multiple Devices

As shown in Figures 2-5 and 2-6, we can construct multiple queues in an application. We can bind these queues to a single device (the sum of work to the queues is funneled into the single device), to multiple devices, or to some combination of these. Figure 2-13 provides an example that creates one queue bound to a GPU and another queue bound to an FPGA. The corresponding mapping is shown graphically in Figure 2-14.

Figure 2-13
A program to create g p u and f p g a queues. The example output reads as follows. Selected device 1, Intel R, U H D Graphics, 0 x 9 a 6 0. Selected device 2, p a c underscore a 10, Intel P A C Platform, p a c underscore e e 0 0 0 0 0.

Creating queues to both GPU and FPGA devices

Figure 2-14
An illustration. Two queues are on the left. C P U, G P U 1, and F P G A are on the right. The upper queue labeled my g p u queue links to G P U 1, and the lower labeled my f p g a queue links to F P G A.

GPU + FPGA device selector example: One queue is bound to a GPU and another to an FPGA

Method#5: Custom (Very Specific) Device Selection

We will now look at how to write a custom selector. In addition to examples in this chapter, there are a few more examples shown in Chapter 12. The built-in device selectors are intended to let us get code up and running quickly. Real applications usually require specialized selection of a device, such as picking a desired GPU from a set of GPU types available in a system. The device selection mechanism is easily extended to arbitrarily complex logic, so we can write whatever code is required to choose the device that we prefer.

Selection Based on Device Aspects

SYCL defines properties of devices known as aspects. For example, some aspects that a device might exhibit (return true on aspect queries) are gpu, host_debuggable, fp64, and online_compiler. Please refer to the “Device Aspects” section of the SYCL specification for a full list of standard aspects, and their definitions.

To select a device using aspects defined in SYCL, the aspect_selector can be used as shown in Figure 2-15. In the form of aspect_selector taking a comma-delimited group of aspects, all aspects must be exhibited by a device for the device to be selected. An alternate form of aspect_selector takes two std::vectors. The first vector contains aspects that must be present in a device, and the second vector contains aspects that must not be present in a device (lists negative aspects). Figure 2-15 shows an example of using both of these forms of aspect_selector.

Figure 2-15
A program to create two queues with selected aspects. The example output reads as follows. First selected device is Intel R, U H D Graphics, 0 x 9 a 6 0. Second selected device is the eleventh generation Intel R, Core T M, i 9 11900 kilo bytes at 3.3 gigahertz.

Aspect selector

Some aspects may be used to infer performance characteristics of a device. For example, any device with the emulated aspect may not perform as well as a device of the same type, which is not emulated, but may instead exhibit other aspects related to improved debuggability.

Selection Through a Custom Selector

When existing aspects aren’t sufficient for selection of a specific device, a custom device selector may be defined. Such a selector is simply a C++ callable (e.g., a function or lambda) that takes a const Device& as a parameter and that returns an integer score for the specific device. The SYCL runtime invokes the selector on all available root devices that can be found and chooses the device for which the selector returned the highest score (which must be nonnegative for selection to occur).

In cases where there is a tie for the highest score, the SYCL runtime will choose one of the tied devices. No device for which the selector returned a negative number will be chosen by the runtime, so returning a negative number from a selector guarantees that the device will not be selected.

Mechanisms to Score a Device

We have many options to create an integer score corresponding to a specific device, such as the following:

  1. 1.

    Return a positive value for a specific device class.

  2. 2.

    String match on a device name and/or device vendor strings.

  3. 3.

    Compute anything that we can imagine leading to an integer value, based on device or platform queries.

For example, one possible approach to select a specific Intel Arria FPGA accelerator board is shown in Figure 2-16.

Figure 2-16
A program of my selector. The example output at the bottom reads as follows. Selected device is p a c underscore a 10, Intel P A C Platform, left parenthesis p a c underscore e e 0 0 0 0 0 right parenthesis.

Custom selector for a specific Intel Arria FPGA accelerator board

Chapter 12 has more discussion and examples for device selection and discusses the get_info method in more depth.

Creating Work on a Device

Applications usually contain a combination of both host code and device code. There are a few class members that allow us to submit device code for execution, and because these work dispatch constructs are the only way to submit device code, they allow us to easily distinguish device code from host code.

The remainder of this chapter introduces some of the work dispatch constructs, with the goal to help us understand and identify the division between device code and host code that executes natively on the host processor.

Introducing the Task Graph

A fundamental concept in the SYCL execution model is a graph of nodes. Each node (unit of work) in this graph contains an action to be performed on a device, with the most common action being a data-parallel device kernel invocation. Figure 2-17 shows an example graph with four nodes, where each node can be thought of as a device kernel invocation.

Figure 2-17
A graph with 4 nodes A, B, C, and D. The links are from A to B, B to C and D, and C to D. The connecting arrows indicate the dependencies to define when an action can be initiated. Example, data dependence. The nodes indicate the actions. Example, data-parallel device kernel invocation.

The task graph defines actions to perform (asynchronously from the host program) on one or more devices and also dependences that determine when an action is safe to execute

The nodes in Figure 2-17 have dependence edges defining when it is legal for a node’s work to begin execution. The dependence edges are most commonly generated automatically from data dependences, although there are ways for us to manually add additional custom dependences when we want to. Node B in the graph, for example, has a dependence edge from node A. This edge means that node A must complete execution, and most likely (depending on specifics of the dependence) make generated data available on the device where node B will execute before node B’s action is started. The runtime controls resolution of dependences and triggering of node executions completely asynchronously from the host program’s execution. The graph of nodes defining an application will be referred to in this book as the task graph and is covered in more detail in Chapter 3.

Where Is the Device Code?

There are multiple mechanisms that can be used to define code that will be executed on a device, but a simple example shows how to identify such code. Even if the pattern in the example appears complex at first glance, the pattern remains the same across all device code definitions so quickly becomes second nature.

The code passed as the final argument to the parallel_for, defined as a lambda expression in Figure 2-18, is the device code to be executed on a device. The parallel_for in this case is the construct that lets us distinguish device code from host code. The parallel_for is one of a small set of device dispatch mechanisms, all members of the handler class, that define the code to be executed on a device. A simplified definition of the handler class is given in Figure 2-19.

Figure 2-18
A program with a command group at the top and a device code at the bottom. The device code reads h dot parallel underscore for left parenthesis size, left square bracket = right square bracket, left parenthesis auto ampersand i d x right parenthesis, a c c of i d x = i d x, right parenthesis.

Submission of device code

Figure 2-19
A program to specify events that must be complete before the action executes, guarantee that the memory object accessed by the accessor is updated on the host, submit a memset operation writing to the specified pointer, copy between accessors, and submit different forms of kernel for execution.

Simplified definition of member functions in the handler class

In addition to calling members of the handler class to submit device code, there are also members of the queue class that allow work to be submitted. The queue class members shown in Figure 2-20 are shortcuts that simplify certain patterns, and we will see these shortcuts used in future chapters.

Figure 2-20
A program to submit a memset operation writing to the specified pointer, submit different forms of the kernel for execution, wait for the specified events to complete before executing the kernel, and return an event representing the kernel operation.

Simplified definition of member functions in the queue class that act as shorthand notation for equivalent functions in the handler class

Actions

The code in Figure 2-18 contains a parallel_for, which defines work to be performed on a device. The parallel_for is within a command group (CG) submitted to a queue, and the queue defines the device on which the work is to be performed. Within the command group, there are two categories of code:

  1. 1.

    Host code that sets up dependences defining when it is safe for the runtime to start execution of the work defined in (2), such as creation of accessors to buffers (described in Chapter 3)

  2. 2.

    At most one call to an action that either queues device code for execution or performs a manual memory operation such as copy

The handler class contains a small set of member functions that define the action to be performed when a task graph node is executed. Figure 2-21 summarizes these actions.

Figure 2-21
A table has 3 columns and 2 rows. The column headers are work type, actions with handler class methods, and summary. The work types are device code execution with 2 sub rows, single task and parallel for, and explicit memory operation with 3 sub rows, copy, update host, and fill.

Actions that invoke device code or perform explicit memory operations

At most one action from Figure 2-21 may be called within a command group (it is an error to call more than one), and only a single command group can be submitted to a queue per submit call. The result of this is that a single (or potentially no) operation from Figure 2-21 exists per task graph node, to be executed when the node dependences are met and the runtime determines that it is safe to execute.

A command group must have at most one action within it, such as a kernel launch or explicit memory operation.

The idea that code is executed asynchronously in the future is the critical difference between code that runs on the CPU as part of the host program and device code that will run in the future when dependences are satisfied. A command group usually contains code from each category, with the code that defines dependences running as part of the host program (so that the runtime knows what the dependences are) and device code running in the future once the dependences are satisfied.

There are three classes of code in Figure 2-22:

  1. 1.

    Host code: Drives the application, including creating and managing data buffers and submitting work to queues to form new nodes in the task graph for asynchronous execution.

  2. 2.

    Host code within a command group: This code is run on the processor that the host code is executing on and executes immediately, before the submit call returns. This code sets up the node dependences by creating accessors, for example. Any arbitrary CPU code can execute here, but best practice is to restrict it to code that configures the node dependences.

  3. 3.

    An action: Any action listed in Figure 2-21 can be included in a command group, and it defines the work to be performed asynchronously in the future when node requirements are met (set up by (2)).

Figure 2-22
A device code has 4 segments. These are the host code that involves selecting any device for the queue, immediate code to set up the task graph node, device code that runs in the future when dependencies are met, and host code.

Submission of device code

To understand when code in an application will run, note that anything passed to an action listed in Figure 2-21 that initiates device code execution, or an explicit memory operation listed in Figure 2-21, will execute asynchronously in the future when the SYCL task graph (described later) node dependences have been satisfied. All other code runs as part of the host program immediately, as expected in typical C++ code.

It is important to note that although device code can start running (asynchronously) when task graph node dependences have been met, device code is not guaranteed to start running at that point. The only way to be sure that device code will start executing is to have the host program wait for (block on) results from the device code execution, through mechanisms such as host accessors or queue wait operations, which we cover in later chapters. Without such host blocking operations, the SYCL and lower-level runtimes make decisions on when to start execution of device code, possibly optimizing for objectives other than “run as soon as possible” such as optimizing for power or congestion.

Host tasks

In general, the code executed by an action submitted to a queue (such as through parallel_for) is device code, following a few language restrictions that allow it to run efficiently on many architectures. There is one important deviation, though, which is accessed through a handler method named host_task. This method allows arbitrary C++ code to be submitted as an action in the task graph, to be executed on the host once any task graph dependences have been satisfied.

Host tasks are important in some programs for two reasons:

  1. 1.

    Arbitrary C++ can be included, even std::cout or printf. This can be important for easy debugging, interoperability with lower-level APIs such as OpenCL, or for incrementally enabling the use of accelerators in existing code.

  2. 2.

    Host tasks execute asynchronously as part of the task graph, instead of synchronously with the host program. Although a host program can launch additional threads or use other task parallelism approaches, host tasks integrate with the dependence tracking mechanisms of the SYCL runtime. This can be very convenient and may result in higher performance when device and host code need to be interspersed.

Figure 2-23
A program to create a queue of g p u selector v, initialize values in the shared allocation, and use a host task to output values on the host as part of the task graph. The example output indicates the selected device, host task at 0, 1, 2, and 3, and main at 0, 1, 2, and 3.

A simple host_task

Figure 2-23 demonstrates a simple host task, which outputs text using std::cout when the task graph dependences have been met. Remember that the host task is executed asynchronously from the rest of the host program. This is a powerful part of the task graph mechanism in which the SYCL runtime schedules work when it is safe to do so, without interaction from the host program which may instead continue with other work. Also note that the code body of the host task does not need to follow any restrictions that are imposed on device code (described in Chapter 10).

The example in Figure 2-23 is based on events (described in Chapter 3) to create a dependence between the device code submission and a later host task, but host tasks can also be used with accessors (also covered in Chapter 3) through a special accessor template parameterization of target::host_task (Chapter 7).

Summary

In this chapter we provided an overview of queues, selection of the device with which a queue will be associated, and how to create custom device selectors. We also overviewed the code that executes on a device asynchronously when dependences are met vs. the code that executes as part of the C++ application host code. Chapter 3 describes how to control data movement.