In this chapter, we look at the advanced concept of making our program more flexible and therefore more portable. This is done by looking at mechanisms to match the capabilities of any system (and accelerators) our application might be executed upon, with a selection of kernels and code that we have written. This is an advanced topic because we can always simply “use the default accelerator” and run the kernels we write on that regardless of what it is. We have learned that this will work even on systems which may have no accelerator because SYCL guarantees there is always a device available that will run a kernel even if it is the CPU that is also running our host application.

When we move beyond “use the default accelerator” and general-purpose kernels, we find mechanisms are available to choose which device(s) to use, and mechanisms to create more specialized kernels. We discuss both capabilities in this chapter. Together, these two capabilities allow us to construct applications that are highly adaptable to the system on which they are executed.

Fortunately, the creators of the SYCL specification thought about these needs and gave us interfaces to let us solve this problem. The SYCL specification defines a device class that encapsulates a device on which kernels may be executed. We first cover the ability to query the device class, so that our program can adapt to the device characteristics and capabilities. We may occasionally choose to write different algorithms for different devices. Later in this chapter, we learn that we can apply aspects to a kernel to specialize a kernel and let a compiler take advantage of that. Such specialization helps make a kernel more tailored to a certain class of devices while likely rendering it unsuitable for other devices. Combining these concepts allows us to adapt our program as much, or as little, as we wish. This ensures we can decide how much investment to make in squeezing out performance while starting with broad portability.

Is There a GPU Present?

Many of us will start with having logic to figure out “Is there a GPU present?” to inform the choices our program will make as it executes. That is the start of what this chapter covers. As we will see, there is much more information available to help us make our programs robust and performant.

Parameterizing a program can help with correctness, functional portability, performance portability, and future proofing.

This chapter dives into the most important queries and how to use them effectively in our programs. Implementations doubtlessly offer more detailed properties that we can query. To learn all possible queries, we would need to review the latest SYCL specification, the documentation for our particular compiler, and documentation for any runtimes/drivers we may encounter.

Device-specific properties are queryable using get_info functions, including access to device-specific kernel and work-group properties.

Refining Kernel Code to Be More Prescriptive

It is useful to consider that our coding, kernel by kernel, will fall broadly into one of these three categories:

  • Generic kernel code: Run anywhere, not tuned to a specific class of device.

  • Device type–specific kernel code: Run on a type of device (e.g., GPU, CPU, FPGA), not tuned to specific models of a device type. This is particularly useful because many device types share common features, so it is safe to make some assumptions that would not apply to fully general code written for all devices.

  • Tuned device-specific kernel code: Run on a type of device, with tuning that reacts to specific parameters of a device—this covers a broad range of possibilities from a small amount of tuning to very detailed optimization work.

    It is our job as programmers to determine when different patterns are needed for different device types. We dedicate Chapters 14, 15, 16, and 17 to illuminating this important thinking.

It is most common to start by focusing on getting things working with a functionally correct implementation of a generic kernel. Chapter 2 specifically talks about what methods are easiest to debug when getting started with a kernel implementation. Once we have a kernel working, we may evolve it to target the capabilities of a specific device type or device model.

Chapter 14 offers a framework of thinking to consider parallelism first, before we dive into device considerations. It is our choice of pattern (a.k.a. algorithm) that dictates our code, and it is our job as programmers to determine when different patterns are needed for different devices. Chapters 15 (GPU), 16 (CPU), and 17 (FPGA) dive more deeply into the qualities that distinguish these device types and motivate a choice in pattern to use. It is these qualities that motivate us to consider writing distinct versions of kernels when the best approach (pattern choice) varies on different device types.

When we have a kernel written for a specific type of device (e.g., a specific CPU, GPU, FPGA, etc.), it is logical to adapt it to specific vendors or even models of such devices. Good coding style is to parameterize code based on features (e.g., item size support found from a device query).

We should write code to query parameters that describe the actual capabilities of a device instead of its marketing information; it is bad programming practice to query the model number of a device and react to that—such code is less portable because it is not future-proof.

It is common to write a different kernel for each device type that we want to support (a GPU version of a kernel and an FPGA version of a kernel and maybe a generic version of a kernel). When we get more specific, to support a specific device vendor or even device model, we may benefit when we can parameterize a kernel rather than duplicate it. We are free to do either, as we see fit. Code cluttered with too many parameter adjustments may be hard to read or excessively burdened at runtime. It is common however that parameters can fit neatly into a single version of a kernel.

Parameterizing makes the most sense when the algorithm is broadly the same but has been tuned for the capabilities of a specific device. Writing a different kernel is much cleaner when using a completely different approach, pattern, or algorithm.

How to Enumerate Devices and Capabilities

Chapter 2 enumerates and explains five methods for choosing a device on which to execute. Essentially, Method#1 was the least prescriptive run it somewhere, and we evolve to the most prescriptive Method#5, which considered executing on a fairly precise model of a device from a family of devices. The enumerated methods in between gave a mix of flexibility and prescriptiveness. Figure 12-1, Figure 12-2, and Figure 12-4 help to illustrate how we can select a device.

Figure 12-1 shows that even if we allow the implementation to select a default device for us (Method#1 in Chapter 2), we can still query for information about the selected device.

Figure 12-2 shows how we can try to set up a queue using a specific device (in this case, a GPU), but fall back explicitly on the default device if no GPU is available. This gives us some control of our device choice by biasing us to get a GPU whenever one is available. We know that at least one device is always guaranteed to exist so our kernels can always run in a properly configured system. When there is no GPU, many systems will default to a CPU device but there is no guarantee. Likewise, if we ask for a CPU device explicitly, there is no guarantee there is such a device (but we are guaranteed that some device will exist).

It is not recommended that we use the solution shown in Figure 12-2. In addition to appearing a little scary and error prone, Figure 12-2 does not give us control over which GPU is selected if there are choices of GPUs at runtime. Despite being both instructive and functional, there is a better way. It is recommended that we write custom device selectors as shown in the next code example (Figure 12-4).

Figure 12-1
A code in C + + with queue q. It outputs 5 lines with 1 line per run. Reads, by default, we are running on N VIDIA GeForce R T X 3060, A M D Radeon R X 5700 X T, Intel R U H D graphics 770, Intel R Xeon R Gold 6.336 Y C P U at the rate of 2.40 Gigahertz, and Intel R data center G P U max 1100.

Device we have been assigned by default

Queries about devices rely on installed software (special user-level drivers), to respond regarding a device. SYCL relies on this, just as an operating system needs drivers to access hardware—it is not sufficient that the hardware simply be installed in a machine.

Figure 12-2
A code in C + + with auto G P U is available = false, and try-catch block. Gives 4 sets of outputs using 4 different systems each with a G P U, and an output using a system without G P U.

Using try-catch to select a GPU device if possible, use the default device if not

Aspects

The SYCL standard has a small list of device aspects that can be used to understand the capabilities of a device, to control which devices we choose to use, and to control which kernels we submit to a device. At the end of this chapter, we will discuss “kernel specialization” and kernel templating. For now, we will enumerate the aspects and how to use them in device queries and selection. Figure 12-3 lists aspects that are defined by the SYCL standard to be available for use in every C++ program using SYCL. Aspects are Boolean—a device either has or does not have an aspect. The first four (cpu/gpu/accelerator/custom) are mutually exclusive since device types are defined as an enum by SYCL 2020. Features including aspect::fp16, aspect::fp64, and aspect::atomic64 are “optional features” so they may not be supported by all devices—testing for these can be especially important for a robust application.

Figure 12-3
A table with 2 columns of standard aspect, all Booleans, and the device. Includes 14 rows of aspect types, with some rows having more aspect types grouped together.

Aspects defined by the SYCL standard (implementations can add more)

Custom Device Selector

Figure 12-4 uses a custom device selector. Custom device selectors were first discussed in Chapter 2 as Method#5 for choosing where our code runs (Figure 2-16). The custom device selector evaluates each device available to the application. A particular device is selected based on receiving the highest score (or no device if the highest score is -1). In this example, we will have a little fun with our selector:

  • Reject non-GPUs (return -1).

  • Favor GPUs with a vendor name including the word “ACME” (return 24 if Martian, 824 otherwise).

  • Any other non-Martian GPU is a good one (return 799).

  • Martian GPUs, which are not ACME, are rejected (return -1).

The next section, “Being Curious: get_info<>,” dives into the rich information that get_devices(), get_platforms(), and get_info<> offer. Those interfaces open up any type of logic we might want to utilize to pick our devices, including the simple vendor name checks shown in Figure 2-16 and Figure 12-4.

Figure 12-4
A code in C + + gives negative score to all devices if there is no G P U on the system, and the selector will not select a device, which would cause an exception. Gives 4 example outputs using 4 different systems each with a G P U, and an example output using a system without G P U.

Custom device selector—our preferred solution

Being Curious: get_info<>

In order for our program to “know” what devices are available at runtime, we can have our program query available devices from the device class, and then we can learn more details using get_info<> to inquire about a specific device. We provide a simple program, called curious (see Figure 12-5), that uses these interfaces to dump out information for us to look at directly. This can be especially useful for doing a sanity check when developing or debugging a program that uses these interfaces. Failure of this program to work as expected can often tell us that the software drivers we need are not installed correctly. Figure 12-6 shows a sample output from this program, with the high-level information about the devices that are present.

You may want to see if your system supports a utility such as sycl-ls, before you write your own “list all available SYCL devices” program.

Figure 12-5
A code in C + + to loop through available platforms and to loop through available devices in that platform, using get platforms, get devices, and this device dot get info.

Simple use of device query mechanisms: curious.cpp

Figure 12-6
A set of 6 outputs from curious dot c p p, with names of found platforms and their respective devices such as N VIDIA GeForce RIX 3060, A M D Raedeon R X 5700 X T, and Intel R Xeon E 217 G G c p u at the rate of 3.70 Gigahertz.

Example output from curious.cpp

Being More Curious: Detailed Enumeration Code

We offer a program, which we have named verycurious.cpp (Figure 12-7), to illustrate some of the detailed information available using get_info. Again, we find ourselves writing code like this to help when developing or debugging a program.

Now that we have shown how to access the information, we will discuss the information fields that prove the most important to query and act upon in applications.

Figure 12-7
A code in C + + with do query, get platforms, get info of query, dev dot get info, is c p u = = has aspect of c p u, or has aspect of g p u.

More detailed use of device query mechanisms: verycurious.cpp (subset shown)

Very Curious: get_info plus has()

The has() interface allows a program to test directly for a feature using aspects listed in Figure 12-3. Simple usage is shown in Figure 12-7—with more in the full verycurious.cpp source code in the book GitHub. The verycurious.cpp program is helpful for seeing the details about devices on your system.

Device Information Descriptors

Our “curious” and “verycurious” program examples, used earlier in this chapter, utilize popular SYCL device class member functions (i.e., is_cpu, is_gpu, is_accelerator, get_info, has). These member functions are documented in the SYCL specification in a table titled “Member functions of the SYCL device class.”

The “curious” program examples also queried for information using the get_info member function. There is a set of queries that must be supported by all SYCL devices. The complete list of such items is described in the SYCL specification in a table titled “Device information descriptors.”

Device-Specific Kernel Information Descriptors

Like platforms and devices, we can query information about our kernels using a get_info function. Such information (e.g., supported work-group sizes, preferred work-group size, the amount of private memory required per work-item) may be device-specific, and so the get_info member function of the kernel class accepts a device as an argument.

The Specifics: Those of “Correctness”

We will divide the specifics into information about necessary conditions (correctness) and information useful for tuning but not necessary for correctness.

In this first correctness category, we will enumerate conditions that should be met in order for kernels to launch properly. Failure to abide by these device limitations will lead to program failures. Figure 12-8 shows how we can fetch a few of these parameters in a way that the values are available for use in host code and in kernel code (via lambda capture). We can modify our code to utilize this information; for instance, it could guide our code on buffer sizing or work-group sizing.

Figure 12-8
A code in C + + with dev = q dot get device, dev dot get info. Outputs maximum W G size, global memory size, and local memory size.

Fetching parameters that can be used to shape a kernel

Submitting a kernel that violates a required condition (e.g., sub_group_sizes) will generate a runtime error.

Device Queries

device_type: cpu, gpu, accelerator, custom,Footnote 1 automatic, all. These are most often tested by is_cpu, is_gpu(), and so on (see Figure 12-7):

max_work_item_sizes: The maximum number of work-items that are permitted in each dimension of the work-group of the nd_range. The minimum value is (1, 1, 1).

max_work_group_size: The maximum number of work-items that are permitted in a work-group executing a kernel on a single compute unit. The minimum value is 1.

global_mem_size: The size of global memory in bytes.

local_mem_size: The size of local memory in bytes. The minimum size is 32 K.

max_compute_units: Indicative of the amount of parallelism available on a device—implementation-defined, interpret with care!

sub_group_sizes: Returns the set of sub-group sizes supported by the device.

Note that many more characteristics are encoded as aspects (see Figure 12-3), such as USM capabilities.

We Strongly Advise Avoiding max_compute_units in program logic

We have found that querying the maximum number of compute units should be avoided, in part because the definition isn’t crisp enough to be useful in code tuning. Instead of using max_compute_units, most programs should express their parallelism and let the runtime map it onto available parallelism. Relying on max_compute_units for correctness only makes sense when augmented with implementation- and device-specific information. Experts might do that, but most developers do not and do not need to do so! Let the runtime do its job in this case!

Kernel Queries

The mechanisms discussed in Chapter 10, under “Kernels in Kernel Bundles,” are needed to perform these kernel queries:

  • work_group_size: Returns the maximum work-group size that can be used to execute a kernel on a specific device

  • compile_work_group_size: Returns the work-group size specified by a kernel if applicable; otherwise returns (0, 0, 0)

  • compile_sub_group_size: Returns the sub-group size specified by a kernel if applicable; otherwise returns 0

  • compile_num_sub_groups: Returns the number of sub-groups specified by a kernel if applicable; otherwise returns 0

  • max_sub_group_size: Returns the maximum sub-group size for a kernel launched with the specified work-group size

  • max_num_sub_groups: Returns the maximum number of sub-groups for a kernel

The Specifics: Those of “Tuning/Optimization”

There are a few additional parameters that can be considered as fine-tuning parameters for our kernels. These can be ignored without jeopardizing the correctness of a program. These allow our kernels to really utilize the particulars of the hardware for performance.

Paying attention to the results of these queries can help when tuning for a cache (if it exists).

Device Queries

global_mem_cache_line_size: Size of global memory cache line in bytes.

global_mem_cache_size: Size of global memory cache in bytes.

local_mem_type: The type of local memory supported. This can be info::local_mem_type::local implying dedicated local memory storage such as SRAM or info::local_mem_type::global. The latter type means that local memory is just implemented as an abstraction on top of global memory with potentially no performance gains.

Kernel Queries

preferred_work_group_size: The preferred work-group size for executing a kernel on a specific device.

preferred_work_group_size_multiple: Work-group size should be a multiple of this value (preferred_work_group_size_multiple) for executing a kernel on a particular device for best performance. The value must not be greater than work_group_size.

Runtime vs. Compile-Time Properties

Implementations may offer compile-time constants/macros, or other functionality, but they are not standard and therefore we do not encourage their use nor do we discuss them in this book. The queries described in this chapter are performed through runtime APIs (get_info) so the results are not known until runtime. In the next section, we discuss how attributes may be used to control how the kernel is compiled. Other than attributes, the SYCL standard promotes only the use of runtime information with one fairly esoteric exception. SYCL does offer two traits that the application can use to query aspects at compilation time. These traits are there specifically to help avoid instantiating a templated kernel for device features that are not supported by any device. This is a very advanced, and seldom used, feature we do not elaborate upon in this book. The SYCL standard has an example toward the end of the “Device aspects” section that shows the use of any_device_has_v<aspect> and all_devices_have_v<aspect> for this purpose. The standard also defines “specialization constants,” which we do not discuss in this book because they are typically used in very advanced targeted development, such as in libraries. An experimental compile-time property extension is discussed in the Epilogue under “Compile-Time Properties.”

Kernel Specialization

We can specialize our kernels by having different kernels for different uses and select the appropriate kernel based on aspects (see Figure 12-3) of the device we are targeting. Of course, we can write specialized kernels explicitly and use C++ templating to help. We can inform the compiler that we want our kernel to use specific feature by using SYCL attributes (Figure 12-9) and aspects (Figure 12-3).

For example, the reqd_work_group_size attribute (Figure 12-9) can be used to require a specific work-group size for a kernel, and the device_has attribute can be used to require specific device aspects for a kernel.

Using attributes helps in two ways:

  1. 1.

    A kernel will throw an exception if it is submitted to a device that does not have one of the listed aspects.

  2. 2.

    The compiler will issue a diagnostic if the kernel (or any of the functions it calls) uses an optional feature (e.g., fp16) that is associated with an aspect that is not listed in the attribute.

The first helps prevent an application from proceeding if it will likely fail, and the second helps catch errors at compile time. For these reasons, using attributes can be helpful.

Figure 12-10 provides an example for illustration that uses run time logic to choose between two code sequences and uses attributes to specialize one of the kernels.

Figure 12-9
A table with 4 rows has columns of standard attribute and specifies.

Attributes defined by the SYCL standard (and not deprecated)

Figure 12-10
A code in C + + checks the device's aspects before submitting a kernel which requires those attributes. Prints if doubles were used or no doubles were used.

Specialization of kernel explicitly with the help of attributes

Summary

The most portable programs will query the devices that are available in a system and adjust their behavior based on runtime information. This chapter opens the door to the rich set of information that is available to allow such tailoring of our code to adjust to the hardware that is present at runtime. We also discussed various ways to specialize kernels so they can be more closely adapted to a particular device type when we decide the investment is worthwhile. These give us the tools to balance portability and performance as necessary to meet our needs, all within the bounds of using C++ with SYCL.

Our programs can be made more functionally portable, more performance portable, and more future-proof by parameterizing our application to adjust to the characteristics of the hardware. We can also test that the hardware present falls within the bounds of any assumptions we have made in the design of our program and either warns or aborts when hardware is found that lies outside the bounds of our assumptions.