Defining Kernels

Reinders, James; Ashbaugh, Ben; Brodman, James; Kinsner, Michael; Pennycook, John; Tian, Xinmin

doi:10.1007/978-1-4842-5574-2_10

James Reinders⁷,
Ben Ashbaugh⁸,
James Brodman⁹,
Michael Kinsner¹⁰,
John Pennycook¹¹ &
…
Xinmin Tian¹²

52k Accesses

Abstract

Thus far in this book, our code examples have represented kernels using C++ lambda expressions. Lambda expressions are a concise and convenient way to represent a kernel right where it is used, but they are not the only way to represent a kernel in SYCL. In this chapter, we will explore various ways to define kernels in detail, helping us to choose a kernel form that is most natural for our C++ coding needs.

You have full access to this open access chapter, Download chapter PDF

Thus far in this book, our code examples have represented kernels using C++ lambda expressions. Lambda expressions are a concise and convenient way to represent a kernel right where it is used, but they are not the only way to represent a kernel in SYCL. In this chapter, we will explore various ways to define kernels in detail, helping us to choose a kernel form that is most natural for our C++ coding needs.

This chapter explains and compares three ways to represent a kernel:

Lambda expressions
Named function objects (functors)
Interoperability with kernels created via other languages or APIs

This chapter closes with a discussion of how to explicitly manipulate kernels in a program object to control when and how kernels are compiled.

Why Three Ways to Represent a Kernel?

Before we dive into the details, let’s start with a summary of why there are three ways to define a kernel and the advantages and disadvantages of each method. A useful summary is given in Figure 10-1.

Bear in mind that a kernel is used to express a unit of computation and that many instances of a kernel will usually execute in parallel on an accelerator. SYCL supports multiple ways to express a kernel to integrate naturally and seamlessly into a variety of codebases while executing efficiently on a wide diversity of accelerator types.

Kernels As Lambda Expressions

C++ lambda expressions , also referred to as anonymous function objects , unnamed function objects , closures, or simply lambdas, are a convenient way to express a kernel right where it is used. This section describes how to represent a kernel as a C++ lambda expression. This expands on the introductory refresher on C++ lambda functions, in Chapter 1, which included some coding samples with output.

C++ lambda expressions are very powerful and have an expressive syntax, but only a specific subset of the full C++ lambda expression syntax is required (and supported) when expressing a kernel.

Elements of a Kernel Lambda Expression

Figure 10-2 shows a kernel written as a typical lambda expression—the code examples so far in this book have used this syntax.

The illustration in Figure 10-3 shows more elements of a lambda expression that may be used with kernels, but many of these elements are not typical. In most cases, the lambda defaults are sufficient, so a typical kernel lambda expression looks more like the lambda expression in Figure 10-2 than the more complicated lambda expression in Figure 10-3.

1.
The first part of a lambda expression describes the lambda captures. Capturing a variable from a surrounding scope enables it to be used within the lambda expression, without explicitly passing it to the lambda expression as a parameter.

C++ lambda expressions support capturing a variable by copying it or by creating a reference to it, but for kernel lambda expressions, variables may only be captured by copy. General practice is to simply use the default capture mode [=], which implicitly captures all variables by value, although it is possible to explicitly name each captured variable as well. Any variable used within a kernel that is not captured by value will cause a compile-time error.
2.
The second part of a lambda expression describes parameters that are passed to the lambda expression, just like parameters that are passed to named functions.

For kernel lambda expressions, the parameters depend on how the kernel was invoked and usually identify the index of the work-item in the parallel execution space. Please refer to Chapter 4 for more details about the various parallel execution spaces and how to identify the index of a work-item in each execution space.
3.
The last part of the lambda expression defines the lambda function body. For a kernel lambda expression, the function body describes the operations that should be performed at each index in the parallel execution space.

There are other parts of a lambda expression that are supported for kernels , but are either optional or infrequently used:
4.
Some specifiers (such as mutable) may be supported, but their use is not recommended, and support may be removed in future versions of SYCL (it is gone in the provisional SYCL 2020) or DPC++. None is shown in the example code.
5.
The exception specification is supported, but must be noexcept if provided, since exceptions are not supported for kernels.
6.
Lambda attributes are supported and may be used to control how the kernel is compiled. For example, the reqd_work_group_size attribute can be used to require a specific work-group size for a kernel.
7.
The return type may be specified, but must be void if provided, since non-void return types are not supported for kernels.

LAMBDA CAPTURES: IMPLICIT OR EXPLICIT?

Some C++ style guides recommend against implicit (or default) captures for lambda expressions due to possible dangling pointer issues, especially when lambda expressions cross scope boundaries. The same issues may occur when lambdas are used to represent kernels, since kernel lambdas execute asynchronously on the device, separately from host code.

Because implicit captures are useful and concise, it is common practice for SYCL kernels and a convention we use in this book, but it is ultimately our decision whether to prefer the brevity of implicit captures or the clarity of explicit captures.

Naming Kernel Lambda Expressions

There is one more element that must be provided in some cases when a kernel is written as a lambda expression: because lambda expressions are anonymous, at times SYCL requires an explicit kernel name template parameter to uniquely identify a kernel written as a lambda expression.

Naming a kernel lambda expression is a way for a host code compiler to identify which kernel to invoke when the kernel was compiled by a separate device code compiler. Naming a kernel lambda also enables runtime introspection of a compiled kernel or building a kernel by name, as shown in Figure 10-9.

To support more concise code when the kernel name template parameter is not required, the DPC++ compiler supports omitting the kernel name template parameter for a kernel lambda via the -fsycl-unnamed-lambda compiler option. When using this option, no explicit kernel name template parameter is required, as shown in Figure 10-5.

Because the kernel name template parameter for lambda expressions is not required in most cases, we can usually start with an unnamed lambda and only add a kernel name in specific cases when the kernel name template parameter is required.

When the kernel name template parameter is not required, using unnamed kernel lambdas is preferred to reduce verbosity.

Kernels As Named Function Objects

Named function objects , also known as functors , are an established pattern in C++ that allows operating on an arbitrary collection of data while maintaining a well-defined interface. When used to represent a kernel, the member variables of a named function object define the state that the kernel may operate on, and the overloaded function call operator() is invoked for each work-item in the parallel execution space.

Named function objects require more code than lambda expressions to express a kernel, but the extra verbosity provides more control and additional capabilities. It may be easier to analyze and optimize kernels expressed as named function objects, for example, since any buffers and data values used by the kernel must be explicitly passed to the kernel, rather than captured automatically.

Finally, because named function objects are just like any other C++ class, kernels expressed as named function objects may be templated, unlike kernels expressed as lambda expressions. Kernels expressed as named function objects may also be easier to reuse and may be shipped as part of a separate header file or library.

Elements of a Kernel Named Function Object

The code in Figure 10-6 describes the elements of a kernel represented as a named function object.

When a kernel is expressed as a named function object, the named function object type must follow C++11 rules to be trivially copyable. Informally, this means that the named function objects may be safely copied byte by byte, enabling the member variables of the named function object to be passed to and accessed by kernel code executing on a device.

The arguments to the overloaded function call operator() depend on how the kernel is launched, just like for kernels expressed as lambda expressions.

Because the function object is named, the host code compiler can use the function object type to associate with the kernel code produced by the device code compiler, even if the function object is templated. As a result, no additional kernel name template parameter is needed to name a kernel function object.

Interoperability with Other APIs

When a SYCL implementation is built on top of another API, the implementation may be able to interoperate with kernels defined using mechanisms of the underlying API. This allows an application to easily and incrementally integrate SYCL into existing codebases.

Because a SYCL implementation may be layered on top of many other APIs, the functionality described in this section is optional and may not be supported by all implementations. The underlying API may even differ depending on the specific device type or device vendor!

Broadly speaking, an implementation may support two interoperability mechanisms: from an API-defined source or intermediate representation (IR) or from an API-specific handle. Of these two mechanisms, the ability to create a kernel from an API-defined source or intermediate representation is more portable, since some source or IR formats are supported by multiple APIs. For example, OpenCL C kernels may be directly consumed by many APIs or may be compiled into some form understood by an API, but it is unlikely that an API-specific kernel handle from one API will be understood by a different API.

Remember that all forms of interoperability are optional!

Different SYCL implementations may support creating kernels from different API-specific handles—or not at all.

Always check the documentation for details!

Interoperability with API-Defined Source Languages

With this form of interoperability , the contents of the kernel are described as source code or using an intermediate representation that is not defined by SYCL, but the kernel objects are still created using SYCL API calls. This form of interoperability allows reuse of kernel libraries written in other source languages or use of domain-specific languages (DSLs) that generate code in an intermediate representation.

An implementation must understand the kernel source code or intermediate representation to utilize this form of interoperability. For example, if the kernel is written using OpenCL C in source form, the implementation must support building SYCL programs from OpenCL C kernel source code.

Figure 10-7 shows how a SYCL kernel may be written as OpenCL C kernel source code.

In this example , the kernel source string is represented as a C++ raw string literal in the same file as the SYCL host API calls, but there is no requirement that this is the case, and some applications may read the kernel source string from a file or even generate it just-in-time.

Because the SYCL compiler does not have visibility into a SYCL kernel written in an API-defined source language, any kernel arguments must explicitly be passed using the set_arg() or set_args() interface. The SYCL runtime and the API-defined source language must agree on a convention to pass objects as kernel arguments. In this example, the accessor dataAcc is passed as the global pointer kernel argument data.

The build_with_source() interface supports passing optional API-defined build options to precisely control how the kernel is compiled. In this example, the program build options -cl-fast-relaxed-math are used to indicate that the kernel compiler can use a faster math library with relaxed precision. The program build options are optional and may be omitted if no build options are required.

Interoperability with API-Defined Kernel Objects

With this form of interoperability, the kernel objects themselves are created in another API and then imported into SYCL. This form of interoperability enables one part of an application to directly create and use kernel objects using underlying APIs and another part of the application to reuse the same kernels using SYCL APIs. The code in Figure 10-8 shows how a SYCL kernel may be created from an OpenCL kernel object.

As with other forms of interoperability, the SYCL compiler does not have visibility into an API-defined kernel object. Therefore, kernel arguments must be explicitly passed using the set_arg() or set_args() interface, and the SYCL runtime and the underlying API must agree on a convention to pass kernel arguments.

Kernels in Program Objects

In prior sections, when kernels were either created from an API-defined representation or from API-specific handles, the kernels were created in two steps: first by creating a program object and then by creating the kernel from the program object. A program object is a collection of kernels and the functions they call that are compiled as a unit.

For kernels represented as lambda expressions or named function objects, the program object containing the kernel is usually implicit and invisible to an application. For applications that require more control, an application can explicitly manage kernels and the program objects that encapsulate them. To describe why this may be beneficial, it is helpful to take a brief look at how many SYCL implementations manage just-in-time (JIT) kernel compilation.

While not required by the specification, many implementations compile kernels “lazily.” This is usually a good policy since it ensures fast application startup and does not unnecessarily compile kernels that are never executed. The disadvantage of this policy is that the first use of a kernel usually takes longer than subsequent uses, since it includes the time needed to compile the kernel, plus the time needed to submit and execute the kernel. For some complex kernels, the time needed to compile the kernel can be significant, making it desirable to shift compilation to a different point during application execution, such as when the application is loading, or in a separate background thread.

Some kernels may also benefit from implementation-defined “build options” to precisely control how the kernel is compiled. For example, for some implementations, it may be possible to instruct the kernel compiler to use a math library with lower precision and better performance.

To provide more control over when and how a kernel is compiled, an application can explicitly request that a kernel be compiled before the kernel is used, using specific build options. Then, the pre-compiled kernel can be submitted into a queue for execution, like usual. Figure 10-9 shows how this works.

In this example, a program object is created from a SYCL context, and the kernel defined by the specified template parameter is built using the build_with_kernel_type function . For this example, the program build options -cl-fast-relaxed-math indicate that the kernel compiler can use a faster math library with relaxed precision, but the program build options are optional and may be omitted if no special program build options are required. The template parameter naming the kernel lambda is required in this case, to identify which kernel to compile.

A program object may also be created from a context and a specific list of devices, rather than all the devices in the context, allowing a program object for one set of devices to be compiled with different build options than those of another program object for a different set of devices.

The previously compiled kernel is passed to the parallel_for using the get_kernel function in addition to the usual kernel lambda expression. This ensures that the previously compiled kernel that was built using the relaxed math library gets used. If the previously compiled kernel is not passed to the parallel_for, then the kernel will be compiled again, without any special build options. This may be functionally correct, but it is certainly not the intended behavior!

In many cases, such as in the simple example shown earlier, these additional steps are unlikely to produce a noticeable change in application behavior and may be omitted for clarity, but they should be considered when tuning an application for performance.

IMPROVING INTEROPERABILITY AND PROGRAM OBJECT MANAGEMENT

Although the SYCL interfaces for interoperability and program object management described in this chapter are useful and functional, they are likely to be improved and enhanced in future versions of SYCL and DPC++. Please refer to the latest SYCL and DPC++ documentation to find updates that were not available or not stable enough to include in this book!

Summary

In this chapter, we explored different ways to define kernels. We described how to seamlessly integrate into existing C++ codebases by representing kernels as C++ lambda expressions or named function objects. For new codebases, we also discussed the pros and cons of the different kernel representations, to help choose the best way to define kernels based on the needs of the application or library.

We also described how to interoperate with other APIs, either by creating a kernel from an API-defined source language or intermediate representation or by creating a kernel object from a handle to an API representation of the kernel. Interoperability enables an application to migrate from lower-level APIs to SYCL over time or to interface with libraries written for other APIs.

Finally, we described how kernels are typically compiled in a SYCL application and how to directly manipulate kernels in program objects to control the compilation process. Even though this level of control will not be required for most applications, it is a useful technique to be aware of when tuning an application.

Author information

Authors and Affiliations

Beaverton, OR, USA
James Reinders
Folsom, CA, USA
Ben Ashbaugh
Marlborough, MA, USA
James Brodman
Halifax, NS, Canada
Michael Kinsner
San Jose, CA, USA
John Pennycook
Fremont, CA, USA
Xinmin Tian

Authors

James Reinders
View author publications
You can also search for this author in PubMed Google Scholar
Ben Ashbaugh
View author publications
You can also search for this author in PubMed Google Scholar
James Brodman
View author publications
You can also search for this author in PubMed Google Scholar
Michael Kinsner
View author publications
You can also search for this author in PubMed Google Scholar
John Pennycook
View author publications
You can also search for this author in PubMed Google Scholar
Xinmin Tian
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Reinders, J., Ashbaugh, B., Brodman, J., Kinsner, M., Pennycook, J., Tian, X. (2021). Defining Kernels. In: Data Parallel C++. Apress, Berkeley, CA. https://doi.org/10.1007/978-1-4842-5574-2_10

Download citation

DOI: https://doi.org/10.1007/978-1-4842-5574-2_10
Published: 03 November 2020
Publisher Name: Apress, Berkeley, CA
Print ISBN: 978-1-4842-5573-5
Online ISBN: 978-1-4842-5574-2
eBook Packages: Professional and Applied ComputingApress Access BooksProfessional and Applied Computing (R0)

Publish with us

Policies and ethics