Vectors are collections of data. These can be useful because parallelism in our computers comes from collections of compute hardware, and data is often processed in related groupings (e.g., the color channels in an RGB pixel). Sound like a marriage made in heaven? It is so important, we’ll spend a chapter discussing the merits of vector types and how to utilize them. We will not dive into vectorization in this chapter, since that varies based on device type and implementations. Vectorization is covered in Chapters 15 and 16.

This chapter seeks to address the following questions:

  • What are vector types?

  • How much do I really need to know about the vector interface?

  • Should vector types be used to express parallelism?

  • When should I use vector types?

We discuss the strengths and weaknesses of available vector types using working code examples and highlight the most important aspects of exploiting vector types.

How to Think About Vectors

Vectors are a surprisingly controversial topic when we talk with parallel programming experts, and in the authors’ experience, this is because different people define and think about the term in different ways.

There are two broad ways to think about vector data types (a collection of data):

  1. 1.

    As a convenience type, which groups data that you might want to refer to and operate on as a group, for example, grouping the color channels of a pixel (e.g., RGB, YUV) into a single variable (e.g., float3), which could be a vector. We could define a pixel class or struct and define math operators like + on it, but vector types conveniently do this for us out of the box. Convenience types can be found in many shader languages used to program GPUs, so this way of thinking is already common among many GPU developers.

  2. 2.

    As a mechanism to describe how code maps to a SIMD instruction set in hardware. For example, in some languages and implementations, operations on a float8 could in theory map to an eight-lane SIMD instruction in hardware. Vector types are used in multiple languages as a convenient high-level alternative to CPU-specific SIMD intrinsics for specific instruction sets, so this way of thinking is already common among many CPU developers.

Although these two interpretations are very different, they unintentionally became combined and muddled together as SYCL and other languages became applicable to both CPUs and GPUs. A vector in the SYCL 1.2.1 specification is compatible with either interpretation (we will revisit this later), so we need to clarify our recommended thinking in DPC++ before going any further.

Throughout this book, we talk about how work-items can be grouped together to expose powerful communication and synchronization primitives, such as sub-group barriers and shuffles. For these operations to be efficient on vector hardware, there is an assumption that different work-items in a sub-group combine and map to SIMD instructions. Said another way, multiple work-items are grouped together by the compiler, at which point they can map to SIMD instructions in the hardware. Remember from Chapter 4 that this is a basic premise of SPMD programming models that operate on top of vector hardware, where a single work-item constitutes a lane of what might be a SIMD instruction in hardware, instead of a work-item defining the entire operation that will be a SIMD instruction in the hardware. You can think of the compiler as always vectorizing across work-items when mapping to SIMD instructions in hardware, when programming in a SPMD style with the DPC++ compiler.

For the features and hardware described in this book, vectors are useful primarily for the first interpretation in this section—vectors are convenience types that should not be thought of as mapping to SIMD instructions in hardware. Work-items are grouped together to form SIMD instructions in hardware, on the platforms where that applies (CPUs, GPUs). Vectors should be thought of as providing convenient operators such as swizzles and math functions that make common operations on groups of data concise within our code (e.g., adding two RGB pixels).

For developers coming from languages that don’t have vectors or from GPU shading languages, we can think of SYCL vectors as local to a work-item in that if there is an addition of two four-element vectors, that addition might take four instructions in the hardware (it would be scalarized from the perspective of the work-item). Each element of the vector would be added through a different instruction/clock cycle in the hardware. With this interpretation, vectors are a convenience in that we can add two vectors in a single operation in our source code, as opposed to performing four scalar operations in the source.

For developers coming from a CPU background, we should know that implicit vectorization to SIMD hardware occurs by default in the compiler in a few ways independent of the vector types. The compiler performs this implicit vectorization across work-items, extracts the vector operations from well-formed loops, or honors vector types when mapping to vector instructions—see Chapter 16 for more information.

OTHER IMPLEMENTATIONS POSSIBLE!

Different compilers and implementations of SYCL and DPC++ can in theory make different decisions on how vector data types in code map to vector hardware instructions. We should read a vendor’s documentation and optimization guides to understand how to write code that will map to efficient SIMD instructions. This book is written principally against the DPC++ compiler, so documents the thinking and programming patterns that it is built around.

CHANGES ARE ON THE HORIZON

We have just said to consider vector types as convenience types and to expect vectorization across work-items when thinking about the mapping to hardware on devices where that makes sense. This is expected to be the default interpretation in the DPC++ compiler and toolchain going forward. However, there are two additional future-looking changes to be aware of.

First, we can expect some future DPC++ features that will allow us to write explicit vector code that maps directly to SIMD instructions in the hardware, particularly for experts who want to tune details of code for a specific architecture and take control from the compiler vectorizers. This is a niche feature that will be used by very few developers, but we can expect programming mechanisms to exist eventually where this is possible. Those programming mechanisms will make it very clear which code is written in an explicit vector style, so that there isn’t confusion between the code we write today and that new more explicit (and less portable) style.

Second, the need for this section of the book (talking about interpretations of vectors) highlights that there is confusion on what a vector means, and that will be solved in SYCL in the future. There is a hint of this in the SYCL 2020 provisional specification where a math array type (marray) has been described, which is explicitly the first interpretation from this section—a convenience type unrelated to vector hardware instructions. We should expect another type to also eventually appear to cover the second interpretation, likely aligned with the C++ std::simd templates. With these two types being clearly associated with specific interpretations of a vector data type, our intent as programmers will be clear from the code that we write. This will be less error prone and less confusing and may even reduce the number of heated discussions between expert developers when the question arises “What is a vector?”

Vector Types

Vector types in SYCL are cross-platform class templates that work efficiently on devices as well as in host C++ code and allow sharing of vectors between the host and its devices. Vector types include methods that allow construction of a new vector from a swizzled set of component elements, meaning that elements of the new vector can be picked in an arbitrary order from elements of the old vector. vec is a vector type that compiles down to the built-in vector types on target device backends, where possible, and provides compatible support on the host.

The vec class is templated on its number of elements and its element type. The number of elements parameter, numElements, can be one of 1, 2, 3, 4, 8, or 16. Any other value will produce a compilation failure. The element type parameter, dataT, must be one of the basic scalar types supported in device code.

The SYCL vec class template provides interoperability with the underlying vector type defined by vector_t which is available only when compiled for the device. The vec class can be constructed from an instance of vector_t and can implicitly convert to an instance of vector_t in order to support interoperability with native SYCL backends from a kernel function (e.g., OpenCL backends). An instance of the vec class template can also be implicitly converted to an instance of the data type when the number of elements is 1 in order to allow single-element vectors and scalars to be easily interchangeable.

For our programming convenience, SYCL provides a number of type aliases of the form using <type><elems> = vec<<storage-type>, <elems>>, where <elems> is 2, 3, 4, 8, and 16 and pairings of <type> and <storage-type> for integral types are char int8_t, uchar uint8_t, short int16_t, ushort uint16_t, int int32_t, uint uint32_t, long int64_t, and ulong uint64_t and for floating-point types half, float, and double. For example, uint4 is an alias to vec<uint32_t, 4> and float16 is an alias to vec<float, 16>.

Vector Interface

The functionality of vector types is exposed through the class vec. The vec class represents a set of data elements that are grouped together. The interfaces of the constructors, member functions, and non-member functions of the vec class template are described in Figures 11-1, 11-4, and 11-5.

The XYZW members listed in Figure 11-2 are available only when numElements <= 4. RGBA members are available only when numElements == 4.

The members lo, hi, odd, and even shown in Figure 11-3 are available only when numElements > 1.

Figure 11-1
figure 1

vec class declaration and member functions

Figure 11-2
figure 2

swizzled_vec member functions

Figure 11-3
figure 3

vec operator interface

Figure 11-4
figure 4

vec member functions

Figure 11-5
figure 5

vec non-member functions

Load and Store Member Functions

Vector load and store operations are members of the vec class for loading and storing the elements of a vector. These operations can be to or from an array of elements of the same type as the channels of the vector. An example is shown in Figure 11-6.

Figure 11-6
figure 6

Use of load and store member functions.

In the vec class, dataT and numElements are template parameters that reflect the component type and dimensionality of a vec.

The load() member function template will read values of type dataT from the memory at the address of the multi_ptr, offset in elements of dataT by numElements*offset, and write those values to the channels of the vec.

The store() member function template will read channels of the vec and write those values to the memory at the address of the multi_ptr, offset in elements of dataT by numElements*offset.

The parameter is a multi_ptr rather than an accessor so that locally created pointers can also be used as well as pointers passed from the host.

The data type of the multi_ptr is dataT, the data type of the components of the vec class specialization. This requires that the pointer passed to either load() or store() must match the type of the vec instance itself.

Swizzle Operations

In graphics applications, swizzling means rearranging the data elements of a vector. For example, if a = {1, 2, 3, 4,}, and knowing that the components of a four-element vector can be referred to as {x, y, z, w}, we could write b = a.wxyz(). The result in the variable b would be {4, 1, 2, 3}. This form of code is common in GPU applications where there is efficient hardware for such operations. Swizzles can be performed in two ways:

  • By calling the swizzle member function of a vec, which takes a variadic number of integer template arguments between 0 and numElements-1, specifying swizzle indices

  • By calling one of the simple swizzle member functions such as XYZW_SWIZZLE and RGBA_SWIZZLE

Note that the simple swizzle functions are only available for up to four-element vectors and are only available when the macro SYCL_SIMPLE_SWIZZLES is defined before including sycl.hpp. In both cases, the return type is always an instance of __swizzled_vec__, an implementation-defined temporary class representing a swizzle of the original vec instance. Both the swizzle member function template and the simple swizzle member functions allow swizzle indexes to be repeated. Figure 11-7 shows a simple usage of __swizzled_vec__.

Figure 11-7
figure 7

Example of using the __swizzled_vec__ class

Vector Execution Within a Parallel Kernel

As described in Chapters 4 and 9, a work-item is the leaf node of the parallelism hierarchy and represents an individual instance of a kernel function. Work-items can be executed in any order and cannot communicate or synchronize with each other except through atomic memory operations to local and global memory or through group collective functions (e.g., shuffle, barrier).

As described at the start of this chapter, a vector in DPC++ should be interpreted as a convenience for us when writing code. Each vector is local to a single work-item (instead of relating to vectorization in hardware) and can therefore be thought of as equivalent to a private array of numElements in our work-item. For example, the storage of a “float4 y4” declaration is equivalent to float y4[4]. Consider the example shown in Figure 11-8.

Figure 11-8
figure 8

Vector execution example

For the scalar variable x, the result of kernel execution with multiple work-items on hardware that has SIMD instructions (e.g., CPUs, GPUs) might use a vector register and SIMD instructions, but the vectorization is across work-items and unrelated to any vector type in our code. Each work-item could operate on a different location in the implicit vec_x, as shown in Figure 11-9. The scalar data in a work-item can be thought of as being implicitly vectorized (combined into SIMD hardware instructions) across work-items that happen to execute at the same time, in some implementations and on some hardware, but the work-item code that we write does not encode this in any way—this is at the core of the SPMD style of programming.

Figure 11-9
figure 9

Vector expansion from scalar variable x to vec_x[8]

With the implicit vector expansion from scalar variable x to vec_x[8] by the compiler as shown in Figure 11-9, the compiler creates a SIMD operation in hardware from a scalar operation that occurs in multiple work-items.

For the vector variable y4, the result of kernel execution for multiple work-items, for example, eight work-items, does not process the vec4 by using vector operations in hardware. Instead each work-item independently sees its own vector, and the operations on elements on that vector occur across multiple clock cycles/instructions (the vector is scalarized by the compiler), as shown in Figure 11-10.

Figure 11-10
figure 10

Vertical expansion to equivalent of vec_y[8][4] of y4 across eight work-items

Each work-item sees the original data layout of y4, which provides an intuitive model to reason about and tune. The performance downside is that the compiler has to generate gather/scatter memory instructions for both CPUs and GPUs, as shown in Figure 11-11, (the vectors are contiguous in memory and neighboring work-items operating on different vectors in parallel), so scalars are often an efficient approach over explicit vectors when a compiler will vectorize across work-items (e.g., across a sub-group). See Chapters 15 and 16 for more details.

Figure 11-11
figure 11

Vector code example with address escaping

When the compiler is able to prove that the address of y4 does not escape from the current kernel work-item or all callee functions are to be inlined, then the compiler may perform optimizations that act as if there was a horizontal unit-stride expansion to vec_y[4][8] from y4 using a set of vector registers, as shown in Figure 11-12. In this case, compilers can achieve optimal performance without generating gather/scatter SIMD instructions for both CPUs and GPUs. The compiler optimization reports provide information to programmers about this type of transformation, whether it occurred or not, and can provide hints on how to tweak our code for increased performance.

Figure 11-12
figure 12

Horizontal unit-stride expansion to vec_y[4][8] of y4

Vector Parallelism

Although vectors in source code within DPC++ should be interpreted as convenience tools that are local to only a single work-item, this chapter on vectors would not be complete without some mention of how SIMD instructions in hardware operate. This discussion is not coupled to vectors within our source code, but provides orthogonal background that will be useful as we progress to the later chapters of this book that describe specific device types (GPU, CPU, FPGA).

Modern CPUs and GPUs contain SIMD instruction hardware that operate on multiple data values contained in one vector register or a register file. For example, with Intel x86 AVX-512 and other modern CPU SIMD hardware, SIMD instructions can be used to exploit data parallelism. On CPUs and GPUs that provide SIMD hardware, we can consider a vector addition operation, for example, on an eight-element vector, as shown in Figure 11-13.

Figure 11-13
figure 13

SIMD addition with eight-way data parallelism

The vector addition in this example could execute in a single instruction on vector hardware, adding the vector registers vec_x and vec_y in parallel with that SIMD instruction.

Exposing potential parallelism in a hardware-agnostic way ensures that our applications can scale up (or down) to fit the capabilities of different platforms, including those with vector hardware instructions. Striking the right balance between work-item and other forms of parallelism during application development is a challenge that we must all engage with, and that is covered more in Chapters 15, 16, and 17.

Summary

There are multiple interpretations of the term vector within programming languages, and understanding the interpretation that a particular language or compiler has been built around is important when we want to write performant and scalable code. DPC++ and the DPC++ compiler have been built around the idea that vectors in source code are convenience functions local to a work-item and that implicit vectorization by the compiler across work-items may map to SIMD instructions in the hardware. When we want to write code which maps directly to vector hardware explicitly, we should look to vendor documentation and future extensions to SYCL and DPC++. Writing our kernels using multiple work-items (e.g., ND-range) and relying on the compiler to vectorize across work-items should be how most applications are written because doing so leverages the powerful abstraction of SPMD, which provides an easy-to-reason-about programming model, and that provides scalable performance across devices and architectures.

This chapter has described the vec interface, which offers convenience out of the box when we have groupings of similarly typed data that we want to operate on (e.g., a pixel with multiple color channels). It has also touched briefly on SIMD instructions in hardware, to prepare us for more detailed discussions in Chapters 15 and 16.