Supercomputer architects often lament that we need to “feed the beast.” The phrase “feed the beast” refers to the “beast” of a computer we create when we use lots of parallelism, and feeding data to it becomes a key challenge to solve.

Feeding a Data Parallel C++ program on a heterogeneous machine requires some care to ensure data is where it needs to be when it needs to be there. In a large program, that can be a lot of work. In a preexisting C++ program, it can be a nightmare just to sort out how to manage all the data movements needed.

We will carefully explain the two ways to manage data: Unified Shared Memory (USM) and buffers. USM is pointer based, which is familiar to C++ programmers. Buffers offer a higher-level abstraction. Choice is good.

We need to control the movement of data, and this chapter covers options to do exactly that.

In Chapter 2, we studied how to control where code executes. Our code needs data as input and produces data as output. Since our code may run on multiple devices and those devices do not necessarily share memory, we need to manage data movement. Even when data is shared, such as with USM, synchronization and coherency are concepts we need to understand and manage.

A logical question might be “Why doesn’t the compiler just do everything automatically for us?” While a great deal can be handled for us automatically, performance is usually suboptimal if we do not assert ourselves as programmers. In practice, for best performance, we will need to concern ourselves with code placement (Chapter 2) and data movement (this chapter) when writing heterogeneous programs.

This chapter provides an overview of managing data, including controlling the ordering of data usage. It complements the prior chapter, which showed us how to control where code runs. This chapter helps us efficiently make our data appear where we have asked the code to run, which is important not only for correct execution of our application but also to minimize execution time and power consumption.

Introduction

Compute is nothing without data. The whole point of accelerating a computation is to produce an answer more quickly. This means that one of the most important aspects of data-parallel computations is how they access data, and introducing accelerator devices into a machine further complicates the picture. In traditional single-socket CPU-based systems, we have a single memory. Accelerator devices often have their own attached memories that cannot be directly accessed from the host. Consequently, parallel programming models that support discrete devices must provide mechanisms to manage these multiple memories and move data between them.

In this chapter, we present an overview of the various mechanisms for data management. We introduce Unified Shared Memory and the buffer abstractions for data management and describe the relationship between kernel execution and data movement.

The Data Management Problem

Historically, one of the advantages of shared memory models for parallel programming is that they provide a single, shared view of memory. Having this single view of memory simplifies life. We are not required to do anything special to access memory from parallel tasks (aside from proper synchronization to avoid data races). While some types of accelerator devices (e.g., integrated GPUs) share memory with a host CPU, many discrete accelerators have their own local memories separate from that of the CPU as seen in Figure 3-1.

Figure 3-1
figure 1

Multiple discrete memories

Device Local vs. Device Remote

Programs running on a device perform better when reading and writing data using memory attached directly to the device rather than remote memories. We refer to accesses to a directly attached memory as local accesses. Accesses to another device’s memory are remote accesses. Remote accesses tend to be slower than local accesses because they must travel over data links with lower bandwidth and/or higher latency. This means that it is often advantageous to co-locate both a computation and the data that it will use. To accomplish this, we must somehow ensure that data is copied or migrated between different memories in order to move it closer to where computation occurs.

Figure 3-2
figure 2

Data movement and kernel execution

Managing Multiple Memories

Managing multiple memories can be accomplished, broadly, in two ways: explicitly through our program or implicitly by the runtime. Each method has its advantages and drawbacks, and we may choose one or the other depending on circumstances or personal preference.

Explicit Data Movement

One option for managing multiple memories is to explicitly copy data between different memories. Figure 3-2 shows a system with a discrete accelerator where we must first copy any data that a kernel will require from the host memory to GPU memory. After the kernel computes results, we must copy these results back to the CPU before the host program can use that data.

The primary advantage of explicit data movement is that we have full control over when data is transferred between different memories. This is important because overlapping computation with data transfer can be essential to obtain the best performance on some hardware.

The drawback of explicit data movement is that specifying all data movements can be tedious and error prone. Transferring an incorrect amount of data or not ensuring that all data has been transferred before a kernel begins computing can lead to incorrect results. Getting all of the data movement correct from the beginning can be a very time-consuming task.

Implicit Data Movement

The alternative to program-controlled explicit data movements are implicit data movements controlled by a parallel runtime or driver. In this case, instead of requiring explicit copies between different memories, the parallel runtime is responsible for ensuring that data is transferred to the appropriate memory before it is used.

The advantage of implicit data movement is that it requires less effort to get an application to take advantage of faster memory attached directly to the device. All the heavy lifting is done automatically by the runtime. This also reduces the opportunity to introduce errors into the program since the runtime will automatically identify both when data transfers must be performed and how much data must be transferred.

The drawback of implicit data movement is that we have less or no control over the behavior of the runtime’s implicit mechanisms. The runtime will provide functional correctness but may not move data in an optimal fashion that ensures maximal overlap of computation with data transfer, and this could have a negative impact on program performance.

Selecting the Right Strategy

Picking the best strategy for a program can depend on many different factors. Different strategies might be appropriate for different phases of program development. We could even decide that the best solution is to mix and match the explicit and implicit methods for different pieces of the program. We might choose to begin using implicit data movement to simplify porting an application to a new device. As we begin tuning the application for performance, we might start replacing implicit data movement with explicit in performance-critical parts of the code. Future chapters will cover how data transfers can be overlapped with computation in order to optimize performance.

USM, Buffers, and Images

There are three abstractions for managing memory: Unified Shared Memory (USM), buffers, and images. USM is a pointer-based approach that should be familiar to C/C++ programmers. One advantage of USM is easier integration with existing C++ code that operates on pointers. Buffers, as represented by the buffer template class, describe one-, two-, or three-dimensional arrays. They provide an abstract view of memory that can be accessed on either the host or a device. Buffers are not directly accessed by the program and are instead used through accessor objects. Images act as a special type of buffer that provides extra functionality specific to image processing. This functionality includes support for special image formats, reading of images using sampler objects, and more. Buffers and images are powerful abstractions that solve many problems, but rewriting all interfaces in existing code to accept buffers or accessors can be time-consuming. Since the interface for buffers and images is largely the same, the rest of this chapter will only focus on USM and buffers.

Unified Shared Memory

USM is one tool available to us for data management. USM is a pointer-based approach that should be familiar to C and C++ programmers who use malloc or new to allocate data. USM simplifies life when porting existing C/C++ code that makes heavy use of pointers. Devices that support USM support a unified virtual address space. Having a unified virtual address space means that any pointer value returned by a USM allocation routine on the host will be a valid pointer value on the device. We do not have to manually translate a host pointer to obtain the “device version”—we see the same pointer value on both the host and device.

A more detailed description of USM can be found in Chapter 6.

Accessing Memory Through Pointers

Since not all memories are created equal when a system contains both host memory and some number of device-attached local memories, USM defines three different types of allocations : device, host, and shared. All types of allocations are performed on the host. Figure 3-3 summarizes the characteristics of each allocation type.

Figure 3-3
figure 3

USM allocation types

A device allocation occurs in device attached memory. Such an allocation can be read from and written to on a device but is not directly accessible from the host. We must use explicit copy operations to move data between regular allocations in host memory and device allocations.

A host allocation occurs in host memory that is accessible both on the host and on a device. This means the same pointer value is valid both in host code and in device kernels. However, when such a pointer is accessed, the data always comes from host memory. If it is accessed on a device, the data does not migrate from the host to device-local memory. Instead, data is typically sent over a bus, such as PCI-Express (PCI-E), that connects the device to the host.

A shared allocation is accessible on both the host and the device. In this regard, it is very similar to a host allocation, but it differs in that data can now migrate between host memory and device-local memory. This means that accesses on a device, after the migration has occurred, happen from much faster device-local memory instead of remotely accessing host memory though a higher-latency connection. Typically, this is accomplished through mechanisms inside the runtime and lower-level drivers that are hidden from us.

USM and Data Movement

USM supports both explicit and implicit data movement strategies, and different allocation types map to different strategies. Device allocations require us to explicitly move data between host and device, while host and shared allocations provide implicit data movement.

Explicit Data Movement in USM

Explicit data movement with USM is accomplished with device allocations and a special memcpy() found in the queue and handler classes. We enqueue memcpy() operations (actions) to transfer data either from the host to the device or from the device to the host.

Figure 3-4 contains one kernel that operates on a device allocation. Data is copied between hostArray and deviceArray before and after the kernel executes using memcpy() operations. Calls to wait() on the queue ensure that the copy to the device has completed before the kernel executes and ensure that the kernel has completed before the data is copied back to the host. We will learn how we can eliminate these calls later in this chapter.

Figure 3-4
figure 4

USM explicit data movement

Implicit Data Movement in USM

Implicit data movement with USM is accomplished with host and shared allocations. With these types of allocations, we do not need to explicitly insert copy operations to move data between host and device. Instead, we simply access the pointers inside a kernel, and any required data movement is performed automatically without programmer intervention (as long as your device supports these allocations). This greatly simplifies porting of existing codes: simply replace any malloc or new with the appropriate USM allocation functions (as well as the calls to free to deallocate memory), and everything should just work.

Figure 3-5
figure 5

USM implicit data movement

In Figure 3-5, we create two arrays , hostArray and sharedArray, that are host and shared allocations, respectively. While both host and shared allocations are directly accessible in host code, we only initialize hostArray here. Similarly, it can be directly accessed inside the kernel, performing remote reads of the data. The runtime ensures that sharedArray is available on the device before the kernel accesses it and that it is moved back when it is later read by the host code, all without programmer intervention.

Buffers

The other abstraction provided for data management is the buffer object. Buffers are a data abstraction that represent one or more objects of a given C++ type. Elements of a buffer object can be a scalar data type (such as an int, float, or double), a vector data type (Chapter 11), or a user-defined class or structure. Data structures in buffers must be C++ trivially copyable, which means that an object can be safely copied byte by byte where copy constructors do not need to be invoked.

While a buffer itself is a single object, the C++ type encapsulated by the buffer could be an array that contains multiple objects. Buffers represent data objects rather than specific memory addresses, so they cannot be directly accessed like regular C++ arrays. Indeed, a buffer object might map to multiple different memory locations on several different devices, or even on the same device, for performance reasons. Instead, we use accessor objects to read and write to buffers.

A more detailed description of buffers can be found in Chapter 7.

Creating Buffers

Buffers can be created in a variety of ways. The simplest method is to simply construct a new buffer with a range that specifies the size of the buffer. However, creating a buffer in this fashion does not initialize its data, meaning that we must first initialize the buffer through other means before attempting to read useful data from it.

Buffers can also be created from existing data on the host. This is done by invoking one of the several constructors that take either a pointer to an existing host allocation, a set of InputIterators, or a container that has certain properties. Data is copied during buffer construction from the existing host allocation into the buffer object’s host memory. A buffer may also be created from an existing cl_mem object if we are using the SYCL interoperability features with OpenCL.

Accessing Buffers

Buffers may not be directly accessed by the host and device (except through advanced and infrequently used mechanisms not described here). Instead, we must create accessors in order to read and write to buffers. Accessors provide the runtime with information about how we plan to use the data in buffers, allowing it to correctly schedule data movement.

Figure 3-6
figure 6

Buffers and accessors

7

Figure 3-7
figure 7

Buffer access modes

Access Modes

When creating an accessor, we can inform the runtime how we are going to use it to provide more information for optimizations. We do this by specifying an access mode. Access modes are defined in the access::mode enum described in Figure 3-7. In the code example shown in Figure 3-6, the accessor myAccessor is created with the default access mode, access::mode::read_write. This lets the runtime know that we intend to both read and write to the buffer through myAccessor. Access modes are how the runtime is able to optimize implicit data movement. For example, access::mode::read tells the runtime that the data needs to be available on the device before this kernel can begin executing. If a kernel only reads data through an accessor, there is no need to copy data back to the host after the kernel has completed as we haven’t modified it. Likewise, access::mode::write lets the runtime know that we will modify the contents of a buffer and may need to copy the results back after computation has ended.

Creating accessors with the proper modes gives the runtime more information about how we use data in our program. The runtime uses accessors to order the uses of data, but it can also use this data to optimize scheduling of kernels and data movement. The access modes and optimization tags are described in greater detail in Chapter 7.

Ordering the Uses of Data

Kernels can be viewed as asynchronous tasks that are submitted for execution . These tasks must be submitted to a queue where they are scheduled for execution on a device. In many cases, kernels must execute in a specific order so that the correct result is computed. If obtaining the correct result requires task A to execute before task B, we say that a dependenceFootnote 1 exists between tasks A and B.

However, kernels are not the only form of task that must be scheduled. Any data that is accessed by a kernel needs to be available on the device before the kernel can start executing. These data dependences can create additional tasks in the form of data transfers from one device to another. Data transfer tasks may be either explicitly coded copy operations or more commonly implicit data movements performed by the runtime.

If we take all the tasks in a program and the dependences that exist between them, we can use this to visualize the information as a graph. This task graph is specifically a directed acyclic graph (DAG) where the nodes are the tasks and the edges are the dependences. The graph is directed because dependences are one-way: task A must happen before task B. The graph is acyclic because it does not contain any cycles or paths from a node that lead back to itself.

In Figure 3-8, task A must execute before tasks B and C. Likewise, B and C must execute before task D. Since B and C do not have a dependence between each other, the runtime is free to execute them in any order (or even in parallel) as long as task A has already executed. Therefore, the possible legal orderings of this graph are ABCD, ACBD, and even A{B,C}D if B and C can concurrently execute.

Figure 3-8
figure 8

Simple task graph

Tasks may have a dependence with a subset of all tasks. In these cases, we only want to specify the dependences that matter for correctness. This flexibility gives the runtime latitude to optimize the execution order of the task graph. In Figure 3-9, we extend the earlier task graph from Figure 3-8 to add tasks E and F where E must execute before F. However, tasks E and F have no dependences with nodes A, B, C, and D. This allows the runtime to choose from many possible legal orderings to execute all the tasks.

Figure 3-9
figure 9

Task graph with disjoint dependences

There are two different ways to model the execution of tasks, such as a launch of a kernel, in a queue: the queue could either execute tasks in the order of submission, or it could execute tasks in any order subject to any dependences that we define. There are several mechanisms for us to define the dependences needed for correct ordering.

In-order Queues

The simplest option to order tasks is to submit them to an in-order queue object. An in-order queue executes tasks in the order in which they were submitted as seen in Figure 3-10. While the intuitive task ordering of in-order queues provides an advantage in simplicity, it provides a disadvantage in that the execution of tasks will serialize even if no dependences exist between independent tasks. In-order queues are useful when bringing up applications because they are simple, intuitive, deterministic on execution ordering, and suitable for many codes.

Figure 3-10
figure 10

In-order queue usage

Out-of-Order (OoO) Queues

Since queue objects are out-of-order queues (unless created with the in-order queue property), they must provide ways to order tasks submitted to them. Queues order tasks by letting us inform the runtime of dependences between them. These dependences can be specified, either explicitly or implicitly, using command groups.

A command group is an object that specifies a task and its dependences. Command groups are typically written as C++ lambdas passed as an argument to the submit() method of a queue object. This lambda’s only parameter is a reference to a handler object. The handler object is used inside the command group to specify actions, create accessors , and specify dependences.

Explicit Dependences with Events

Explicit dependences between tasks look like the examples we have seen (Figure 3-8) where task A must execute before task B. Expressing dependences in this way focuses on explicit ordering based on the computations that occur rather than on the data accessed by the computations. Note that expressing dependences between computations is primarily relevant for codes that use USM since codes that use buffers express most dependences via accessors. In Figures 3-4 and 3-5, we simply tell the queue to wait for all previously submitted tasks to finish before we continue. Instead, we can express task dependences through event objects. When submitting a command group to a queue, the submit() method returns an event object. These events can then be used in two ways.

First, we can synchronize through the host by explicitly calling the wait() method on an event. This forces the runtime to wait for the task that generated the event to finish executing before host program execution may continue. Explicitly waiting on events can be very useful for debugging an application, but wait() can overly constrain the asynchronous execution of tasks since it halts all execution on the host thread. Similarly, one could also call wait() on a queue object, which would block execution on the host until all enqueued tasks have completed. This can be a useful tool if we do not want to keep track of all the events returned by enqueued tasks.

This brings us to the second way that events can be used. The handler class contains a method named depends_on() . This method accepts either a single event or a vector of events and informs the runtime that the command group being submitted requires the specified events to complete before the action within the command group may execute. Figure 3-11 shows an example of how depends_on() may be used to order tasks.

Figure 3-11
figure 11

Using events and depends_on

Implicit Dependences with Accessors

Implicit dependences between tasks are created from data dependences. Data dependences between tasks take three forms, shown in Figure 3-12.

Figure 3-12
figure 12

Three forms of data dependences

Data dependences are expressed to the runtime in two ways: accessors and program order. Both must be used for the runtime to properly compute data dependences. This is illustrated in Figures 3-13 and 3-14.

Figure 3-13
figure 13

Read-after-Write

Figure 3-14
figure 14

RAW task graph

In Figures 3-13 and 3-14, we execute three kernels—computeB, readA, and computeC—and then read the final result back on the host. The command group for kernel computeB creates two accessors, accA and accB. These accessors use access tags read_only and write_only for optimization to specify that we do not use the default access mode, access::mode::read_write. We will learn more about access tags in Chapter 7. Kernel computeB reads buffer A and writes to buffer B. Buffer A must be copied from the host to the device before the kernel begins execution.

Kernel readA also creates a read-only accessor for buffer A. Since kernel readA is submitted after kernel computeB, this creates a Read-after-Read (RAR) scenario. However, RARs do not place extra restrictions on the runtime, and the kernels are free to execute in any order. Indeed, a runtime might prefer to execute kernel readA before kernel computeB or even execute both at the same time. Both require buffer A to be copied to the device, but kernel computeB also requires buffer B to be copied in case any existing values are not overwritten by computeB and which might be used by later kernels. This means that the runtime could execute kernel readA while the data transfer for buffer B occurs and also shows that even if a kernel will only write to a buffer, the original content of the buffer may still be moved to the device because there is no guarantee that all values in the buffer will be written by a kernel (see Chapter 7 for tags that let us optimize in these cases).

Kernel computeC reads buffer B , which we computed in kernel computeB. Since we submitted kernel computeC after we submitted kernel computeB, this means that kernel computeC has a RAW data dependence on buffer B. RAW dependences are also called true dependences or flow dependences, as data needs to flow from one computation to another in order to compute the correct result. Finally, we also create a RAW dependence on buffer C between kernel computeC and the host, since the host wants to read C after the kernel has finished. This forces the runtime to copy buffer C back to the host. Since there were no writes to buffer A on devices, the runtime does not need to copy that buffer back to the host because the host has an up-to-date copy already.

Figure 3-15
figure 15

Write-after-Read and Write-after-Write

Figure 3-16
figure 16

WAR and WAW task graph

In Figures 3-15 and 3-16, we again execute three kernels : computeB, rewriteA, and rewriteB. Kernel computeB once again reads buffer A and writes to buffer B, kernel rewriteA writes to buffer A, and kernel rewriteB writes to buffer B. Kernel rewriteA could theoretically execute earlier than kernel computeB since less data needs to be transferred before the kernel is ready, but it must wait until after kernel computeB finishes since there is a WAR dependence on buffer A.

In this example, kernel computeB requires the original value of A from the host, and it would read the wrong values if kernel rewriteA executed before kernel computeB. WAR dependences are also called anti-dependences. RAW dependences ensure that data properly flows in the correct direction, while WAR dependences ensure existing values are not overwritten before they are read. The WAW dependence on buffer B found in kernel rewrite functions similarly. If there were any reads of buffer B submitted in between kernels computeB and rewriteB, they would result in RAW and WAR dependences that would properly order the tasks. However, there is an implicit dependence between kernel rewriteB and the host in this example since the final data must be written back to the host. We will learn more about what causes this writeback in Chapter 7. The WAW dependence, also called an output dependence, ensures that the final output will be correct on the host.

Choosing a Data Management Strategy

Selecting the right data management strategy for our applications is largely a matter of personal preference. Indeed, we may begin with one strategy and switch to another as our program matures. However, there are a few useful guidelines to help us to pick a strategy that will serve our needs.

The first decision to make is whether we want to use explicit or implicit data movement since this greatly affects what we need to do to our program. Implicit data movement is generally an easier place to start because all the data movement is handled for us, letting us focus on expression of the computation.

If we decide that we’d rather have full control over all data movement from the beginning, then explicit data movement using USM device allocations is where we want to start. We just need to be sure to add all the necessary copies between host and devices!

When selecting an implicit data movement strategy, we still have a choice of whether to use buffers or USM host or shared pointers. Again, this choice is a matter of personal preference, but there are a few questions that could help guide us to one over the other. If we’re porting an existing C/C++ program that uses pointers, USM might be an easier path since most code won’t need to change. If data representation hasn’t guided us to a preference, another question we can ask is how we would like to express our dependences between kernels. If we prefer to think about data dependences between kernels, choose buffers. If we prefer to think about dependences as performing one computation before another and want to express that using an in-order queue or with explicit events or waiting between kernels, choose USM.

When using USM pointers (with either explicit or implicit data movement), we have a choice of which type of queue we want to use. In-order queues are simple and intuitive, but they constrain the runtime and may limit performance. Out-of-order queues are more complex, but they give the runtime more freedom to re-order and overlap execution. The out-of-order queue class is the right choice if our program will have complex dependences between kernels. If our program simply runs many kernels one after another, then an in-order queue will be a better option for us.

Handler Class: Key Members

We have shown a number of ways to use the handler class . Figures 3-17 and 3-18 provide a more detailed explanation of the key members of this very important class. We have not yet used all these members, but they will be used later in the book. This is as good a place as any to lay them out.

A closely related class, the queue class, is similarly explained at the end of Chapter 2. The online oneAPI DPC++ language reference provides an even more detailed explanation of both classes.

Figure 3-17
figure 17

Simplified definition of the non-accessor members of the handler class

Figure 3-18
figure 18

Simplified definition of the accessor members of the handler class

Summary

In this chapter, we have introduced the mechanisms that address the problems of data management and how to order the uses of data. Managing access to different memories is a key challenge when using accelerator devices, and we have different options to suit our needs.

We provided an overview of the different types of dependences that can exist between the uses of data, and we described how to provide information about these dependences to queues so that they properly order tasks.

This chapter provided an overview of Unified Shared Memory and buffers. We will explore all the modes and behaviors of USM in greater detail in Chapter 6. Chapter 7 will explore buffers more deeply, including all the different ways to create buffers and control their behavior. Chapter 8 will revisit the scheduling mechanisms for queues that control the ordering of kernel executions and data movements.