In this section, we discuss the execution model of the three devices (CPU, GPU and FPGA), i.e. how to execute a primitive on these different devices.
For executing primitives on a CPU, we follow the operator-at-a-time execution model proposed by Boncz et al. . This model executes one primitive at a time for all input data, in a tight loop. When one primitive in the query plan finishes execution, the next one starts. Though vectorized execution of these primitives provide performance benefits , this is not suitable for a heterogeneous environment, due to high data transfer delay.
Figure 1 shows an OpenMP implementation of the reduce-add primitive to be run on a multi-core CPU.
For data transfer in OpenMP, we simply forward the input pointer to the target function for execution.
Similar to CPUs, GPUs also follows operator-at-a-time execution. For each primitive (or kernel in the context of OpenCL), the runtime loads data as well as creating result space in the GPU based on the target kernel. The result space is estimated based on the input size, intents (data processed per thread) . Once set, the kernel is executed.
A task sets the number of parallel local items and the work groups required for parallelizing the task on the GPU. These values are computed from the input size and parameters present in the task. Note that consecutive operations in GPU simply forward the data buffer without any expensive data routing.
In Figure 2, we show an example OpenCL implementation for the reduce-add primitive. It is worth mentioning, that the same kernel code can be used to run on a CPU instead of a GPU. Note the kernel’s additional parameter INTENT defining the number of inputs processed by a single work item.
Data transfer mechanism is different based on the target OpenCL device. For OpenCL execution on CPU, we simply forward the host pointer. In case of GPUs, we consider the case of a cold-store, where initially no data is available in GPU and data is transferred explicitly according to the incoming query. However, all intermediate results are stored in the GPU memory and only the final result is transferred back to the CPU.
While HLS solves most of the complexities in programming a FPGA, the software integration problem still remains. To connect FPGAs to a host server, there exist products such as Xillybus over PCIe  which provide both hardware interfaces and OS drivers, but any user program still has to be adapted specifically to every FPGA design. Thus, single-function accelerators are common in the industry [27, 32, 43]. Another option is to implement multiple functions in the same FPGA design, which limits the amount of resources every functional unit can occupy, and therefore potentially reduces performance. We have already analyzed these shortcomings with respect to Intel OpenCL for FPGAs . To combat this, modern FPGAs allow for smaller parts of the fabric to be reconfigured with different circuits during runtime, instead of just loading a complete design into the FPGA at power up. Dynamic partial reconfiguration of reconfigurable partitions (RP) allows for function units (FU) to be exchanged at runtime, and therefore greatly increases flexibility, but also forces higher complexity upon the designer. One possibility to hide this additional complexity is to specify the design in such a way that a simpler representation of it can be devised: An overlay architecture abstracts away the raw image of FPGA hardware resources into a user representation more specific to their application.
User Model of the Overlay Architecture
As shown in layer two of Fig. 4, our overlay architecture consists of an array of locally interconnected FUs, where each FU performs a set of streaming operations based on the primitives introduced in Sect. 3.1. These primitives are implemented using HLS, synthesized for each FU, and then integrated into the overall design transparently to the programmer via DPR. Figure 3 shows the reduction primitive for addition as an example of an HLS primitive: After defining the data types for the hardware interface using vendor libraries, we add a stream of 32-bit-integers. The logic for accessing the stream hardware is generated by the HLS tool chain.
Query processing (partially) on FPGA using the reconfigurable overlay architecture starts with mapping the input data-flow graph to the graph describing the available FUs. While finding perfect matchings is complex, efficient approximative approaches such as simulated annealing  are generally available and effective. Based on the matching, the user space driver constructs the necessary configuration data, loads the required primitives via DPR and configures data-flow routing within the overlay architecture. Finally, after input data is copied to the DDR memories on the FPGA card, the user instructs the overlay architecture to process the required columns and waits for it to finish.
Structure and Infrastructure
We implemented a custom FPGA overlay architecture focused on hardware-pipelined execution of data-flow graphs using DPR to provide diverse functionality. As Fig. 4 shows, the FPGA is divided into a static part (on the right) and a set of reconfigurable partitions. Only the actual function units (FUs) are placed inside the reconfigurable partitions (in green). The generic infrastructure elements of the static partition are in the second layer (in gray). They consist of the logic required for PCIe connectivity, a DDR3 controller and direct memory access (DMA) blocks for data access and transfer from/to the host.
The RPs are arranged in a regular grid pattern across the FPGA fabric and are grouped together with their supporting logic into a grid of tiles. The tiles are the fundamental building block of the overlay architecture and their inner structure is described in Fig. 5. Each tile has high-bandwidth data-flow connections to a 4-neighbourhood. Furthermore, there is a packet-switched configuration and status network for short message exchanges between the host program and any FU without the need for a static data-flow route. The input and output crossbars within each tile enable flexible data transport and are also set up via these messages. In addition, to support random memory accesses for example for the hash table primitives, a few tiles are also attached to the static memory bus directly, not just through DMA engines.
Since data-flow graphs are likely to not map perfectly to a 2D-grid, a special pass-through or bridge FU can be loaded, which instead of an operation just contains two FIFOs. This allows forwarding of up to two data streams through any unused tile. One exemplary use is highlighted in pink in Fig. 6, where two data streams need to cross over.
The whole design is freely scalable and flexible with respect to the targeted application domain. This is especially interesting, since the amount of general and specialized FPGA resources provided is defined one level lower, by the shape and placement of the reconfigurable partitions. This allows not only for different designs to set higher or lower resource limits, but also, the partitions do not necessarily have to have the same size. Again, all of this is transparent to the user, who only perceives an array of functional units, each with its own set of supported operations.
In conclusion, our overlay architecture allows for fast and dynamic composition of query datapaths. Using small configurable local crossbars allows for higher flexibility than a statically wired set of RPs while allowing for many more RPs than a globally interconnected system. This flexibility in data-flow routing can also help reduce the cost of reconfiguration. Also, the structure of the tiles fits perfectly to a column-based system, chosen to reduce bandwidth by accessing only the necessary columns. Finally, the structure of our overlay architecture enables both functional and data parallelism since different parts of the system are always similar.
The connections between the overlay architecture and the memory subsystem are located along the left and bottom sides in our FPGA,as shown in Fig. 6. Again, such placement is due to the physical layout of the generic infrastructure components. The DMA engines are used to scan or store columns located in the FPGA’s DDR3 DRAM. In addition, there is one bidirectional connection directly to the PCIe core, which allows one column to be streamed to or from the host server directly.