In-Depth Analysis of OLAP Query Performance on Heterogeneous Hardware

Classical database systems are now facing the challenge of processing high-volume data feeds at unprecedented rates as efficiently as possible while also minimizing power consumption. Since CPU-only machines hit their limits, co-processors like GPUs and FPGAs are investigated by database system designers for their distinct capabilities. As a result, database systems over heterogeneous processing architectures are on the rise. In order to better understand their potentials and limitations, in-depth performance analyses are vital. This paper provides interesting performance data by benchmarking a portable operator set for column-based systems on CPU, GPU, and FPGA – all available processing devices within the same system. We consider TPC‑H query Q6 and additionally a hash join to profile the execution across the systems. We show that system memory access and/or buffer management remains the main bottleneck for device integration, and that architecture-specific execution engines and operators offer significantly higher performance.

eas for database systems. These applications require a huge amount of processing power from CPUs [33].
Such trends in data-driven applications and modern hardware devices shape the efforts made by the database community and divide it into two main groups: The first focuses on a scaling-out strategy, i.e., on distributing storage and/or processing over clusters of computers that still use CPUs as the main processing element [40,41]. The second group focuses on scaling-up, i.e., utilizing the advancements made in hardware architecture, specifically GPUs and FPGAs, to build heterogeneous machines with the goal to process data more efficiently than CPU-only machines. The motivation for this strategy is that GPUs and FPGAs have inherently parallel execution models. While GPUs are known for their SIMT model, FPGAs offer spatial parallelism, pipelined execution, low power consumption and inherently enable stream processing. Overall, due to increased demands for efficient processing of large amounts of data, the rise of heterogeneous database systems seems inevitable.
Several projects are published where GPUs are integrated as co-processors in addition to CPUs into one system [10,16,30]. Similarly there are other attempts to integrate FPGAs and CPUs [27,29,32]. Since most systems focus on these two individual pairings, few research groups investigated systems consisting of all three.
In this paper, we focus on analyzing the performance of all three devices (CPU-GPU-FPGA). In order to draw fair conclusions, we run the same workload from TPC-H [42] on all of them using common primitive-based operator implementations. Our performance analysis include an end-toend analysis to cover all overheads incurred by the underlying system. We not only compare performance figures, but also provide insights into portability and potential barriers and bottlenecks. These insights provide a better understanding of the scale-up strategy.
This paper is structured as follows: In Sect. 2, we discuss related work. In Sect. 3, we provide background information on device characteristics and used primitives. Then in Sect. 4, we describe how we achieve operator portability across different hardware architectures, with special focus on the FPGA architecture. In Sect. 5, we explain our experimental setup and present the experimental results in Sect. 6. Finally, in Sect. 7, we conclude the paper by pointing out limitations and future work.

Related Work
In this section, we discuss a selective list of Online Analytical Processing (OLAP) systems, which we divide into two groups: CPU-GPU and CPU-FPGA hybrids. CPU-GPU Hybrid: GPUQP [16] is one of the first attempts to introduce GPUs as query co-processors in order to exploit their massive data parallelism. He et al. [22] introduced an in-memory relational query co-processing system called GDB. This system uses an optimized set of primitives such as split and sort for query processing. Similarly, CoGaDB [10] is a main memory DBMS targeting a heterogeneous CPU-GPU system. It uses two sets of operators, one for the CPU, and the other for GPU. Further, it uses a device-aware optimizer -HyPE [9]. HyPE takes advantage of CPUs and GPUs and extends the physical optimization process in a cost-learning-based engine. Alternative to CoGaDB, a hardware-oblivious processing engine -Ocelot [23] is developed as an extension of MonetDB. In contrast to previously mentioned hardware-aware systems, its operators were developed using OpenCL [25]. Similarly, OmniDB [48] exploits hardware characteristics in OpenCL based CPU-GPU DMBSs.
Other than these, few commercial implementation comprises of (but not limited to) Kinetica [26], OmniSci (a.k.a. MapD) [28], SQreamDB [38] and BlazingDB [6]. CPU-FPGA Hybrid: Researchers also investigated heterogeneous DBMSs combining CPU and FPGA. One example of such systems is doppioDB [36], an extension of Mon-etDB that uses Hardware User Defined Functions (HUDFs) for query acceleration on FPGAs. The system was built on Intel's Xeon+FPGA platform showing advantages and limitations of HUDFs. It is built on specialized hardware, while our FPGA system is designed for off-the-shelf FPGA accelerator cards.
Centaur [29] is a framework for hybrid CPU-FPGA databases that allows the dynamic allocation of pipelined FPGA operators as well as hybrid execution on CPU and FPGA. Centaur utilizes UDFs and FThreads to allow seamless integration of FPGA operators in the database engine. While our system is designed with Dynamic Partial Reconfiguration (DPR) and portability as main goals, Centaur builds on the Intel Xeon+FPGA technology, and they aim to distribute and queue commands to a set of static workers.
In [39], the authors use a PCIe-attached FPGA to offload predicate evaluation for heavy analytical query processing, while the CPU handles transaction processing. They stream compressed row-based data through a configurable accelerator over PCIe for executing different queries without reconfiguration. The reported results indicate the system offers an order of magnitude speedup in the offloading computations compared to a software baseline.
AxleDB [34] is an FPGA-based database query processor with SSDs directly attached to the processing units. Static accelerators are arranged in a streaming-based ring bus and are operated by a global state machine with hostgenerated instruction words. In addition to the accelerators, it implements an index structure. The authors achieve a speedup of 2x over standard DBMSs by implementing data parallel accelerators. While being based on standard FPGA hardware, AxleDB is also a static system, which limits its flexibility in both allocation of compute resources and communication compared to a system with DPR and configurable data-flow routing.
The ReProVide project [4] presents a hybrid system consisting of CPU-FPGA, where the FPGA is connected over the network. This system uses the dynamic partial reconfiguration (DPR) capability of modern FPGAs to exchange large accelerators to do pre-processing of data to be sent to the host. The authors also propose a model for operator placement and buffer allocation.
The DPR-based FPGA system presented in [49] cannot just stream rows through processing modules, but can also do sort and join operations using heterogeneous reconfigurable partitions. The authors focus only on the FPGA: The host assigns partial queries to the FPGA and manages the accelerator. In addition to the special function partition for sort and join, the system consists of 4 banks of 16 small daisy-chained reconfigurable modules with data flow being routed only between banks. This work is the closest in design to our system, but we designed a fully column-based system and focus on more flexible routing of data-flow between compute units and a more uniform grid of compute tiles.
K Fig. 1 OpenMP code of the reduce-add primitive Overall, in this work, we evaluate query processing over a heterogeneous processor system containing three different processor architectures: CPU, GPU and FPGA. For a detailed discussion on integrating FPGAs into DBMSs, we refer to our previous work [3]. There we present an extensive literature review, identify the challenges and propose approaches to tackle them. To the best of our knowledge, there is no existing work that compares query processing performance on CPU, GPU and FPGA processing from a single holistic setup.

Background
Current processor architectures are branching out into two broad spectra: (1) homogeneous cores, where multiple CPU cores are tightly coupled for processing. (2) heterogeneous cores, where tailor-made compute architectures are used for specific functions. DBMSs running on top of these architectures have to be aware of the features of the underlying hardware to have improved performance [10,36].
In order to run database operations on different hardware architectures, an abstract representation of the operations is necessary to reduce re-implementation work and increase re-usability [8].

Primitives
We use primitives as basic building blocks of DBMS operations. Primitives in this context are atomic functions that are combined together to form an SQL operator. They can be parallelized and arranged together to execute partial or full SQL queries [7,22]. We surveyed the primitives and detailed their alternative implementations in our paper [19]. Below are the primitives considered in this work: Map: A one-to-one mapping from input to output. Scan: Prefix sum on the given input. Materialize: Synthesizing selected column values. Aggregate: Grouped or single reduction on an array of inputs. Scatter & Gather: Data shuffling operations. Using these primitives, we build a Directed Acyclic Graph (DAG) for a given SQL query.

Devices and Characteristics
CPU: the key characteristics of CPUs are instruction pipelining, hierarchical caches and SIMD [24]. With pipelining, CPUs provide instruction-level parallelism but the overall execution in a system is delayed due to data and control-flow dependencies across instructions. This delay is hidden using out-of-order execution and branch prediction strategies. Another improvement in current CPU generations are the SIMD units i.e., vectorized execution of a single instruction across multiple values [12,37]. These units can execute an instruction on a vector of data instead of just on a single value. GPU: A GPU contains thousands of physical cores to support embarrassingly parallel operations. GPUs use a higher degree of Thread Level Parallelism (TLP) compared to CPUs in order to hide memory access latency. A GPU is capable of executing a large number of simple instructions in parallel, but does not have the capability of branch prediction. Therefore, control-flow-heavy operations cannot exploit the full parallelism in a GPU due to the serialization of diverging branches [21]. Finally, the biggest challenge in a GPU is the transfer time spent to move data from the CPU's main memory to the GPU's. This transfer overhead can over-shadow the performance benefits of GPU-acceleration [11]. FPGA: FPGAs are capable of realizing arbitrary logic circuits by mapping computation, i.e., mathematical (pure, stateless) functions, to hardware lookup tables (LUT). Additionally, modern FPGAs contain specialized hardware for more efficient processing, including large blocks of SRAM (BRAM), carry chains connecting neighboring LUTs and multiply-accumulate units (DSP). Finally, dynamic partial reconfiguration (DPR) allows the designer to exchange parts of the circuit at runtime. Traditionally, FPGAs are programmed using hardware description languages (HDL) that allow for low-level design specification on the register transfer level (RTL). As compiler technology improved, the concept of high-level synthesis (HLS) was developed. Here, the kernel is programmed in a higher-level language such as C++, and is then transformed automatically into a circuit. HLS provides a more software-like development environment, but can sometimes be less flexible and also incur a slight resource overhead compared to an RTL design [44].

Execution Models
In this section, we discuss the execution model of the three devices (CPU, GPU and FPGA), i.e. how to execute a primitive on these different devices.

CPU
For executing primitives on a CPU, we follow the operator-at-a-time execution model proposed by Boncz et al. [7]. This model executes one primitive at a time for all input data, in a tight loop. When one primitive in the query plan finishes execution, the next one starts. Though vectorized execution of these primitives provide performance benefits [50], this is not suitable for a heterogeneous environment, due to high data transfer delay. Figure 1 shows an OpenMP implementation of the reduce-add primitive to be run on a multi-core CPU.
For data transfer in OpenMP, we simply forward the input pointer to the target function for execution.

GPU
Similar to CPUs, GPUs also follows operator-at-a-time execution. For each primitive (or kernel in the context of OpenCL), the runtime loads data as well as creating result space in the GPU based on the target kernel. The result space is estimated based on the input size, intents (data processed per thread) [31]. Once set, the kernel is executed.
A task sets the number of parallel local items and the work groups required for parallelizing the task on the GPU. These values are computed from the input size and parameters present in the task. Note that consecutive operations in GPU simply forward the data buffer without any expensive data routing.
In Figure 2, we show an example OpenCL implementation for the reduce-add primitive. It is worth mentioning, that the same kernel code can be used to run on a CPU instead of a GPU. Note the kernel's additional parameter INTENT defining the number of inputs processed by a single work item.
Data transfer mechanism is different based on the target OpenCL device. For OpenCL execution on CPU, we simply forward the host pointer. In case of GPUs, we consider the case of a cold-store, where initially no data is available in GPU and data is transferred explicitly according to the incoming query. However, all intermediate results are stored in the GPU memory and only the final result is transferred back to the CPU.

FPGA
While HLS solves most of the complexities in programming a FPGA, the software integration problem still re- To connect FPGAs to a host server, there exist products such as Xillybus over PCIe [46] which provide both hardware interfaces and OS drivers, but any user program still has to be adapted specifically to every FPGA design. Thus, single-function accelerators are common in the industry [27,32,43]. Another option is to implement multiple functions in the same FPGA design, which limits the amount of resources every functional unit can occupy, and therefore potentially reduces performance. We have already analyzed these shortcomings with respect to Intel OpenCL for FPGAs [15]. To combat this, modern FPGAs allow for smaller parts of the fabric to be reconfigured with different circuits during runtime, instead of just loading a complete design into the FPGA at power up. Dynamic partial reconfiguration of reconfigurable partitions (RP) allows for function units (FU) to be exchanged at runtime, and therefore greatly increases flexibility, but also forces higher complexity upon the designer. One possibility to hide this additional complexity is to specify the design in such a way that a simpler representation of it can be devised: An overlay architecture abstracts away the raw image of FPGA hardware resources into a user representation more specific to their application.

User Model of the Overlay Architecture
As shown in layer two of Fig. 4, our overlay architecture consists of an array of locally interconnected FUs, where each FU performs a set of streaming operations based on the primitives introduced in Sect. 3.1. These primitives are implemented using HLS, synthesized for each FU, and then integrated into the overall design transparently to the programmer via DPR. Figure 3 shows the reduction primitive for addition as an example of an HLS primitive: After defining the data types for the hardware interface using vendor libraries, we add a stream of 32-bit-integers. The logic for accessing the stream hardware is generated by the HLS tool chain. Query processing (partially) on FPGA using the reconfigurable overlay architecture starts with mapping the input data-flow graph to the graph describing the available FUs. While finding perfect matchings is complex, efficient approximative approaches such as simulated annealing [5] are generally available and effective. Based on the matching, the user space driver constructs the necessary configuration data, loads the required primitives via DPR and configures data-flow routing within the overlay architecture. Finally, after input data is copied to the DDR memories on the FPGA card, the user instructs the overlay architecture to process the required columns and waits for it to finish.

Structure and Infrastructure
We implemented a custom FPGA overlay architecture focused on hardware-pipelined execution of data-flow graphs using DPR to provide diverse functionality. As Fig. 4 shows, the FPGA is divided into a static part (on the right) and The RPs are arranged in a regular grid pattern across the FPGA fabric and are grouped together with their supporting logic into a grid of tiles. The tiles are the fundamental building block of the overlay architecture and their inner structure is described in Fig. 5. Each tile has high-bandwidth data-flow connections to a 4-neighbourhood. Furthermore, there is a packet-switched configuration and status network for short message exchanges between the host program and any FU without the need for a static data-flow route. The input and output crossbars within each tile enable flexible data transport and are also set up via these messages. In addition, to support random memory accesses for example for the hash table primitives, a few tiles are also attached to the static memory bus directly, not just through DMA engines.
Since data-flow graphs are likely to not map perfectly to a 2D-grid, a special pass-through or bridge FU can be loaded, which instead of an operation just contains two FIFOs. This allows forwarding of up to two data streams through any unused tile. One exemplary use is highlighted in pink in Fig. 6, where two data streams need to cross over.
The whole design is freely scalable and flexible with respect to the targeted application domain. This is especially interesting, since the amount of general and specialized FPGA resources provided is defined one level lower, by the shape and placement of the reconfigurable partitions. This allows not only for different designs to set higher or lower resource limits, but also, the partitions do not necessarily In conclusion, our overlay architecture allows for fast and dynamic composition of query datapaths. Using small configurable local crossbars allows for higher flexibility than a statically wired set of RPs while allowing for many more RPs than a globally interconnected system. This flexibility in data-flow routing can also help reduce the cost of reconfiguration. Also, the structure of the tiles fits perfectly to a column-based system, chosen to reduce bandwidth by accessing only the necessary columns. Finally, the structure of our overlay architecture enables both functional and data parallelism since different parts of the system are always similar.

Physical Design
The connections between the overlay architecture and the memory subsystem are located along the left and bottom sides in our FPGA,as shown in Fig. 6. Again, such placement is due to the physical layout of the generic infrastructure components. The DMA engines are used to scan or store columns located in the FPGA's DDR3 DRAM. In addition, there is one bidirectional connection directly to the PCIe core, which allows one column to be streamed to or from the host server directly.

Experimental Setup
This section covers the setup of our experiments and describes the hardware used to gather our results. Since we focus on analytical query processing instead of transactions, we choose the TPC-H benchmark [42] for our comparisons. Due to limited space, we report a detailed analysis of one query, which we carefully selected. One important goal was to have a similar query execution plan for CPU and GPU. Such an operator-oblivious comparison is required to provide a fair comparison of both performance and system integration. Our secondary goal was to select either a complete query or a meaningful partial query pipeline that could be implemented with the finite hardware resources of common FPGAs. Furthermore, our current evaluation excludes algorithmic variants and computationally expensive multidimensional sorting operations. Such evaluation is extensively studied on their own already [13,18,20,47]. Fig. 6 shows the data-flow graph of Q6 with its different types of operations. The selection primitives, consisting of three range comparisons, are connected via logical operations. This is shown in orange in the bottom half of Fig. 6. At the tail end Q6 contains two arithmetic operations, which are shown in green and violet. After the reduction, only a scalar result remains. For execution on FPGA, all primitives are mapped to the tiles of our FPGA overlay architecture. After selection, for CPU and GPU, results are materialized by gathering column values according to the 1-values in the resulting bitmap. In contrast, the dataflow processing using hardware-pipelined function units on the FPGA allows for in-place materialization using the filter primitive. The filter and post-processing primitives are shown in violet in Fig. 6. Since this tree-shaped query execution plan does not neatly fit onto a 2D-grid, non-compute FIFO-only Bridge FUs are used to allow flexible placement of the query's data-flow graph.
Overall Q6 contains enough optimization potential to allow all of the three architectures to apply their own strengths, but the execution plans on different devices are still similar enough to reach comparable results.

Device Setup
All measurements are taken on a server equipped with an AMD Epyc 7351P 16-Core Processor, an AMD Radeon RX Vega 56 with 8GiB HBM and a Xilinx Zynq-7000 ZC706 evaluation card with 2GiB DDR3 DRAM and Debian Linux as the operating system. FPGA -Physical Setup: Our FPGA provides 218600 LUTS, 437200 FFs, 545 BRAMs, 900 DSPs [45]. The SoC comprises of Zynq PS: two ARM9-cores and one DDR memory controller. Based on the SoC and PCIe endpoint (i.e. the fixed-function units) layout, we implement the I/Opins to DDR memory controller. Similarly based on the layout as well as the locations and number of DSP resources, we choose a 4 7 tiles grid for .
Each FU can at most use 1200 LUTs, 2400 FFs and 20 DSP units. Fig. 6 shows an abstract representation of both the logical and physical design, the latter of which is constrained by placement requirements for the PCIe core and DDR3 memory controller. For the same reason, only the two leftmost tiles on the bottom row are equipped with an extended interface, which in addition to the normal data streaming connections also allows random memory access. These tiles are sized slightly larger and occupy 1600 LUTs, 3200 FFs, 10 BRAMs, and 20 DSP units, but they still fit within the regular grid structure. All streaming connections within the overlay architecture are set to 32 bits. The complete design requires 73949 LUTs, 75203 FFs, 37.5 BRAM and no DSP units. In relation to the available resources, this is 34% of LUTs, 17% of FF and 7% of BRAM. Given the requirements of the DDR3 memory controller, the overlay architecture is clocked at 200Mhz.

Evaluation
In this section, we compare execution of the same TPC-H workload on different devices in our system. At first we present performance results for a micro-benchmark followed by a detailed end-to-end comparison of query 6 and hash join. To ensure comparable performance, we run our experiments on CPU, GPU and FPGA, all of which are available in the same environment and share the same system resources.

Micro-benchmark
We test the devices' general processing throughput by launching an increasing number of aggregate operators, with each input column consisting of 10 million integers. CPU and GPU process the aggregates sequentially, but with wide data parallelism within each primitive, while on the FPGA, functional parallelism between the aggregates is employed. The resulting execution times for the four primitive implementations on our three devices are shown in Fig. 7. The data shows that GPU, just as expected, by far outperforms CPU and FPGA due to its high memory bandwidth and optimized atomic operations. With its lower clock rate and limited throughput per DMA channel, the FPGA is much slower for a single aggregate than the other devices. However since all FUs can run completely independently from each other (or in a hardware pipeline), with statically allocated bandwidth per connection, the FPGA handles functional parallelism efficiently: Performance is nearly constant when increasing the number of concurrent reduce operations until memory controller, or rather, system interconnect bandwidth limits are reached. The step down in performance after three concurrent aggregates happens because at that point, the faster memory controller becomes saturated and additional aggregates have to fall-back to the slower memory. Finally, the grey line in Fig. 7 shows the theoretical maximum throughput for the FPGA, which is almost reached.
Due to their operator-at-a-time execution, performance of both CPU and GPU degrades linearly with increasing numbers of concurrent reduce operations. In case of the single aggregation, the runtime is less than 1ms for CPU OpenMP and GPU OpenCL versions. Hence, they are not vidible in the Fig. 7. The GPU shows a near linear growth in execution time, as it executes one reduce after another. Still, the GPU performs well for reduce compared to other implementations. Even with the OpenMP variant of CPU being nearly twice as fast as the OpenCL version, they are slower compared to GPU.

Hash Joins
Unlike the primitives of query 6, a hash join implementation requires large amounts of intermediate data to be stored in order to determine the join pairs. Such random memory accesses are particularly expensive in FPGAs. To analyse the performance of joins, we can limit ourselves to just the primitives required for them, since they will be the most expensive primitives in the overall query pipeline.
We split the hash join into its two primitives: hash-table build and hash-table probe. We use linear probing for collision resolution and multiplicative hashing as hash function. On the FPGA, the join primitives can be loaded into the extended tiles equipped with random memory access. Since these tiles still need to fit within the regular grid of the physical design, their resources are constrained in similar ways as the pure streaming tiles. To circumvent these limitations, we could break these primitives out of the grid or implementing them in the static part. This would allow for higher-performance implementations at the cost of flexibility, for example by implementing large domain-specific caches. Another performance limitation is that the memory interface generated by the HLS compiler is not welloptimized for concurrent random memory access.
The execution time for hash join is given in Fig. 8. Due to the given limitations, we see a larger performance penalty for FPGA in comparison to the other devices.
Based on our results, the performance of OpenCLbased CPU hash-build is comparatively slower than that of OpenMP based implementation (cf. Fig. 11). However the OpenCL hash-build in CPU is faster than that of the OpenMP implementation. Such performance difference comes from threading strategies present in the implementations -OpenMP is coarse grained and OpenCL is fine grained [35]. Such performance from OpenCL and OpenMP are still worse compared to the GPU. Here, the massive parallel nature of the GPU is useful in probing multiple input values thereby having better performance than CPU.

Query 6
We report performance measurements for table sizes from 1000 rows up to 128 million rows which almost fills up the entire memory on the smallest device.
The communication overhead to start computation in non-CPU devices is included in the results. Looking at the data from this angle allows us to analyze end-to-end latency from the database system optimizer's point of view, including performance and quality of the host drivers. The figures detailing query execution time, including data transfer to PCIe devices, are organized as follows: Fig. 9 covers the whole range of input rows and includes all transfer times Finally, Fig. 11 shows the execution time of the single primitives for the platforms that do operator-at-a-time execution. Since the FPGA overlay architecture executes all primitives in parallel in a hardware pipeline, it is not included. This data shows strong evidence for implementing architecture-specific primitives. This is especially clear for the materialization primitive, which underperforms on GPU due to architectural disadvantages. In contrast, for the reduction, the GPU can bring its specialized hardware for atomic operations to bear.

Access Bottleneck
Comparing the results shown in Fig. 9 and Fig. 10 it is clear that devices without direct access to system memory are at a disadvantage. In the case of a cold-start without most required data already present in the device-local memory, transfer times over PCIe are the dominating factor in execution time for both GPU and FPGA, something which can only be improved with newer system interconnects such as NVLink [17]. This effect is exacerbated for the FPGA card used, as both its internal memory bandwidth and PCIe throughput are fairly low compared to GPUs. Improving this with more powerful logic cores is possible, but requires a much higher amount of FPGA resources. However, if device-resident data is used, the situation is improved. For Q6, OpenCL on GPU is faster than the CPU OpenCL driver for more than 20 million rows, see Fig. 10.

Hardware-Sensitive vs. Hardware-Oblivious Primitives
A surprise is that for query execution on CPU, there is a big difference between the OpenCL primitives and OpenMP parallelized versions, despite using zero-copy buffer management for the OpenCL variant. This is due to the fact that OpenCL code written for massively parallel execution on a GPU does not translate to an optimal execution on a multicore CPU. So even with the current state of OpenCL packages, the effort of implementing native primitives for CPU remains a sensible investment. Our evaluation supports existing evidence, that OpenCL is in a disadvantage compared to OpenMP for multi-core CPUs [35]. Due to the limited throughput and low memory bandwidth of the resource-efficient generic infrastructure components on the FPGA, the DMA interfaces into the overlay architecture reach only roughly half of their maximum bandwidth. Still, the FPGA processes query 6 at an input throughput of about 1.75 GB/s. Such execution performance can be improved by fusing multiple operator kernels together for FPGAs [14]. Finally, the stepping observed for small table sizes during data transfer onto the FPGA card are due to kernel-level I/O scheduling.

Power Comparison
While the overlay architecture is flexible and scalable, it must be emphasized that limited performance in system integration components is the major bottleneck for the FPGA. For a fair comparison, the power consumption differences should be minded: While the CPU generates up to 155W of waste heat and the GPU has a heat output of up to 210W, the FPGA evaluation card only draws about 12W. These numbers are based on the original equipment manufacturer's specifications [1,2] and in addition for the FPGA, power estimates from the vendor design tools. For Query 6 with 128 million input rows, these power requirements translate to ca. 12J on the CPU using OpenMP-based primitives, 37J for the CPU OpenCL implementation, 32J for OpenCL on the GPU and finally, 14J for the FPGA. Even without dataparallel processing, the FPGA is competitive with the nearoptimal CPU-based primitive implementation, while both OpenCL-based sets of primitives require a lot more energy.

Data-Parallelism on the FPGA
Setting the processing capabilities on the FPGA into context, the performance measurements reported are for hardware-pipelined tuple-at-a-time processing. On a card with higher memory bandwidth only 2 or 4 SIMD parallelism would be required to reach parity with CPU and GPU.

Conclusion
In this paper, we present an in-depth analysis of running OLAP queries on CPU, GPU and FPGA. We consider not only execution time alone, but also the overhead incurred by I/O and scheduling.
Our experiments consist of running the same microbenchmark using aggregation, hash joins and Q6 from TPC-H on all three devices using portable operators (primitives) to obtain fair and comparable results. With our experiments, we observed that architecture-specific primitive implementations and execution engines are needed to better exploit the underlying architectures of CPUs, GPUs, and FPGAs. While direct access to main memory gives CPUs in general a large advantage over coprocessors, we again confirm the related work that OpenMP primitive implementations still outperform OpenCL on CPUs.
Regarding FPGAs, our overlay architecture offers a high scalability for analytical query processing, but the low memory bandwidth of the FPGA card's memory controller and system interconnect limit real-world performance. We argue that FPGAs are still to be considered for query processing due to its low power consumption compared to CPUs and GPUs.
Finally, with recent advancements in system interconnect technologies (e.g., NVLink), we argue that heterogeneous coprocessor-based database systems can become more competitive with CPU-only systems for general analytical database query processing.