Big data systems are reaching maturity in terms of squeezing out the last bits of performance of CPUs or even GPUs. The next near-term and widely available alternative for higher performance in the data center and cloud may be the FPGA accelerator.
Coming from the embedded systems and prototyping-oriented market, FPGA vendors have broadened their focus towards the data center by releasing accelerator cards with similar form factors and interfaces as GPGPUs. Various commercial parties offer cloud infrastructure nodes with FPGA accelerator cards attached. FPGA accelerators have also been successfully deployed at a large scale in commercial clusters of large companies (e.g. [9]).
Whether the FPGA accelerator in the data center will become an implementation platform as common as other accelerators, such as GPGPUs, is still an open question. The answer will depend on the economic advantages that these systems will offer; will they provide a lower cost per query? Will they provide more performance per dollar?
In an attempt to answer these questions, valid reasons to be skeptical about embracing FPGA accelerators in the data center exist. We stipulate three disadvantages within this context:
-
1.
Technological disadvantage: FPGAs run at relatively low clock frequencies and require more silicon to implement the same operation compared to a CPU or GPGPU, requiring the specialized circuits they implement to be orders of magnitude more efficient at whatever computation they perform before they provide an economically viable alternative.
-
2.
Hard to program: A notorious property of FPGAs is that they are hard to program, incurring high nonrecurring engineering costs; a higher cost per query or more dollars to achieve decent performance.
-
3.
Vendor-specific: Relative to the software ecosystem in the field of big data analytics, one could observe a lack of reusable, vendor-agnostic, open-source tooling and standardization. The big data analytics community has shown to thrive and rely specifically on open-source frameworks, as this provides more control over their systems and prevents vendor lock-in.
On the other hand, valid reasons to be optimistic exist as well, because of the following advantages.
-
1.
Specialization: FPGAs are able to implement specialized data flow architectures that, contrary to load-store architecture-based machines, do not always require the intermediate results of fine-grained computations to spill to memory, but rather pass them to the next computational stage immediately. This often leads to either increased performance or to increased energy efficiency, both of which may provide an economic advantage.
-
2.
Hardware integration: FPGAs have excellent I/O capabilities that help to integrate them in places the GPGPU cannot (yet) go, for example, between the host CPU and network and storage resources. This can help to build solutions with very low latency compared to CPUs and GPGPUs.
Hardware Design Challenges
The two mentioned advantages have the potential to mitigate the first disadvantage in specific cases, which leads us to mainly worry about the problem of productivity. One branch of approaches that the research and industrial community takes to increase productivity is to say: hardware is hard to design while software is easy to program, therefore we should be able to write software resulting in a hardware design. While the term has become ambiguous, this approach is called High-Level Synthesis (HLS), which we interpret here as; using a description of a software program to generate a hardware circuit performing the same function, hopefully with better performance. A thorough overview of HLS tools can be found in [10].
The HLS approach can (arguably) lead to disappointment on the side of the developer, since it is easy to enter a state of cognitive dissonance during programming. A user with a software design background may find many constructs and libraries not applicable or synthesizable in a language that s(he) thinks to understand. Hardware-specific knowledge must be acquired, and often vendor-specific pragmatism must be applied to end up with a working implementation. A user with a hardware design background may experience a lack of control that may result in a suboptimal design, hampering the intended performance that they know could be achieved using a HDL. Software languages are designed with the intent to abstract CPU instructions, memory, and I/O, but not the gates, connections, and registers that hardware-oriented users desire to express more explicitly than what is allowed by most software-oriented languages. A recent meta-analysis of academic literature [11] shows that designs created with HLS techniques at a reduced design effort of about 3 × still show only half the performance compared to HDL designs, although the meta-study includes designs in frameworks that would classify as an HDL approach (e.g. Chisel) more than HLS, according to our definition. Since the direct competitor is the server-grade CPU and the GPGPU, it is in many cases unlikely that losing half the performance is acceptable.
For the reasons mentioned above, we argue (together with [12, 13]) for a different approach to attack the ”hard-to-program” problem; hardware is hard to design, therefore we need to provide hardware developers with abstractions that make it easier to design hardware. Such abstractions are easier to provide when the context of the problem is narrow, leading to domain-specific approaches. We must increasingly take care that these abstractions incur zero overhead, since technologically, we are getting close to an era where the added cost of abstraction cannot be mitigated by more transistors, due to the slowdown of Moore’s law.
We stipulate three FPGA-specific challenges from the hardware development point of view when designing FPGA-based hardware accelerators for big data systems that cause a substantial amount of development effort.
-
H1
Portability: Highly vendor-specific styles of designing hardware accelerators prevent widespread reuse of existing solutions, often leading hardware developers to ‘roll their own’ implementations. It also makes it hard to switch implementations to different FPGA accelerator platforms of different vendors.
-
H2
Interface design: Developers spend a lot of time on designing interfaces appropriate for their data structure, since they are typically provided with just a byte-addressable memory interface. This involves the tedious work of designing appropriate state machines to perform all pointer arithmetic and handle all bus requests.
-
H3
Infrastructure: Hardware developers spend a lot of time on the infrastructure or sometimes colloquially called ‘plumbing’ around their kernels, including buffers, arbiters, etc., while their focus is the kernel itself.
Big Data System Integration
Not only FPGA-based designs themselves can be very complex — the big data analytics frameworks in which they need to be integrated are very complex as well. For the sake of the discussion in this article, we are going to assume that there is a hardware developer wanting to alleviate some bottlenecks in a big data analytics pipeline implemented in software through the use of an FPGA accelerator. In such a context, it is safe to assume that there is a lot of data to be analyzed. The FPGA accelerator must have access to this data.
Assuming the analytics pipeline to be implemented in the C programming language, a programmer may point to their efficiently packed, hand-crafted struct s, union s, arrays, pointers to nested dynamically-sized data structures, and eventually the primitive types of data that makes up the data structure of interest. Were this data structure to be somewhat inefficiently laid out in memory in terms of feeding it to the accelerator, the programmer would be able to easily modify the exact byte-level layout of the data structure in memory, typically causing the data to reside in regions of memory that are as contiguous as possible, such that they can be loaded into the FPGA using large bursts, preventing interface latency from becoming a bottleneck when many pointers need to be traversed. These assumptions are reasonable and describe a common design pattern in hardware acceleration of software written in low-level languages such as C. However, we will show that in the domain of big data analytics, these assumptions usually do not hold.
We have analyzed the code bases of many active and widely used open-source projects related to big data analytics. The goal is to answer the question: what languages are mostly used in the big data ecosystem? While there are hundreds of candidates in the open-source space alone, we have selected projects that are commonly found in the middleware of the infrastructure. This is where accelerators are most likely to be integrated. We therefore do not include frameworks focused on specific applications or end-users (e.g., deep learning or business intelligence), since they are often built on top of the middleware frameworks that we analyzed.
The overview of the frameworks that were analyzed is as follows:
-
8 query engines: PrestoDB, Cloudera Hue, Dremio, and Hive, Drill, Impala, Kylin, Phoenix
-
7 stream processing engines: Heron, Samza, Beam, Storm, Kafka, Druid, Flink
-
15 (in-memory) data storage engines: MongoDB, CouchDB, Cassandra, CockRoachDB, CouchDB, OpenTSDB, Accumulo, Riak, HBase, Kudu, Redis, Memcached, Hadoop-HDFS, Sqoop, Arrow
-
9 management and security frameworks: Airflow, ZooKeeper, Helix, Atlas, Prometheus, Knox, Metron, Ranger
-
6 hybrid general-purpose frameworks: Mesos, Hadoop, Tez, CDAP, Spark, Dask
-
4 logging frameworks: Flume, Fluent Bit, Fluentd, Logstash
-
2 search frameworks: ElasticSearch, Lucene-Solr
-
3 messaging /RPC frameworks: RocketMQ, Akka, Thrift
A pie chart of the analysis is shown in Fig. 1. From the figure, we may find that the vast majority of the codebase is written in Java, followed by Python, with C/C++ taking up about 15% of the lines of code. These figures indicate the most widely used run-time technologies in big data analytics pipelines.
About 80% of the code found in the ecosystem is written in languages that typically alleviate the burden of low-level memory management by various methods that cause several problems. First, garbage collection (GC) is applied to prevent memory leaks, sometimes causing data to move around the memory, invalidating any pointers to the data, causing the need to halt the software run-time when FPGA accelerators would be operating on the data. Second, extensive standard libraries with containers for many typical data structures (e.g., strings, dynamically sized arrays, hash maps) are commonly used. This decreases the development effort and provides a form of standardization within a language. However, the language-specific in-memory formats of these containers often do not correspond well to how it would be preferable for FPGA accelerators to access the data. Finally, data is often wrapped into objects (e.g., in Python), although the native architecture supports a specific data type in hardware. While in C an array of a thousand integers is relatively simple in memory, in Python this looks like an array with a thousand pointers to boxed integer objects, which is not very efficient to access with high throughput, as it is potentially highly fragmented. Furthermore, these objects contain language and run-time specific metadata that are of absolutely no use to an accelerator, such as pointers to the class of the object in Java.
Discussing the details of all these techniques is outside the scope of this article, but we summarize the discussion to the following challenges for developers wanting to integrate an FPGA accelerator solution into a software-oriented big data analytics pipeline:
-
S1.
Complex run-time systems: it is hard to get to the data, because it is hidden under many layers of automated memory management.
-
S2.
Hardware-unfriendly layout: the data is laid out in a way that is most practical for the language run-time system, with a lot of additional bytes containing data that is uninteresting to the FPGA accelerator. A more FPGA-friendly in-memory format of the data structure must be designed to make it accessible to the FPGA accelerator.
-
S3.
(De)serialization: Even if one would handcraft such a format, one would have to serialize the input data for the accelerator into that format, and then deserialize the result back into a format that the language run-time would understand. The throughput of (de)serialization is relatively low compared to modern accelerator interfaces, and can easily lead to performance bottlenecks [14].
Apache Arrow
Due to the nature of this article, the challenges S1, S2, and S3 from the previous section were described mainly from an FPGA acceleration developer point of view. However, even within the software ecosystem of big data analytics pipelines, such challenges exist. When heterogeneous processes interact (e.g., when there is inter-process communication between a pure Java program offloading some computation to a very fast C library), there needs to be one common (in-memory) format that both programs agree on. Several projects have provided such a common format for generic types of data, such as Google’s Protobuf [15]. The project provides a code generation step to automatically generate serialization and deserialization functions that help produce and consume data in the common format, turning it back into language-native in-memory objects, such that programmers can continue to work with them in the fashion of their language.
Later, it was realized that serialization and deserialization itself can cause bottlenecks, since copies have to be made twice; first, when serializing the data to the common format at the producer side, and again, when deserializing it on the consumer side. In many cases, providing specialized functions to access the data in its common format turns out to be faster than applying serialization and deserialization, since data may be passed between processes without making any copies to restructure it into a language-specific format. This has led to what is called a zero-copy approach to inter-process communication. Through the help of libraries such as Flatbuffers [16], such functions are provided to several languages. Producing processes immediately use the common format for their data structure, and then only share a pointer to the data with the consuming process. No copies are made because both processes work with the common format as much as possible from the same location in memory. Programmers are provided with language-specific libraries that make it easy for them to interact with the data structure according to the fashion of their language.
An approach similar to Flatbuffers, but specifically tailored to big data analytics, is found in the Apache Arrow project [17]. Apache Arrow is specifically tailored to work with large tabular data structures that are stored in memory in a column-oriented fashion. While iterating over column entries in tables, the columnar format causes more efficient use of CPU caches and vector instructions than a row-oriented format. It also provides a memory management daemon called Plasma, that allows to place data structures outside the heaps of garbage collected run-time systems, providing interfaces for zero-copy inter-process communication of Arrow data sets.
Thus, Arrow specifically solves the challenges S1, S2, and S3 by, respectively:
-
1.
Allowing data to be stored off-heap, unburdened by GC.
-
2.
Providing a common in-memory format and language-specific libraries to access the data, preventing the need for serialization.
-
3.
Tailoring the format to work well on modern CPUs by being column-oriented.
Fletcher
Previous studies have shown that inefficiencies in serialization of data from language run-times with automated memory management may cause more than an order of magnitude decrease in throughput compared to modern accelerator interfaces to host memory, that contemporary protocols such as PCIe, CXL, CCIX, or OpenCAPI (intend to) provide [14]. Therefore, the benefits of Apache Arrow can help alleviate bottlenecks in the context of FPGA accelerators as well.
Fletcher is an open-source FPGA accelerator framework specifically built on top of Apache Arrow, with the intent to not only solve challenges S1, S2, and S3 on the big data analytics framework integration side, but also to solve challenges H1, H2, and H3 on the hardware development side. This is illustrated in Fig. 2.
Previous articles have discussed, at a very high level, the idea behind the framework, and have shown several use cases [2]. These use cases have shown that through the use of Arrow and Fletcher, serialization overhead can be prevented, since the Arrow format is highly suitable for hardware accelerators, allowing the accelerators to perform at the bandwidth of the accelerator interface to host memory. An brief summary and overview of the framework as used by the developer during compile time and during run time is shown in Fig. 3.
When accessing tabular data, one would prefer to do so through row indices rather than byte addresses. This has led the Fletcher project to construct specific low-level hardware components with streamable interfaces, that allow to provide a range of row indices, returning one or multiple streams of data corresponding to the types of Arrow tables. In contrast to a byte-addressable memory interface, this addresses the challenge H2. We will briefly reiterate the design of these components in Section 3.
In the remainder of this article, we will describe how Fletcher deals with the challenges H1, the problem of portability, and H3, the problem of the infrastructure design effort.
Related Work
While many commercial tools exist that automate infrastructure design, most of them are geared towards the HLS approach, but provide little help to users that, for the reasons mentioned above, prefer to work with HDLs to describe their solutions. HLS tools are also known to have problems dealing with dynamic data structures, as described in [18], that Arrow allows to express. In Section 3, we show a method specific to Arrow for traversing the dynamic structures efficiently. Previous research has extensively investigated hardware interfaces for more generic C-style dynamic data structures through specialized DMA engines [19], but does not focus on integration with modern software frameworks from the big data analytics ecosystem analysis. To the best of our knowledge, Fletcher is the only open source FPGA accelerator framework that deals with challenge H3, in the context of big data analytics on tabular data sets for those that prefer an HDL design flow specifically.
A number of frameworks do exist that help deal with challenge H1. We first give an overview of related work regarding challenge H1, also shown in Table 1. This helps us compare Fletcher to existing frameworks and stipulate the differences. We use the following criteria to include specific frameworks in our comparison:
-
The framework is active and publicly available open-source.
-
The framework targets datacenter-grade accelerator cards/platforms.
-
The framework provides abstractions that provide some form of portability between such cards/platforms.
Table 1 Overview of open-source FPGA accelerator development frameworks As shown in the table, there are currently a small number of other frameworks that adhere to these criteria. TaPaSCo [20] allows designers to easily set up systems that perform several hardware-accelerated tasks in parallel. It is in some sense complementary to Fletcher, since (as will be discussed again later) Fletcher provides an AXI4 top level for memory access, alongside an AXI4-lite for the control path of the kernel, exactly fitting the integration style of TaPaSCo’s processing elements. TaPaSCo furthermore allows design-space exploration to find optimal macroscopic configurations of the parallel kernels, a feature that Fletcher does not have. It also allows to target a wide variety of (mainly embedded-oriented, but some datacenter-grade) FPGA accelerator cards, although currently only those that contain Xilinx FPGAs.
Spatial [21] is mainly a domain-specific language embedded in Scala, tightly connected to the Chisel hardware description language [23]. The language provides a very high level of abstraction to design accelerators and targets not only various FPGA accelerator platforms (of both Intel and Xilinx), but also CGRA-like and ASIC targets. Aside from not being a language itself, Fletcher differs from Spatial in the sense that it is less generic, and focuses only on abstractions to easily and efficiently access tabular data structures described in Arrow.
OC-Accel [22], the successor of CAPI SNAP, does adhere to the criteria described, although it is still somewhat platform-specific, since it allows to target FPGA accelerator systems that have an OpenCAPI [24] enabled host system, typically found only in contemporary POWER systems. OC-Accel is a target for Fletcher, aside from AWS EC2 F1 and Xilinx Alveo cards. We conclude the comparison by mentioning that Fletcher is a more domain-specific solution that only works for the tabular data structures of Apache Arrow. This prevents Fletcher from being used in other domains, although the lessons learned are of value when creating similar frameworks for other domains.