The previous evaluation already provides motivation for the use of explicit cross-layer data formats. By using the explicit data formats, it becomes possible to exploit the on-device ARM-cores for computational loads. In systems without explicit data formats, all this computational load would have to be executed by the host CPU. An additional advantage that comes with the use of explicit data formats is the possibility of hardware acceleration. Specifically, the COSMOS hardware platform features a SoC comprised of an ARM-based processing System (PS) and a FPGA-based portion (PL). In the original design, the PL-part of the SoC is used to implement only the Flash controllers as well as the NVMe controller. Due to the under-utilization of the PL-portion, this opens up additional potential for hardware acceleration. Examples for this are given in [27, 28]. In both works, the COSMOS hardware design is extended using processing elements that carry out computational tasks. This section will discuss the performance of [27] as well as its applicability to typical use-cases.
Execution stack
First, we will give an overview over our system stack and discuss, how different parts of the stack interact with each other. Additionally, differences between this work and [27, 28] will be highlighted.
At the highest layer, a benchmark-application is run on the host. In this work, the application relies on a MyRocks-database to store the data processed by the Image-Processor. MyRocks in turn relies on an underlying RocksDB-KV store for actual persistant storage. Due to MyRocks being specifically designed to work with RocksDB, this integration is almost seamless. Instead of using a plain RocksDB, our stack relies on an enhanced RocksDB, called NoFTL-KV [29]. In difference to regular RocksDB, NoFTL-KV assumes a specific storage device underneath, which it can access natively from userspace, without any intermediary layers (i.e., OS-drivers or block device drivers). NoFTL-KV allows the DBMS to directly work on the actual flash chips instead of relying on intermediary compatability layers. Instead of using virtual addresses that have to be resolved, NoFTL-KV can directly access the corresponding physical flash pages on the COSMOS OpenSSD via the NVMe-protocol. To allow this direct access, it is also required that the COSMOS OpenSSD is running a specialized firmware that allows this kind of access to it. This firmware is responsible for actually performing incoming NVMe-requests. In our case, both NoFTL-KV and the COSMOS-firmware have been extended to also allow the execution of user-defined commands via NVMe.
While the setup of [27, 28] is very similar, there are two key-differences. In those works, the topmost layer is not MyRocks, but instead just NoFTL-KV. In both cases the stored data is structured in a specific manner, but in this work, the structure is defined by a MySQL-Schema, while [27] relies on the benchmark-application to define a clear data structure.
The second difference is the configuration of the COSMOS OpenSSD. The COSMOS device can be configured with different hardware-designs, enabling different hardware-functionality. In this work, the hardware-functionality is limited to running the interface via NVMe and the access to the on-device flash chips. In [27, 28], the functionality is extended by using hardware accelerators for certain database-related functions. Both hardware-configurations rely on the same baseline hardware-design and COSMOS-firmware. For actually using the hardware-accelerators, the firmware has to be extended to actually provide the functionality to higher layers. This is achieved, by implementing new user-defined NVMe-commands, which tell the firmware to call corresponding on-device functionality. This functionality can either be realized by using the on-device ARM-cores, or the accelerators implemented using the on-device FPGA-fabric. In this work, only the on-device ARM-cores are used, while [27, 28] also realize additional functionality via the FPGA-fabric.
In the case of FPGA-based functionality, it is then necessary to implement the corresponding control logic within the COSMOS-firmware. Depending on the complexity of the hardware accelerators, the control logic is typically performed by reading and writing a limited number of control registers, which are simply memory-mapped into address space of the ARM-cores. The resulting execution stacks are also shown in Fig. 8.
The following subsections will go into more detail on how the FPGA-fabric can be used to further increase the performance based on the findings of [27, 28]
Accelerator architecture
The architecture presented in [27] is relatively straight-forward. The processing element (PE) can access the on-device DRAM of the COSMOS. Additionally, the PE can be configured and controlled via control-registers that can be accessed by the PS. Within the PE, a pipeline is built, where entire data-blocks are first loaded from the DRAM and then grouped into a stream of tuples, each representing a single key-value pair. These key-value pairs are then passed into a filtering unit which can apply simple logical predicates to each key-value pair, and depending on the evaluation of the predicate, can discard some key-value pairs. Afterwards, a data transformation is performed, which allows the removal of unnecessary fields and the re-structuring of each individual key-value pair. The resulting key-value pairs are then automatically stored back to the on-device DRAM.
Depending on the underlying structure of the key-value pairs, the corresponding PEs may differ in the amount of FPGA resources they require. [27] report the hardware utilizations of two different kinds of PEs and the amount of required Slices ranges from 1.84 to 15.14% of the overall available Slices (54650). Especially for smaller key-value pairs, this enables a lot more parallelism. With using just the two available ARM-cores, the degree of parallelism is limited to two, while [27] uses up to seven concurrent PEs.
Accelerating database-operations
Exploiting the available compute-parallelism, as well as the heterogeneity of their system, [27] implements simple GET and SCAN operations. It evaluates the execution times of both using the traditional approach on the host-CPU (Blk), software-based NDP (SW-NDP) and hardware-accelerated NDP (HW-NDP). For their evaluation, a benchmark composed of key-value pairs built from publications, references, conference venues, and authors is employed. The overall number of key-value pairs in this dataset is roughly 48 million, with the total size being 2.4 GB.
Due to the use of the concepts presented in this paper, it is possible to reduce the GET execution time from 8ms (Blk) down to 5.68 ms (SW-NDP), and that of the SCAN operation from 6.96s (Blk) down to 4.81 s (SW-NDP). With the high potential for concurrency of the SCAN operation, it is possible to further reduce the SCAN execution time down to 3.35 s (HW-NDP) using hardware acceleration. Additionally, [27] implements the Betweenness Centrality (BC) algorithm and also measure the impact of software-based and hardware-accelerated NDP. In the case of BC, the original execution time of 1027.84 s (Blk) is reduced down to 426.62 s (SW-NDP) and 374.81s (HW-NDP) using software-based and hardware-accelerated NDP, respectively. The BC is implemented as software for the ARM core that can be configured to exploit the presented PEs (HW-NDP). Alternatively, the corresponding parts of computation are done using just the ARM core (SW-NDP).
Integration into existing systems
Independently of the potential for performance increases, another issue for new database paradigms and concepts is the complexity of adoption as well as usability. Since databases are very important in academic and industrial applications alike, they are already widely used. Many of these applications rely on a specific DBMS like MySQL or Postgre SQL and switching the DBMS is often not an easy task, due to applications running on top of the DBMS. Therefore, it is important to achieve a degree of compatibility to these systems that enables a simple and easy switch from the existing stack to a hardware-accelerated stack.
To give a perspective on this, we want to take a closer look at MyRocks, which is a RocksDB-based storage engine that supports MySQL. Under the hood, a MyRocks database uses RocksDB for storing the data, while still allowing typical MySQL-Queries. Therefore, many of the performance increases achieved with a RocksDB database also apply to MyRocks. The two major restrictions to this statement are the following: 1. It might be necessary to adapt MyRocks to support command pushdown to allow the full use of Native Storage and NDP. 2. Since our hardware-accelerated NDP is only able to process selected data-types, it would also be necessary to transform the MyRocks database in a way that removes variable-size data from as many relations as possible, since relations with variable-size data cannot be processed by our current NDP-PEs.
Assuming that the MyRocks-relations meet these restrictions, the accelerated NDP could be integrated at different points in the stack. In our use-cases, it was integrated at the level of RocksDB, which provides the most basic operations such as SCAN and GET. Since MySQL-Queries performed on MyRocks will invoke the corresponding functionality in the underlying RocksDB, no further implementation changes would be required. Other functions of RocksDB could also be adapted to exploit the performance increases of NDP. Obvious candidates would be RANGE_SCANs and MULTI_GETs. To adapt these, their filtering functionality has to be replaced by calls to the PEs. These operations can be wrapped into function-calls without significant overhead, making this overall approach very simple. A major downside of this approach is the introduction of additional layers in the overall stack. Since a significant portion of our performance increases are achieved by removing unnecessary compatibility layers, it would potentially make more sense to instead allow MyRocks to directly access the on-device functionality. While this would incur much higher implementation effort, it might also yield even greater performance increases.
Discussion
From the results presented in [27, 28], it is clear that the potential of hardware acceleration is great. To unveil this potential, it is absolutely necessary to use explicit cross-layer data formats. Especially for the operations allowing more concurrency, such as SCAN and BC, the potential of hardware-acceleration is large and may provide great performance improvements. In addition, computational load is shifted from the host-CPU to the storage-device, which in theory allows the CPU to carry out more complex operations, while simple operations are handled using on-device NDP across multiple smart storage devices.
Even though these initial results are promising, the use of hardware-accelerated NDP introduces a new problem: To exploit hardware-acceleration, [27, 28] rely on manually-designed PEs and corresponding control logic. To address this problem, several approaches could be used. The most obvious is the use of High-Level Synthesis (HLS), which allows the generation of hardware accelerators from high-level code (e.g., C/C++). While HLS is generally easier to use for software- or database-engineers, it generally produces sub-optimal results in comparison to manually-written or generated Hardware Description Language (HDL) based accelerators. Considering the accelerators used in [27], it is clear that many of the sub-modules depend on the underlying structure of the processed key-value pairs.
Exploiting this structure, it could be possible to implement the automatic generation of HDL-based accelerators. This would also enable the automatic generation of the software interface for the control of the accelerators. Such a specialized NDP-PE generator could potentially create better hardware accelerators than general-purpose HLS, while also providing all of the control software, without requiring any additional knowledge about hardware design.
Apart from this caveat, the work presented in [27, 28] is a very good example for a field-based NDP-operation, where processing and computation can be performed close to storage, and the processed data can be handled at the finest degree of granularity. According to the model depicted in Fig. 2, the relevant cross-layer data formats are implicitly encoded into the accelerator to allow it the interpretation of the data down to single fields.