1 Introduction

In-memory databases aim at low latency and high throughput for queries and updates in order to support real-time data processing. Although (most of) the data can be kept in main memory and expensive IO operations can thus be reduced, access to main memory becomes more and more of a bottleneck and affects the overall performance—a phenomenon that is known as memory wall [21]. However, novel and emerging memory technologies open up new opportunities such as offloading computation to memory. One example is Processing-in-Memory (PIM), a rather new concept where (simpler) operations can be executed directly in memory (on the same die) without moving the data from DRAM. The basic idea of this approach is to equip memory chips with additional processing units. Data can be processed directly on the memory chips without involving the system’s CPU. PIM offers great potential: CPU load could be reduced, and memory bandwidth could be increased by reducing the amount of data to be transferred to the CPU.

Though many PIM architectures have been proposed in the past (see [17] for a classification), the only publicly available commercial product is offered by the UPMEM company. In addition, Samsung has also announced a product, but it is not available on the market yet. The UPMEM technology has only recently become available (first presented at HotChips 2019), therefore, only a few experimental studies of UPMEM have been published. In [18], the authors evaluate UPMEM PIM using a few use cases such as data compression, encryption, JSON processing, and text search. [8] presents PrIM—the Processing-In-Memory benchmarks—a benchmark suite from different application domains as well as several key observations and programming recommendations. Though PrIM contains also a database selection operator, it is not integrated into a database engine. Both papers discuss also the technical details of UPMEM.

Based on these works, we investigate in this paper the potential of offloading data management processing to memory using PIM. In addition to our previous work [2], we present in this work an extended system capable of compiling simple queries to PIM hardware and additional experiments in the evaluation. For this purpose, we use a graph database engine PoseidonFootnote 1 [11]. The contributions of this work are: (1) a data layout suitable for PIM-enabled hardware, (2) a method for transferring table records between host and PIM hardware, (3) an implementation of efficient table scan operators for real PIM-enable hardware using the UPMEM DIMMs, (4) a query compilation approach for the JIT compilation of graph queries for PIM hardware, (5) an evaluation of the presented approach using suitable graph workloads. However, because we focus in this paper on scan operations, the findings of our experiments are not limited to graph databases but can be generalized to scans on relational and other non-relational databases.

2 Related Work

PIM is a well-known technique to overcome the CPU-memory bottleneck for several decades. There have been a number of concepts and approaches to provide PIM on hardware since the 1990s [6, 19]. The high cost and lack of industrial support for this concept prevented the production and sale of real PIM hardware. Still, research was conducted based on prototypes. The PIM technology follows an approach similar to GPU processing. The design space of GPU-accelerated architectures transferred to PIM was investigated in the work of [22]. Further, with LazyPIM, the authors of [5] published a mechanism for reducing data exchange between CPU and PIM cores by means of caching. With the company around UPMEM, hardware providing real PIM-enabled DRAM DIMMs was published [20].

There are already a number of works concerning this architecture investigating its characteristics and applicability. Gomez-Luna et. al investigate the architecture for its limitations and performance as well as energy consumption [8]. The work results show that the UPMEM system achieves suitable performance as long as the individual components (DPUs) do not require communication (DPU-to-DPU). There is also available work concerning the applicability of PIM hardware on real use cases. [9] investigates the potential of PIM hardware for the acceleration of ML training. The results show that ML training using PIM hardware can improve the training process compared with GPU-based ML training. [7] investigates the improvement of sparse matrix-vector multiplication using real PIM hardware. [12] provided an efficient index data structure that leverages PIM. The work of [15] studied the implementation of joins for the UPMEM architecture. Further, they provide an approach for efficient joins on the DPUs. Based on the findings of these works, we studied the programming model of the PIM architecture provided by UPMEM for a query engine of a DBMS [3, 4]. The results of our evaluation show that query processing can benefit by using PIM-enabled hardware and its high parallel processing.

However, due to the short availability of real PIM hardware at the time of this work, there is, to our knowledge, no DBMS that directly integrates PIM.

3 Preliminaries

In this section, we introduce the architecture for which the presented approaches of this work are mainly developed. However, the shown approaches are also suitable for implementation in similar hardware architectures and other DBMS.

3.1 PIM Technology

The UPMEM company provides the first commercially available PIM-enabled hardware [20]. In this section, we give an overview of the UPMEM architecture and programming model as our work is built for this PIM-enabled hardware.

3.1.1 UPMEM Architecture

The core of the UPMEM architecture is the UPMEM DIMM, which is based on a regular DDR4-2400 DIMM module but equipped with additional PIM chips. The UPMEM DIMMs are organized into ranks. A UPMEM DIMM consists of up to two ranks and each rank consists of up to 8 PIM-enabled chips. A PIM chip usually consists of 8 DRAM Processing Units (DPUs). Each DPU has exclusive access to 64 MB Main RAM (MRAM), 24 KB Instruction RAM (IRAM), and 64 KB Working RAM (WRAM) for processing. As DPUs have only access to their own MRAM there is no direct communication possible between different DPUs. Further, a DPU consists of a general-purpose 32-bit RISC core with a maximum achievable frequency of 400 MHz, which can execute a special instruction set in a multithreaded in-order pipeline. For multithreading, there are 24 hardware threads available. All threads share the same memory on the DPU which requires synchronization to guarantee consistency when updating shared memory. This architecture allows the parallel execution of the same program on different pieces of data directly on the memory without involving the CPU.

3.1.2 Programming Model

Each DPU can use up to 24 tasklets that can be executed in parallel. This follows the Single Program Multiple Data programming model. All DPUs and their threads are executed with the same code but on different pieces of data. The number of used tasklets must be defined before compilation and passed as compiler arguments. As the MRAM and WRAM are shared among all tasklets on a DPU, the model provides synchronization primitives like mutexes, semaphores, barriers, and handshakes.

The execution and control of the DPU program are handled by the host application. The host application allocates the set of desired DPUs and selects the appropriate DPU program. It is possible to allocate a specific rank or a specific number of DPUs. Further, the host application manages the execution of the DPU program and the data transfer to and from MRAM. The actual execution and data transfer can be handled synchronously and asynchronously by the host application. When executing the DPU program launch or the data transfer synchronously, the host application waits for the complete execution of the launch or data transfer. When transferring data, it is often desired to prepare the next batch for data transfer, while transferring the old batch. For this purpose, the UPMEM host library provides the possibility to execute the data transfer and DPU launch asynchronously. The asynchronous model executes the instructions in the background using another thread and gives the control back to the host application. It allows the host application to proceed with the next batch for data transfer, or launch the same or another program on another set of DPUs.

The workflow of a host program running a DPU program with UPMEM technology can be summarized in the following steps: (1) DPU resource allocations (DPU, Ranks, DPU Program), (2) buffer population from the host’s main memory to MRAM of DPUs, (3) execution of the DPU Program, (4) retrieving of the processed results from the MRAM of the DPUs to the host’s main memory. Moreover, it is possible to execute multiple iterations between steps (2)–(4) when executing a DPU program. The data remains in the MRAM and WRAM of the DPUs and does not have to be reinitialized. This is useful for tasks where a solution has to be calculated in several iterations. The DPU programs are written in the programming language C and compiled by a special compiler, which is based on LLVM and Clang.

3.1.3 Memory Management

The different memory types in a DPU differ in size and connection to the host and the DPU itself. The MRAM of a DPU has the largest capacity with up to 64 MB and has the purpose of exchanging data with the host. The host system can copy data from its main memory to the MRAM and also transfer data from the MRAM to the main memory of the host.

The WRAM of a DPU is a working memory in which a DPU stores the stack and global variables. Access to this memory is restricted to the DPU itself. Direct access from the host is not possible. Further, the DPU can access the WRAM only through 8‑64 bit DMA instructions. The UPMEM runtime library provides for the transfer between MRAM and WRAM the methods mram_read for WRAM-MRAM and mram_write for MRAM-WRAM transfer. Each DMA instruction can copy up to 2 KB of data.

Communication with the host is done through data transfers between the main memory of the host and the MRAM of the DPU. The UPMEM runtime library provides different instructions for this purpose like dpu_copy_to/dpu_copy_from for copying a buffer from and to MRAM of specific DPUs. For parallel data transfer, the library provides the method dpu_prepare_xfer which assigns a buffer to a specific MRAM of a DPU. The actual data transfer is then performed in parallel using the dpu_push_xfer method but requires the same buffer size for all DPUs.

Figure 1 compares the throughput between serial and parallel data transfer. For the experiment, we transferred 10000 chunks with 817 records with a size of 625 MB from the host to different numbers of DPUs and assigned the chunks evenly through all DPUs. For the throughput, we measured the transfer times in several runs. When transferring data in serial the throughput remains at a constant level as the data has to be transferred to each DPU after another. The throughput can be increased with parallel transfer. In general, with an increasing number of involved DPUs (i.e., an increasing number of ranks) in the transfer, the throughput increases. The transfer takes advantage of rank parallel transfer which transfers data in parallel using multiple threads to the DPUs in the different ranks.

Fig. 1
figure 1

Throughput comparison between serial and parallel CPU-DPU data transfer

3.2 The Poseidon Graph Database

The presented work is mainly developed for the graph database Poseidon. Although Poseidon was originally designed for the characteristics of persistent memory (PMem), these characteristics can also be transferred to the exploitation of in-memory processing in DRAM. For this purpose, Poseidon already provides the necessary optimized data structures. In the following, the general architecture of Poseidon is described.

3.2.1 Data Model and Storage

The data layout of the Poseidon Graph Database is based on the labeled property graph model wherein labels and property values can be assigned to nodes and relationships. A complete graph in the Poseidon Graph Database consists of different tables in which the nodes, relationships, and respective property entries are stored. The tables are directly stored in DRAM. Poseidon provides also support for the storage of data directly on disk or PMem using similar data structures but optimized for utilization of the underlying storage. However, for the scope of this paper, we focus on the implementation of the storage in DRAM. For the underlying data structure, a linked list of fixed-size arrays (chunks) is used, which is referred to as a chunked vector. Furthermore, the nodes, relationships, and properties are stored in fixed-size records within the appropriate chunked vector.

For entries with variable sizes, such as string values, an entry is created in a dictionary, and the corresponding dictionary code is stored in the record. To achieve the connection between nodes and relationships, the offsets of the connected entries are used. A node record stores the offsets of the first incoming and outgoing relationships. This offset points to an entry in the relationship table. A relationship record contains the offsets of the source and destination nodes. Moreover, the corresponding relationships are also linked to each other. A relationship record, therefore, contains the next offset of the relationship list of the source and destination node. Traversing through a graph can be achieved by alternately searching the node and relationship table for the offsets of the corresponding records.

3.2.2 Graph Queries

Processing graph database queries consists mainly of discovering a path between nodes in a graph. Besides the usual operators known from relational DBMSs like selections, projections, or joins, Poseidon provides an additional set of operators, especially for the processing of graph queries. These operators are based on graph algebra which is an extension of relational algebra [10]. For the data flow between operators, we implemented a push-based query processing approach. Here, the operators are organized into a pipeline and push their results toward the consuming operator until a pipeline breaker occurs [16]. For various reasons regarding the simplicity of a graph query language, we implemented an easy and manageable query language oriented to graph algebra.

The entry point of every query in Poseidon is the NodeScan operator. It scans the underlying table for nodes, compares optionally each node with a given label, and pushes the appropriate nodes to the next operator. Because strings are dictionary encoded, this kind of filter is a simple integer comparison representing an appropriate candidate for offloading to UPMEM. The nodes can be then processed with the ForeachRelationship operator, in order to find an ingoing or outgoing relationship of the previous node. A relationship tuple can then be processed using the Expand operator to find the corresponding node from a relationship.

3.2.3 Query Processing

For the processing of graph queries, Poseidon’s query engine relies on push-based query processing and Morsel-driven parallelism [14]. The data flow at query processing is organized in a pipeline, and the resulting tuples are pushed from one operator toward their consuming operator. For parallelism, the engine exploits Morsel-driven parallelism using the underlying chunked vector data structure by assigning it at the beginning to individual tasks that will be executed in parallel.

The query engine provides three different execution modes for the actual processing: executing ahead-of-time (AOT) compiled code, just-in-time (JIT) compilation, and an adaptive approach. The AOT-compiled mode processes the given query using pre-compiled C++ methods, which execute the given operators. The JIT-compilation mode transforms the given graph query into highly optimized machine code and executes it directly. For this, we use the LLVM compilation framework. A graph query is transformed into a single function in LLVM IR. Then, it is optimized using several optimization passes like dead-code elimination or instruction combining. Next, the resulting optimized LLVM IR code is transformed into machine code and executed by the engine. To hide compilation time, the engine can execute queries in the adaptive mode. Here, the engine starts the query processing using the AOT-compiled mode and compiles the query in the background. As soon as the compilation is complete, it switches to the new compiled code. Additionally, this mode is useful to hide access latencies of the underlying storage type like disk or persistent memory [1].

4 PIM-Based Table Scans

The starting points of most queries are either index lookups or table scans. If no index is available, then there is no other way than to traverse the entire table for tuples that match a given predicate. Especially in the case of predicates with particularly low selectivity, tuples that do not correspond to the predicate must be transferred unnecessarily via the CPU of the system.

This procedure in today’s database systems leads to a bottleneck and reduces the possible performance. In this section, we will show the possibility of implementing a table scan operator by exploiting the PIM technology.

4.1 Memory Layout

For the execution of table scans on the DPU, the memory of the DPUs must be taken into account according to their characteristics. The tables of nodes, relationships, and properties of the Poseidon Graph database are based on the chunked vector data structure. The chunking of the table can also be exploited for the design of the memory layout on the DPUs.

The listing in Fig. 2 shows the structure of the chunks and node records which are stored in MRAM. The required size of the nodes for storing the necessary data is 80 bytes. We leave the parts that are used for transactional processing out of the scope and label them as tx_pad. A similar structural layout is used to represent the relationships and properties in the MRAM of the DPUs. Basically, the representations are equivalent to those used for storing the data in the host’s main memory. In addition, care was taken that the size is a multiple of 8 bytes to allow transfer to MRAM and between MRAM-WRAM without additional transformation. The appropriate alignment of the data allows the direct transfer to the DPU but also buffers a part of the data in the DPU WRAM for faster access.

Fig. 2
figure 2

Listing: Structure of node and node chunk in MRAM

With large tables, it may happen that more chunks exist than available DPUs. To use the memory of a DPU efficiently and to process as much data as possible on it, it is more beneficial to transfer several chunks to a DPU. Considering the chunk size of 65,536 bytes, we reserve in each DPU space that is able to hold up to 1000 chunks. The remaining MRAM space can be used for parameters such as the number of chunks passed, filter arguments, and storing the results.

4.2 Chunk-DPU Assignment

In order to transfer the data efficiently, we make use of asynchronous and parallel host-to-DPU data transfer. To implement the data transfer as efficiently as possible the data must be transferred in a parallel way. This is achieved with DPU and Rank parallel data transfer. The underlying data structure in Poseidon, which is used for the storage of nodes and relationships, is perfectly suited to achieve this with the least possible implementation effort. Figure 3 shows the rank parallel chunk-to-DPU assignment. Each chunk is assigned a DPU on a rank in a round-robin way. After the assignment, the data is transferred in parallel per rank. This ensures that the workload on all DPUs is similar. Furthermore, multiple chunks can be assigned to a DPU to make efficient use of the available MRAM memory. If the table does not fit completely into the MRAM, the program must be executed with the already transferred part. Then the remaining part must be transferred back to the DPU. The assignment algorithm iterates over the available chunks of nodes, relationships, or properties. Then it iterates over the DPU of a rank. This is advantageous to allow efficient parallel data transfer, as the buffer for the data transfer must be the same size and written to the same offset address in the MRAM. Otherwise, the transfer would be serial.

Fig. 3
figure 3

Rank parallel chunk to DPU assigment

4.3 DPU Scan

To enable an efficient multithreading scan of the chunks, we divide the workload among all available tasklets. For this, we distribute the elements to all available tasklets per chunk. Each tasklet thus works on an allocated area in each chunk assigned to the DPU.

Then each tasklet iterates over the allocated area of the chunks. In each iteration, a record is checked for a given filter predicate. As soon as a record matches the predicate, the result is saved by setting the corresponding position of the record in the chunk to 1 in a bit vector.

Per DPU there is a single-bit vector for each passed chunk. This tasklet design also has the advantage that no further synchronization mechanisms are necessary since each tasklet writes the result to its own memory area.

4.4 Code Compilation

In Poseidon, there are three different tables used to represent node, relationship, and properties of an. Scanning each of these tables requires different DPU programs, as the format of the individual records varies. Therefore, we have developed a query compiler that transforms the PIM-relevant part of a graph query into a DPU program which enables flexibility in query processing. The query compiler is able to transform scans, filters, and navigational operators into a PIM program. Further, we extend the query engine of Poseidon by two additional operators which are responsible for the data transfer between the host and DPU and vice versa [3].

The appropriate operators are placed before and after the execution of the PIM-compiled operators. A general overview and example of the compilation process are given in Fig. 4. These and the remaining operators which are not compiled into a PIM program are instead compiled into the usual host code which controls the execution of the PIM program. With this, there are two different compilation processes. As it introduces additional compilation times we integrate it into the adaptive query compilation engine with several execution modes. The processing of the query starts always with the interpreter that executes pre-compiled operators in the order of the given query.

Fig. 4
figure 4

Query compilation approach for PIM-enabled hardware

The compilation processes of the host and DPU programs start at the same time in the background. As soon as the compilation is complete the processing switches to the compiled code as described early in Sect. 3.2.3. Similarly, the query will be executed on each chunk individually by executing the function pointed by a global function pointer. After the compilation, the processing changes the destination of the function pointer to the new compiled code.

5 Evaluation

We use the Social Network Benchmark (SNB) dataset from the Linked Data Benchmark Council (LDBC) for the following benchmarks. This is an applicable benchmark to evaluate the performance of this approach in a graph DBMS. The used scale factor of the dataset is 1. We further restrict ourselves to the nodes of the dataset which we store in a nodes table in Poseidon. In total, the table contains 1,180,565 entries, which are stored in 1445 chunks in DRAM with 817 entries per chunk. For the scan query, we scan the node entries for nodes labeled as Post with a selectivity of 10%. Further, the baseline approach using the CPU received the same optimizations (data layout, scan processing, tuple handling) as the PIM-optimized approaches.

The system used for the following benchmarks runs with two Intel Xeon Silver 4215R with a total of 16 cores with 2 threads each. A total of 32 threads can be executed on the system. Furthermore, the system has 512 GB of DRAM, which is made up of 8 \(\times\) 64 GB DIMMs. In terms of PIM, the system has 4 UPMEM DIMMs with 8 GB each. Each UPMEM DIMM has 2 ranks with up to 64 DPUs each. The total number of available DPUs is 510 since two of the 512 theoretical DPUs are not usable in our system. The clock rates of the DPUs are between 200–400 MHz. The system runs under Ubuntu 20.04.1 with Linux kernel 5.4.0. The code of the host and DPU program was compiled with Clang at version 12 and full optimization at \(-\)O3. The data layouts of the baseline and the DPU implementation are based on the same data structures and the same optimization to get a fair comparison.

5.1 DPU Parallelism

Figure 5 shows the execution of table scans with different numbers of DPUs (16–510 DPUs) as well as with different numbers of tasklets. The baseline in these experiments is the CPU execution of the table scan with varying numbers of hardware threads (1–32). Furthermore, each thread performs the scan operation on the same number of chunks. The results in the baseline execution are saved in a result vector similar to the DPU program in order to obtain a workload as similar as possible. The baseline execution becomes faster with increasing number of threads up to a number of 8 threads. With more than 8 threads the execution time remains the same. This can be explained by the memory wall phenomenon. The memory becomes a bottleneck here and no more data can be supplied.

Fig. 5
figure 5

Execution time comparison between CPU and DPU-based execution with different numbers of DPUs

With a particularly large number of DPUs (128 or more), the execution runtimes change only slightly, since the size of the workload also changes only very slightly. The sweet spot for parallelism is around 12–24 task sets and a DPU count of around 160 for this table size. As the number of tasklets increases, the parallelism of the execution also increases. This can further improve the runtime. Furthermore, with an increasing number of DPUs, the parallelism increases additionally. However, it can be seen that around 32 DPUs, which are used for the table scan, the runtime approaches the baseline of the CPU execution more and more. From 128 DPUs and the maximum number of 24 tasklets, the PIM execution is even faster than the CPU execution with all available hardware threads.

Figure 6 shows the correlation between the number of involved DPUs and the number of tasklets when executing a table scan. Further, it shows the table when executing it directly on the CPU. With an increasing number of numbers the execution time reduces. For the CPU-based execution, the execution time reduces with the number of involved threads. For the DPU-based execution, the execution time slightly improves with an increasing number of up to 6 DPUs. After this point, the execution time improves drastically. The same can be observed with the number of involved tasklets on the DPUs. With an increasing number of tasklets the execution time decreases. For example, the execution time with 16 DPUs with all 24 tasklets is similar to the execution with 510 DPUs and 1 tasklet and similar to a CPU-based execution with 32 threads. In summary, it can be concluded that the full utilization of the parallelism of the tasklets and a high number of DPUs can improve the runtimes of table scans by a considerable amount.

Fig. 6
figure 6

DPU and tasklet parallelism compared with CPU-based execution

5.2 Table Size

The execution times of the DPU scan with different table sizes are given in Fig. 7. The baseline of this experiment is again the execution of the table scan on the CPU with all available hardware threads. Each of these hardware threads executes the scan for several chunks. To achieve a similar workload, the baseline execution stores the result in a result vector, similar to the DPU program. The table size is represented by a different number of chunks. One chunk contains up to 817 records. A table that consists of 1000 chunks (1 K) contains up to 817,000 entries. For this benchmark, we created an additional graph using the LDBC SNB dataset containing 50% Post nodes and 50% Person nodes.

Fig. 7
figure 7

Execution time of table scan on different table sizes

The linear increase in the execution time can be seen directly for all executions. Furthermore, the table scan on the DPU itself with a small number of 32 DPUs is much faster than the corresponding execution on the CPU with 32 threads. The high task parallelism can lead to very fast processing of the scan. Anyway, to enable the highest possible parallelism of the DPUs, it is necessary to have as little inter-DPU communication as possible and as little synchronization as possible at the tasklet level. In our approach, each tasklet worked on its own allocated memory space on the MRAM. Thus, no synchronization mechanisms were necessary. The result is a significant runtime improvement of the table scan.

5.3 Data Transfer

Figure 8 shows the transfer times of large tables for different numbers of DPUs. For this experiment, we created several graphs with different numbers of chunks, ranging from 1000 (1 K) to 12,000 (12 K). Each chunk contains up to 817 node records. We considered the transfer times on different numbers of DPUs to study the parallel data transfer. Furthermore, we made sure that the total number of chunks is distributed among the DPUs. If the amount of data does not fit into the memory of the DPUs, the data transfer would have to be executed multiple times.

Fig. 8
figure 8

Table transfer times from host to DPUs with different numbers of DPUs

The results clearly show that the transfer times increase as the number of data increases. Parallel data transfer can reduce the transfer times by a few milliseconds. For example, the transfer time can be decreased by half if all 510 DPUs are used instead of 32. The attained throughput ranges from 1.5 GB/s (32 DPUs) to 3 GB/s (510 DPUs). However, the existing data transfer implementation could be improved by increasing the chunk size for transfers. Since the data only has to be loaded into the memory of the DPUs at the startup of the database, the result of the data transfer is acceptable. However, by using interleaved execution, the transfer times of data can be hidden [4].

5.4 Code Compilation

For the evaluation of the query compiler, we make use of the LDBC Interactive Short Read queries. We focus here on queries 1–3, as they are short enough to show the behavior when executing queries whose compilation time is higher than the actual processing time. The compilation times for the query and their execution times using the different execution modes are shown in Fig. 9a, b respectively. The compilation time for the PIM program makes only a small part of the complete compilation process. Compiling only a few operators into a PIM program results in small code and therefore fast compilation. The resulting LLVM IR code for the DPU program of the queries Q1–Q3 has less than 200 lines of code. Further, the resulting execution time of the PIM Adaptive mode is even faster than the adaptive which executes the query adaptively and only on the host. This lies in the highly parallel processing provided by the UPMEM architecture. With the adaptive compilation mode for PIM, it is not only possible to improve the execution of queries but also to hide the compilation time when compiling query code for multiple hardware.

Fig. 9
figure 9

a Share of host and DPU program compilation time when compiling queries, b Comparison of the execution time using different execution modes

5.5 Energy Consumption & Economic Aspects

In this experiment, we investigate the energy consumption of the UPMEM DIMMs when executing a table scan and compare it with the CPU-based execution. To measure energy consumption we execute the CPU-based approach and compare them with several DPU-based executions with 2 and 510 involved DPUs. The energy consumption is measured with performance counters provided by the Linux kernel using the perf tools. Further, we measure the energy consumption of the CPU with all involved cores and the DRAM using perf stat -e power/energy-ram/ -e power/energy-pkg/. We have executed the table same table scan as shown for the first experiment with 100 runs. The energy consumption executed on the different hardware configurations is shown in Fig. 10. In general, the energy consumption for DRAM remains the same for all configurations, while the CPU consumption decreases when using more DPUs. Using more DPU reduces the CPU load and therefore the energy consumption of the CPU itself. In all of the configurations, the data is stored in DRAM in any case. Therefore, there are no changes in the energy consumption for DRAM. Most of the energy consumption can be saved when executing the table scan (or other workloads) on the DPUs. However, it should be considered that the presented PIM architecture by UPMEM is the first real available hardware of such kind at the time of this work.

Fig. 10
figure 10

Energy consumption of CPU-based execution with DPU-based execution

The UPMEM technology offers the advantage that any system can be PIM-enabled since only the respective UPMEM DIMMs have to be added to the system. The costs for an 8 GB PIM-enabled DDR4 are, according to the offer at the time of this work, 10\(\times\) the price ($120/DIMM) of a standard 8 GB DDR4 DIMM ($13.5/DIMM). However, this comparison is not fair due to the relatively short availability, and low margins of production. When comparing the UPMEM DIMMs with PIM via FPGAs in terms of price, it shows in favor of the UPMEM DIMMs. The costs for a similar system using an FPGA (e.g. AxDIMM [13]) are 6\(\times\) the price of the UPMEM DIMMs.

6 Conclusion

In this work, we have investigated how we can accelerate table scans with filters using PIM technology. As shown in the benchmarks presented here, PIM technology can outperform the runtime of a comparable CPU execution. Even when utilizing the CPU with its maximum possible parallelism, the results cannot come close to the runtimes of a table scan on multiple DPUs. With very large tables, this effect becomes even more pronounced. This has several implications for the execution of table scans. To save as much runtime as possible, as much data as possible, if not all, must be transferred to the DPUs’ memory. In general, UPMEM hardware is suitable whenever workload data can be split into small and independent data. Despite the fairly high possible bandwidth, however, the data should be kept in MRAM early and for a long time, since the transfer times can have a negative effect. Accelerating scans is only the first step in leveraging PIM for database operations. For future work, it is desirable to work on more complex operators like joins or aggregation which exploit the characteristics of PIM. Further, having multiple memory technologies including PIM available in a database server raises the question of data placement and/or efficient data transfer.