Before performing an analysis using the community standard TPC benchmarks, we devise a sensitivity study using the micro-benchmark. The goal of this study is to answer the following questions:
Where do CPU cycles go when running in-memory OLTP? Are they wasted on memory stalls or used to retire instructions?
Where do memory stalls come from? Are they mainly due to instructions or data for in-memory OLTP?
What is the impact of the database size on the above metrics?
Does the amount of work done per transaction affect the results and, if yes, how?
To answer these questions, we break the analysis into two parts. The first part (Sect. 4.1) varies the database size by varying the number of rows in the table while keeping the amount of work done per transaction constant. On the other hand, the second part (Sect. 4.2) varies the amount of work done per transaction by increasing the number of rows read in a transaction while keeping the database size constant.
Sensitivity to data size
To investigate the impact of database size on the micro-architectural behavior, we populate databases of size 1MB, 10MB, 10GB, and 100 GB. DBMS M has a 32 GB of data size limitation. Hence, we use maximum of 10 GB of database for DBMS M. The micro-architectural behavior of DBMS M does not significantly change as the data size varies (see Fig. 1 and Table 2). Hence, we rely on 10 GB of data to interpret DBMS M’s micro-architectural behavior for a large data size. Then, we collect hardware events as the systems run the micro-benchmark with a single transaction type that just reads/updates one random row after an index probe operation. While the results for the read-only version of the micro-benchmark are in the following subsections, the results for the read-write version of the micro-benchmark are in Section A.1 of the “Appendix”.
CPU cycles breakdown
Figure 1 shows the CPU cycles breakdowns as the database size is increased. The retiring cycles ratios are similar for databases of sizes 1 MB and 10 MB since the data working set mainly fits in the last-level cache (LLC). As we increase the database size to 10 GB and 100 GB, the retiring cycles ratios are decreased since the data working set no longer fits in caches and the long-latency data misses become more significant. All the retiring cycles ratios are less than 30% for a database of size 10 GB and 100 GB for all the systems. Hence, in-memory OLTP systems spend most of their CPU cycles to stalls similar to disk-based systems.
Shore-MT is bound by Dcache and resource/dependency stalls. Shore-MT is a disk-based system known to have a large and complex instruction footprint. Existing work on Shore-MT has long shown that Shore-MT is Icache-stalls-bound [48, 49, 53, 54]. All machines used in the existing work is Intel’s, version 2, Ivy Bridge micro-architecture. We, on the other hand, use a later-generation version 4 micro-architecture, Broadwell, which is the slightly improved version of the version 3 Haswell micro-architecture. Intel has announced an important micro-architectural improvement on the instruction fetch unit of the Haswell micro-architecture . As Hammarlund et al.  specifies, “State-of-the-art advances in branch prediction algorithms enable accurate fetch requests to run ahead of micro-operation supply to hide instruction TLB and cache misses.”. Hence, instruction fetch unit keeps supplying instructions even though there is an instruction cache miss, which allows overlapping the instruction cache miss latency with useful work. As a result, Shore-MT, being a long-standing-Icache-stalls-bound system, has become Dcache- and resource/dependency-stalls-bound. We compare Ivy Bridge and Broadwell in Sect. 10.
The instruction footprint of disk-based systems is large and complex due to the complex relationships among various individual components such as buffer manager and lock manager. Shore-MT not suffering from Icache stalls shows that today’s processors are powerful enough to mitigate the Icache stalls caused by the large and complex the instruction footprint of disk-based systems. We summarize our findings on Icache stalls in terms of database software and hardware development in Sect. 11.
Shore-MT spends 40% of its time on Dcache and resource/dependency stalls even when the data size is small. We examine Shore-MT’s call stack and observe that most of the Dcache and resource/dependency stalls are due to buffer manager and centralized locking & latching operations such as looking into the hash table that keeps track of the buffer pool pages and acquiring a lock. As the data size exceeds the LLC size, Shore-MT becomes more and more Dcache-bound due to its working set not fitting into the LLC and suffering from expensive LLC misses.
DBMS D and DBMS M suffer from high Icache stalls. DBMS D, despite being a popular commercial, disk-based OLTP system, relies on a legacy codebase whose instruction footprint is large and complex. DBMS M, being the in-memory OLTP engine of DBMS D, borrows legacy code modules from DBMS D. As a result, its instruction footprint is also large and complex, and it mainly suffers from the Icache stalls. Nevertheless, DBMS M’s throughput is three times of DBMS D (see Table 2). Hence, the optimizations DBMS M adopts help in reducing the instruction and data footprints of DBMS D. Nevertheless, both DBMS D and M mainly suffer from Icache stalls showing the severe effect of using legacy code modules both for disk-based and in-memory OLTP systems.
Unlike DBMS M, DBMS N does not suffer from Icache stalls. This is because DBMS N is a ground-up designed in-memory system. It does not rely on legacy code modules as DBMS M does. As a result, its instruction footprint is smaller and less complex than DBMS M, and hence, it does not suffer from Icache stalls.
Our previous work [48, 49] shows that VoltDB, a similar ground-up designed system to DBMS N, is instruction-bound rather than data-bound as we present here. The reason is, once again, the improved micro-architecture of the machine (Broadwell) we use compared to the machine used by [48, 49] (Ivy Bridge). The improved instruction fetch unit of Broadwell eliminates the Icache stalls and makes DBMS N data-bound. Section 10 discusses this issue in more detail. We summarize our findings on Icache stalls in terms of database software and hardware development in Sect. 11.
DBMS N is an in-memory system, which eliminates the heavy disk-based system components such as buffer pool and lock manager. Nevertheless, it suffers significant amount of Dcache and resource/dependency stalls for small data sizes, similar to Shore-MT. We examine DBMS N’s call stack and observe that the Dcache and resource/dependency stalls are largely due to the cost of setting up and instantiating the transactions. As the data size is increased, DBMS N becomes more and more Dcache-bound due to its working set not fitting into the LLC and suffering from expensive LLC misses.
Silo has high retiring cycles for small data sizes. Silo is an in-memory system eliminating the costly buffer manager and centralized locking & latching components that disk-based systems use. Moreover, Silo is a kernel OLTP engine that has transactions hard-coded in C++. Hence, it does not suffer from the cost of instantiating and scheduling user request as DBMS N does. As a result, it does not suffer from Dcache stalls when the data size is small, and it has high retiring cycles ratio for small data sizes. As the data size exceeds the LLC size, Silo becomes Dcache-bound such that Silo’s retiring cycles are the lowest among all the systems.
In addition to the Dcache stalls, Silo also suffers from significant amount of branch misprediction. This is due to the in-node searches that Silo performs during the index traversalFootnote 1. While all the systems perform an index traversal during their transaction processing, it only surfaces up on Silo thanks to its lean codebase and efficient data structure it uses. Other systems suffer from other overhead such as complex instruction footprint as in DBMS D and M, large data footprint as in Shore-MT, or inefficient index structure as in DBMS N.
We observe that all the systems suffer 10–15% decoding stalls. Decoding stalls are the stalls due to the inefficiencies in the micro-architectural implementation of the instruction decoding unit of the processor. As Intel relies on a Complex Instruction Set Computer (CISC) type of microprocessor design, it requires decoding instructions into micro-operations. It is known that Intel’s instruction decoding unit is a legacy micro-architecture and has several penalties . To avoid these penalties, Intel has introduced a Decoded Stream Buffer (DSB), which is a micro-operation cache inside the pipeline that allows side-stepping the decoding unit and providing already decoded instructions to the processor. However, DSB is small (1.5 KB). If the workload’s instruction working set does not fit into the DSB, it has to pass through the legacy decoding unit and suffer from the penalties of the legacy decoding unit. As OLTP workloads’ instruction footprint is large and complex, they pass through the legacy decoding unit and suffer the decoding stalls. Nevertheless, the caused penalties are relatively small, 10-15%, and hence do not constitute a significant problem.
Table 2 shows normalized throughputs for each system as the database size is increased. The throughput values are normalized to DBMS D for each database size individually. The relative performance between DBMS D and M remains stable. This is because both DBMS D and M are instruction-bound. Hence, the increased data size does not significantly affect their performance. Shore-MT, DBMS N, and Silo’s relative performance, on the other hand, is decreased as the data size is increased. This is because Shore-MT, DBMS N and Silo are data-bound, i.e., Dcache and resource/dependency-stalls-bound. The increased data size results in a more substantial drop in their throughput than DBMS D and M. As a result, their throughputs get close to DBMS D and M as the data size increases.
All the in-memory systems are faster than the disk-based systems for all the data sizes. This shows that the optimizations that in-memory systems implement significantly help improving the throughput. DBMS M, despite its large and complex instruction footprint, is faster than Shore-MT for 10 GB of database. As Shore-MT mainly suffers from Dcache stalls, this shows the severe negative effects of disk-based systems’ data overhead, such as large index and buffer pool pages.
DBMS N is faster than DBMS M thanks to its smaller and simpler codebase. DBMS N is a ground-up designed system. Hence, it does not borrow any legacy codebase as DBMS M does. As a result, it does not suffer from Icache stalls, and it is significantly faster than DBMS M. Silo is the fastest system we have profiled. One reason for that is Silo is a kernel OLTP engine, which does not suffer from the cost of setting up and instantiating the transactions. DBMS N, on the other hand, is an end-to-end SQL-based OLTP system. Another reason is that Silo uses an efficiently implemented index structure, Masstree . Hence, its data footprint is also succinct. As a result, it is 4.9x faster than DBMS N for 100 GB of database. We examine the inefficient index structure issue of DBMS N in Sects. 4.1.3 and 4.2.1 in more detail.
DBMS N versus Silo
Silo is \(\sim 5\times \) faster than DBMS N as shown by Table 2 for 100 GB of database. We examine the call stack of DBMS N and Silo. DBMS N spends \(\sim 33\%\) of its time while setting up and instantiating the transactions, from which Silo minimally suffers due to being a kernel OLTP engine. Hence, 33% of the 5\(\times \) difference is due to the transaction setup and instantiation work that DBMS N performs. These are functions like processInitiateTask(), xfer() (dequeues user requests) and coreExecutePlan Fragments().
DBMS N spends most of the remaining 67% of its time in index lookup. In particular, DBMS N spends \(\sim \)44% of its time in index lookup and \(\sim \)19% of its time in various small-sized functions each taking less than 1% of the execution time. Silo spends \(\sim \)75% of its time in index lookup. Therefore, most of the remaining 67% of the 5\(\times \) throughput difference between DBMS N and Silo is due to Silo using a more efficient index structure than DBMS N.
We examine the micro-architectural behavior of the transaction setup and index lookup components of DBMS N separately in Fig. 3 for 100 GB of database. The figure shows that transaction setup component only modestly suffers from Dcache stalls, whereas the lookup component is exclusively Dcache-stalls-dominated. The overall micro-architectural behavior of DBMS N is the composition of these two micro-architectural behavior. We also examine the micro-architectural behavior of the index lookup component of Silo. Similar to DBMS N, the micro-architectural behavior of the index lookup operation of Silo is exclusively dominated by Dcache stalls. As Silo’s execution time dominated by the index lookup operation, 67% of the 5\(\times \) throughput difference between DBMS N and Silo is due to the used index structure. Therefore, DBMS N can significantly improve its performance by using a more efficient index structure. We describe DBMS N and Silo’s used index structures in Sect. 4.2.1 in more detail.
Shore-MT versus DBMS N
Shore-MT and DBMS N spend a significant portion of their time in Dcache stalls for small data sizes (as shown by Fig. 1). We examine the execution time and Dcache stalls breakdown for Shore-MT and DBMS N for small and large data sizes. We identify five components for Shore-MT at the software level: (i) B-tree, (ii) buffer manager, (iii) locking and latching, (iv) transaction setup, and (v) rest. We identify four components for DBMS N at the software level: (i) Transaction setup, (ii) Index lookup, (iii) Post-lookup, and (iv) Rest. Figure 2 shows the results for 1 MB of data.
Shore-MT spends most of the execution time processing disk-based system components such as buffer manager and centralized locking & latching components. Similarly, most of Shore-MT’s Dcache stalls are coming from the locking & latching and buffer manager components. DBMS N is an in-memory system. It does not use a buffer manager component, and it uses a partitioning-based, lightweight concurrency control mechanism. However, as it is a real-life OLTP system, it performs the necessary work to setup and instantiate transactions using functions like processInitiateTask(), xfer() (dequeues user requests) and coreExecutePlanFragments(). DBMS N spends most of the execution in transaction setup, and most of DBMS N’s Dcache stalls are coming from the transaction setup component.
We also examine Shore-MT and DBMS N’s execution time and Dcache stalls breakdowns for 100 GB of database. Figure 4 shows the results. Shore-MT, being a disk-based system, spends most of the execution time in buffer manager and locking & latching components. However, unlike it is for 1 MB of database, buffer manager component consumes a significantly larger portion of the execution time than the locking & latching component. As the data size is increased, there are larger and larger number of buffer pool pages. Hence, the data footprint is increased, and the amount of time spent inside buffer manager is increased. Similarly, Shore-MT’s Dcache stalls are mostly due to the buffer manager components for 100 GB of database as shown by Fig. 4b. Our findings for Shore-MT corroborates with the findings of Harizopoulos et al. .
DBMS N spends significantly less time on setting up the transactions for 100 GB of database compared to 1 MB of database. As the data size is increased, the amount of work that the transactions perform is increased. As a result, transaction setup time is reduced from \(\sim \)60% to \(\sim \)33%, and the index lookup time is increased from \(\sim \)10% to \(\sim \)44%.
Shore-MT’s disk-based system overhead is persistent across different data sizes due to the fundamental architectural reasons. Most of the execution time is spent inside the disk-based system components such as locking & latching and buffer manager, although the system component that consumes the largest portion of the execution time shifts from the locking & latching to the buffer manager component as the data size is increased from 1 MB to 100 GB.
DBMS N’s transaction setup overhead depends on the data size. As the data size is increased, the amount of work that transactions perform is increased. As a result, the transaction setup consumes smaller and smaller portion of the execution time as the amount of work per transaction is increased.
Therefore, Shore-MT requires a major architectural re-design to reduce the data footprint and deliver as high throughput as in-memory OLTP systems deliver. DBMS N spends the large portion of its execution time inside the OLTP engine for large data sizes. Hence, it can highly benefit from optimizing internals of the OLTP engine such as using a more efficient index structure as shown in Sect. 4.1.3.
Relative throughput of OLTP systems widely vary among different categories of OLTP systems. However, CPU cycles utilization of all the OLTP systems are low. DBMS D and M, relying on legacy codebases, suffers from Icache stalls. Shore-MT, DBMS N and Silo, being either OLTP kernels having transactions hard-coded in C++, and/or a ground-up designed in-memory system, eliminate the Icache stalls. The reduced Icache stalls cause Dcache stalls to surface, which Shore-MT, DBMS N and Silo suffer from. The Dcache stalls are largely due to the random-data-access nature of the workload, in addition to the cost of buffer manager and locking & latching overhead for Shore-MT and the small cost of transaction setup for DBMS N. As a result, Shore-MT, DBMS N and Silo spend only 30% of the CPU cycles for retiring instructions similar to DBMS D and M.
Sensitivity to work per transaction
To investigate the impact of the amount of work per transaction on the micro-architectural behavior, we increase the number of rows that a transaction accesses from 1 to 10 and then to 100. We perform these experiments with 100 GB dataset for all the systems except DBMS M. We use 10 GB of database for DBMS M due to its 32 GB of maximum database size limitation. In the following sub-sections, we present the results for the read-only version of the micro-benchmark. The results for the read-write version of the micro-benchmark can be found in Section A.2 of the “Appendix”.
Figure 5 shows the CPU cycles breakdowns as the amount of work per transaction is increased. DBMS D and M suffer less and less from Icache stalls as we increase the amount of work per transaction. The repetitive behavior within a transaction leads to a better instruction cache locality. As a result, the code for the other layers of the system that surround a transaction’s execution (e.g., the code outside the storage manager) is executed less frequently since the transactions get longer as we increase the amount of work done per transaction. For example, where probing 100 rows per transaction stresses purely the storage manager component, probing 1 row also stresses the other layers such as query parsing, work done while starting/ending a transaction, etc. As a result, Icache stalls are decreased as the amount of work per transaction is increased.
DBMS D and M’s Dcache stalls are increased as we increase the work done per transaction. As we read more random rows per transaction, we make more frequent random data accesses, which leads to a higher data miss rate and hence higher Dcache stalls.
Shore-MT, DBMS N and Silo have slightly increased Dcache stalls as the amount of work per transaction is increased. Shore-MT, DBMS N and Silo’s instruction footprint is small and simple enough when the number of rows read per transaction is 1. Hence, the increased instruction locality does not make a significant difference in terms of their micro-architectural behavior.
Table 3 shows the normalized throughputs. DBMS D and DBMS M are instruction-stalls-bound as shown in the previous section. Therefore, their delivered throughput does not drop significantly as the data size is increased. DBMS D’s normalized throughput for 1MB, 10MB, 10GB, and 100 GB is: 1, 0.93, 0.84, and 0.81. The throughput of DBMS D drops only by 3% as the data size is increased from 1 MB to 100 GB. Similarly, DBMS M’s normalized throughput for 1MB, 10MB, and 10 GB is: 1, 1.01, and 0.89. DBMS M’s throughput drops by only 11% as the data size is increased from 1 MB to 10 GB. Therefore, we assume that DBMS M’s throughput for 100 GB is similar to its throughput for 10GB, and we use DBMS M’s throughput for 10 GB in Table 3.
DBMS M’s relative throughput is increased as the number of rows per transaction is increased. DBMS M becomes even faster than DBMS N as the number of rows per transaction is increased to 10 and 100. This because of the increased instruction locality that DBMS M benefits from. As the number of rows read per transaction is increased, DBMS M executes the in-memory OLTP engine more and more. As a result, it suffers less and less from the legacy codebase that it borrows from DBMS D. This shows that the core in-memory OLTP engine of DBMS M is efficiently implemented and benefits from the in-memory systems optimizations.
Shore-MT, DBMS N, and Silo’s relative throughput to DBMS D is decreased as the number of rows per transaction is increased. This is because DBMS D benefits from the increased instruction locality as the number of rows is increased.
DBMS N’s relative throughput with respect to Silo is decreased as the number of rows is increased. DBMS N’s relative throughput with respect to Silo is: 4.9, 3.8, and 3.5 for 1, 10, and 100 rows, respectively. This is because DBMS N executes less and less the code to setup and instantiate the transactions as the amount of work per transaction is increased. As a result, its throughput gets closer and closer to Silo, which minimally suffers from the work required to setup and instantiate the transactions. Nevertheless, Silo is 3.5\(\times \) faster than DBMS N even when DBMS N largely eliminates the transaction setup overhead. We examine this throughput difference in more detail in the following section.
Code modules breakdown
To better understand the impact of legacy code, as well as components outside the storage manager, we quantify the percentage of the execution time spent in the OLTP engine as the amount of work per transaction is increased for the disk-based system DBMS D, and in-memory systems DBMS M and DBMS N. While performing this breakdown, we have done a best-effort categorization based on the code modules reported by VTune as part of the worker thread execution of each system. Figure 6 shows the results.
We observe that DBMS D and M systems spend only 35 to 45% of their time inside OLTP engine showing the dominance of the legacy code overhead both systems suffer. The amount of time spent inside the OLTP engine increases as the number of rows read increases. This increase is less pronounced for the disk-based system DBMS D showing the higher overhead of the code outside the OLTP engine. For DBMS M, when we increase the number of rows from 1 to 10, the percentage inside the OLTP engine is significantly increased showing the reduced effect the legacy code overhead that DBMS M borrows from the traditional disk-based OLTP system it belongs to. These results explain better why DBMS D and M’s relative throughput (shown in Table 3) gets higher and higher as the number of rows read per transaction is increased.
DBMS N is a ground-up designed in-memory system. As a result, it does not suffer from a legacy codebase as DBMS D and M do. However, as being an end-to-end system, it spends certain amount of time to setup, instantiate and finalize the transactions. In Fig. 6, the amount of time that DBMS N does not spend inside the OLTP engine corresponds to this instantiation and finalization. This time is \(\sim \)33% for reading 1 row per transaction since the transaction is short. As the number of rows per transaction increases to 10 and 100, the amount of time spent for setting up and instantiating the transactions is reduced to \(\sim \)10% and 5%, respectively. This shows that, depending on the amount of work per transaction, the work required for setting up and finalizing the transaction can be significant.
On the other hand, despite the large amount of work per transaction and reduced overhead of transaction setup and finalization, DBMS N is still 3.5\(\times \) slower than Silo for 100 rows per transaction (see Table 3). We break the execution time of DBMS N and Silo down to their function call stack to see this difference better. We use the same execution time breakdown at the software level used for DBMS N in Sect. 4.1.3. Figure 7 shows the results for 100 rows.
Both systems spend most of their time performing the index lookup. Therefore, the main reason for the performance difference between DBMS N and Silo is the inefficient index structure of DBMS N. We examined DBMS N’s index data structure. We saw that DBMS N uses red-black tree as its index structure. Red-black trees are self-balancing binary search trees, where the number of elements per node is 1. Hence, at every level of the tree, red-black tree performs a single comparison. As every node of the tree is in a random memory location, red-black tree is subject to a data cache miss at every level during the index traversal.
Silo, on the other hand, uses Masstree, which is a variant of B-tree. As all the B-trees, Masstree’s nodes have a particular node size and fanout. It is 15 for Silo. Hence, instead of 1, it keeps 15 elements per node. As the tree depth drops exponentially with the node size, Silo’s index is much more shallow than DBMS N’s red-black tree. Therefore, Masstree is subjected to a significantly less number of data cache misses. Furthermore, Masstree software prefetches the node’s data blocks by injecting a software prefetch instruction during the index traversal. As a result, Silo is subject to a single data cache miss for the entire node it accesses at every level of the tree, making Silo significantly faster than DBMS N. This shows that, despite the overhead of a real-life system, the efficiency of the used index structure is still the most crucial factor in defining the performance characteristics of an in-memory OLTP system.
Finally, DBMS N’s rest component is significantly higher than that of Silo. This is because DBMS N, being a real-life system, executes more functions to provide the end-to-end response.