A database’s internal architecture makes a tremendous impact on the latency it can achieve and the throughput it can handle. Being an extremely complex piece of software, a database doesn’t exist in a vacuum, but rather interacts with the environment, which includes the operating system and the hardware.

While it’s one thing to get massive terabyte-to-petabyte scale systems up and running, it’s a whole other thing to make sure they are operating at peak efficiency. In fact, it’s usually more than just “one other thing.” Performance optimization of large distributed systems is usually a multivariate problem—combining aspects of the underlying hardware, networking, tuning operating systems, and finagling with layers of virtualization and application architectures.

Such a complex problem warrants exploration from multiple perspectives. This chapter begins the discussion of database internals by looking at ways that databases can optimize performance by taking advantage of modern hardware and operating systems. It covers how the database interacts with the operating system plus CPUs, memory, storage, and networking. Then, the next chapter shifts focus to algorithmic optimizations.Footnote 1

CPU

Programming books tell programmers that they have this CPU that can run processes or threads, and what runs means is that there’s some simple sequential instruction execution. Then there’s a footnote explaining that with multiple threads you might need to consider doing some synchronization. In fact, how things are actually executed inside CPU cores is something completely different and much more complicated. It would be very difficult to program these machines if you didn’t have those abstractions from books, but they are a lie to some degree. How you can efficiently take advantage of CPU capabilities is still very important.

Share Nothing Across Cores

Individual CPU cores aren’t getting any faster. Their clock speeds reached a performance plateau long ago. Now, the ongoing increase of CPU performance continues horizontally: by increasing the number of processing units. In turn, the increase in the number of cores means that performance now depends on coordination across multiple cores (versus the throughput of a single core).

On modern hardware, the performance of standard workloads depends more on the locking and coordination across cores than on the performance of an individual core. Software architects face two unattractive alternatives:

  • Coarse-grained locking, which will see application threads contend for control of the data and wait instead of producing useful work.

  • Fine-grained locking, which, in addition to being hard to program and debug, sees significant overhead even when no contention occurs due to the locking primitives themselves.

Consider an SSD drive. The typical time needed to communicate with an SSD on a modern NVMe device is quite lengthy—it’s about 20 μseconds. That’s enough time for the CPU to execute tens of thousands of instructions. Developers should consider it as a networked device but generally do not program in that way. Instead, they often use an API that is synchronous (we’ll return to this later), which produces a thread that can be blocked.

Looking at the image of the logical layout of an Intel Xeon Processor (see Figure 3-1), it’s clear that this is also a networked device.

Figure 3-1
An illustration of Intel Xeon processor layout. Four Skylake processors are interconnected with Intel U P I in between and each consisting of D D R 4 D I M Ms.

The logical layout of an Intel Xeon Processor

The cores are all connected by what is essentially a network—a dual ring interconnected architecture. There are two such rings and they are bidirectional. Why should developers use a synchronous API for that then? Since sharing information across cores requires costly locking, a shared-nothing model is perfectly worth considering. In such a model, all requests are sharded onto individual cores, one application thread is run per core, and communication depends on explicit message passing, not shared memory between threads. This design avoids slow, unscalable lock primitives and cache bounces.

Any sharing of resources across cores in modern processors must be handled explicitly. For example, when two requests are part of the same session and two CPUs each get a request that depends on the same session state, one CPU must explicitly forward the request to the other. Either CPU may handle either response. Ideally, your database provides facilities that limit the need for cross-core communication—but when communication is inevitable, it provides high-performance non-blocking communication primitives to ensure performance is not degraded.

Futures-Promises

There are many solutions for coordinating work across multiple cores. Some are highly programmer-friendly and enable the development of software that works exactly as if it were running on a single core. For example, the classic UNIX process model is designed to keep each process in total isolation and relies on kernel code to maintain a separate virtual memory space per process. Unfortunately, this increases the overhead at the OS level.

There’s a model known as “futures and promises.” A future is a data structure that represents some yet-undetermined result. A promise is the provider of this result. It can be helpful to think of a promise/future pair as a first-in first-out (FIFO) queue with a maximum length of one item, which may be used only once. The promise is the producing end of the queue, while the future is the consuming end. Like FIFOs, futures and promises decouple the data producer and the data consumer.

However, the optimized implementations of futures and promises need to take several considerations into account. While the standard implementation targets coarse-grained tasks that may block and take a long time to complete, optimized futures and promises are used to manage fine-grained, non-blocking tasks. In order to meet this requirement efficiently, they should:

  • Require no locking

  • Not allocate memory

  • Support continuations

Future-promise design eliminates the costs associated with maintaining individual threads by the OS and allows close to complete utilization of the CPU. On the other hand, it calls for user-space CPU scheduling and very likely limits the developer with voluntary preemption scheduling. The latter, in turn, is prone to generating phantom jams in popular producer-consumer programming templates.Footnote 2

Applying future-promise design to database internals has obvious benefits. First of all, database workloads can be naturally CPU-bound. For example, that’s typically the case with in-memory database engines, and aggregates’ evaluations also involve pretty intensive CPU work. Even for huge on-disk datasets, when the query time is typically dominated by the I/O, CPU should be considered. Parsing a query is a CPU-intensive task regardless of whether the workload is CPU-bound or storage-bound, and collecting, converting, and sending the data back to the user also calls for careful CPU utilization. And last but not least: Processing the data always involves a lot of high-level operations and low-level instructions. Maintaining them in an optimal manner requires a good low-level programming paradigm and future-promises is one of the best choices. However, large instruction sets need even more care; this leads to “execution stages.”

Execution Stages

Let’s dive deeper into CPU microarchitecture, because (as discussed previously) database engine CPUs typically need to deal with millions and billions of instructions, and it’s essential to help the poor thing with that. In a very simplified way, the microarchitecture of a modern x86 CPU—from the point of view of top-down analysis—consists of four major components: frontend, backend, branch speculation, and retiring.

Frontend

The processor’s frontend is responsible for fetching and decoding instructions that are going to be executed. It may become a bottleneck when there is either a latency problem or insufficient bandwidth. The former can be caused, for example, by instruction cache misses. The latter happens when the instruction decoders cannot keep up. In the latter case, the solution may be to attempt to make the hot path (or at least significant portions of it) fit in the decoded μop cache (DSB) or be recognizable by the loop detector (LSD).

Branch Speculation

Pipeline slots that the top-down analysis classifies as bad speculation are not stalled, but wasted. This happens when a branch is incorrectly predicted and the rest of the CPU executes a μop that eventually cannot be committed. The branch predictor is generally considered to be a part of the frontend. However, its problems can affect the whole pipeline in ways beyond just causing the backend to be undersupplied by the instruction fetch and decode. (Note: Branch mispredictions are covered in more detail a bit later in this chapter.)

Backend

The backend receives decoded μops and executes them. A stall may happen either because of an execution port being busy or a cache miss. At the lower level, a pipeline slot may be core bound either due to data dependency or an insufficient number of available execution units. Stalls caused by memory can be caused by cache misses at different levels of data cache, external memory latency, or bandwidth.

Retiring

Finally, there are pipeline slots that get classified as retiring. They are the lucky ones that were able to execute and commit their μop without any problems. When 100 percent of the pipeline slots are able to retire without a stall, the program has achieved the maximum number of instructions per cycle for that model of the CPU. Although this is very desirable, it doesn’t mean that there’s no opportunity for improvement. Rather, it means that the CPU is fully utilized and the only way to improve the performance is to reduce the number of instructions.

Implications for Databases

The way CPUs are architectured has direct implications on the database design. It may very well happen that individual requests involve a lot of logic and relatively little data, which is a scenario that stresses the CPU significantly. This kind of workload will be completely dominated by the frontend—instruction cache misses in particular. If you think about this for a moment, it shouldn’t be very surprising. The pipeline that each request goes through is quite long. For example, write requests may need to go through transport protocol logic, query parsing code, look up in the caching layer, or be applied to the memtable, and so on.

The most obvious way to solve this is to attempt to reduce the amount of logic in the hot path. Unfortunately, this approach does not offer a huge potential for significant performance improvement. Reducing the number of instructions needed to perform a certain activity is a popular optimization practice, but a developer cannot make any code shorter infinitely. At some point, the code “freezes”—literally. There’s some minimal amount of instructions needed even to compare two strings and return the result. It’s impossible to perform that with a single instruction.

A higher-level way of dealing with instruction cache problems is called Staged Event-Driven Architecture (SEDA for short). It’s an architecture that splits the request processing pipeline into a graph of stages—thereby decoupling the logic from the event and thread scheduling. This tends to yield greater performance improvements than the previous approach.

Memory

Memory management is the central design point in all aspects of programming. Even comparing programming languages to one another always involves discussions about the way programmers are supposed to handle memory allocation and freeing. No wonder memory management design affects the performance of a database so much.

Applied to database engineering, memory management typically falls into two related but independent subsystems: memory allocation and cache control. The former is in fact a very generic software engineering issue, so considerations about it are not extremely specific to databases (though they are crucial and are worth studying). As opposed to that, the latter topic is itself very broad, affected by the usage details and corner cases. Respectively, in the database world, cache control has its own flavor.

Allocation

The manner in which programs or subsystems allocate and free memory lies at the core of memory management. There are several approaches worth considering.

As illustrated by Figure 3-2, a so-called “log-structured allocation” is known from filesystems where it puts sequential writes to a circular log on the persisting storage and handles updates the very same way. At some point, this filesystem must reclaim blocks that became obsolete entries in the log area to make some more space available for future writes. In a naive implementation, unused entries are reclaimed by rereading and rewriting the log from scratch; obsolete blocks are then skipped in the process.

Figure 3-2
An illustration of an allocation structure of data and superblock using logs. Four types of logs used are file, inode, directory, and inode map.

A log-structured allocation puts sequential writes to a circular log on the persisting storage and handles updates the same way

A memory allocator for naive code can do something similar. In its simplest form, it would allocate the next block of memory by simply advancing a next-free pointer. Deallocation would just need to mark the allocated area as freed. One advantage of this approach is the speed of allocation. Another is the simplicity and efficiency of deallocation if it happens in FIFO order or affects the whole allocation space. Stack memory allocations are later released in the order that’s reverse to allocation, so this is the most prominent and the most efficient example of such an approach.

Using linear allocators as general-purpose allocators can be more problematic because of the difficulty of space reclamation. To reclaim space, it’s not enough to just mark entries as free. This leads to memory fragmentation, which in turn outweighs the advantages of linear allocation. So, as with the filesystem, the memory must be reclaimed so that it only contains allocated entries and the free space can be used again. Reclamation requires moving allocated entries around—a process that changes and invalidates their previously known addresses. In naive code, the locations of references to allocated entries (addresses stored as pointers) are unknown to the allocator. Existing references would have to be patched to make the allocator action transparent to the caller; that’s not feasible for a general-purpose allocator like malloc. Logging allocator use is tied to the programming language selection. Some RTTIs, like C++, can greatly facilitate this by providing move-constructors. However, passing pointers to libraries that are outside of your control (e.g., glibc) would still be an issue.

Another alternative is adopting a strategy of pool allocators, which provide allocation spaces for allocation of entries of a fixed size (see Figure 3-3). By limiting the allocation space that way, fragmentation can be reduced. A number of general-purpose allocators use pool allocators for small allocations. In some cases, those application spaces exist on a per-thread basis to eliminate the need for locking and improve CPU cache utilization.

Figure 3-3
An illustration of the allocation of data to arrays of slots. The three types of pool allocation arrays are full, partial, and free. The freed object slots are shaded.

Pool allocators provide allocation spaces for allocation of entries of a fixed size. Fragmentation is reduced by limiting the allocation space

This pool allocation strategy provides two core benefits. First, it saves you from having to search for available memory space. Second, it alleviates memory fragmentation because it pre-allocates in memory a cache for use with a collection of object sizes. Here’s how it works to achieve that:

  1. 1.

    The region for each of the sizes has fixed-size memory chunks that are suitable for the contained objects, and those chunks are all tracked by the allocator.

  2. 2.

    When it’s time for the allocator to allocate memory for a certain type of data object, it’s typically possible to use a free slot (chunk) in one of the existing memory slabs.Footnote 3

  3. 3.

    When it’s time for the allocator to free the object’s memory, it can simply move that slot over to the containing slab’s list of unused/free memory slots.

  4. 4.

    That memory slot (or some other free slot) will be removed from the list of free slots whenever there’s a call to create an object of the same type (or a call to allocate memory of the same size).

The best allocation approach to pick heavily depends on the usage scenario. One great benefit of a log-structured approach is that it handles fragmentation of small sub-pools in a more efficient way. Pool allocators, on the other hand, generate less background load on the CPU because of the lack of compacting activity.

Cache Control

When it comes to memory management in a software application that stores lots of data on disk, you cannot overlook the topic of cache control. Caching is always a must in data processing, and it’s crucial to decide what and where to cache.

If caching is done at the I/O level, for both read/write and mmap, caching can become the responsibility of the kernel. The majority of the system’s memory is given over to the page cache. The kernel decides which pages should be evicted when memory runs low, decides when pages need to be written back to disk, and controls read-ahead. The application can provide some guidance to the kernel using the madvise(2) and fadvise(2) system calls.

The main advantage of letting the kernel control caching is that great effort has been invested by the kernel developers over many decades into tuning the algorithms used by the cache. Those algorithms are used by thousands of different applications and are generally effective. The disadvantage, however, is that these algorithms are general-purpose and not tuned to the application. The kernel must guess how the application will behave next. Even if the application knows differently, it usually has no way to help the kernel guess correctly. This results in the wrong pages being evicted, I/O scheduled in the wrong order, or read-ahead scheduled for data that will not be consumed in the near future.

Next, doing the caching at the I/O level interacts with the topic often referred to as IMR—in memory representation. No wonder that the format in which data is stored on disk differs from the form the same data is allocated in memory as objects. The simplest reason that it’s not the same is byte-ordering. With that in mind, if the data is cached once it’s read from the disk, it needs to be further converted or parsed into the object used in memory. This can be a waste of CPU cycles, so applications may choose to cache at the object level.

Choosing to cache at the object level affects a lot of other design points. With that, the cache management is all on the application side including cross-core synchronization, data coherence, invalidation, and so on. Next, since objects can be (and typically are) much smaller than the average I/O size, caching millions and billions of those objects requires a collection selection that can handle it (you’ll learn about this quite soon). Finally, caching on the object level greatly affects the way I/O is done.

I/O

Unless the database engine is an in-memory one, it will have to keep the data on external storage. There can be many options to do that, including local disks, network-attached storage, distributed file- and object- storage systems, and so on. The term “I/O” typically refers to accessing data on local storage—disks or filesystems (that, in turn, are located on disks as well). And in general, there are four choices for accessing files on a Linux server: read/write, mmap, Direct I/O (DIO) read/write, and Asynchronous I/O (AIO/DIO, because this I/O is rarely used in cached mode).

Traditional Read/Write

The traditional method is to use the read(2) and write(2) system calls. In a modern implementation, the read system call (or one of its many variants—pread, readv, preadv, etc.) asks the kernel to read a section of a file and copy the data into the calling process address space. If all of the requested data is in the page cache, the kernel will copy it and return immediately; otherwise, it will arrange for the disk to read the requested data into the page cache, block the calling thread, and when the data is available, it will resume the thread and copy the data. A write, on the other hand, will usually1 just copy the data into the page cache; the kernel will write back the page cache to disk some time afterward.

mmap

An alternative and more modern method is to memory-map the file into the application address space using the mmap(2) system call. This causes a section of the address space to refer directly to the page cache pages that contain the file’s data. After this preparatory step, the application can access file data using the processor’s memory read and memory write instructions. If the requested data happens to be in cache, the kernel is completely bypassed and the read (or write) is performed at memory speed. If a cache miss occurs, then a page-fault happens and the kernel puts the active thread to sleep while it goes off to read the data for that page. When the data is finally available, the memory-management unit is programmed so the newly read data is accessible to the thread, which is then awoken.

Direct I/O (DIO)

Both traditional read/write and mmap involve the kernel page cache and defer the scheduling of I/O to the kernel. When the application wants to schedule I/O itself (for reasons that we will explain later), it can use Direct I/O, as shown in Figure 3-4. This involves opening the file with the O_DIRECT flag; further activity will use the normal read and write family of system calls. However, their behavior is now altered: Instead of accessing the cache, the disk is accessed directly, which means that the calling thread will be put to sleep unconditionally. Furthermore, the disk controller will copy the data directly to userspace, bypassing the kernel.

Figure 3-4
An illustration of input and output of app, thread, kernel, and disk. The app reads and writes data to the kernel. The D M A data and the context switch are used between the app and the disk.

Direct I/O involves opening the file with the O_DIRECT flag; further activity will use the normal read and write family of system calls, but their behavior is now altered

Asynchronous I/O (AIO/DIO)

A refinement of Direct I/O, Asynchronous Direct I/O, behaves similarly but prevents the calling thread from blocking (see Figure 3-5). Instead, the application thread schedules Direct I/O operations using the io_submit(2) system call, but the thread is not blocked; the I/O operation runs in parallel with normal thread execution. A separate system call, io_getevents(2), waits for and collects the results of completed I/O operations. Like DIO, the kernel’s page cache is bypassed, and the disk controller is responsible for copying the data directly to userspace.

Figure 3-5
An illustration of input and output of app, thread, kernel, and disk. The app reads and writes data to the kernel. The D M A data is used between the app and the disk.

A refinement of Direct I/O, Asynchronous Direct I/O behaves similarly but prevents the calling thread from blocking

Note: io_uring

The API to perform asynchronous I/O appeared in Linux long ago, and it was warmly met by the community. However, as it often happens, real-world usage quickly revealed many inefficiencies, such as blocking under some circumstances (despite the name), the need to call the kernel too often, and poor support for canceling the submitted requests. Eventually, it became clear that the updated requirements were not compatible with the existing API and the need for a new one arose.

This is how the io_uring() API appeared. It provides the same facilities as AIO does, but in a much more convenient and performant way (it also has notably better documentation). Without diving into implementation details, let’s just say that it exists and is preferred over the legacy AIO.

Understanding the Tradeoffs

The different access methods share some characteristics and differ in others. Table 3-1 summarizes these characteristics, which are discussed further in this section.

Table 3-1 Comparing Different I/O Access Methods

Copying and MMU Activity

One of the benefits of the mmap method is that if the data is in cache, then the kernel is bypassed completely. The kernel does not need to copy data from the kernel to userspace and back, so fewer processor cycles are spent on that activity. This benefits workloads that are mostly in cache (for example, if the ratio of storage size to RAM size is close to 1:1).

The downside of mmap, however, occurs when data is not in the cache. This usually happens when the ratio of storage size to RAM size is significantly higher than 1:1. Every page that is brought into the cache causes another page to be evicted. Those pages have to be inserted into and removed from the page tables; the kernel has to scan the page tables to isolate inactive pages, making them candidates for eviction, and so forth. In addition, mmap requires memory for the page tables. On x86 processors, this requires 0.2 percent of the size of the mapped files. This seems low, but if the application has a 100:1 ratio of storage to memory, the result is that 20 percent of memory (0.2% * 100) is devoted to page tables.

I/O Scheduling

One of the problems with letting the kernel control caching (with the mmap and read/write access methods) is that the application loses control of I/O scheduling. The kernel picks whichever block of data it deems appropriate and schedules it for write or read. This can result in the following problems:

  • A write storm. When the kernel schedules large amounts of writes, the disk will be busy for a long while and impact read latency.

  • The kernel cannot distinguish between “important” and “unimportant” I/O. I/O belonging to background tasks can overwhelm foreground tasks, impacting their latency2

By bypassing the kernel page cache, the application takes on the burden of scheduling I/O. This doesn’t mean that the problems are solved, but it does mean that the problems can be solved—with sufficient attention and effort.

When using Direct I/O, each thread controls when to issue I/O. However, the kernel controls when the thread runs, so responsibility for issuing I/O is shared between the kernel and the application. With AIO/DIO, the application is in full control of when I/O is issued.

Thread Scheduling

An I/O intensive application using mmap or read/write cannot guess what its cache hit rate will be. Therefore, it has to run a large number of threads (significantly larger than the core count of the machine it is running on). Using too few threads, they may all be waiting for the disk leaving the processor underutilized. Since each thread usually has at most one disk I/O outstanding, the number of running threads must be around the concurrency of the storage subsystem multiplied by some small factor in order to keep the disk fully occupied. However, if the cache hit rate is sufficiently high, then these large numbers of threads will contend with each other for the limited number of cores.

When using Direct I/O, this problem is somewhat mitigated. The application knows exactly when a thread is blocked on I/O and when it can run, so the application can adjust the number of running threads according to runtime conditions.

With AIO/DIO, the application has full control over both running threads and waiting I/O (the two are completely divorced), so it can easily adjust to in-memory or disk-bound conditions or anything in between.

I/O Alignment

Storage devices have a block size; all I/O must be performed in multiples of this block size which is typically 512 or 4096 bytes. Using read/write or mmap, the kernel performs the alignment automatically; a small read or write is expanded to the correct block boundary by the kernel before it is issued.

With DIO, it is up to the application to perform block alignment. This incurs some complexity, but also provides an advantage: The kernel will usually over-align to a 4096 byte boundary even when a 512-byte boundary suffices. However, a user application using DIO can issue 512-byte aligned reads, which results in saving bandwidth on small items.

Application Complexity

While the previous discussions favored AIO/DIO for I/O intensive applications, that method comes with a significant cost: complexity. Placing the responsibility of cache management on the application means it can make better choices than the kernel and make those choices with less overhead. However, those algorithms need to be written and tested. Using asynchronous I/O requires that the application is written using callbacks, coroutines, or a similar method, and often reduces the reusability of many available libraries.

Choosing the Filesystem and/or Disk

Beyond performing the I/O itself, the database design must consider the medium against which this I/O is done. In many cases, the choice is often between a filesystem or a raw block device, which in turn can be a choice of a traditional spinning disk or an SSD drive. In cloud environments, however, there can be the third option because local drives are always ephemeral—which imposes strict requirements on the replication.

Filesystems vs Raw Disks

This decision can be approached from two angles: management costs and performance.

If you’re accessing the storage as a raw block device, all the difficulties with block allocation and reclamation are on the application side. We touched on this topic slightly earlier when we talked about memory management. The same set of challenges apply to RAM as well as disks.

A connected, though very different, challenge is providing data integrity in case of crashes. Unless the database is purely in-memory, the I/O should be done in a way that avoids losing data or reading garbage from disk after a restart. Modern filesystems, however, provide both and are very mature to trust the efficiency of allocations and integrity of data. Accessing raw block devices unfortunately lacks those features (so they need to be implemented at the same quality on the application side).

From the performance point of view, the difference is not that drastic. On one hand, writing data to a file is always accompanied by associated metadata updates. This consumes both disk space and I/O bandwidth. However, some modern filesystems provide a very good balance of performance and efficiency, almost eliminating the I/O latency. (One of the most prominent examples is XFS. Another really good and mature piece of software is Ext4). The great ally in this camp is the fallocate(2) system call that makes the filesystem preallocate space on disk. When used, filesystems also have a chance to make full use of the extents mechanisms, thus bringing the QoS of using files to the same performance level as when using raw block devices.

Appending Writes

The database may have a heavy reliance on appends to files or require in-place updates of individual file blocks. Both approaches need special attention from the system architect because they call for different properties from the underlying system.

On one hand, appending writes requires careful interaction with the filesystem so that metadata updates (file size, in particular) do not dominate the regular I/O. On the other hand, appending writes (being sort of cache-oblivious algorithms) handle the disk overwriting difficulties in a natural manner. Contrary to this, in-place updates cannot happen at random offsets and sizes because disks may not tolerate this kind of workload, even if they’re used in a raw block device manner (not via a filesystem).

That being said, let’s dive even deeper into the stack and descend into the hardware level.

How Modern SSDs Work

Like other computational resources, disks are limited in the speed they can provide. This speed is typically measured as a two-dimensional value with Input/Output Operations per Second (IOPS) and bytes per second (throughput). Of course, these parameters are not cut in stone even for each particular disk, and the maximum number of requests or bytes greatly depends on the requests’ distribution, queuing and concurrency, buffering or caching, disk age, and many other factors. So when performing I/O, a disk must always balance between two inefficiencies—overwhelming the disk with requests and underutilizing it.

Overwhelming the disk should be avoided because when the disk is full of requests it cannot distinguish between the criticality of certain requests over others. Of course, all requests are important, but it makes sense to prioritize latency-sensitive requests. For example, ScyllaDB serves real-time queries that need to be completed in single-digit milliseconds or less and, in parallel, it processes terabytes of data for compaction, streaming, decommission, and so forth. The former have strong latency sensitivity; the latter are less so. Good I/O maintenance that tries to maximize the I/O bandwidth while keeping latency as low as possible for latency-sensitive tasks is complicated enough to become a standalone component called the I/O Scheduler.

When evaluating a disk, you would most likely be looking at its four parameters—read/write IOPS and read/write throughput (such as in MB/s). Comparing these numbers to one another is a popular way of claiming one disk is better than the other and estimating the aforementioned “bandwidth capacity” of the drive by applying Little’s Law. With that, the I/O Scheduler’s job is to provide a certain level of concurrency inside the disk to get maximum bandwidth from it, but not to make this concurrency too high in order to prevent the disk from queueing requests internally for longer than needed.

For instance, Figure 3-6 illustrates how read request latency depends on the intensity of small reads (challenging disk IOPS capacity) vs the intensity of large writes (pursuing the disk bandwidth). The latency value is color-coded, and the “interesting area” is painted in cyan—this is where the latency stays below 1 millisecond. The drive measured is the NVMe disk that comes with the AWS EC2 i3en.3xlarge instance.

Figure 3-6
Two graphs of p 50 and p 95 latency from 0 to 1 G B per second. The graphs depict a decreasing trend. Two shaded strips are given to the right of the graphs.

Bandwidth/latency graphs showing how read request latency depends on the intensity of small reads (challenging disk IOPS capacity) vs the intensity of large writes (pursuing the disk bandwidth)

This drive demonstrates almost perfect half-duplex behavior—increasing the read intensity several times requires roughly the same reduction in write intensity to keep the disk operating at the same speed.

Tip: How to Measure Your Own Disk Behavior Under Load

The better you understand how your own disks perform under load, the better you can tune them to capitalize on their “sweet spot.” One way to do this is with Diskplorer,Footnote 4 an open-source disk latency/bandwidth exploring toolset. By using Linux fio under the hood it runs a battery of measurements to discover performance characteristics for a specific hardware configuration, giving you an at-a-glance view of how server storage I/O will behave under load.

For a walkthrough of how to use this tool, see the Linux Foundation video, “Understanding Storage I/O Under Load.”Footnote 5

Networking

The conventional networking functionality available in Linux is remarkably full-featured, mature, and performant. Since the database rarely imposes severe per-ping latency requirements, there are very few surprises that come from it when properly configured and used. Nonetheless, some considerations still need to be made.

As explained by David Ahern, “Linux will process a fair amount of packets in the context of whatever is running on the CPU at the moment the IRQ is handled. System accounting will attribute those CPU cycles to any process running at that moment even though that process is not doing any work on its behalf. For example, ‘top’ can show a process that appears to be using 99+% CPU, but in reality, 60 percent of that time is spent processing packets—meaning the process is really only getting 40 percent of the CPU to make progress on its workload.”Footnote 6

However, for truly networking-intensive applications, the Linux stack is constrained:

  • Kernel space implementation: Separation of the network stack into kernel space means that costly context switches are needed to perform network operations, and that data copies must be performed to transfer data from kernel buffers to user buffers and vice versa.

  • Time sharing: Linux is a time-sharing system, and so must rely on slow, expensive interrupts to notify the kernel that there are new packets to be processed.

  • Threaded model: The Linux kernel is heavily threaded, so all data structures are protected with locks. While a huge effort has made Linux very scalable, this is not without limitations and contention occurs at large core counts. Even without contention, the locking primitives themselves are relatively slow and impact networking performance.

As before, the way to overcome this limitation is to move the packet processing to the userspace. There are plenty of out-of-kernel implementations of the TCP algorithm that are worth considering.

DPDK

One of the generic approaches that’s often referred to in the networking area is the poll mode vs interrupt model. When a packet arrives, the system may have two options for how to get informed—set up and interrupt from the hardware (or, in the case of the userspace implementation, from the kernel file descriptor using the poll family of system calls) or keep polling the network card on its own from time to time until the packet is noticed.

The famous userspace network toolkit, called DPDK, is designed specifically for fast packet processing, usually in fewer than 80 CPU cycles per packet.Footnote 7 It integrates seamlessly with Linux in order to take advantage of high-performance hardware.

IRQ Binding

As stated earlier, packet processing may take up to 60 percent of the CPU time, which is way too much. This percentage leaves too few CPU ticks for the database work itself. Even though in this case the backpressure mechanism would most likely keep the external activity off and the system would likely find its balance, the resulting system throughput would likely be unacceptable.

System architects may consider the non-symmetrical CPU approach to mitigate this. If you’re letting the Linux kernel process network packets, there are several ways to localize this processing on separate CPUs.

The simplest way is to bind the IRQ processing from the NIC to specific cores or hyper-threads. Linux uses two-step processing of incoming packets called IRQ and soft-IRQ. If the IRQs are properly bound to cores, the soft-IRQ also happens on those cores—thus completely localizing the processing.

For huge-scale nodes running tens to hundred(s) of cores, the number of network-only cores may become literally more than one. In this case, it might make sense to localize processing even further by assigning cores from different NUMA nodes and teaching the NIC to balance the traffic between those using the receive packet steering facility of the Linux kernel.

Summary

This chapter introduced a number of ways that database engineering decisions enable database users to squeeze more power out of modern infrastructure. For CPUs, the chapter talked about taking advantage of multicore servers by limiting resource sharing across cores and using future-promise design to coordinate work across cores. The chapter also provided a specific example of how low-level CPU architecture has direct implications on the database.

Moving on to memory, you read about two related but independent subsystems: memory allocation and cache control. For I/O, the chapter discussed Linux options such as traditional read/write, mmap, Direct I/O (DIO) read/write, and Asynchronous I/O—including the various tradeoffs of each. This was followed by a deep dive into how modern SSDs work and how a database can take advantage of a drive’s unique characteristics. Finally, you looked at constraints associated with the Linux networking stack and explored alternatives such as DPDK and IRQ binding. The next chapter shifts the focus from hardware interactions to algorithmic optimizations: pure software challenges.