A database’s automated admin operations work to keep things tight and tidy behind the scenes, but a level of supervision is required. Databases don’t know your business and could very naively decide to execute resource-intensive admin operations at what’s actually a performance-critical time. This final chapter details how common admin operations tend to impact performance. It covers the nature and severity of representative impacts and offers some tips on how you might mitigate them.

Admin Operations and Performance

You might see promises of “zero impact” admin operations, but remember that the laws of physics mean that’s not possible. Performing any operation consumes resources. And when you’re operating at extreme speed and scale, these operations may introduce exacerbated impacts. Given use cases that need to operate at sub-millisecond or single-digit millisecond P99 latency, it doesn’t take much for background tasks to have a noticeable impact. With a latency-sensitive use case, there can be absolutely no system contention during its execution. Even admin operations that will ultimately improve your database performance could inevitably hurt performance to some extent as they are executing. The better you understand the extent of their impact on your specific workload, the more effectively you can strategize to minimize disruption.

Low-level details about what admin operations are required will vary from database to database and also change over time; that’s well beyond the scope of this book. This chapter focuses on how admin operations could end up undermining the other work you’ve done to optimize database performance—and how to avoid that scenario. It starts by presenting a quick rule of thumb to prioritize your focus. Then, examples of backups and compaction will showcase the potentially significant—and also highly variable—impact of admin operations on performance.

Looking at Admin Operations Through the Lens of Performance

Every admin operation, from backups to data migrations to adding and reducing capacity, consumes resources that might otherwise be spent on your workload. The impact of an admin operation will vary across databases and workloads. What’s more, an impact that results in lost revenue for one company might be completely acceptable for another.

What admin operations should you focus on from the performance perspective? As shown in Figure 11-1, work through three key considerations for every admin operation that’s being performed.

  1. 1.

    What’s the impact on your specific workload at your current or projected scale?

  2. 2.

    How much does that impact matter to your business?

  3. 3.

    To what extent can you control it?

Figure 11-1
A Venn diagram of performance optimization focus. It has three circles that denote performance impact, business impact, and controllable. Focus is present at the intersection of the three circles.

A quick rule of thumb for where to focus your admin-related performance optimizations

If there is no discernible performance impact for your scenario, then the second and third questions don’t really matter. If there’s a significant and business-critical impact but you can’t control it, you’re in the tough position of deciding whether to accept it or consider moving to an alternative database. If the stars align and you can control something that’s both impactful and business-critical, that’s a great place to focus.

For example, consider PostgreSQL’s autovacuum function. As of this writing, autovacuum is triggered when a specified scale factor/threshold is exceeded. This is likely to coincide with heavy activity on the table—which is probably not when you want background admin tasks to kick in. Starving some tables while repeatedly vacuuming others is common, and users trying to compel autovacuum to hit starved tables can easily end up pushing the system beyond its limit. What’s the likely impact on the business? Probably fairly high for any performance-sensitive use case. And to what extent can you control it? Quite well. For example, you can tune autovacuum settings at both the global and table level, as well as apply strategies like supplementing autovacuum with additional scheduled vacuum jobs. The bottom line here is that this is a great performance optimization opportunity.

On the other hand, if you are using a managed DBaaS such as DynamoDB, admin operations such as data cleanup might be largely beyond your scope of visibility and control. It certainly doesn’t hurt to ask your vendor what they’re willing to divulge about what, when, and how admin operations are performed. Even if you discover that an admin operation undermines performance in a way that matters for you, you might not be able to control it—but at least you can better prepare for it and diagnose the performance hit when it occurs.

Among admin operations that could negatively impact performance, some of the most common suspects are:

  • Node recovery: This involves existing replicas streaming data in order to recover the missing replica. Existing replicas need to read through all the data required by the recovered replica and transfer its results via the network.

  • Ramping up/down capacity: This often requires an entire cluster or region to rebalance data. Ramping up capacity means that data will be streamed from other replicas to the new one, while ramping down means that the node being removed will stream data out to existing replicas.

  • Data migration: Migration often affects latency on the source cluster. Since a data migration typically involves no downtime, a balance between speed and service stability is needed in order to avoid impacting existing production workloads.

  • Database upgrades: Although the outcome of an upgrade is likely to improve performance, remember that restarting a database instance results in a cold cache. This may affect read latencies if the use case in question is cache heavy.

  • Logging and tracing: When you’re trying to understand a specific pattern or impact, logging and tracing will be important. Databases provide several verbosity levels for many logging components, as well as the ability to enable tracepoints toward your query plans. However, enabling both logging and tracing should be done with caution because they can potentially be resource-intensive operations.

  • Data synchronization: Eventually consistent databases don’t guarantee that all the data you’re looking for will be immediately available across all natural replicas. As a result, a background process is often needed to get data in sync. This typically involves each replica reading through its existing data, comparing it with its peers, and applying any relevant changes.

Two of the most common operations that impact performance across a variety of databases are backups and compaction. Let’s take a deeper look at both.

Backups

Backups—a common maintenance procedure for any database—can be surprisingly resource intensive. For example, consider a backup strategy where data deduplication is required. As data in the database frequently gets written or overwritten, backups may consume several CPU cycles and disk I/O on reads in order to compare whether the data to be backed up has already been saved. Then, as it finds newer data that must be retained, it eventually uploads the data (which also involves issuing underlying I/O reads) to a safe location. As the process is repeated across multiple nodes, its parallelism often ends up hurting latencies, especially for use cases that heavily rely on disk I/O to fetch information.

Impacts

Factors that influence a backup’s performance impact include:

  • Dataset size and replication factor: The more data you’re backing up, the more time it takes to run a backup. Depending on the number of files stored on disk, backing up may use a lot of read I/O to scan through all the required database blobs.

  • Scope: Are you backing up all on-disk data files all the time (full backup)? A specific cluster? A system-wide snapshot? An incremental backup? A properly defined backup strategy and scope will help you mitigate the impact.

  • Frequency: Frequent small backups will result in a more constant low-level pain; less frequent, but larger backups will cause a sharper pain, but that pain will be inflicted less frequently.

  • Bandwidth throttling: The option to compress or spread out the backup pain helps teams who want to get backups completed as fast as possible during low peak periods (if any exist) or to run them as unobtrusively as possible during steady workloads.

  • Scheduling options: The ability to control precisely when backups occur allows teams with spiky workloads to avoid backups during likely peak periods.

  • Data compression: Greater compression will save on storage, but it comes at the cost of increased CPU usage as the backup runs.

  • Parallelism: The more nodes you back up in parallel, the faster it completes—but at the risk of starving disk I/O capacity along with your ongoing workload.

  • Storage medium: Reads from local SSDs are noticeably faster than regular disks. As a result, if your database relies on slow-access storage devices, it is much easier for backups to deplete your available read capacity.

Optimization

Before you start adjusting any options, consider these two critical questions:

  • What’s your business’ tolerance for data loss?

  • What type of backup makes the most sense given your workloads?

For example, if you’re working on a food delivery app, a large backup that kicks off in the middle of the Friday lunch surge could result in lost business. The pain could be alleviated by running regular backups during predictable downtimes (e.g., very early in the morning), when there are resources to spare.

But other businesses don’t have a predictable downtime. For another example, consider an application that provides location tracking services for ambulances—a use case where a catastrophic event could bring a dramatic surge at any time without warning. In that case, many small and frequent backups might be the best strategy. This way, backups are unlikely to significantly impact database performance, no matter when the unpredictable demand happens to rise.

Work with your team to understand the backup coverage that you need and what type of backup pain you’re willing to accept, then adjust your options accordingly.

Note

Repairs are a totally different process, but they have a similar impact. Eventually consistent databases need to ensure that replicas (eventually) all have the appropriate updates. In Cassandra and Cassandra-like databases, this process is referred to as repairs. When repair runs, it could cause latency to spike. The key to minimizing its performance impact varies according to your workload. If there’s a time when your database is predictably idle, run repair then—with high parallelism and intensity. If your use case can withstand minor latency spikes, you can try to limit the repair’s intensity and parallelism. But, if you can’t afford any latency spikes (e.g., a real-time bidding use case that must provide sub-millisecond P9999 latencies around the clock), your best bet is to limit the operation to run as slowly as possible.

Compaction

As mentioned in Chapter 2 and covered more in Appendix A, LSM-based databases use compaction—a process of rewriting tables to remove deleted entries and reorganize data to enable faster, more efficient reads and writes. Compaction operations are expensive in terms of CPU, memory, and disk I/O.Footnote 1

The degree to which you can control compaction varies dramatically from database to database. For example, with Bigtable, it’s all done automatically. However, databases such as Couchbase, HBase, Cassandra, and ScyllaDB let you choose from a variety of compaction strategies, many of which have additional options you can use to fine-tune how compaction is performed, as well as other settings that influence compaction performance (for example, rate-limiting it).

Impacts

The performance impact of compaction also varies dramatically from database to database. One fundamental factor that influences compaction speed is whether the database is performing the major compactions on each shard/CPU concurrently, or the compaction is bound to a single thread. As shown in Figure 11-2, benchmarks found that there can reflect a nearly 60X difference in the time required to run a major compaction of 1TB of data at RF=1 on i3.4xlarge machines.

Figure 11-2
A bar graph of time taken versus compaction on databases. The lowest time taken is 36 minutes for Scylla 4 dot 4 dot 3 and the highest is 37 hours 56 minutes for Cassandra 4 dot 0 dot 0.

The wide range of time required to perform compaction on similar databases—from 36 minutes to 37 hours and 56 minutes

Additional factors that influence the impact of compaction include:

  • Compaction backlog: Since compacting data is a process that is always running in the background, the amount of data to compact is expressed by its growing backlog. If compaction falls behind, it will eventually try to catch up in an attempt to keep the database from running out of resources and to minimize read amplification. A growing compaction backlog indicates that the cluster is not sized appropriately, the use case in question is using an inappropriate compaction strategy, or the process is being throttled too aggressively in the database settings.

  • Inefficient compaction strategy selection: Write-mostly workloads are different from read-mostly workloads, which are different from update-heavy and delete-heavy workloads. Understanding the concepts behind every compaction strategy and how it impacts your workload is important to avoid read amplification, write amplification, or space amplification.

  • Compaction throughput: In situations where your use case relies heavily on reading from cold data, having overly aggressive compaction throughput would end up stealing some important IOPS and bandwidth needed for your workload. Play with different rate-limiting values and keep an eye on your compaction backlog until you find your sweet spot.

An inefficient compaction strategy may affect workloads differently, depending on what level of inefficiency it is. For example, a write-heavy workload will typically want to prevent compactions from running too aggressively; otherwise, it may exhaust the existing disk bandwidth capacity and eventually throttle down the database write path. On the other hand, a read-heavy workload will likely want compaction to run more aggressively, given that compactions may actually improve read latencies by requiring the database to issue fewer underlying storage disk I/O operations. Time-series use cases will typically prefer data to be separated into buckets so that eventual eviction can be done efficiently. And so on, and so on.

Optimization

When selecting a compaction strategy, keep in mind that the ultimate goal should be low amplification. You want to avoid:

  • Read amplification (read requests needing many files to look up relevant data)

  • Excessive temporary disk space that requires the disk to be larger than a perfectly-compacted representation of the data (space amplification)

  • Compacting the same data over and over again (write amplification)

  • Overwritten/deleted/expired data remaining on disk, slowing down your read path

Since not everyone is using a database that performs compaction, this chapter doesn’t go deep into the weeds of the pros and cons of specific strategies. Table 11-1 provides an overview of which compaction strategy generally works best for different workloads (your results may vary).

Table 11-1 Comparing Compaction Strategies

Two key takeaways should be that 1) one size never fits all, so it’s nice to have a choice in admin matters, and 2) tradeoffs are inevitable—know what pain you can tolerate best so pick your poison.

To drive the point home, here’s a real-world story. Once upon a time, a new ScyllaDB user reported high read latencies. The use case was a TTL’d time series to support live media streaming. Time series use cases heavily rely on fetching data in specific timeframes and expect that such lookups are fast enough to be served by the database. As a result, time series use cases often rely on a Time-Bucketed compaction strategy, which ensures that the data in question is compacted together under the same time window to avoid the database having to potentially scan through multiple files across distinct windows just to retrieve the data. However, if configured incorrectly, the strategy may backfire and introduce severe performance headaches.

In this particular situation, we discovered that their time buckets were too small for the amount of data they were frequently retrieving as part of a single query. For example, if you decide to time-bucket your data every ten minutes, but always want to retrieve ten hours’ worth of data, that will require the database to scan through 60 (6 buckets/hour * 10 hours) of buckets! With the right amount of concurrency, every query scanning through these large chunks of data could starve the underlying disk I/O capacity. Therefore, the resolution was to update the compaction configuration to reflect a more realistic data grouping as required by the use case.

One final note on adjusting your compaction strategy for performance: Remember that when you adjust your compaction strategy, your database will need to rewrite all your table data. This will incur a significant performance penalty and should be carefully planned to occur at a time that works best for your business.

Summary

Admin operations like repair, compactions, and backups are an unavoidable part of running a healthy, well-performing database. There’s no such thing as a “zero impact” admin operation; performing any operation consumes resources, and these operations can have exacerbated impacts if you’re operating at extreme scale. This chapter used the examples of backups and compaction to showcase the potentially significant—and also highly variable—impact of admin operations on performance.

This is the final official chapter of this book—the end of these highly opinionated recommendations for improving database performance based on what we’ve seen working with a broad range of database users and databases. It’s hardly the end of options for optimizing database performance though. Some potential next steps:

  • Flag the considerations/recommendations that seem to offer potential for your specific workloads and use case, then discuss with your team.

  • Take another look at your database’s specific options (e.g., for monitoring, drivers, admin, etc.) and see if it’s time to rethink any of your previous decisions.

  • Tap your database vendor and/or community to learn about performance-related engineering decisions and optimizations that might offer untapped opportunities (or be responsible for some of your current constraints).

  • Consider whether your data modeling might need a tune-up or overhaul (monitoring can help you assess this). If so, we recommend NoSQL Distilled by Pramod J. Sadalge and Martin Fowler—assuming you’re using NoSQL. If not, browse the wealth of resources on RDBMS data modeling.

  • Continue learning more about the fundamental database design decisions made when building any distributed database: replication and sharding strategies, consensus algorithms, data structures (B-tree vs LSM tree), and so on. You can get a performance-focused introduction to these topics, as well as recommendations for learning more from the masters, in Appendix A.