What’s more fun than wrestling with database performance? Well, a lot. But that doesn’t mean you can’t have a little fun here. To give you an idea of the complexities you’ll likely face if you’re serious about optimizing database performance, this chapter presents two rather fanciful stories. The technical topics covered here are expanded on throughout the book. But this is the one and only time you’ll hear of poor Joan and Patrick. Let their struggles bring you some valuable lessons, solace in your own performance predicaments… and maybe a few chuckles as well.

Joan Dives Into Drivers and Debugging

Lured in by impressive buzzwords like “hybrid cloud,” “serverless,” and “edge first,” Joan readily joined a new company and started catching up with their technology stack. Her first project recently started a transition from their in-house implementation of a database system, which turned out to not scale at the same pace as the number of customers, to one of the industry-standard database management solutions. Their new pick was a new distributed database, which, as opposed to NoSQL, strives to keep the original ACIDFootnote 1 guarantees known in the SQL world.

Due to a few new data protection acts that tend to appear annually nowadays, the company’s board decided that they were going to maintain their own datacenter, instead of using one of the popular cloud vendors for storing sensitive information.

On a very high level, the company’s main product consisted of only two layers:

  • The frontend, the entry point for users, which actually runs in their own browsers and communicates with the rest of the system to exchange and persist information.

  • The everything-else, customarily known as the backend, but actually includes load balancers, authentication, authorization, multiple cache layers, databases, backups, and so on.

Joan’s first task was to implement a very simple service for gathering and summing up various statistics from the database and integrate that service with the whole ecosystem, so that it fetched data from the database in real-time and allowed the DevOps teams to inspect the statistics live.

To impress the management and reassure them that hiring Joan was their absolutely best decision this quarter, Joan decided to deliver a proof-of-concept implementation on her first day! The company’s unspoken policy was to write software in Rust, so she grabbed the first driver for their database from a brief crates.io search and sat down to her self-organized hackathon.

The day went by really smoothly, with Rust’s ergonomic-focused ecosystem providing a superior developer experience. But then Joan ran her first smoke tests on a real system. Disbelief turned to disappointment and helplessness when she realized that every third request (on average) ended up in an error, even though the whole database cluster reported to be in a healthy, operable state. That meant a debugging session was in order!

Unfortunately, the driver Joan hastily picked for the foundation of her work, even though open-source on its own, was just a thin wrapper over precompiled, legacy C code, with no source to be found. Fueled by a strong desire to solve the mystery and a healthy dose of fury, Joan spent a few hours inspecting the network communication with Wireshark,Footnote 2 and she made an educated guess that the bug must be in the hashing key implementation.Footnote 3 In the database used by the company, keys are hashed to later route requests to appropriate nodes. If a hash value is computed incorrectly, a request may be forwarded to the wrong node, which can refuse it and return an error instead.

Unable to verify the claim due to the missing source code, Joan decided on a simpler path—ditching the originally chosen driver and reimplementing the solution on one of the officially supported, open-source drivers backed by the database vendor, with a solid user base and regularly updated release schedule.

Joan’s Diary of Lessons Learned, Part I

The initial lessons include:

  1. 1.

    Choose a driver carefully. It’s at the core of your code’s performance, robustness, and reliability.

  2. 2.

    Drivers have bugs too, and it’s impossible to avoid them. Still, there are good practices to follow:

    1. a.

      Unless there’s a good reason, choose the officially supported driver (if it exists).

    2. b.

      Open-source drivers have advantages. They’re not only verified by the community, but they also allow deep inspection of the code, and even modifying the driver code to get even more insights for debugging.

    3. c.

      It’s better to rely on drivers with a well-established release schedule since they are more likely to receive bug fixes (including for security vulnerabilities) in a reasonable period of time.

  3. 3.

    Wireshark is a great open-source tool for interpreting network packets; give it a try if you want to peek under the hood of your program.

The introductory task was eventually completed successfully, which made Joan ready to receive her first real assignment.

The Tuning

Armed with the experience gained working on the introductory task, Joan started planning how to approach her new assignment: a misbehaving app. One of the applications notoriously caused stability issues for the whole system, disrupting other workloads each time it experienced any problems. The rogue app was already based on an officially supported driver, so Joan could cross that one off the list of potential root causes.

This particular service was responsible for injecting data backed up from the legacy system into the new database. Because the company was not in a great hurry, the application was written with low concurrency in mind to have low priority and not interfere with user workloads. Unfortunately, once every few days something kept triggering an anomaly. The normally peaceful application seemed to be trying to perform a denial-of-service attack on its own database, flooding it with requests until the backend got overloaded enough to cause issues for other parts of the ecosystem.

As Joan watched metrics presented in a Grafana dashboard, clearly suggesting that the rate of requests generated by this application started spiking around the time of the anomaly, she wondered how on Earth this workload could behave like that. It was, after all, explicitly implemented to send new requests only when fewer than 100 of them were currently in progress.

Since collaboration was heavily advertised as one of the company’s “spirit and cultural foundations” during the onboarding sessions with an onsite coach, she decided it was best to discuss the matter with her colleague, Tony.

  • “Look, Tony, I can’t wrap my head around this,” she explained. “This service doesn’t send any new requests when 100 of them are already in flight. And look right here in the logs: 100 requests in-progress, one returned a timeout error, and…,” she then stopped, startled at her own epiphany.

  • “Alright, thanks Tony, you’re a dear—best rubber duckFootnote 4 ever!,” she concluded and returned to fixing the code.

The observation that led to discovering the root cause was rather simple: The request didn’t actually return a timeout error because the database server never sent such a response. The request was simply qualified as timed out by the driver, and discarded. But the sole fact that the driver no longer waits for a response for a particular request does not mean that the database is done processing it! It’s entirely possible that the request was instead just stalled, taking longer than expected, and the driver gave up waiting for its response.

With that knowledge, it’s easy to imagine that once 100 requests time out on the client side, the app might erroneously think that they are not in progress anymore, and happily submit 100 more requests to the database, increasing the total number of in-flight requests (i.e., concurrency) to 200. Rinse, repeat, and you can achieve extreme levels of concurrency on your database cluster—even though the application was supposed to keep it limited to a small number!

Joan’s Diary of Lessons Learned, Part II

The lessons continue:

  1. 1.

    Client-side timeouts are convenient for programmers, but they can interact badly with server-side timeouts. Rule of thumb: Make the client-side timeouts around twice as long as server-side ones, unless you have an extremely good reason to do otherwise. Some drivers may be capable of issuing a warning if they detect that the client-side timeout is smaller than the server-side one, or even amend the server-side timeout to match, but in general it’s best to double-check.

  2. 2.

    Tasks with seemingly fixed concurrency can actually cause spikes under certain unexpected conditions. Inspecting logs and dashboards is helpful in investigating such cases, so make sure that observability tools are available, both in the database cluster and for all client applications. Bonus points for distributed tracing, like OpenTelemetryFootnote 5 integration.

With the client-side timeouts properly amended, the application choked much less frequently and to a smaller extent, but it still wasn’t a perfect citizen in the distributed system. It occasionally picked a victim database node and kept bothering it with too many requests, while ignoring the fact that seven other nodes were considerably less loaded and could help handle the workload too. At other times, its concurrency was reported to be exactly 200 percent larger than expected by the configuration. Whenever the two anomalies converged in time, the poor node was unable to handle all the requests it was bombarded with, and it had to give up on a fair portion of them. A long study of the driver’s documentation, which was fortunately available in mdBookFootnote 6 format and kept reasonably up-to-date, helped Joan alleviate those pains too.

The first issue was simply a misconfiguration of the non-default load balancing policy, which tried too hard to pick “the least loaded” database node out of all the available ones, based on heuristics and statistics occasionally updated by the database itself. Unfortunately, this policy was also “best effort,” and relied on the fact that statistics arriving from the database were always legit. But a stressed database node could become so overloaded that it wasn’t sending updated statistics in time! That led the driver to falsely believe that this particular server was not actually busy at all. Joan decided that this setup was a premature optimization that turned out to be a footgun, so she just restored the original default policy, which worked as expected.

The second issue (temporary doubling of the concurrency) was caused by another misconfiguration: an overeager speculative retry policy. After waiting for a preconfigured period of time without getting an acknowledgement from the database, drivers would speculatively resend a request to maximize its chances to succeed. This mechanism is very useful to increase requests’ success rate. However, if the original request also succeeds, it means that the speculative one was sent in vain. In order to balance the pros and cons, speculative retry should be configured to resend requests only when it’s very likely that the original one failed. Otherwise, as in Joan’s case, the speculative retry may act too soon, doubling the number of requests sent (and thus also doubling concurrency) without improving the success rate.

Whew, nothing gives a simultaneous endorphin rush and dopamine hit like a quality debugging session that ends in an astounding success (except writing a cheesy story in a deeply technical book, naturally). Great job, Joan!

The end.

Patrick’s Unlucky Green Fedoras

After losing his job at a FAANG MAANG (MANGA?) company, Patrick decided to strike off on his own and founded a niche online store dedicated to trading his absolute favorite among headwear, green fedoras. Noticing that a certain NoSQL database was recently trending on the front page of Hacker News, Patrick picked it for his backend stack.

After some experimentation with the offering’s free tier, Patrick decided to sign a one-year contract with a major cloud provider to get a significant discount on its NoSQL database-as-a-service offering. With provisioned throughput capable of serving up to 1,000 customers every second, the technology stack was ready and the store opened its virtual doors to the customers. To Patrick’s disappointment, fewer than ten customers visited the site daily. At the same time, the shiny new database cluster kept running, fueled by a steady influx of money from his credit card and waiting for its potential to be harnessed.

Patrick’s Diary of Lessons Learned, Part I

The lessons started right away:

  1. 1.

    Although some databases advertise themselves as universal, most of them perform best for certain kinds of workloads. The analysis before selecting a database for your own needs must include estimating the characteristics of your own workload:

    1. a.

      Is it likely to be a predictable, steady flow of requests (e.g., updates being fetched from other systems periodically)?

    2. b.

      Is the variance high and hard to predict, with the system being idle for potentially long periods of time, with occasional bumps of activity?

      Database-as-a-service offerings often let you pick between provisioned throughput and on-demand purchasing. Although the former is more cost-efficient, it incurs a certain cost regardless of how busy the database actually is. The latter costs more per request, but you only pay for what you use.

  2. 2.

    Give yourself time to evaluate your choice and avoid committing to long-term contracts (even if lured by a discount) before you see that the setup works for you in a sustainable way.

The First Spike

March 17th seemed like an extremely lucky day. Patrick was pleased to notice lots of new orders starting from the early morning. But as the number of active customers skyrocketed around noon, Patrick’s mood started to deteriorate. This was strictly correlated with the rate of calls he received from angry customers reporting their inability to proceed with their orders.

After a short brainstorming session with himself and a web search engine, Patrick realized, to his dismay, that he lacked any observability tools on his precious (and quite expensive) database cluster. Shortly after frantically setting up Grafana and browsing the metrics, Patrick saw that although the number of incoming requests kept growing, their success rate was capped at a certain level, way below today’s expected traffic.

“Provisioned throughput strikes again,” Patrick groaned to himself, while scrolling through thousands of “throughput exceeded” error messages that started appearing around 11am.

Patrick’s Diary of Lessons Learned, Part II

This is what Patrick learned:

  1. 1.

    If your workload is susceptible to spikes, be prepared for it and try to architect your cluster to be able to survive a temporarily elevated load. Database-as-a-service solutions tend to allow configuring the provisioned throughput in a dynamic way, which means that the threshold of accepted requests can occasionally be raised temporarily to a previously configured level. Or, respectively, they allow it to be temporarily decreased to make the solution slightly more cost-efficient.

  2. 2.

    Always expect spikes. Even if your workload is absolutely steady, a temporary hardware failure or a surprise DDoS attack can cause a sharp increase in incoming requests.

  3. 3.

    Observability is key in distributed systems. It allows the developers to retrospectively investigate a failure. It also provides real-time alerts when a likely failure scenario is detected, allowing people to react quickly and either prevent a larger failure from happening, or at least minimize the negative impact on the cluster.

The First Loss

Patrick didn’t even manage to recover from the trauma of losing most of his potential income on the only day throughout the year during which green fedoras experienced any kind of demand, when the letter came. It included an angry rant from a would-be customer, who successfully proceeded with his order and paid for it (with a receipt from the payment processing operator as proof), but is now unable to either see any details of his order—and he’s still waiting for the delivery!

Without further ado, Patrick browsed the database. To his astonishment, he didn’t find any trace of the order either. For completeness, Patrick also put his wishful thinking into practice by browsing the backup snapshot directory. It remained empty, as one of Patrick’s initial executive decisions was to save time and money by not scheduling any periodic backup procedures.

How did data loss happen to him, of all people? After studying the consistency model of his database of choice, Patrick realized that there’s consensus to make between consistency guarantees, performance, and availability. By configuring the queries, one can either demand linearizabilityFootnote 7 at the cost of decreased throughput, or reduce the consistency guarantees and increase performance accordingly. Higher throughput capabilities were a no-brainer for Patrick a few days ago, but ultimately customer data landed on a single server without any replicas distributed in the system. Once this server failed—which happens to hardware surprisingly often, especially at large scale—the data was gone.

Patrick’s Diary of Lessons Learned, Part III

Further lessons include:

  1. 1.

    Backups are vital in a distributed environment, and there’s no such thing as setting backup routines “too soon.” Systems fail, and backups are there to restore as much of the important data as possible.

  2. 2.

    Every database system has a certain consistency model, and it’s crucial to take that into account when designing your project. There might be compromises to make. In some use cases (think financial systems), consistency is the key. In other ones, eventual consistency is acceptable, as long as it keeps the system highly available and responsive.

The Spike Strikes Again

Months went by and Patrick’s sleeping schedule was even beginning to show signs of stabilization. With regular backups, a redesigned consistency model, and a reminder set in his calendar for March 16th to scale up the cluster to manage elevated traffic, he felt moderately safe.

If only he knew that a ten-second video of a cat dressed as a leprechaun had just gone viral in Malaysia… which, taking time zone into account, happened around 2am Patrick’s time, ruining the aforementioned sleep stabilization efforts.

On the one hand, the observability suite did its job and set off a warning early, allowing for a rapid response. On the other hand, even though Patrick reacted on time, databases are seldom able to scale instantaneously, and his system of choice was no exception in that regard. The spike in concurrency was very high and concentrated, as thousands of Malaysian teenagers rushed to bulk-buy green hats in pursuit of ever-changing Internet trends. Patrick was able to observe a real-life instantiation of Little’s Law, which he vaguely remembered from his days at the university. With a beautifully concise formula, L = λW, the law can be simplified to the fact that concurrency equals throughput times latency.

Tip

For those having trouble with remembering the formula, think units. Concurrency is just a number, latency can be measured in seconds, while throughput is usually expressed in 1/s. Then, it stands to reason that in order for units to match, concurrency should be obtained by multiplying latency (seconds) by throughput (1/s). You’re welcome!

Throughput depends on the hardware and naturally has its limits (e.g., you can’t expect a NVMe drive purchased in 2023 to serve the data for you in terabytes per second, although we are crossing our fingers for this assumption to be invalidated in near future!) Once the limit is hit, you can treat it as constant in the formula. It’s then clear that as concurrency raises, so does latency. For the end-users—Malaysian teenagers in this scenario—it means that the latency is eventually going to cross the magic barrier for average human perception of a few seconds. Once that happens, users get too frustrated and simply give up on trying altogether, assuming that the system is broken beyond repair. It’s easy to find online articles quoting that “Amazon found that 100ms of latency costs them 1 percent in sales”; although it sounds overly simplified, it is also true enough.

Patrick’s Diary of Lessons Learned, Part IV

The lessons continue…:

  1. 1.

    Unexpected spikes are inevitable, and scaling out the cluster might not be swift enough to mitigate the negative effects of excessive concurrency. Expecting the database to handle it properly is not without merit, but not every database is capable of that. If possible, limit the concurrency in your system as early as possible. For instance, if the database is never touched directly by customers (which is a very good idea for multiple reasons) but instead is accessed through a set of microservices under your control, make sure that the microservices are also aware of the concurrency limits and adhere to them.

  2. 2.

    Keep in mind that Little’s Law exists—it’s fundamental knowledge for anyone interested in distributed systems. Quoting it often also makes you appear exceptionally smart among peers.

Backup Strikes Back

After redesigning his project yet again to take expected and unexpected concurrency fluctuations into account, Patrick happily waited for his fedora business to finally become ramen profitable.

Unfortunately, the next March 17th didn’t go as smoothly as expected either. Patrick spent most of the day enjoying steady Grafana dashboards, which kept assuring him that the traffic was under control and capable of handling the load of customers, with a healthy safe margin. But then the dashboards stopped, kindly mentioning that the disks became severely overutilized. This seemed completely out of place given the observed concurrency. While looking for the possible source of this anomaly, Patrick noticed, to his horror, that the scheduled backup procedure coincided with the annual peak load…

Patrick’s Diary of Lessons Learned, Part V

Concluding thoughts:

  1. 1.

    Database systems are hardly ever idle, even without incoming user requests. Maintenance operations often happen and you must take them into consideration because they’re an internal source of concurrency and resource consumption.

  2. 2.

    Whenever possible, schedule maintenance options for times with expected low pressure on the system.

  3. 3.

    If your database management system supports any kind of quality of service configuration, it’s a good idea to investigate such capabilities. For instance, it might be possible to set a strong priority for user requests over regular maintenance operations, especially during peak hours. Respectively, periods with low user-induced activity can be utilized to speed up background activities. In the database world, systems that use a variant of LSM trees for underlying storage need to perform quite a bit of compactions (a kind of maintenance operation on data) in order to keep the read/write performance predictable and steady.

The end.

Summary

Meeting database performance expectations can sometimes seem like a never-ending pain. As soon as you diagnose and address one problem, another is likely lurking right behind it. The next chapter helps you anticipate the challenges and opportunities you are most likely to face given your technical requirements and business expectations.