As mentioned in Chapter 5, database servers are often combined into intricate topologies where certain nodes are grouped in a single geographical location; others are used only as a fast cache layer, and yet others store seldom-accessed cold data in a cheap place, for emergency purposes only. That chapter covered how drivers work to understand and interact with that topology to exchange information more efficiently.

This chapter focuses on the topology in and of itself. How is data replicated across geographies and datacenters? What are the risks and alternatives to taking the common NoSQL practice of scaling out to the extreme? And what about intermediaries to your database servers—for example, external caches, load balancers, and abstraction layers? Performance implications of all this and more are all covered here.Footnote 1

Replication Strategy

First, let’s look at replication, which is how your data will be spread to other replicas across your cluster.

Note

If you want a quick introduction to the concept of replication, see Appendix A.

Having more replicas will slow your writes (since every write must be duplicated to replicas), but it could accelerate your reads (since more replicas will be available for serving the same dataset). It will also allow you to maintain operations and avoid data loss in the event of node failures. Additionally, replicating data to get closer to your application and closer to your users will reduce latency, especially if your application has a highly geographically-distributed user base.

A replication factor (RF) of 1 means there is only one copy of a row in a cluster, and there is no way to recover the data if the node is compromised or goes down (other than restoring from a backup). An RF of 2 means that there are two copies of a row in a cluster. An RF of at least three is used in most systems. This allows you to write and read with strong consistency, as a quorum of replicas will be achieved, even if one node is down.

Many databases also let you fine-tune replication settings at the regional level. For example, you could have three replicas in a heavily used region, but only two in a less popular region.

Note that replicating data across multiple regions (as Bigtable recommends as a safeguard against both availability zone failure and regional failure) can be expensive. Before you set this up, understand the cost of replicating data between regions.

If you’re working with DynamoDB, you create tables (not clusters), and AWS manages the replication for you as soon as you set a table to be Global. One notable drawback of DynamoDB global tables is that transactions are not supported across regions, which may be a limiting factor for some use cases.

Rack Configuration

If all your nodes are in the same datacenter, how do you configure their placement? The rule of thumb here is to have as many racks as you have replicas. For example, if you have a replication factor of three, run it in three racks. That way, even if an entire rack goes down, you can still continue to satisfy read and write requests to a majority of your replicas. Performance might degrade a bit since you have lost roughly 33 percent of your infrastructure (considering a total zone/rack outage), but overall you’ll still be up and running. Conversely, if you have three replicas distributed across two racks, then losing a rack may potentially affect two out of the three natural endpoints for part of your data. That’s a showstopper if your use case requires strongly consistent reads/writes.

Multi-Region or Global Replication

By placing your database servers close to your users, you lower the network latency. You can also improve availability and insulate your business from regional outages.

If you do have multiple datacenters, ensure that—unless otherwise required by the business—reads and writes use a consistency level that is confined to replicas within a specific datacenter. This approach avoids introducing a latency hit by instructing the database to only select local replicas (under the same region) for achieving your required consistency level. Also, ensure that each application client knows what datacenter is considered its local one; it should prioritize that local one for connections and requests, although it may also have a fallback strategy just in case that datacenter goes down.

Note that application clients may or may not be aware of the multi-datacenter deployment, and it is up to the application developer to decide on the awareness to fallback across regions. Although different settings and load balancing profiles exist through a variety of database drivers, the general concept for an application to failover to a different region in the event of a local failure may often break application semantics. As a result, its reaction upon a failure must be handled directly by the application developer.

Multi-Availability Zones vs. Multi-Region

To mitigate a possible server or rack failure, cloud vendors offer (and recommend) a multi-zone deployment. Think about it as if you have a datacenter at your fingertips where you can deploy each server instance in its own rack, using its own power, top-of-rack switch, and cooling system. Such a deployment will be bulletproof for any single system or zonal failure, since each rack is self-contained. The availability zones are still located in the same region. However, a specific zone failure won’t affect another zone’s deployed instances.

For example, on Google Compute Engine, the us-east1-b, us-east1-c, and us-east1-d availability zones are located in the us-east1 region (Moncks Corner, South Carolina, USA). But each availability zone is self-contained. Network latency between AZs in the same region is negligible for the purpose of this discussion.

In short, both multi-zone and multi-region deployments help with business continuity and disaster recovery respectively, but multi-region has the additional benefit of minimizing local application latencies in those local regions. It might come at a cost though: cross-region data replication costs need to be considered for multi-regional topologies.

Note that multi-zonal deployments will similarly charge you for inter-zone replication. Although it is perfectly possible to have a single zone deployment for your database, it is often not a recommended approach because it will effectively be exposed as a single point of failure toward your infrastructure. The choice here is quite simple: Do you want to reduce costs as much as possible and risk potential unavailability, or do you want to guarantee high availability in a single region at the expense of network replication costs?

Scaling Up vs Scaling Out

Is it better to have a larger number of smaller (read, “less powerful”) nodes or a smaller number of larger nodes? We recommend aiming for the most powerful nodes and smallest clusters that meet your high availability and resiliency goals—but only if your database can truly take advantage of the power added by the larger nodes.

Let’s unravel that a bit. For over a decade, NoSQL’s promise has been enabling massive horizontal scalability with relatively inexpensive commodity hardware. This has allowed organizations to deploy architectures that would have been prohibitively expensive and impossible to scale using traditional relational database systems.

Over that same decade, “commodity hardware” has also undergone a transformation. But not all databases take advantage of modern computing resources. Many aren’t architected to take advantage of the resources offered by large nodes, such as the added CPU, memory, and solid-state drives (SSDs), nor can they store large amounts of data on disk efficiently. Managed runtimes, like Java, are further constrained by heap size. Multi-threaded code, with its locking and context-switches overhead and lack of attention for Non-Uniform Memory Architecture (NUMA), imposes a significant performance penalty against modern hardware architectures.

If your database is in this group, you might find that scaling up quickly brings you to a point of diminishing returns. But even then, it’s best to max out your vertical scaling potential before you shift to horizontal scaling.

A focus on horizontal scaling results in system sprawl, which equates to operational overhead, with a far larger footprint to keep managed and secure. Server sprawl also introduces more network overhead to distributed systems due to the constant replication and health checks done by every single node in your cluster. Although most vendors claim that scaling out will bring you linear performance, some others are more conservative and state that it will bring you “near to linear performance.” For example, Cassandra Production GuidelinesFootnote 2 do not recommend clusters larger than 50 nodes using the default number of 16 vNodes per instance because it may result in decreased availability.

Moreover, there are quite a few advantages to using large, powerful nodes.

  • Less noisy neighbors: On cloud platforms, multi-tenancy is the norm. A cloud platform is, by definition, based on shared network bandwidth, I/O, memory, storage, and so on. As a result, a deployment of many small nodes is susceptible to the “noisy neighbor” effect. This effect is experienced when one application or virtual machine consumes more than its fair share of available resources. As nodes increase in size, fewer and fewer resources are shared among tenants. In fact, beyond a certain size, your applications are likely to be the only tenant on the physical machines on which your system is deployed.

  • Fewer failures: Since large and small nodes fail at roughly the same rate, large nodes deliver a higher mean time between failures (MTBF) than small nodes. Failures in the data layer require operator intervention, and restoring a large node requires the same amount of human effort as a small one. In a cluster of a thousand nodes, you’ll likely see failures every day—and this magnifies administrative costs.

  • Datacenter density: Many organizations with on-premises datacenters are seeking to increase density by consolidating servers into fewer, larger boxes with more computing resources per server. Small clusters of large nodes help this process by efficiently consuming denser resources, in turn decreasing energy and operating costs.

  • Operational simplicity: Big clusters of small instances demand more attention, and generate more alerts, than small clusters of large instances. All of those small nodes multiply the effort of real-time monitoring and periodic maintenance, such as rolling upgrades.

Some architects are concerned that putting more data on fewer nodes increases the risks associated with outages and data loss. You can think of this as the “big basket” problem. It may seem intuitive that storing all of your data on a few large nodes makes them more vulnerable to outages, like putting all of your eggs in one basket. But this doesn’t necessarily hold true. Modern databases use a number of techniques to ensure availability while also accelerating recovery from failures, making big nodes both safer and more economical. For example, consider capabilities that reduce the time required to add and replace nodes and internal load balancing mechanisms to minimize the throughput or latency impact across database restarts.Footnote 3

Workload Isolation

Many teams find themselves in a position where they need to run multiple different workloads against the database. It is often compelling to aggregate different workloads under a single cluster, especially when they need to work on the exact same dataset. Keeping several workloads together under a single cluster can also reduce costs. But, it’s essential to avoid resource contention when implementing latency-critical workloads. Failure to do so may introduce hard-to-diagnose performance situations, where one misbehaving workload ends up dragging down the entire cluster’s performance.

There are many ways to accomplish workload isolation to minimize the resource contention that could occur when running multiple workloads on a single cluster. Here are a few that work well. Keep in mind that the best approach depends on your existing database’s available options, as well as your use case’s requirements:

  • Physical isolation: This setup is often used to entirely isolate one workload from another. It involves essentially extending your deployment to an additional region (which may be physically the same as your existing one, but logically different on the database side). As a result, the workloads are split to replicate data to another location, but queries are executed only within a particular location—in such a way that a performance bottleneck in one workload won’t degrade or bottleneck the other. Note that a downside of this solution is that your infrastructure costs double.

  • Logical isolation: Some databases or deployment options allow you to logically isolate workloads without needing to increase your infrastructure resources. For example, ScyllaDB has a workload prioritization feature where you can assign different weights for specific workloads to help the database understand which workload you want it to prioritize in the event of system contention. If your database does not offer such a feature, you may still be able to run two or more workloads in parallel, but watch out for potential contentions in your database.

  • Scheduled isolation: Many times, you might need to simply run batched scheduled jobs at specified intervals in order to support other business-related activities, such as extracting analytics reports. In those cases, consider running the workload in question at low-peak periods (if any exist), and experiment with different concurrency settings in order to avoid impairing the latency of the primary workload that’s running alongside it.

More on Workload Prioritization for Logical Isolation

ScyllaDB users sometimes use workload prioritization to balance OLAP and OLTP workloads. The goal is to ensure that each defined task has a fair share of system resources so that no single job monopolizes system resources, starving other jobs of their needed minimums to continue operations.

In Figure 8-1, note that the latency for both workloads nearly converges. OLTP processing began at or below 2ms P99 latency up until the OLAP job began at 12:15. When the OLAP workload kicked in, OLTP P99 latencies shot up to 8ms, then further degraded, plateauing around 11–12ms until the OLAP job terminated after 12:26.

Figure 8-1
A graph titled, Cassandra, Stress Latency 99%, in which the latency for both workloads nearly converges. O L T P processing began at or below 2 milliseconds P 99 latency up until the O L A P job began at 12 15. When the O L A P workload kicked in, O L T P P 99 latencies shot up to 8 milliseconds, then further degraded, plateauing around 11 to 12 milliseconds until the O L A P job terminated after 12 26.

Latency between OLTP and OLAP workloads on the same cluster before enabling workload prioritization

These latencies are approximately six times greater than when OLTP ran by itself. (OLAP latencies hover between 12–14ms, but, again, OLAP is not latency-sensitive).

Figure 8-2 shows that the throughput on OLTP sinks from around 60,000 OPS to half that—30,000 OPS. You can see the reason why. OLAP, being throughput hungry, is maintaining roughly 260,000 OPS.

Figure 8-2
A graph titled, Cassandra, Stress latency 99%, illustrates that the throughput on O L T P sinks from around 60,000 O P S to half that 30,000 O P S.

Comparative throughput results for OLTP and OLAP on the same cluster without workload prioritization enabled

Ultimately, OLTP suffers with respect to both latency and throughput, and users experience slower response times. In many real-world conditions, such OLTP responses would violate a customer’s SLA.

Figure 8-3 shows the latencies after workload prioritization is enabled. You can see that the OLTP workload similarly starts out at sub-millisecond to 2ms P99 latencies. Once an OLAP workload is added, OLTP processing performance degrades, but with P99 latencies hovering between 4–7ms (about half of the 11–12ms P99 latencies when workload prioritization was not enabled).

Figure 8-3
A graph titled, O P S S, illustrates that the latencies after workload prioritization is enabled. The O L T P workload starts out at sub-millisecond to 2 milliseconds m s P 99 latencies. When O L A P workload is added, O L T P processing performance degrades, but with P 99 latencies hovering between 4 and 7 milliseconds.

OLTP and OLAP latencies with workload prioritization enabled

It is important to note that once system contention kicks in, the OLTP latencies are still somewhat impacted—just not to the same extent they were prior to workload prioritization. If your real-time workload requires ultra-constant single-digit millisecond or lower P99 latencies, then we strongly recommend that you avoid introducing any form of contention.

The OLAP workload, not being as latency-sensitive, has P99 latencies that hover between 25–65ms. These are much higher latencies than before—the tradeoff for keeping the OLTP latencies lower.

Throughput wise, Figure 8-4 shows that the OLTP traffic is a smooth 60,000 OPS until the OLAP load is also enabled.

Figure 8-4
A graph titled, Cassandra, Stress Latency 99%, in which the O L T P traffic is a smooth 60,000 O P S until the O L A P load is also enabled.

OLTP and OLAP load throughput with workload prioritization enabled

It does dip in performance at that point, but only slightly, hovering between 54,000 to 58,000 OPS. That is only a 3–10 percent drop in throughput. The OLAP workload, for its part, hovers between 215,000–250,000 OPS. That is a drop of 4–18 percent, which means an OLAP workload would take longer to complete. Both workloads suffer degradation, as would be expected for an overloaded cluster, but neither to a crippling degree.

Abstraction Layers

It’s becoming fairly common for teams to write an abstraction layer on top of their databases. Instead of calling the database’s APIs directly, the applications connect to this database-agnostic abstraction layer, which then manages the logistics of connecting to the database.

There are usually a few main motives behind this move:

  • Portability: If the team wants to move to another database, they won’t need to modify their applications and queries. However, the team responsible for the abstraction layer will need to modify that code, which could turn out to be more complicated.

  • Developer simplicity: Developers don’t need to worry about the inner details of working with any particular database. This can make it easier for people to move around from team to team.

  • Scalability: An abstraction layer can be easier to containerize. If the API gets overloaded, it’s usually easier to scale out more containers in Kubernetes than to spin off more containers of the database itself.

  • Customer-facing APIs: Exposing the database directly to end-users is typically not a good idea. As a result, many companies expose common endpoints, which are eventually translated into actual database queries. As a result, the abstraction layer can shed requests, limit concurrency across tenants, and perform auditability and accountability over its calls.

But, there’s definitely a potential for a performance penalty that is highly dependent on how efficiently the layer was implemented. An abstraction layer that was fastidiously implemented by a team of masterful Rust engineers is likely to have a much more negligible impact than a Java or Python one cobbled together as a quick side project. If you decide to take this route, be sure that the layer is developed with performance in mind, and that you carefully measure its impact via both benchmarking and ongoing monitoring. Remember that every application <> database communication is going to use this layer, so a small inefficiency can quickly snowball into a significant performance problem.

For example, we once saw a customer report an elevated latency situation coming from their Golang abstraction layer. Once we realized that the latency on the database side was within bounds for its use case, the investigation shifted from the database over to the network and client side. Long story short, the application latency spikes were identified as being heavily affected during the garbage collection process, dragging down the client-side performance significantly. The problem was resolved after scaling out the number of clients and by ensuring that they had enough compute resources to properly function.

And another example: When working with a customer through a PostgreSQL to NoSQL migration, we realized that their clients were fairly often opening too many concurrent connections against the database. Although having a high number of sockets opened is typically a good idea for distributed systems, an extremely high number of them can easily overwhelm the client side (which needs to keep track of all open sockets) as well as the database. After we reported our findings to the customer, they discovered that they were opening a new database session for every request they submitted against the cluster. After correcting the malfunctioning code, the overall application throughput was significantly increased because the abstraction layer was then using active sockets opened when it routed requests.Footnote 4

Load Balancing

Should you put a dedicated load balancer in front of your database? In most cases, no. Databases typically have their own way to balance traffic across the cluster, so layering a load balancer on top of that won’t help—and it could actually hurt. Consider 1) how many requests the load balancer can serve without becoming a bottleneck and 2) its balancing policy. Also, recognize that it introduces a single point of failure that reduces your database resilience. As a result, you overcomplicate your overall infrastructure topology because you now need to worry about load balancing high availability.

Of course, there are always exceptions. For example, say you were previously using a database API that’s unaware of the layout of the cluster and its individual nodes (e.g., DynamoDB, where a client is configured with a single “endpoint address” and all requests are sent to it). Now you’re shifting to a distributed leaderless database like ScyllaDB, where clients are node aware and even token aware (aware of which token ranges are natural endpoints for every node in your topology). If you simply configure an application with the IP address of a single ScyllaDB node as its single DynamoDB API endpoint address, the application will work correctly. After all, any node can answer any request by forwarding it to other nodes as necessary. However, this single node will be more loaded than the other nodes because it will be the only node actively serving requests. This node will also become a single point of failure from your application’s perspective.

In this special edge case, load balancing is critical—but load balancers are not. Server-side load balancing is fairly complex from an admin perspective. More importantly with respect to performance, server-side solutions add latency. Solutions that involve a TCP or HTTP load balancer require another hop for each request—increasing not just the cost of each request, but also its latency. We recommend client-side load balancing: Modifying the application to send requests to the available nodes versus a single one.

The key takeaway here is that load balancing generally isn’t needed—and when it is, server-side load balancers yield a pretty severe performance penalty. If it’s absolutely necessary, client-side load balancing is likely a better option.Footnote 5

External Caches

Teams often consider external caches when the existing database cluster cannot meet the required SLA. This is a clear performance-oriented decision. Putting an external cache in front of the database is commonly used to compensate for subpar latency stemming from the various factors discussed throughout this book: inefficient database internals, driver usage, infrastructure choices, traffic spikes, and so on.

Caching may seem like a fast and easy solution because the deployment can be implemented without tremendous hassle and without incurring the significant cost of database scaling, database schema redesign, or even a deeper technology transformation. However, external caches are not as simple as they are often made out to be. In fact, they can be one of the more problematic components of a distributed application architecture.

In some cases, it’s a necessary evil—particularly if you have an ultra-latency-sensitive use case such as real-time ad bidding or streaming media, and you’ve tried all the other means of reducing latency. But in many cases, the performance boost isn’t worth it. The following sections outline some key risks and you can decide what makes sense for your use case and SLAs.

An External Cache Adds Latency

A separate cache means another hop on the way. When a cache surrounds the database, the first access occurs at the cache layer. If the data isn’t in the cache, then the request is sent to the database. The result is additional latency to an already slow path of uncached data. One may claim that when the entire dataset fits the cache, the additional latency doesn’t come into play. However, there is usually more than a single workload/pattern that hits the database and some of it will carry the extra hop cost.

An External Cache Is an Additional Cost

Caching means expensive DRAM, which translates to a higher cost per gigabyte than SSDs. Even when RAM can store frequently accessed objects, it is best to use the existing database RAM, and even increase it for internal caching rather than provision entirely separate infrastructure on RAM-oriented instances. Provisioning a cache to be the same size as the entire persistent dataset may be prohibitively expensive. In other cases, the working set size can be too big, often reaching petabytes, making an SSD-friendly implementation the preferred, and cheaper, option.

External Caching Decreases Availability

No cache’s high availability solution can match that of the database itself. Modern distributed databases have multiple replicas; they also are topology-aware and speed-aware and can sustain multiple failures without data loss.

For example, a common replication pattern is three local replicas, which generally allows for reads to be balanced across such replicas in order to efficiently use your database’s internal caching mechanism. Consider a nine-node cluster with a replication factor of three: Essentially every node will hold roughly 33 percent of your total dataset size. As requests are balanced among different replicas, this grants you more room for caching your data, which could (potentially) completely eliminate the need for an external cache. Conversely, if an external cache happens to invalidate entries right before a surge of cold requests, availability could be impeded for a while since the database won’t have that data in its internal cache (more on this in the section entitled “External Caching Ruins the Database Caching” later in this chapter).

Caches often lack high-availability properties and can easily fail or invalidate records depending on their heuristics. Partial failures, which are more common, are even worse in terms of consistency. When the cache inevitably fails, the database will get hit by the unmitigated firehose of queries and likely wreck your SLAs. In addition, even if a cache itself has some high availability features, it can’t coordinate handling such failure with the persistent database it is in front of. The bottom line: Rely on the database, rather than making your latency SLAs dependent on a cache.

Application Complexity: Your Application Needs to Handle More Cases

Application and operational complexity are problems for external caches. Once you have an external cache, you need to keep the cache up-to-date with the client and the database. For instance, if your database runs repairs, the cache needs to be synced or invalidated. However, invalidating the cache may introduce a long period of time when you need to wait for it to eventually get warm. Your client retry and timeout policies need to match the properties of the cache but also need to function when the cache is done. Usually, such scenarios are hard to test and implement.

External Caching Ruins the Database Caching

Modern databases have embedded caches and complex policies to manage them. When you place a cache in front of the database, most read requests will reach only the external cache and the database won’t keep these objects in its memory. As a result, the database cache is rendered ineffective. When requests eventually reach the database, its cache will be cold and the responses will come primarily from the disk. As a result, the round-trip from the cache to the database and then back to the application is likely to incur additional latency.

External Caching Might Increase Security Risks

An external cache adds a whole new attack surface to your infrastructure. Encryption, isolation, and access control on data placed in the cache are likely to be different from the ones at the database layer itself.

External Caching Ignores the Database Knowledge and Database Resources

Databases are quite complex and built for specialized I/O workloads on the system. Many of the queries access the same data, and some amount of the working set size can be cached in memory in order to save disk accesses. A good database should have sophisticated logic to decide which objects, indexes, and accesses it should cache.

The database also should have various eviction policies (such as the least recently used [LRU] policy as a straightforward example) that determine when new data should replace existing (older) cached objects.

Another example is scan-resistant caching. When scanning a large dataset, say a large range or a full-table scan, a lot of objects are read from the disk. The database can realize this is a scan (not a regular query) and choose to leave these objects outside its internal cache. However, an external cache would treat the result set just like any other and attempt to cache the results. The database automatically synchronizes the content of the cache with the disk according to the incoming request rate, and thus the user and the developer do not need to do anything to make sure that lookups to recently written data are performant. Therefore, if, for some reason, your database doesn’t respond fast enough, it means that:

  • The cache is misconfigured

  • It doesn’t have enough RAM for caching

  • The working set size and request pattern don’t fit the cache

  • The database cache implementation is poor

Summary

This chapter shared strong opinions on how to navigate topology decisions. For example, we recommended:

  • Using an RF of at least 3 (with geographical fine-tuning if available)

  • Having as many racks as replicas

  • Isolating reads and writes within a specific datacenter

  • Ensuring each client knows and prioritizes the local datacenter

  • Considering the (cross-region replication) costs of multi-region deployments as well as their benefits

  • Scaling up as much as possible before scaling out

  • Considering a few different options to minimize the resource contention that could occur when running multiple workloads on a single cluster

  • Carefully considering the caveats associated with external caches, load balancers, and abstraction layers

The next chapter looks at best practices for testing your topology: Benchmarking it to see what it’s capable of and how it compares to alternative configurations and solutions.