We propose the in-place RSM architecture, which is optimized to replicate states of small sizes. As it does not rely on a command log, the complex state management outlined in Section 3.2 is not needed.
Although the general structure of the in-place architecture is similar to the log-based approach, the use of consensus is fundamentally different. Instead of agreeing on a sequence of commands for specific slots in a log, an in-place RSM agrees directly on the sequence of state machine states. Thus, it especially shines when the replicated state is small enough to be repeatedly transmitted over the network (i.e., the data management is latency-bound and not bandwidth-bound) and is combined with a storage medium that allows fast, byte-addressable access such as NVRAM (see Section 4.3). For example, in-place RSMs can be used for basic primitives such as counters, sets, locks, or small system-wide metadata such as current process group memberships or leaderships without the complexity of managing a log. Furthermore, the small storage and protocol overhead of an in-place RSM makes it easy to deploy many parallel, independent RSM instances that share the same physical machines. As we will discuss in Section 5 in more detail, this property is beneficial, for example, to replicate individual key-value pairs of a key-value store each in a separate RSM instance.
Overview
The high-level design of an in-place RSM, depicted in Fig. 2, resembles its log-based counterpart. Initially, a client submits a command to any replica (step 1). The replica’s consensus module then proposes a new state as the next state of the RSM. It first fetches the replica’s current state (step 2), applies the received command on the state, and then proposes the result by sending messages to the other replicas. If no concurrent proposal of another replica exists, this proposal will eventually be accepted as the next state of the RSM (step 3). Every replica that learns the new state can immediately override its current local state (step 4). The local state must be stored on a persistent medium if replicas need the ability to recover from crashes. Finally, the client is notified that its command succeeded (step 5).
In contrast to using a log, no replica explicitly stores the sequence of commands. Instead, the sequence is only implied by the current replica state. Thus, replicas can quickly catch-up if they have missed past consensus decisions, e.g., due to message loss caused by unreliable communication channels. For example, replica 2 in Fig. 2 may not learn state ’\(\{B\}\)’. However, it can still safely overwrite its local state if it learns a subsequent proposed state. In contrast, replicas in the log-based approach must learn all previous agreed-upon commands before new commands can be executed or have to request a potentially large snapshot of the state. Of course, the consensus protocol must be modified for in-place RSMs, ensuring that only up-to-date replicas succeed with their proposal.
Modified Consensus
Current consensus algorithms are either directly build around a command log [10], generalized dependency graphs that allow multiple equivalent command orderings [14], or can only be used to agree on a single command [9]. Systems that use the latter kind of algorithms typically chain multiple consensus instances—one for each slot in the log. However, the managed state grows in all cases with the number of learned commands, which eventually needs to be actively truncated. To meet the demands of our in-place RSM architecture, we designed a new consensus algorithm that can agree on an arbitrary number of values in sequence (i.e., states of the RSM) without additional memory requirements.
Our algorithm is based on classical single-decree Paxos [9]. In the following, we describe the high-level approach. We refer to [15] for the full algorithm.
Paxos Consensus
Single-decree Paxos can be used by a set of processes to agree on a single value. Paxos distinguishes the roles of the proposer, acceptor, and learner. Proposers propose values, acceptors vote on the proposals, and learners collect acceptor votes and may learn a proposed value.
Paxos is executed in two phases, as shown in Algorithm 1. In the first phase, a proposer first chooses a (unique) round number. Rounds are used to order concurrent proposals. The proposer then announces to all acceptors its intent to propose a value in this round during the second phase. Acceptors reply with their current vote if they have not seen a higher numbered proposal.
If the proposer has received a quorum of ack replies, it proceeds to phase 2. It proposes the most recent value received in phase 1 messages. If no acceptors have voted for a value yet, it can choose its own value, e.g., a value received from a client. The proposer sends the proposal to all acceptors, who vote for it if they have not seen a higher numbered proposal. Learners collect the votes.
Once a learner has received votes for the same proposal from a quorum, the included value was learned by consensus. The protocol ensures that the learned value never changes: Intuitively, as a quorum has voted for value \(v\), every proposer will receive a least one instance of \(v\) in a phase 1 ack message. Thus, no other value will be proposed anymore.
RMWPaxos: Enhancing Paxos with Consistent Quorums
Single-decree Paxos can only be used to agree on a single value. If agreement on a sequence of values is needed, e.g., commands in an RSM, multiple single-decree Paxos instances are necessary.
To avoid the need to clean-up old instances, we extended single-decree Paxos so that a single Paxos instance can agree on an arbitrarily long sequence of values. We call this modified protocol RMWPaxos [15], as it supports the consistent application of arbitrary read-modify-write operations on a single value. Informally, the protocol supports the following semantics (see [15] for a formal problem statement):
Given a current agreed-upon state \(s\) and a set of concurrently received commands \(C\), all processes agree on the next state \(s^{\prime}\) such that \(\exists c\in C:s^{\prime}=c(s)\).
Already applied commands must never be ‘lost’. Once \(s^{\prime}\) was agreed on, the next agreed state in the sequence must be \(c(s^{\prime})\) for some command \(c\). Thus, one of the main challenges in developing the protocol is reliably detecting the current agreed-upon state before applying a new command to propose the next value.
We achieve this by augmenting Paxos with the notion of consistent quorums. A quorum is consistent if all acceptors in it have voted for the same state. Otherwise, the quorum is inconsistent. We can use this concept to modify phase 2 of Paxos (see Fig. 3):
After a proposer \(p\) has completed phase 1, it checks if the quorum has responded consistently. It does so by comparing all included values in the received ack messages. If they are the same, then the quorum is consistent, the state is reliably established, and the corresponding command is successfully finished. Then, \(p\) knows with certainty that the system has agreed on this value, and no newer agreed-upon value exists. Thus, \(p\) can apply a client command and propose the result as the next value in the sequence. If \(p\) succeeds with its proposal, it can update its local state machine with the proposed value and notify the client that issued the command.
If the quorum is inconsistent, \(p\) does not know which was the last agreed-upon value (cf. Fig. 3, step two). In order to prevent the system from ‘losing’ commands, \(p\) cannot immediately propose a new value. Instead, it must propose the most recent value received in phase 1, similar to single-decree Paxos (cf. Algorithm 1, Line 4–8). Afterward, \(p\) can retry by executing the protocol from the beginning.
An inconsistent quorum may indicate that some other proposers executed the protocol concurrently. However, it can also be the result of message loss or other failures. In this case, the proposer cannot decide with certainty which (or if any) of the received values was agreed-upon judging from phase 1.
Persisting the State with NVRAM
The in-place RSM architecture promises less state overhead and protocol complexity than log-based designs when fault-tolerant access is needed on a fine-granular scale. Traditional persistent storage media such as HDDs or SSDs are not the best fit for the needs of in-place RSMs. They can only be accessed on a block granularity. Operating systems (OS) employ a paging mechanism to handle access to these block devices efficiently. Pages often have a size of 4 KiB, but they can also be larger on current systems. To access any data on the persistent storage, the OS fetches the corresponding page into main memory. If the data is modified, the page is marked as dirty. It is eventually written back to the persistent medium, either explicitly by the application or due to periodic synchronizations by the OS.
The use of multiple in-place RSMs, e.g., to replicate individual key-value pairs of a key-value store (Section 5) results in an access pattern dominated by fine-granular random-access writes. However, such an access pattern cannot be efficiently handled by the paging mechanism for several reasons: First, writing a full page upon every modification causes write amplification as the modified data is often much smaller than a single page. Second, different RSM instances may not share the same page, increasing the risk of page faults.
With the recent availability of 3D XPoint based memory, such as Intel’s DCPMM, we now have a persistent storage medium that features byte-granular access with sub-microsecond latency. These properties are promising for the needs of in-place RSMs. While several performance studies exist that discuss DCPMM’s applicability in various areas, none has evaluated the persistent access patterns that occurs when using in-place RSM’s. Therefore, we investigate them in Section 6.2.