Keywords

1 Introduction

The theory of CAP [1] (Consistency, Availability, Partition tolerance) tells us that in any distributed system, the three essential characteristics of CAP cannot be satisfied simultaneously; at least one of them must be given up. Generally, in a distributed system, the partition tolerance is automatically satisfied. Giving up consistency means that the data between nodes cannot be trusted, which is usually unacceptable. Therefore, a possible choice is to give up availability, meaning that the nodes need to be entirely independent to obtain data consistency. When building a distributed system, the main construction goals are to ensure its consistency and partition tolerance, while the former has drawn more interest in recent research.

The consistency problem mainly focuses on how to reach agreement among multiple service nodes. The services of distributed systems are usually vulnerable to various network issues such as server reset and network jitter, making the services unreliable. To solve this problem, a consensus algorithm was created. The consensus algorithm usually uses a replicated state machine to ensure that all nodes have the same log sequence. After all the logs are applied in order, the state machine will eventually reach an agreement. The consistency algorithms are widely used in distributed databases [2,3,4], blockchain applications [5, 6], high-performance middleware [7], and other fields, and they are also the basis for realizing these systems.

Two well-known consensus algorithms are the Paxos [8] and the Raft [9]. The Paxos algorithm has been the benchmark for consensus algorithms in the past decades, but it is somehow obscure, and the implementation detail is missing in the original research, leading to various versions of systems and hard to verify its correctness. The Raft protocol supplements the details of multi-decision stages in the Paxos. It enhances the comprehensibility, decomposes the consistency problem into several consecutive sub-problems, and finally guarantees the system’s correctness through the security mechanism.

The distributed consensus problem requires participants to reach a consensus on the command sequence, and a state machine executes the submitted command sequence and ensures the ultimate consistency. In the Raft algorithm, a leader will be selected first, and the leader will execute all requests. Raft’s security mechanism ensures that the state machine logs are in a specific sequence according to the logical numbers to reach a consensus, i.e., sequential submission and sequential execution. However, the systems implemented with this procedure have a low throughput rate, a large portion of the requests must be remained blocked, and this reduction in performance will deteriorate, especially in scenarios with high concurrency.

To deal with this problem, an improved Raft consensus algorithm is proposed in this paper. Instead of strict sequential execution of requests, we introduce a pre-proposal stage, in which the asynchronous batch processing is performed to improve the efficiency while retaining the distributed consensus characteristic. The improved Raft algorithm will be deployed on simulated cluster machines for experiments. Finally, the availability and the performance of the proposed method under a large number of concurrent requests will be verified.

2 Related Works

2.1 Replicated State Machine

The consensus algorithm usually uses the replicated state machine structure as its means to achieve fault tolerance. Local state machines on some servers will generate execution copies of the same state and send them to other servers through network, so that the state machine can continue to execute even when some machines are down. A typical implementation is to use the state machine managed by the leader node to execute and send the copy, which can ensure that the cluster can survive externally even when one node is down. Mature open source systems such as Zookeeper [10], TiKV [11] and Chubby [12] are all based on this implementation.

The basis theory of the state machine is: if each node in the cluster is running the same prototype of the deterministic state machine S, and the state machine is in the initial state S0 at the beginning, with the same input sequence I = {i1,i2,i3,i4,i5,…,in}, these state machines will execute the request sequence with the transition path: s0- > s1- > s2- > s3- > s4- > s5- >…- > sn, so finally the consistent final state Sn will be achieved, producing the same state output set O = {o1(s1),o2(s2),o3(s3),o4(s4),o5(s5),…,on(sn)}.

Fig. 1.
figure 1

The replicated state machine structure.

As shown in Fig. 1, the replicated state machine is implemented based on log replication, and the structure usually consists of three parts: a consensus module, a state machine prototype, and a storage engine. The consensus module of each server is responsible for receiving the log sequence initiated by the client, executing, and storing it in the order in which it is received, and then distributing the logs through the network to make the state machines of all server nodes to be consistent. Since the state of each state machine is deterministic, and each operation can produce the same state and output sequence, the entire server cluster acts as one exceptionally reliable state machine.

2.2 Raft Log Compression

The Raft protocol is implemented based on the state machine of log replication. However, in actual systems, the log could not allow unlimited growth. As time increases, the continuous growth of logs will take up more log transmission overhead, as well as more recovery time for node downtime. If there is no certain mechanism to solve this problem, the response time of the Raft cluster will be significantly slower, so log compression is usually implemented in Raft algorithms.

The Raft uses snapshots to implement the log compression. In the snapshot system, if the state Sn in the state machine at a certain time is safely applied to most of the nodes, then Sn is considered safe, and all the states previous to Sn can be discarded, therefore the initial operating state S0 is steadily changed to Sn, and other nodes only need to obtain the log sequence starting from Sn when obtaining logs.

Fig. 2.
figure 2

The Raft log compression implemented by snapshots.

Figure 2 shows the basic idea of the Raft snapshots. A snapshot is created independently by each server node and can only include log entries that have been safely submitted. The snapshot structure contains the index value of the last log entry that was last replaced by the snapshot. Once a node completes a snapshot, it can delete all logs and snapshots before the last index position.

Although each node manages the snapshots independently, Raft’s logs and snapshots are still based on the leader node. For followers who are too backward (including nodes that recover from downtime and have large network delays), the leader will send the latest updates through the network and overwrite it.

3 Improved Raft Algorithm

3.1 Premises and Goals of the Improved Algorithm

The premises of the original Raft algorithm is as follows, meaning that its security mechanism should basically guarantees:

  • The cluster maintains a monotonically increasing term number (Term).

  • The network communication between clusters is not reliable and are susceptible to packet loss, delay, network jitter, etc.

  • No Byzantine error will occur.

  • There will always be one leader selected in the cluster and there will only be one leader under the same term number.

  • Leader is responsible for interacting with client requests. Client requests received by other nodes need to be redirected to the Leader.

  • The request to the client meets the linear consistency, and the client can accurately return the interactive information after each operation.

In the improved algorithm, most of the above premises is not changed except for the second one. In actual engineering projects, the communication between computers tends to be stable most of the time (that is, the delay between nodes is much less than the time of a Heartbeat). In addition, general reliable communication protocols such as TCP have a retransmission mechanism, with which lost packets will be retransmitted immediately, so it is possible to recover in a short time even if there is a failure. Therefore, we can change the second premise to: the computer network is not always in a dangerous state. It can be assumed that the communication established between the Leader and the other followers is safe, although node downtime and network partitions still occur, they can be viewed as under control.

3.2 Proposal Process

Each operation of the client that can be performed by the state machine on the server is called a Proposal. A complete Proposal process usually consists of an event request (Invocation, hereinafter referred to as Inv) and an event response (Response, hereinafter referred to as Res). A request contains an operation with the type Write or Read, and the non-read-only type Write is finally submitted by the state machine.

Fig. 3.
figure 3

(a) The process of a Proposal. (b) The parallel process of Proposals.

Figure 3(a) shows the process of a Proposal from client A from initiation to response. From the perspective of Raft, a system that meets linear consistency needs to achieve the following points:

  • The submission of Proposal may be concurrent, but the processing is sequential, and the next Proposal can be processed only after a Proposal returns a response.

  • The Inv operation is atomic.

  • Other proposals occur between the two events of Inv and Res.

  • After any Read operation returns a new value, all subsequent Read operations should return this new value.

Figure 3(b) is an example of parallel client requests with linear consistency in Raft. For the same piece of data V, the client A to E initiates a parallel Read/Write request at a certain moment, and Raft receives the Proposal in Real-Time order. As shown in the figure, the request satisfies the following total order relationship:

$$\mathrm{P}=\{\mathrm{A},\mathrm{B},\mathrm{C},\mathrm{D},\mathrm{E}\}$$
(1)
$$\mathrm{R}=\{<\mathrm{A},\mathrm{B}>,<\mathrm{B},\mathrm{C}>,<\mathrm{C},\mathrm{E}>,<\mathrm{A},\mathrm{D}>,<\mathrm{D},\mathrm{C}>\}$$
(2)

The V = 1 that A initiates the write is successfully written in the Inv period. At this time, B initiates the read between Inv and Res, then V = 1 will be read if it can, so as to C and E. The read operation of D is after A and before C, then the value read by D at this time is the data of Inv initiated by A, and V = 1 will be returned.

3.3 The Proposed Improved Raft Algorithm

Raft’s linear semantics causes client requests to eventually turn into an execution sequence that is received, executed, and submitted sequentially, regardless of the concurrency levels of requests. Under a large number of concurrent requests, two problems will arise. 1. The Leader must process the proposal under the Raft mechanism, so the Leader is a performance bottleneck. 2. The processing rate is much slower than the request rate. A large number of requests will cause a large number of logs to accumulate and occupy bandwidth for a long time and memory.

Problem 1 can be solved with the Mutil-Raft-Group [4]. Mutil-Raft regards a Raft cluster as a consensus group. Each consensus group will generate a leader. Different leaders manage different log shards. In this way, the Leader’s load pressure will be evenly divided among all consensus groups, thus preventing the Raft cluster’s single Leader from becoming an obstacle. In this paper, we focus on how to solve problem 2.

Fig. 4.
figure 4

Log entry commit process

Each proposal will be converted into a log that can be executed by the state machine, as shown in Fig. 4. When the leader node’s consistency module receives the log, the Leader first appends the log to the log collection and then distributes the log items through the RPC method AppendEntries to the remaining follower nodes. Regardless of conditions such as network partition and downtime, the follower node will also copy the log items to its log collection after receiving the request and reply to the leader node ACK to indicate a successful Append. When the Leader receives more than half of the Followers’ ACK message, the state machine will submit the log, and the ACK will be sent to other Follower nodes to submit, thereby completing a cluster log submission.

In a highly concurrent scenario, the log items to be processed can be understood as an infinitely growing task queue. The Leader continuously sends Append Entries RPC messages to Follower and waits for half of the nodes to respond. The growth rate of this queue is much greater than that of the submittal time of a log. In this log synchronization mode, consider that the network jitter and packet loss occurs, more logs will be affected, which dramatically impacts system throughput.

Based on the TCP protocol’s sliding window mechanism, when multiple consecutive Append Entries RPCs are initiated, the Leader essentially establishes a TCP relationship with the Follower and initiates multiple TCP packets. The sliding window mechanism allows the sender to send multiple packets consecutively before stop-and-wait confirmation instead of stopping to confirm each time a group is sent. The window size determines the number of data packets that can be sent, and when the window is full, the wait will be delayed. The delayed waiting of many TCP data packets will lead to the appearance of LFN (long fat network), which will make the data packets timeout and retransmit. Useless retransmissions generate a lot of network overhead. If the window is large enough, the response can be correctly received by sending multiple data packets continuously and not being retransmitted. If other network overheads are not counted, the network throughput is equivalent to the amount of data transmission per second.

Based on this theory, the synchronous wait of continuous Append Entries is changed to asynchronous in our proposed method so that subsequent ACKs will not be blocked and the network throughput can be improved. However, due to the impact of operating system scheduling during asynchronous callbacks, the message sequence of asynchronous processing may be inconsistent, and direct asynchronous submission may lead to log holes. The solution to this problem is: when the Leader’s continuous Heartbeat confirmation can be responded to in time, the network is considered smooth. When an out-of-order sequence occurs, it is within the controllable range, as the logs before the out-of-order log will eventually appear at a certain point in the future. For out-of-order sequences due to scheduling problems, we only need to wait and submit them in order again. If the network fails and is partitioned, the TCP mechanism also ensures that the messages will not be out-of-order.

On this asynchronous basis, the batch is used for log processing. For this reason, we introduce a pre-Proposal stage is to pre-process concurrent Proposals. The Pre-proposal stage is between the client-initiated Proposal and the Leader’s processing the Proposal. During this period, a highly concurrent synchronization queue is used to load the Proposal in the order of FIFO (First In First Out). After the Leader starts to process the Proposal, it will sequentially take out the Proposal from the synchronization queue until it encounters the first read-only request in the queue. Then a replica state machine is constructed that is the same as the local state machine. In the replicated state machine, non-read-only logs are submitted in batches, and snapshots are extracted, asynchronous RPCs are sent to make other Follower nodes install snapshots. When more than half of the nodes’ ACK responses are received, the replicated state machine is used to replace the original state machine. In order to ensure the consistent reading of the Raft, it is necessary to ensure that the write request has been executed before a read request is executed. For this reason, the synchronization queue needs to be blocked, and the read-related Proposal is processed separately until the next read request. In scenarios that there are more writes than reads, the throughput could be improved more significantly.

4 Experiments and Analysis

The experimental environment is as follows: The server host has 32 GiB of memory, the CPU is Intel Xeon (Cascade Lake) Platinum 8269CY 2.5 GHz with 8 cores. The proposed algorithm is run in the virtual container of this server, 3 nodes are simulated, with each node specifies 4 GiB memory and 2 CPU cores, the operating system is CentOS, and the program code is programmed in Java.

In order to evaluate the efficiency of the improved Raft algorithm, a comparison experiment with traditional Raft [9] was conducted, and the following two aspects were evaluated: 1. The time it takes to process the same level of Proposal before and after the improvement; 2. The impact on the system throughput before and after the improvement.

Multithreading was used to send concurrent requests. In total 17 sets of experiments were carried out for comparison, with different request concurrency levels: from 1000 log entries to up to 13000 log entries. The final results are shown in Fig. 5, Fig. 6 and Table 1.

Fig. 5.
figure 5

Performance comparison on the process time of with different number of log entries.

Fig. 6.
figure 6

Performance comparison on throughput of with different size of data volume.

With the increase of concurrency level, the program will inevitably meet the processing bottleneck, that is to say, the point when the program processing speed is far less than the task increments. Figure 5 shows that the bottleneck is around the log concurrency of 12000. If the request number is more than this, the processing capacity of both algorithms will decrease exponentially. Before the bottleneck, it can be clearly seen that the proposed algorithm can guarantee more than 20% improvement compared with the traditional algorithm. Even after the bottleneck, the proposed algorithm’s process time can adjust to stable because the introduction of the batch process helps alleviate the concurrent task queue. On the contrary, due to the log backlog and task accumulation, the traditional algorithm’s processing time will always stay at an exponentially growing trend.

Figure 6 shows that with the increase in the amount of processing data, the throughput of the proposed algorithm system can always be higher than that of the traditional algorithm thanks to the batch processing. Due to many limitations of hardware and software systems, such as the number of disk manipulators, the number of CPU cores, file systems, etc., this improvement is foreseeable to have some limits. Nonetheless, the throughput can be stably guaranteed to be more than two times that of the original algorithm.

Table 1. Performance improvement rate of the optimized algorithm

Table 1 records the improvement rate of the improved algorithm in system throughput and log processing time. It can be seen that the proposed algorithm can at least double the system throughput, and the processing time of the client requests can also be increased by more than 20%.

5 Conclusion

In this paper, the distributed consensus problem is optimized with an improved Raft algorithm. The traditional Raft algorithm executes a client’s request to meet linear consistency with sequential execution and sequential submission, which has great impact on performance. In this paper, we intoduces asynchronous and batch processing methods in the pre-Proposal stage to accelerate the processing time and system throughput. After the log submission, snapshot compression of the logs is sent in the sequential queue. Since the network response time is much shorter than the memory calculation, the throughput can be greatly promoted. Experimental results show that this method can increase the system throughput by more than 2 to 3.6 times, and the parallel request processing efficiency can also be increased by more than 1.2 times, which can improve the efficiency of the algorithm while ensuring the correct operation of the algorithm.