Raft Deep Dive for System Design

Explore the Raft consensus algorithm to understand its leader election, log replication, and safety mechanisms. This lesson helps you grasp how Raft simplifies consensus compared to Paxos, ensuring reliable and consistent distributed systems through modular design and fault tolerance.

We'll cover the following...

Introducing Raft
Problems with Paxos
Raft’s innovations
Raft’s overview
Requirements
- Functional requirements
- Non-functional requirements
Raft’s properties
Bird’s eye view

Raft's algorithm uses a simple majority vote to achieve consensus with high availability. The nodes that miss some updates catch up later as they rejoin the Raft cluster. The nodes that miss updates apply the commands from the log to the state machine from left to right (from lower slot numbers to higher slot numbers).

Like Paxos, Raft is always safe, but it is not live under specific circumstances. Unlike Paxos, leader election is a mandatory part of the Raft algorithm, and the leader services all client requests. Such a leader is one of the reasons for Raft's simplicity. However, it might be a scalability issue if the number of concurrent clients is very high.

Note: As per the FLP impossibility, it is not possible to always achieve consensus in an asynchronous network even when a single node can crash-fail. Raft takes two major steps to avoid FLP impossibility:
Raft tweaks the meaning of consensus and termination by using a majority rather than requiring immediate agreement from all nodes. However, Raft makes sure that if a majority has agreed on something, the minority can not change those decisions.
Raft detects those cases when consensus is not possible. Under such circumstances, the algorithm stops making forward progress. Raft is always safe and resumes normal operations as soon as system conditions improve. This works nicely for real applications because many systems usually have a long period of smooth operation before encountering a failure.

This lesson will compare Raft with Paxos. We will then learn about the functional and non-functional requirements of Raft.

Problems with Paxos

The Paxos protocol was developed by Leslie Lamport. In the past decade, it has become the standard for consensus in distributed systems. Paxos can agree on a single decision, such as a replicated log entry called the single-decree Paxos. Multiple instances of this protocol are combined to facilitate a series of decisions, including a log referred to as multi-Paxos. Paxos ensures safety (Paxos is not always live), and it also supports changes in cluster membership. Its correctness has been proven, and it is efficient in the normal case.

However, Paxos has two significant drawbacks:

The first one is its complexity. The complete explanation of Paxos is notoriously opaque and challenging to understand. Even simplified explanations are still difficult for many people. The difficulty in understanding Paxos is due to its foundation on the single-decree subset, which is quite complicated in its own right. Single-decree Paxos is divided into two stages that do not have simple, intuitive explanations and cannot be understood independently. Because of this, it isn’t easy to develop intuitions about why the single-decree protocol works. The composition rules for multi-Paxos add significant additional complexity. Having a leader initiate consensus is unnecessary and just an add-on for performance optimizations. This means multiple instances of single-decree consensus might be in progress for different log slots.

The second limitation of Paxos is its inadequate support for developing practical implementations. One reason for this is the lack of a widely accepted algorithm for multi-Paxos, as Lamport’s descriptions mainly focus on single-decree Paxos, with only a few suggestions for multi-Paxos that lack sufficient details. Although several attempts have been made to optimize Paxos, these efforts differ from each other and Lamport’s suggestions. Paxos-like algorithms have been implemented in systems like Chubby, but the implementation details have not been made public in most cases.

Another issue with Paxos is its symmetric peer-to-peer approach, which is appropriate for a world where only one decision needs to be made, but few practical systems use it. Instead, it is easier and faster to elect a leader first and then have the leader coordinate the decisions if a series of decisions need to be made.

Consequently, practical systems differ considerably from Paxos, and each implementation starts with Paxos and then develops a different architecture due to the challenges associated with implementing Paxos.

Given these limitations, the designers of Raft concluded that Paxos is not a suitable foundation for system building or education. Since consensus is essential in large-scale software systems, another attempt was made to design an alternative consensus algorithm with better properties than Paxos, which resulted in the Raft algorithm.

Raft’s innovations

In contrast to Paxos, the Raft consensus algorithm is designed to be more understandable:

Problem decomposition: Raft decomposes the consensus problem into separate subproblems, such as leader election and log replication, providing a modular solution to each.
Understandability: Raft has three components: leader election, log replication, and safety. Each component is designed to be understandable, and combining these components provides a comprehensive solution to the consensus problem. To improve clarity, Raft separates these critical components of consensus.
Automatic cluster management: Raft also offers several practical features such as automatic cluster membership changes, which simplify distributed system implementation. It also introduces a novel mechanism for altering cluster membership that uses overlapping majorities to ensure safety.
Distinct structure: Its effectiveness is comparable to that of (multi-)Paxos, while its structure is distinct, resulting in greater comprehensibility than Paxos and a more solid foundation for developing practical systems.
Coherence: Additionally, it enforces a more substantial degree of coherence to minimize the number of states that must be considered. Fewer states also reduce the complexity of verifying the correctness of the algorithm.

The following illustration highlights the distinguishing factors of the Raft algorithm as compared to others:

Raft’s overview

Raft is a consensus algorithm for managing a replicated log containing the client’s state machine commands. It implements consensus by first electing a leader and then managing all the replication from the leader to other servers. The leader is responsible for accepting log entries from clients, replicating them to other servers, and instructing them when it is safe to apply them to their state machines. This ensures the unidirectional flow of instructions from the leader to the followers and allows the leader to make unanimous decisions without consulting other servers. For example, the leader can decide where to put the new incoming client log entry without waiting for other servers to decide. If the leader fails and gets disconnected from other servers, Raft elects a new leader using its algorithm.

Requirements

Let’s define some functional and non-functional requirements for the Raft consensus algorithm:

Functional requirements

Raft uses a robust leader approach in its core algorithm. It decomposes the consensus problem into three subproblems, fulfilling the following functional requirements:

Leader election: If a leader fails, the algorithm should elect a new leader to continue its operations.
Log replication: The leader must accept log entries from clients, replicate them to other servers, and commit them at least at a majority before replying to the client and to all nodes eventually.
Safety: No two servers should have a different entry at a specific log index in their state machines.

All three subproblems (requirements) are relatively independent, but collectively provide a holistic approach to a consensus algorithm, covering all its main aspects.

Non-functional requirements

In addition to the functional requirements, Raft fulfills some additional non-functional requirements:

Timing and availability: Besides ensuring availability through leader failure handling, it also ensures availability in case a follower or candidate crashes (In Raft, a server can be a leader, candidate, or follower). Raft must ensure that any nodes left behind in consensus (minority nodes) eventually catch up with the majority as quickly as possible.
Cluster membership changes: The Raft algorithm should be able to manage cluster membership efficiently and such that the normal consensus operations are not impacted during membership changes.

Log compaction: Raft should be able to compact the log to keep it growing without bounds so that a cold start node might not have to run a very long log and could instead start from a last snapshot and then apply the log entries after that.

Raft’s properties

Besides the above-mentioned functional and non-functional requirements, Raft ensures some properties are fulfilled at all costs. Let’s see what these properties are:

Election safety: No more than one leader can be elected for a given election term. (A new term can be started when a follower detects the absence of a leader and elevates itself to a candidate and triggers a new election.)
Leader-append only: A leader only performs one action on its log, which only appends entries, neither deletes nor overwrites them.
Log matching: If an entry exists with the same log index and election term in two logs, the two logs are identical to that given index.
Leader completeness: If an entry gets committed in a given term, it will be in the logs of all higher-numbered terms.
State machine safety: If a server has applied for a log entry at a specific index in its state machine, then there won’t exist any other server with a different log entry at that specific index.

Bird’s eye view

In the following lessons, we will design and evaluate Raft. The following concept map is a quick summary of the design and analysis of the Raft consensus algorithm:

1.Prologue

2.File Systems

3.Google File System (GFS)

4.Google Colossus File System

5.Facebook's Tectonic File System

6.Databases

7.Google Bigtable

8.Google Megastore

9.Google Spanner

10.Key-value Stores

11.Many-core Key-value Store

12.Scaling Memcache

13.SILT

14.Amazon DynamoDB

15.Concurrency Management

16.Two-phase Locking (2PL)

17.Google Chubby Locking Service

18.ZooKeeper

19.Big Data Processing: Batch to Stream Processing

20.MapReduce

21.Spark

22.Kafka

23.Consensus

24.Understanding Consensus: Two Generals, FLP, & Byzantine Generals

25.Two-phase Commit

26.State Machine Replication

27.Paxos

28.Raft

29.Epilogue