Paxos Deep Dive for System Design
Explore how the Paxos algorithm provides consensus for replicated state machines in distributed systems. Learn its core safety properties and liveness goals, and understand its role in maintaining data consistency and fault tolerance through single and multi-decree consensus approaches.
Motivation: Log replication via consensus
Paxos is a consensus algorithm used by multiple entities to agree on something of interest, such as maintaining consistency among multiple copies of data. As we learned in our state machine replication chapter, we can model a broad range of computation as a state machine, and replicating such a state machine at different nodes provides us fault tolerance. Commands transition a state machine from one consistent state to the next. To ensure consistency, it is necessary to establish consensus among the replicas of the state machine regarding the specific command to be executed and the order in which it should be executed. Paxos being a consensus algorithm is a natural choice in this scenario. (We will learn about Paxos in the context of state machine replication because it is generic and applicable for many use cases.)
We will use the Paxos algorithm to achieve consensus on a single jmp 0x12366) or keys and their associated values in a hash table or any other state machine. These logs can be viewed as sequential read-ahead files. Once consensus has been achieved on the slot's commands, these commands are executed on the local state machine. Each execution transitions a state machine from one consistent state to the next.
We use a Paxos instance to achieve consensus on a single slot of the log, as shown in the following slide deck. There can be multiple clients wanting to write their command at a specific slot of the log, and the Paxos algorithm helps them reach a consensus on whose value will be
The basic Paxos algorithm for achieving consensus for a single slot is called single-decree Paxos (or basic Paxos). When we use multiple instances of Paxos concurrently, such a setting is called multi-decree Paxos (or multi-Paxos). In this chapter, we will first explain basic Paxos in different circumstances, including normal operation and in the presence of failures. Then we will explain the working of multi-Paxos.
Many software products use Paxos algorithms. An example is Google's Spanner database which uses multiple Paxos groups. Each group manages a shard of the database and achieves consensus on how the database will change when concurrent commands/transactions are active.
Note: The original Paxos paper primarily focuses on the proofs of correctness and leaves out many implementation details. As a result, many variants of Paxos algorithms are in use in many real software products due to different interpretations of the algorithm and various tweaks made to it. In our discussion, we will adhere to the original algorithm as much as possible, with occasional pointers to enhancements.
Client interactions
When a client has a command to execute, it submits its command to a replica server via a remote procedure call (RPC). The replica gets the command chosen at all the servers after getting the consensus. Once consensus has been achieved and the specific command has been executed by the state machine, the result of execution is returned to the client.
Failure model
We assume a non-Byzantine failure environment for the Paxos algorithm. Network messages can be lost or delayed.
Goals for Paxos
The goal of basic Paxos is to get the consensus on a slot's value at all the replicas by enforcing the safety and liveness properties.
Safety conditions
To choose a value, the following safety properties must hold:
Only a single value is chosen.
The value should be chosen only from the proposed values.
None of the processes learns that a value is chosen until that value is actually chosen.
Liveness
The goal of the Paxos algorithm is to eventually decide on a single value, choosing it from multiple proposed values., This property is known as the liveness of Paxos.
Note: Paxos algorithm is always safe but is not guaranteed to be live always. For example, if majority of replicas are not available, progress can not be made until a majority is available.
In the next lesson, we will explain the basic Paxos algorithm in detail.
Disclaimer: Our presentation of Paxos is partially influenced by the user study conducted by Diego Ongaro and John Ousterhout, who aimed to determine which consensus algorithm (Paxos or Raft) is easier to understand.