Paxos Deep Dive for System Design

Explore how the Paxos algorithm provides consensus for replicated state machines in distributed systems. Learn its core safety properties and liveness goals, and understand its role in maintaining data consistency and fault tolerance through single and multi-decree consensus approaches.

We'll cover the following...

Motivation: Log replication via consensus
- Client interactions
- Failure model
Goals for Paxos
- Safety conditions
- Liveness

Motivation: Log replication via consensus

Paxos is a consensus algorithm used by multiple entities to agree on something of interest, such as maintaining consistency among multiple copies of data. As we learned in our state machine replication chapter, we can model a broad range of computation as a state machine, and replicating such a state machine at different nodes provides us fault tolerance. Commands transition a state machine from one consistent state to the next. To ensure consistency, it is necessary to establish consensus among the replicas of the state machine regarding the specific command to be executed and the order in which it should be executed. Paxos being a consensus algorithm is a natural choice in this scenario. (We will learn about Paxos in the context of state machine replication because it is generic and applicable for many use cases.)

We will use the Paxos algorithm to achieve consensus on a single valueThis value can be arbitrarily complex in nature. across state machine replicas. In the real world, Paxos-based solutions are used to implement replicated logs. A log can be viewed as a series of placeholders (called a slot) to save clients' commands. These logs record the commands of clients, one per slot. Paxos is agnostic about the specifics of these commands because they are application specific. These commands could be machine instructions (for example, jmp 0x12366) or keys and their associated values in a hash table or any other state machine. These logs can be viewed as sequential read-ahead files. Once consensus has been achieved on the slot's commands, these commands are executed on the local state machine. Each execution transitions a state machine from one consistent state to the next.

We use a Paxos instance to achieve consensus on a single slot of the log, as shown in the following slide deck. There can be multiple clients wanting to write their command at a specific slot of the log, and the Paxos algorithm helps them reach a consensus on whose value will be chosenWe say a value is chosen when a majority of replica servers have agreed on a value. in the slot. Multiple instances of Paxos can be active concurrently for different slot numbers.

The basic Paxos algorithm for achieving consensus for a single slot is called single-decree Paxos (or basic Paxos). When we use multiple instances of Paxos concurrently, such a setting is called multi-decree Paxos (or multi-Paxos). In this chapter, we will first explain basic Paxos in different circumstances, including normal operation and in the presence of failures. Then we will explain the working of multi-Paxos.

Many software products use Paxos algorithms. An example is Google's Spanner database which uses multiple Paxos groups. Each group manages a shard of the database and achieves consensus on how the database will change when concurrent commands/transactions are active.

Note: The original Paxos paper primarily focuses on the proofs of correctness and leaves out many implementation details. As a result, many variants of Paxos algorithms are in use in many real software products due to different interpretations of the algorithm and various tweaks made to it. In our discussion, we will adhere to the original algorithm as much as possible, with occasional pointers to enhancements.

Client interactions

When a client has a command to execute, it submits its command to a replica server via a remote procedure call (RPC). The replica gets the command chosen at all the servers after getting the consensus. Once consensus has been achieved and the specific command has been executed by the state machine, the result of execution is returned to the client.

Failure model

We assume a non-Byzantine failure environment for the Paxos algorithm. Network messages can be lost or delayed.

Goals for Paxos

The goal of basic Paxos is to get the consensus on a slot's value at all the replicas by enforcing the safety and liveness properties.

Safety conditions

To choose a value, the following safety properties must hold:

Only a single value is chosen.
The value should be chosen only from the proposed values.
None of the processes learns that a value is chosen until that value is actually chosen.

Liveness

The goal of the Paxos algorithm is to eventually decide on a single value, choosing it from multiple proposed values., This property is known as the liveness of Paxos.

Note: Paxos algorithm is always safe but is not guaranteed to be live always. For example, if majority of replicas are not available, progress can not be made until a majority is available.

In the next lesson, we will explain the basic Paxos algorithm in detail.

Disclaimer: Our presentation of Paxos is partially influenced by the user study conducted by Diego Ongaro and John Ousterhout, who aimed to determine which consensus algorithm (Paxos or Raft) is easier to understand.

1.Prologue

2.File Systems

3.Google File System (GFS)

4.Google Colossus File System

5.Facebook's Tectonic File System

6.Databases

7.Google Bigtable

8.Google Megastore

9.Google Spanner

10.Key-value Stores

11.Many-core Key-value Store

12.Scaling Memcache

13.SILT

14.Amazon DynamoDB

15.Concurrency Management

16.Two-phase Locking (2PL)

17.Google Chubby Locking Service

18.ZooKeeper

19.Big Data Processing: Batch to Stream Processing

20.MapReduce

21.Spark

22.Kafka

23.Consensus

24.Understanding Consensus: Two Generals, FLP, & Byzantine Generals

25.Two-phase Commit

26.State Machine Replication

27.Paxos

28.Raft

29.Epilogue