6.824 2007 Lecture 13: paxos From Paxos Made Simple, by Leslie Lamport, 2001 introduction 2-phase commit is good if different nodes are doing different things but in general you have to wait for all sites and TC to be up you have to know if each site voted yes or no and the TC must be up to decide not very fault-tolerant: has to wait for repair can we get work done even if some nodes can't be contacted? yes: in the special case of replication state machine replication works for any kind of replicated service: storage or lock server or whatever every replica must see same operations in same order if the operations are deterministic, replicas will end up with same state So same sequence of operations is the hard part to implement how to ensure all replicas see operations in the same order? primary + backup(s) is the scheme we will look at clients send all operations to current primary primary chooses order, sends to backups, replies to client after getting confirmations from backups. If the primary fails, backups take over and are at most one operation behind. what if the primary fails? need to worry about that last operation, possibly not complete need to pick a new primary can't afford to have two primaries! suppose lowest-numbered live server is the primary so after failure, everyone pings everyone then everyone knows who new primary is? well, maybe not: pings may be lost => two primaries pings may be delayed => two primaries partition => two primaries So if you just let backups assume they are primary once they can't hear from the primary after x seconds, a network problem might result in the old primary still being alive, and the backup thinking it is the primary. On a lock server, this means two primaries might give the same lock to two machines. idea: a majority of nodes must agree on the primary at most one network partition can have a majority if two potential primaries, their majorities must overlap In practice this works as three components Replicated state machine---keeps system state, such as locks, etc. Primary shares state with backups. Configuration/Views---a set sequence of views over all time. The view contains the list of servers that are part of the current view. - Some convension, such as "the lowest-numbered server in the view is the primary" decides the primary. - Configuration layer periodically pings all servers. If it notices a server is alive or dead (different from the current view), it talks to the other replicas by way of paxos to choose new view. Paxos---takes request for views from different view change notices, and ensures that each of the members of the group (or at least a majority) agrees on the new view (essentially a string representing the view). technique: "view change" algorithm system goes through a sequence of views view: view# and set of participants ensure agreement on unique successor of each view the participant set allows everyone to agree on new primary view change requires "fault-tolerant agreement" at most a single value is chosen agree despite lost messages and crashed nodes can't really guarantee to agree---they may jump back and forth if proposers keep arguing. They suggest a distinguished proposer to get liveness, but we can guarantee to *not* "agree" on different values, which guarantees correctness. suggested distinguished proposing algorithm: each server asks paxos for a vote at some random time between (0, server) seconds, so lower-numbered servers will be more likely to cause agreement, without much back-and-forth over the network. Paxos fault-tolerant agreement protocol eventually succeeds if a majority of participants are reachable best known algorithm general Paxos approach one (or more) nodes decide to be the leader leader chooses a proposed value to agree on (view# and participant set) leader contacts participants, tries to assemble a majority participants are all the nodes in the old view (including unreachable) or a fixed set of configuration master nodes if a majority respond, we announce result why agreement is hard what if two nodes decide to be the leader? what if network partition leads to two leaders? what if the leader crashes after persuading only some of the nodes? what if leader got a majority, then failed, without announcing result? or announced result to only a few nodes? new leader might choose a different value, even though we agreed Two components in Paxos: Proposer sends out value on which to achieve agreement Acceptor votes yes or no on each proposal. In practice, each server has a proposer and an acceptor. Any proposer can propose a new view change at each point. Goal is that acceptors, once they accept a request, will not issue a contradictory acceptance to another proposer. Keep in mind that a proposer may start the protocol and then crash. Paxos has three phases may have to start over if failure/timeouts see handout with code (included below) or lab 8 assignment we run an instance of this protocol for each view. the "n"'s in the protocol are proposal numbers, not view numbers. per view a node may make many proposals. each of the proposals is numbered in increasing order. Phases: proposer sends out a prepare(n) to each acceptor some acceptors send back a prepare_ok(n_a, v_a) to proposer, with its version of the value of v. if proposer gets majority of v values that are equal (call it v'), it sends' accept(n, v') to each acceptor. Otherwise, it sets v' = v, the value that its caller asked it to send. acceptor that sets its new value sends an accept_ok(n) proposer responds with decide to each of those ok acceptors Notes: acceptor sends prepare_ok(n_a, v_a) only if the prepare(n) message had n > its highest n yet. If it sends the prepare_ok, it will send the value it's already agreed on, if it's agreed on one already. Remember that when you look at pseudocode---each time a view configuration changes, a new instance of paxos is instantiated. This means that n_a, v_a, are all 0, null on each acceptor/proposer at the beginning of each view change. So the vote we are discussing is re-initiated for each view change, and you can't move to a new view change until majority is agreed on. n value suggested by each proposer must be unique. Do this by appending the nodeid to each n, so that you can monotonically increase values but keep them unique. A value is chosen if a majority of acceptors sent accept_ok(n). Once proposer sends decide() to acceptors, its configuration can change to that view. If a node doesn't get the decide, it will poll eventually and get the majority decision. if at any time any node gets bored (times out) it asks everyone for their v_a looking for a majority. if one exists, it accepts that view. else, it declares itself a proposer and starts a new Phase 1 if nothing goes wrong, Paxos clearly reaches agreement (one proposer will push the other acceptors to agree on value). how do we ensure good probability that there is only one proposer? every node has to be prepared to be proposer, to cope w/ failure so delay a random amount of time after you realize a new view is required or delay your ID times some constant key danger: nodes w/ different v_a receive donereq goal: if donereq *could* have been sent, future donereqs guaranteed to have same v_a what if more than one proposer? due to timeout or partition or lost packets the two proposers used different n, say 10 and 11 if 10 didn't get a majority to acceptors it never will, since no-one will accept 10 after seeing 11's prepare or perhaps 10 is in a network partition if 10 did get a majority to acceptors i.e. might have sent decide 10's majority saw 10's accept before 11's prepare otherwise they would have ignored 10's accept, so no majority so 11 will get a prepareres from at least one node that saw 10's accept so 11 will be aware of 10's value so 11 will use 10's value, rather than making a new one so we agreed on a v after all what if proposer fails before sending accepts? some node will time out and become a proposer old proposer didn't send any decide, so we don't care what he did it's good, but not neccessary, that new proposer chooses higher n if it doesn't, timeout and some other proposer will try eventually we'll get a proposer that knew old n and will use a higher n what if proposers fails after sending a minority of accepts? same as two proposers... what if proposer fails after sending a majority of accepts? i.e. potentially after reaching agreement! same as two proposer... shows scenario say last view's v = {1,2,3} 2 fails 1 becomes proposer, receives prepareress from 1 and 3, choses v = {1,3} 1 sends accept to 1 and 3, and fails. 3 will start running paxos, but cannot make a majority. keeps trying 2 reboots, reloads state from disk (nothing related to this view change) 2 becomes a proposer and proposes accept with n higher than 1 used 3 send accept_ok, which will include v = {1, 3}. 2 and 3 will switch to view {1,3} 2 will run Paxos again to add itself. No majority of {1,3} is alive, however, to add in 2. must wait until 1 is back up. (note that if we change the protocol, we could continue, because 3 knows that 1 cannot have a majority---because that would be 1 and 2---but 2 is talking to 3 to make a view. what if a node fails after receiving accept? if it doesn't restart, possible timeout in Phase 3, new proposer it it does restart, it must remember v_a/n_a! (on disk) proposer might have failed after sending a few donereqs new proposer must choose same value our node might be the intersecting node of the two majorities what if a node reboots after sending prepareres? does it have to remember n_h on disk? it uses n_h to reject prepare/accept with smaller n scenario: proposer1 sends prepare(n=10), a bare majority sends prepareres so node X's n_h = 10 proposer2 sends prepare(n=11), a majority intersecting only at node X sends accept node X's n_h = 11 proposer2 got no prepareres with a value, so it chooses v=200 node X crashes and reboots, loses n_h proposer1 sends accept(n=10, v=100), its bare majority gets it including node X (which should have rejected it...) so we have agreement w/ v=100 proposer2 sends accept(n=11, v=200) its bare majority all accept the message including node X, since 11 > n_h so we have agreement w/ v=200. oops. so: each node must remember n_h on disk. If disk crashes, remove yourself from the system. Write them to disk before sending accept_ok messages. Make sure that you update (n_a, v_a atomically, so that you come back in a consistent state). correctness requires unique n's...just do this by appending node id liveness requires increasing n's...do this by incrememnting some number when you see a larger n, or by using your system clock time, which monotonically increases and stays close to the other system clock times. conclusion what have we achieved? remember the original goal was replicated state machines and we want to continue even if some nodes are not available after each failure we can perform view change using Paxos agreement that is, we can agree on exactly which nodes are in the new view so, for example, everyone can agree on a single new primary but we haven't talked at all about how to manage the data -----------rtm's 2009 Paxos Handout-------------- Proposer(v): choose n, unique and higher than any n so far send prepare(n) to all acceptors (including self) if prepare_ok(n_a, v_a) from majority: if any v_a != null v' = v_a with max n_a else v' = v send accept(n, v') to all if accept_ok(n) from majority: send decided(v') to all Acceptor: n_p highest prepare seen, initially 0 n_a highest accept seen, initially 0 v_a highest accept seen, initially null prepare(n) handler: if n > n_p: n_p = n reply prepare_ok(n_a, v_a) accept(n, v) handler: if n >= n_p: n_a = n v_a = v reply accept_ok(n) commit point: majority of acceptors record a particular v_a if acceptor times out (doesn't receive decide): ask all servers for v_a, see if majority agrees otherwise become a proposer ------------------ paxos.txt -------------------- state: n_a, v_a: highest value and proposal # which node has accepted n_h: highest proposal # seen in a prepare my_n: the last proposal # the node has used in this round of Paxos vid_h: highest view number we have accepted views: map of past view numbers to values done: proposer says agreement was reached, we can start new view on each view change, initialize state: n_a = 0 n_h = 0 my_n = 0 v_a = () // empty list Paxos Phase 1 a node (maybe more than one...) decides to be proposer (may not be in current view): my_n = max(n_h, my_n)+1, append node ID // unique proposal number done = false sends prepare(vid_h+1, my_n) to all nodes in {views[vid_h], initial contact node, itself} if node receives prepare(vid, n): if vid <= vid_h: return oldview(vid, views[vid]) else if n > n_h: n_h = n done = false return prepareres(n_a, v_a) else: return reject() Paxos Phase 2 if proposer gets oldview(vid, v): views[vid] = v vid_h = vid view change restart paxos else if proposer gets reject(): delay and restart paxos else if proposer gets prepareres from majority of nodes in views[vid_h]: if any prepareres(n_i, v_i) exists such that v_i is not empty: v = non-empty value v_i corresponding to highest n_i received else proposer gets to choose a value: v = set of pingable nodes (including self) send accept(vid_h+1, my_n, v) to all responders else: delay and restart paxos if node gets accept(vid, n, v): if vid <= vid_h: return oldview(vid, views[vid]) else if n >= n_h: n_a = n v_a = v return accept_ok() else return reject() Paxos Phase 3 if proposer gets oldview(vid, v): views[vid] = v vid_h = vid view change restart paxos else if proposer gets accept_ok from a majority of nodes in views[vid_h]: send decide(vid_h+1, v_a) to all (including self) else: delay and restart paxos if node gets decide(vid, v): if vid <= vid_h: return oldview(vid, views[vid]) else: done = true primary is lowest-numbered node in v views[vid] = v vid_h = vid view change