6.824 2007 Lecture 13: paxos

From Paxos Made Simple, by Leslie Lamport, 2001

introduction
  2-phase commit is good if different nodes are doing different things
    but in general you have to wait for all sites and TC to be up
    you have to know if each site voted yes or no
    and the TC must be up to decide
    not very fault-tolerant: has to wait for repair
  can we get work done even if some nodes can't be contacted?
  yes: in the special case of replication

state machine replication
  works for any kind of replicated service: storage or lock server or whatever
  every replica must see same operations in same order
    if the operations are deterministic, replicas will end up with same state

So same sequence of operations is the hard part to implement

how to ensure all replicas see operations in the same order?
  primary + backup(s) is the scheme we will look at
  clients send all operations to current primary
  primary chooses order, sends to backups, replies to client after getting
    confirmations from backups.  If the primary fails, backups take over
    and are at most one operation behind.

what if the primary fails?
  need to worry about that last operation, possibly not complete
  need to pick a new primary
  can't afford to have two primaries!
  suppose lowest-numbered live server is the primary
  so after failure, everyone pings everyone
  then everyone knows who new primary is?
  well, maybe not:
    pings may be lost => two primaries
    pings may be delayed => two primaries
    partition => two primaries
    
So if you just let backups assume they are primary once they can't hear from
  the primary after x seconds, a network problem might result in the old
  primary still being alive, and the backup thinking it is the primary.
  On a lock server, this means two primaries might give the same lock to
  two machines.

idea: a majority of nodes must agree on the primary
  at most one network partition can have a majority
  if two potential primaries, their majorities must overlap

In practice this works as three components
  Replicated state machine---keeps system state, such as locks, etc.  Primary
    shares state with backups.
  Configuration/Views---a set sequence of views over all time.  The view
    contains the list of servers that are part of the current view.
     - Some convension, such as "the lowest-numbered server in the
       view is the primary" decides the primary.
     - Configuration layer periodically pings all servers.  If it notices a
       server is alive or dead (different from the current view), it
       talks to the other replicas by way of paxos to choose new view.
  Paxos---takes request for views from different view change notices, and
    ensures that each of the members of the group (or at least a majority)
    agrees on the new view (essentially a string representing the view).

technique: "view change" algorithm
  system goes through a sequence of views
  view: view# and set of participants
  ensure agreement on unique successor of each view
  the participant set allows everyone to agree on new primary

view change requires "fault-tolerant agreement"
  at most a single value is chosen
  agree despite lost messages and crashed nodes
  can't really guarantee to agree---they may jump back and forth if
    proposers keep arguing.  They suggest a distinguished proposer to get
    liveness, but we can guarantee to *not* "agree" on different values,
    which guarantees correctness.
  suggested distinguished proposing algorithm: each server asks paxos for
    a vote at some random time between (0, server) seconds, so lower-numbered
    servers will be more likely to cause agreement, without much back-and-forth
    over the network.
    
Paxos fault-tolerant agreement protocol
  eventually succeeds if a majority of participants are reachable
  best known algorithm

general Paxos approach
  one (or more) nodes decide to be the leader
  leader chooses a proposed value to agree on
    (view# and participant set)
  leader contacts participants, tries to assemble a majority
    participants are all the nodes in the old view (including unreachable)
    or a fixed set of configuration master nodes
  if a majority respond, we announce result

why agreement is hard
  what if two nodes decide to be the leader?
  what if network partition leads to two leaders?
  what if the leader crashes after persuading only some of the nodes?
  what if leader got a majority, then failed, without announcing result?
    or announced result to only a few nodes?
    new leader might choose a different value, even though we agreed

Two components in Paxos:
  Proposer sends out value on which to achieve agreement
  Acceptor votes yes or no on each proposal.
  In practice, each server has a proposer and an acceptor.

Any proposer can propose a new view change at each point.  Goal is that
  acceptors, once they accept a request, will not issue a contradictory
  acceptance to another proposer.  Keep in mind that a proposer
  may start the protocol and then crash.

Paxos
  has three phases
  may have to start over if failure/timeouts
  see handout with code (included  below) or lab 8 assignment

we run an instance of this protocol for each view.

the "n"'s in the protocol are proposal numbers, not view numbers.  per
view a node may make many proposals.  each of the proposals is
numbered in increasing order.

Phases:
  proposer sends out a prepare(n) to each acceptor
  some acceptors send back a prepare_ok(n_a, v_a) to proposer, with its version
    of the value of v.
  if proposer gets majority of v values that are equal (call it v'), it sends'
    accept(n, v') to each acceptor.  Otherwise, it sets v' = v, the value
    that its caller asked it to send.
  acceptor that sets its new value sends an accept_ok(n)
  proposer responds with decide to each of those ok acceptors

Notes:
  acceptor sends prepare_ok(n_a, v_a) only if the prepare(n) message
    had n > its highest n yet.  If it sends the prepare_ok, it will send
    the value it's already agreed on, if it's agreed on one already.
  Remember that when you look at pseudocode---each time a view configuration
    changes, a new instance of paxos is instantiated.  This means that n_a, v_a,
    are all 0, null on each acceptor/proposer at the beginning of each view
    change.  So the vote we are discussing is re-initiated for each view change,
    and you can't move to a new view change until majority is agreed on.
  n value suggested by each proposer must be unique.  Do this by appending
    the nodeid to each n, so that you can monotonically increase values
    but keep them unique.

A value is chosen if a majority of acceptors sent accept_ok(n).  Once proposer
  sends decide() to acceptors, its configuration can change to that view.
  If a node doesn't get the decide, it will poll eventually and get the majority
  decision.

if at any time any node gets bored (times out)
  it asks everyone for their v_a looking for a majority.
  if one exists, it accepts that view.
    else, it declares itself a proposer and starts a new Phase 1

if nothing goes wrong, Paxos clearly reaches agreement (one proposer will
  push the other acceptors to agree on value).

how do we ensure good probability that there is only one proposer?
  every node has to be prepared to be proposer, to cope w/ failure
  so delay a random amount of time after you realize a new view is required
    or delay your ID times some constant

key danger:
  nodes w/ different v_a receive donereq
  goal: if donereq *could* have been sent, future donereqs guaranteed to have same v_a

what if more than one proposer?
  due to timeout or partition or lost packets
  the two proposers used different n, say 10 and 11
  if 10 didn't get a majority to acceptors
    it never will, since no-one will accept 10 after seeing 11's prepare
    or perhaps 10 is in a network partition
  if 10 did get a majority to acceptors
    i.e. might have sent decide
    10's majority saw 10's accept before 11's prepare
      otherwise they would have ignored 10's accept, so no majority
    so 11 will get a prepareres from at least one node that saw 10's accept
    so 11 will be aware of 10's value
    so 11 will use 10's value, rather than making a new one
    so we agreed on a v after all

what if proposer fails before sending accepts?
  some node will time out and become a proposer
  old proposer didn't send any decide, so we don't care what he did
  it's good, but not neccessary, that new proposer chooses higher n
    if it doesn't, timeout and some other proposer will try
    eventually we'll get a proposer that knew old n and will use a higher n

what if proposers fails after sending a minority of accepts?
  same as two proposers...

what if proposer fails after sending a majority of accepts?
  i.e. potentially after reaching agreement!
  same as two proposer...
  shows scenario
    say last view's v = {1,2,3}
    2 fails
    1 becomes proposer, receives prepareress from 1 and 3, choses v = {1,3}
    1 sends accept to 1 and 3, and fails.
    3 will start running paxos, but cannot make a majority. keeps trying
    2 reboots, reloads state from disk (nothing related to this view change)
    2 becomes a proposer and proposes accept with n higher than 1 used
      3 send accept_ok, which will include v = {1, 3}.
      2 and 3 will switch to view {1,3}
    2 will run Paxos again to add itself.
      No majority of {1,3} is alive, however, to add in 2.
      must wait until 1 is back up.  (note that if we change the protocol,
      we could continue, because 3 knows that 1 cannot have a
      majority---because that would be 1 and 2---but 2 is talking to 3 to
      make a view.
    
what if a node fails after receiving accept?
  if it doesn't restart, possible timeout in Phase 3, new proposer
  it it does restart, it must remember v_a/n_a! (on disk)
    proposer might have failed after sending a few donereqs
    new proposer must choose same value
    our node might be the intersecting node of the two majorities

what if a node reboots after sending prepareres?
  does it have to remember n_h on disk?
  it uses n_h to reject prepare/accept with smaller n
  scenario:
    proposer1 sends prepare(n=10), a bare majority sends prepareres
      so node X's n_h = 10
    proposer2 sends prepare(n=11), a majority intersecting only at 
    node X sends accept
      node X's n_h = 11
      proposer2 got no prepareres with a value, so it chooses v=200
    node X crashes and reboots, loses n_h
    proposer1 sends accept(n=10, v=100), its bare majority gets it
      including node X (which should have rejected it...)
      so we have agreement w/ v=100
    proposer2 sends accept(n=11, v=200)
      its bare majority all accept the message
      including node X, since 11 > n_h
    so we have agreement w/ v=200. oops.
  so: each node must remember n_h on disk.  If disk crashes, remove yourself
    from the system.  Write them to disk before sending accept_ok messages.
    Make sure that you update (n_a, v_a atomically, so that you come back
    in a consistent state).

correctness requires unique n's...just do this by appending node id
liveness requires increasing n's...do this by incrememnting some number
  when you see a larger n, or by using your system clock time, which
  monotonically increases and stays close to the other system clock times.

conclusion
  what have we achieved?
  remember the original goal was replicated state machines
    and we want to continue even if some nodes are not available
  after each failure we can perform view change using Paxos agreement
  that is, we can agree on exactly which nodes are in the new view
  so, for example, everyone can agree on a single new primary
  but we haven't talked at all about how to manage the data


-----------rtm's 2009 Paxos Handout--------------
Proposer(v):
  choose n, unique and higher than any n so far
  send prepare(n) to all acceptors (including self)
  if prepare_ok(n_a, v_a) from majority:
    if any v_a != null
      v' = v_a with max n_a
    else
      v' = v
    send accept(n, v') to all
    if accept_ok(n) from majority:
      send decided(v') to all

Acceptor:
  n_p highest prepare seen, initially 0
  n_a highest accept seen, initially 0
  v_a highest accept seen, initially null
  
  prepare(n) handler:
    if n > n_p:
      n_p = n
      reply prepare_ok(n_a, v_a)
  
  accept(n, v) handler:
    if n >= n_p:
      n_a = n
      v_a = v
      reply accept_ok(n)

commit point: majority of acceptors record a particular v_a
if acceptor times out (doesn't receive decide):
  ask all servers for v_a, see if majority agrees
  otherwise become a proposer
  
------------------ paxos.txt --------------------

state:
  n_a, v_a: highest value and proposal # which node has accepted
  n_h: highest proposal # seen in a prepare
  my_n: the last proposal # the node has used in this round of Paxos
  vid_h: highest view number we have accepted
  views: map of past view numbers to values
  done: proposer says agreement was reached, we can start new view

on each view change, initialize state:
  n_a = 0
  n_h = 0
  my_n = 0
  v_a = () // empty list

Paxos Phase 1
  a node (maybe more than one...) decides to be proposer (may not be in current view):
    my_n = max(n_h, my_n)+1, append node ID  // unique proposal number
    done = false
    sends prepare(vid_h+1, my_n) to all nodes in {views[vid_h], initial contact node, itself}
  if node receives prepare(vid, n):
    if vid <= vid_h:
      return oldview(vid, views[vid])
    else if n > n_h:
      n_h = n
      done = false
      return prepareres(n_a, v_a)
    else:
      return reject()

Paxos Phase 2
  if proposer gets oldview(vid, v):
    views[vid] = v
    vid_h = vid
    view change
    restart paxos
  else if proposer gets reject():
    delay and restart paxos
  else if proposer gets prepareres from majority of nodes in views[vid_h]:
    if any prepareres(n_i, v_i) exists such that v_i is not empty:
      v = non-empty value v_i corresponding to highest n_i received
    else proposer gets to choose a value:
      v = set of pingable nodes (including self)
    send accept(vid_h+1, my_n, v) to all responders
  else:
    delay and restart paxos
  if node gets accept(vid, n, v):
    if vid <= vid_h:
      return oldview(vid, views[vid])
    else if n >= n_h:
      n_a = n
      v_a = v
      return accept_ok()
    else
      return reject()

Paxos Phase 3
  if proposer gets oldview(vid, v):
    views[vid] = v
    vid_h = vid
    view change
    restart paxos
  else if proposer gets accept_ok from a majority of nodes in views[vid_h]:
    send decide(vid_h+1, v_a) to all (including self)
  else:
    delay and restart paxos
  if node gets decide(vid, v):
    if vid <= vid_h:
      return oldview(vid, views[vid])
    else:
      done = true
      primary is lowest-numbered node in v
      views[vid] = v
      vid_h = vid
      view change