6.824 2009 Lecture 12: state machine replication

General topic area: Fault tolerance
  last week: how to recover a single machine from a power failure
    durability, but when machine is down service is unavailable
  topic of next few lectures: high availability 
    using multiple computers to provide service so that if one fails
    another one can provide it.
  today's topic: state machine replication in hypervisors.  This makes several
    assumptions that make life easier to replicate, and in future lectures,
    we'll get rid of those assumptions.

state machine replication
  works for any kind of replicated service: storage or lock server or whatever
  every replica must see same operations in same order
    if deterministic, replicas will end up with same state
    so you can't look at system clock or call rand(), but if you're getting
      locks on a fileserver, the machines should replicate in lock-step.
  how to handle concurrent client requests to different replicas?
    an easy option: order them by a master
        master-slave organization
    so that you don't have different replicas respond to concurrent requests
      in different order, have the slaves do whatever the master does.
  how many replicas?
    assume repair: if one fails, it comes back.
    enough to survive a failure(s) during the time that it takes for a new
     one to start
     
Handling failures is tough.  A master only responds once it gets a confirmation
  that all slaves have replicated state.  That way, if it fails, we know one
  of the slaves is in the same state.  Now we can use 2-phase commit.  This
  isn't perfect, since if master goes down and commit record doesn't
  get to all of the machines, the system is frozen and the slaves can't 
  take over.

So we actually need to re-elect a master if the master fails.  We also need to
  agree on who is down and who is actively participating in the system.  Both
  of these are election/consensus questions solved by the Paxos algorithm,
  which will be covered later.
  
So how many replicas do you need?  Depends on your failure model.  How
  much time does it take to bring another machine up?  How frequently do
  machines fail?  How correlated are the failures?
  
Usually people have 3, so that if one goes down and you need a majority
  of machines to vote on something (master/consensus), you still have 2/3 
  machines up.

Case study: Hypervisor-based Fault-tolerance
  Bressoud and Schneider
  SOSP 1995

Why are we reading this paper?
  We're about to look at a bunch of systems that replicate for fault-tolerance
    lab 7 and lab 8 you implement a replicated state machine
  The hypervisor is at one extreme of the spectrum in terms of simplicity
    assumptions
  Two machines run the exact same instructions
  If one fails, the other just keeps going, no time/data/work is lost
  Totally transparent, runs existing O/S and apps
  Only a factor of two slower
  Seems like an amazingly general/easy/practical solution to fault-tolerance

Hypervisor is a layer that sits between the hardware and the operating system.
  It will ensure that all instructions of instruction set that are executed
  on a master will be executed on all backups.  But how can it do that
  without slowing the system down to send messages(takes milliseconds) 
  between machines over ethernet for EACH instruction (takes nanoseconds).
  It turns out they only do it at 2x slowdown, and a similar model
  is employed by VMware today.

Where they are coming from
  Banks, telephone exchanges, NASA need fault-tolerant computers
  1980s saw lots of *hardware* fault-tolerant machines
  Assumption: CPU most likely part to fail due to complexity
  So two CPUs connected to bus, single memory, disk, &c system
  All in the same cabinet. Not distributed over the Internet.
  Hypervisor is all s/w! Runs on stock h/w.

What is the modern version of this paper?
  Goal: fault tolerant web application that has its durable state on a NAS
  One option: rewrite web application to use replicated-state machine
    But the Web application programmers might not have the expertise
  One option: apply the replicate state machine on OS level (Linux)
    But OS is a big program with many interfaces---maybe hard to pull off
    In addition, you may not have the source
  Other option: put Web site/Linux on a replicated VM (VMware/Xen, ..)
    Advantage: works with any web application (modula some assumptions).

Challenges:
  - performance
    - you don't have to send message for every instruction.  If you have
      10000 adds, just chunk them and send them together in one group
      (called an epoch).
  - some instructions can not just be executed on each machine, so you
    have to capture them.
      -> interrupts---come from a device that is outside of CPU and
         may randomly hit you after any unstruction.  Catch these,
         and handle them at the end of each epoch.
      -> asking for the system clock time
      -> replacement policies for pages/TLB that is random rather than
         least-recently or most-recently used.
      -> memory-mapped devices snoop on writes to memory for instructions
         to write to disk or network by some protocol.  You have to know
         how the device watches memory so that you can replicate that
         instruction to the driver on each machine.  But that's not perfect,
         as disks have physical block failures, and so you end up causing
         different events to pop up on different machines.  To solve this,
         they assumed a shared disk, and ignored all disk write insturctions on
         backup machines.

suppose primary fails during its E+1
  primary had waiting for ACKs for E msgs (P2)
  so backup can finish E w/o trouble
  backup will be waiting in P5/P7 for [end,E+1]
  P7 says backup isn't primary until E+2
  so how does backup get env results during E+1? i.e. in P6?
can primary and backup have distinct disks?
  or must they talk to the same disk, as in Figure 1?
  same disk -- otherwise backup can't retry lost SCSI OPs...
to handle primary failure if backup isn't issuing certain disk operations, etc.,
  they require the primary to send the result of running the operation to
  the slaves, so that they can guarantee everything worked.  If no response
  comes, just put an error in the device driver, which will cause OS to recover
  on the new master (which works for disk writes or tcp connections which are
  idempotent).  This is bad for sending a printer the print instruction, or
  an ATM's money withdrawl logic, as operations are not idempotent.  We'll
  think of ways to get around this later.

Implementation details

Straw man:
  two identical machines
  start with same memory contents (program and data)
  just start them executing!
  if one fails, the other keeps going
  it's *fast* -- just as fast as ordinary single computer!
  are we done?

Q: Will they perform the same computation?
  theorem: yes, if they execute same operations,
    in the same order, and each operation has deterministic effect
  ADD instruction &c probably deterministic
  instructions to read time-of-day register, cycle counter, priv level
    not deterministic
  memory system?
  local devices like hard disk?
    low level: interrupt timing
    high level: whether disk read/write results are deterministic
    errored blocks...
  external input devices like network, keyboard?
    both CPUs need to get each input
  output devices like network, display, printer?
    exactly one CPU should produce each output

Let's figure out how to deal with local devices (e.g. hard disk)
  why not replicate all I/O hardware on each machine?
    e.g. let both machines read/write their own disk
    writes should be identical, thus reads also
  maybe some h/w replicas don't behave the same.
    for example, o/s knows how to re-map a failed disk sector,
      but the two disks won't fail at the same sectors
  so why did they replicate the main-memory system?
    doesn't fail? would be too slow to share?

So they hook up all devices to both machines: shared I/O bus
  sending commands and data to the device
    only the "primary" writes the device
    hypervisor ignores backup's I/O writes
  reading results back from the device
    only primary reads
    primary hypervisor copies to the backup
    backup hypervisor stalls until correct value arrives from primary
  handling interrupts
    only primary CPU sees real h/w interrupts
    hypervisor sends them to the backup

how to ensure two machines see interrupts at the same time?
  cannot let them occur at different points in app instruction stream
    that might cause different behavior
  could execute instructions one at a time on both prim and backup
    backup waits until it knows whether interrupt happened after prev instr
    slow...
  epochs!
  CPU h/w can be told to interrupt every N instructions
    i.e. at exactly the same point on both machines
  primary hypervisor delays all interrupts until end of epoch
  backup has to wait at the end of its epoch
    maybe primary hasn't reached epoch end, hasn't seen all interrupts
  at each epoch end, primary and backup deliver all interrupts

what about fail-over?
  if primary fails, backup must start doing I/O
  how does backup know the primary has failed?
  what about I/O writes in the last epoch: might or might not have happened.
  what if backup is waiting for an I/O read?
    did the primary even issue the read request?
    did the result come back, but not yet sent to backup?
  in the end, the O/S has to re-try the I/O!

do all devices follow the IO1/IO2 rules?
  hypervisor has to understand an operation was started by not finished
    so it can generate "uncertain" interrupt
  device has to support repeated operations correctly
  O/S device drivers have to know to re-try operations
  OK for disk read and disk write
    some SCSI drivers notice time-outs and re-try
  maybe OK for network (TCP will sort it out)
  not so OK for keyboard, line printer, ATM bill vendor

how does the hypervisor know about incoming I/O data from devices?
  e.g. disk blocks just DMA's from disk
  needs to send block data to the secondary
  no problem for programmed I/O (hypervisor can trap on LD)
  puzzle: DMA. how does the hypervisor know?

is the machine virtualizable?
  we can always interpret SLOWLY
  real question: can I run o/s and apps on the real hardware?
  will the hardware trap to hypervisor on *every* tricky instruction?
  i.e. any instruction that might be non-deterministic
    OR reveal the existence of the hypervisor?
  time-of-day register
  memory-mapped I/O loads and stores
    need to trap and substitute load values
  HP branch-and-link instruction
  HP TLB replacement
  HP "space registers" (page mapping tables?) writable by users (!)

What if ethernet cable breaks?
  primary still running
  backup will promote itself
  that's part of what the fail-stop assumption assumes away...

What if we want to re-start a failed+repaired primary?

Is this the way we should build all fault-tolerant systems?
  How about the library approach in lab 7?
    what are the "ordinary" instructions?
    what are the "environment" instructions?
    how does lab 7 deal with lost environment instructions?