Previous failure model - failstop failure: machines crash or are unavailable. - failure detector sends heartbeat message, and if no response, removes the unreachable node - so detector might be overzealous, but it _will_ eventually catch all failed nodes Byzantine failure model - nodes fail in completely arbitrary ways - adversarial model: someone breaks code, and messes you up in worst way possible - bugs, misconfigurations are also detectable in this model - depends on non-correlation of failures. like failstop model, where you assume not all nodes fail at once, here you assume N different copies of code/operating system/root passwords, so they won't fail in the same way. No way to write failure detector for byzantine model - an adversary will just hack code to respond in way that failure detector expects, so you'd never notice a difference Initial setup is described below: Same primary-backup model for replication - primary sends all requests to backups - backups execute them - backups reply to primary - primary replies to client - require responses from _all_ of the nodes before responding View change---when a node goes down/fails, run paxos to agree on new view Recovery from failure---state transfer from a working copy - makes rollback easy, since you don't know who will become new primary when paxos elects a new one - since you don't reply to client until all backups succeed, the client will be none-the-wiser - if we only required majority of backups to respond to primary on state changes, the replication would have to go to majority of them to do state transfer How many replicas to tolerate f failstop faults? 2f+1---f fail, and f+1 will give majority of correct response. How many replicas to tolerate f byzantine faults? 3f+1---if f are split from the network, and f are faulty, then f+1 will remain to give you the truth So how do we change protocol a bit to handle all of these failures - each node cryptographically signs its responses so that client can verify primary isn't lying - primary responds to client after hearing from 2f+1 of the clients - since primary might be malicious, it might send two different state changes to two backups. The backups only talk to the primary, so they wouldn't know that they were inconsistent. This is solved by a pre-prepare message. - view change must change to handle BFT Pre-prepare message - client sends operation to primary - primary sends pre-prepare to each replica - replicas respond with signed prepare message saying they are willing to assign the view stamp to a given operation - primary responds to all replicas with 2f+1 prepare messages. Now replicas can run operation, after ensuring that primary's public key matches, that all previous operations have run, and that viewstamp is accurate - replicas respond to primary saying they committed results. primary responds to them saying which 2f+1 commiters it received. Then 2f+1 of the replicas respond to the client, and now everyone knows that no one faulty node could have harmed the system Note: as an optimization, get rid of pre-prepare->prepare step by having all replicas multicast their message to all other replicas, at which point they will know they are all good to commit without an extra round trip to the primary Paper mentions optimizations for multiple reads in a row, and for batching a bunch of operations into one to avoid the round trips. How must view change be different to handle BFT? - step through primaries in round-robin fashion on each view change. - a new primary issues a to everyone, and everyone responds with signed cerificate of all prepared messages that they committed - at least 2f+1 nodes respond saying they want a new view change, and primary sends to everyone - primary also sends for any changes that might need to be replayed on some nodes that didn't get commit. Replicas that already committed change won't re-run it. Note: unlike paxos, which needs to agree on a value (requiring a third step), we know the value here---round robin voting---so we just have to agree to move to next view. Since we can treat failures as byzantine, you don't have to sync to disk on each commit. You have to sync on view change or checkpoint, but in normal operation, you save from having to sync to disk after each write, since your replicas help you tolerate failures (assuming at least f+1 stay alive in a 3f+1-node system).