11/6/2008 Distributed Transaction Processing Reading: * C.Mohan, Bruce Lindsay, and R. Obermarck. Transaction Management in the R* Distributed Database Management Systems. ACM Transactions On Database Systems 11(4), 1986. In Red Book. Distributed Transactions: to guarantee ACID, need multi-site work to be ALL committed, or ALL aborted. Timeline (msg(X, op) means send op to X): A msg(coord, OK) msg(coord, done) coordinator msg(A, decrement) msg(B, inc) msg({A,B}, commit) B msg(coord, OK) msg(coord, done) if A or B crash after msg({A,B}, commit), the other will commit regardless! fix: 1) have coordinator wait around and resent commit when A comes back up 2) A reads messages before running recovery problem with fix: if A crashed before writing log record, but after coordinator sent commit writing commit then it doesn't know WHAT to commit. So the fix is not a fix. So another option is to send a message to each node asking them to write their log out and acknowledge the write before you send them commits. This message is a PREPARE message, adding another round trip message to each machine. Once A and B are prepared, coordinator commits, and sends COMMIT to A and B as many times as is necessary for them to respond "commit." (note: coordinator logs PREPARE and COMMIT locally before sending those messages, to handle crashess) That protocol is called 2-phase commit (2PC). It's too expensive to do the round-trips over wide-area networks. So this is only the answer on a LAN. So A/B/coord can either be working, prepared, or committed. Failing from working or prepared puts them in the abort phase. Scenarios: - coordinator crashes -> A + B have an election for a replacement cooridnator (distributed elections are HARD) -> new coordinator polls everyone else, and tells everyone to commit if anyone committed. - A or B crashes -> coordinator waits around for them to come back, and catches them up - A + coordinator crash, B is prepared -> A and cooridnator might be committed, or they might be aborted -> since you don't know what state A/coord are in, you must block until someone comes up and gives you information. -> awful: blocking means you hold on to your locks, blocking other transactions Crashes need to be short, otherwise system blocks. =Warehouses= Updates come in bulk (update all purchases for store X for the day) Idea: split load stream into single-site (single machine) transactions, which are periodically committed, and recovered in bulk. So warehouses don't have this problem - you can always keep bulk loading from last failure. =OLTP= eBay hashed on item ID to a given machine. So each transaction is single-sited. So recovery is easy - failures only ever involve one machine. goal: always make your transactions single-sited. Don't use 2PC, just keep transactions to one machine. =Crashes= - Transaction deadlocks/aborts -> standard logic solves our problem. - Application fails -> coordinator doesn't know what to do next, so commit committed transactions, abort any working transactions, and do the right thing with prepared transactions (commit if they are all prepared). This is essentially not a DB problem. - Operating system fails (this is why admins don't update to latest system for a couple of years). This means DB is out of business, meaning we have a DB crash. So, let's handle that: - DBMS crashes. Different kinds of bugs: - bohr-bugs are repeatable. Repeating command causes crash to happen. -> elimited by QA, so get rid of vendors w/ this problem. - heisen-bugs are unrepeatable. Repeating a sequence of commands does not cause crash to happen. This happens under high load, due to race conditions, concurrency control problems. Send engineers on airplanes, even if they can't solve the problem, it will make customer happy. - communication failures - your L machines break into some number of partitions: -> L partitions - 1-site networks, no one can talk to anyone else (easy case) -> 2 partitions of K, L-K machines (hard case) Solution: maintain two networks, so that you can switch to communicating on a second medium - disaster: floor/earthquake/hurricane/global power loss. Assume NOTHING remains of your datacenter. Assume all data at that location is GONE. Disaster solutions: 1970's: write a log. Have Iron Mountain move tapes to safe place. Buy IBM machines, since they could perform heroics to get you new hardware in a disaster, and you're back up in a few days. This is because hardware is too expensive to replicate needlessly. 1980's: hire Comdisco to provide hardware in case of disaster. They sell this service to multiple clients, under the premise that they won't all have disaster at the same time. Send tapes to Comdisco, and roll forward tapes, up and running in 1 hr. You have Comdisco drills once a year to test disaster policy. So 1970's and 1980's assumed it was too expensive to have up-to-the-second log of transactions. 1990's: everyone has their own backup machines. Run a datacenter in two locations, far enough that a hurricane won't hit both places, and on a different power grid. Options for logging: - Spool log over the internet & roll forward at the backup site. Failover takes on the order of minutes. You don't run 2PC over WAN, so some transactions might be lost at the tail of their log. eBay does this with replication factor of about 2.5. This wastes backup machines for doing anything useful, since it is always slightly behind the primary copy. - Run each transaction twice, one at each site. Then both systems are up-to-date, so you can send read-only load to either site. Need 2PC over a WAN to run this perfectly, as locking leads to nondeterministic deadlock behavior. Same with optimistic concurrency control. - Use H-Store, since it uses only main-memory, Xactions take less than 1ms, and Xactions run to completion, so no nondeterministic deadlock issue. Now you can run each tranaction at two different sites, and spread read-only load to them.