10/7/2008 Reading: Michael J. Franklin. Concurrency Control and Recovery. The Computer Science and Engineering Handbook, 1997. http://db.csail.mit.edu/6.830/lectures/franklin97concurrency.pdf. (Adam's Note: this is one of the best explanations of write-ahead logging I have read) =Transactions= Often written Xaction. Great way to: 1) Guarantee to application programmer that work done within a transaction will happen in full or not at all. 2) Allow DB to allow transactions that are independent of each other to run in parallel to enhance concurrency ACID properties of transactions: Atomic - make actions happen as if they are all occurring at one instance in time. "All or nothing." Consistent - invariants always hold. So if you have some set of constraints on the system (managers make more than subordinates), they will not be broken by new transactions. Isolated - transactions you run won't see other transactions' internal state. Kind of falls out of atomicity. Durable - completed transactions persist, even after a crash (guarantee data is written before you confirm that write succeeded). This is difficult to guarantee in distributed systems where data is replicated. ==Syntax== BEGIN TRANSACTION SELECT X UPDATE Y -> not visible to other transactions until commit UPDATE X -> not visible to other transactions until commit -> A crash here would not update Y or X COMMIT/ABORT -> a crash after this point will still recover with Y and X being updated. (remember: this is a software guarantee. The hard drive might still explode, and your data would be lost). ==History== Formalized by Jim Gray, who won a Turing award for the concept. Initially used in banking applications, airline booking. ==Achieving Atomicity/Isolation== Note: we can't just package these things up and make them happen all at once. Results need to be interactive so that a user can take results of one query and perform an action based on it within one transaction. So to get concurrency, need to isolate state in one transaction from another. Simplest form: Run transactions sequentially. Long-running applications will stop others. -> Bad idea - modern computers can do multiple things at once, so a user taking a long time to enter the next query in an uncommitted transaction, or hitting disk will cause other queries to stop, even though they could be running at the same time. Best for throughput: run multiple Xactions in parallel. -> Need Concurrency Control to avoid transactions stepping on each others' toes. ==Definitions== -> Serializability - running multiple transactions and having the result look as if they ran in some sequential order. -> Transaction Notation: Consider A and B to be imaginary "objects" in the DB. You can read from those items, and write to them. Example: T1 --- RA - read A (e.g. v = A) WA - write to A (e.g. A = v*z) RB - read B (e.g. y = B) WB - write to B (e.g. B = v+y) -> Schedule (interleaving) T1 T2 RA WA RA WA RB WB RB WB This schedule is serializable. Had T1.WA happened after T2.WA, then it's possible that A gets its value from T1 and B gets its value form T2, so that's not possible in a sequential execution. -> Conflict Serializability The following table describes what is considered a conflict when T1's operation comes before T2's. T1 T2 conflict? RA RA no RA WA yes WA RA yes WA WA yes A schedule is conflict serializable if for all conflicitng pairs of operations o1 of T1 and o2 of T2, - o1 always precedes o2, OR - o2 always precedes o1 Example: T1 1: RA 2: WA 3: RB 4: WB T2 5: RA 6: WA 7: RB 8: WB conflicts: 1-6, 2-5, 2-6, 3-8, 4-7, 4-8 Can draw an edge from T1 to T2 if T1 does a conflicting operation before T2 and vice versa. If the graph has a cycle, the schedule is NOT conflict serializable: T1 T2 T3 RA WA RB RA WA RB WB WB T1->T2-><-T3 So T1 and T2 don't conflict, and T1 and T3 don't conflict, but T2 and T3 conflict, since they affect each other, so it's not conflict serializable. If there is no cycle, then following the graph's directed edges gives you the "equivalent" serializable order to running them concurrently. ==What You Would Like More than Conflict Serializability== T1 T2 RA WA RA WC commit RB abort This is not good: we have read the result of an aborted transaction! Aborted modifications shouldn't leak through the system. This kind of transaction is called non-recoverable. Another example T1 T2 T3 RA WB WA RA WA RA if T1 aborts at this point, have to kill T2 and T3 as well! That's called a cascading abort. To avoid this, only allow cascadeless aborts! A cascadeless abort is also recoverable. To avoid this: make all read operations to items written by another transaction wait for that transaction to commit. One solution for this would use locking.