10/14/2008 Readings: * "ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging". C. Mohan et al. In the Red Book. Read Sections 1-7, and skim Sections 12 and 13. * Note: The ARIES paper is ahard one to read. See Michael Franklin's writeup as referenced in lecture 10 for an easier read-through. =Durability= ==Kinds of Faults== What kind of faults can happen? - Database (program) crash - we'll talk about this. - Disk crash - we will assume this doesn't happen today, but the rule of thumb is redundancy -> keep multiple copies of logs, data, etc. on machines that are ideally geographically distributed - Network faulure - will talk about this for distributed DBs in a few weeks - Power failure - similar to database crash, modulo some weird corruption. If you're big enough to care about losing money, - Backup power - diesel, etc. - Long power outage - general case of this is a "disaster," where data center is no longer accessible. We'll discuss this, but essentially need multiple geographically distributed data centers. - Misc. hardware faults - redundancy is key Another class of error: User Error - DROP TABLE orders; - rm -rf * Solution to user error problem - backup your stuff! Another class of error: bugs in database software that doesn't cause it to crash. - "Byzantine Faults" - arbitrary bugs which don't necessarily crash system - People don't focus on this too much, but if they do, they can run multiple implementations of the DB and compare their results after each step. - Convert lots of these errors in to failstop/crash errors by using asserts in your code. These days, people run an active replica in a different place in the country/world. In the 80's you'd take a backup on tape, ship it elsewhere nightly, and resolve problems by shipping engineers to the backup site if recovery is needed. Since we only ship data to remote servers every couple of minutes these days, we still have some data loss, and in that case, rely on good customer service representatives to help customers. So our assumption: make all software bugs look like operating system crash. Hardware going down is solved by redundant hardware. ==Crash-based Recovery== Has to handle crashes and aborts. 1) Rollback transactions that abort, or partially completed transactions on a crash. 2) Ensure committed transactions are visible, at least in the log. So our view of a DB system is: - Buffer Manager with pages in memory - Tables, indices, and a log on disk Recovery brings memory and disk back to similar state it was in before crash. A filesystem on disk kind of supports this. When editing a text file in emacs/vim, partial state is written to a temporary file, which is atomically moved at the operating system level every time user saves. So either all edits before save are changed, or none of them are. In a DB system, this is called "shadow pages." Atomically installs pages once committed. But still need to log which pages to commit. So they ended up ditching the shadow pages, and just writing page contents themselves to a log. ==Write Ahead Logging== Before we write into a table, write the modification to a log. It's required that we phyically write all of log record before we begin physically writing changes to table. Can still modify table in memory, since if we crash, we'll lose that state anyway. More precisely: write all log records for a transaction as well as its COMMIT record to the log before returning control to the user, and acknowledging commit. What do we log? - updates (umbrella term for update/delete/insert) - transactional information (begin, commit, and abort) So the life of a transaction might look like: BEGIN, UP A, UP B, COMMIT (Let's not worry about delete, write, etc... just call it all UP) Update records - UNDO and REDO information - Simplest thing to write for an update to a page is the copy of the page before and the copy of the page after the update. Hopefully this isn't 4096x2 bytes - can just write which offsets in the page differ. This is called "phyical logging" - Logical logging - just write logical operation that affected page -> "Insert tuple t into page". Far more compact than physical logging. However, It's harder to know how to undo logical operations, etc. ==STEAL/FORCE== If the buffer manager never evicts dirty pages (as in SimpleDB), the policy is called "NO STEAL." Under this policy, you will never have to undo any transactions, since they aren't in effect before you commit. If you force all transactions' dirty pages to disk before writing a commit record (a FORCE policy), then you don't have to redo work, since it will already be on disk. These two policies directly contradict each other, so you can't have both at the same time. FORCE on its own is hard to guarantee, since you can't write everything to disk atomically anyway, and might require us to write random accesses to disk without good reason. NO STEAL on its own is hard to implement, since transactions might touch more pages than available memory. So in practice, all available systems implement a NO FORCE/STEAL policy. - We might have to be able to UNDO stolen pages from transactions in flight. - We might have to be able to REDO transactions that were partially forced to disk.