10/14/2008

Readings:
* "ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging". C. Mohan et al. In the Red Book. Read Sections 1-7, and skim Sections 12 and 13. 
* Note: The ARIES paper is ahard one to read.  See Michael Franklin's writeup as referenced in lecture 10 for an easier read-through. 

=Durability=
==Kinds of Faults==
What kind of faults can happen?
 - Database (program) crash - we'll talk about this.
 - Disk crash - we will assume this doesn't happen today, but the rule of thumb
   is redundancy -> keep multiple copies of logs, data, etc. on machines that
   are ideally geographically distributed
 - Network faulure - will talk about this for distributed DBs in a few weeks
 - Power failure - similar to database crash, modulo some weird corruption.
   If you're big enough to care about losing money,
     - Backup power - diesel, etc.
     - Long power outage - general case of this is a "disaster," where
       data center is no longer accessible.  We'll discuss this, but
       essentially need multiple geographically distributed data centers.
 - Misc. hardware faults - redundancy is key

Another class of error: User Error
 - DROP TABLE orders;
 - rm -rf *
Solution to user error problem - backup your stuff!

Another class of error: bugs in database software that doesn't cause it to
crash.
  - "Byzantine Faults" - arbitrary bugs which don't necessarily crash system
  - People don't focus on this too much, but if they do, they can run multiple
    implementations of the DB and compare their results after each step.
  - Convert lots of these errors in to failstop/crash errors by using asserts
    in your code.

These days, people run an active replica in a different place in the
country/world.  In the 80's you'd take a backup on tape, ship it elsewhere
nightly, and resolve problems by shipping engineers to the backup site
if recovery is needed.

Since we only ship data to remote servers every couple of minutes these days,
we still have some data loss, and in that case, rely on good customer service
representatives to help customers.

So our assumption: make all software bugs look like operating system crash.
Hardware going down is solved by redundant hardware.

==Crash-based Recovery==
Has to handle crashes and aborts.
1) Rollback transactions that abort, or partially completed transactions
   on a crash.
2) Ensure committed transactions are visible, at least in the log.

So our view of a DB system is:
  - Buffer Manager with pages in memory
  - Tables, indices, and a log on disk

Recovery brings memory and disk back to similar state it was in before crash.

A filesystem on disk kind of supports this.  When editing a text file in
emacs/vim, partial state is written to a temporary file, which is atomically
moved at the operating system level every time user saves.  So either all
edits before save are changed, or none of them are.

In a DB system, this is called "shadow pages."  Atomically installs pages
once committed.  But still need to log which pages to commit.  So they ended up
ditching the shadow pages, and just writing page contents themselves to a
log.

==Write Ahead Logging==
Before we write into a table, write the modification to a log.
It's required that we phyically write all of log record before we
begin physically writing changes to table.  Can still modify table
in memory, since if we crash, we'll lose that state anyway.
More precisely: write all log records for a transaction as well
as its COMMIT record to the log before returning control to the
user, and acknowledging commit.

What do we log?
 - updates (umbrella term for update/delete/insert)
 - transactional information (begin, commit, and abort)

So the life of a transaction might look like:
  BEGIN, UP A, UP B, COMMIT
(Let's not worry about delete, write, etc... just call it all UP)

Update records - UNDO and REDO information
 - Simplest thing to write for an update to a page is the
   copy of the page before and the copy of the page after the
   update.  Hopefully this isn't 4096x2 bytes - can just write
   which offsets in the page differ.  This is called
   "phyical logging"
 - Logical logging - just write logical operation that affected
   page -> "Insert tuple t into page".  Far more compact than
   physical logging.  However, It's harder to know how to
   undo logical operations, etc.

==STEAL/FORCE==
If the buffer manager never evicts dirty pages (as in SimpleDB),
the policy is called "NO STEAL."  Under this policy, you will
never have to undo any transactions, since they aren't in effect
before you commit.

If you force all transactions' dirty pages to disk before writing a
commit record (a FORCE policy), then you don't have to redo work,
since it will already be on disk.

These two policies directly contradict each other, so you can't have both
at the same time.

FORCE on its own is hard
to guarantee, since you can't write everything to disk atomically
anyway, and might require us to write random accesses to disk
without good reason.

NO STEAL on its own is hard to implement, since transactions might touch
more pages than available memory.

So in practice, all available systems implement a NO FORCE/STEAL policy.
 - We might have to be able to UNDO stolen pages from transactions in flight.
 - We might have to be able to REDO transactions that were partially forced
   to disk.