6.824 2007 Lecture 10: Logging and Recovery Done with consistency segment of the class. Consistency helps multiple clients update the same shared data Haven't yet dealt with the durability of the data itself across failures, even on just a single node So today's lecture will focus on logging/journaling in order to make filesystem operations atomic. ext3, zfs, and other journalling filesystems use these techniques. What's the overall topic? Atomic updates of complex data w.r.t. failures. Today just a single system, we'll be seeing distributed versions later. Similar techniques will be useful in distributed systems like Petal, 2-phase commit, Paxos, etc. assume disk sector writes are atomic, and "work correctly" see FSD paper for better handling of disk failures Why aren't synchronous meta-data updates enough? They're slow (each could require a separate disk seek) Recovery may require scanning the whole disk Force writing data to disk in a particular order Some operations don't have an obvious single committing write Durability of data A single node may fail while writing multi-sector on-disk data structure. Several nodes at once might be updated, and data structures might be too complex to update at once. Updating multiple items at once and getting all or none of them updated after a crash is called atomicity. So logging provides the following rule: don't update the last version in place. Instead, make updates in a different location, and swap to a new version once we're sure the changes are stable. Example: FFS rename editor could use re-name from temp file for careful update echo a > d1/f1 echo b > d2/f2 mv d2/f2 d1/f1 need to update two directories, stored in two blocks on disk. do we want to remove d2/f2 first and then add d1/f1, or the other way around. probably want add then remove what if a crash? if add then remove, might have f2 in two locations, without removing it from d2. what does fsck do? it knows something is wrong, since link count is 1, but two links. can't roll back -- which one to delete? has to just increase the link count. this is *not* a legal result of rename! but at least we haven't lost the file. so FFS is slow *and* it doesn't get semantics right. So that's not perfect, but it's not bad. Alternatively, can make logical copy of directory structure, modify the copy, and atomically switch the updated structure over to the copy. You can push tree update one step farther. Prepare a new copy of the entire affected sub-tree. Replace old subtree in one final write. Very expensive if done in the obvious way. But you can share structure between old and new tree. Only need new storage between change points and sub-tree root. (NetApp WAFL does this and more.) This approach only works for tree data structures. and doesn't support concurrent operations very well What are the reasons to use logging? atomic commit of compound operations. w.r.t. crashes. fast recovery (unlike fsck). well-defined post-recovery state: serial prefix of operations. as if synchronous and crash had occured a bit earlier can be applied to almost any existing data structure, even complex ones e.g. database tables, free lists representation is compact on a disk, so very fast to append useful to coordinate updates to distributed data structures let's all do this operation oops, someone didn't say "yes"---rollback how to back out or complete? so true state of the filesystem is in the log, and the on-disk datastructure does not need to be synchronized to disk, which is expensive (due to seeks for updates all over the hard drive) Transactions The main point of a log is to make complex operations atomic. I.e. operations that involve many individual writes. You want all writes or none, even if a crash in the middle. A "transaction" is a multi-write operation that should be atomic. The logging system needs to know which sets of writes form a transaction. re-organize code to mark start/end of group of atomic operations create() begin_transaction update free list update i-node update directory entry end_transaction app sends writes to the logging system there may be multiple concurrent transactions e.g. if two processes are making system calls each trasaction is represented on disk by a BEGIN and END, and a bunch of operations that define the transaction in between. Terminology in-memory data cache dirty vs clean on-disk data in-memory log on-disk log sync write vs async TID = transaction ID naive re-do log keep a "log" of updates B TID [begin] W TID B# new-data [write] E TID [end == commit] Example: B T1 W T1 B1 25 E T1 B T2 W T2 B1 30 B T3 W T3 B2 99 W T3 B3 50 E T3 for now, log lives on its own infinite disk note we include record from uncommitted xactions in the log records from concurrent xactions may be inter-mingled we can write dirty in-memory data blocks to disk any time we want recovery 1. discard all on-disk data 2. scan whole log and remember all Committed TIDs 3. scan whole log, ignore non-committed TIDs, replay the writes why can't we use any of on-disk data's contents during recovery? don't know if a block is from an uncommitted xaction i.e. was written to disk before commit the *real* data is in the log! the on-disk data structure is just a cache for speed since it's hard to *find* things in a log so what have we achieved? atomic update of complex data structures: gets rename() right recoverable operations are fast (exploit disk locality) problems: we have to store the whole log forever [unbounded log] recovery has to replay from the beginning of time re-do with checkpoint most logs work like this, e.g. FSD checkpoint periodically to have consistent system state, so that you can delete all of log before checkpoint [bounded log] allows much faster recovery: can use on-disk data write-ahead rule delay flushing dirty blocks from in-memory data cache until corresponding commit record is on disk, meaning you could recover from the log if necessary. so keep updates of uncommitted xactions in in-memory data cache (not disk) so no un-committed data on disk. but disk may be missing some committed data recovery needs to replay committed data from the log how can we avoid re-playing the whole log on recovery? recovery needs to know a point in log at which it can start a "checkpoint", pointer into log, stored on disk how to ensure recovery can ignore everything before the checkpoint? checkpoint rule: all data writes before checkpoint must be stable on disk (including END/commit record) checkpoint may not advance beyond first uncommitted Begin so essentially a checkpoint record contains a pointer to the system state which is consistent with ALL log records before it. in background, flush a bunch of early writes, update checkpoint ptr three log regions: data guaranteed on disk (checkpoint) data might be on disk (log write point) data cannot be on disk (end of in-memory log) on recovery, re-play commited updates from checkpoint onward it's ok if we flush but crash before updating checkpoint pointer we will re-write exactly the same data during recovery can free log space before checkpoint! problem: uncommitted transactions use space in in-memory data cache (since you can't log until after transaction is committed) a problem for long-running transactions (not a problem for file systems, which have short transactions) To avoid having to write only after commit, you can just write updates as they come, but add an undo record to each redo record, saying what to do with updates mid-transaction that never commits. This is a big deal in database systems where transactions may take a long time. un-do/re-do with checkpoint [didn't talk about this in class] suppose we want to write uncommitted data to disk? need to be able to un-do them in recovery so include old value in each log record W TID B# old-data new-data now we can write data from in-memory data cache to disk after log entry is on disk no need to wait for the End to be on disk so we can free in-memory data cache blocks of uncommitted transactions recovery: for each block mentioned in the log find the last xaction that wrote that block if committed: re-do if not committed: un-do two pointers stored on disk: checkpoint and tail checkpoint: all in-memory data cache entries flushed up to this point no need to re-do before this point but may need to un-do before this point tail: start of first uncommitted transaction no need to un-do before this point so can free before this point it's ok if we crash just before updating the tail pointer itself we would have advanced it over committed transaction(s) so we will re-do them, no problem what if there's an un-do record for block never written to disk? it's ok: un-do will re-write same value that's already there what if B T1 W T1 B1 old=10 new=20 B T2 W T2 B1 old=20 new=30 crash The right answer is B1 = 10, since neither committed But it looks like we'll un-do to 20 What went wrong? How to fix it? careful disk writing log usually stored in a dedicated known area of the disk so it's easy to find after a reboot where's the start? checkpoint, a pointer in a known disk sector where's the end? hard if crash interrupted log append append records in order include unique ascending sequence # in each record also a checksum for multi-sector records (maybe in End?) recovery must search forward for highest sequential # why is logging fast? group commit -- batched log writes. could delay flushing log -- may lose committed transactions but at least you have a prefix. so rather than write each log synchronously as they are committed, combine them every few fractions of a second to one seek per update and achieve higher throughput on disk. single seek to implement a transaction. maybe less if no intervening disk activity, or group commit write-behind of data allows batched / scheduled. one data block may reflect many transactions. i.e. create many files in a directory. don't have to be so careful since the log is the real information ------- Reimplementing the Cedar File System Using Logging and Group Commit Robert Hagmann SOSP 1987 Why are we reading this paper? Robustness and crash recovery Logging will come up many times What are the main properties the authors want? Recoverable if unexpected crash (e.g. while modifying name table). specifically, these updates should be atomic If single- or double-sector disk error, only that file damaged. Fast crash recovery. Fast ordinary operation. Basic FSD (updated cedar filesystem) data structures VAM---volume allocation map, which is a bitmap of free disk pages to use -> stays in memory filename table---b+tree of file names and block lists -> on disk leaders---checksums of content blocks -> on disk file content blocks -> on disk log---write-ahead log, logs changes to FNT -> on disk No step of the file creation process is brittle---if you crash after writing leader/contents (or some portion of them), you just lose some space, but VAM is dynamically recalculated on recovering from crash, and file name table stays consistent with pre-creation information. Why unexpected crashes might be a problem b-tree update requires multiple disk writes Why does a log help with crash recovery? It makes multi-step operations atomic w.r.t. recovery What does the on-disk log look like? Fixed area of disk, not a file, no free list for log blocks. Proper entry: hdr[header], blank page, hdr copy, data pages..., end, copies of data..., end copy. They were worried about disk block failure (two blocks next to each other might go bad), so they logged each header and data pages twice, so you can always recover from block failure to one block. Remember---the data pages don't contain file data, just filesystem metadata, since they don't protect anything but the filesystem structure. be clear about memory vs disk memory: disk cache. some blocks might be dirty. in-memory log (for group commit), just most recent, not yet flushed, w/ commit points. VAM disk: real homes of data and name table blocks. on-disk log, in thirds. Steps to follow for a file create look for two free pages to write content/leader to (in VAM) update the file name table in memory write the leader+content to free pages log the changes to the file name table (the b+tree) eventually, memory image will merge with structure on disk, but at this point it's logged When is the in-memory log forced to the disk? Group commit... Every few seconds. Why not log immediately, after every operation? Too many disk seeks---kills throughput Note that while the log is on disk, it's kept in memory until a group commit point, at which point all in-memory log updates are flushed. When are the modified disk cache pages written to the disk? Is it OK to write them before the corresponding log records? Is it OK to write them after? What happens during recovery after crash+reboot? How does recovery s/w find the start of the log? Disk has pointer to first log of oldest third, updated on entry to third. How does recovery find the last log record? hdr/end blocks must have time-stamp or something. What if the system crashed while writing to the log on disk? Do we assume disk wrote pages of log entry in order? If end block exists and matches hdr block, we're OK. What does recovery actually do with the log records? Can it just start re-play at the beginning? I.e. what if it crashed during previous replay, is 2nd time OK? Where does recovery stop re-playing? Does recovery work if crash during log write? Yes---you just skip the first log record to be missing an END record. What about recovering the free list (VAM)? What if the free list had incorrect contents? It's dynamically calculated at the start of the system after recovery What can we say about the state of the file system after crash+recovery? Some prefix of the operations have been completed. I.e. every operation up to a certain point. May have lost the last few operations. Why don't they log data writes? When do they write the data? Does that break the prefix rule? May have writes from "the future", after the recovery point. What about delete, create, write, crash? W/ recovery point before delete? What if we run out of log space during normal operation? How can we be sure it's safe to start re-using the log space on disk? What if crash while writing blocks during switch to a new 1/3? Why is FSD faster than CFS? Table 3, why does FSD do 1/6th as many I/Os for small creates? Why 149 I/Os for 100 files? Did they evaluate whether they achieved their robustness goals? No. ******We didn't talk about anything below this in class****** Let's talk about the old system, CFS, first. In some ways it's the more interesting of the two. What were some strengths of the old design? Labels provided robustness. What were the weaknesses of the old design? Non-standard hardware. Slow recovery. Slow meta-data update. CFS file name table was a b-tree. *not* consistent after a crash. thus really required to duplicate names &c in headers. How did the labels work? What was in a label? When written? when checked? What errors can the labels detect? What invariants could a label help maintain? Disk hardware failures: For each, detect bad data? Recover? If disk scribbles onto the file name table? If disk scribbles onto a header? If disk scribbles onto file content? Software errors: If used page on free list, will we re-use it incorrectly? If someone writes junk into header, will we detect? Can we recover? If someone writes junk into file content, will we detect? Why was the CFS VAM just hints? VAM is disk page free list. How could there be no invariants? What if an allocated page is on the free list? Does the FSD VAM have invariants? Why is guarding against single sector errors reasonable? What's the logic behind this design focus? What are the alternatives? What if most disk errors occured in disk electronics? Do they have hierarchical directories? Only one "directory", the file name table...