6.824 2009 Lecture 14: Case studies: Frangipani Why is the log region (1TB) smaller than the bitmap (3TB)? introduction YFS is a watered-down version of Frangipani Now we understand all the ingredients (e.g., Paxos, consistency), let's at look it in more detail More discussion/question oriented lecture Perhaps ideas for final project Frangipani: A Scalable Distributed File System Thekkath, Mann, Lee SOSP 1997 Why not one beefy server with lots of disk? Performance: If different files are read frequently, you can split the bottleneck in accessing the server. Multiple disks in more machines allow for more concurrency. Perhaps if everyone in an organization was writing to a single file, the overhead of frangipani might be worse, but that's not the common case. Overall architecture (note that each component can run on the same machine as another component) - Multiple frangipani servers, which take care of naming, metadata. - Multiple petal servers, which handles low-level block filesystem with no user-facing things like filenames. - Distributed lock server that provides multiple reader/single writer locks. - There's nothing deep in having two layers: they had already written Petal, and didn't want to re-implement it. Why not primary copy? (that is, primary/copy NFS) Actually they do use primary copy inside Petal But many Petal primary/backup pairs in one FS What does Petal do / guarantee? - Provides 64-bit virtual address space for writing to - Provides replicated block storage to higher layers, and recovers blocks on systems that crash from other blocks. Each client machine runs a frangipani server, which provides it with a filesystem interface. What happens if a client e.g. creates a file? create("/d/f"); lock log lock bitmaps lock directory d locally in frangipani server buffer cache: read disk node contents for d assuming f doesn't already exists allocate an inode for f, initialize it in frangipani's buffer cache add entry for f in d force log to petal server (write-ahead logging, to avoid crashing during mutli-step process) write d's inode content write f's inode content release locks What if a client on a different server reads that file? Server gets the REVOKE writes log to Petal, writes meta-data to Petal, RELEASEs lock Why must it write the log entry to Petal before writing the meta-data? Why must it write the meta-data to Petal before releasing the lock? What if two clients try to create the same file at the same time? The locks are doing two things: Atomic multi-write transactions. Serializing updates to meta-data (cache consistency). What if a server dies and it is not holding any locks? Can the other servers totally ignore the failure? Yes---it wrote the data out to petal, and isn't working on anything right now. What if a server dies while holding locks? Can we just ignore it until it comes back up and recovers itself? - no it's got stuff locked that others might want Can we just revoke its locks and continue? - nope---it might leave inconsistent state behind, half way through an operation. What does Frangipani do to recover? - re-run the log Remember that file contents are not locked/synced to disk...only the file metadata. The contents are synced from frangipani buffer cache asynchronously. Logs are initially kept in memory with no communication to petal. Locks are cached and not returned prematurely. Only if another server asks for a lock on the file you have do you release the lock after syncing to petal's disk-based log. The log is asynchronously written out every 30 seconds. This is probably confusing to the user, but makes things faster (if you repeatedly update an inode by creating many files in a directory, etc., and if you're writing sequentially many items to log). What's in a log record? - sequence number (log is circular, so this tells you when it wrapped around on itself) - operation type (create) - inum(s) affected directory inum + version file inum + version - checksum If a server crashes w/o writing log to disk (no one asked for lock), then all modifications are gone. Suppose S1 deletes f1, flushes its block+log, releases lock. Then S2 acquires lock and creates a new f1. Then S1 crashes. Then all lock server crash and reboot Then recovery demon for S1 runs Will recovery re-play the delete? Not necessarily: each block has a version number that increments with the log entry that last affected it. If you are replaying a log entry, but the version on the block affected is newer, then you ignore the log entry. Details depend on whether S2 has written the block yet (which would have incremented version number) S1 creates f2, crashes while holding lock how does replay work? if S1 crashed before any flush of anything? mid-way through flushing log? mid-way through flushing data? just after all flushing, before releasing lock? just after releasing the lock? What if server crashes midway while writing log? - you'd check checksum, see nasty data, and not run that atomic operation. Does the recovery manager have to acquire locks before playing records? What if some other server currently holds the lock? Might the other server have stale data cached? From before replay? What if two servers crash at about the same time? And they both modified the same file, then released lock. How do we know what order to replay their logs in? I.e. can we replay one, then the other? Or must we interleave in the original order? What if power failure affects all servers? Suppose S1 creates f1, creates f2, then crashes. What combinations of f1 and f2 are allowed after recovery? What if a server runs out of log space? What if it hasn't yet flushed corresponding blocks to Petal? What happens if the network partitions? Could a partitioned file server perform updates? Serve stale data out of its cache? What if the partition heals just before the lease expires? Could file server and lock server disagree about who holds the lock? Why isn't the lock service a performance bottleneck? What if a lock server crashes? What do lock servers use for paxos for agreeing on? - which lock server is responsible for which range of locks (inums). - which frangipani servers participate in the system (so that a lock server that crashes can poll frangipani servers for who owns which locks). Why does Frangipani have a disk-like interface to Petal? Frangipani was never intended to use a disk, so no compatibility reason might some other interface work better? Table 2: why are creates relatively slow, but deletes fast? Why is figure 5 flat? Why not more load -> longer run times?