6.824 Lecture 6: Consistency Outline Finish DryadLINQ Consistency Consistency models Strict consistency Sequential consistency IVY =DryadLINQ= - We were wondering when subsequent operations in DryadLINQ are placed on the same machine rather than passed to operators on other machine. This is a static thing according to the authors. If one operation streams to another, it's kept on the same machine. Once the output of an operation is partitioned, the results are sent over the network to multiple machines. - Deciding how many partitions for various operations is done by dividing the workload size by a constant, namely how large a chunk each server should process at a time. - Deciding how to create hash buckets is done by binning the keys learned during a deterministic sampling stage. Performance Evaluation - Overhead of higher-level dryad link---they show this is network-bound, which is good, since the system isn't limited by the overhead of DryadLINQ - Programming ease---a real-world application takes 1000 lines of Dryad vs. 100 lines of DryadLINQ. The expense of this is 1.3x that of Dryad without optimizations. They even claim it's hard to implement various JOIN-like operations on mapreduce and thus don't make the comparison, which is slightly suspect - Compare dynamic optimizations to statically compiled un-optimized operation, and find performance improvement. - One nice feature is the distributed debugger that is built into the .NET debugging framework---immensely useful for programmers. Consistency = a constraint on the memory model of a storage system Interesting: when data is replicated and concurrent access How to guarantee reasonable expectations when users do a load/store of data. Read(a) Write(a,v) This becomes hard on a single-processor machine because of instruction re-ordering, etc. Even more challenges come when you share/distributed the memory on multiple processors/machines, and you throw in concurrent writers - simple example: web browser---it caches data, and is told by browser how long to keep page valid for (expiration is determined by server's response). So that's loose consistency: you are consistent with writes up to a window of time in which cache is valid. Replicated data a huge theme in distributed systems For performance and fault tolerance Often easier to replicate data than computation Examples: Caching of web pages (Coral) Caches in YFS (labs 5 and 6) Caches in Linux kernel All these examples involve sophisticated optimizations for performance How do we know if an optimization is correct? We need to know how to think about correct execution of distributed programs. Most of these ideas from multiprocessors and databases 20/30 years ago. For now, just correctness and efficiency, not fault-tolerance. Replicating content that isn't modified (or has a single writer) is "simple" See, for example, BitTorrent and Coral Coral obeys HTTP expiration directive and last-modified Topic gets much more interesting when concurrent reads and writes. Let's assume we are implementing a traditional memory (i.e. with LD/ST) matches many read/write abstract interfaces (e.g., file systems) Let's distribute it naively internet cloud, hosts CPU0, CPU1, CPU2 assume each host has a local copy of all of memory reads are local, so they are very fast write: send update msg to each other host (but don't wait) Does this memory work well? Not fully---since hosts don't wait on each other, some writes may be lost, which means that they all wrote based on different views of the data Example 1: CPU0: v0 = f0(); done0 = true; CPU1: while(done0 == false) ; v1 = f1(v0); done1 = true; CPU2: while(done1 == false) ; v2 = f2(v0, v1); Intuitive intent: CPU2 should execute f2() with results from CPU0 and CPU1 waiting for CPU1 implies waiting for CPU0 On a single machine, this will probably work as expected. When memory sharing happens over the net, but you don't confirm writes before moving on, messages may arrive out-of-order, and thus are not synchronized. Example 1 won't work with naive distributed memory: Problem A: CPU0's writes of v0 and done0 may be interchanged by network leaving v0 unset but done0=true But assume each CPU sees each other's writes in issue order Problem B: CPU2 sees CPU1's writes before CPU0's writes i.e. CPU2 and CPU1 disagree on order of CPU0 and CPU1 writes Lesson: either naive distributed memory isn't "correct" or we should not have expected Examples to work In this case, it was invalid caches that killed us...so it's not just memory, it can happen on a distributed filesystem as well How can we write correct distributed programs w/ shared storage? Memory system promises to behave according to certain rules. We write programs assuming those rules. Rules are a "consistency model" Contract between memory system and programmer What makes a good consistency model? There are no "right" or "wrong" models A model may make it harder or easier to program i.e. lead to more or less intuitive results A model may be harder or easier to implement efficiently Also application dependent A consistency model for Web pages different than for memory How about "strict consistency": each instruction is stamped with the wall-clock time at which it started across all CPUs Rule 1: LD gets value of most recent previous ST to same address Rule 2: each CPU's instructions have time-stamps in execution order Essentially the same as on uniprocessor Would strict consistency avoid problem A and B? Yes, if you could implement it How do you implement strict consistency? Time: 1 2 3 4 CPU0: ST ST CPU1: LD LD Time between instructions << speed-of-light between CPUs! How is LD@2 even aware of ST@1? How does ST@4 know to pause until LD@3 has finished? how does ST@4 know how long to wait? Too hard to implement! So the issue here is that you can't compare timestamps across machines, so it's hard to synchronize in practice. One reasonable model: sequential consistency Is an execution (a set of operations) correct? There must be some total order of operations such that 1. all CPUs see results consistent with that total order i.e. reads see most recent write in the total order 2. each CPU's instructions appear in-order in the total order So we're simulating multiple CPUs running on a single memory, which can have a total order. Intuitive justification: The single total order means it's easy for one CPU to predict what other CPUs will see The "consistent with" and lack of real time may make it easy to implement The system appears free to interleave instruction streams however it likes to form the total order However! When executing in real time, once the system reveals a written value to a read operation, the system has committed to a little bit of partial order. this may have transitive effects. So in real life the system only has freedom in ordering more or less concurrent operations -- ones that haven't been observed yet So question now becomes to how to implement sequential consistency correctly Let's look at IVY. What consistency does it provide and how? IVY takes a bunch of machines on a network, shares their memories, and allows you to treat memory as one large address space. Why is Ivy cool? All the advantages of *very* expensive parallel hardware. On cheap network of workstations. No h/w modifications required! Do we want a single address space? Or have programmer understand remote references? Shall we make a fixed partition of address space? I.e. first megabyte on host 0, second megabyte on host 1, &c? And send all reads/writes to the correct host? We can detect reads and writes using VM hardware. I.e. I read- or write-protect remote pages. What if we didn't do a good job of placing the pages on hosts? Maybe we cannot predict which hosts will use which pages? If we guess wrong, page faults will result in a network lookup and an interruption of a remote CPU. That makes 1 instruction take 1000 or so instructions to complete...slowing down the system significantly. Could move the page each time it is used. When I read or write, I find current owner, and I take the page. So need a more dynamic way to find current location of the page. What if lots of people read a page? Move page for writing, but allow read-only copies. When I write a page, I invalidate r/o cached copies. When I read a non-cached page, I find most recent writer. Works if pages are r/o and shared, or r/w by one host. Only bad case is write sharing. When might this arise? False sharing... How does their scheme work? Section 3.1 three CPUs, one MGR, draw table per entity the MGR keeps track of which CPU owns write-access to each page, allowing for a total ordering on all store procedures, and thus facilitating sequential consistency ptable (page table): one per CPU, one entry per page lock access (read or write or nil) i_am_owner (same as write?) info: just on MGR, one entry per page lock copy_set---current readers of the page owner---the cpu with the most recent copy of the data on the page Message types: RQ read query (reader to MGR) RF read forward (MGR to owner) RD read data (owner to reader) RC read confirm (reader to MGR) WQ IV IC WF WD WC scenario 1: owned by CPU0, CPU1 wants to read 0. page fault on CPU1, since page must have been marked invalid 1. CPU1 sends RQ to MGR 2. MGR locks, sends RF to CPU0, MGR adds CPU1 to copy_set 3. CPU0 sends RD to CPU1, CPU0 marks page as access=read 4. CPU1 sends RC to MGR 5. MGR unlocks scenario 2: owned by CPU0, CPU2 wants to write 0. page fault on CPU2 1. CPU2 sends WQ to MGR 2. MGR locks, sends IV to copy_set (i.e. CPU1) 3. CPU1 sends IC msg to MGR (or does it? does MGR wait?---yes, for sequential consistency) 4. MGR sends WF to CPU0 5. CPU0 sends WD to CPU2, clears access 6. CPU2 sends WC to MGR 7. MGR unlocks what if two CPUs want to write the same page at the same time? problem: a write has many steps and modifies multiple tables the tables have invariants: MGR must agree w/ CPUs about the single owner MGR must agree w/ CPUs about the copy_set copy_set != {} must agree with owner's writeability thus write implementation should be atomic -------------stopped lecture here what enforces the atomicity? what are the RC and WC messages for? what if RF is overtaken by WF? (I'm not sure about this. Perhaps they assume FIFO message order anyway?) (per-CPU ptable locks fix many potential races.) does Ivy provide strict consistency? no: ST may take a long time to revoke read access on other CPUs so LDs may get old data long after the ST issues does Ivy provide sequential consistency? does it work on our examples? v = fn(); done = true; can a cpu see done=true but still see old v? Ivy does seem to use our two seq consist implementation rules. 1. Each CPU to execute reads/writes in program order, one at a time 2. Each memory location to execute reads/writes in arrival order, one at a time What's a block odd-even based merge-split algorithm? Why is appropriate to this kind of parallel architecture? Partition over N CPUs. Local sort at each CPU. View CPUs as logical chain. Repeat N times: Even CPUs send to (higher) odd CPUs. Odd CPUs merge, send low half back to even CPU. Odd CPUs send to (higher) even CPUs. Even CPUs merge, send low half back to odd CPU. Note that "send to" means look at the right place in shared memory. Probably everything in a single huge array. What's the best we could hope for in terms of performance? Linear speedup... When are you likely to get linear speedup? When there's not much sharing -- strict partition. How well do they do? linear for PDE and matrix mul, not so good for sorting In what sense does it subsume the notion of RPC? When would DSM be better than RPC? More transparent. Easier to program. When would RPC be better? Isolation. Control over communication. Tolerate latency. Portability. Might you still want RPC in your DSM system? For efficient sleep/wakeup? Has their idea been successful? Using workstations for parallel processing: yes! Beowulf, MapReduce, Dryad Shared memory? Hard to say. Lack of control over communication details?