6.824 Lecture 7: Release consistency Plan: Finish up Ivy (first system to proposed distributed shared memory (DSM)) Consistency Strict Sequential Release (eager and lazy) Treadmarks (used release consistency for DSM) Big idea: vector timestamps What makes a good consistency model? There are no "right" or "wrong" models A model may make it harder or easier to program i.e. lead to more or less intuitive results A model may be harder or easier to implement efficiently Lab 6 will implement lazy release consistency on extents. Lab 5 lays the ground work by doing lazy releasing of locks. Ivy summary predecessor scheme for Treadmarks you have lots of workstations you want to use them in parallel for a single big computation you write your application using threads and locks maybe it's an existing threaded app, no changes required start a copy on each workstation use "distributed shared memory" so that they share all memory layering: app, DSM library, VM h/w, network VM h/w can mark page as invalid, read-only, or read-write fault to DSM library if you write r/o page, or anything an invalid page every page has a current owner page manager tracks current owner current owner required to have a copy!!! cannot toss it. management duties spread over workstations, page # mod N read fault (on an invalid page) ask manager who current owner is get a copy tell current owner to mark as read-only tell manager we have a copy mark as read-only on local machine write fault ask manager who current owner is get a copy (if not already present r/o) tell current owner to mark invalid, and give up ownership tell manager we're the owner manager tells all r/o copies to mark as invalid mark as r/w on local machine invariants: 1. every page has exactly one current owner 2. current owner has a copy of the page 3. if mapped r/w by owner, no other copies 4. if mapped r/o by owner, maybe identical other r/o copies 5. manager knows about all copies sequentially consistent rules cause writes to a page to proceed one at a time So IVY introduced DSM, and went from centralized manager to distributed manager to pushing a lot of the invalidation work onto the requesting nodes. Treadmarks wanted to make it more performant by reducing communication overhead unless absolutely necessary, and to do this in userspace instead of kernel space. - Ivy required invalidate to all cached copies on write - Ivy required entire page to be sent to requesters, rather than diffs Treadmarks high level goals? Better DSM performance. Run existing sequential consistent parallel code. What specific problems with previous DSM are they trying to fix? false sharing of large pages only one writer of a page at a time write of irrelevant data may invalidate my page (if x and y are stored on the same page cpu1 and cpu2 update them independently, they will keep invalidating each other even though they don't conflict). reduce overhead in particular when a variable is contended ask very little of programmer, other than annotating blocks of code with synchronization, which was required for shared memory anyway What are write diffs? And what is the point? Are they just like object/variable granularity rather than page granularity? No sub-page invalidate, so you would need to send updates after each store. Do diffs make sense by themselves? I.e. w/o RC or LRC? Vector timestamps - A vector with an entry for each CPU in system. - Each entry has the last interval that it has seen an update from a given processor. - say cpu0 has vector 1212 and cpu1 has vector 1211 for page X. Then cpu2 is outdated compared to cpu1, and can get the update initiated by cpu3 from cpu2. - if cpu3 has 1301, then you can't order cpu1,cpu2,cpu3...this means there was a conflict on a page (two writers). What is release consistency? And what is the point? - Any code that has proper acquire/release protecting shared variables will execute as if it were executing on a sequential consistency memory - Synchronization is only guaranteed at release of a lock When would RC be more efficient than just write diffs? example program: cpu1: while (1) {acquire; x = 1; ... x = 10; release} cpu2: while (1) {acquire; x = 2; release} Ivy would share cpu1's 10 updates 10 times and invalidate pages. Threadmarks only synchronizes on release, and only sends diff upon release. So it's logical: people who can write multithreaded code can use DSMs in threadmarks will not have to learn new primitives. What is lazy release consistency? And what is the point? When would LRC be more efficient than RC? Assume each variable on its own page. Or that granularity is a single word... Time -----> CPU0: al1 x=1 rl1 al2 z=99 rl2 CPU1: al1 y=x rl1 CPU2: al1 print x, y, z rl1 cpu1's al1 asks manager for lock1, and gets forwarded to cpu0. cpu1 tells cpu0 it wants lock, and sends vector timestamp to cpu0. cpu0 releases, and sends write notices past the vector timestamp that cpu1 had to cpu1, and cpu1 can get its desired diffs from those notices. Why is it legal for z to be out-of-date at CPU2? -> there's no lock2 acquired on cpu2 Example for why you need the vector timestamps. [xyz] is vector timestamp [000] CPU0: al1 x=1 rl1 [100] [000] CPU1: ^-[100] al1 y=x rl1 [110] [000] CPU2: ^-[110] al1 print x, y rl1 What's the "right" answer? 1,1 cpu0 How would an eager release consistent system handle this? How does lazy release consistency handle this? How does TreadMarks know what to do? What if the VTs for the two values are not ordered? Could this happen? [000] CPU0: al1 x=1 rl1 [100] [000] CPU1: al2 x=2 rl2 [010] [000] CPU2: al1 al2 print x rl2 rl1 ,------^ At the point at which cpu2 tries to acquire lock1 and lock2, it will get confused. What do you do at this point? Scream at the programmer! You have conflicting vector clocks with conflicting write notices---protect your shared memory better! So summary is: - on acquire, get last holder of lock from a central/distributed table - send your version vector to the last holder - last holder sends their latest version vector - also sends all write notices (page id and cpu that last updated it) of updates that came after older version vector - acquirer sends diff requests for all data that had a write notice that they actually need (triggered on a page fault for write-noticed pages) - if one version vector was not younger than another, then there was a conflict, which means the programmer did not lock properly to protect shared data What if you get different values from different sources? CPU0: al1 x=1 rl1 al2 y=9 rl2 CPU1: al1 x=2 rl1 CPU2: al1 al2 z = x + y rl2 rl1 CPU2 is going to hear "x=1" from CPU0, and "x=2" from CPU1. How does CPU2 know what to do? What model of consistency does the programmer need to have in mind? What rules does the programmer have to follow? What can the programmer expect from the memory system? Example of when LRC might still do too much work. CPU0: al2 z=99 rl2 al1 x=1 rl1 CPU1: al1 y=x rl1 In this case, CPU1 didn't really need z. This suggests that further improvements might be possible, at what expense? Compiler or programmer notices dependencies? Essentially, you could keep per-lock or per-variable versions, but that's a lot of extra state to keep, especially since we're sending write notices, not diffs! What happens in this case? CPU0: al1 x=1 rl1 al1 x=2 rl1 CPU1: al1 al2 y=x rl2 rl1 CPU2: al2 z=y rl2 Does CPU2 get the x=2 update? Should it? Does it make any difference? In most cases TreadMarks could avoid sending x=2 if it wanted to. But there might have been a GC, forcing CPU2 to know about x=2. Are there programs that work under RC that break under LRC? RC is blind to which lock variable was involved. CPU0: al1 x=1 rl1 al1 if(y==0) ... rl1 CPU1: al2 y=1 rl2 al2 if(x==0) ... rl2 TreadMarks keeps complete history of write notices. I.e. accurate record of what changed in each interval. Write notice and interval record lists in Figure 2. It does not just keep the latest data. Why? Is the history needed for LRC? Eager vs. Lazy Release Consistency - Rather than send to acquirer on every release, you only send when they need it (one message for many updates only on acquire). This is for the most part good, unless you have to acquire a lock to receive updates, which is expensive and awkward. What's the point of the diffs? Send only what you need to. Why do they send write notices separately from the diffs? They can delay the diffs until you actually use the page: you might not. - if you update a variable twice before a nother cpu needs it, only send them the last diff. Maybe you already have the write notice? But VT indicates this? When you ask for diffs, how far back in the past do you ask for them? Does this imply unbounded amount of history? So it seems that TreadMarks scales close to linearly with the number of processors, but with the caveat that if the program takes too many locks, and has too much communication between nodes, you won't scale as well. It turns out that for scientific computing, since people prefer message-passing to make explicit when they want to share data. For data-intensive applications like mapreduce, people don't like shared memory, and are willing to take other abstractions. Other complaints: can't scale to many workstations in shared memory world, since with 1000 machines, several will fail, and this is not fault-tolerant. There is further literature to improve on this.