6.824 Lecture 5: Distributed Programming: MapReduce and DryadLINQ Outline: Lab 2 MapReduce (Google) Dryad (Microsoft) DryadLINQ (Microsoft) Lab 2: fuse.cc, fuse_client.cc, fuse_client.h, and yfs_protocol.h trace gettattr(), which the open or stat system call will send When a user types "$ cat x" at the prompt, various system calls are executed by the kernel. FUSE runs as a kernel module that acts as a stub for various FS implementations, such as YFS. When you mount a partition to a directory, you tell the kernel that all reqests for that directory should go to YFS via FUSE. Benefits of FUSE: if there's a bug in your FS, you don't crash the kernel. FUSE is the kernel module, and any bugs shouldn't cause computer to crash, and so you can attach a debugger, etc., to be able to develop easier in userland. So far, we have seen RPC and RMI as a way to program distributed systems. What if you want to write applications which are aware of parallelism? Since writing a distributed application has a number of additional challenges over sequential programming, it would be nice if there ways to simplify it. Today we see two designs for making writing parallel applications easier on distributed computer systems: MapReduce and Dryad. The class of parallel applications we'll focus on today is data parallelism. In this model, you run one operation on many different chunks of data. This lecture is also a case study of: Use of distributed computer systems Distributed computing challenges: programming, fault tolerance, consistency, concurrency, etc. Applications Scientific applications (large-computation processing---rendering, etc.) Large-data processing apps (indexing, search, ...) Designs: Cluster computing using PCs connected by a high-speed network Grid computing using a high-speed network of supercomputers good for scientific computing on mainframes/grids where computation is heavy Volunteer/Personal computers (seti@home) aggregates Pcs on the Internet Classes of parallelism Embarassingly parallel - no communication between pieces of distribution (not interesting to think about, since there's no communication) Coarse-grained parallelism - after every major chunk or so, some computation is required (perfect for distributed systems, to minimize communication) Fine-grained parallelism - after every for-loop, you might synchronize between distribution pieces (not good for distributed systems since communication is too high---use multiprocessors) Challenges: Parallelize application --- How to handle shared state? Network is a bottleneck typically Embarrasing parallel (run same app for different inputs, users, ..) Coarse-grained (computation versus communication ratio is low) Fine-grained (typically require parallel computer) How to write the application? Explicit messages RPC---good for client/server not distributed MPI---used for thread messaging Shared memory Balance computations of an application across computers Statically (e.g., doable when designer knows how much work there is) Dynamically Handle failures of nodes during computation With a 1,000 machines, is a failure likely in a 1 hour period? Often easier than with say banking applications (or YFS lock server) Many computations have no "harmful" side-effects and clear commit points Scheduling several applications who want to share infrastructure Time-sharing Strict partitioning MapReduce Design Partition large data set into M slices a slice is equal to a 64 Mbyte part of the input (typically) M is usually greater than the number of machines you have. Useful for load balancing the slices, and handling machine failures by passing slices around. To keep all machines busy in heterogenous machine cloud, allow each machine to run multiple partitions. Run map on each partition, which produces R local partitions on the mapping machine, partitioned by the map output keys by the partition function R. Run reduce function on the merged result of pulling the same partition from each of the mappers. This produces R output files. Programmer interface: map(string key, string value) reduce(string key, iterator values) Example: word count split file in big splits a map computation takes one split as input produces a list of words as output the output is partitions into R partitions a reduce computation takes a partition as input outputs the number of occurences of each word Implementation: caller invokes mapreduce library library creates worker processes run map or reduce computations library creates one master process master assigns a map and reduce tasks to workers master is comm channel between map and reduce workers handles failures of workers map workers communicate locations of R partitions to master reducer works asks master for locations sorts input keys run reduce operation when all workers are finished, master returns result to caller Fault tolerance when worker fails master resets all of worker's map and reduce tasks to idle maps need to be reset because map's output is local and unavailable when map is reset, inform all reduce tasks to read input from new worker when master fails - restart entire mapreduce - failure case. Semantics: if user map and reduce functions are deterministic, then output is the same as non-faulty sequential run of the program when reduce completes, worker renames tmp output file atomically reduce commit point! Load balance: M + R tasks ideally M + R is much larger than number of workers challenge: stragglers---slow machines. solution: run a slow map task on multiple machines if you have idle machines, so that fastest one wins. Locality manager runs mappers close to where one of the 3 replicas of input is (in GFS---google filesystem) Dryad Similar goals as MapReduce, but different design Computations expressed as a graph Vertices are computations Edges are communication channels Each vertex can have several input and output channels A vertex won't run until all of its input channels are finished sending Nice C++ use to make it easy to construct graphs Unlike mapreduce, which has map vertices point to each reduce vertex, the graph can be more arbitrary, and thus richer. Runtime system can optimize and rebuild graph on the fly to improve performance. Implementation Job manager execution records for each vertex when all inputs are available, vertex becomes runnable vertices may express preferences dynamic graph refinements Daemon creates processes to run vertices Stage manager locality replicated stages to avoid straggler problem channels files, TCP pipes, or shared memory Load balancing Greedy scheduling Fault tolerance Job manager fails, computation fails Vertex computation fails restart vertex with different version # previous instance of vertex may run in parallel with new instances Semantics Assumption: Vertex are non-deterministic Each vertex runs one or more times Stop when all vertices have completed their execution at least once Locality stage manager MapReduce versus Dryad Many similarities Dryad computation graphs, while MapReduce a series of maps and reduces Each vertex can take n inputs, while map takes on input Each vertex can produce n outputs, while map generate n output How would you express SQL query (see sec 2.1) using MapReduce? DryadLINQ Builds on Dryad adds .NET (Microsoft framework) objects for strict typing supports declarative queries instead of procedural building of program graph, so that you can describe WHAT you want, not HOW to get it. Declarative language has notions of filters (like map), aggregates/group by's (like reduce), and order by (like a map/reduce) that people are used to from SQL. Once you call ToDryadTable on a query in DryadLINQ, the compiler gets invoked and generates a distributed query graph. Once you have a basic graph, there are a few static and dynamic optimizations that rebuild the graph in reaction to static rules or data properties Some optimizations involve deciding which operations to run on which computers to limit machine-machine communication, etc. Instead of using just hash partitioning like MapReduce, has a deterministic sampling of files to see whether hash or range partitioning is better, etc.