11/20/2008 ---------- Readings: * Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters In OSDI, 2004. http://labs.google.com/papers/mapreduce-osdi04.pdf. * Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber Bigtable: A Distributed Storage System for Structured Data In OSDI, 2006. http://labs.google.com/papers/bigtable-osdi06.pdf. =MapReduce= Simplified parallel data processing framework (language + system). Google found that lots of people wanted to process data that sits at different sites. Writing that query in c++ is annoying. - sample query: what is frequency of every word on internet? Two logical phases: - map: for all documents d, run map(d)->{(key, value)} for each input document, output a bunch of key/value pairs as per some map function. - reduce: for all keys k, run reduce({value}) for each key and its list of values, run some reduce function on the list of values. In SQL: SELECT key, aggregate(value) GROUP BY key FROM ( SELECT key, value FROM documents WHERE condition(documents ) MapReduce is much more general than SQL, since you have user-defined map/reduce functions, but you can pretty much reproduce mapreduce in sql. Example: counting all words in all docs on internet map(doc): for each word in doc: emit(w, 1) reduce(key, values) emit (key, sum(values)) [ use sum() instead of length(), so individual mappers can preaggregate data before they ship it to reducers ] Uses: - Search index construction - takes 13 mapreduce jobs in series to build index - Distributed Grep - Yahoo! made Pig, which implements distributed query operators for a more database-oriented experience ==Architecture== Based on top of Google File System (GFS) - distributed filesystem - breaks data into chunks which are replicated (three times) on different machines, so that if one replica goes down, it can be retrieved elsewhere. General principle: bring the computation/data processing to the server with the data, so map worker should be on a GFS node with data it is processing if possible. Steps: - Map workers read data from GFS, and output key/values. - Master tells which map worker which chunks to read. - Once a map worker is done, it writes its data out to a local file system. It does this instead of sending the data to a reduce worker directly, so that if the mapper crashes, you don't have to figure out how to clean up reduce worker's combined workload (which came from multiple mappers) - Once mappers are done, they partition their data into as many bins as there are reduce workers, and send the chunks to each reduce worker. - Reducer combines all shuffled partitions, and outputs reduced keys. If user knows something about the data, they can write a "Combiner," which lets you run reduce-like tasks on the mapper before it ships the data to the reducers, to compress data. Fault tolerance: - map worker fails - master reassigns work to someone else - reduce worker filas - master reassigns to someone else, who gets map partitions again - master fails - give up. Probability that any one machine (e.g. the master) fails is low, whereas probability that one of hundreds or thousands of workers fails is high. - worker is live, but slow - run the same job on a different machine, and consider work that is produced faster. - slowness happens because either a machine is broken, or partitioning was not uniform. ==Performance== Paper claims they process a grep of 1TB in 150 seconds, which is 6.7 GB/sec. However, they are doing it on 1800 machines. That means 4 MB/sec on each machine. That's WAY slower than 50 MB/sec on each machine, which would be the sequential disk I/O. So mapreduce isn't magic - DB2 (shared nothing distributed DB) can achieve speeds that high. The freedom of the model is probably what gave it its popularity. =BigTable= Google's distributed storage system for structured data. Maps 4-dimensional keys to values: (row, column family, column name, timestamp) -> value Multiple columns can be in a column family (so you can store multiple tables per row), and look at the value of a row/column AT A GIVEN timestamp. API is simple - look up a row given a predicate of the key - can also insert/delete to a given row/column Data is partitioned horizontally by row key onto different machines. The sequential ranges of keys are stored in tablets. You only send a query for a row key to the server with the tablet containing that key (unlike distributed DBs, which send query to all machines). They don't want to run a master that handles all queries, so it has to be easy to find tablet with data you want. So how do you find it? ==Distributed Search Algorithm== Chubby - replicated state machine - 5 machines (so it's highly available) - stores configuration information (small bits of data) that are updated frequently. Tablets live in a three-tier hierarchy, which is essentially a B+Tree - There is a single root tablet, which stores key ranges and pointers to metadata tablets containing those ranges - Metadata tablets contain mapping (key range) -> (tablet containing the rows in that range) - Actual tablets at base of tree are stored on various tablet servers. So a client library in the application asks chubby for root tablet address, then looks for range of key, and then looks for tablet containing key. This seems like four lookups, but in reality, once the client library queries a level of the tree, it can cache the results of that level, and only if it finds stale/incorrect data as the result of the query does it have to query higher level of hierarchy. ==Tablets== There are many more tablets per tablet server than fit in memory. Here is the structure of a single tablet: - memtable is a part of the table which sits in memory. - in GFS, the tablet stores: - several SSTables, which are sorted string tables that are read-only (never updated). - a REDO log, which is where updates go. This is a write-ahead log, so that before memtable is updated, you write update to REDO log. Writes are written to memtable. To delete a row, store its deleted key in the memtable. (REDO table has these updates written to it before the memtable does) Periodically, when memtable fills (just like c-store's ROS), write memtables to GFS as SSTables, and as soon as memtable is on disk, ditch its REDO log. merge SSTables in background. major compaction -> merge all SSTables, including deletes. A query has to go to all SSTables, and gets accurate result as of a timestamp. ==Tablet server failure== Tablet servers continually renew locks with chubby. BigTable master is updated when a tablet server loses a lock, at which point it advertises that a new tablet server is needed, and chubby ensures that only tablet server picks up each tablet. Data from crashed server is stored on GFS, so we never lost the data (it's replicated). New tablet server which picks up a tablet reruns redo log, rebuilds memtable, and starts serving sstables. ==Master failure== Master is responsible for determining when tablets should be reassigned, and takes care of metadata creation. If it goes down, tablet servers compete to become a master, and chubby decides who wins. ==Transactional semantics== None are provided. They only guarantee that a write to an individual row is atomic. Updating multiple rows does not provide any guarantees.