Lecture 22: Relaxed Consistency---PNUTS Recent (2008) system in production use at Yahoo! Web Infrastructure - Start off as apache + mysql + framework of choice on a single machine - As # users increase, database becomes bottleneck, but developers are reluctant to get rid of it, since it's a nice abstraction - Hosting DB is tough because it's bottlenecked by IOs, so disk gets trashed - Next step is to partition DB's tables across multiple machines, so that traffic heads to different machines for different users. This is non-trivial, since users interact w/ each other, and displaying any webpage requires RPCs, etc. - Next step is to cache data in something like memcached, which distributes an in-memory cache to offload read-intensive workloads, with added expense of keeping cache consistent - Next step is to have users all over the world. So now to get lower latency, you send users into different datacenters across the world by geographic proximity. So you copy the partitioned infrastructure to different data centers, and this is also helps w/ replication/fault tolerance. - After that, you design your own data store (Google->bigtable, Amazon->dynamo, Microsoft->3 systems, Yahoo!->PNUTS) PNUTS goals - Relational-like DB - Scale with increasing load - Low latency---Service-Level Agreements (SLAs) oblige service layer to deliver storage level in several tens of milliseconds - High availability Applications using this tool - Yahoo user DB---near 1B entries, and millions of active users - Social applications---have another table in addition to users that links users to one-another (userx is friends with usery) - Content metadata---don't store huge files like videos, but store information about those files such as the url of the file, the user who uploaded it, the date, etc. Metadata requires low latency, whereas the big files require a high-throughput filesystem - Listings management---listings for products, etc. Must allow sorting and range queries (rather than lookups) that allow you to ask for all records with a property between two values (price, size, etc.) - Session data---follow click streams, user-navigation around a site within a given session Data model---relational---there are rows, and each row has an attribute Failure models - at 10k's of machines, 1 will fail per hour - network partitions might cause computers to be separate from each other Queries - set -> update some data at a given field - insert new data - read queries - scan (lookup) - range Consistency model - only row-based consistency->updates to two rows do not have consistency - timeline-based->each record has a sequence number, including - generation number---updated each time record is recreated - sequence number---updated each time record is updated - any read of a record at a sequence number includes the prefix of edits at that sequence number Query interface - read-any->doesn't matter which version you read. -> low-latency query, and it's for data that's ok to read stale - read-critical(version)->read a copy at least as new as this version - read-latest->reads the globally newest version of the row (from the record-level master) -> most expensive, might have to go to a data center across the world - test-and-set(version, update)->only perform update if version is the latest version in the DB -> useful for atomic read-write. To implement increment under concurrent updates, you can run (version, record) = read_latest record.value = record.value + 1 test-and-set(version, record) Infrastructure - Storage servers---store tablets, which are subsets of tables, which can fit in RAM. Storage servers have several hundred of these tablets each. - Tablet controllers---store ground-truth of which tablet is on which machine, on two machines, one being a backup of the other - Routers---contains mapping of tablets->machines, which is cached. This offloads work from tablet controllers. If the router gives a client an incorrect storage server on a lookup, the client lets it know, and the router will ask the tablet controller at that point Tablet Controller deals with: - tablet migration for load balancing - deal with inserts to split tablets - supports routers which go down by rebuilding their state - change master for a given record Updates - master gets updated, and bumps up version number - Yahoo! Message Broker (YMB) is a publication-subcription system that all replicas subscribe to for updates from master. - master sends update to YMB. YMB accepts message when it writes message to replicated stable storage log - YMB ensures eventual delivery of message (in order of acceptance) to all subscribers. - if master fails, another node becomes a master (tablet controller decides this). - before accepting more updates, new master in a different data center sends checkpoint message to YMB. Then it reads all messages from YMB. Once it gets its message back, it knows all of old master's updates have been delivered to it.