6.824 Lecture 15: Peer-to-peer systems: Lookup Prior focus has been on traditional distributed systems e.g. client/server style Server in Machine room: well maintained, centrally located, perhaps replicated. Examples: master in MapReduce/Dryad, lock and extent server in YFS Now: Internet scale systems Huge number of distributed machines: can't know everyone Machines owned perhaps by you and me: no obvious central authority. Needs to scale to larger amount of machines---can't have a centralized service with replicated state machines. e.g. From e-mail to Napster, Kazaa, Bittorrent, Skype Problems How do you name nodes and objects? How do you find other nodes in the system (efficiently)? How should data be split up between nodes? How to prevent data from being lost? How to keep it available? - with millions of machines, at least one will die. How to avoid machine loss -> data loss? How to provide consistency? - latency between replicated data is higher (machines in different parts of the world). have to lose a bit of consistency to avoid overhead. How to provide security? anonymity? - you can't get everyone to submit to a central certificate authority to verify identity, so you reduce your security assumtpions: maybe only want to know that data is unchanged by adversary What structure could be used to organize nodes? Central contact point, but provides central point of failure: Napster - bittorrent: trackers---keeps list of seeders and hashes of files so that you can start with a set of remote servers and know data is authentic. So it's a decentralized architecture with a centralized component. Hierarchy: DNS for E-mail, WWW - need to establish hierarchy---who is the root? - consistency between layers of the hierarchy---DNS has a lease time during which lower nameservers may be out-of-date - semantics associated with names. Since it's not flat, if you leave MIT, you can no longer name your service in the *.mit.edu namespace. Flat? Let's look at a system with a flat interface: DHT flat is cool, because app can pick IDs in way that is good for the app Scalable lookup: Provide an abstract interface to store and find data Typical DHT interface: put(key, value) get(key) -> value loose guarantees about keeping data alive log(n) hops, even for new nodes guarantees about load balance, even for new nodes Potential DHT applications: publishing: DHT keys are like links file system, use DHT as a sort of distributed disk drive keys are like block numbers YFS extent server could be like this---returns list of possible fileservers with blocks. location tracking keys are e.g. cell phone numbers a value is a phone's current location tracker for bittorrent, so it's not centralized dns could be reimplemented, but no longer hierarchical caching systems---cache webpages (key is the URL). memcached has such an interface as well. Basic idea Two layers: routing (lookup) and data storage Routing layer handles naming and arranging nodes and then finding them. Storage layer handles actually putting and maintaining data. Challenges: how to map keys -> ids of servers w/ data? load balancing of keys? how to track an intelligent subset of the nodes to cover routing to all of them? how to route queries to other nodes? how to handle nodes joining how to handle node failures What does a complete algorithm have to do? 1. Define IDs, document ID to node ID assignment 2. Define per-node routing table contents 3. Lookup algorithm that uses routing tables 4. Join procedure to reflect new nodes in tables 5. Failure recovery 6. Move data around when nodes join 7. Make new replicas when nodes fail Typical approach: Give each node a unique ID Nodes and keys are mapped to the same ID space Have a rule for assigning keys to nodes based on node/key ID e.g. key X goes on node node with nearest ID to X Now how, given X, do we find that node? Arrange nodes in an ID-space, based on ID i.e. use ID as coordinates Build a global sense of direction Examples: 1D line, 2D square, Tree based on bits, hypercube, or ID circle Build routing tables to allow ID-space navigation Each node knows about ID-space neighbors I.e. knows neighbors' IDs and IP addresses Perhaps each node knows a few farther-away nodes To move long distances quickly The "Chord" peer-to-peer lookup system An example system of this type ID-space topology Ring: All IDs are 160-bit numbers, viewed in a ring (largest id is stored right next to 0, so we loop around to ourselves). Everyone agrees on how the ring is divided between nodes Just based on ID bits Assignment of key IDs to node IDs? Key stored on first node whose ID is equal to or greater than key ID (successor). Closeness is defined as the "clockwise distance" If node and key IDs are uniform, we get reasonable load balance. Node IDs can be assigned, chosen randomly, SHA-1 hash of IP address... - deterministic way to calculate node id - load balancing for free by way of cryptographic hash - adversary can't break the system by making keys that hit one server - it's hard to calculate another key that would put you right before another node to steal all of its keys. Key IDs can be drived from data, or chosen by user Routing? Query is at some node. Node needs to forward the query to a node "closer" to key. Simplest system: either you are the "closest" or your neighbor is closer. Hand-off queries in a clockwise direction until done Only state necessary is "successor". n.find_successor (k): if k in (n,successor]: return successor else: return successor.find_successor (k) Slow but steady; how can we make this faster? This looks like a linked list: O(n) Can we make it more like a binary search? Need to be able to halve the distance at each step. So store a skip-list at each node - If we use 160 bits, store 160 ips and node ids. - Pick the right nodes to store by picking ids 1/2, 1/4, 1/8, ..., 1/160 away from yours (1/2 way around the ring, 1/4 way around the ring, etc.) - Since at each step of looking for the node w/ the closest id to the destination node, we're cutting half the distance to the node, we can get to destination in lg N steps. So for 1,000,000 nodes, we can find the right node in log 1,000,000, or about 20 steps. Finger table routing: Keep track of nodes exponentially further away: New state: succ(n + 2^i) Many of these entries will be the same in full system: expect O(lg N) n.find_successor (k): if k in (n,successor]: return successor else: n' = closest_preceding node (k) [of the lg N we know of] return n'.find_successor (k) Maybe node 8's looks like this: 1: 14 2: 14 4: 14 8: 21 16: 32 32: 42 There's a complete tree rooted at every node Starts at that node's row 0 Threaded through other nodes' row 1, &c Every node acts as a root, so there's no root hotspot This is *better* than simply arranging the nodes in one tree (no one node sees all of traffic) If a node has a lot of bandwidth or a stronger machine? - if it is Y times stronger than other machines, run Y virtual instances of chord servers on it. Their node ids will be SHA1(IP_1),...,SHA1(IP_Y) So let's say node 10 joins between node 5 and 20 - needs to get a routing table. just borrow it from 20, and correct it over time - assume predecessor and successor pointers exist - 10 sets 20 as its successor, borrows its routing table - 10 tells 20 to set its predecessor pointer - 5 periodically asks 20 who its predecessor is periodically, and eventually updates to 10 - replicate data from old successor so that you can have fault tolerance, and eventually delete it from the successor. This assumes values are immutable, or else you'll need a commit protocol to keep consistency. How does a new node acquire correct tables? General approach: Assume system starts out w/ correct routing tables. Use routing tables to help the new node find information. Add new node in a way that maintains correctness. Issues a lookup for its own key to any existing node. Finds new node's successor. Ask that node for its finger table. At this point the new node can forward queries correctly: Tweak its own finger table as necessary. Does routing *to* us now work? If new node doesn't do anything, query will go to where it would have gone before we joined. I.e. to the existing node numerically closest to us. So, for correctness, we need to let people know that we are here. Each node keeps track of its current predecessor. When you join, tell your successor that its predecessor has changed. Periodically ask your successor who its predecessor is: If that node is closer to you, switch to that guy. Is that enough? Everyone must also continue to update their finger tables: Periodically lookup your n + 2^i-th key What about concurrent joins? E.g. two new nodes with very close ids, might have same successor. e.g. 44 and 46. Both may find node 48... spiky tree! Good news: periodic stabilization takes care of this. What about node failures? Assume nodes fail w/o warning. Strictly harder than graceful departure. Two issues: Other nodes' routing tables refer to dead node. Dead nodes predecessor has no successor. If you try to route via dead node, detect timeout, treat as empty table entry. I.e. route to numerically closer entry instead. Repair: ask any node on same row for a copy of its corresponding entry. Or any node on rows below. All these share the right prefix. For missing successor Failed node might have been closest to key ID! Need to know next-closest. Maintain a _list_ of successors: r successors. If you expect really bad luck, maintain O(log N) successors. We can route around failure. The system is effectively self-correcting. In addition to keeping track of r successors, whenever you get a put() request, put the key at your r-1 successors. Finger requests might also lead to down servers while routing. If that happens, ask the next smallest finger router that you have in your log(N) list of known nodes. At some point, the repair process will update our finger table to correct it. Locality/Proximity Lookup takes log(n) messages. But they are to random nodes on the Internet! Will often be very far away (~100ms to go from east coast to west coast over internet). Can we route through nodes close to us on underlying network? This boils down to whether we have choices: If multiple correct next hops, we can try to choose closest. Chord doesn't allow much choice. Observe: Strict successor for finger not necessary. Ping nodes in successor list of true finger, pick closest. What's the effect? Individual hops are lower latency. But less and less choice (lower node density) as you get close in ID space. So last few hops likely to be very long. Thus you don't *end up* close to the initiating node. You just get there quicker. How fast could proximity be? 1 + 1/4 + 1/16 + 1/64 Not as good as real shortest-path routing! Any down side to locality routing? Harder to prove independent failure. Maybe no big deal, since no locality for successor lists sets. Easier to trick me into using malicious nodes in my tables. What about security? Self-authenticating data, e.g. key = SHA1(value) So DHT node can't forge data Of course it's annoying to have immutable data... It's also hard to look up keys like URLs or blocks, where you don't know the contents of the item you're looking up ahead of time... Can a DHT node claim that data doesn't exist? Yes, though perhaps you can check other replicas Can a host join w/ IDs chosen to sit under every replica? This is called a Sybil attack Or "join" many times, so it is most of the DHT nodes? How are IDs chosen? This is an open problem: unclear how to authenticate people in a decentralized fashion. Why not just keep complete routing tables? So you can always route in one hop? Danger in large systems: timeouts or cost of keeping tables up to date. Accordion (NSDI '05) trade off between keeping complete state and performance subject to a budget Are there any reasonable applications for DHTs? For example, could you build a DHT-based Gnutella? Content Distribution Networks as well---see next week