11/4/2008 Reading: * Michael Stonebraker, Samuel Madden, Daniel Abadi, Stavros Harizopoulos, Nabil Hachem, and Pat Helland. The End of an Architectural Era (It's Time for a Complete Rewrite). Proceedings of VLDB, 2007. =Last time= - walmart: few long-running ad-hoc analytical query - eBay: millions of single-row queries Shared Memory Multiprocessor (SMP) only - scale up on one machine (SQLServer) Shared disk: Oracle, Sybase IQ - buy large disks on a common bus between machines Shared nothing: DB2 Parallel Edition, Vertica, Greenplum, Teradata, Netezza, EnterpriseDB MSFT just bought DatAllegro, to move to multimachine. =Horizontal Partitioning= ==Data Warehousing== In walmart case, "bought" table takes 99% of space, computation. To scale on shared-nothing collection of L machines, partition "bought" table across all machines. Horizontal partitioning of table means you put whole tuples on each machine. Methods of assigning tuples to machines - Hash some key into L buckets - Range partition based on some key - round-robin (random assignment) - load-sensitive balancing for "hot" tuples What about other tables in the system - the small ones that are referenced by every row of "bought?" - Just replicate them to each machine To execute a query, run the same query plan on each machine. Each machine will touch a portion of the "bought" table that it has with a sequential scan. This scales linearly with the number of machines you have! Since the majority of your time is spent scanning "bought," you can divide the scan time by the number of machines it's happening on. Perfect for a data warehousing scenario. Doesn't replicating other tables hurt you? - Updates to those tables are very infrequent. New stores and products are not added in high-throughput scenarios. - Fact table updated frequently, so you spread the updates around. Note - can get even better performance with a column store in data warehousing market, since in addition to horizontal partitioning, you only read the columns you reuqest. Also - compression is a great idea! Some CPU performance lost on decompressing will save you many sequential I/Os! Large blocks are important, since you're reading everything anyway. A column store can store a single attribute in sequence (for subsequent rows) in a large block, leading to better block-level compression than row stores, which store various fields in sequence, making compression worse. Wasted space in rowstores (legacy stuff to support fast updated, etc. not necessary in data warehousing world) - A page slot of at least 2 bytes is stored for each tuple - A tuple header with a tuple ID, null values, etc., takes up lots of overhead for each tuple Removing these would improve rowstore performance! Skew - slows down query time to time of slowest machine - say you have 10 machines, 1 has 4% of "bought", while other 9 have 10.5% each. - even worse, 1 machine has 20%, and 9 machines have 9%. Have to wait on one machine for all queries! So skew is bad! - hashing on the wrong field might kill your performance, since there are many "smiths," so you have to hash on the right field - round robin is good - guarantees data is well-distributed - load-sensitive - all loads are the same - read entire dataset each time! - range partitioning - horrible - might suddenly insert more stuff in a given range at some point Have to pick right partitioning strategy from day 1, or a year later, skew will kill you. Repartition on the fly is hard, since you don't want to go down while you re-arrange! But even if skew is not a problem, scaling is still hard. Success means buying more machines. How do you distribute load to the new machines. - Even harder: you have to distribute to a heterogenous hardware configuration, as new machines will have different configurations. ==eBay== How do we partition data to make eBay happy? item(item_id, 120 other fields) bid(bid_id, item_id, ...) SELECT * FROM bid, item WHERE item_id = 3; You are retrieving a small number of records - don't want to run each query on all machines, like you did for data warehousing. Here you should hash the item and bid tables on the same field (item_id). If you do that, you assign the query's data to a single location, so the 10 or so (small number) of bids for each item are collocated, resulting in one machine per transaction. This is the exact opposite scenario from walmart - send each query to one machine. eBay does roughly this - runs 25 independent oracle machines, and load balances between them with its own logic. ==Mixed workload== Database design when you have mixed workload of analytical and single-row queries is a black art. Partitioning scheme is a hard one to pick... Query optimization becomes really hard in that case. To keep L machines equally busy, have to consider how queries will cause intra-machine communications. Before, all we considered was disk I/Os. Now the network matters, as we do 5 terabyte joins across the network. In such an environment, OLTP has to get precedence to satisfy impatient customers with latency requirements. Current solution: heuristics from research community, or simply splitting the analytical workload off into a warehouse in a nightly process, using Extract, Transform, and Load (ETL) workflow. This means the warehouse is 24 hours out of date - not quite perfect. =Breakdown of an OLTP Workload= Buffer Pool ~25% Locking ~25% Logging ~25% Multithreading (concurrent B+Trees requiring latching) ~25% Useful work 1% To get more than an order of magnitude faster (by Amdahl's law), you have to ditch each of the 25% costs. These days, entire OLTP dataset fits in memory. So ditch buffer pool. OLTP transactions are web-based, not person-based, so transactions are 1 ms or less. So don't pay for locking - run transactions in sequence to completion. - but what about multicore - you'd lose parallelism on each core! Run a DB instance on each core, split memory by the number of cores, and treat them like different machines Since you want high availability anyway, replicate data, and stop logging! Mutithreading - don't multithread - run transactions to completion! Current commercialization of this idea: H-Store, runs 11x as fast as traditional RDBMSs, and claimed to be getting faster. Problem: if datacenter goes down, lose everything from RAM. Mike says "stay tuned," presumably for cross-geography replication. =Transactions in a Shared Nothing= Traditional Architecture application talks to a coordinator, which takes care of parsing, optimization, and supervision of query. Coordinator talks to each of L workers required to run plan, which each do useful work, which is returned to coordinator, and then to application. sample query: move $100 from account A to account B BEGIN TRANSACTION debit A credit B COMMIT On one site, do this with record-level locking and a write-ahead logging system. How to avoid transactions colliding on multiple machines? Assume A and B are on different machines. - Coordinator will send message to A to debit A, which causes local logging and locking, to which A responds DONE. - Coordinator will send message to B to debit B, which causes local logging and locking, to which B responds DONE. - Coordinator sends COMMIT to A and B, and they both commit. This doesn't work! Deadlock is possible, and now two machines will deadlock on each other, just like it happened in row-level locking. So A or B won't send back a DONE. Two ways to do it: - coordinator times out and aborts after X seconds of waiting - everyone does this - assemble global waits-for graph & look for cycle - too complex to be useful This leads us to the next lecture, on how to perform transactions in distributed systems