10/30/2008 One Size Does not Fit All, with Mike Reading: * http://www.cs.brown.edu/~ugur/fits_all.pdf * David Dewitt and Jim Gray. Parallel Database Systems: The Future of High Performance Database Processing. Communications of the ACM. 1992. Red Book. =Data Warehousing= =="Data Mining"== How Walmart operates - all purchases at any store ever go into this schema in one centralized location: bought(cid, sid, tid, pid, price, discount, qty) -> 20 TB (terabytes) store(sid, district, address, manager, size) -> 3,000 rows customer(cid, name, address, sex, age) -> 1 million rows product(pid, p_price, ...) -> 60,000 rows time(tid, yr, month, week, day) -> 1,000 rows -> They aggregate these already so it's easier to query on any given time attribute This kind of schema is called a star-schema. - There's one fact table, in this case it's called "bought" - There are many dimension tables, which are referenced by the fact table Simple query they want to run: for pids < 3000 (1/20th of data) calculate sales, total profit by store, by week, by product Note: total profit = qty * (price - p_price) SELECT sum(b.qty), sum(b.qty*(b.price - p.p_price)) FROM bought b, prod p, time t WHERE p.pid = b.pid AND t.tid = b.tid AND b.pid < 3000 GROUP BY b.sid, t.week, b.pid; We have to scan entire bought table to answer this query. So read 20 TB. Typical computer these days: CPU - 8 going on 16 or 32 MEM - 32G (and increasing) disks - 10 per computer (1 TB each), 50 mb/sec sequential reads, 100 random reads/sec It would take about 6 hours to read that table. That gives you 4 queries every 24 hours, which is laughable. So Walmart has to parallelize this. ==Recovery== How long does it take to restore a checkpoint on a corrupted fact table. So that means checkpoints take forever to write, and take about 6 hours to recover under the model of one beefy machine. Time to recover: O(1 day) Walmart's recovery scheme: keep a second copy, and switch over if necessary. So that's data warehousing for you. =OLTP= OLTP - online transaction processing - NOT analytical. Instead, have short random read/write access patterns. eBay - items: 130M, projecting 300M next year - 120 attributes/item - new items: 10M/day - items completed: 10M/day - bids: 22m/day [250/sec] - queries: random reads, ~1B/day (40:1 read:write ratio) [10k queries/sec] - keep transaction data for 6 months - 2 PB (petabytes) - misleading, since most of data is static text - Mike guesses 2 TB of OLTP/transactional data To answer 10k queries/sec, you need 100 disks each doing 100 seeks/second eBay's item table is spread across 25 machines, to support concurrent access. Recovery story - in 1998, they were down for 24 hrs - front page of NYT - 0 revenue for an entire day To handle 0 downtime in future, they have 4 data centers. Replicate data on avg. 2.5X. Datacenters are geographically dispersed more than a half-width of a hurricane apart. =Common Needs= Both walmart and ebay have lots of data. Neither can go down, but for different reasons! ebay needs to appear live, whereas walmart doesn't want to waste analysts time. walmart queries touch entire DB, where ebay queries hit one row in DB. =Scientific Data= Large Synoptic Survey Telescope (LSST), to be deployed in Chile in 2012. Telescope is like a big digital camera. Image size is 3.2 gigapixels! Images/minute: 3 Data/day: 30TB Data/year: 10PB Will run for 6 years! Monitor objects in the sky: 50 billion Object sightings/day: 8 billion Major problems LSST faces: - uncertainty - their measurements are subject to error, so need probability attached to each data. - no overwrite - can't fix errors in place. You have to put in replacement data elsewhere, and keep old stuff. This is for accountability of scientific paper. =Shared Disk= In 80's idea was to have one large storage system. Today, that's called a SAN. Common bus is shared by all machines wanting to read from shared disk. So application runs on one machine. DBMS runs on many machines, all sharing the same data bus. What to do about lock table (which parts of shared disk data is locked by which machine?) - One machine stores all locking data. Central point of failure, and require messaging roundtrips to one machine - Store locks on shared disk - now any lock requires a disk write just to write the lock! Any how do you have mutual exclusion on writing locks? - Broadcast messages to everyone saying you're locking. That's what many implementations end up doing. Slow. What does the Buffer Pool look like? - When you dirty a page, broadcast to every other machine's buffer pool to invalidate their copy Main proponents of shared disk: - Oracle (RAC) - has buffer pool and lock problem solved by sending messages. - Can scale to approx. 4-6 SMP systems, and then get no more throughput due to message contention. Conclusion: shared nothing seems to work better, since you don't need lots of complicated/contended buffer pool/locking logic. =Shared Nothing= Have as many machines as you want, don't connect the drives. JBOD - just a bunch of disks. google/facebook/goldman sachs/etc all do this elderly dbms vendors don't do this vertica/terradata/dataalegro/greenplum (new-wave dbms vendors) all build on this.