10/30/2008
One Size Does not Fit All, with Mike

Reading:
* http://www.cs.brown.edu/~ugur/fits_all.pdf
* David Dewitt and Jim Gray.  Parallel Database Systems: The Future of High Performance Database Processing. Communications of the ACM. 1992. Red Book. 

=Data Warehousing=
=="Data Mining"==
How Walmart operates - all purchases at any store ever go into this schema in
one centralized location:

bought(cid, sid, tid, pid, price, discount, qty) -> 20 TB (terabytes)
store(sid, district, address, manager, size) -> 3,000 rows
customer(cid, name, address, sex, age) -> 1 million rows
product(pid, p_price, ...) -> 60,000 rows
time(tid, yr, month, week, day) -> 1,000 rows
  -> They aggregate these already so it's easier to query on any given
     time attribute

This kind of schema is called a star-schema.
  - There's one fact table, in this case it's called "bought"
  - There are many dimension tables, which are referenced by the fact table

Simple query they want to run:
for pids < 3000 (1/20th of data)
  calculate sales, total profit by store, by week, by product

Note: total profit = qty * (price - p_price)

SELECT sum(b.qty), sum(b.qty*(b.price - p.p_price))
FROM bought b, prod p, time t
WHERE p.pid = b.pid
AND   t.tid = b.tid
AND   b.pid < 3000
GROUP BY b.sid, t.week, b.pid;

We have to scan entire bought table to answer this query.  So read 20 TB.

Typical computer these days:
  CPU - 8 going on 16 or 32
  MEM - 32G (and increasing)
  disks - 10 per computer (1 TB each), 50 mb/sec sequential reads,
          100 random reads/sec

It would take about 6 hours to read that table.  That gives you 4 queries every
24 hours, which is laughable.  So Walmart has to parallelize this.

==Recovery==
How long does it take to restore a checkpoint on a corrupted fact table.

So that means checkpoints take forever to write, and take about 6 hours to
recover under the model of one beefy machine.

Time to recover: O(1 day)

Walmart's recovery scheme: keep a second copy, and switch over if necessary.

So that's data warehousing for you.

=OLTP=
OLTP - online transaction processing - NOT analytical.  Instead, have short
random read/write access patterns.

eBay
  - items: 130M, projecting 300M next year
    - 120 attributes/item
  - new items: 10M/day
  - items completed: 10M/day
  - bids: 22m/day [250/sec]
  - queries: random reads, ~1B/day (40:1 read:write ratio)
             [10k queries/sec]
  - keep transaction data for 6 months - 2 PB (petabytes)
    - misleading, since most of data is static text
    - Mike guesses 2 TB of OLTP/transactional data

To answer 10k queries/sec, you need 100 disks each doing 100 seeks/second

eBay's item table is spread across 25 machines, to support concurrent access.

Recovery story
  - in 1998, they were down for 24 hrs
  - front page of NYT
  - 0 revenue for an entire day

To handle 0 downtime in future, they have 4 data centers.  Replicate data
on avg. 2.5X.  Datacenters are geographically dispersed more than a
half-width of a hurricane apart.

=Common Needs=
Both walmart and ebay have lots of data.  Neither can go down, but
for different reasons!  ebay needs to appear live, whereas walmart doesn't
want to waste analysts time.  walmart queries touch entire DB,
where ebay queries hit one row in DB.

=Scientific Data=
Large Synoptic Survey Telescope (LSST), to be deployed in Chile in 2012.
Telescope is like a big digital camera.  Image size is 3.2 gigapixels!
Images/minute: 3
Data/day: 30TB
Data/year: 10PB
Will run for 6 years!

Monitor objects in the sky: 50 billion
Object sightings/day: 8 billion

Major problems LSST faces:
 - uncertainty - their measurements are subject to error, so need probability
                 attached to each data.
 - no overwrite - can't fix errors in place.  You have to put in replacement
                  data elsewhere, and keep old stuff.  This is for
                  accountability of scientific paper.

=Shared Disk=
In 80's idea was to have one large storage system.  Today, that's called a SAN.
Common bus is shared by all machines wanting to read from shared disk.

So application runs on one machine.  DBMS runs on many machines, all sharing
the same data bus.

What to do about lock table (which parts of shared disk data is locked by which
machine?)
  - One machine stores all locking data.  Central point of failure, and require
    messaging roundtrips to one machine
  - Store locks on shared disk - now any lock requires a disk write just to write
    the lock!  Any how do you have mutual exclusion on writing locks?
  - Broadcast messages to everyone saying you're locking.  That's what
    many implementations end up doing.  Slow.

What does the Buffer Pool look like?
  - When you dirty a page, broadcast to every other machine's buffer pool
    to invalidate their copy

Main proponents of shared disk:
  - Oracle (RAC) - has buffer pool and lock problem solved by sending messages.
  - Can scale to approx. 4-6 SMP systems, and then get no more throughput due
    to message contention.

Conclusion: shared nothing seems to work better, since you don't need lots of
  complicated/contended buffer pool/locking logic.

=Shared Nothing=
Have as many machines as you want, don't connect the drives.
JBOD - just a bunch of disks.

google/facebook/goldman sachs/etc all do this
elderly dbms vendors don't do this
vertica/terradata/dataalegro/greenplum (new-wave dbms vendors) all
build on this.