11/4/2008

Reading:
* Michael Stonebraker, Samuel Madden, Daniel Abadi, Stavros Harizopoulos, Nabil Hachem, and Pat Helland. The End of an Architectural Era (It's Time for a Complete Rewrite). Proceedings of VLDB, 2007.

=Last time=
  - walmart: few long-running ad-hoc analytical query
  - eBay: millions of single-row queries

Shared Memory Multiprocessor (SMP) only - scale up on one machine (SQLServer)
Shared disk: Oracle, Sybase IQ - buy large disks on a common bus between machines
Shared nothing: DB2 Parallel Edition, Vertica, Greenplum, Teradata, Netezza, EnterpriseDB

MSFT just bought DatAllegro, to move to multimachine.

=Horizontal Partitioning=
==Data Warehousing==
In walmart case, "bought" table takes 99% of space, computation.  To scale on
shared-nothing collection of L machines, partition "bought" table across
all machines.  Horizontal partitioning of table means you put whole
tuples on each machine.

Methods of assigning tuples to machines
  - Hash some key into L buckets
  - Range partition based on some key
  - round-robin (random assignment)
  - load-sensitive balancing for "hot" tuples

What about other tables in the system - the small ones
that are referenced by every row of "bought?"
  - Just replicate them to each machine

To execute a query, run the same query plan on each machine.
Each machine will touch a portion of the "bought" table that it
has with a sequential scan.  This scales linearly with the number
of machines you have!  Since the majority of your time is spent scanning
"bought," you can divide the scan time by the number of machines it's
happening on.  Perfect for a data warehousing scenario.

Doesn't replicating other tables hurt you?
  - Updates to those tables are very infrequent.  New stores and
    products are not added in high-throughput scenarios.
  - Fact table updated frequently, so you spread the updates around.

Note - can get even better performance with a column store in data warehousing
market, since in addition to horizontal partitioning, you only read the columns
you reuqest.

Also - compression is a great idea!  Some CPU performance lost on decompressing
will save you many sequential I/Os!

Large blocks are important, since you're reading everything anyway.  A column store
can store a single attribute in sequence (for subsequent rows) in a large block,
leading to better block-level compression than row stores, which store various
fields in sequence, making compression worse.

Wasted space in rowstores (legacy stuff to support fast updated, etc. not
necessary in data warehousing world)
  - A page slot of at least 2 bytes is stored for each tuple
  - A tuple header with a tuple ID, null values, etc., takes up lots of
    overhead for each tuple

Removing these would improve rowstore performance!

Skew - slows down query time to time of slowest machine
  - say you have 10 machines, 1 has 4% of "bought", while other 9 have
    10.5% each.
  - even worse, 1 machine has 20%, and 9 machines have 9%.  Have to wait on
    one machine for all queries!

So skew is bad!
  - hashing on the wrong field might kill your performance, since there are
    many "smiths," so you have to hash on the right field
  - round robin is good - guarantees data is well-distributed
  - load-sensitive - all loads are the same - read entire dataset each time!
  - range partitioning - horrible - might suddenly insert more stuff in a
    given range at some point

Have to pick right partitioning strategy from day 1, or a year later, skew will
kill you.  Repartition on the fly is hard, since you don't want to go down
while you re-arrange!

But even if skew is not a problem, scaling is still hard.  Success means buying
more machines.  How do you distribute load to the new machines.
  - Even harder: you have to distribute to a heterogenous hardware configuration,
    as new machines will have different configurations.

==eBay==
How do we partition data to make eBay happy?
item(item_id, 120 other fields)
bid(bid_id, item_id, ...)

SELECT * FROM bid, item WHERE item_id = 3;

You are retrieving a small number of records - don't want to run each query on
all machines, like you did for data warehousing.

Here you should hash the item and bid tables on the same field (item_id).
If you do that, you assign the query's data to a single location, so the 10
or so (small number) of bids for each item are collocated, resulting
in one machine per transaction.

This is the exact opposite scenario from walmart - send each query to one
machine.

eBay does roughly this - runs 25 independent oracle machines, and load balances
between them with its own logic.

==Mixed workload==
Database design when you have mixed workload of analytical and single-row
queries is a black art.

Partitioning scheme is a hard one to pick...
Query optimization becomes really hard in that case.
To keep L machines equally busy, have to consider how queries will cause
intra-machine communications.  Before, all we considered was disk I/Os.  Now
the network matters, as we do 5 terabyte joins across the network.

In such an environment, OLTP has to get precedence to satisfy impatient
customers with latency requirements.

Current solution: heuristics from research community, or simply splitting
the analytical workload off into a warehouse in a nightly process, using
Extract, Transform, and Load (ETL) workflow.  This means the warehouse is 24
hours out of date - not quite perfect.

=Breakdown of an OLTP Workload=
Buffer Pool ~25%
Locking ~25%
Logging ~25%
Multithreading (concurrent B+Trees requiring latching) ~25%
Useful work 1%

To get more than an order of magnitude faster (by Amdahl's law), you have to
ditch each of the 25% costs.

These days, entire OLTP dataset fits in memory.  So ditch buffer pool.

OLTP transactions are web-based, not person-based, so transactions are 1 ms
or less.  So don't pay for locking - run transactions in sequence to
completion.
  - but what about multicore - you'd lose parallelism on each core!  Run
    a DB instance on each core, split memory by the number of cores,
    and treat them like different machines

Since you want high availability anyway, replicate data, and stop logging!

Mutithreading - don't multithread - run transactions to completion!

Current commercialization of this idea: H-Store, runs 11x as fast as
traditional RDBMSs, and claimed to be getting faster.

Problem: if datacenter goes down, lose everything from RAM.  Mike says
"stay tuned," presumably for cross-geography replication.

=Transactions in a Shared Nothing=
Traditional Architecture
application talks to a coordinator, which takes care of parsing,
optimization, and supervision of query.
Coordinator talks to each of L workers required to run plan, which each do
useful work, which is returned to coordinator, and then to application.

sample query:
move $100 from account A to account B
BEGIN TRANSACTION
debit A
credit B
COMMIT

On one site, do this with record-level locking and a write-ahead logging
system.  

How to avoid transactions colliding on multiple machines?

Assume A and B are on different machines.
  - Coordinator will send message to A to debit A, which causes local logging
    and locking, to which A responds DONE.
  - Coordinator will send message to B to debit B, which causes local logging
    and locking, to which B responds DONE.
  - Coordinator sends COMMIT to A and B, and they both commit.

This doesn't work! Deadlock is possible, and now two machines 
will deadlock on each other, just like it happened in row-level
locking.  So A or B won't send back a DONE.

Two ways to do it:
  - coordinator times out and aborts after X seconds of waiting - everyone
    does this
  - assemble global waits-for graph & look for cycle - too complex to be useful

This leads us to the next lecture, on how to perform transactions in
distributed systems