1/2/2008
Query Optimization

Reading:
* Patricia Selinger, M. Astrahan, D. Chamberlin, Raymond Lorie, and T. Price. Access Path Selection in a Relational Database Management System. Proceedings of ACM SIGMOD, 1979. Pages 22-34. In the Red Book.
* Optionally, you may also wish to look at: Michael Mannino, Paichen Chu, and Thomas Sager. Statistical Profile Estimation in Database Systems. ACM Computing Surveys 20(3), 1988. Pages 191-221. This paper discusses many of the techniques that used to make query optimization and cost estimation practical in modern database systems. 

How do we determine which query plan to run given all of our options?

Patricia Selinger -> 1978 paper for System R on cost-based query optimization.
  Many of the foundations of the selinger optimizer are still used today!

Cost-based optimization = find min(cost) plan.
This is compared to heuristic optimization, which doesn't assign costs to
various decisions.

=Cost Model=
==Definitions==
  - Selectivity (F) - fraction of tuples that pass through a filter.
    NOTE: highly selective means F is high, and MORE tuples get through
          (think of it like a productivity)
    For a JOIN, selectivity =
      (tuples passed through)/(size of cross product of two tables)
  - NCARD - relation cardinality - the number of tuples in the relation
    We used to call this {R}
  - TCARD - number of pages in the relation
    We used to call this |R|
  - ICARD - number of keys in index
  - NINDX - number of pages in index

==Selectivity Estimation==
There is a lot of research into this, but let's assume uniformity of key
distribution, and assume some defaults for values we don't know.

for filter:
- col = value, estimate 1/ICARD if index exists, or 1/10 otherwise
- col > value.  Assume uniform distribution of keys.
  estimate ((max - min) - (val - min))/(max - min)
             = (max - min - val + min)/(max - min)
             = (max - val)/(max - min),
  or 1/3 if no max, min exist
- col1 = col2 (join)
  estimate: 1/ICARD(primary key column being joined)
  if no primary key,
  then 1/max(ICARD(col1), ICARD(col2))

==Cost of Access Methods==
W - Fraction of the cost of accessing a page that the CPU takes for
    doing work like evaluating a predicate

equality predicte on a b+tree:
  1 [btree lookup] + 1 [heapfile lookup] + W [evaluate preducate on heap page]

range scan w/ predicate:
  for clustered index:
    F(pred)*(NINDX + TCARD) + W*(tuples read)
  for unclustered index:
    F(pred)*(NINDX + NCARD) + W*(tuples read)
  seq scan: TCARD + W*NCARD

==Join Costs==
Nested Loops(A,B)
  Cost(accessing outer) [outer scan]
   + (NCARD(outer) [per tuple of outer] * Cost(inner) [scan inner])
INL(A,B)
  Cost(accessing outer) [outer scan]
   + (NCARD(outer) [per tuple of outer] * C(inner index loopup) [scan inner])
Merge
  Cost(accessing outer) + Cost(accessing inner) + sort cost

As you go up tree of joins, estimates become worse and worse, due to
correlations and other things we don't account for

=Selecting a Plan=
Why is it hard?  Can't we just write out all possible plans, and pick the one
with the smallest cost?

There are too many possible plans to consider!

Consider joining: A B C

(AB)C
A(BC)
AC(B)
A(CB)

So number of strings for n letters (n tables) - n!
Number of parenthesizations of n letters - (n-1)!
So n!(n-1)! plans!  Too much!

Heuristics - only consider these kind of plans:
  - push down selections - always makes sense if selection operation is free.
                         - reconsider this if your predicate involves an FFT
                           or something like that.
  - left-deep plans - bad if you want to parallelize joins, but on a single
                      thread, it's probably fine.
                      removes parenthesizations, as all parenthesis look like:
                        ((((AB)C)D)E)
  - Only consider non-cross products - don't join tables without join
                                       predicates

=Dynamic Programming=
Turns n! search space into complexity 2^n

Intuition - (ABC)DE - best way to join ABC is independent of joining with DE
            later.
  (kind of true.  not the case if you might be able to do a sort merge join and
   can get ABC in a good sort order)

for $delta in (1...|R|)
  for all S in {all length $delta subsets of R}
    optjoin(S) = a \join (S-a) where a is the relation
                 that minimizes
                   cost(optjoin(s-a)) +
                   mincost of (s-a) \join a +
                   mincost to access a

Example:
ABCD
$delta = 1
A - e.g. cost to seq scan
B - e.g. cost to do an index lookup
C...
D...

$delta = 2
 AB - AB or BA
 AC - AC or CA
 BC...
 AD...
 BD...
 CD...

$delta = 3
ABC - remove A, compare cost of BC\joinA to A\joinBC
    - remove B, compare cost of AC\joinB to B\joinAC
    - remove C, ...
ACD...
ABD...
BCD...

...
Keep going, pick smallest cost one

Cost of algorithm:
outer loop goes through all subsets eventually, so 2^n
inner loop considers all sets of length S with one item removed - length S.
  This is at most length n per loop.

So upper bound on cost is: (2^n)n.

2^n is way smaller than n!.
if n = 20, 2^n is about 10^7, but n! is about 10^18.

Note: this algorithm looks at a mix of left- and right-deep, meaning it tries 
to turn each subtree across its vertical axis.  This is still a deep
evaluation - parallel joins are not produced.  You need to mix the left- and
right-deepness in order to consider all mixes of inner joins and outer joins.

=Interesting Orders=
Real-world optimizers use
 - bushy plans to allow for parallelization
 - more legitimate selectivity estimation
   - don't assume uniformity in values
   - join selectivities may be correlated

=Selectivity Estimation=
Imagine joining AjoinBjoinCjoinD, and that everything AjoinB has
selectivity .1, whereas AjoinB has selectivity 1.  If you put
AjoinB at the bottom of the plan, you are joining way more data than
you need to (10 times as much).

==Methods of Sampling==
->Histogram
Selinger optimizer does optimization based on uniform distribution assumption
of data values.  This can often be wrong.  For more precision, can keep
a histogram.

So if sal > 25k is a filter.
0-10k: .4
10-20k: .1
20-30k: .4
30-40k: .1

So >25k is about .2 + .1 = .3 if you assume values within histogram bins are
uniform.

->Random Sampling
Sample some values before a long-running query to see what the values are

->Keep Statistics from previous queries for future ones

==Joins==
Selinger optimizer says selectivity of a join is 1/max(card1, card2).
Instead of making this assumption, can keep join statistics between the
two columns being joined - more histogram bins required, but still gives
you an idea of what will actually join together.

==Postgres Statistics==
Postgres keeps:
-> A histogram with fixed height - so essentially it keeps which data values
separate the data between fixed-size bins.  Helps measure selectivity of
predicates.
-> A fraction of the table which is unique, or (if there are a small number of
them) how many unique values are in the table.
-> A "correlation" value that tells you how likely it is for one value to be
larger than the value before it in a column (measures how "sorted" the column
is).

See Postgres docs (quite detailed) if you ever want to do this yourself and
replicate something decent!