9/30/2008
Join Algorithms

Reading:
* L.D. Shapiro. Join Processing in Database Systems with Large Main Memories. In the Red Book.

Assume o outer tuples, i inner tuples (used in showing big O notation)

Some notation for further analysis:
S.a = R.b is the kind of join we are doing
{s} = # tuples in S
|S| = # pages in S
M pages of memory
|R| < |S|

=Simple Join Algorithms=
Nested loops - O(o*i) in the number of tuples
Index nested loops - Iterate through O(o) tuples, and do an index lookup
                     (at least 1 seek) for each of them
In-memory hash - Hash larger table as inner, Iterate through O(o) tuples,
                 do an in-memory lookup for each of them
In memory sort merge - O(ologo) [sort] + O(ilogi) [sort] +
                       O(o) [merge scan] + O(i) [merge scan]

=Deeper Analysis=
==Nested Loops Join==
{s}x{r} comparisons

example: |S| = 4, |R| = 2, M = 3, LRU
  - S is inner: 2 + 4 + 4 = 10
  - R is inner: 4 + 2 = 6

More formally, if R is inner:
  {S}|R| [how many times we scan R as the inner] + |S| [scanning through S as outer]

==Index-Nested Loops==
(Given an index on R.b)

for s in S
  find matches in R.b's index

I/O: {s} + |S| pages are read, but the {s} is all random accesses

==Hash (in memory)==
Only works when M > |R| (plus room for the hash table, etc.)

build hash table T on R
for each s in S
  lookup S.a in T

I/O: |S| + |R|

==Sort-Merge==
Read R, sort
Read S, sort
Merge

Deal with double values specially:
S's joined fields sorted = 1 2 5 7 7 9
R's joined fields sorted = 2 3 7 7 9

Want to output (2,2), (7,7), (7,7), (7,7), (7,7), (9,9), but if we just
followed the basic merge algorithm, we'd only output two copies of (7,7).
So be careful - you have to look ahead to see if the next value is repeated,
so that you can make cross product of all repeated values.

(Don't worry about exact pseudocode, just understand that the algorithm must
output correct results of a cross product)

=External Join Algorithms=
External means that they are larger than memory, so you have to use disk
to handle intermediate operations.

Assumes that M > sqrt(|S|).  We'll talk about why this is, but it's realistic
nonetheless.  It means 1 meg of memory can support joins of tables of 
size 1 terabyte.  [NOTE: I think this might have to be M > 2sqrt(|S|), so
you can make sure you can fit parts of S AND R in memory].

==External Sort Merge Join==
phase 1: partition
  repeat until done:
    read M pages of S, sort, write out (|S|/M runs are created)
  repeat until done:
    read M pages of R, sort, write out (|R|/M runs are created)

phase 2: merge
merge a bunch of runs at a time
example:
  R = 1,4,3,6,9,14,1,7,11
  S = 2,3,7,12,9,8,4,15,6
  M can hold 3 tuples at a time

R's runs:
  1    6    1
  3    9    7
  4    14   11
S's runs:
  2    8    4
  3    9    6
  7    12   15

So now keep a cursor at the beginning of each run (that is, store a page of
each run of R and S).  We'd need 6 cursors, three starting at
1, 6, and 1 for R, and for S, they start at 2, 8, and 4.

For each of the |S|/M runs and |R|/M runs, look within the runs of R
to find the smallest cursor, and compare that to the smallest cursor of 
S.  Advance the smallest cursor once it's no longer needed for a join

Sam notes that maybe we need M to be 6 (to handle the factor of 2 that
we mentioned earlier), so that
we can have one page from each run of R and S in memory.

Imagine M = sqrt(|S|).
Then there are |S|/sqrt(|S|) = sqrt(|S|) runs, which can fit in memory.
If S is any larger, then we won't be able to have one page per R and S's
runs.

I/O:
|R| + |S| [read in R and S to sort them - sequential]
+ |R| + |S| [to write the runs out - sequential]
+ |R| + |S| [to scan through the runs and merge them - somewhat random access,
             since we have to advance cursors to new pages in somewhat
             random order]
= 3(|R| + |S|)

Unix sort uses this algorithm, as well as many DBs.

==Hash Joins==
h(x) -> {1,...,n}
pass size = n/(|S|/M)
for pass = 0 ... |S|/M
  min hv = size*pass
  max hv = size*(pass+1)
  scan S, hash tuples with h(x)
    if tuple is in range [min, max), add to hash table T
    otherwise, write the tuple back out to deal with later
  scan R, hash tuples with h(x)
    if tuple is in range [min, max), look up value in T
    otherwise, write the tuple back out to deal with later

So you keep reading memory-sized chunks of S (since h(.) distributes values
uniformly), look R up in it when the range matches, and output matches.

example with previous data:
h(x) = x mod 3   (size = 1)
pass 0:
  S hash table: 3 12 9 15 6
  write out from S: 2 7 8 4
  R: 3 6 9
  write out: 1 4 14 1 7 11

  matches: 3 6 9
pass 1:
S: 7 4
write out: 2 8
R: 1 4 1 7
write out: 14 11

  matches: 4, 7
...

3 passes:
(R+S) + (2/3)(R+S)*2 + (1/3)(R+S)*2 = 3(R+S)
In general, we have to do N(R+S) for N passes.
So that sucks.

==GRACE Hash==
Choose sqrt(R) partitions, divide up hash table
hash R into these partitions
hash S into these paritions
for each partition p
  build a hash table T on partition p of R
  scan partition p of S, lookup in T

[It's probably fine to choose sqrt(S) partitions, but we used sqrt(R) for
some reason]

Note: we read |R| + |S| to hash them, |R| + |S| to write out partitions,
and |R| + |S| to read it in again to join.
So 3(|R| + |S|), but maybe with a bit more randomness than the sort merge join.

That's nice!
The only time the simple join algorithm is better is if there is enough RAM
to hold significant portion of the tables, and then do less than 3 passes over
data.

==Hybrid Hash JOIN==
Get best of simple and GRACE hash.  Simple idea would be to decide based on size of
memory and join which algorithm to use.  They use a more complicated algorithm,
which says that you use extra RAM in simple join to do GRACE-like JOIN, but who knows
why.

==Sort Merge Join vs GRACE hash algorithm==
While they have the same complexity, sort merge join is 5x slower
(experimentally in 1980's), because sorting takes nontrivial amount of
CPU time.  So unless you need data in sorted order at end of join
for use in another operator, stick to hash joins!