9/30/2008 Join Algorithms Reading: * L.D. Shapiro. Join Processing in Database Systems with Large Main Memories. In the Red Book. Assume o outer tuples, i inner tuples (used in showing big O notation) Some notation for further analysis: S.a = R.b is the kind of join we are doing {s} = # tuples in S |S| = # pages in S M pages of memory |R| < |S| =Simple Join Algorithms= Nested loops - O(o*i) in the number of tuples Index nested loops - Iterate through O(o) tuples, and do an index lookup (at least 1 seek) for each of them In-memory hash - Hash larger table as inner, Iterate through O(o) tuples, do an in-memory lookup for each of them In memory sort merge - O(ologo) [sort] + O(ilogi) [sort] + O(o) [merge scan] + O(i) [merge scan] =Deeper Analysis= ==Nested Loops Join== {s}x{r} comparisons example: |S| = 4, |R| = 2, M = 3, LRU - S is inner: 2 + 4 + 4 = 10 - R is inner: 4 + 2 = 6 More formally, if R is inner: {S}|R| [how many times we scan R as the inner] + |S| [scanning through S as outer] ==Index-Nested Loops== (Given an index on R.b) for s in S find matches in R.b's index I/O: {s} + |S| pages are read, but the {s} is all random accesses ==Hash (in memory)== Only works when M > |R| (plus room for the hash table, etc.) build hash table T on R for each s in S lookup S.a in T I/O: |S| + |R| ==Sort-Merge== Read R, sort Read S, sort Merge Deal with double values specially: S's joined fields sorted = 1 2 5 7 7 9 R's joined fields sorted = 2 3 7 7 9 Want to output (2,2), (7,7), (7,7), (7,7), (7,7), (9,9), but if we just followed the basic merge algorithm, we'd only output two copies of (7,7). So be careful - you have to look ahead to see if the next value is repeated, so that you can make cross product of all repeated values. (Don't worry about exact pseudocode, just understand that the algorithm must output correct results of a cross product) =External Join Algorithms= External means that they are larger than memory, so you have to use disk to handle intermediate operations. Assumes that M > sqrt(|S|). We'll talk about why this is, but it's realistic nonetheless. It means 1 meg of memory can support joins of tables of size 1 terabyte. [NOTE: I think this might have to be M > 2sqrt(|S|), so you can make sure you can fit parts of S AND R in memory]. ==External Sort Merge Join== phase 1: partition repeat until done: read M pages of S, sort, write out (|S|/M runs are created) repeat until done: read M pages of R, sort, write out (|R|/M runs are created) phase 2: merge merge a bunch of runs at a time example: R = 1,4,3,6,9,14,1,7,11 S = 2,3,7,12,9,8,4,15,6 M can hold 3 tuples at a time R's runs: 1 6 1 3 9 7 4 14 11 S's runs: 2 8 4 3 9 6 7 12 15 So now keep a cursor at the beginning of each run (that is, store a page of each run of R and S). We'd need 6 cursors, three starting at 1, 6, and 1 for R, and for S, they start at 2, 8, and 4. For each of the |S|/M runs and |R|/M runs, look within the runs of R to find the smallest cursor, and compare that to the smallest cursor of S. Advance the smallest cursor once it's no longer needed for a join Sam notes that maybe we need M to be 6 (to handle the factor of 2 that we mentioned earlier), so that we can have one page from each run of R and S in memory. Imagine M = sqrt(|S|). Then there are |S|/sqrt(|S|) = sqrt(|S|) runs, which can fit in memory. If S is any larger, then we won't be able to have one page per R and S's runs. I/O: |R| + |S| [read in R and S to sort them - sequential] + |R| + |S| [to write the runs out - sequential] + |R| + |S| [to scan through the runs and merge them - somewhat random access, since we have to advance cursors to new pages in somewhat random order] = 3(|R| + |S|) Unix sort uses this algorithm, as well as many DBs. ==Hash Joins== h(x) -> {1,...,n} pass size = n/(|S|/M) for pass = 0 ... |S|/M min hv = size*pass max hv = size*(pass+1) scan S, hash tuples with h(x) if tuple is in range [min, max), add to hash table T otherwise, write the tuple back out to deal with later scan R, hash tuples with h(x) if tuple is in range [min, max), look up value in T otherwise, write the tuple back out to deal with later So you keep reading memory-sized chunks of S (since h(.) distributes values uniformly), look R up in it when the range matches, and output matches. example with previous data: h(x) = x mod 3 (size = 1) pass 0: S hash table: 3 12 9 15 6 write out from S: 2 7 8 4 R: 3 6 9 write out: 1 4 14 1 7 11 matches: 3 6 9 pass 1: S: 7 4 write out: 2 8 R: 1 4 1 7 write out: 14 11 matches: 4, 7 ... 3 passes: (R+S) + (2/3)(R+S)*2 + (1/3)(R+S)*2 = 3(R+S) In general, we have to do N(R+S) for N passes. So that sucks. ==GRACE Hash== Choose sqrt(R) partitions, divide up hash table hash R into these partitions hash S into these paritions for each partition p build a hash table T on partition p of R scan partition p of S, lookup in T [It's probably fine to choose sqrt(S) partitions, but we used sqrt(R) for some reason] Note: we read |R| + |S| to hash them, |R| + |S| to write out partitions, and |R| + |S| to read it in again to join. So 3(|R| + |S|), but maybe with a bit more randomness than the sort merge join. That's nice! The only time the simple join algorithm is better is if there is enough RAM to hold significant portion of the tables, and then do less than 3 passes over data. ==Hybrid Hash JOIN== Get best of simple and GRACE hash. Simple idea would be to decide based on size of memory and join which algorithm to use. They use a more complicated algorithm, which says that you use extra RAM in simple join to do GRACE-like JOIN, but who knows why. ==Sort Merge Join vs GRACE hash algorithm== While they have the same complexity, sort merge join is 5x slower (experimentally in 1980's), because sorting takes nontrivial amount of CPU time. So unless you need data in sorted order at end of join for use in another operator, stick to hash joins!