1/2/2008 Query Optimization Reading: * Patricia Selinger, M. Astrahan, D. Chamberlin, Raymond Lorie, and T. Price. Access Path Selection in a Relational Database Management System. Proceedings of ACM SIGMOD, 1979. Pages 22-34. In the Red Book. * Optionally, you may also wish to look at: Michael Mannino, Paichen Chu, and Thomas Sager. Statistical Profile Estimation in Database Systems. ACM Computing Surveys 20(3), 1988. Pages 191-221. This paper discusses many of the techniques that used to make query optimization and cost estimation practical in modern database systems. How do we determine which query plan to run given all of our options? Patricia Selinger -> 1978 paper for System R on cost-based query optimization. Many of the foundations of the selinger optimizer are still used today! Cost-based optimization = find min(cost) plan. This is compared to heuristic optimization, which doesn't assign costs to various decisions. =Cost Model= ==Definitions== - Selectivity (F) - fraction of tuples that pass through a filter. NOTE: highly selective means F is high, and MORE tuples get through (think of it like a productivity) For a JOIN, selectivity = (tuples passed through)/(size of cross product of two tables) - NCARD - relation cardinality - the number of tuples in the relation We used to call this {R} - TCARD - number of pages in the relation We used to call this |R| - ICARD - number of keys in index - NINDX - number of pages in index ==Selectivity Estimation== There is a lot of research into this, but let's assume uniformity of key distribution, and assume some defaults for values we don't know. for filter: - col = value, estimate 1/ICARD if index exists, or 1/10 otherwise - col > value. Assume uniform distribution of keys. estimate ((max - min) - (val - min))/(max - min) = (max - min - val + min)/(max - min) = (max - val)/(max - min), or 1/3 if no max, min exist - col1 = col2 (join) estimate: 1/ICARD(primary key column being joined) if no primary key, then 1/max(ICARD(col1), ICARD(col2)) ==Cost of Access Methods== W - Fraction of the cost of accessing a page that the CPU takes for doing work like evaluating a predicate equality predicte on a b+tree: 1 [btree lookup] + 1 [heapfile lookup] + W [evaluate preducate on heap page] range scan w/ predicate: for clustered index: F(pred)*(NINDX + TCARD) + W*(tuples read) for unclustered index: F(pred)*(NINDX + NCARD) + W*(tuples read) seq scan: TCARD + W*NCARD ==Join Costs== Nested Loops(A,B) Cost(accessing outer) [outer scan] + (NCARD(outer) [per tuple of outer] * Cost(inner) [scan inner]) INL(A,B) Cost(accessing outer) [outer scan] + (NCARD(outer) [per tuple of outer] * C(inner index loopup) [scan inner]) Merge Cost(accessing outer) + Cost(accessing inner) + sort cost As you go up tree of joins, estimates become worse and worse, due to correlations and other things we don't account for =Selecting a Plan= Why is it hard? Can't we just write out all possible plans, and pick the one with the smallest cost? There are too many possible plans to consider! Consider joining: A B C (AB)C A(BC) AC(B) A(CB) So number of strings for n letters (n tables) - n! Number of parenthesizations of n letters - (n-1)! So n!(n-1)! plans! Too much! Heuristics - only consider these kind of plans: - push down selections - always makes sense if selection operation is free. - reconsider this if your predicate involves an FFT or something like that. - left-deep plans - bad if you want to parallelize joins, but on a single thread, it's probably fine. removes parenthesizations, as all parenthesis look like: ((((AB)C)D)E) - Only consider non-cross products - don't join tables without join predicates =Dynamic Programming= Turns n! search space into complexity 2^n Intuition - (ABC)DE - best way to join ABC is independent of joining with DE later. (kind of true. not the case if you might be able to do a sort merge join and can get ABC in a good sort order) for $delta in (1...|R|) for all S in {all length $delta subsets of R} optjoin(S) = a \join (S-a) where a is the relation that minimizes cost(optjoin(s-a)) + mincost of (s-a) \join a + mincost to access a Example: ABCD $delta = 1 A - e.g. cost to seq scan B - e.g. cost to do an index lookup C... D... $delta = 2 AB - AB or BA AC - AC or CA BC... AD... BD... CD... $delta = 3 ABC - remove A, compare cost of BC\joinA to A\joinBC - remove B, compare cost of AC\joinB to B\joinAC - remove C, ... ACD... ABD... BCD... ... Keep going, pick smallest cost one Cost of algorithm: outer loop goes through all subsets eventually, so 2^n inner loop considers all sets of length S with one item removed - length S. This is at most length n per loop. So upper bound on cost is: (2^n)n. 2^n is way smaller than n!. if n = 20, 2^n is about 10^7, but n! is about 10^18. Note: this algorithm looks at a mix of left- and right-deep, meaning it tries to turn each subtree across its vertical axis. This is still a deep evaluation - parallel joins are not produced. You need to mix the left- and right-deepness in order to consider all mixes of inner joins and outer joins. =Interesting Orders= Real-world optimizers use - bushy plans to allow for parallelization - more legitimate selectivity estimation - don't assume uniformity in values - join selectivities may be correlated =Selectivity Estimation= Imagine joining AjoinBjoinCjoinD, and that everything AjoinB has selectivity .1, whereas AjoinB has selectivity 1. If you put AjoinB at the bottom of the plan, you are joining way more data than you need to (10 times as much). ==Methods of Sampling== ->Histogram Selinger optimizer does optimization based on uniform distribution assumption of data values. This can often be wrong. For more precision, can keep a histogram. So if sal > 25k is a filter. 0-10k: .4 10-20k: .1 20-30k: .4 30-40k: .1 So >25k is about .2 + .1 = .3 if you assume values within histogram bins are uniform. ->Random Sampling Sample some values before a long-running query to see what the values are ->Keep Statistics from previous queries for future ones ==Joins== Selinger optimizer says selectivity of a join is 1/max(card1, card2). Instead of making this assumption, can keep join statistics between the two columns being joined - more histogram bins required, but still gives you an idea of what will actually join together. ==Postgres Statistics== Postgres keeps: -> A histogram with fixed height - so essentially it keeps which data values separate the data between fixed-size bins. Helps measure selectivity of predicates. -> A fraction of the table which is unique, or (if there are a small number of them) how many unique values are in the table. -> A "correlation" value that tells you how likely it is for one value to be larger than the value before it in a column (measures how "sorted" the column is). See Postgres docs (quite detailed) if you ever want to do this yourself and replicate something decent!