11/25/2008 Readings: * Shivnath Babu and Pedro Bizarro. Adaptive Query Processing in the Looking Glass. In Proceedings of CIDR, 2005. http://www-db.cs.wisc.edu/cidr/cidr2005/papers/P20.pdf. * Ron Avnur and Joseph M. Hellerstein. Eddies: Continuously Adaptive Query Processing. SIGMOD Conference 2000. In Red Book. =Adaptive query processing (AQP)= Idea is that for long-running queries, we can change a query plan (shape, physical operation, while running) This might happen if the properties of the input change while running: - e.g. the selectivity of a predicate with respect to another one it is correlated with. - e.g. your estimates of the number of tuples from a different level in the tree was incorrect =Prior work= 1) Kabra + DeWitt - plan-based AQP for conventional queries 2) Eddies - adaptivity for streaming data - if the estimated statistics differ, modify the query plan =Kabra + DeWitt Paper= Key idea: Between collection operators, keep statistics collection operators. join join / \ / \ filter B -> stats B / / A filter / stats / A Suppose we realize selectivity is way off. What can we do? 1) Throw out the entire result set, replan, and re-execute. Wasteful but simple. 2) Change the plan and preserve the results you've already produced. Less waste, but harder to get right. 3) Change plan at well-defined points in execution -> Kabra + DeWitt 2+3 guarantee forward progress What is a well-defined point to re-order? Right before a blocking operator, since it needs all of the tuples anyway. - sort - aggregate - hash/merge join =Eddies= In streaming databases like Aurora, it helps to re-arrange query operators so that the most selective operators process data first. There has been lots of work in selecting which operator to run first, and allows for rearrangement of operators as you go. Eddies does this differently - the eddie, or tuple router, receives tuples, and decides which operator to send a tuple to. Each operator works on its own thread, and the tuple router pushes tuples onto queues, receives them back, and sends them out to new operators. Tuples receive a header with bits saying which operators can process it (example: a JOIN requires certain fields to exist, so until those fields exist, the "ready" bits for the join operator will be false). Once a tuple is processed by an operator, its "done" bit for that operator is set, so it doesn't run twice. An operator like a selection can drop a tuple that doesn't pass its filter. ==Lottery Scheduling== We'd like to send a tuple to a more selective operator before a less selective one, so that we can kill it early if it doesn't pass the selective one. We could use a random strategy for a while, and then learn estimates, and pick the correct order, but the selectivities might change over time. So instead, assign probability of sending a tuple to an operator to be its share of the selectivity (or the inverse)! That way, most selective operators get the largest share of tuples. To implement this, they: - Assign a ticket to an operator each time it receives a tuple. - Subtract a ticket from the operator each time it emits a tuple. - All operators start with 1 ticket - Assign tuples to operators based on probability calculated from fraction of tickets owned. So this means that operators which are highly selective or have lower cost than the others and process tuples faster will receive more tuples, which is a good thing. If stream selectivities are constant for a while and then change later, we want to notice this occuring and change probabilities. Eddies handles this by keeping a set of tickets for a while for routing while collecting new tickets for its next routing stage. ==Blocking Operators/Symmetric Hash Join== A blocking join would not be useful in Eddies, since it stops all tuples from flowing. We instead use a symmetric hash join. R join S -> build a hash table on R and S. Each time a tuple for R comes in, add it to the hash table on R, and look up its join field in S. if the probe in S matches, output the joined tuple. Otherwise, when the tuple(s) that join with S come in later, you'll find them in R (just do the same thing for tuples which come in from S). This takes twice as many probes/hash table builds, but is non-blocking, since if you don't catch a join now, you'll catch it later. Now you can have two operators in Eddies for each join on R and S: - One operator adds a tuple to S's table and looks it up in R's. - The other operator adds a tuple to R's table and looks it up in S's. This way, tuples can feed into one and the other based on the correct ready/done bits, and nothing blocks! =Who Uses This?= Kabra + DeWitt paper is starting to rear its head in commercial systems Eddies is neat, but is hard to build such that its performance is great compared to other systems.