Database System Internals 9/18/2008 Readings (for this lecture and the last): # Joseph Hellerstein and Michael Stonebraker. The Anatomy of a Database System. In "Readings in Database Systems" (aka "The Red Book"). Focus on Sections 1-4, though you should also read Sections 5.1 and 5.2 and skim Section 6. # M.M. Astrahan et al. System R: Relational Approach to Database Management. ACM TODS 1(2), 1976. Pages 97-137. Read up to page 122; you may also skip the "Optimizer" section, pages 110 - 114. =Review From last Time= System R RDS holds - parser - discussed last time - rewriter - discussed last time - planner - today - optimizer - today - executor - today System R RSS holds - access methods - today - buffer pool =RDS= ==Planner== SQL Query Tree -> Logical Plan Logical plan -> Physical Plan Some notation (sorry - this is a text file - I'll use LaTeX notation, but you should draw it) - \sigma_{condition} T -> Select tuples from table T based on condition. ex. \sigma_{age > 17} Employees - A \bowtie_{condition} B -> Join A and B (get the cross product), and apply the filter in condition to B. ex. Employee \bowtie_{employee.deptid = dept.id} Department - \Pi_{f1,...,fn} T -> Project columns f1 through fn from table T. You can select only certain columns from that table. - \alpha_{aggreagation, grouping} -> Aggregate using aggregation method with GROUP BY specified by grouping. Remember that relational algebra is closed on relations, so any of these operators can operate on the output of other operators. Example ------- emp(eno, ename, sal, dno) dept(dno, dname, bldg) kids(kno, eno, kname, birthday) [lists children of the employees, to give them ice cream on their birthdays] SELECT ename, count(*) FROM emp, dept, kids WHERE emp.dno = dept.dno AND emp.sal > 50k [ treated as a \sigma ] AND kids.eno = emp.eno [ treated as a \bowtie ] AND dept.name = 'eecs' GROUP BY ename HAVING COUNT(*) > 7; In relational algebra, one representation (of many) can be: \sigma_{count(*) > 7} (\alpha_{count(*), ename) (\Pi_{ename} (((\sigma_{sal > 50k} emp \bowtie_{dno = dno} \sigma_{name = 'eecs'} dept) \bowtie_{eno = eno} kids))) Key takeaway - this process is pretty mechanical. However, operators commute with each other, so many equivalent expressions work. Since this is difficult to parse for a human, we often draw these expressions as trees \sigma | \Pi | \bowtie / \ \bowtie kids / \ \sigma \sigma | | emp dept If you weren't doing this in a text file using ascii, you would annotate each of the nodes of the tree with conditions, etc. Remember - output of each operation i sa relation - any operator can consume the output of any other operator - we'll study various efficient methods for permuting the query tree for efficiency reasons (filtering before joins, etc.) The plan above is a LOGICAL plan. A PHYSICAL plan would need implementations: - access methods (file scan, index lookup,...) - join algorithms (sort merge join, nested loops join,...) =RSS= ==Access Methods== 1) Heap Files Simplest way to write data out to disk is a simple Heap File. This file has pages laid out consecutively. In each page, you place tuples on page in whatever order they were stored. There are finer details to how to lay out tuples, which we'll discuss later. 2) B+Trees (Generically called "image" in System R. Generically called "index" by the modern databases community). Like a binary tree has a branching factor of 2, a B+Tree has a branching factor of B. B is selected so that B keys take approximately one page on disk. Everything to the right of an entry on a B+Tree node is smaller than it. ------------ |10| | | | ------------ / |5|7|9| | / \ ------- ---------- |2|3|4|4| | | | | | ------- ---------- The leaves of the B+Tree point into the heapfile, and are in order. So you create an index on some field, which creates an ordered B+Tree that points to an unsorted heapfile. A single lookup for a key is efficient, but a sequential scan based on the order of the index causes you to jump all over the disk to read tuples from the heapfile. To make sequential scans on a single indexed id more efficient, you can CLUSTER the heap file so that the order of that file is the same as a single index. This makes scanning it according to the order of the index more efficient, since you can also sequentially scan the heap file and have the tuples stay in order. We'll get into more access methods later. ==Executor== How a query is executed in code (iterator model): it = access_method.open(predicate) loop: tuple = it.next() ... So operators implement the iterator interface, and can feed into each other. Example (with query tree turned sideways): \sigma -- emp / \bowtie \ dept There are many implementations for the join (\bowtie). A simple (and inefficient one) is a nested loops join (which has a left and right iterator) for each in left: t1 = left.next for each in right: t2 = right.next if (t1 (condition) t2) output t1,t2 So each operator can interface with any other operator through an interface, some (joins) interface with more than one iterators. This is nice, because it makes the executor implementation very easy! Just keep calling next() on the operator at the top of the query tree. Note: System R folks call predicates for joins/selects search arguments (sargs). A predicate that can be pushed down the tree (such as a filter) is called a "sargable predicate." ==Query Optimization== We won't go into it now, but we'll study the "Selinger Optimizer," a classic query optimizer, in detail in a few weeks. Optimizer chooses which plan is most efficient. A join B join C join D join E has 24 (5!) different orderings. Actually, it has more, since A join B is different than B join A (inner vs. outer) join. There are also N parenthesizations for a string of N tables being joined, so there are more than N! options to choose from. Thus, optimizers are important, since N is large (can be on the order of 10) in practice. A brief introduction to join optimization: d / join c \ / join b \ / join \ a (called a "left-deep" join) d / join / \ join b c \ / join \ a (called a "bushy" join) Bushy joins used to be avoided because one a tuple comes to the left side of a top-most join, it has to execute another join for that tuple. That join on the right is executed repeatedly, wasting resources. A common way to get around this is to store the result of the right-most join, so that they are only calculated once, and so bushy plans are now used by some commercial vendors.