9/25/2008 Reading: # Hong-Tai Chou and David DeWitt. An Evaluation of Buffer Management Strategies for Relational Database Systems. VLDB, 1985. From last time: Indexing + Access Methods - heap files - hash files: h(f) -> {1, ... , k} - problem is how to handle overflow chains after several updates - today: extensible hash tables - B+Trees - today - R*Trees - today =Extensible Hashing= Don't start with fixed # of buckets. Create new buckets over time as old ones fill up. if h(f)->{1,...,b}, where b is large (say 2^32). then make a family of hash functions: h_k(f) -> h(f) mod 2^k so: h_1(f) -> {0,1} so: h_2(f) -> {0,1,2,3} Example: Imagine we start with two buckets (two pages), and each page holds 3 keys. Let's say h(f) returns hash keys 0 0 2 3 2. keep table: k=1 h_k(x) | bucket 0 | 0 1 | 1 insert 0 page/bucket 0: 0 page/bucket 1: insert 0 page/bucket 0: 0 0 page/bucket 1: insert 2 page/bucket 0: 0 0 2 page/bucket 1: insert 3 page/bucket 0: 0 0 2 page/bucket 1: 3 insert 2 bucket 0 overflows! Overflow global k value. Double size of table to split large table into two. Add one table. k=1 h_k(x) | bucket 0 | 0 1 | 1 (keeps pointing where it used to) 2 | 2 3 | 1 (keeps pointing where it used to) Only have to rehash page 0 into page 0 and page 2. Don't have to rehash buckets that don't overflow. so after we insert 2, new buckets look like: page/bucket 0: 0 0 page/bucket 1: 3 page/bucket 2: 2 2 If you keep inserting 0, 0, 0, 0, 0, then you'll still need overflow pages, but it signals you might need a better hash function. =B+Trees= Has B items per page. Pointer to left of item is less than it, and to the right of item is greater than or equal to it. ||3||5||7||9|| / \ ||1||2||2||2|| ||3||3||4|| || / ||1||1|| || || -> ... -> Leaf values point to actual tuple in file, or sometimes contain tuple as value. Leaf pages point to next leaf page in order, to support range scans of data efficiently. See cow book for implementation of keeping them balanced, etc... Does it make sense, if I have 30% selectivity, to follow leaf pointers of tree to the resulting tuples? Maybe not - if it's not clustered, you have to seek per key to the appropriate page, and pages will be out of order (slow!). If it's not clustered, you can only afford to do this in practice if the selectivity is less than ~.1% So clustering helps - it makes following the keys similar to a sequential scan. But we can only cluster one index. For the rest, databases do a bitmap index scan: - Scan the leaves of the tree you are interested in. - Keep a bitmap of which pages need to be read because they contain some tuples - Do a sequential scan on disk, but only pick up the pages in the bitmap, and apply the predicate to those At worst, this results in a sequential scan. At best, the query was selective, so we only read a small amount of the pages. How deep do these trees get? Imagine keys take 8 bytes. Then a 4KB page can store 512 keys. In a depth-4 B+Tree, can store 512^4 = 68 Billion keys! So to store MOST data, you can do a lookup in 4 seeks. However, top of tree can be stored in cache. So in practice, only look at the base of the tree, which takes 1-2 seeks. If a B+Tree is full, then every insert will cause a split (which is expensive). So if you plan on doing inserts, keep pages 50%-66% full, but it really depends on workload. Fill factor (the X% value) will try to guarantee that pages will be split when they are X% full, and merge pages when they are less than X% full. This avoids excessive rebalancing. =R*Trees= What are R*Trees for? Rectangles. This supports geospatial queries for geometric objects stored in 2-D (or arbitrary dimensionality). Query type: find multi-dimensional objects near a point. B+Trees don't do this well - they only index efficiently on one dimension. RTrees (we drop the *, and drop the + for the BTree for ease of typing) will store like a BTree does, but will store rectangles. For objects of any shape, represent each one by a bounding rectangle. Rectangles might overlap with each other. Root page contains "universe," which is a bounding box of all rectangles. Each top-level rectangle points to the rectangles fully contained inside of it. To search for an object in an area, recursively look in all rectangles that might contain it. No guarantee of logarithmic performance, since you have to follow multiple paths for each overlapping rectangle. RTree paper discusses algorithms (heuristic) for maximally separating rectangles into as few overlapping pages as possible. It is an "engineering approach," since no method is not provably better, but some are better on some data than others. Note: KD-Trees split universe into non-overlapping rectangles. It reduces multiple lookups at the expense of having empty rectangles for sparse datasets. Performance of various access methods: | Seq Scan | Insert | Delete | Lookup Heap File | All pages | O(1) | Seq. Scan | Seq. Scan Hash File | All Pages | O(1) | O(1) | O(1) (except for range queries) B+Tree | All Pages |log_B(pages)| log_B(pages) | log_B(pages) for both single values + ranges (NOTE: log_B(pages), once cache is considered, is usually 1-2. =Buffer Pool= What is the right cache eviction policy? It's actually better to handle different access methods and algorithms differently! DBMIN - a classic "Wisconsin"-style paper. Heuristics are proposed, and a simulation gives intuition into which is better. Heuristics: access pattern - how many pages to keep in cache for operation -------------------------------------------------------------- sequential scan of a heapfile - 1. won't reuse, so don't flood cache with pages you won't re-read random lookup (single index lookup) - 1. no point in keeping intermediate pages in index repeated sequential (inner of nested loops) - if have enough memory, store entire relation. Since you will be repeating it else, store 1 (don't waste other cached pages) or dedicate all of cache to it, and use MRU. looping hierarchical (BTree index lookups in nested loops) - evict pages that are lower in hierarchy (leaves) before you evict higher ones (root). How to implement these heuristics in a buffer manager? Can keep one buffer cache per file (per query). so if query is nested_loops(A, B), then store 1 page for the outer (A), and store |B| pages for the buffer for the inner (B), so that they don't evict each other. Also, keep a global buffer pool to allow for reuse of pages between queries. If concurrent queries won't fit in memory, then don't admit queries that will push over memory usage until others finish. This is good for improving throughput of queries. This is complicated, and in practice, people use a policy called Love/Hate. Essentially, keep 2 LRU queues, and when an access method requests a page, it knows whether it will need to reuse it (Love), or never touch it again (Hate).