9/23/2008 --------- Readings: * Ramakrishnan and Gehrke Pages 273-289. If you are using another book, this is the introduction to the Section on Storage and Indexing which discusses different access methods and their relative performance. * Ramakrishnan and Gehrke Pages 344-358. This is an in-depth discussion of the B+Tree data structure and its implementation. Most database books, as well as any algorithms text (such as CLR or Knuth) will provide an equivalent discussion of B+Trees. * "The R*-Tree: An Efficient and Robust Access Method for Points and Rectangles." Beckmann et al, in The Red Book. Last few lectures, we've spoken about Query Processing Layer (RDS). In the next few lectures, we'll talk about the performance, cost, and implementation of a database, including access methods and indices. =Cost Model= cost considerations: CPU + Disk I/O CPU - let's say 1 GHZ, and let's say that actually does 1 instruction/cycle - roughly 1 billion instructions/sec - ballpark of 1000 instructions/tuple to implement a selection operator. cost includes going to right offset in page, finding correct field, testing some predicate on it, etc... - this means we can process 1M tuples/second Disk I/O - a disk is a circular platter with data in concentric tracks. - an arm called a disk head goes to appropriate concentric track to read a piece of data - time to read a piece of data (seek time) = time to move disk head + time to spin disk to appropriate location. (approx. 10 ms today, but spin rate varies from 3600 rpm to 15000 rpm) - once a head is in place, read bandwidth varies from 20-100 MB/s depending on spin rate. - Let's say 100 MB/s. Then if a tuple is 100 bytes, we can read 1 million tuples/second sequentially. So disk sequential read is approximately as fast as CPU. When people say that disk is slow, they mean the seek time is slow. That's because if we were to seek to every tuple instead of sequentially read it, we'd only be able to read 100 tuples/second. Reasons DBs are not ALWAYS disk-bound: - lots of RAM avoids seeks - fastest TPC-H benchmark machine has ~13000 disks on it. Put "shards" of data on each disk, and so in 10 ms, you can potentially seek to 13000 locations instead of just 1. Common thing to do: balance number of disks to speed of CPU. Flash memory has nice property - persistent storage, but can do up to 10000 seeks/second instead of 100. Since transfer time and cpu time are approximately equivalent, we'll ignore everything but I/Os, specifically seeks + transfer time, for our cost models. So back to our employees/kids example: SELECT * FROM employees, kids, depts WHERE emp.did = dept.id AND emp.id = kids.eid AND emp.sal > 10k; One possible query plan: kids / join emp \ / join \ dept Suppose 100 tuples/page 10 pages RAM 10 KB/page 10 ms seek 100 MB/s transfer speed |dept| = 100 (1 page), |emp| = 10k (100 pages), |kids| = 30k (300 pages) Look at nested loops join for join(dept, emp). Suppose 1000 employees match salary > 10k. Then since each employee is in 1 department, result of join is 1000 tuples. If dept is the outer in the nested loop, you do 1 scan of dept, 100 scans of emp. Most of time is spent on 1 MB scan of emp, which takes 10 ms scan + 10 ms seek. That's 20 ms/scan, so times 100 scans, we take 2 sec. Also, dept takes small amount of time (1 scan of 1 page), so forget it for now. If emp is the outer, we have to scan dept 1000 times. But it's 1 page, so it can fit in RAM the whole time. So it's 10 ms to seek to employees, 10 ms to scan it, and 10 ms to seek to dept (almost no time to scan it - it's 1 page...). That takes ~30 ms - way faster! Getting things to stay in RAM is extremely valuable. =Buffer Pool= This is a cache of pages, plus a mapping to where they are in RAM if they have already been loaded. Table for BufferPool: pgid | ptr ---------- 1 | 0x01 2 | ... If it's a cache, it's important to know how big it is, and what kind of eviction policy it has. First off - why not just depend on OS disk cache? - let's say we use LRU eviction policy, and the inner loop is 1 page larger than buffer pool. Then we'd keep replacing pages in the steady state. If it was MRU, we could have kept all but one page in RAM. - later on, we'll talk about how to add intelligence to the DB BufferPool to take advantage of what it knows about access patterns to make a better eviction policy. =Access Methods= ==Heap files== Just an array of pages which are array of tuples on disk. However, we can add some intellgence for fast storage. Questions: 1) How to keep track of free space for adding new tuples? - Keep a directory of pages with free space - Can be a linked list, or a bitmap of pages with or without free space. Instead of a bit per page, can store an integer of how many bytes are free on this page. 2) Page Layout (what does data on a page look like) - If you have fixed-length records, keep an array (perhaps on-disk) of fixed-length tuples for each page, and put a bitmap as a header at the top of each page to say which tuples are actually present. - If you have variable-length records, each header slot should contain an [optional: offset], length, and free bit. 3) Tuple Layout (what does encoded tuple look like) - If you have variable-length tuples, put all fixed-length fields at the beginning, put all variable-length fields after them, and keep a header per tuple with length of each variable-length field. Maybe also keep a bit per field specifying if it's NULL or not, since SQL requires NULL to be handled specially. - If you store an image, or other large item larger than a page, then just store a pointer to it on disk. This is often called a BLOB (binary large object). ==Hash files== Hash table that is stored on disk Maps from value -> record location in heapfile (or to record directly) Simplest thing to do: h(value) -> {1 ... k} (one of k buckets) Make buckets be fixed size (1 page) If bucket takes larger than 1 page, make each page point to an overflow page. - this would lead to long overflow chains if you add lots of tuples after building the table. - to fix overflows, you'd have to rehash EVERY record, even in non-overflowed pages. We'll talk about ways to deal with this later (a data structure called extensible hash tables). ==B+Trees== Next time ==R-Trees== Next time