9/23/2008
---------

Readings:
* Ramakrishnan and Gehrke Pages 273-289. If you are using another book, this is the introduction to the Section on Storage and Indexing which discusses different access methods and their relative performance.
* Ramakrishnan and Gehrke Pages 344-358. This is an in-depth discussion of the B+Tree data structure and its implementation. Most database books, as well as any algorithms text (such as CLR or Knuth) will provide an equivalent discussion of B+Trees.
* "The R*-Tree: An Efficient and Robust Access Method for Points and Rectangles." Beckmann et al, in The Red Book. 


Last few lectures, we've spoken about Query Processing Layer (RDS).

In the next few lectures, we'll talk about the performance, cost, and
implementation of a database, including access methods and indices.

=Cost Model=

cost considerations: CPU + Disk I/O
CPU - let's say 1 GHZ, and let's say that actually does 1 instruction/cycle
    - roughly 1 billion instructions/sec
    - ballpark of 1000 instructions/tuple to implement a selection operator.
      cost includes going to right offset in page, finding correct field,
      testing some predicate on it, etc...
    - this means we can process 1M tuples/second

Disk I/O - a disk is a circular platter with data in concentric tracks.
         - an arm called a disk head goes to appropriate concentric track
           to read a piece of data
         - time to read a piece of data (seek time) = 
              time to move disk head +
              time to spin disk to appropriate location.
           (approx. 10 ms today, but spin rate varies from 3600 rpm
            to 15000 rpm)
         - once a head is in place, read bandwidth varies from 20-100 MB/s
           depending on spin rate.
         - Let's say 100 MB/s.  Then if a tuple is 100 bytes, we can read
           1 million tuples/second sequentially.

So disk sequential read is approximately as fast as CPU.  When people say
that disk is slow, they mean the seek time is slow.  That's because if we
were to seek to every tuple instead of sequentially read it, we'd only
be able to read 100 tuples/second.

Reasons DBs are not ALWAYS disk-bound:
 - lots of RAM avoids seeks
 - fastest TPC-H benchmark machine has ~13000 disks on it.  Put "shards"
   of data on each disk, and so in 10 ms, you can potentially seek
   to 13000 locations instead of just 1.  Common thing to do: balance
   number of disks to speed of CPU.

Flash memory has nice property - persistent storage, but can do up to
10000 seeks/second instead of 100.

Since transfer time and cpu time are approximately equivalent, we'll ignore
everything but I/Os, specifically seeks + transfer time, for our cost models.

So back to our employees/kids example:
SELECT *
FROM employees, kids, depts 
WHERE emp.did = dept.id
AND emp.id = kids.eid
AND emp.sal > 10k;

One possible query plan:
     kids
    / 
join      emp
    \    /
     join 
         \
          dept

Suppose 
100 tuples/page
10 pages RAM
10 KB/page
10 ms seek
100 MB/s transfer speed
|dept| = 100 (1 page), |emp| = 10k (100 pages), |kids| = 30k (300 pages)

Look at nested loops join for join(dept, emp).
Suppose 1000 employees match salary > 10k.  Then since each employee is in 1
department, result of join is 1000 tuples.

If dept is the outer in the nested loop, you do 1 scan of dept, 100 scans of
emp.
Most of time is spent on 1 MB scan of emp, which takes 10 ms scan + 10 ms seek.
That's 20 ms/scan, so times 100 scans, we take 2 sec.
Also, dept takes small amount of time (1 scan of 1 page), so forget it for now.

If emp is the outer, we have to scan dept 1000 times.  But it's 1 page, so it
can fit in RAM the whole time.
So it's 10 ms to seek to employees, 10 ms to scan it, and 10 ms to seek to dept
(almost no time to scan it - it's 1 page...).
That takes ~30 ms - way faster!

Getting things to stay in RAM is extremely valuable.

=Buffer Pool=
This is a cache of pages, plus a mapping to where they are in RAM if they
have already been loaded.

Table for BufferPool:
pgid | ptr
----------
  1  | 0x01
  2  | ...

If it's a cache, it's important to know how big it is, and what kind
of eviction policy it has.
First off - why not just depend on OS disk cache?
 - let's say we use LRU eviction policy, and the inner loop is 1 page
   larger than buffer pool.  Then we'd keep replacing pages in the
   steady state.  If it was MRU, we could have kept all but one page
   in RAM.
 - later on, we'll talk about how to add intelligence to the DB BufferPool
   to take advantage of what it knows about access patterns to make a
   better eviction policy.

=Access Methods=
==Heap files==
Just an array of pages which are array of tuples on disk.  However, we can
add some intellgence for fast storage.

Questions:
1) How to keep track of free space for adding new tuples?
  - Keep a directory of pages with free space
  - Can be a linked list, or a bitmap of pages with or without
    free space.  Instead of a bit per page, can store an integer of
    how many bytes are free on this page.
2) Page Layout (what does data on a page look like)
  - If you have fixed-length records, keep an array (perhaps on-disk)
    of fixed-length tuples for each page, and put a bitmap as a header
    at the top of each page to say which tuples are actually present.
  - If you have variable-length records, each header slot should
    contain an [optional: offset], length, and free bit.
3) Tuple Layout (what does encoded tuple look like)
  - If you have variable-length tuples, put all fixed-length fields
    at the beginning, put all variable-length fields after them, and keep
    a header per tuple with length of each variable-length field.
    Maybe also keep a bit per field specifying if it's NULL or not, since
    SQL requires NULL to be handled specially.
  - If you store an image, or other large item larger than a page, then
    just store a pointer to it on disk.  This is often called a
    BLOB (binary large object).
==Hash files==
Hash table that is stored on disk
Maps from value -> record location in heapfile (or to record directly)

Simplest thing to do:
h(value) -> {1 ... k} (one of k buckets)
Make buckets be fixed size (1 page)
If bucket takes larger than 1 page, make each page point to an overflow page.
  - this would lead to long overflow chains if you add lots of tuples after
    building the table.
  - to fix overflows, you'd have to rehash EVERY record, even in non-overflowed
    pages.

We'll talk about ways to deal with this later (a data structure called
extensible hash tables).
==B+Trees==
Next time
==R-Trees==
Next time