9/16/2008

Readings (for this lecture and the next):

#  Joseph Hellerstein and Michael Stonebraker. The Anatomy of a Database System. In "Readings in Database Systems" (aka "The Red Book"). Focus on Sections 1-4, though you should also read Sections 5.1 and 5.2 and skim Section 6.
# M.M. Astrahan et al. System R: Relational Approach to Database Management. ACM TODS 1(2), 1976. Pages 97-137. Read up to page 122; you may also skip the "Optimizer" section, pages 110 - 114. 

Two topics:
1) Relational Database Design
2) Underlying Details of Relational Database Systems

=How to structure your tables?=
* normalization (wrong according to Mike)
* E-R models (right " " ")

Codd's original example: Supply(sno[supplier #], pno[part #], psize, quantity, discount)

Functional dependencies - specify which items in a table completely determine other items
quanity -> discount (discount is funcitionally dependent of quanity)
pno -> psize
(sno, pno) -> psize, quanity, discount

problems:
- redundancy - repeated suppliers lead to repeated discount for the same quanity -
application programmer has to remember to update all of the values on an update.
- some discounts might be missing - if there is no supplier of a part, you can't
mention discount rate.  Mathematically, there is a transitive dependency:
  (sno,pno) -> quantity -> discount

==normalization==
better design:
(quanity,discount) in its own table
(sno, pno, psize, quantity)

still a problem: psize gets repeated when multiple suppliers sell parts
mathematically: every attribute FD should be on a key.  attrubte not fully functionally
                dependent on a key is BAD

solution?  Normalization always says the same thing - create a new table!
turn (sno, pno, psize, quantity) into (pno, size) and (sno, pno, quantity).

Theoreticians had a field day with this - many normal forms were created.
Problems:
- people didn't understand how these worked in practice.  need something more intuitive.
- tells you how to improve a bad design, but not how to create a good one from the start.
- performance issues: joins are expensive, so you took something with less joins, and 
added joins to the picture

==Entity-Relational models==
Entities:
animals(name, age, species, feed time)
cages(id, size)
zookeepers(id, name, address)

Relationships:
lives-in
cares-for

Diagram:
 -----------            -------              ------
| zookeeper | -------< | animal | >-------- | cage |
 -----------            -------              -----

 ">" means:  many >--------- one, one ---------< many, and many >----------< many

Wong & Katz (1979) published algorithm to put any ER diagram into 3NF tables, and
all commercial products use this approach.

Also supports specialization (zookeeper is a type of person, or underage 
driver inherits license of adult)

Rough algorithm:
Put all entities in their own table
Put reference on "many" in many-one relationship to the "one"
Put all many-many relationships into their own table


=Relational Database Systems=
Some history.  Two relational databases built in '70's.
System R - IBM Research prototype.  Released as product in DB2.  Jim Gray, quite 
           a few other famous folks.
Ingres - Mike.  Major contribution was the people who worked on it, which went on
         to commercialize Tandem, RTI, Britton Lee, Sybase, MS SQL Server
Before IBM turns System R into DB2, Larry Ellison reads System R whitepapers,
  releases Oracle before DB2.

Today's market is DB2, Oracle, Sybase, MS SQL Server, $15B/year industry.  Oracle
is ~60% of that.

Non-commercial versions - coming closer to feature-complete, and more scalable/distributable.
 - MySQL
 - Postgres - also by Mike, picked up by a few Russians, and turned into strong open
              source tool.  Netezza, Greenplum, Vertica commercialized from parts of this.

Major Components of a DBMS
 - Front end: admission control, connection management
 - Relational Data System (RDS): takes in queries, makes query plan,
   starts executing them (parser, rewriter, planner, optimizer, executor)
 - Relational Storage System (RSS): access methods, data, transactional stuff

Other thoughts - memory management (buffer pool), disk space management,
data replication, administrative utilities.

==Front End==
Authorization - access control is a list of people per table.
Views can have different access priviledges than their base tables.
==Relational Data System==
0) Parser - Takes SQL query, converts it to parse tree
1) Rewriter - takes query, and rewrites it as another
  a) Used for views (used for logical data independence, access control, etc.)
    CREATE VIEW sals AS (
      SELECT dept, avg(sal) as sal
      FROM emp
      GROUP BY dept
    )

    Query: SELECT sal FROM sals WHERE dept='eecs';
    
    Rewriter would take query used to build sals, and place it into the FROM.
  b) Flatten nested queries when possible:
     SELECT avg(sal) FROM emp WHERE dept='eecs' group by dept;

     This is important because we want to allow for the greatest freedom to
     reorder filters, etc.  In this example, flattening query allows us to
     implement filter on 'eecs' department before performing aggregate avg.

     Flattening is not always easy:
     view: sals (SELECT DISTINCT dept, sal FROM emp)
     query: SELECT avg(sal) from sals...;
     bad result: SELECT DISTINCT avg(sal) FROM ...
       -> this applies the DISTINCT to the wrong place:
         emp:
         dept | sal
         eecs | 10k
         eecs | 10k
         mech | 50k

         sals view:
         eecs | 10k
         mech | 50k

         avg on original query and bad result will differ in avg value, and will make results
         not repeat even though query would have left duplicates in there.

      So there are set of rules that say when you can and can't flatten in the face of
      duplicates.

      Reference: "Query Rewrite Rules in the IBM DB2 Unversial DBMS"

      Temporary tables: not updated with respect to original tables.
      Views: some vendors will update base tables when a view is updated.
      Materialized Views: physically store new view for efficiency purposes.
  c) Redundant expression elimination: x > 10 AND x > 20 becomes x > 20
  d) expression introduction: translate filters on one table into filters on another
     for optimization purposes.
==Rewriter==


=Process Models=
Aside: process models - why are they important?
web server -> application server -> DBMS.  Have 100's of requests concurrently.
Process model says how many processes/threads to allocate for handling queries.
1) Single thread of execution.  This sucks, because you are forced to sit in
   back of long queries even if yours is short.
2) 1 thread per connection.  No fun, because of overhead of creating threads
3) Pool of threads per connection - pretty nice.

Postgres is process per connection.  This is because good threading in unix wasn't
around when it was designed.  Isn't redesigned to use threads because it would
require rewriting entire system (not trivial).  Getting rid of shared
memory pool would also suck.  Thus, it's important to pay attention to this and
do it right early on.