DB History with Mike  9/9/2008
-------------------

Papers:
# Michael Stonebraker and Joseph Hellerstein. What Goes Around Comes Around. In "Readings in Database Systems" (aka the Red Book). Read Sections 1-4 (if you know something about XML, you may also enjoy reading Sections 10 and 11; they are classic Stonebraker).
# E.F. Codd. A relational model of data for large shared data banks. Communications of the ACM, 1970. Focus on Sections 1.3 and all of Section 2. 

Ancient History - 1970's
Modern History - 1980's, 1990's

70's debate:
Hierarchical - IBM/IMS, still has more data than any DB today (business data processing, like charles schwabb)
Network - Codasyl - Committee on Data Storage ??? - Implemented as IDS
Relational - System-R (language: SQL), or Ingres (language: Quel)

=Ancient History=
==Hierarchical Model==
Data Model:
-Segment types (objects)
-Segment instances (records)
-Segment types must be a tree

Possible schemas for zoo:
a) Keepers->Animals->Cages works, but repeats cage information: multiple animals in same cage, so repeat cage data
b) Keepers->Cages->Animals works, but repeats cage data for keepers in multiple keepers
c) Keepers->Animals<-Cages doesn't work - it's a graph not a tree, but doesn't require repeated data

In DL-1 - hierarchical segment key describes path down tree to base record.
 - HSK keys are: {(parent key)}*(key of segment)
 - GU command starts you at parent segment
 - GH next in HSK order (doesn't stop at end of your subtree!   Just keeps going through keys until end of DB)
 - GHP - " " " (with restrictions)

Sample query: Find cages that Sam entered
 GU Keeper (name = 'Sam')    // this finds Sam
 until no more
   GHP Cage                  // loops through all cages in Sam's subtree.
                             // Don't use GH here - it will keep reading keys after the subtree!

How did IMS store data?  Index parent segments, but not subtrees.  Performance depends on how you organize data on disk.

Problems w/ IMS:
1) Data redundancy in tree
2) No physical data independence
  - If a storage segment is sequential, you can't insert into it.  Depending on how DBA organizes
    storage structures, you may or may not be able to use certain programs on it.
3) Painful low-level language
4) No logical data independence.  What if pandas have two feeding times per day.  Need to change schema to support multiple feeding times per animal, but have no idea if programs already written for it will still work.  Schemas change all the time for many reasons.  In DL/1, it's hard to know scope of change in programs you wrote.  Student points out that this might be impossible!

==CODASYL==
Let's move from hierarchical to graph-structured one.
Keepers->Animals<-Cages is acceptable now!

Data Model
-Record Types
-Record Instances
-Sets (1:N relationsihps between record types)

Issue: low-level record-oriented interface
Sample query: Find cages that Sam entered
  Find Keeper (name = 'Sam')
  until no more
    Find next Animal in cares_for
      Find parent in lives_in
Annoying - have to loop through space of graphs, keep track of where you are, etc.

1973 Turing Award Lecture was by Charles Bachman, who defends this programming model as great!

==Relational Model==
Presented by Codd
Went for simplicity - data model is tables.

High-level language
  Data Language: alpha (1st order predicate calculus - For all x there exists a y such that...)
  Relational Algebra

Problem with Alpha/Algebra - geared toward mathematicians, not programmers.  So Codd has a great idea, but may not be a good language designer.  

So SQL from System R (IBM) and QUEL from Ingres (Mike) were better, and SQL won out because IBM had more weight!

So have Animals, Cages, Keepers table.  Maybe have lives_in, cares_for table for many-to-many relationships.
Sample query: Find cages that Sam entered
  SELECT cid
  FROM animals
  WHERE keeper_name = 'Sam'

Ideas:
-Offers complete physical data independence.  How you store tables doesn't affect anything!
-Logical data independence - kind of - can't fundamentally change schema without having to change program.  Views help you abstract away from schema into logical queries.
-Can represent hierarchical or networked data

Note: result of a query is another table.  Can have views defined to make queries smaller.

Debate between Codd and Bachman - Codd wanted data independence, Bachman said it can't be implemented efficiently.
Turns out SQL can be made efficient.  Won debate in 80's.  CODASYL died out.  IMS picked up enough steam until System R got efficient in 80's, but by that point, so much depends on it for legacy purposes.

==What to learn from Ancient History==
High-level languages will win.  Don't make people use low-level language.
Schema WILL change.  Design something to somewhat handle this.
IMS still faster, since it can cluster joined data together, and DL/1 is a lower-level language with less overhead.  Still, any DBA would move to relational model for data independence and easier code.

=Modern History=
==Object-oriented DBs (late 80's)==
In Java and C++ we use objects - why not persist them that way?  "Persistent C++"
Why did this not take hold?
- Business Data Processing people don't use Java/C++.  They use high-level languages - no objects!
- CAD was a big customer - simulators, many other programs operate on objects.  Objects are stored on disk, and pointers are swizzled/de-swizzled to disk.  Why didn't engineers want to accept this?  Simulations were 1M lines of code.  Swizzling logic was 500 lines.  How much did OODBMS vendors expect to charge for this?
- Mostly died out!

==Object-relational DBs (Postgres by Mike Stonebraker)==
OODBMS was a good idea, but not worth the money.
Relational model gives you int/float/datetime/string.  Not good enough for business logic.

So Postgres says - combine the int/float/datetime/string into user-defined types, with user-defined functions operating on them.