11/13/2008

=Data Integration=
- So far, make a big machine out of jelly beans
- you have one DBA, one schema


When Mike worked at Informix, there were __58__ HR systems, one for each
country.  Different compensation rules per country (lunch, overtime, etc.).
CEO wants to know how many employees are in the company.  Have to integrate
all 58 of those!
  - some were pencil/paper
  - Excel
  - DBMS
  - SAP

Orbitz - ~40 airline systems for different parts of the world, integrated.

Note: many countries require that HR systems live inside country,
and Orbitz can't ask airline systems to host their machines.  So
integrating databases is a MUST.
  - lots of schemas
  - one DBA per site

This can be thought of a cross-enterprise endeavour.

=Solution 1=
Application issues SQL against a global schema
Coordinator parses, optimizes, and supervises queries on each of K workers,
each speaking with one of K databases.

One DB might be SQLServer, another DB may be at Google, using BigTable.
  - K workers must run wrappers to convert to SQL (or central language).

Example:
EMP(name, dept, pet, hobbies, salary)

France
 ("Francois Bancillon", 123, "je ne said pas, une chienne", "piste", 4000euro)
USA
 ("Sam Madden", "shoe", "woof the cockapoodle", ski, $12000)

How to map these?
  - Name is easy
  - 123 vs. shoe?
  - pet requires language translation
  - "piste" to ski needs an ontology!
  - salary: US is monthly, before taxes, in Dollars
            France includes lunch allowance, after taxes, in Euros

CEO might ask: what is avg. company salary?
  - integration must be semantic, not syntactic

It's _somebody's_ job to take care of translation.  It's no easier or harder
no matter where you perform translation (DB, application, etc...).

Mike says -> this is almost arbitrarily hard

Does shredding the data into RDF or XML help us?  No - it helps merge the data
into a common syntax, but doesn't solve semantic problem.

Does a schema description (metadata of the metadata) help?  No - you are
integrating data of already-built databases, which aren't using common
schema description language.

Lots of suggestions came up to fundamentally change DB systems to work
for this solution.  Mike responded that you don't always  have
source code for the systems, so schema translation must live OUTSIDE
of the databases.

So put a schema translator on each worker, which will translate their output.

=Major Issues=
1) Semantic Heterogeneity is the formal name of the fact that different
schemas mean different things semantically.  This is the hardest problem.

2) 2-Phase Commit.  You want to book a trip on three airlines for different
legs of the flight.
BEGIN
  United reservation
  American reservation
  Continental reservation
PREPARE
COMMIT

None of the companies would be willing to do this for Orbitz.
Each reservation locks pages in my DB.  What if Orbitz (the coordinator)
gets to prepare stage and then goes down?  Or what if they are malicious
and lock the DBs on purpose?

No DBA will participate in another enterprise's distributed transaction.

The solution they came up with is to have each airline provide a feed
that is updated every N hours, with seats/prices.

Orbitz keeps its own (slightly stale) data, places individual orders,
and if one of them fails, cancels all of them contracually if one of
them fails.

=Web-Oriented Data Integration=
- loosy-goosey semantics - search engines just give you results,
  but don't interpret them!
- shallow web (stuff search engines can see and index)
- deep web (data sitting behind user interfaces) - 500x the size
  of the shallow web.  More than 1 million deep web sites.

The deep web contains things like the address of your child in college
(behind a form you search for their name in), or the value of your house
(sitting behind a form asking for the address).

Search engines don't (yet) know how to fill out all of these forms.

Mike (and friends) made a tool to do this:
- Define a bunch of functions for postgres, which take arguments and return
result of filling out some form on the web.
- Store a hierarchy of concepts, mapping the concepts to functions which
query the right websites to answer the desired question
- Run those functions
  who: person name
  what: address
  know: middlebury student
  In concept hierarchy will be "middlebury student".  Some function
  will map name to address.  Run the function, and bring it back
  into postgres!

Google does it differently - every time it finds a form on the web, it tries
to fill it in to get data from that site, indexing the deep web by
bringing it into the shallow web.  See work by Alon Halevy in
VLDB 2007, as well as CIDR 2006.

=Streaming DBs=
Totally different topic.

Reading:
* Daniel J. Abadi, Don Carney, Ugur Cetintemel, Mitch Cherniack, Christian Convey, Sangdon Lee, Michael Stonebraker, Nesime Tatbul, and Stan Zdonik. Aurora: a new model and architecture for data stream management. VLDB Journal 12(2): 120-139, August 2003. Read Sections 1-6. Note that this IS NOT the Aurora paper in the Red Book. http://www.cs.brown.edu/research/aurora/vldb03_journal.pdf.

Our usual mantra is: store the data, and then process it later.

Streaming data comes in and you have to make a decision in 10 ms or less.
No time to store data, just process it on the fly within some time window.

Scope of aggregates: a time window, not a table.

Aurora project at MIT and Streams project at Stanford have developed this,
but like Information Integration, there is lots of work to be done.