11/13/2008 =Data Integration= - So far, make a big machine out of jelly beans - you have one DBA, one schema When Mike worked at Informix, there were __58__ HR systems, one for each country. Different compensation rules per country (lunch, overtime, etc.). CEO wants to know how many employees are in the company. Have to integrate all 58 of those! - some were pencil/paper - Excel - DBMS - SAP Orbitz - ~40 airline systems for different parts of the world, integrated. Note: many countries require that HR systems live inside country, and Orbitz can't ask airline systems to host their machines. So integrating databases is a MUST. - lots of schemas - one DBA per site This can be thought of a cross-enterprise endeavour. =Solution 1= Application issues SQL against a global schema Coordinator parses, optimizes, and supervises queries on each of K workers, each speaking with one of K databases. One DB might be SQLServer, another DB may be at Google, using BigTable. - K workers must run wrappers to convert to SQL (or central language). Example: EMP(name, dept, pet, hobbies, salary) France ("Francois Bancillon", 123, "je ne said pas, une chienne", "piste", 4000euro) USA ("Sam Madden", "shoe", "woof the cockapoodle", ski, $12000) How to map these? - Name is easy - 123 vs. shoe? - pet requires language translation - "piste" to ski needs an ontology! - salary: US is monthly, before taxes, in Dollars France includes lunch allowance, after taxes, in Euros CEO might ask: what is avg. company salary? - integration must be semantic, not syntactic It's _somebody's_ job to take care of translation. It's no easier or harder no matter where you perform translation (DB, application, etc...). Mike says -> this is almost arbitrarily hard Does shredding the data into RDF or XML help us? No - it helps merge the data into a common syntax, but doesn't solve semantic problem. Does a schema description (metadata of the metadata) help? No - you are integrating data of already-built databases, which aren't using common schema description language. Lots of suggestions came up to fundamentally change DB systems to work for this solution. Mike responded that you don't always have source code for the systems, so schema translation must live OUTSIDE of the databases. So put a schema translator on each worker, which will translate their output. =Major Issues= 1) Semantic Heterogeneity is the formal name of the fact that different schemas mean different things semantically. This is the hardest problem. 2) 2-Phase Commit. You want to book a trip on three airlines for different legs of the flight. BEGIN United reservation American reservation Continental reservation PREPARE COMMIT None of the companies would be willing to do this for Orbitz. Each reservation locks pages in my DB. What if Orbitz (the coordinator) gets to prepare stage and then goes down? Or what if they are malicious and lock the DBs on purpose? No DBA will participate in another enterprise's distributed transaction. The solution they came up with is to have each airline provide a feed that is updated every N hours, with seats/prices. Orbitz keeps its own (slightly stale) data, places individual orders, and if one of them fails, cancels all of them contracually if one of them fails. =Web-Oriented Data Integration= - loosy-goosey semantics - search engines just give you results, but don't interpret them! - shallow web (stuff search engines can see and index) - deep web (data sitting behind user interfaces) - 500x the size of the shallow web. More than 1 million deep web sites. The deep web contains things like the address of your child in college (behind a form you search for their name in), or the value of your house (sitting behind a form asking for the address). Search engines don't (yet) know how to fill out all of these forms. Mike (and friends) made a tool to do this: - Define a bunch of functions for postgres, which take arguments and return result of filling out some form on the web. - Store a hierarchy of concepts, mapping the concepts to functions which query the right websites to answer the desired question - Run those functions who: person name what: address know: middlebury student In concept hierarchy will be "middlebury student". Some function will map name to address. Run the function, and bring it back into postgres! Google does it differently - every time it finds a form on the web, it tries to fill it in to get data from that site, indexing the deep web by bringing it into the shallow web. See work by Alon Halevy in VLDB 2007, as well as CIDR 2006. =Streaming DBs= Totally different topic. Reading: * Daniel J. Abadi, Don Carney, Ugur Cetintemel, Mitch Cherniack, Christian Convey, Sangdon Lee, Michael Stonebraker, Nesime Tatbul, and Stan Zdonik. Aurora: a new model and architecture for data stream management. VLDB Journal 12(2): 120-139, August 2003. Read Sections 1-6. Note that this IS NOT the Aurora paper in the Red Book. http://www.cs.brown.edu/research/aurora/vldb03_journal.pdf. Our usual mantra is: store the data, and then process it later. Streaming data comes in and you have to make a decision in 10 ms or less. No time to store data, just process it on the fly within some time window. Scope of aggregates: a time window, not a table. Aurora project at MIT and Streams project at Stanford have developed this, but like Information Integration, there is lots of work to be done.