More history from Mike - 9/11/2008 Reading: 1. The start of Chapter 2 to end of Section 2.6 (pages 25-46) of Ramakrishnan and Gehrke 2. Section 3.5 (pages 75-86) of Ramakrishnan and Gehrke 3. The start of Chapter 19 to end of Section 19.6 (pages 605-619) of Ramakrishnan and Gehrke =Semi-Structured Data= Employees name|title|age|dept Sam|assoc|29|EECS Mike|Prof|64|EECS What if we want to add hobbies? Sam... dog(name=milo, breed=golden doodle, age=2) bike(brand=khs, miles/week=120, derailer=shimano) Mike... hike(mountains=11, boots=llbean) bike(manufacturer=Ed Lytton, location=Oakland CA, frame=28") This information is semi-structured - hard to put in a relational model! To put in relations, you'd need a lot of fields and a lot of nulls ==Mike's favorite proposal - RDF (Resource Description Framework)== - {(subject, verb, object)} ex. (Sam has bike) (id666 is bike) (Sam owns id666) - No restriction on verbs, or schema SPARQL - W3C standard query language Find Sam's bike: Select ?id From ("Sam" "owns" ?id) (?id "is" "bike") Mike's take on RDF. It's: - clean - simple - does a small amount of stuff very well - hard to construct complicated queries. - putting together n items in way that a relation does takes a query of length n - you don't know what to query for in a triple store unless you know the exact way triples are constructed ==XML - not quite as good (Mike)== Sam KHS ... ... XML has a lot of popularity, since XML goes through firewalls and nothing else does (well actually, any text does, but XML is what w3c gave us) XML came from SGML, which is a markup language. Meant to move data over the wire. Then some folks got the idea to store it on disk. Well - if we're storing it as a document, we need a schema: What can you describe in an XML schema? Table IMS CODASYL Set-valued elements -> color=red or green [ not so cool in the relational world ] Union Type [ not so cool in the relational world ] It's HUGE. Unlike RDF/Sparql, which are small, and implementable. There are no complete implementations of XML/Schemas/Queries. What does the query language look like? for $x in employee where $x.name = 'Sam' return $x.hobbies.bike ==XML vs. RDF== W3C has been pushing XML and RDF. RDBMS giants (IBM, Oracle, Microsoft, etc...) have been pushing XML, for a variety of reasons. - unfortunately, since XML Schema spec is so large, each major vendor implemented a different fragment of the spec. Mike's concluding thoughts: Semi- or un-structured data is not the end-all-be-all: You can store unstructured data easily with these tools, but the inconsistencies will come back to bite you. When you've allowed yourself to store salaries for employees in different countries with different semantics/laws applying to them, you still haven't allowed someone to run a useful query over that data, which will be incorrect unless you adjust for the semantics). =No structure at all= Can just store blobs of data. Use mapreduce, etc., to process data. We'll touch on this more later. =Why does SQL suck?= ==Nested queries vs. Flattened ones== Made by Chamberlain and Boyce, 1974. Animals older than Freddie? SELECT name FROM animal WHERE age > ( SELECT age FROM animals WHERE name='Freddie' ) Good - we have a block-structured language. Didn't support joins except through nested blocks. What if we want "Animals older than Freddie that share a cage with Freddie?" SELECT name FROM animal A WHERE age > ( SELECT age FROM animals WHERE name='Freddie' AND cid = A.cid ) A.cid has to refer to the outer block's cage id. 1974 spec didn't support this. Added variables, so they added that. However, this now implies that you have to implement the nested query as a for loop through outer block. More complex query Find all pairs of animals cared for by the same keeper? Select name, name2 from animals where kid = (Select kid from animals where name != name2) ACK! - need to use name2 in inner and outer query. Don't allow that! Instead, invented a flat query that supported JOINs. 1976: SELECT a1.name, a2.name FROM Animals A1, Animals A2 WHERE A1.name != A2.name AND A1.kid = A2.kid; 1984 - this design is commercialized. Now we have two representaions for most joins: 1) nested [inside-out evaluation (for loop)] 2) flat [need optimizer to decide which way is faster] so #2 will "always" select fastest option. #1 does not allow for optimization directly. Horrible language design - two options to choose from, might lead to different performance. By 1990, all vendors flatten nested queries. So nested queries are essentially ignored when possible. SQL is larger than it needs to be. Unfortunately, nested is not a subset of flat! here are nested queries that can not be flattened. Horrendous! Nested executor still needs to exist in databases, and language still needs to support it. Proof that some nested queries can not be flattened: SELECT name FROM Animals WHERE kid IN (SELECT kid FROM keepers WHERE name = 'Sam') [above query gives you animals whose keeper is named 'Sam'] SELECT name FROM Animals WHERE kid = (SELECT kid FROM keepers WHERE name = 'Sam') [only works when subquery has cardinality 1. If it's > 1, it returns error, since it can't be replaced in query parent block. This is an example of a query which can not be flattened due to its semantics] ==Calling SQL From a Programming Language== 2 options: - subroutine library (ODBC, JDBC) - need 20 lines of ODBC to handle all of the error handling/logic for 1 line of query - embedding it within a language - like Ruby on Rails, Hibernate, Linq. These find roots in Rigel [1976], PascalR [1975] as programming languages, but PL researchers didn't want to add it to mainstream languages at that point.