11/18/2008
-----------

Reading:
* Eric Brewer. Combining Systems and Databases: A Search Engine Retrospective. In Red Book.

=Search Engines=

Eric Brewer paper - "Search Engine Retrospectives"
Started Inktomi, ran Hotbot, which was #2 to Altavista.  Made some number of
billions of dollars in .com bubble.

How is a search engine different from a DBMS?
 - relaxed consistency (no consistency guarantee)
   - one of many thousands of machines may go down, so its documents
     won't be reported
   - documents may not have been retrieved at the same time
   - there's more than one "correct" answer
 - essentially read-only -> not updated by individual queries - more
   like a data warehouse
 - there's only one kind of query - simplified query workload
 - fixed schema
 - data updates are pull (crawl web) vs. push (INSERT)

Basic architechture (Brewer claims this is order of magnitude more performant
than a distributed DB)
  - query -> web server -> one of n worker machines, each with some part of
    the index - crawler is background process, which creates index for
    workers(1...n)
    crawlers are very complicated, but we won't say much about them
  - crawler feeds HTML to indexer.  Indexer takes HTML, and outputs,
    and produces an inverted index of word->docid (words, and which documents
    contain those words.

Queries -> boolean expression over words (implicitly combined with AND, etc).
        -> other than words, search over properties (language: chinese,
           site: mit.edu)

Results: ordered list of pages satisfying a query
 - order comes from score:
   c_1 * document_score + c_2 * score(words, doc)
 - document_score: Google uses PageRank
   -> PR(a) = sum_{pages p linking to a} \frac{PR(p)}{#links on p}
   -> There's an iterative version of calculating this, but it essentially
      calculates how many popular pages (normalized by how many pages they
      link to) link to this page.
   -> docs w/ lots of incoming links, or with highly authoritative
      incoming links are good.
   -> Page Rank is a simplified version of what modern search engines
      ACTUALLY do.
 - score(words, doc) - scores the words user is looking for based on
   how often/where they appear in the document, and how important
   those words are.
    -> simplest form of this is TF/IDF -
      term frequency/inverse document frequency
    -> tfidf = log (1/idf) + log (TF)
    -> balances popular words by how unique/useful they are
    -> also take into account word placement, etc

Order score is NOT calculated on the fly - precalculated

Tables stored in a search engine's DB:
 - Documents(DocID, URL, date, Size, Summary, ..., page score) 
   NOTE: most search engines probably store ENTIRE document somewhere
 - Words(wordid, docid, score, position) -> the words found in documents
   otherwise known as the inverted index
 - Props(propid, docid)
 - Terms(wordid, string, statistics) -> stores the string representing a given
   wordid, so that we can normalize it out of words table


Documents + Words are big, Properties is pretty big, and Terms is relatively
small (around 10M terms, according to Brewer)

Say someone wants to search for "Mike AND Stonebraker"
Query plan looks like:
                          Top K sort, generate resultset
                                      |
                               Calculate score, URL
                                      |
                                   INL join
                                   /       \
               select docid, word score    index lookup on docs, 
                          |             retrieving url, summary, docids
               merge join/intersection
                 /               \
   select(mike, words)         select(stonebraker, words)

(merge join of "words" selections is possible, since documents are sorted
on documentID.  can also join on properties to filter on them.)

Performance Optimization
- Compression - wordid need not be repeated for each document it appears in.
- Caching results in the query plan
  - cache selections on popular words
  - cache entire result set for popular queries
- Partition tables on each worker

Good partitioning for distributed execution
 - Partition Documents, Words, and Props on docid.  So all facts about
   a given document are found on a single worker machine.
 - Send each query to each worker machine, which will return sorted list
   of documents, which can be merged.
 - Store the terms table on each worker, since it's small, or keep it
   on the front-end machines which pass queries along.
 - hash partition on URL/docid to distribute data easily
 - Keep multiple smaller partitions on each worker, so that you can
   update smaller chunks from crawler

CAP Theorem (consistency, availability, partitioning)
 - consistency - results include all of data
 - availability - get results at any time
 - partitioning - can handle some machines becoming unavailable

You can have any TWO of the three in CAP.  So search engines drop
consistency.  If a worker goes down, recover eventually, but keep
giving results back without the data from that worker.