11/18/2008 ----------- Reading: * Eric Brewer. Combining Systems and Databases: A Search Engine Retrospective. In Red Book. =Search Engines= Eric Brewer paper - "Search Engine Retrospectives" Started Inktomi, ran Hotbot, which was #2 to Altavista. Made some number of billions of dollars in .com bubble. How is a search engine different from a DBMS? - relaxed consistency (no consistency guarantee) - one of many thousands of machines may go down, so its documents won't be reported - documents may not have been retrieved at the same time - there's more than one "correct" answer - essentially read-only -> not updated by individual queries - more like a data warehouse - there's only one kind of query - simplified query workload - fixed schema - data updates are pull (crawl web) vs. push (INSERT) Basic architechture (Brewer claims this is order of magnitude more performant than a distributed DB) - query -> web server -> one of n worker machines, each with some part of the index - crawler is background process, which creates index for workers(1...n) crawlers are very complicated, but we won't say much about them - crawler feeds HTML to indexer. Indexer takes HTML, and outputs, and produces an inverted index of word->docid (words, and which documents contain those words. Queries -> boolean expression over words (implicitly combined with AND, etc). -> other than words, search over properties (language: chinese, site: mit.edu) Results: ordered list of pages satisfying a query - order comes from score: c_1 * document_score + c_2 * score(words, doc) - document_score: Google uses PageRank -> PR(a) = sum_{pages p linking to a} \frac{PR(p)}{#links on p} -> There's an iterative version of calculating this, but it essentially calculates how many popular pages (normalized by how many pages they link to) link to this page. -> docs w/ lots of incoming links, or with highly authoritative incoming links are good. -> Page Rank is a simplified version of what modern search engines ACTUALLY do. - score(words, doc) - scores the words user is looking for based on how often/where they appear in the document, and how important those words are. -> simplest form of this is TF/IDF - term frequency/inverse document frequency -> tfidf = log (1/idf) + log (TF) -> balances popular words by how unique/useful they are -> also take into account word placement, etc Order score is NOT calculated on the fly - precalculated Tables stored in a search engine's DB: - Documents(DocID, URL, date, Size, Summary, ..., page score) NOTE: most search engines probably store ENTIRE document somewhere - Words(wordid, docid, score, position) -> the words found in documents otherwise known as the inverted index - Props(propid, docid) - Terms(wordid, string, statistics) -> stores the string representing a given wordid, so that we can normalize it out of words table Documents + Words are big, Properties is pretty big, and Terms is relatively small (around 10M terms, according to Brewer) Say someone wants to search for "Mike AND Stonebraker" Query plan looks like: Top K sort, generate resultset | Calculate score, URL | INL join / \ select docid, word score index lookup on docs, | retrieving url, summary, docids merge join/intersection / \ select(mike, words) select(stonebraker, words) (merge join of "words" selections is possible, since documents are sorted on documentID. can also join on properties to filter on them.) Performance Optimization - Compression - wordid need not be repeated for each document it appears in. - Caching results in the query plan - cache selections on popular words - cache entire result set for popular queries - Partition tables on each worker Good partitioning for distributed execution - Partition Documents, Words, and Props on docid. So all facts about a given document are found on a single worker machine. - Send each query to each worker machine, which will return sorted list of documents, which can be merged. - Store the terms table on each worker, since it's small, or keep it on the front-end machines which pass queries along. - hash partition on URL/docid to distribute data easily - Keep multiple smaller partitions on each worker, so that you can update smaller chunks from crawler CAP Theorem (consistency, availability, partitioning) - consistency - results include all of data - availability - get results at any time - partitioning - can handle some machines becoming unavailable You can have any TWO of the three in CAP. So search engines drop consistency. If a worker goes down, recover eventually, but keep giving results back without the data from that worker.