10/28/2008 Column Stores Reading: * Michael Stonebraker, Daniel Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Samuel Madden, Elizabeth O'Neil, Pat O'Neil, Alex Rasin, Nga Tran, Stan Zdonik. C-Store: A Column-oriented DBMS. In Proceedings of VLDB, 2005. http://db.lcs.mit.edu/madden/html/vldb_cr5.pdf. Visiting lecturer: Shilpa Lawande, VP of Engineering of Vertica (commercializaiton of C-Store). Organizations split analytical processing off into different databases that are well-designed for analytics, so that cheaper transactional queries don't get slow down, and that analytical queries can be more efficient. =Data Warehouses/Analytics= Old field, but used to be limited to data analysts. Now it's growing so that everyone in company has access to data, performing ad-hoc analytical queries. Different uses for data warehousing - storing historical tickstores, fraud detection, compliance in finance - communications - call detail reporting/analysis - internet - clickstream analysis - consumer marketing - buying pattern analysis - life sciences - rdf data management Most queries are read-only analysis of ENTIRE dataset. Previous solutions provided pains in how much data could be stored, and so not much could be queries historically. =One Size Does Not Fit All= History: - in 70s/80s, OLTP (online transaction processing) applications provided good efficiency for write-mostly applications -> designed to keep data in memory, avoid using disk - OLAP (analytical processing) in 90's -> band-aid features like MVs, bitmap indexes, cubes often take as much/more space than data itself -> designed for specific queries, not ad-hoc ones - Appliances came about in 2000's -> throw hardware at the problem, since commodity hardware is cheap, but you're still using old technology -> proprietary hardware is too expensive - Vertica's solution: -> database software should be custom-build for analytics -> leverage commodity hardware to do this, but software will tie it together in a better way -> looking forward - don't even own hardware - put it on the cloud =Vertica= Based on C-Store MIT CSAIL process Design Ideals - Designed for fast perfomance on read-mostly workload. - Bad for small, high-throughput queries, especially if they require updates/deletes Claims: - Fast - Column store architecture - Aggresive compression (somtimes 10:1) - Designed for multicore CPUs - Lightweight transaction model, since no concurrent updates/deletes - Simple - Interfaces like other DBs (SQL, ODBC) - Designed for concurrent queries and bulk loads - Build-in fault-tolerance - Cheap - "Scale out" (buy commodity hardware) rather than "scale up" (buy more expensive hardware) =Column Store Architecture= Data is usually simple, in a single table (not normalized), and have 300-500 columns. Queries usually touch less than 8 columns at a time. Traditional rowstores store all columns in a row in the same place. Column stores store each column in its own contiguous file, so accessing few columns require less IO So far, other column stores exist (for example - Sybase IQ). Features that follow make vertica different. =Compression= Columns compress better when on their own (vertica's claim). Typically see 50-90% compression. NULLs take virtually no space. Many encoding types, but most popular are - Run-length encoding - if data is not diverse and sorted, just store each value with how many times it appears in sequence. - Delta encoding - if you have diverse values and they are sorted, just store a start value and the deltas in smaller amounts - LZW - dictionary encoding for repeated values, not sorted. By saving disk I/O, you shift work to CPU, leaving queries CPU-bound, contrary to popular DB mindset. =Projection= All data in c-stores is sorted. Multiple columns sorted in the same order are called a projection. If you query for columns within a projection, you can read the columns separately in the same order, and reconstruct the rows of constructed columns. Multiple projections speed up different query classes, and optimizer's job is to pick which projection to use to answer a query. Vertica found that join indices, which allow you to put columns together from different projections, were not efficient, and ditched them. =Shared-nothing Architecture= Users can connect to any node Add nodes to scale out Distributed query optimizer and execution engine Storage is internal to each node, avoiding expensive centralized storage solution. Each node runs 1 vertica process (multithreaded) + 1 spread process - spread: off-the-shelf product to determine which nodes are up and which are down. =Hybrid Storage= Write-optimized store (WOS) handles all write requests. Background process called a "tuple mover" moves tuples to read-optimized store (ROS). WOS sits in memory. Is currently a column-store, but is moving toward a row-store. Users bulk-load into it, and background process offloads into ROS. ROS stored in projection sort order. WOS has insertion-order projections only. Query executor reads from WOS and ROS (hopefully mostly from ROS). Since you can't stop ROS from running when you move tuples into packed/sorted columns (as C-Store suggested), vertica stores projections in a small amount of containers, which are merged together to answer a query. Tuple-mover not only moves tuples into ROS, but also merges small containers together to ensure lack of fragmentation. Multiple containers help concurrency. Strive to have as many containers as you have disk spindles. Multiple cores can read from different spindles concurrently. =Light-weight Transactions= No row-level locking. Historical queries ask questions as of a certain time (epoch). Epoch of data is advanced every 3 minutes. All Xactions going on at any time are stamped with current epoch. New Xactions will get new epoch values after 3 minutes. No deletions/updates. All deletes treated like inserts (insert a "delete" flag for a tuple, but don't delete the tuple itself). Over time, tuple mover purges data as it merges containers to clean up the file, or if requested explicitly by administrator. Update = delete + insert. Since users want historical queries, they only touch the closed epochs, not the current one. Since updates only touch the open epoch, no locks are necessary for historical queries. Only lock pages for open epoch queries, so that if someone is insistent on querying an open epoch, they hit a lock, but otherwise, no locking. =Recovery= No persistent UNDO logs - just throw away WOS files. No persistent REDO logs - all data is redundantly stored on multiple nodes, so you can recover data by querying other nodes. =K-Safe Physical Design= Non-segmented projections are replicated on all nodes Each segmented projection has K buddy projections. For a database with N nodes, maximum allowable k must satisfy k < N/2 Database designer ensures design is K-safe, and both ROS and WOS are K-safe. On failure - query executor handles down nodes - recovery works by querying live nodes - all failures handled as a hardware failure Essentially, for K-safety, store K+1 copies of the data, not all on the same machine. For disaster (whole datacenter goes down), load the newest epoch for which all columns have their segments. =Database Designer= How to pick projections? Database designer (software) takes schema and sample queries, and figures out how many projections to create for efficient queries and K-safety.