10/28/2008
Column Stores

Reading:
* Michael Stonebraker, Daniel Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Samuel Madden, Elizabeth O'Neil, Pat O'Neil, Alex Rasin, Nga Tran, Stan Zdonik. C-Store: A Column-oriented DBMS. In Proceedings of VLDB, 2005.  http://db.lcs.mit.edu/madden/html/vldb_cr5.pdf.

Visiting lecturer: Shilpa Lawande, VP of Engineering of Vertica
(commercializaiton of C-Store).

Organizations split analytical processing off into different databases that 
are well-designed for analytics, so that cheaper transactional queries don't
get slow down, and that analytical queries can be more efficient.

=Data Warehouses/Analytics=
Old field, but used to be limited to data analysts.
Now it's growing so that everyone in company has access to data,
performing ad-hoc analytical queries.

Different uses for data warehousing
  - storing historical tickstores, fraud detection, compliance in finance
  - communications - call detail reporting/analysis
  - internet - clickstream analysis
  - consumer marketing - buying pattern analysis
  - life sciences - rdf data management

Most queries are read-only analysis of ENTIRE dataset.

Previous solutions provided pains in how much data could be stored, and so
not much could be queries historically.

=One Size Does Not Fit All=
History:
 - in 70s/80s, OLTP (online transaction processing) applications provided
   good efficiency for write-mostly applications
     -> designed to keep data in memory, avoid using disk
 - OLAP (analytical processing) in 90's
     -> band-aid features like MVs, bitmap indexes, cubes often take
        as much/more space than data itself
     -> designed for specific queries, not ad-hoc ones
 - Appliances came about in 2000's
     -> throw hardware at the problem, since commodity hardware is cheap,
        but you're still using old technology
     -> proprietary hardware is too expensive
 - Vertica's solution:
     -> database software should be custom-build for analytics
     -> leverage commodity hardware to do this, but software will tie it
        together in a better way
     -> looking forward - don't even own hardware - put it on the cloud

=Vertica=
Based on C-Store MIT CSAIL process

Design Ideals
- Designed for fast perfomance on read-mostly workload.
- Bad for small, high-throughput queries, especially if they
  require updates/deletes

Claims:
  - Fast
    - Column store architecture
    - Aggresive compression (somtimes 10:1)
    - Designed for multicore CPUs
    - Lightweight transaction model, since no concurrent updates/deletes
  - Simple
    - Interfaces like other DBs (SQL, ODBC)
    - Designed for concurrent queries and bulk loads
    - Build-in fault-tolerance
  - Cheap
    - "Scale out" (buy commodity hardware) rather than "scale up" (buy
      more expensive hardware)

=Column Store Architecture=
Data is usually simple, in a single table (not normalized), and have 300-500 columns.
Queries usually touch less than 8 columns at a time.

Traditional rowstores store all columns in a row in the same place.
Column stores store each column in its own contiguous file, so accessing few
columns require less IO

So far, other column stores exist (for example - Sybase IQ).  Features that
follow make vertica different.

=Compression=
Columns compress better when on their own (vertica's claim).  Typically see
50-90% compression.

NULLs take virtually no space.

Many encoding types, but most popular are
 - Run-length encoding - if data is not diverse and sorted, just store
   each value with how many times it appears in sequence.
 - Delta encoding - if you have diverse values and they are sorted,
   just store a start value and the deltas in smaller amounts
 - LZW - dictionary encoding for repeated values, not sorted.

By saving disk I/O, you shift work to CPU, leaving queries CPU-bound, contrary
to popular DB mindset.

=Projection=
All data in c-stores is sorted.  Multiple columns sorted in the same order
are called a projection.  If you query for columns within a projection,
you can read the columns separately in the same order, and reconstruct the
rows of constructed columns.

Multiple projections speed up different query classes, and optimizer's job is
to pick which projection to use to answer a query.

Vertica found that join indices, which allow you to put columns together
from different projections, were not efficient, and ditched them.

=Shared-nothing Architecture=
Users can connect to any node

Add nodes to scale out


Distributed query optimizer and execution engine

Storage is internal to each node, avoiding expensive centralized storage
solution.

Each node runs 1 vertica process (multithreaded) + 1 spread process
 - spread: off-the-shelf product to determine which nodes are up and which
   are down.

=Hybrid Storage=
Write-optimized store (WOS) handles all write requests.  Background process
called a "tuple mover" moves tuples to read-optimized store (ROS).

WOS sits in memory.  Is currently a column-store, but is moving toward a
row-store.  Users bulk-load into it, and background process offloads into ROS.

ROS stored in projection sort order.  WOS has insertion-order projections only.

Query executor reads from WOS and ROS (hopefully mostly from ROS).

Since you can't stop ROS from running when you move tuples into
packed/sorted columns (as C-Store suggested), vertica stores projections
in a small amount of containers, which are merged together to answer a
query.  Tuple-mover not only moves tuples into ROS, but also merges
small containers together to ensure lack of fragmentation.

Multiple containers help concurrency.  Strive to have as many containers as you
have disk spindles.  Multiple cores can read from different spindles concurrently.

=Light-weight Transactions=
No row-level locking.

Historical queries ask questions as of a certain time (epoch).  Epoch of
data is advanced every 3 minutes.  All Xactions going on at any time are
stamped with current epoch.  New Xactions will get new epoch values after 3
minutes.

No deletions/updates.  All deletes treated like inserts (insert a "delete"
flag for a tuple, but don't delete the tuple itself).  Over time, tuple
mover purges data as it merges containers to clean up the file, or 
if requested explicitly by administrator.  Update = delete + insert.

Since users want historical queries, they only touch the closed epochs,
not the current one.  Since updates only touch the open epoch, no locks
are necessary for historical queries.  Only lock pages for open epoch queries,
so that if someone is insistent on querying an open epoch, they hit a lock,
but otherwise, no locking.

=Recovery=
No persistent UNDO logs - just throw away WOS files.

No persistent REDO logs - all data is redundantly stored on multiple nodes,
so you can recover data by querying other nodes.

=K-Safe Physical Design=
Non-segmented projections are replicated on all nodes
Each segmented projection has K buddy projections.

For a database with N nodes, maximum allowable k must satisfy
  k < N/2

Database designer ensures design is K-safe, and both ROS and WOS
are K-safe.

On failure
 - query executor handles down nodes
 - recovery works by querying live nodes
 - all failures handled as a hardware failure

Essentially, for K-safety, store K+1 copies of the data, not
all on the same machine.

For disaster (whole datacenter goes down), load the newest epoch for which
all columns have their segments.

=Database Designer=
How to pick projections?
Database designer (software) takes schema and sample queries, and figures
out how many projections to create for efficient queries and K-safety.