There has been a significant amount of excitement and recent work
on column-oriented database systems (“column-stores”). These
database systems have been shown to perform more than an or-
der of magnitude better than traditional row-oriented database sys-
tems (“row-stores”) on analytical workloads such as those found in
data warehouses, decision support, and business intelligence appli-
cations. The elevator pitch behind this performance difference is
straightforward: column-stores are more I/O efficient for read-only
queries since they only have to read from disk (or from memory)
ABSTRACT
While it is generally accepted that data warehouses and
OLAP workloads are excellent applications for column-stores,
this paper speculates that column-stores may well be suited
for additional applications. In particular we observe that
column-stores do not see a performance degradation when
storing extremely wide tables, and column-stores handle sparse
data very well. These two properties lead us to conjecture
that column-stores may be good storage layers for Semantic
Web data, XML data, and data with GEM-style schemas.
There is currently considerable enthusiasm around the MapReduce
(MR) paradigm for large-scale data analysis [17]. Although the
basic control flow of this framework has existed in parallel SQL
database management systems (DBMS) for over 20 years, some
have called MR a dramatically new computing model [8, 17]. In
this paper, we describe and compare both paradigms. Furthermore,
we evaluate both kinds of systems in terms of performance and de-
velopment complexity. To this end, we define a benchmark con-
sisting of a collection of tasks that we have run on an open source