Large-scale data analysis has become increasingly impor-
tant for many enterprises. Recently, a new distributed com-
puting paradigm, called MapReduce, and its open source
implementation Hadoop, has been widely adopted due to
its impressive scalability and flexibility to handle structured
as well as unstructured data. In this paper, we describe
our data warehouse system, called Cheetah, built on top of
MapReduce. Cheetah is designed specifically for our online
advertising application to allow various simplifications and
custom optimizations. First, we take a fresh look at the data
warehouse schema design. In particular, we define a virtual
view on top of the common star or snowflake data warehouse
schema. This virtual view abstraction not only allows us to
design a SQL-like but much more succinct query language,
but also makes it easier to support many advanced query
processing features. Next, we describe a stack of optimiza-
tion techniques ranging from data compression and access
method to multi-query optimization and exploiting materi-
alized views. In fact, each node with commodity hardware in
our cluster is able to process raw data at 1GBytes/s. Lastly,
we show how to seamlessly integrate Cheetah into any ad-
hoc MapReduce jobs. This allows MapReduce developers
to fully leverage the power of both MapReduce and data
warehouse technologies.
Attachment | Size |
---|---|
Chen2010CheetahAHighPerformanceCustomDataWarehouseonTopof.pdf | 520.35 KB |