Hadoop has become an attractive platform for large-scale data ana-
lytics. In this paper, we identify a major performance bottleneck of
Hadoop: its lack of ability to colocate related data on the same set
of nodes. To overcome this bottleneck, we introduce CoHadoop,
a lightweight extension of Hadoop that allows applications to con-
trol where data are stored. In contrast to previous approaches, Co-
Hadoop retains the flexibility of Hadoop in that it does not require
users to convert their data to a certain format (e.g., a relational
database or a specific file format). Instead, applications give hints
to CoHadoop that some set of files are related and may be processed
jointly; CoHadoop then tries to colocate these files for improved
efficiency. Our approach is designed such that the strong fault tol-
erance properties of Hadoop are retained. Colocation can be used
to improve the efficiency of many operations, including indexing,
grouping, aggregation, columnar storage, joins, and sessionization.
We conducted a detailed study of joins and sessionization in the
context of log processing—a common use case for Hadoop—, and
propose efficient map-only algorithms that exploit colocated data
partitions. In our experiments, we observed that CoHadoop outper-
forms both plain Hadoop and previous work. In particular, our ap-
proach not only performs better than repartition-based algorithms,
but also outperforms map-only algorithms that do exploit data par-
titioning but not colocation.
Attachment | Size |
---|---|
p575-eltabakh.pdf | 729.77 KB |