CoHadoop: flexible data placement and its exploitation in Hadoop

Keyword search

Guided search

Click a term to initiate a search.

CoHadoop: flexible data placement and its exploitation in Hadoop

Wed, 08/17/2011 - 17:40 — kolb

Authors:

Eltabakh, MY; Tian, Y; Özcan, F; Gemulla, R; Krettek, A; McPherson, J

Author:

Eltabakh, M

Tian, Y

Özcan, F

Gemulla, R

Krettek, A

McPherson, J

Hadoop has become an attractive platform for large-scale data ana-
lytics. In this paper, we identify a major performance bottleneck of
Hadoop: its lack of ability to colocate related data on the same set
of nodes. To overcome this bottleneck, we introduce CoHadoop,
a lightweight extension of Hadoop that allows applications to con-
trol where data are stored. In contrast to previous approaches, Co-
Hadoop retains the ﬂexibility of Hadoop in that it does not require
users to convert their data to a certain format (e.g., a relational
database or a speciﬁc ﬁle format). Instead, applications give hints
to CoHadoop that some set of ﬁles are related and may be processed
jointly; CoHadoop then tries to colocate these ﬁles for improved
efﬁciency. Our approach is designed such that the strong fault tol-
erance properties of Hadoop are retained. Colocation can be used
to improve the efﬁciency of many operations, including indexing,
grouping, aggregation, columnar storage, joins, and sessionization.
We conducted a detailed study of joins and sessionization in the
context of log processing—a common use case for Hadoop—, and
propose efﬁcient map-only algorithms that exploit colocated data
partitions. In our experiments, we observed that CoHadoop outper-
forms both plain Hadoop and previous work. In particular, our ap-
proach not only performs better than repartition-based algorithms,
but also outperforms map-only algorithms that do exploit data par-
titioning but not colocation.

Year:

2011

Venue:

VLDB 2011

URL:

http://portal.acm.org/citation.cfm?id=2002943

Citations:

Citations range:

n/a

Attachment	Size
p575-eltabakh.pdf	729.77 KB

websearch

Cloud Computing publication categorizer

Keyword search

Guided search

Author

Year

Topic

Tags

mailpart

Citations range

CoHadoop: flexible data placement and its exploitation in Hadoop

Navigation

Related categories

User login