Implementations of map-reduce are being used to perform
many operations on very large data. We examine strategies
for joining several relations in the map-reduce environment.
Our new approach begins by identifying the “map-key,” the
set of attributes that identify the Reduce process to which a
Map process must send a particular tuple. Each attribute of
the map-key gets a “share,” which is the number of buckets
into which its values are hashed, to form a component of the
identifier of a Reduce process. Relations have their tuples
replicated in limited fashion, the degree of replication de-
pending on the shares for those map-key attributes that are
missing from their schema. We study the problem of opti-
mizing the shares, given a fixed number of Reduce processes.
An algorithm for detecting and fixing problems where a vari-
able is mistakenly included in the map-key is given. Then,
we consider two important special cases: chain joins and
star joins. In each case we are able to determine the map-
key and determine the shares that yield the least replication.
While the method we propose is not always superior to the
conventional way of using map-reduce to implement joins,
there are some important cases involving large-scale data
where our method wins, including: (1) analytic queries in
which a very large fact table is joined with smaller dimen-
sion tables, and (2) queries involving paths through graphs
with high out-degree, such as the Web or a social network.
Attachment | Size |
---|---|
join-mr.pdf | 243.91 KB |