Joins are essential for many data analysis tasks, but are
not supported directly by the MapReduce paradigm. While
there has been progress on equi-joins, implementation of join
algorithms in MapReduce in general is not sufficiently un-
derstood. We study the problem of how to map arbitrary
join conditions to Map and Reduce functions, i.e., a parallel
infrastructure that controls data flow based on key-equality
only. Our proposed join model simplifies creation of and
reasoning about joins in MapReduce. Using this model, we
derive a surprisingly simple randomized algorithm, called 1-
Bucket-Theta, for implementing arbitrary joins (theta-joins)
in a single MapReduce job. This algorithm only requires
minimal statistics (input cardinality) and we provide evi-
dence that for a variety of join problems, it is either close
to optimal or the best possible option. For some of the
problems where 1-Bucket-Theta is not the best choice, we
show how to achieve better performance by exploiting addi-
tional input statistics. All algorithms can be made ’memory-
aware’, and they do not require any modifications to the
MapReduce environment. Experiments show the effective-
ness of our approach.
Attachment | Size |
---|---|
sigmod11-okcan.pdf | 368.88 KB |