The effectiveness and scalability of MapReduce-based im-
plementations of complex data-intensive tasks depend on an
even redistribution of data between map and reduce tasks.
In the presence of skewed data, sophisticated redistribution
approaches thus become necessary to achieve load balanc-
ing among all reduce tasks to be executed in parallel. For
the complex problem of entity resolution with blocking, we
propose BlockSplit, a load balancing approach that supports
blocking techniques to reduce the search space of entity res-
olution. The evaluation on a real cloud infrastructure shows
the value and effectiveness of the proposed approach.
Attachment | Size |
---|---|
cikm_poster_paper.pdf | 674.79 KB |