Abstract Cloud infrastructures enable the efficient parallel
execution of data-intensive tasks such as entity resolution on
large datasets. We investigate challenges and possible solu-
tions of using the MapReduce programming model for par-
allel entity resolution using Sorting Neighborhood blocking
(SN). We propose and evaluate two efficient MapReduce-
based implementations for single- and multi-pass SN that
either use multiple MapReduce jobs or apply a tailored data
replication. We also propose an automatic data partitioning
approach for multi-pass SN to achieve load balancing. Our
evaluation based on real-world datasets shows the high effi-
ciency and effectiveness of the proposed approaches.
Attachment | Size |
---|---|
multi_pass_sn_with_mr.pdf | 739.14 KB |