Cloud infrastructures enable the efficient parallel execution of data-intensive
tasks such as entity resolution on large datasets. We investigate challenges and possi-
ble solutions of using the MapReduce programming model for parallel entity resolu-
tion. In particular, we propose and evaluate two MapReduce-based implementations
for Sorted Neighborhood blocking that either use multiple MapReduce jobs or apply
a tailored data replication.
Attachment | Size |
---|---|
Kolb2011ParallelSortedNeighborhoodBlockingwithMapReduce.pdf | 533.25 KB |