Entity resolution is a crucial step for data quality and data
integration. Learning-based approaches show high effective-
ness at the expense of poor efficiency. To reduce the typ-
ically high execution times, we investigate how learning-
based entity resolution can be realized in a cloud infras-
tructure using MapReduce. We propose and evaluate two
efficient MapReduce-based strategies for pair-wise similar-
ity computation and classifier application on the Cartesian
product of two input sources. Our evaluation is based on
real-world datasets and shows the high efficiency and effec-
tiveness of the proposed approaches.
Attachment | Size |
---|---|
learning_based_er_with_mr.pdf | 702.91 KB |