cs.utexas.edu

On evaluation and training-set construction for duplicate detection

Authors: 
Bilenko, M; Mooney, RJ
Year: 
2003
Venue: 
Proceedings of the KDD-2003 workshop on data cleaning

A variety of experimental methodologies have been used to evaluate
the accuracy of duplicate-detection systems. We advocate presenting
precision-recall curves as the most informative evaluation
methodology. We also discuss a number of issues that arise when
evaluating and assembling training data for adaptive systems that
use machine learning to tune themselves to specific applications.
We consider several different application scenarios and experimentally
examine the effectiveness of alternative methods of collecting
training data under each scenario. We propose two new approaches

Adaptive duplicate detection using learnable string similarity measures

Authors: 
Bilenko, M; Mooney, RJ
Year: 
2003
Venue: 
Proceedings of the ninth ACM SIGKDD international conference

The problem of identifying approximately duplicate records in databases is an essential step for data cleaning and data integration processes. Most existing approaches have relied on generic or manually tuned distance metrics for estimating the similarity of potential duplicates. In this paper, we present a framework for improving duplicate detection using trainable measures of textual similarity.

Syndicate content