On evaluation and training-set construction for duplicate detection

Authors: 
Bilenko, M; Mooney, RJ
Author: 
Bilenko, M
Mooney, R
Year: 
2003
Venue: 
Proceedings of the KDD-2003 workshop on data cleaning
URL: 
http://www.cs.utexas.edu/~ml/papers/marlin-kdd-wkshp-03.pdf
Citations: 
0
Citations range: 
n/a
AttachmentSize
Bilenko2003Onevaluationandtrainingset.pdf117.11 KB

A variety of experimental methodologies have been used to evaluate
the accuracy of duplicate-detection systems. We advocate presenting
precision-recall curves as the most informative evaluation
methodology. We also discuss a number of issues that arise when
evaluating and assembling training data for adaptive systems that
use machine learning to tune themselves to specific applications.
We consider several different application scenarios and experimentally
examine the effectiveness of alternative methods of collecting
training data under each scenario. We propose two new approaches
to collecting training data called static-active learning and weaklylabeled
non-duplicates, and present experimental results on their
effectiveness.