Framework for Evaluating Clustering Algorithms in Duplicate Detection

Hassanzadeh, Oktie; Chiang, Fei; Miller, Renée; Lee, Hyun Chul
Miller, R
Hassanzadeh, O
Lee, H
Chiang, F
Citations range: 
10 - 49
vldb09-1025.pdf473.64 KB

The presence of duplicate records is a major data quality concern in
large databases. To detect duplicates, entity resolution also known
as duplication detection or record linkage is used as a part of the
data cleaning process to identify records that potentially refer to
the same real-world entity. We present the Stringer system that
provides an evaluation framework for understanding what barriers
remain towards the goal of truly scalable and general purpose duplication
detection algorithms. In this paper, we use Stringer to
evaluate the quality of the clusters (groups of potential duplicates)
obtained from several unconstrained clustering algorithms used in
concert with approximate join techniques. Our work is motivated
by the recent significant advancements that have made approximate
join algorithms highly scalable. Our extensive evaluation reveals
that some clustering algorithms that have never been considered
for duplicate detection, perform extremely well in terms of both
accuracy and scalability.