cs.cornell.edu

Eliminating fuzzy duplicates in data warehouses

Authors: 
Ananthakrishna, R; Chaudhuri, S; Ganti, V
Year: 
2002
Venue: 
VLDB 2002

The duplicate elimination problem of detecting multiple tuples, which describe the same real world entity, is an important data cleaning problem. Previous domain independent solutions to this problem relied on standard textual similarity functions (e.g., edit distance, cosine metric) between multi-attribute tuples. However, such approaches result in large numbers of false positives if we want to identify domain-specific abbreviations and conventions.

Syndicate content