Venue:
Research Issues on Data Mining and Knowledge Discovery, 1997
URL:
http://citeseer.ist.psu.edu/monge97efficient.html
Detecting database records that are approximate duplicates, but not exact duplicates, is an important task. Databases may contain duplicate records concerning the same realworld entity because of data entry errors, because of unstandardized abbreviations, or because of differences in the detailed schemas of records from multiple databases, among other reasons. In this paper, we present an efficient algorithm for recognizing clusters of approximately duplicate records. Three key ideas distinguish the algorithm presented. First, a version of the Smith-Waterman algrotihm for computing minimum edit-distance is used as a domain-independent method to recognize pairs of approcimately duplicate records. Second, the union/find algrotihm is used to keep track of clusters of duplicate records incrementally, as pairwse duplicate relationships are discovered. Third, the algorthm uses a priority queue of cluster subsets to respond adaptively to the size and homogeneity of the clusters discovered as the database is scanned. This typically reduces by over 75% the number of times that the expensive pairwise record matching (Smith-Waterman or other) is applied without impairing accuracy. Comprehensive experiments on synthetic databases and on a real database of bibliographic records confirm the effectiveness of the new algorithms.