A knowledge-based approach for duplicate elimination in data cleaning

Guided search

Click a term to initiate a search.

Keyword search

A knowledge-based approach for duplicate elimination in data cleaning

Fri, 04/20/2007 - 14:09 — cat

Authors:

Low, WL; Lee, ML; Ling, TW

Author:

Low, W

Lee, M

Ling, T

Year:

2001

Venue:

Information Systems

URL:

http://portal.acm.org/citation.cfm?id=514390.514393

Citations:

Citations range:

50 - 99

Attachment	Size
Low2001Aknowledgebasedapproachforduplicateeliminationindata.pdf	31.48 KB

Existing duplicate elimination methods for data cleaning work on the basis of computing the degree of similarity between nearby records in a sorted database. High recall can be achieved by accepting records with low degrees of similarity as duplicates, at the cost of lower precision. High precision can be achieved analogously at the cost of lower recall. This is the recall-precision dilemma. We develop a generic knowledge-based framework for effective data cleaning that can implement any existing data cleaning strategies and more. We propose a new method for computing transitive closure under uncertainty for dealing with the merging of groups of inexact duplicate records and explain why small changes to window sizes has little effect on the results of the sorted neighborhood method. Experimental study with two real-world datasets show that this approach can accurately identify duplicates and anomalies with high recall and precision, thus effectively resolving the recall-precision dilemma.

websearch

Data Cleaning publication categorizer

Guided search

Data Cleaning

Data sets

Data type

Paper type

Venue type

Author

Year

mailpart

Citations range

Keyword search

A knowledge-based approach for duplicate elimination in data cleaning

Related categories

User login