IntelliClean: a knowledge-based intelligent data cleaner

Authors: 
Lee, M.L.; Ling, T.W.; Low, W.L.
Author: 
Lee, M
Ling, T
Low, W
Year: 
2000
Venue: 
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, 2000
URL: 
http://portal.acm.org/citation.cfm?id=347154
Citations: 
149
Citations range: 
100 - 499
AttachmentSize
Lee2000IntelliCleanaknowledgebased.pdf203.97 KB

Existing data cleaning methods work on the basis of computing the degree of similarity between nearby records in a sorted database. High recall is achieved by accepting records with low degrees of similarity as duplicates, at the cost of lower precision. High precision is achieved analogously at the cost of lower recall. This is the recall-percision dilemma.
In this paper, we propose a generic knowledge-based framework for effective data cleaning that implements existing cleaning strategies and more. We develop a new method to compute transitive closure under uncertaint ywhich handles
the merging of groups of inexact duplicate records. Experimental results show that this framework can identify duplicates and anomalies with high recall and precision.