Problems, Methods, and Challenges in Comprehensive Data Cleansing

Authors: 
Müller, Heiko; Freytag, Johann-Christoph
Author: 
Müller, H
Freytag, J
Year: 
2003
Venue: 
HUB-IB-164, Humboldt University Berlin
URL: 
http://www.dbis.informatik.hu-berlin.de/fileadmin/research/papers/techreports/2003-hub_ib_164-mueller.pdf
Citations: 
85
Citations range: 
50 - 99
AttachmentSize
Mller2003ProblemsMethodsand.pdf121.75 KB

Cleansing data from impurities is an integral part of data processing and maintenance. This has lead to the development of a broad range of methods intending to enhance the accuracy and thereby the usability of existing data. This paper presents a survey of data cleansing problems, approaches, and methods. We classify the various types of anomalies occurring in data that have to be eliminated, and we define a set of quality criteria that comprehensively cleansed data has to accomplish. Based on this classification we evaluate and compare existing approaches for data cleansing with respect to the types of anomalies handled and eliminated by them. We also describe in general the different steps in data cleansing and specify the methods used within the cleansing process and give an outlook to research directions that complement the existing systems.