Authors:
Galhardas, H; Florescu, D; Shasha, D; Simon, E; Saita, C.
Author:
Galhardas, H
Florescu, D
Shasha, D
Simon, E
Saita, C
URL:
http://citeseer.ist.psu.edu/451854.html
The problem of data cleaning, which consists of removing
inconsistencies and errors from original data sets, is well known in
the area of decision support systems and data warehouses. This holds
regardless of the application - relational database joining,
web-related, or scientific. In all cases, existing ETL (Extraction
Transformation Loading) and data cleaning tools for writing data
cleaning programs are insufficient. The main challenge is the design
and implementation of a dataflow graph that effectively and
efficiently generates clean data. Needed improvements to the current
state of the art include (i) a clear separation between the logical
specification of data transformations and their physical
implementation (ii) an explanation of the reasoning behind cleaning
results, (iii) and interactive facilities to tune a data cleaning
program. This paper presents a language, an execution model and
algorithms that enable users to express data cleaning specifications
declaratively and perform the cleaning efficiently. We use as an
example a set of bibliographic references used to construct the
Citeseer Web site. The underlying data integration problem is to
derive structured and clean textual records so that meaningful
queries can be performed. Experimental results report on the
assessment of the proposed framework for data cleaning.