Data cleaning

An overview of OntoClean

Authors: 
Guarino, N; Welty, CA
Year: 
2009
Venue: 
Handbook on ontologies

OntoClean is a methodology for validating the ontological adequacy
of taxonomic relationships. It is based on highly general ontological notions drawn from philosophy, like essence, identity, and unity, which are used to characterize relevant aspects of the intended meaning of the properties, classes, and relations that make up an ontology. These aspects are represented by formal metaproperties, which impose several constraints on the taxonomic structure of an ontology. The analysis of these constraints helps in evaluating and validating the choices made.

Reference Fusion and Flexible Querying

Authors: 
Saïs, Fatiha; Thomopoulos, Rallou
Year: 
2008
Venue: 
ODBASE--OTM Conferences 2008: Monterrey, Mexico

This paper deals with the issue of data fusion, which arises once reconciliations between references have been determined. The objective of this task is to fusion the descriptions of references that refer to the same real world entity so as to obtain a unique representation. In order to deal with the problem of uncertainty in the values associated with the attributes, we have chosen to represent the results of the fusion of references in a formalism based on fuzzy sets. We indicate how the confidence degrees are computed.

Combining a Logical and a Numerical Method for Data Reconciliation

Authors: 
Saïs, Fatiha; Pernelle, Nathalie; Rousset, Marie-Christine
Year: 
2009
Venue: 
JoDS - Journal of Data Semantics (LNCS subline, Springer)

The reference reconciliation problem consists in deciding whether different identifiers refer to the same data, i.e. correspond to the same real world entity. In this article we present a reference reconciliation approach which combines a logical method for reference reconciliation called L2R and a numerical one called N2R. This approach exploits the schema and data semantics, which is translated into a set of Horn FOL rules of reconciliation. These rules are used in L2R to infer exact decisions both of reconciliation and non-reconciliation.

Data Quality: Concepts, Methodologies and Techniques

Authors: 
Batini, C.; Scannapieca, M.
Year: 
2006

Data fusion

Authors: 
Bleiholder, J; Naumann, F
Year: 
2008
Venue: 
ACM Computing Surveys

The development of the Internet in recent years has made it possible and useful to access many different information systems anywhere in the world to obtain information. While there is much research on the integration of heterogeneous information systems, most commercial systems stop short of the actual integration of available data. Data fusion is the process of fusing multiple records representing the same real-world object into a single, consistent, and clean representation.

Conditional Functional Dependencies for Data Cleaning

Authors: 
Bohannon, Philip; Fan, Wenfei; Geerts, Floris; Jia, Xibei; Kementsietsidis, Anastasios.;
Year: 
2007
Venue: 
ICDE, 2007

We propose a class of constraints, referred to as conditional functional dependencies (CFDs), and study their applications in data cleaning. In contrast to traditional functional dependencies (FDs) that were developed mainly for schema design, CFDs aim at capturing the consistency of data by incorporating bindings of semantic ally related values. For CFDs we provide an inference system analogous to Armstrong's axioms for FDs, as well as consistency analysis.

A Survey of Data Quality Tools

Authors: 
Barateiro, José; Galhardas, Helena
Year: 
2005
Venue: 
Datenbankspektrum, Vol. 14, 2005

Data quality tools aim at detecting and
correcting data problems that affect the
accuracy and efficiency of data analysis
applications. We propose a classification
of the most relevant commercial and research
data quality tools that can be used
as a framework for comparing tools and
understand their functionalities.

Data Quality at a Glance

Authors: 
Scannapieco, M; Missier, P; Batini, C
Year: 
2005
Venue: 
Datenbankspektrum, Vol. 14, 2005

The paper provides an overview of data quality, in terms of its multidimensional nature. A set of data quality dimensions is defined, including accuracy, completeness, time-related dimensions and consistency. Several practical examples on how such dimensions can be measured and used are also described. The definitions for data quality dimensions are placed in the context of other research proposals for sets of data quality dimensions, showing similarities and differences.

Improving data cleaning quality using a data lineage facility

Authors: 
Galhardas, H; Florescu, D; Shasha, D; Simon, E; E Simon, CA
Year: 
2001
Venue: 
Proc. Conf. on Data Management and Data Warehouses (DMDW)

The problem of data cleaning, which consists of
removing inconsistencies and errors from original
data sets, is well known in the area of decision
support systems and data warehouses. However,
for some applications, existing ETL (Extraction
Transformation Loading) and data cleaning
tools for writing data cleaning programs are insuf-
ficient. One important challenge with them is the
design of a data flow graph that effectively generates
clean data. A generalized difficulty is the lack
of explanation of cleaning results and user interaction

Probabilistic Name and Address Cleaning and Standardization

Authors: 
Christen, P.; Churches, T.; Zhu, J.
Year: 
2002
Venue: 
Proceedings of the Australasian Data Mining Workshop, 2002

In the absence of a shared unique key, an ensemble of nonunique
personal attributes such as names and addresses is
often used to link data from disparate sources. Such data
matching is widely used when assembling data warehouses
and business mailing lists, and is a foundation of many longitudinal
epidemiological and other health related studies.
Unfortunately, names and addresses are often captured in
non-standard and varying formats, usually with some degree
of spelling and typographical errors. It is therefore
important that such data is transformed into a clean and

A Bayesian decision model for cost optimal record matching

Authors: 
Verykios, V. S.; Moustakides, G. V.; Elfeky, M. G.
Year: 
2003
Venue: 
VLDB Journal

In an error-free system with perfectly clean data, the construction of a global view of the data consists of linking - in relational terms, joining - two or more tables on their key fields. Unfortunately, most of the time, these data are neither carefully controlled for quality nor necessarily defined commonly across different data sources. As a result, the creation of such a global data view resorts to approximate joins.

Data Cleaning for Decision Support

Authors: 
Benedikt, M.; Bohannon, P.; Bruns, G.
Year: 
2006
Venue: 
Clean DB, 2006

Data cleaning may involve the acquisition, at
some effort or expense, of high-quality data.
Such data can serve not only to correct individual
errors, but also to improve the reliability
model for data sources. However, there
has been little research into this latter role for
acquired data. In this short paper we define
a new data cleaning model that allows a user
to estimate the value of further data acquisition
in the face of specific business decisions.
As data is acquired, the reliability model of
sources is updated using Bayesian techniques,

Cleaning the spurious links in data

Authors: 
Lee, M.L.; Hsu, W.; Kothari, V.
Year: 
2004
Venue: 
Intelligent Systems, IEEE [see also IEEE Intelligent Systems and Their Applications], 19, 2004

Data quality problems can arise from abbreviations, data entry mistakes, duplicate records, missing fields, and many other sources. These problems proliferate when you integrate multiple data sources in data warehousing, federated databases, and global information systems. A newly discovered class of erroneous data is spurious links, where a real-world entity has multiple links that might not be properly associated with it. The existence of such spurious links often leads to confusion and misrepresentation in the data records representing the entity.

AJAX: an extensible data cleaning tool

Authors: 
Galhardas, H; Florescu, D; Shasha, D; Simon, E
Year: 
2000
Venue: 
ACM SIGMOD Record

... groups together matching pairs with a high similarity value by applying a given grouping criteria (e.g. by transitive closure). Finally, ging collapses each individual cluster into a tuple of the resulting data source. AJAX provides @@@@ for specifying data cleaning programs, which consists of SQL statements enriched with a set of specific primitives to express these transformations.AJAX also @@@@. It allows the user to interact with an executing data cleaning program to handle exceptional cases and to inspect intermediate results.

Data Cleaning: Problems and Current Approaches

Authors: 
Rahm, Erhard; Do, Hong Hai
Year: 
2000
Venue: 
IEEE Data Engineering Bulletin

We classify data quality problems that are addressed by data cleaning and provide an overview of the main
solution approaches. Data cleaning is especially required when integrating heterogeneous data sources and
should be addressed together with schema-related data transformations. In data warehouses, data cleaning is
a major part of the so-called ETL process. We also discuss current tool support for data cleaning.

Syndicate content