cs.tu-berlin.de

Data Fusion in Three Steps: Resolving Schema, Tuple, and Value Inconsistencies

Authors: 
Naumann, Felix; Bilke, Alexander; Bleiholder, Jens; Weis, Melanie
Year: 
2006
Venue: 
IEEE Data Engineering Bulletin 29(2):21-31

Heterogeneous and dirty data is abundant. It is stored under different, often opaque schemata, it rep-
resents identical real-world objects multiple times, causing duplicates, and it has missing values and
conflicting values. Without suitable techniques for integrating and fusing such data, the data quality of
an integrated system remains low. We present a suite of methods, combined in a single tool, that allows
ad-hoc, declarative fusion of such data by employing schema matching, duplicate detection and data
fusion.

Identification of Real-World Objects in Multiple Databases

Authors: 
Neiling, M
Year: 
2005
Venue: 
TR, TU Berlin

Object identification is an important issue for integration of data from
different sources. The identification task is complicated, if no global and consistent
identifier is shared by the sources. Then, object identification can only be performed
through the identifying information, the objects data provides itself. Unfortunately
real-world data is dirty, hence identification mechanisms like natural keys fail mostly
—we have to take care of the variations and errors of the data. Consequently, object
identification can no more be guaranteed to be fault-free. Several methods tackle

Quality-driven Integration of Heterogeneous Information Systems

Authors: 
Naumann, F; Leser, U; Freytag, J
Year: 
1999
Venue: 
VLDB Conference

Integrated access to information that is spread
over multiple, distributed, and heterogeneous
sources is an important problem in many scientific
and commercial domains. While much
work has been done on query processing and
choosing plans under cost criteria, very little is
known about the important problem of incorporating
the information quality aspect into
query planning.
In this paper we describe a framework for
multidatabase query processing that fully includes
the quality of information in many
facets, such as completeness, timeliness, accuracy,

Syndicate content