Survey

Data Quality: Concepts, Methodologies and Techniques

Authors: 
Batini, C.; Scannapieca, M.
Year: 
2006

Data Fusion - Resolving Data Conflicts for Integration

Authors: 
Dong, XL; Naumann, F
Year: 
2009

Data fusion

Authors: 
Bleiholder, J; Naumann, F
Year: 
2008
Venue: 
ACM Computing Surveys

The development of the Internet in recent years has made it possible and useful to access many different information systems anywhere in the world to obtain information. While there is much research on the integration of heterogeneous information systems, most commercial systems stop short of the actual integration of available data. Data fusion is the process of fusing multiple records representing the same real-world object into a single, consistent, and clean representation.

A guided tour to approximate string matching

Authors: 
Navarro, G
Year: 
2001
Venue: 
ACM Computing Surveys

We survey the current techniques to cope with the problem of string matching that allows errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities. We present a number of experiments to compare the performance of the different algorithms and show which are the best choices.

Data Quality at a Glance

Authors: 
Scannapieco, M; Missier, P; Batini, C
Year: 
2005
Venue: 
Datenbankspektrum, Vol. 14, 2005

The paper provides an overview of data quality, in terms of its multidimensional nature. A set of data quality dimensions is defined, including accuracy, completeness, time-related dimensions and consistency. Several practical examples on how such dimensions can be measured and used are also described. The definitions for data quality dimensions are placed in the context of other research proposals for sets of data quality dimensions, showing similarities and differences.

Erkennen und Bereinigen von Datenfehlern in naturwissenschaftlichen Daten

Authors: 
Müller, H; Weis, M; Bleiholder, J; Leser, U
Year: 
2005
Venue: 
Datenbankspektrum, Vol. 15

Naturwissenschaftliche Daten sind aufgrund
ihres Entstehungsprozesses oft mit
einem hohen Maß an Unsicherheit behaftet.
Bei der Integration von Daten aus verschiedenen
Quellen führen diese Unsicherheiten,
neben der vielfältigen syntaktischen
und semantischen Heterogenität in
der Repräsentation von Daten, zu Konflikten,
die in einer verringerten Qualität des
integrierten Datenbestandes münden. Obwohl
Konflikte oftmals nur durch Domänenexperten
endgültig aufgelöst werden
können, kann und muss die Arbeit dieser
Experten durch geeignete Werkzeuge unterstützt

Duplicate Record Detection: A Survey

Authors: 
Elmagarmid, Ahmed; Ipeirotis, Panagiotis; Verykios, Vassilios
Year: 
2007
Venue: 
The IEEE Transactions on Knowledge and Data Engineering (TKDE)

Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription errors, incomplete information, lack of standard formats, or any combination of these factors. In this paper, we present a thorough analysis of the literature on duplicate record detection.

Data Transformation for Warehousing Web Data

Authors: 
Zhu, Yan; Bornhovd, Christof; Buchmann, Alejandro P.
Year: 
2001
Venue: 
Third International Workshop on Advanced Issues of E-Commerce and Web-Based Information Systems (WECWIS '01), 2001

In order to analyze market trends and make reasonable business plans, a company's local data is not sufficient. Decision making must also be based on information from suppliers, partners and competitors. This external data can be obtained from the Web in many cases, but must be integrated with the company's own data, for example, in a data warehouse. To this end, Web data has to be mapped to the star schema of the warehouse. In this paper we propose a semi-automatic approach to support this transformation process.

Advanced methods for record linkage

Authors: 
Winkler, W.E.
Year: 
1994
Venue: 
Proceedings of the Section on, 1994

Record linkage, or computer matching, is needed for the creation and maintenance of name and
address lists that support operations for and evaluations of a Year 2000 Census. This paper
describes three advances. The first is an enhanced method of string comparison for dealing with
typographical variations and scanning errors. It improves upon string comparators in computer
science. The second is a linear assignment algorithm that can use only 0.002 as much storage as
existing algorithms in operations research, requires at most an additional 0.03 increase in time, and

The state of record linkage and current research problems

Authors: 
Winkler, W.E.
Year: 
1999
Venue: 
RR99/03, US Bureau of the Census, 1999

This paper provides an overview of methods and systems developed for record linkage. Modern
record linkage begins with the pioneering work of Newcombe and is especially based on the formal
mathematical model of Fellegi and Sunter. In their seminal work, Fellegi and Sunter introduced
many powerful ideas for estimating record linkage parameters and other ideas that still influence
record linkage today. Record linkage research is characterized by its synergism of statistics,
computer science, and operations research. Many difficult algorithms have been developed and put

Data Cleaning: Problems and Current Approaches

Authors: 
Rahm, Erhard; Do, Hong Hai
Year: 
2000
Venue: 
IEEE Data Engineering Bulletin

We classify data quality problems that are addressed by data cleaning and provide an overview of the main
solution approaches. Data cleaning is especially required when integrating heterogeneous data sources and
should be addressed together with schema-related data transformations. In data warehouses, data cleaning is
a major part of the so-called ETL process. We also discuss current tool support for data cleaning.

Syndicate content