no paper type

A Fast Linkage Detection Scheme for Multi-Source Information Integration

Authors: 
Aizawa, A; Oyama, K
Year: 
2005
Venue: 
Web Information Retrieval and Integration

Record linkage refers to techniques for identifying
records associated with the same real-world entities.
Record linkage is not only crucial in integrating
multi-source databases that have been generated independently,
but is also considered to be one of the key
issues in integrating heterogeneous Web resources. However,
when targeting large-scale data, the cost of enumerating
all the possible linkages often becomes impracticably
high. Based on this background, this paper
proposes a fast and efficient method for linkage detection.

Adaptive name matching in information integration

Authors: 
Bilenko, M; Mooney, R; Cohen, W; P Ravikumar, S
Year: 
2003
Venue: 
Intelligent Systems

Identifying approximately duplicate database records that refer to the same entity is essential for information integration. The authors compare and describe methods for combining and learning textual similarity measures for name matching.

Efficient topic-based unsupervised name disambiguation

Authors: 
Song, Y; Huang, J; Councill, IG; Li, J; Giles, CL
Year: 
2007
Venue: 
Proc. 2007 Conf. on Digital libraries

Name ambiguity is a special case of identity uncertainty where one person can be referenced by multiple name variations in different situations or even share the same name with other people. In this paper, we focus on the problem of disambiguating person names within web pages and scientific documents. We present an efficient and effective two-stage approach to disambiguate names. In the first stage, two novel topic-based models are proposed by extending two hierarchical Bayesian text models, namely Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA).

Self-tuning in graph-based reference disambiguation

Authors: 
Nuray-Turan, R; Kalashnikov, DV; Mehrotra, S
Year: 
2007
Venue: 
Proc. DASFAA 2007

Nowadays many data mining/analysis applications use the
graph analysis techniques for decision making. Many of these techniques
are based on the importance of relationships among the interacting units.
A number of models and measures that analyze the relationship importance
(link structure) have been proposed (e.g., centrality, importance
and page rank) and they are generally based on intuition, where the analyst
intuitively decides a reasonable model that fits the underlying data.
In this paper, we address the problem of learning such models directly

Personal Name Matching: New Test Collections and a Social Network based Approach.

Authors: 
Reuther, P
Year: 
2006
Venue: 
Tech. Report, Univ. Trier

This paper gives an overview of Personal Name Matching. Personal
name matching is of great importance for all applications that deal
with personal names. The problem with personal names is that they
are not unique and sometimes even for one name many variations
exist. This leads to the fact that databases on the one hand may
have several entries for one and the same person and on the other
hand have one entry for many different persons. For the evaluation
of Personal Name Matching algorithms test collections are of great

Managing the Quality of Person Names in DBLP

Authors: 
Reuther, P; Walter, B; Ley, M; Weber, A; Klink, S
Year: 
2006
Venue: 
Proc. ECDL, LNCS

Quality management is, not only for digital libraries, an important task in which many dimensions and different aspects have to be considered. The following paper gives a short overview on DBLP in which the data acquisition and maintenance process underlying DBLP is discussed from a quality point of view. The paper finishes with a new approach to identify erroneous person names.

D-Swoosh: A Family of Algorithms for Generic, Distributed Entity Resolution

Authors: 
Benjelloun, O.; Garcia-Molina, H.; Gong, H.; Kawai, H; Larson, T.E.; Menestrina, D.; Thavisomboon, S.
Year: 
2007
Venue: 
Proc. ICDCS, 2007

Entity Resolution (ER) matches and merges records that refer to the same real-world entities, and is typically a compute-intensive process due to complex matching functions and high data volumes. We present a family of algorithms, D-Swoosh, for distributing the ER workload across multiple processors. The algorithms use generic match and merge functions, and ensure that new merged records are distributed to processors that may have matching records. We perform a detailed performance evaluation on a testbed of 15 processors.

Biological data cleaning: a case study

Authors: 
Herbert, KG; Wang, JTL
Year: 
2007
Venue: 
International Journal of Information Quality

As databases become more pervasive through the biological sciences, various data quality concerns are emerging. Biological databases tend to develop data quality issues regarding data legacy, data uniformity and data duplication. Due to the nature of this data, each of these problems is non-trivial and can cause many problems for the database. For biological data to be corrected and standardised, methods and frameworks must be developed to handle both structural and traditional data. This paper discusses issues concerning biological data quality with respect to data cleaning.

Approximate string-matching with q-grams and maximal matches

Authors: 
Ukkonen, E
Year: 
1992
Venue: 
Theoretical Computer Science

Ukkonen, E., Approximate string-matching with ¿/-grams and maximal matches. Theoretical Com-
puter Science 92 (1992) 191-211.
We study approximate string-matching in connection with two string distance functions that are
computable in linear time. The first function is based on the so-called ij-grams. An algorithm is given
for the associated string-matching problem that finds the locally best approximate occurrences of
pattern P, |P| = m, in text T, \T\ = n, in time 0(«log(m — q)). The occurrences with distance

Data unification in personal information management

Authors: 
Karger, DR; Jones, W
Year: 
2006
Venue: 
Communications of the ACM

Users need ways to unify, simplify, and consolidate information too often fragmented by location, device, and software application.

Domain-independent data cleaning via analysis of entity-relationship graph

Authors: 
Kalashnikov, DV; Mehrotra, S
Year: 
2006
Venue: 
ACM Transactions on Database Systems (TODS)

In this article, we address the problem of reference disambiguation. Specifically, we consider a situation where entities in the database are referred to using descriptions (e.g., a set of instantiated attributes). The objective of reference disambiguation is to identify the unique entity to which each description corresponds. The key difference between the approach we propose (called RelDC) and the traditional techniques is that RelDC analyzes not only object features but also inter-object relationships to improve the disambiguation quality.

A knowledge-based approach for duplicate elimination in data cleaning

Authors: 
Low, WL; Lee, ML; Ling, TW
Year: 
2001
Venue: 
Information Systems

Existing duplicate elimination methods for data cleaning work on the basis of computing the degree of similarity between nearby records in a sorted database. High recall can be achieved by accepting records with low degrees of similarity as duplicates, at the cost of lower precision. High precision can be achieved analogously at the cost of lower recall. This is the recall-precision dilemma. We develop a generic knowledge-based framework for effective data cleaning that can implement any existing data cleaning strategies and more.

Efficient similarity-based operations for data integration

Authors: 
Schallehn, E; Sattler, KU; Saake, G
Year: 
2004
Venue: 
Data & Knowledge Engineering

Dealing with discrepancies in data is still a big challenge in data integration systems. The problem occurs both during eliminating duplicates from semantic overlapping sources as well as during combining complementary data from different sources. Though using SQL operations like grouping and join seems to be a viable way, they fail if the attribute values of the potential duplicates or related tuples are not equal but only similar by certain criteria. As a solution to this problem, we present in this paper similarity-based variants of grouping and join operators.

Duplicate record identification in bibliographic databases

Authors: 
Goyal, P
Year: 
1987
Venue: 
Information Systems

This study presents the applicability of an automatically generated code for use in duplicate detection in bibliographic databases. It is shown that the methods generate a large percentage of unique codes, and that the code is short enough to be useful. The code would prove to be particularly useful in identifying duplicates when records are added to the database.

On The Accuracy and Completeness of The Record Matching Process

Authors: 
Verykios, VS; Elfeky, MG; AK Elmagarmid, A
Year: 
2000
Venue: 
Proc.2000 Conf. on Information Quality

The role of data resources in today's business environment is multi-faceted. Primarily, they support the operational needs of an organization or a company. Secondarily, they can be used for decision support and management. The quality of the data, used to support the operational needs, is usually below the quality required for decision support and management.

Matching Algorithms within a Duplicate Detection System

Authors: 
Monge, AE
Year: 
2000
Venue: 
IEEE Data Engineering Bulletin

Detecting database records that are approximate duplicates, but not exact duplicates, is an important
task. Databases may contain duplicate records concerning the same real-world entity because of data
entry errors, unstandardized abbreviations, or differences in the detailed schemas of records from multiple
databases – such as what happens in data warehousing where records from multiple data sources are
integrated into a single source of information – among other reasons. In this paper we review a system

Efficient clustering of high-dimensional data sets with application to reference matching

Authors: 
McCallum, A; Nigam, K; Ungar, LH
Year: 
2000
Venue: 
Proc. 6th ACM SIGKDD conf.

Many important problems involve clustering large datasets.
Although naive implementations of clustering are computa-
tionally expensive, there are established efficient techniques
for clustering when the dataset has either (1) a limited num-
ber of clusters, (2) a low feature dimensionality, or (3) a
small number of data points. However, there has been much
less work on methods of efficiently clustering datasets that
are large in all three ways at once|for example, having
millions of data points that exist in many thousands of di-
mensions representing many thousands of clusters.

Getty's Synoname and its cousins: A survey of applications of personal name-matching algorithms

Authors: 
Borgman, CL; Siegfried, SL
Year: 
1992
Venue: 
Journal of the American Society for Information Science

The study reported in this article was commissioned by
the Getty Art History Information Program (AHIP) as a
background investigation of personal name-matching
programs in fields other than art history, for purposes of
comparing them and their approaches with AHIP’s SynonameTM
project. We review techniques employed in a
variety of applications, including art history, bibliography,
genealogy, commerce, and government, providing a
framework of personal name characteristics, factors in
selecting matching techniques, and types of applications.

Re-identification of Familial Database Records.

Authors: 
Malin, B
Year: 
2006
Venue: 
Proc. AMIA Annual Symp

Many genome-based research projects include familial
relationships, such as pedigrees, with genomic data
records. To protect anonymity when sharing family
information, data holders remove, or encode, explicit
identifiers (e.g. personal name). In this paper,
however, we introduce IdentiFamily, a software
program that can link de-identified family relations to
named people. The program extracts genealogical
knowledge from publicly available records and
ascertains the re-identification risk for specific family
relations. We find robust genealogies on current

An interface for mining genealogical nominal data using the concept of linkage and a hybrid name matching algorithm

Authors: 
Snae, C; Diaz, BM
Year: 
2002
Venue: 
Journal of 3D-Forum Society

This paper describes hybrid name matching algorithms developed to provide nominal data linkage within English parish register data. LIG2 has been shown to perform as well as conventional matching algorithms found in the literature, while its probability version LIG3 provides sufficient flexibility to be included in a Nominal Data Linkage Workbench which allows other dimensions e.g. geographical space, and historical time to be included in the linkage/matching process. The paper reports some initial findings on implementing such a Workbench.

Regelbasierte Ausreißersuche zur Datenqualitätsanalyse

Authors: 
Kübart, J,.; Grimmer, Udo; Hipp, Jochen
Year: 
2005
Venue: 
Datenbankspektrum, Vol. 14, 2005

Kritisch für Datenauswertungen und Datenmigrationen ist die Qualität der zugrunde liegenden Daten. Eine Analyse der Datenqualität ist insbesondere bei großen Datenbeständen jedoch eine nicht triviale Aufgabe. Wir stellen ein Verfahren zur regelbasierten Ausreißersuche in großen Datenbanken vor, das sowohl mit von Experten vorgegebenen Gültigkeitsregeln (\"Geschäftsregeln\") als auch mit automatisch aus Daten erzeugten Regeln eingesetzt werden kann.

Exploiting secondary sources for automatic object consolidation

Authors: 
Michalowski, M; Thakkar, S; Knoblock, CA
Year: 
2003
Venue: 
Proc. 2003 ACM SIGKDD Workshop on Data Cleaning, Record Linkage, and Object Consolidation

Information sources on the web are controlled by different
organizations or people, utilize different text formats, and
have varying inconsistencies. Therefore, any system that integrates
information from different data sources must consolidate
data from these sources. Data from many data
sources on the web may not contain enough information to
accurately consolidate the data even using state of the art
object consolidation systems. We present an approach to
accurately and automatically consolidate data from various
data sources by utilizing a state of the art object consolidation

A Latent Dirichlet Model for Unsupervised Entity Resolution

Authors: 
Bhattacharya, I.; Getoor, L.;
Year: 
2006
Venue: 
The SIAM International Conference on Data Mining (SIAM-SDM), 2006

Entity resolution has received considerable attention in recent years. Given many references to underlying entities, the goal is to predict which references correspond to the same entity. We show how to extend the Latent Dirichlet Allocation model for this task and propose a probabilistic model for collective entity resolution for relational domains where references are connected to each other.

Relational Clustering for Entity Resolution Queries

Authors: 
Bhattacharya, I.; Licamele, L.; Getoor, L.;
Year: 
2006
Venue: 
ICML 2006 Workshop on Statistical Relational Learning (SRL)

The goal of entity resolution is to reconcile database references corresponding to the same real-world entities. Given the abundance of publicly available databases where entities are not resolved, we motivate the problem of quickly processing queries that require resolved entities from such ‘unclean’ databases. We first propose a cut-based relational clustering formulation for collective entity resolution. We then show how it can be performed on-the-fly by adaptively extracting and resolving those database references that are the most helpful for resolving the query.

Collective Entity Resolution in Relational Data

Authors: 
Bhattacharya, I.; Getoor, L.;
Year: 
2007
Venue: 
ACM Transactions on Knowledge Discovery from Data, 2007

Many databases contain uncertain and imprecise references to real-world entities. The absence of identifiers for the underlying entities often results in a database which contains multiple references to the same entity. This can lead not only to data redundancy, but also inaccuracies in query processing and knowledge extraction. These problems can be alleviated through the use of entity resolution. Entity resolution involves discovering the underlying entities and mapping each database reference to these entities.

Syndicate content