Febrl - A freely available record linkage system with a graphical user interface

Christen, Peter
Australasian Workshop Health Data and Knowledge Management

Record or data linkage is an important enabling tech-
nology in the health sector, as linked data is a cost-
effective resource that can help to improve research
into health policies, detect adverse drug reactions, re-
duce costs, and uncover fraud within the health sys-
tem. Significant advances, mostly originating from
data mining and machine learning, have been made
in recent years in many areas of record linkage tech-
niques. Most of these new methods are not yet im-
plemented in current record linkage systems, or are
hidden within ‘black box’ commercial software. This

Learning Blocking Schemes for Record Linkage

Michelson, Matthew; Knoblock, Craig A.

Record linkage is the process of matching records across data
sets that refer to the same entity. One issue within record
linkage is determining which record pairs to consider, since
a detailed comparison between all of the records is impractical.
Blocking addresses this issue by generating candidate
matches as a preprocessing step for record linkage. For example,
in a person matching problem, blocking might return
all people with the same last name as candidate matches. Two
main problems in blocking are the selection of attributes for

Example-driven Design of Efficient Record Matching Queries

Chaudhuri, Surajit;Chen, Bee-Chung;Ganti, Venkatesh;Kaushik, Raghav

Record matching is the task of identifying records that match the same real world entity. This is a problem of great significance for a variety of business intelligence applications. Implementations of record matching rely on exact as well as approximate string matching (e.g., edit distances) and use of external reference data sources. Record matching can be viewed as a query composed of a small set of primitive operators. However, formulating such record matching queries is difficult and depends on the specific application scenario.

Self-tuning in graph-based reference disambiguation

Nuray-Turan, R; Kalashnikov, DV; Mehrotra, S
Proc. DASFAA 2007

Nowadays many data mining/analysis applications use the
graph analysis techniques for decision making. Many of these techniques
are based on the importance of relationships among the interacting units.
A number of models and measures that analyze the relationship importance
(link structure) have been proposed (e.g., centrality, importance
and page rank) and they are generally based on intuition, where the analyst
intuitively decides a reasonable model that fits the underlying data.
In this paper, we address the problem of learning such models directly

Learning string-edit distance

Ristad, ES; Yianilos, PN; Inc, M.T.; Princeton, NJ
Pattern Analysis and Machine Intelligence, IEEE Transactions on, 20, 1998

In many applications, it is necessary to determine the similarity of two strings.
A widely-used notion of string similarity is the edit distance: the minimum
number of insertions, deletions, and substitutions required to transform one
string into the other. In this report, we provide a stochastic model for string
edit distance. Our stochastic model allows us to learn a string edit distance
function from a corpus of examples. We illustrate the utility of our approach
by applying it to the difficult problem of learning the pronunciation of words in

Learning to match and cluster large high-dimensional data sets for data integration

Cohen, William; Richman, Jacob

Part of the process of data integration is determining which sets of identifiers refer to the same real-world entities. In integrating databases found on the Web or obtained by using information extraction methods, it is often possible to solve this problem by exploiting similarities in the textual names used for objects in different databases. In this paper we describe techniques for clustering and matching identifier names that are both scalable and adaptive, in the sense that they can be trained to obtain better performance in a particular domain.

Learning Object Identification Rules for Information Integration

Tejada, S
Ph.D. Thesis, University of Southern California's Information Sciences Institute, Los Angeles, 2002

When integrating information from multiple websites, the same data objects can
exist in inconsistent text formats across sites, making it di±cult to identify match-
ing objects using exact text match. We have developed an object identi¯cation
system called Active Atlas, which compares the objects' shared attributes in order
to identify matching objects. Certain attributes are more important for decid-
ing if a mapping should exist between two objects. Previous methods of object
identi¯cation have required manual construction of object identi¯cation rules or

Syndicate content