Probabilistic Name and Address Cleaning and Standardization

Guided search

Click a term to initiate a search.

Keyword search

Probabilistic Name and Address Cleaning and Standardization

Mon, 10/09/2006 - 10:08 — thor

Authors:

Christen, P.; Churches, T.; Zhu, J.

Author:

Christen, P

Churches, T

Zhu, J

Year:

2002

Venue:

Proceedings of the Australasian Data Mining Workshop, 2002

URL:

citeseer.ist.psu.edu/christen02probabilistic.html

Citations:

Citations range:

n/a

In the absence of a shared unique key, an ensemble of nonunique
personal attributes such as names and addresses is
often used to link data from disparate sources. Such data
matching is widely used when assembling data warehouses
and business mailing lists, and is a foundation of many longitudinal
epidemiological and other health related studies.
Unfortunately, names and addresses are often captured in
non-standard and varying formats, usually with some degree
of spelling and typographical errors. It is therefore
important that such data is transformed into a clean and
standardised format before it is further processed.
Traditional approaches for cleaning and standardization of
personal information have been based on domain-specific
rules that need considerable configuration by highly skilled
end users. In this paper we describe an alternative approach
based on probabilistic hidden Markov models. Experiments
on various health-related administrative data sets show that,
compared to a rules-based approach, the probabilistic system
is less cumbersome and more flexible to use and, for
more complex data, produces more accurate results.

websearch

Data Cleaning publication categorizer

Guided search

Data Cleaning

Data sets

Data type

Paper type

Venue type

Author

Year

mailpart

Citations range

Keyword search

Probabilistic Name and Address Cleaning and Standardization

Related categories

User login