Probabilistic Name and Address Cleaning and Standardization

Christen, P.; Churches, T.; Zhu, J.
Christen, P
Churches, T
Zhu, J
Proceedings of the Australasian Data Mining Workshop, 2002
Citations range: 

In the absence of a shared unique key, an ensemble of nonunique
personal attributes such as names and addresses is
often used to link data from disparate sources. Such data
matching is widely used when assembling data warehouses
and business mailing lists, and is a foundation of many longitudinal
epidemiological and other health related studies.
Unfortunately, names and addresses are often captured in
non-standard and varying formats, usually with some degree
of spelling and typographical errors. It is therefore
important that such data is transformed into a clean and
standardised format before it is further processed.
Traditional approaches for cleaning and standardization of
personal information have been based on domain-specific
rules that need considerable configuration by highly skilled
end users. In this paper we describe an alternative approach
based on probabilistic hidden Markov models. Experiments
on various health-related administrative data sets show that,
compared to a rules-based approach, the probabilistic system
is less cumbersome and more flexible to use and, for
more complex data, produces more accurate results.