Automatic Training Example Selection for Scalable Unsupervised Record Linkage

Christen, Peter
Christen, P
Citations range: 
10 - 49

Linking records from two or more databases is becoming
increasingly important in the data preparation step of many data min-
ing projects, as linked data can enable analysts to conduct studies that
are not feasible otherwise, or that would require expensive and time-
consuming collection of specific data. The aim of such linkages is to match
all records that refer to the same entity. One of the main challenges in
record linkage is the accurate classification of record pairs into matches
and non-matches. With traditional techniques, classification thresholds
have to be set either manually or using an EM-based approach. Many
modern classification techniques, on the other hand, are based on super-
vised machine learning and thus require training data, which is often not
available in real world situations. A novel two-step approach to unsu-
pervised record pair classification is presented in this paper. In the first
step, training examples are selected automatically, and in the second step
these examples are used to train a binary classifier. An experimental eval-
uation shows that this approach can outperform k-means clustering and
can also be much faster than other classification techniques.