Effective automated object matching

Zardetto, D;Scannapieco, M;Catarci, T
Zardetto, D
Scannapieco, M
Catarci, T
Proc. ICDE
Citations range: 
1 - 9

Object Matching (OM) is the problem of identifying
pairs of data-objects coming from different sources and representing
the same real world object. Several methods have been
proposed to solve OM problems, but none of them seems to be at
the same time fully automated and very effective. In this paper
we present a fundamentally new suite of methods that instead
possesses both these abilities.
We adopt a statistical approach based on mixture models,
which structures an OM process into two consecutive tasks.
First, mixture parameters are estimated by fitting the model to
observed distance measures between pairs. Then, a probabilistic
clustering of the pairs into Matches and Unmatches is obtained
by exploiting the fitted model.
In particular, we use a mixture model with component densities
belonging to the Beta parametric family and we fit it by means
of an original perturbation-like technique. Moreover, we solve
the clustering problem according to both Maximum Likelihood
and Minimum Cost objectives. To accomplish this task, optimal
decision rules fulfilling one-to-one matching constraints are
searched by a purposefully designed evolutionary algorithm.
Notably, our suite of methods is distance-independent in the
sense that it does not rely on any restrictive assumption on the
function to be used when comparing data-objects. Even more
interestingly, our approach is not confined to record linkage
applications but can be applied to match also other kinds of dataobjects.
We present several experiments on real data that validate
the proposed methods and show their excellent effectiveness.