Authors:
Bilenko, Mikhail; Kamath, Beena; Mooney, Raymond J.
Author:
Bilenko, M
Kamath, B
Mooney, R
URL:
http://research.microsoft.com/~mbilenko/papers/06-icdm.pdf
Many data mining tasks require computing similarity between
pairs of objects. Pairwise similarity computations are
particularly important in record linkage systems, as well as
in clustering and schema mapping algorithms. Because the
number of object pairs grows quadratically with the size of
the dataset, computing similarity between all pairs is impractical
and becomes prohibitive for large datasets and
complex similarity functions. Blocking methods alleviate
this problem by efficiently selecting approximately similar
object pairs for subsequent distance computations, leaving
out the remaining pairs as dissimilar. Previously proposed
blocking methods require manually constructing an indexbased
similarity function or selecting a set of predicates,
followed by hand-tuning of parameters. In this paper, we introduce
an adaptive framework for automatically learning
blocking functions that are efficient and accurate. We describe
two predicate-based formulations of learnable blocking
functions and provide learning algorithms for training
them. The effectiveness of the proposed techniques
is demonstrated on real and simulated datasets, on which
they prove to be more accurate than non-adaptive blocking
methods.