Reasoning about Record Matching Rules

Fan, Wenfei; Jia, Xibei; Li, Jianzhong; Ma, Shuai
Li, J
Fan, W
Ma, S
Jia, X
Citations range: 
10 - 49
vldb09-654.pdf605.61 KB

To accurately match records it is often necessary to utilize the semantics
of the data. Functional dependencies (FDs) have proven
useful in identifying tuples in a clean relation, based on the semantics
of the data. For all the reasons that FDs and their inference
are needed, it is also important to develop dependencies and
their reasoning techniques for matching tuples from unreliable data
sources. This paper investigates dependencies and their reasoning
for record matching. (a) We introduce a class of matching dependencies
(MDs) for specifying the semantics of data in unreliable relations,
defined in terms of similarity metrics and a dynamic semantics.
(b) We identify a special case of MDs, referred to as relative
candidate keys (RCKs), to determine what attributes to compare and
how to compare them when matching records across possibly different
relations. (c) We propose a mechanism for inferring MDs, a
departure from traditional implication analysis, such that when we
cannot match records by comparing attributes that contain errors,
we may still find matches by using other, more reliable attributes.
(d) We provide an O(n2) time algorithm for inferring MDs, and
an effective algorithm for deducing a set of RCKs from MDs. (e)
We experimentally verify that the algorithms help matching tools
efficiently identify keys at compile time for matching, blocking or
windowing, and that the techniques effectively improve both the
quality and efficiency of various record matching methods.