Authors:
Guha, S.; Koudas, N.; Marathe, A.; Srivastava, D.
Author:
Guha, S
Koudas, N
Marathe, A
Srivastava, D
Venue:
Proceedings of the 30th International Conference on Very Large Databases (VLDB 2004), 2004
URL:
http://www.isys.ucl.ac.be/vldb04/eProceedings/contents/pdf/RS17P2.PDF
Data Cleaning is an important process that has been at
the center of research interest in recent years. An important
end goal of effective data cleaning is to identify
the relational tuple or tuples that are “most related” to
a given query tuple. Various techniques have been proposed
in the literature for efficiently identifying approximate
matches to a query string against a single attribute
of a relation. In addition to constructing a ranking (i.e.,
ordering) of these matches, the techniques often associate,
with each match, scores that quantify the extent
of the match. Since multiple attributes could exist in the
query tuple, issuing approximate match operations for
each of them separately will effectively create a number
of ranked lists of the relation tuples. Merging these lists
to identify a final ranking and scoring, and returning the
top-K tuples, is a challenging task.
In this paper, we adapt the well-known footrule distance
(for merging ranked lists) to effectively deal with
scores. We study efficient algorithms to merge rankings,
and produce the top-K tuples, in a declarative way.
Since techniques for approximately matching a query
string against a single attribute in a relation are typically
best deployed in a database, we introduce and describe
two novel algorithms for this problem and we
provide SQL specifications for them. Our experimental
case study, using real application data along with a
realization of our proposed techniques on a commercial
data base system, highlights the benefits of the proposed
algorithms and attests to the overall effectiveness and
practicality of our approach.