
Circumventing Data Quality Problems Using Multiple Join Paths

Kotidis, Y.; Marian, A.; Srivastava, D.
Clean DB, 2006

We propose the Multiple Join Path (MJP) framework for obtaining
high quality information by linking fields across multiple databases,
when the underlying databases have poor quality data, which are
characterized by violations of integrity constraints like keys and
functional dependencies within and across databases. MJP associates
quality scores with candidate answers by first scoring individual
data paths between a pair of field values taking into account
data quality with respect to specified integrity constraints, and then

Joins that generalize: text classification using Whirl

Cohen, W.W.; Hirsh, H.
Proc. KDD-98

WHIRL is an extension of relational databases that can perform “soft joins” based on the similarity of textual identifiers; these soft joins extend the traditional operation of joining tables based on the equivalence of atomic values. This paper evaluates WHIRL on a number of inductive classification tasks using data from the World Wide Web. We show that althoughWHIRL is designedfor more general similaritybasedreasoning tasks, it is competitive with mature inductive classification systems on these classification tasks.

Syndicate content