Circumventing Data Quality Problems Using Multiple Join Paths

Authors: 
Kotidis, Y.; Marian, A.; Srivastava, D.
Author: 
Kotidis, Y
Marian, A
Srivastava, D
Year: 
2006
Venue: 
Clean DB, 2006
URL: 
http://pike.psu.edu/cleandb06/papers/CameraReady_107.pdf
Citations: 
10
Citations range: 
10 - 49
AttachmentSize
Kotidis2006CircumventingDataQuality.pdf243.24 KB

We propose the Multiple Join Path (MJP) framework for obtaining
high quality information by linking fields across multiple databases,
when the underlying databases have poor quality data, which are
characterized by violations of integrity constraints like keys and
functional dependencies within and across databases. MJP associates
quality scores with candidate answers by first scoring individual
data paths between a pair of field values taking into account
data quality with respect to specified integrity constraints, and then
agglomerating scores across multiple data paths that serve as corroborating
evidences for a candidate answer. We address the problem
of finding the top-few (highest quality) answers in the MJP
framework using novel techniques, and demonstrate the utility of
our techniques using real data and our Virtual Integration Prototype
testbed.