Combining schema and instance information for integrating heterogeneous data sources

Authors: 
Zhao, H; Ram, S
Author: 
Zhao, H
Ram, S
Year: 
2007
Venue: 
Data and Knowledge Engineering
URL: 
http://linkinghub.elsevier.com/retrieve/pii/S0169023X06000942
Citations: 
28
Citations range: 
10 - 49
AttachmentSize
ZhaoRam2007.pdf300.7 KB

Determining the correspondences among heterogeneous data sources, which is critical to integration of the data
sources, is a complex and resource-consuming task that demands automated support. We propose an iterative procedure
for detecting both schema-level and instance-level correspondences from heterogeneous data sources. Cluster analysis techniques
are used first to identify similar schema elements (i.e., relations and attributes). Based on the identified schema-level
correspondences, classification techniques are used to identify matching tuples. Statistical analysis techniques are then
applied to a preliminary integrated data set to evaluate the relationships among schema elements more accurately.
Improvement in schema-level correspondences triggers another iteration of an iterative procedure. We have performed
empirical evaluation using real-world heterogeneous data sources and report in this paper some promising results (i.e.,
incremental improvement in identified correspondences) that demonstrate the utility of the proposed iterative procedure.