Schema Matching and Mapping-based Data Integration

Do, Hong-Hai
Do, H
Dissertation, Univ. Leipzig, 2006
Citations range: 
50 - 99

Schema matching aims at identifying semantic correspondences between elements of two schemas, e.g., database schemas, ontologies, and XML message formats. It is needed in many database applications, such as integration of web data sources, data warehouse loading and XML message mapping. In today's systems, schema matching is manual; a time-consuming, tedious, and error-prone process, which becomes increasingly impractical with a higher number of schemas and data sources to be dealt with. To reduce the amount of manual effort as much as possible, approaches to semi-automatically determine element correspondences are required. We start by surveying the existing approaches and prototypes for schema matching and explain their common features and applicability using a previously proposed taxonomy. We further identify the major criteria that influence the effectiveness of a match approach. We use these criteria to compare the evaluation of various recent prototypes and discuss the issues that need to be addressed in future evaluations. Besides helping us to develop and test our own system, the surveys of match approaches and of evaluations aim at guiding future implementations, so that they can be documented better, their result be more reproducible, and a comparison between different systems and approaches be easier. Based on the insights about the state of the art, we have developed Coma (Combining Matchers) and further extended it to Coma++, both representing generic and customizable systems for semi-automatic schema matching. In particular, Coma++ offers a platform for flexible combination of different match algorithms. It provides a large spectrum of individual matchers, including a novel approach reusing results from previous match operations, and various mechanisms to combine and refine matcher results. Based on this flexible infrastructure, match processing is supported as a workflow, allowing to divide and successively solve a match task in multiple stages. In particular, we implement specific workflows (i.e., strategies) for context-dependent matching of schemas with shared elements and fragment-based matching of very large schemas. With the flexibility to customize matchers and match strategies, Coma++ also represents a platform for comparative evaluation of match approaches. In fact, we performed comprehensive evaluations using real-world schemas found on the web and ontologies from a published ontology alignment contest. In particular, the E-business message standards involved in our evaluations are among the largest and most complex test schemas as compared to previous evaluations. Coma++ has shown high quality and fast execution time for both the schemas and ontologies, proving the practicability of our generic solution for different domains. Especially, the quality of Coma++ in the ontology alignment contest is comparable to that of the best performing participants. Due to the systematic evaluation, we obtain important insights on the performance of different match strategies and the impact of many factors, such as schema size, the choice of matchers and combination strategies, and the reuse of previous match results. We believe that our insights can be of valuable help for the development and evaluation of further match algorithms. Building on the same idea of reusing previous match results, we have developed Genmapper (Generic Mapper), a new approach for integrating heterogeneous web data sources. It utilizes mappings between sources and utilizes correspondences between their objects, i.e., at the instance level. We focus on the bioinformatics domain with hundreds of publicly accessible, highly cross-referenced web data sources managing annotations and correspondences for various types of molecular-biological objects, such as genes and proteins. Genmapper explicitly captures existing relationships between objects to drive data integration and combine annotation knowledge from different sources. A generic schema is used to uniformly represent object data and correspondences, making it easy to integrate new data sources and to update existing ones. To serve specific analysis needs, powerful operators are provided to derive tailored views from the generic data representation. Genmapper has been successfully used for large-scale functional profiling of genes and proteins.