Data Integration for the Relational Web

Authors:

Cafarella, Michael; Halevy, Alon; Khoussainova, Nodira

Author:

Cafarella, M

Halevy, A

Khoussainova, N

Year:

2009

Venue:

VLDB 2009

URL:

http://www.vldb.org/pvldb/2/vldb09-576.pdf

Citations:

Citations range:

n/a

The Web contains a vast amount of structured information such as HTML tables, HTML lists and deep-web databases; there is enormous potential in combining and re-purposing this data in creative ways. However, integrating data from this relational web raises several challenges that are not addressed by current data integration systems or mash-up tools. First, the structured data is usually not published cleanly and must be extracted (say, from an HTML list) before it can be used. Second, due to the vastness of the corpus, a user can never know all of the potentially-relevant databases ahead of time (much less write a wrapper or mapping for each one); the source databases must be discovered during the integration process. Third, some of the important information regarding the data is only present in its enclosing web page and needs to be extracted appropriately.
This paper describes Octopus, a system that combines search, extraction, data cleaning and integration, and enables users to create new data sets from those found on the Web. The key idea underlying Octopus is to oﬀer the user a set of best-eﬀort operators that automate the most labor-intensive tasks. For example, the Search operator takes a search-style keyword query and returns a set of relevance-ranked and similarity-clustered structured data sources on the Web; the Context operator helps the user specify the semantics of the sources by inferring attribute values that may not appear in the source itself, and the Extend operator helps the user ﬁnd related sources that can be joined to add new attributes to a table. Octopus executes some of these operators automatically, but always allows the user to provide feedback and correct errors. We describe the algorithms underlying each of these operators and experiments that demonstrate their eﬃcacy.

Mashup Pubs publication categorizer

Keyword search

Guided search

Author

Year

mailpart

Citations range

Properties

Tags

Data Integration for the Relational Web

Related categories

Tags

User login