Text joins in an RDBMS for web data integration

Guided search

Click a term to initiate a search.

Keyword search

Text joins in an RDBMS for web data integration

Mon, 10/09/2006 - 12:46 — thor

Authors:

Gravano, L.; Ipeirotis, P.G.; Koudas, N.; Srivastava, D.

Author:

Gravano, L

Ipeirotis, P

Koudas, N

Srivastava, D

Year:

2003

Venue:

Proceedings of the twelfth international conference on World Wide Web, 2003

URL:

http://portal.acm.org/citation.cfm?id=775166&dl=

Citations:

129

Citations range:

100 - 499

Attachment	Size
Gravano2003TextjoinsinanRDBMSforweb.pdf	700.64 KB

The integration of data produced and collected across autonomous, heterogeneous web services is an increasingly important and challenging problem. Due to the lack of global identifiers, the same entity (e.g., a product) might have different textual representations across databases. Textual data is also often noisy because of transcription errors, incomplete information, and lack of standard formats. A fundamental task during data integration is matching of strings that refer to the same entity. In this paper, we adopt the widely used and established cosine similarity metric from the information retrieval field in order to identify potential string matches across web sources. We then use this similarity metric to characterize this key aspect of data integration as a join between relations on textual attributes, where the similarity of matches exceeds a specified threshold. Computing an exact answer to the text join can be expensive. For query processing efficiency, we propose a sampling-based join approximation strategy for execution in a standard, unmodified relational database management system (RDBMS), since more and more web sites are powered by RDBMSs with a web-based front end. We implement the join inside an RDBMS, using SQL queries, for scalability and robustness reasons. Finally, we present a detailed performance evaluation of an implementation of our algorithm within a commercial RDBMS, using real-life data sets. Our experimental results demonstrate the efficiency and accuracy of our techniques.

websearch

Data Cleaning publication categorizer

Guided search

Data Cleaning

Data sets

Data type

Paper type

Venue type

Author

Year

mailpart

Citations range

Keyword search

Text joins in an RDBMS for web data integration

Related categories

User login