Adaptive duplicate detection using learnable string similarity measures

Guided search

Click a term to initiate a search.

Keyword search

Adaptive duplicate detection using learnable string similarity measures

Wed, 09/13/2006 - 15:12 — Anonymous

Authors:

Bilenko, M; Mooney, RJ

Author:

Bilenko, M

Mooney, R

Year:

2003

Venue:

Proceedings of the ninth ACM SIGKDD international conference

URL:

http://portal.acm.org/citation.cfm?id=956759&dl=ACM&coll=portal&CFID=11111111&CFTOKEN=2222222

Citations:

573

Citations range:

500 - 999

Attachment	Size
Bilenko2003Adaptiveduplicatedetection.pdf	234.29 KB

The problem of identifying approximately duplicate records in databases is an essential step for data cleaning and data integration processes. Most existing approaches have relied on generic or manually tuned distance metrics for estimating the similarity of potential duplicates. In this paper, we present a framework for improving duplicate detection using trainable measures of textual similarity. We propose to employ learnable text distance functions for each database field, and show that such measures are capable of adapting to the specific notion of similarity that is appropriate for the field's domain. We present two learnable text similarity measures suitable for this task: an extended variant of learnable string edit distance, and a novel vector-space based measure that employs a Support Vector Machine (SVM) for training. Experimental results on a range of datasets show that our framework can improve duplicate detection accuracy over traditional techniques.

cs.utexas.edu

websearch

Data Cleaning publication categorizer

Guided search

Data Cleaning

Data sets

Data type

Paper type

Venue type

Author

Year

mailpart

Citations range

Keyword search

Adaptive duplicate detection using learnable string similarity measures

Related categories

User login