Automatic retrieval of similar content using search engine query interface

Authors: 
Dasdan, Ali; D'Alberto, Paolo; Kolay, Santanu; Drome, Chris
Author: 
Dasdan, A
D'Alberto, P
Kolay, S
Drome, C
Year: 
2009
Venue: 
CIKM
URL: 
http://portal.acm.org/citation.cfm?id=1646043
Citations: 
0
Citations range: 
n/a

We consider the coverage testing problem where we are given a document and a corpus with a limited query interface and asked to find if the corpus contains a near-duplicate of the document. This problem has applications in search engines for competitive coverage testing. To solve this problem, we propose approaches that work in three main steps: generate a query signature from the document, query the corpus using the query signature and scrape the returned results, and validate the similarity between the input document and the returned results. We discuss techniques to control and bound the performance of these methods. We perform large-scale experimental validation and show that these methods perform well across different search engine corpora and documents in multiple languages. They also are robust against performance parameter variations.

None
Login or register to tag items