cse.unsw.edu.au

Top-k Set Similarity Joins

Authors: 
Xiao, Chuan; Wang, Wei; Lin, Xuemin; Shang, Haichuan
Year: 
2009
Venue: 
ICDE

Similarity join is a useful primitive operation underlying many applications, such as near duplicate Web page detection, data integration, and pattern recognition. Traditional similarity joins require a user to specify a similarity threshold. In this paper, we study a variant of the similarity join, termed top-k set similarity join. It returns the top-k pairs of records ranked by their similarities, thus eliminating the guess work users have to perform when the similarity threshold is unknown before hand. An algorithm, topk-join, is proposed to answer top-k similarity join efficiently.

Syndicate content