Joins that generalize: text classification using Whirl

Authors: 
Cohen, W.W.; Hirsh, H.
Author: 
Cohen, W
Hirsh, H
Year: 
1998
Venue: 
Proc. KDD-98
URL: 
http://www.ldv.uni-trier.de/ldvpage/naumann/textklassifikation/Textklassifikation/cohen98joins.pdf
Citations: 
161
Citations range: 
100 - 499
AttachmentSize
Cohen1998Joinsthatgeneralizetext.pdf100.68 KB

WHIRL is an extension of relational databases that can perform “soft joins” based on the similarity of textual identifiers; these soft joins extend the traditional operation of joining tables based on the equivalence of atomic values. This paper evaluates WHIRL on a number of inductive classification tasks using data from the World Wide Web. We show that althoughWHIRL is designedfor more general similaritybasedreasoning tasks, it is competitive with mature inductive classification systems on these classification tasks. In particular, WHIRL generally achieves lower generalization error than C4.5, RIPPER, and several nearest-neighbor methods. WHIRL is also fast—up to 500 times faster than C4.5 on some benchmark problems. We also show that WHIRL can be efficiently used to select from a large pool of unlabeled items those that can be classified correctly with high confidence.