Authors:
Chaudhuri, Surajit; Ganti, Venkatesh; Xin, Dong
Author:
Chaudhuri, S
Xin, D
Ganti, V
Many entity extraction techniques leverage large reference
entity tables to identify entities in documents. Often, an
entity is referenced in document collections differerently from
that in the reference entity tables. Therefore, we study the
problem of determining whether or not a substring "approx-
imately" matches with a reference entity. Similarity mea-
sures which exploit the correlation between candidate sub-
strings and reference entities across a large number of doc-
uments are known to be more robust than traditional stand
alone string-based similarity functions. However, such an
approach has significant efficiency challenges. In this paper,
we adopt a new architecture and propose new techniques
to address these e±ciency challenges. We mine document
collections and expand a given reference entity table with
variations of each of its entities. Thus, the problem of ap-
proximately matching an input string against reference en-
tities reduces to that of exact match against the expanded
reference table, which can be implemented efficiently. In
an extensive experimental evaluation, we demonstrate the
accuracy and scalability of our techniques.