Similarity functions

Reference Fusion and Flexible Querying

Authors: 
Saïs, Fatiha; Thomopoulos, Rallou
Year: 
2008
Venue: 
ODBASE--OTM Conferences 2008: Monterrey, Mexico

This paper deals with the issue of data fusion, which arises once reconciliations between references have been determined. The objective of this task is to fusion the descriptions of references that refer to the same real world entity so as to obtain a unique representation. In order to deal with the problem of uncertainty in the values associated with the attributes, we have chosen to represent the results of the fusion of references in a formalism based on fuzzy sets. We indicate how the confidence degrees are computed.

Approximate string-matching with q-grams and maximal matches

Authors: 
Ukkonen, E
Year: 
1992
Venue: 
Theoretical Computer Science

Ukkonen, E., Approximate string-matching with ¿/-grams and maximal matches. Theoretical Com-
puter Science 92 (1992) 191-211.
We study approximate string-matching in connection with two string distance functions that are
computable in linear time. The first function is based on the so-called ij-grams. An algorithm is given
for the associated string-matching problem that finds the locally best approximate occurrences of
pattern P, |P| = m, in text T, \T\ = n, in time 0(«log(m — q)). The occurrences with distance

A guided tour to approximate string matching

Authors: 
Navarro, G
Year: 
2001
Venue: 
ACM Computing Surveys

We survey the current techniques to cope with the problem of string matching that allows errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities. We present a number of experiments to compare the performance of the different algorithms and show which are the best choices.

Learning string-edit distance

Authors: 
Ristad, ES; Yianilos, PN; Inc, M.T.; Princeton, NJ
Year: 
1998
Venue: 
Pattern Analysis and Machine Intelligence, IEEE Transactions on, 20, 1998

In many applications, it is necessary to determine the similarity of two strings.
A widely-used notion of string similarity is the edit distance: the minimum
number of insertions, deletions, and substitutions required to transform one
string into the other. In this report, we provide a stochastic model for string
edit distance. Our stochastic model allows us to learn a string edit distance
function from a corpus of examples. We illustrate the utility of our approach
by applying it to the difficult problem of learning the pronunciation of words in

Two algorithms for approximate string matching in static texts

Authors: 
Jokinen, P.; Ukkonen, E.
Year: 
1991
Venue: 
Proceedings Mathematical Foundations of Computer Science 1991,

Incremental distance join algorithms for spatial databases

Authors: 
Hjaltason, G.R.; Samet, H.
Year: 
1998
Venue: 
ACM SIGMOD Record, 27, 1998

Two new spatial join operations, distance join and distance semi-join, are introduced where the join output is ordered by the distance between the spatial attribute values of the joined tuples. Incremental algorithms are presented for computing these operations, which can be used in a pipelined fashion, thereby obviating the need to wait for their completion when only a few tuples are needed. The algorithms can be used with a large class of hierarchical spatial data structures and arbitrary spatial data types in any dimensions. In addition, any distance metric may be employed.

Data integration using similarity joins and a word-based information representation language

Authors: 
Cohen, W.W.
Year: 
2000
Venue: 
ACM Transactions on Information Systems (TOIS), 18, 2000

The integration of distributed, heterogeneous databases, such as those available on the World Wide Web, poses many problems. Herer we consider the problem of integrating data from sources that lack common object identifiers. A solution to this problem is proposed for databases that contain informal, natural-language “names” for objects; most Web-based databases satisfy this requirement, since they usually present their information to the end-user through a veneer of text.

SimFusion: measuring similarity using unified relationship matrix

Authors: 
Xi, W; Fox, EA; Fan, W; Zhang, B; Chen, Z; Yan, J; J Yan, D
Year: 
2005
Venue: 
Proc. of the 28th annual international ACM SIGIR conf.

In this paper we use a Unified Relationship Matrix (URM) to
represent a set of heterogeneous data objects (e.g., web pages,
queries) and their interrelationships (e.g., hyperlinks, user clickthrough
sequences). We claim that iterative computations over the
URM can help overcome the data sparseness problem and detect
latent relationships among heterogeneous data objects, thus, can
improve the quality of information applications that require combination
of information from heterogeneous sources. To support
our claim, we present a unified similarity-calculating algorithm,

Syndicate content