Group Linkage

On, Byung-Won; Koudas, Nick; Lee, Dongwon; Srivastava, Divesh

Poor quality data is prevalent in databases due to a variety
of reasons, including transcription errors, lack of standards
for recording database fields, etc. To be able to query
and integrate such data, considerable recent work has focused
on the record linkage problem, i.e., determine if two
entities represented as relational records are approximately
the same. Often entities are represented as groups of relational
records, rather than individual relational records,
e.g., households in a census survey consist of a group of persons.

Flexible string matching against large databases in practice

Koudas, N.; Marathe, A.; Srivastava, D.
Proceedings of VLDB, 2004

Data Cleaning is an important process that has been at
the center of research interest in recent years. Poor data
quality is the result of a variety of reasons, including
data entry errors and multiple conventions for recording
database fields, and has a significant impact on a variety
of business issues. Hence, there is a pressing need
for technologies that enable flexible (fuzzy) matching
of string information in a database. Cosine similarity
with tf-idf is a well-established metric for comparing
text, and recent proposals have adapted this similarity

Merging the Results of Approximate Match Operations

Guha, S.; Koudas, N.; Marathe, A.; Srivastava, D.
Proceedings of the 30th International Conference on Very Large Databases (VLDB 2004), 2004

Data Cleaning is an important process that has been at
the center of research interest in recent years. An important
end goal of effective data cleaning is to identify
the relational tuple or tuples that are “most related” to
a given query tuple. Various techniques have been proposed
in the literature for efficiently identifying approximate
matches to a query string against a single attribute
of a relation. In addition to constructing a ranking (i.e.,
ordering) of these matches, the techniques often associate,
with each match, scores that quantify the extent

Text joins in an RDBMS for web data integration

Gravano, L.; Ipeirotis, P.G.; Koudas, N.; Srivastava, D.
Proceedings of the twelfth international conference on World Wide Web, 2003

The integration of data produced and collected across autonomous, heterogeneous web services is an increasingly important and challenging problem. Due to the lack of global identifiers, the same entity (e.g., a product) might have different textual representations across databases. Textual data is also often noisy because of transcription errors, incomplete information, and lack of standard formats. A fundamental task during data integration is matching of strings that refer to the same entity.

Approximate string joins in a database (almost) for free

Gravano, L.; Ipeirotis, P.G.; Jagadish, H.V.; Koudas, N.; Muthukrishnan, S.; Srivastava, D.
Proceedings of the 27th International Conference on Very Large Data Bases (VLDB), 2001

String data is ubiquitous, and its management has
taken on particular importance in the past few
years. Approximate queries are very important on
string data especially for more complex queries
involving joins. This is due, for example, to the
prevalence of typographical errors in data, and
multiple conventions for recording attributes such
as name and address. Commercial databases do
not support approximate string joins directly, and
it is a challenge to implement this functionality efficiently
with user-defined functions (UDFs).
In this paper, we develop a technique for building

Circumventing Data Quality Problems Using Multiple Join Paths

Kotidis, Y.; Marian, A.; Srivastava, D.
Clean DB, 2006

We propose the Multiple Join Path (MJP) framework for obtaining
high quality information by linking fields across multiple databases,
when the underlying databases have poor quality data, which are
characterized by violations of integrity constraints like keys and
functional dependencies within and across databases. MJP associates
quality scores with candidate answers by first scoring individual
data paths between a pair of field values taking into account
data quality with respect to specified integrity constraints, and then

Column Heterogeneity as a Measure of Data Quality

Dai, B. T.; Koudas, N.; Ooi, B. C.; Srivastava, D.; Venkatasubramanian, S.
Clean DB, 2006

Data quality is a serious concern in every data management application,
and a variety of quality measures have been proposed, including
accuracy, freshness and completeness, to capture the common
sources of data quality degradation. We identify and focus
attention on a novel measure, column heterogeneity, that seeks to
quantify the data quality problems that can arise when merging data
from different sources. We identify desiderata that a column heterogeneity
measure should intuitively satisfy, and discuss a promising
direction of research to quantify database column heterogeneity

Joins that generalize: text classification using Whirl

Cohen, W.W.; Hirsh, H.
Proc. KDD-98

WHIRL is an extension of relational databases that can perform “soft joins” based on the similarity of textual identifiers; these soft joins extend the traditional operation of joining tables based on the equivalence of atomic values. This paper evaluates WHIRL on a number of inductive classification tasks using data from the World Wide Web. We show that althoughWHIRL is designedfor more general similaritybasedreasoning tasks, it is competitive with mature inductive classification systems on these classification tasks.

Hardening soft information sources

Cohen, WW; Kautz, H; McAllester, D
Proceedings of the sixth ACM SIGKDD international conference

The web contains a large quantity of unstructured information. In many cases, it is possible to heuristically extract structured information, but the resulting databases
are \"soft\": they contain inconsistencies and duplication, and

Syndicate content