Adaptive sorted neighborhood methods for efficient record linkage

Yan, S; Lee, D; Kan, MY; Giles, CL
Proc. 2007 Conf. on Digital libraries

Traditionally, record linkage algorithms have played an important role in maintaining digital libraries - i.e., identifying matching citations or authors for consolidation in updating or integrating digital libraries. As such, a variety of record linkage algorithms have been developed and deployed successfully. Often, however, existing solutions have a set of parameters whose values are set by human experts off-lineand are fixed during the execution.

Duplicate Detection in Biological Data using Association Rule Mining

Koh, JLY; Lee, ML; Khan, AM;Tan, PTJ ; Brusic, V
Proc. ECML/PKDD Workshop on Data Mining and Text Mining for Bioinformatics

Recent advancement in biotechnology has produced a massive
amount of raw biological data which are accumulating at an
exponential rate. Errors, redundancy and discrepancies are
prevalent in the raw data, and there is a serious need for
systematic approaches towards biological data cleaning. This
work examines the extent of redundancy in biological data and
proposes a method for detecting duplicates in biological data.
Duplicate relations in a real-world biological dataset are modeled
into forms of association rules so that these duplicate relations or

IntelliClean: a knowledge-based intelligent data cleaner

Lee, M.L.; Ling, T.W.; Low, W.L.
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, 2000

Existing data cleaning methods work on the basis of computing the degree of similarity between nearby records in a sorted database. High recall is achieved by accepting records with low degrees of similarity as duplicates, at the cost of lower precision. High precision is achieved analogously at the cost of lower recall. This is the recall-percision dilemma.
In this paper, we propose a generic knowledge-based framework for effective data cleaning that implements existing cleaning strategies and more. We develop a new method to compute transitive closure under uncertaint ywhich handles

Column Heterogeneity as a Measure of Data Quality

Dai, B. T.; Koudas, N.; Ooi, B. C.; Srivastava, D.; Venkatasubramanian, S.
Clean DB, 2006

Data quality is a serious concern in every data management application,
and a variety of quality measures have been proposed, including
accuracy, freshness and completeness, to capture the common
sources of data quality degradation. We identify and focus
attention on a novel measure, column heterogeneity, that seeks to
quantify the data quality problems that can arise when merging data
from different sources. We identify desiderata that a column heterogeneity
measure should intuitively satisfy, and discuss a promising
direction of research to quantify database column heterogeneity

Syndicate content