Column Heterogeneity as a Measure of Data Quality

Authors: 
Dai, B. T.; Koudas, N.; Ooi, B. C.; Srivastava, D.; Venkatasubramanian, S.
Author: 
Dai, B
Koudas, N
Ooi, B
Srivastava, D
Venkatasubramanian, S
Year: 
2006
Venue: 
Clean DB, 2006
URL: 
http://pike.psu.edu/cleandb06/papers/CameraReady_111.pdf
Citations: 
8
Citations range: 
1 - 9
AttachmentSize
Dai2006ColumnHeterogeneityasa.pdf121.02 KB

Data quality is a serious concern in every data management application,
and a variety of quality measures have been proposed, including
accuracy, freshness and completeness, to capture the common
sources of data quality degradation. We identify and focus
attention on a novel measure, column heterogeneity, that seeks to
quantify the data quality problems that can arise when merging data
from different sources. We identify desiderata that a column heterogeneity
measure should intuitively satisfy, and discuss a promising
direction of research to quantify database column heterogeneity
based on using a novel combination of cluster entropy and soft clustering.
Finally, we present a few preliminary experimental results,
using diverse data sets of semantically different types, to demonstrate
that this approach appears to provide a robust mechanism for
identifying and quantifying database column heterogeneity.