Identity uncertainty and citation matching

Pasula, H; Marthi, B; Milch, B; Russell, S; Shpitser, I
Advances in Neural Information Processing (NIPS)

Identity uncertainty is a pervasive problem in real-world data analysis. It
arises whenever objects are not labeled with unique identifiers or when
those identifiers may not be perceived perfectly. In such cases, two observations
may or may not correspond to the same object. In this paper,
we consider the problem in the context of citation matching—the problem
of deciding which citations correspond to the same publication. Our
approach is based on the use of a relational probability model to define
a generative model for the domain, including models of author and title

Potters Wheel: An Interactive Framework for Data Cleaning and Transformation

Raman, V; Hellerstein, J
Proc. International Conf. on Very Large Data Bases (VLDB)

Cleaning data of errors in structure and content is important
for data warehousing and integration. Current
solutions for data cleaning involve many iterations of
data “auditing” to find errors, and long-running transformations
to fix them. Users need to endure long
waits, and often write complex transformation scripts.
We present Potter’s Wheel, an interactive data cleaning
system that tightly integrates transformation and
discrepancy detection. Users gradually build transformations
to clean the data by adding or undoing
transforms on a spreadsheet-like interface; the effect

Syndicate content