Tool / product

Source-aware entity matching: A compositional approach

Authors: 
Shen, W.; DeRose, P.; Vu, L.; Doan, A.; Ramakrishnan, R.
Year: 
2007
Venue: 
Proceedings of ICDE 2007

Entity matching (a.k.a. record linkage) plays a crucial
role in integrating multiple data sources, and numerous
matching solutions have been developed. However, the solutions
have largely exploited only information available in
the mentions and employed a single matching technique.
We show how to exploit information about data sources
to significantly improve matching accuracy. In particular,
we observe that different sources often vary substantially
in their level of semantic ambiguity, thus requiring different
matching techniques. In addition, it is often beneficial

A Survey of Data Quality Tools

Authors: 
Barateiro, José; Galhardas, Helena
Year: 
2005
Venue: 
Datenbankspektrum, Vol. 14, 2005

Data quality tools aim at detecting and
correcting data problems that affect the
accuracy and efficiency of data analysis
applications. We propose a classification
of the most relevant commercial and research
data quality tools that can be used
as a framework for comparing tools and
understand their functionalities.

XML Duplicate Detection Using Sorted Neighborhoods

Authors: 
Puhlmann, Sven; Weis, Melanie; Naumann, Felix
Year: 
2006
Venue: 
Conference on Extending Database Technology (EDBT) 2006

Detecting duplicates is a problem with a long tradition in many domains, such as customer relationship management and data warehousing. The problem is twofold: First define a suitable similarity measure, and second efficiently apply the measure to all pairs of objects. With the advent and pervasion of the XML data model, it is necessary to find new similarity measures and to develop efficient methods to detect duplicate elements in nested XML data.

SPIDER: flexible matching in databases

Authors: 
Koudas, N.; Marathe, A.; Srivastava, D.
Year: 
2005
Venue: 
Proceedings of the 2005 ACM SIGMOD international conference on Management of data, 2005

We present a prototype system, SPIDER, developed at AT&T Labs-Research, which supports flexible string attribute value matching in large databases. We discuss the design principles on which SPIDER is based, describe the basic techniques encompassed by the tool and provide a description of the demo.

Data Cleaning for Decision Support

Authors: 
Benedikt, M.; Bohannon, P.; Bruns, G.
Year: 
2006
Venue: 
Clean DB, 2006

Data cleaning may involve the acquisition, at
some effort or expense, of high-quality data.
Such data can serve not only to correct individual
errors, but also to improve the reliability
model for data sources. However, there
has been little research into this latter role for
acquired data. In this short paper we define
a new data cleaning model that allows a user
to estimate the value of further data acquisition
in the face of specific business decisions.
As data is acquired, the reliability model of
sources is updated using Bayesian techniques,

DogmatiX tracks down duplicates in XML

Authors: 
Weis, M; Naumann, F
Year: 
2005
Venue: 
Proceedings of the 2005 ACM SIGMOD international conference

Duplicate detection is the problem of detecting different entries in a data source representing the same real-world entity. While research abounds in the realm of duplicate detection in relational data, there is yet little work for duplicates in other, more complex data models, such as XML.

TAILOR: a record linkage tool box

Authors: 
Elfeky, MG; Verykios, VS; Elmagarmid, AK
Year: 
2002
Venue: 
Data Engineering, 2002. Proceedings. 18th International Conference on, 2002

Data cleaning is a vital process that ensures the quality of data stored in real-world databases. Data cleaning problems are frequently encountered in many research areas, such as knowledge discovery in databases, data warehousing, system integration and e-services. The process of identifying the record pairs that represent the same entity (duplicate records), commonly known as record linkage, is one of the essential elements of data cleaning. In this paper, we address the record linkage problem by adopting a machine learning approach. Three models are proposed and are analyzed empirically.

Syndicate content