Sigmod / VLDB

On active learning of record matching packages

Authors: 
Arasu, A; Götz, M; Kaushik, R.
Year: 
2010
Venue: 
Proc. ACM SIGMOD Conf.

We consider the problem of learning a record matching package (classifier) in an active learning setting. In active learning, the learning algorithm picks the set of examples to be labeled, unlike more traditional passive learning setting where a user selects the labeled examples. Active learning is important for record matching since manually identifying a suitable set of labeled examples is difficult.

Data Fusion - Resolving Data Conflicts for Integration

Authors: 
Dong, XL; Naumann, F
Year: 
2009

Example-driven Design of Efficient Record Matching Queries

Authors: 
Chaudhuri, Surajit;Chen, Bee-Chung;Ganti, Venkatesh;Kaushik, Raghav
Year: 
2007
Venue: 
VLDB

Record matching is the task of identifying records that match the same real world entity. This is a problem of great significance for a variety of business intelligence applications. Implementations of record matching rely on exact as well as approximate string matching (e.g., edit distances) and use of external reference data sources. Record matching can be viewed as a query composed of a small set of primitive operators. However, formulating such record matching queries is difficult and depends on the specific application scenario.

SPIDER: flexible matching in databases

Authors: 
Koudas, N.; Marathe, A.; Srivastava, D.
Year: 
2005
Venue: 
Proceedings of the 2005 ACM SIGMOD international conference on Management of data, 2005

We present a prototype system, SPIDER, developed at AT&T Labs-Research, which supports flexible string attribute value matching in large databases. We discuss the design principles on which SPIDER is based, describe the basic techniques encompassed by the tool and provide a description of the demo.

DogmatiX tracks down duplicates in XML

Authors: 
Weis, M; Naumann, F
Year: 
2005
Venue: 
Proceedings of the 2005 ACM SIGMOD international conference

Duplicate detection is the problem of detecting different entries in a data source representing the same real-world entity. While research abounds in the realm of duplicate detection in relational data, there is yet little work for duplicates in other, more complex data models, such as XML.

The merge/purge problem for large databases

Authors: 
Hernandez, M.A.; Stolfo, S.J.
Year: 
1995
Venue: 
Proceedings of the 1995 ACM SIGMOD international conference on Management of data, 1995

Many commercial organizations routinely gather large numbers of databases for various marketing and business analysis functions. The task is to correlate information from different databases by identifying distinct individuals that appear in a number of different databases typically in an inconsistent and often incorrect fashion. The problem we study here is the task of merging data from multiple sources in as efficient manner as possible, while maximizing the accuracy of the result. We call this the merge/purge problem.

Integration of heterogeneous databases without common domains using queries based on textual similarity

Authors: 
Cohen, WW
Year: 
1998
Venue: 
Proc. ACM SIGMOD

Most databases contain “name constants” like course numbers, personal names, and place names that correspond to entities in the real world. Previous work in integration of heterogeneous databases has assumed that local name constants can be mapped into an appropriate global domain by normalization. However, in many cases, this assumption does not hold; determining if two name constants should be considered identical can require detailed knowledge of the world, the purpose of the user's query, or both.

Robust and efficient fuzzy match for online data cleaning

Authors: 
Chaudhuri, S.; Ganjam, K.; Ganti, V.; Motwani, R.
Year: 
2003
Venue: 
Proc. ACM SIGMOD 2003

To ensure high data quality, data warehouses must validate and cleanse incoming data tuples from external sources. In many situations, clean tuples must match acceptable tuples in reference tables.

Syndicate content