Exploiting relationships for object consolidation

Authors: 
Chen, Z; Kalashnikov, DV; Mehrotra, S
Author: 
Chen, Z
Kalashnikov, D
Mehrotra, S
Year: 
2005
Venue: 
Proceedings of the 2nd international workshop on Information
URL: 
http://portal.acm.org/citation.cfm?id=1077512
Citations: 
68
Citations range: 
50 - 99
AttachmentSize
Chen2005Exploitingrelationshipsfor.pdf588.78 KB

Data mining practitioners frequently have to spend significant portion of their project time on data preprocessing before they can apply their algorithms on real-world datasets. Such a preprocessing is required because many real-world datasets are not perfect, but rather they contain missing, erroneous, duplicate data and other data cleaning problems. It is a well established fact that, in general, if such problems with data are not corrected, applying data mining algorithm can lead to wrong results. The latter is known as the \"garbage in, garbage out\" principle. Given the significance of the problem, numerous data cleaning techniques have been designed in the past to address the aforementioned problems with data.In this paper, we address one of the data cleaning challenges, called object consolidation. This important challenge arises because objects in datasets are frequently represented via descriptions (a set of instantiated attributes), which alone might not always uniquely identify the object. The goal of object consolidation is to correctly consolidate (i.e., to group/determine) all the representations of the same object, for each object in the dataset. In contrast to traditional domain-independent data cleaning techniques, our approach analyzes not only object features, but also additional semantic information: inter-objects relationships, for the purpose of object consolidation. The approach views datasets as attributed relational graphs (ARGs) of object representations (nodes), connected via relationships (edges). The approach then applies graph partitioning techniques to accurately cluster object representations. Our empirical study over real datasets shows that analyzing relationships significantly improves the quality of the result.