Structural matching and discovery in document databases

Authors: 
Wang, J. T. L.; Shasha, D.; Chang, G. J. S.; Relihan, L.; Zhang, K.; Patel, G.
Author: 
Wang, J
Shasha, D
Chang, G
Relihan, L
Zhang, K
Patel, G
Year: 
1997
Venue: 
Proc. of the 1997 ACM SIGMOD Intl Conf. on Management of data
URL: 
http://portal.acm.org/citation.cfm?id=253406&coll=portal&dl=ACM
DOI: 
http://doi.acm.org/10.1145/253260.253406
Citations: 
32
Citations range: 
10 - 49
AttachmentSize
Wang1997Structuralmatchingand.pdf632.81 KB

Structural matching and discovery in documents such as SGML and HTML is important for data warehousing [6], version management [7, 11], hypertext authoring, digital libraries [4] and Internet databases. As an example, a user of the World Wide Web may be interested in knowing changes in an HTML document [2, 5, 10]. Such changes can be detected by comparing the old and new version of the document (referred to as structural matching of documents). As another example, in hypertext authoring, a user may wish to find the common portions in the history list of a document or in a database of documents (referred to as structural discovery of documents). In SIGMOD 95 demo sessions, we exhibited a software package, called TreeDiff [13], for comparing two latex documents and showing their differences. Given two documents, the tool represents the documents as ordered labeled trees and finds an optimal sequence of edit operations to transform one document (tree) to the other. An edit operation could be an insert, delete, or change of a node in the trees. The tool is so named because documents are represented and compared using approximate tree matching techniques [9, 12, 14].