ist.psu.edu

Adaptive sorted neighborhood methods for efficient record linkage

Authors: 
Yan, S; Lee, D; Kan, MY; Giles, CL
Year: 
2007
Venue: 
Proc. 2007 Conf. on Digital libraries

Traditionally, record linkage algorithms have played an important role in maintaining digital libraries - i.e., identifying matching citations or authors for consolidation in updating or integrating digital libraries. As such, a variety of record linkage algorithms have been developed and deployed successfully. Often, however, existing solutions have a set of parameters whose values are set by human experts off-lineand are fixed during the execution.

CiteSeerX: an Architecture and Web Service Design for an Academic Document Search Engine

Authors: 
Li, Huajing; Councill, Isaac; Lee, Wang-Chien; Giles, C. Lee
Year: 
2006
Venue: 
15th International World Wide Web Conference (WWW2006):(poster) 2006

CiteSeer is a scientific literature digital library and search engine which automatically crawls and indexes scientific documents in the field of computer and information science. After serving as a public search engine for nearly ten years, CiteSeer is starting to have scaling problems for handling of more documents, adding new feature and more users. Its monolithic architecture design prevents it from effectively making use of new web technologies and providing new services. After analyzing the current system problems, we propose a new architecture and data model, CiteSeerx.

Learning metadata from the evidence in an on-line citation matching scheme

Authors: 
Councill, Isaac G.; Li, Huajing; Zhuang, Ziming; Debnath, Sandip; Bolelli, Levent; Lee, Wang-Chien; Sivasubramaniam, Anand; Giles, C. Lee
Year: 
2006
Venue: 
Joint Conference on Digital Libraries 2006 (JCDL 2006): 276-285, 2006

Citation matching, or the automatic grouping of bibliographic
references that refer to the same document, is a data management
problem faced by automatic digital libraries for scientific
literature such as CiteSeer and Google Scholar. Although several
solutions have been offered for citation matching in large
bibliographic databases, these solutions typically require
expensive batch clustering operations that must be run offline.
Large digital libraries containing citation information can reduce
maintenance costs and provide new services through efficient

Clustering Scientific Literature Using Sparse Citation Graph Analysis

Authors: 
Bolelli, Levent; Ertekin, Seyda; Giles, C. Lee
Year: 
2006
Venue: 
10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2006): 30-41, 2006

Towards Next Generation CiteSeer: A Flexible Architecture for Digital Library Deployment

Authors: 
Councill, Isaac G.; Giles, C. Lee; Iorio, Ernesto Di; Gori, Marco; Maggini, Marco; Pucci, Augusto
Year: 
2006
Venue: 
Research and Advanced Technology for Digital Libraries, 10th European Conference, (ECDL 2006): 111-122, 2006

CiteSeer began as the first search engine for scientific litera-
ture to incorporate Autonomous Citation Indexing, and has since grown
to be a well-used, open archive for computer and information science pub-
lications, currently indexing over 730,000 academic documents. However,
CiteSeer currently faces significant challenges that must be overcome in
order to improve the quality of the service and guarantee that Cite-
Seer will continue to be a valuable, up-to-date resource well into the
foreseeable future. This paper describes a new architectural framework

Establishing value mappings using statistical models and user feedback

Authors: 
Kang, J.; Han, T.S.; Lee, D.; Mitra, P.
Year: 
2005
Venue: 
Proceedings of the 14th ACM international conference on Information and knowledge management, 2005

In this paper, we present a \"value mapping\" algorithm that does not rely on syntactic similarity or semantic interpretation of the values. The algorithm first constructs a statistical model (e.g., co-occurrence frequency or entropy vector) that captures the unique characteristics of values and their co-occurrence. It then finds the matching values by computing the distances between the models while refining the models using user feedback through iterations.

Syndicate content