cse.psu.edu

CiteSeerX: an Architecture and Web Service Design for an Academic Document Search Engine

Authors: 
Li, Huajing; Councill, Isaac; Lee, Wang-Chien; Giles, C. Lee
Year: 
2006
Venue: 
15th International World Wide Web Conference (WWW2006):(poster) 2006

CiteSeer is a scientific literature digital library and search engine which automatically crawls and indexes scientific documents in the field of computer and information science. After serving as a public search engine for nearly ten years, CiteSeer is starting to have scaling problems for handling of more documents, adding new feature and more users. Its monolithic architecture design prevents it from effectively making use of new web technologies and providing new services. After analyzing the current system problems, we propose a new architecture and data model, CiteSeerx.

Learning metadata from the evidence in an on-line citation matching scheme

Authors: 
Councill, Isaac G.; Li, Huajing; Zhuang, Ziming; Debnath, Sandip; Bolelli, Levent; Lee, Wang-Chien; Sivasubramaniam, Anand; Giles, C. Lee
Year: 
2006
Venue: 
Joint Conference on Digital Libraries 2006 (JCDL 2006): 276-285, 2006

Citation matching, or the automatic grouping of bibliographic
references that refer to the same document, is a data management
problem faced by automatic digital libraries for scientific
literature such as CiteSeer and Google Scholar. Although several
solutions have been offered for citation matching in large
bibliographic databases, these solutions typically require
expensive batch clustering operations that must be run offline.
Large digital libraries containing citation information can reduce
maintenance costs and provide new services through efficient

Clustering Scientific Literature Using Sparse Citation Graph Analysis

Authors: 
Bolelli, Levent; Ertekin, Seyda; Giles, C. Lee
Year: 
2006
Venue: 
10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2006): 30-41, 2006

Group Linkage

Authors: 
On, Byung-Won; Koudas, Nick; Lee, Dongwon; Srivastava, Divesh
Year: 
2007
Venue: 
ICDE

Poor quality data is prevalent in databases due to a variety
of reasons, including transcription errors, lack of standards
for recording database fields, etc. To be able to query
and integrate such data, considerable recent work has focused
on the record linkage problem, i.e., determine if two
entities represented as relational records are approximately
the same. Often entities are represented as groups of relational
records, rather than individual relational records,
e.g., households in a census survey consist of a group of persons.

Syndicate content