Efficient clustering of high-dimensional data sets with application to reference matching

Guided search

Click a term to initiate a search.

Keyword search

Efficient clustering of high-dimensional data sets with application to reference matching

Fri, 04/20/2007 - 13:32 — cat

Authors:

McCallum, A; Nigam, K; Ungar, LH

Author:

McCallum, A

Nigam, K

Ungar, L

Year:

2000

Venue:

Proc. 6th ACM SIGKDD conf.

URL:

http://portal.acm.org/citation.cfm?id=347090.347123

Citations:

550

Citations range:

500 - 999

Attachment	Size
McCallum2000Efficientclusteringofhighdimensionaldatasetswith.pdf	267.44 KB

Many important problems involve clustering large datasets.
Although naive implementations of clustering are computa-
tionally expensive, there are established efficient techniques
for clustering when the dataset has either (1) a limited num-
ber of clusters, (2) a low feature dimensionality, or (3) a
small number of data points. However, there has been much
less work on methods of efficiently clustering datasets that
are large in all three ways at once|for example, having
millions of data points that exist in many thousands of di-
mensions representing many thousands of clusters.
We present a new technique for clustering these large, high-
dimensional datasets. The key idea involves using a cheap,
approximate distance measure to eÆciently divide the data
into overlapping subsets we call canopies. Then cluster-
ing is performed by measuring exact distances only between
points that occur in a common canopy. Using canopies, large
clustering problems that were formerly impossible become
practical. Under reasonable assumptions about the cheap
distance metric, this reduction in computational cost comes
without any loss in clustering accuracy. Canopies can be
applied to many domains and used with a variety of cluster-
ing approaches, including Greedy Agglomerative Clustering,
K-means and Expectation-Maximization. We present ex-
perimental results on grouping bibliographic citations from
the reference sections of research papers. Here the canopy
approach reduces computation time over a traditional clus-
tering approach by more than an order of magnitude and
decreases error in comparison to a previously used algorithm
by 25%.

websearch

Data Cleaning publication categorizer

Guided search

Data Cleaning

Data sets

Data type

Paper type

Venue type

Author

Year

mailpart

Citations range

Keyword search

Efficient clustering of high-dimensional data sets with application to reference matching

Related categories

User login