Clustering high dimensional massive scientific datasets

Ekow J. Otoo, Arie Shoshani, Seung Won Hwang

Research output: Contribution to journalArticle

8 Citations (Scopus)

Abstract

Many scientific applications can benefit from efficient clustering algorithm of massively large high dimensional datasets. However most of the developed algorithms are impractical to use when the amount of data is very large. Given N objects each defined by an M-dimensional feature vector, any clustering technique for handling very large datasets in high dimensional space should run in time O(N) at best, and O(N log N) in the worst case, using no more than O(N M) storage, for it to be practical. A parallelized version of the same algorithm should achieve a linear speed-up in processing time with increasing number of processors. We introduce a hybrid algorithm called HyCeltyc, as an approach for clustering massively large high dimensional datasets. HyCeltyc, which stands for Hybrid Cell Density Clustering method, combines a cell-density based algorithm with a hierarchical agglomerative method to identify clusters in linear time. The main steps of the algorithm involve sampling, dimensionality reduction and selection of significant features on which to cluster the data.

Original languageEnglish
Pages (from-to)147-157
Number of pages11
JournalProceedings of the International Conference on Scientific and Statistical Database Management, SSDBM
DOIs
Publication statusPublished - 2001 Jan 1

Fingerprint

High-dimensional
Clustering
Cell
Dimensionality Reduction
Hybrid Algorithm
Feature Vector
Clustering Methods
Large Data Sets
Clustering Algorithm
Linear Time
Speedup
Efficient Algorithms
Clustering algorithms
Sampling
Processing

All Science Journal Classification (ASJC) codes

  • Software
  • Applied Mathematics

Cite this

@article{4ea6a54688c74bd3b3b2efaf3b1b2d35,
title = "Clustering high dimensional massive scientific datasets",
abstract = "Many scientific applications can benefit from efficient clustering algorithm of massively large high dimensional datasets. However most of the developed algorithms are impractical to use when the amount of data is very large. Given N objects each defined by an M-dimensional feature vector, any clustering technique for handling very large datasets in high dimensional space should run in time O(N) at best, and O(N log N) in the worst case, using no more than O(N M) storage, for it to be practical. A parallelized version of the same algorithm should achieve a linear speed-up in processing time with increasing number of processors. We introduce a hybrid algorithm called HyCeltyc, as an approach for clustering massively large high dimensional datasets. HyCeltyc, which stands for Hybrid Cell Density Clustering method, combines a cell-density based algorithm with a hierarchical agglomerative method to identify clusters in linear time. The main steps of the algorithm involve sampling, dimensionality reduction and selection of significant features on which to cluster the data.",
author = "Otoo, {Ekow J.} and Arie Shoshani and Hwang, {Seung Won}",
year = "2001",
month = "1",
day = "1",
doi = "10.1109/SSDM.2001.938547",
language = "English",
pages = "147--157",
journal = "Proceedings of the International Conference on Scientific and Statistical Database Management, SSDBM",
issn = "1099-3371",

}

TY - JOUR

T1 - Clustering high dimensional massive scientific datasets

AU - Otoo, Ekow J.

AU - Shoshani, Arie

AU - Hwang, Seung Won

PY - 2001/1/1

Y1 - 2001/1/1

N2 - Many scientific applications can benefit from efficient clustering algorithm of massively large high dimensional datasets. However most of the developed algorithms are impractical to use when the amount of data is very large. Given N objects each defined by an M-dimensional feature vector, any clustering technique for handling very large datasets in high dimensional space should run in time O(N) at best, and O(N log N) in the worst case, using no more than O(N M) storage, for it to be practical. A parallelized version of the same algorithm should achieve a linear speed-up in processing time with increasing number of processors. We introduce a hybrid algorithm called HyCeltyc, as an approach for clustering massively large high dimensional datasets. HyCeltyc, which stands for Hybrid Cell Density Clustering method, combines a cell-density based algorithm with a hierarchical agglomerative method to identify clusters in linear time. The main steps of the algorithm involve sampling, dimensionality reduction and selection of significant features on which to cluster the data.

AB - Many scientific applications can benefit from efficient clustering algorithm of massively large high dimensional datasets. However most of the developed algorithms are impractical to use when the amount of data is very large. Given N objects each defined by an M-dimensional feature vector, any clustering technique for handling very large datasets in high dimensional space should run in time O(N) at best, and O(N log N) in the worst case, using no more than O(N M) storage, for it to be practical. A parallelized version of the same algorithm should achieve a linear speed-up in processing time with increasing number of processors. We introduce a hybrid algorithm called HyCeltyc, as an approach for clustering massively large high dimensional datasets. HyCeltyc, which stands for Hybrid Cell Density Clustering method, combines a cell-density based algorithm with a hierarchical agglomerative method to identify clusters in linear time. The main steps of the algorithm involve sampling, dimensionality reduction and selection of significant features on which to cluster the data.

UR - http://www.scopus.com/inward/record.url?scp=53949100051&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=53949100051&partnerID=8YFLogxK

U2 - 10.1109/SSDM.2001.938547

DO - 10.1109/SSDM.2001.938547

M3 - Article

AN - SCOPUS:53949100051

SP - 147

EP - 157

JO - Proceedings of the International Conference on Scientific and Statistical Database Management, SSDBM

JF - Proceedings of the International Conference on Scientific and Statistical Database Management, SSDBM

SN - 1099-3371

ER -