Clustering high dimensional massive scientific datasets

Ekow J. Otoo, Arie Shoshani, Seungwon Hwang

Research output: Contribution to journalArticle

6 Citations (Scopus)

Abstract

Many scientific applications can benefit from an efficient clustering algorithm of massively large high dimensional datasets. However most of the developed algorithms are impractical to use when the amount of data is very large. Given N objects each defined by an M-dimensional feature vector, any clustering technique for handling very large datasets in high dimensional space should run in time O(MN) at best, and O(MN log N) in the worst case, using no more than O(NM) storage, for it to be practical. We introduce a hybrid algorithm, called HyCeltyc, for clustering massively large high dimensional datasets in O(MN) time which is linear in the size of the data. HyCeltyc, which stands for Hybrid Cell Density Clustering method, combines a cell-density based algorithm with a hierarchical agglomerative method to identify clusters in linear time. The main steps of the algorithm involve sampling, dimensionality reduction, selection of significant features on which to cluster the data and a grid-based clustering algorithm that is linear in the data size.

Original languageEnglish
Pages (from-to)147-168
Number of pages22
JournalJournal of Intelligent Information Systems
Volume17
Issue number2-3
DOIs
Publication statusPublished - 2001 Dec 1

Fingerprint

Clustering algorithms
Sampling

All Science Journal Classification (ASJC) codes

  • Software
  • Information Systems
  • Hardware and Architecture
  • Computer Networks and Communications
  • Artificial Intelligence

Cite this

Otoo, Ekow J. ; Shoshani, Arie ; Hwang, Seungwon. / Clustering high dimensional massive scientific datasets. In: Journal of Intelligent Information Systems. 2001 ; Vol. 17, No. 2-3. pp. 147-168.
@article{bd5e2fd4e9e44feca238ab0f655e2c57,
title = "Clustering high dimensional massive scientific datasets",
abstract = "Many scientific applications can benefit from an efficient clustering algorithm of massively large high dimensional datasets. However most of the developed algorithms are impractical to use when the amount of data is very large. Given N objects each defined by an M-dimensional feature vector, any clustering technique for handling very large datasets in high dimensional space should run in time O(MN) at best, and O(MN log N) in the worst case, using no more than O(NM) storage, for it to be practical. We introduce a hybrid algorithm, called HyCeltyc, for clustering massively large high dimensional datasets in O(MN) time which is linear in the size of the data. HyCeltyc, which stands for Hybrid Cell Density Clustering method, combines a cell-density based algorithm with a hierarchical agglomerative method to identify clusters in linear time. The main steps of the algorithm involve sampling, dimensionality reduction, selection of significant features on which to cluster the data and a grid-based clustering algorithm that is linear in the data size.",
author = "Otoo, {Ekow J.} and Arie Shoshani and Seungwon Hwang",
year = "2001",
month = "12",
day = "1",
doi = "10.1023/A:1012853629322",
language = "English",
volume = "17",
pages = "147--168",
journal = "Journal of Intelligent Information Systems",
issn = "0925-9902",
publisher = "Springer Netherlands",
number = "2-3",

}

Clustering high dimensional massive scientific datasets. / Otoo, Ekow J.; Shoshani, Arie; Hwang, Seungwon.

In: Journal of Intelligent Information Systems, Vol. 17, No. 2-3, 01.12.2001, p. 147-168.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Clustering high dimensional massive scientific datasets

AU - Otoo, Ekow J.

AU - Shoshani, Arie

AU - Hwang, Seungwon

PY - 2001/12/1

Y1 - 2001/12/1

N2 - Many scientific applications can benefit from an efficient clustering algorithm of massively large high dimensional datasets. However most of the developed algorithms are impractical to use when the amount of data is very large. Given N objects each defined by an M-dimensional feature vector, any clustering technique for handling very large datasets in high dimensional space should run in time O(MN) at best, and O(MN log N) in the worst case, using no more than O(NM) storage, for it to be practical. We introduce a hybrid algorithm, called HyCeltyc, for clustering massively large high dimensional datasets in O(MN) time which is linear in the size of the data. HyCeltyc, which stands for Hybrid Cell Density Clustering method, combines a cell-density based algorithm with a hierarchical agglomerative method to identify clusters in linear time. The main steps of the algorithm involve sampling, dimensionality reduction, selection of significant features on which to cluster the data and a grid-based clustering algorithm that is linear in the data size.

AB - Many scientific applications can benefit from an efficient clustering algorithm of massively large high dimensional datasets. However most of the developed algorithms are impractical to use when the amount of data is very large. Given N objects each defined by an M-dimensional feature vector, any clustering technique for handling very large datasets in high dimensional space should run in time O(MN) at best, and O(MN log N) in the worst case, using no more than O(NM) storage, for it to be practical. We introduce a hybrid algorithm, called HyCeltyc, for clustering massively large high dimensional datasets in O(MN) time which is linear in the size of the data. HyCeltyc, which stands for Hybrid Cell Density Clustering method, combines a cell-density based algorithm with a hierarchical agglomerative method to identify clusters in linear time. The main steps of the algorithm involve sampling, dimensionality reduction, selection of significant features on which to cluster the data and a grid-based clustering algorithm that is linear in the data size.

UR - http://www.scopus.com/inward/record.url?scp=0035679849&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0035679849&partnerID=8YFLogxK

U2 - 10.1023/A:1012853629322

DO - 10.1023/A:1012853629322

M3 - Article

VL - 17

SP - 147

EP - 168

JO - Journal of Intelligent Information Systems

JF - Journal of Intelligent Information Systems

SN - 0925-9902

IS - 2-3

ER -