Statistical σ-partition clustering over data streams

Nam Hun Park, Won Suk Lee

Research output: Contribution to journalConference article

Abstract

This paper proposes a grid-based clustering method that dynamically partitions the range of a grid-cell based on its distribution statistics of data elements in a data stream. Initially the multi-dimensional space of a data domain is partitioned into a set of mutually exclusive equal-size initial cells. As a new data element is generated continuously, each cell monitors the distribution statistics of data elements within its range. When the support of data elements in a cell becomes high enough, the cell is dynamically divided into two mutually exclusive smaller cells called intermediate cells by assuming the distribution of data elements is a normal distribution. Eventually, the dense sub-range of an initial cell is recursively partitioned until it becomes the smallest cell called a unit cell. In order to minimize the number of cells, a sparse intermediate or unit cell can be pruned if its support becomes much less than a minimum support. The performance of the proposed method is comparatively analyzed through a series of experiments.

Original languageEnglish
Pages (from-to)387-398
Number of pages12
JournalLecture Notes in Artificial Intelligence (Subseries of Lecture Notes in Computer Science)
Volume2838
Publication statusPublished - 2003 Dec 1
Event7th European Conference on Principles and Practice of Knowledge Discovery in Databases - Cavtat-Dubrovnik, Croatia
Duration: 2003 Sep 222003 Sep 26

Fingerprint

Data Streams
Partition
Statistics
Clustering
Cell
Normal distribution
Mutually exclusive
Experiments
Range of data
Grid
Unit
Clustering Methods
Gaussian distribution
Monitor
Minimise

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

@article{6cbb21817c934afe9477a91c3d900c31,
title = "Statistical σ-partition clustering over data streams",
abstract = "This paper proposes a grid-based clustering method that dynamically partitions the range of a grid-cell based on its distribution statistics of data elements in a data stream. Initially the multi-dimensional space of a data domain is partitioned into a set of mutually exclusive equal-size initial cells. As a new data element is generated continuously, each cell monitors the distribution statistics of data elements within its range. When the support of data elements in a cell becomes high enough, the cell is dynamically divided into two mutually exclusive smaller cells called intermediate cells by assuming the distribution of data elements is a normal distribution. Eventually, the dense sub-range of an initial cell is recursively partitioned until it becomes the smallest cell called a unit cell. In order to minimize the number of cells, a sparse intermediate or unit cell can be pruned if its support becomes much less than a minimum support. The performance of the proposed method is comparatively analyzed through a series of experiments.",
author = "Park, {Nam Hun} and Lee, {Won Suk}",
year = "2003",
month = "12",
day = "1",
language = "English",
volume = "2838",
pages = "387--398",
journal = "Lecture Notes in Computer Science",
issn = "0302-9743",
publisher = "Springer Verlag",

}

Statistical σ-partition clustering over data streams. / Park, Nam Hun; Lee, Won Suk.

In: Lecture Notes in Artificial Intelligence (Subseries of Lecture Notes in Computer Science), Vol. 2838, 01.12.2003, p. 387-398.

Research output: Contribution to journalConference article

TY - JOUR

T1 - Statistical σ-partition clustering over data streams

AU - Park, Nam Hun

AU - Lee, Won Suk

PY - 2003/12/1

Y1 - 2003/12/1

N2 - This paper proposes a grid-based clustering method that dynamically partitions the range of a grid-cell based on its distribution statistics of data elements in a data stream. Initially the multi-dimensional space of a data domain is partitioned into a set of mutually exclusive equal-size initial cells. As a new data element is generated continuously, each cell monitors the distribution statistics of data elements within its range. When the support of data elements in a cell becomes high enough, the cell is dynamically divided into two mutually exclusive smaller cells called intermediate cells by assuming the distribution of data elements is a normal distribution. Eventually, the dense sub-range of an initial cell is recursively partitioned until it becomes the smallest cell called a unit cell. In order to minimize the number of cells, a sparse intermediate or unit cell can be pruned if its support becomes much less than a minimum support. The performance of the proposed method is comparatively analyzed through a series of experiments.

AB - This paper proposes a grid-based clustering method that dynamically partitions the range of a grid-cell based on its distribution statistics of data elements in a data stream. Initially the multi-dimensional space of a data domain is partitioned into a set of mutually exclusive equal-size initial cells. As a new data element is generated continuously, each cell monitors the distribution statistics of data elements within its range. When the support of data elements in a cell becomes high enough, the cell is dynamically divided into two mutually exclusive smaller cells called intermediate cells by assuming the distribution of data elements is a normal distribution. Eventually, the dense sub-range of an initial cell is recursively partitioned until it becomes the smallest cell called a unit cell. In order to minimize the number of cells, a sparse intermediate or unit cell can be pruned if its support becomes much less than a minimum support. The performance of the proposed method is comparatively analyzed through a series of experiments.

UR - http://www.scopus.com/inward/record.url?scp=9444258641&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=9444258641&partnerID=8YFLogxK

M3 - Conference article

AN - SCOPUS:9444258641

VL - 2838

SP - 387

EP - 398

JO - Lecture Notes in Computer Science

JF - Lecture Notes in Computer Science

SN - 0302-9743

ER -