Statistical grid-based clustering over data streams

Nam Hun Park, Won Suk Lee

Research output: Contribution to journalArticle

78 Citations (Scopus)

Abstract

A data stream is a massive unbounded sequence of data elements continuously generated at a rapid rate. Due to this reason, most algorithms for data streams sacrifice the correctness of their results for fast processing time. The processing time is greatly influenced by the amount of information that should be maintained. This paper proposes a statistical grid-based approach to clustering data elements of a data stream. Initially, the multidimensional data space of a data stream is partitioned into a set of mutually exclusive equal-size initial cells. When the support of a cell becomes high enough, the cell is dynamically divided into two mutually exclusive intermediate cells based on its distribution statistics. Three different ways of partitioning a dense cell are introduced. Eventually, a dense region of each initial cell is recursively partitioned until it becomes the smallest cell called a unit cell. A cluster of a data stream is a group of adjacent dense unit cells. In order to minimize the number of cells, a sparse intermediate or unit cell is pruned if its support becomes much less than a minimum support. Furthermore, in order to confine the usage of memory space, the size of a unit cell is dynamically minimized such that the result of clustering becomes as accurate as possible. The proposed algorithm is analyzed by a series of experiments to identify its various characteristics.

Original languageEnglish
Pages (from-to)32-37
Number of pages6
JournalSIGMOD Record
Volume33
Issue number1
DOIs
Publication statusPublished - 2004 Mar 1

Fingerprint

Processing
Statistics
Data storage equipment
Experiments

All Science Journal Classification (ASJC) codes

  • Software
  • Information Systems

Cite this

Park, Nam Hun ; Lee, Won Suk. / Statistical grid-based clustering over data streams. In: SIGMOD Record. 2004 ; Vol. 33, No. 1. pp. 32-37.
@article{c953788f23094a3bb3ecb4df2e3aea6f,
title = "Statistical grid-based clustering over data streams",
abstract = "A data stream is a massive unbounded sequence of data elements continuously generated at a rapid rate. Due to this reason, most algorithms for data streams sacrifice the correctness of their results for fast processing time. The processing time is greatly influenced by the amount of information that should be maintained. This paper proposes a statistical grid-based approach to clustering data elements of a data stream. Initially, the multidimensional data space of a data stream is partitioned into a set of mutually exclusive equal-size initial cells. When the support of a cell becomes high enough, the cell is dynamically divided into two mutually exclusive intermediate cells based on its distribution statistics. Three different ways of partitioning a dense cell are introduced. Eventually, a dense region of each initial cell is recursively partitioned until it becomes the smallest cell called a unit cell. A cluster of a data stream is a group of adjacent dense unit cells. In order to minimize the number of cells, a sparse intermediate or unit cell is pruned if its support becomes much less than a minimum support. Furthermore, in order to confine the usage of memory space, the size of a unit cell is dynamically minimized such that the result of clustering becomes as accurate as possible. The proposed algorithm is analyzed by a series of experiments to identify its various characteristics.",
author = "Park, {Nam Hun} and Lee, {Won Suk}",
year = "2004",
month = "3",
day = "1",
doi = "10.1145/974121.974127",
language = "English",
volume = "33",
pages = "32--37",
journal = "SIGMOD Record",
issn = "0163-5808",
publisher = "Association for Computing Machinery (ACM)",
number = "1",

}

Statistical grid-based clustering over data streams. / Park, Nam Hun; Lee, Won Suk.

In: SIGMOD Record, Vol. 33, No. 1, 01.03.2004, p. 32-37.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Statistical grid-based clustering over data streams

AU - Park, Nam Hun

AU - Lee, Won Suk

PY - 2004/3/1

Y1 - 2004/3/1

N2 - A data stream is a massive unbounded sequence of data elements continuously generated at a rapid rate. Due to this reason, most algorithms for data streams sacrifice the correctness of their results for fast processing time. The processing time is greatly influenced by the amount of information that should be maintained. This paper proposes a statistical grid-based approach to clustering data elements of a data stream. Initially, the multidimensional data space of a data stream is partitioned into a set of mutually exclusive equal-size initial cells. When the support of a cell becomes high enough, the cell is dynamically divided into two mutually exclusive intermediate cells based on its distribution statistics. Three different ways of partitioning a dense cell are introduced. Eventually, a dense region of each initial cell is recursively partitioned until it becomes the smallest cell called a unit cell. A cluster of a data stream is a group of adjacent dense unit cells. In order to minimize the number of cells, a sparse intermediate or unit cell is pruned if its support becomes much less than a minimum support. Furthermore, in order to confine the usage of memory space, the size of a unit cell is dynamically minimized such that the result of clustering becomes as accurate as possible. The proposed algorithm is analyzed by a series of experiments to identify its various characteristics.

AB - A data stream is a massive unbounded sequence of data elements continuously generated at a rapid rate. Due to this reason, most algorithms for data streams sacrifice the correctness of their results for fast processing time. The processing time is greatly influenced by the amount of information that should be maintained. This paper proposes a statistical grid-based approach to clustering data elements of a data stream. Initially, the multidimensional data space of a data stream is partitioned into a set of mutually exclusive equal-size initial cells. When the support of a cell becomes high enough, the cell is dynamically divided into two mutually exclusive intermediate cells based on its distribution statistics. Three different ways of partitioning a dense cell are introduced. Eventually, a dense region of each initial cell is recursively partitioned until it becomes the smallest cell called a unit cell. A cluster of a data stream is a group of adjacent dense unit cells. In order to minimize the number of cells, a sparse intermediate or unit cell is pruned if its support becomes much less than a minimum support. Furthermore, in order to confine the usage of memory space, the size of a unit cell is dynamically minimized such that the result of clustering becomes as accurate as possible. The proposed algorithm is analyzed by a series of experiments to identify its various characteristics.

UR - http://www.scopus.com/inward/record.url?scp=14344255219&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=14344255219&partnerID=8YFLogxK

U2 - 10.1145/974121.974127

DO - 10.1145/974121.974127

M3 - Article

AN - SCOPUS:14344255219

VL - 33

SP - 32

EP - 37

JO - SIGMOD Record

JF - SIGMOD Record

SN - 0163-5808

IS - 1

ER -