Efficiently tracing clusters over high-dimensional on-line data streams

Jae Woo Lee, Nam Hun Park, Won Suk Lee

Research output: Contribution to journalArticle

8 Citations (Scopus)

Abstract

A good clustering method should provide flexible scalability on the number of dimensions as well as the size of a data set. This paper proposes a method of efficiently tracing the clusters of a high-dimensional on-line data stream. While tracing the one-dimensional clusters of each dimension independently, a technique which is similar to frequent itemset mining is employed to find the set of multi-dimensional clusters. By finding a frequently co-occurred set of one-dimensional clusters, it is possible to trace a multi-dimensional rectangular space whose range is defined by the one-dimensional clusters collectively. In order to trace such candidates over a multi-dimensional online data stream, a cluster-statistics tree (CS-Tree) is proposed in this paper. A k-depth node(k ≤ d) in the CS-tree is corresponding to a k-dimensional rectangular space. Each node keeps track of the density of data elements in its corresponding rectangular space. Only a node corresponding to a dense rectangular space is allowed to have a child node. The scalability on the number of dimensions is greatly enhanced while sacrificing the accuracy of identified clusters slightly.

Original languageEnglish
Pages (from-to)362-379
Number of pages18
JournalData and Knowledge Engineering
Volume68
Issue number3
DOIs
Publication statusPublished - 2009 Mar 1

Fingerprint

Data streams
Node
Scalability
Statistics
Clustering

All Science Journal Classification (ASJC) codes

  • Information Systems and Management

Cite this

@article{c16c02ef089a444bb47206453f576fc0,
title = "Efficiently tracing clusters over high-dimensional on-line data streams",
abstract = "A good clustering method should provide flexible scalability on the number of dimensions as well as the size of a data set. This paper proposes a method of efficiently tracing the clusters of a high-dimensional on-line data stream. While tracing the one-dimensional clusters of each dimension independently, a technique which is similar to frequent itemset mining is employed to find the set of multi-dimensional clusters. By finding a frequently co-occurred set of one-dimensional clusters, it is possible to trace a multi-dimensional rectangular space whose range is defined by the one-dimensional clusters collectively. In order to trace such candidates over a multi-dimensional online data stream, a cluster-statistics tree (CS-Tree) is proposed in this paper. A k-depth node(k ≤ d) in the CS-tree is corresponding to a k-dimensional rectangular space. Each node keeps track of the density of data elements in its corresponding rectangular space. Only a node corresponding to a dense rectangular space is allowed to have a child node. The scalability on the number of dimensions is greatly enhanced while sacrificing the accuracy of identified clusters slightly.",
author = "Lee, {Jae Woo} and Park, {Nam Hun} and Lee, {Won Suk}",
year = "2009",
month = "3",
day = "1",
doi = "10.1016/j.datak.2008.11.004",
language = "English",
volume = "68",
pages = "362--379",
journal = "Data and Knowledge Engineering",
issn = "0169-023X",
publisher = "Elsevier",
number = "3",

}

Efficiently tracing clusters over high-dimensional on-line data streams. / Lee, Jae Woo; Park, Nam Hun; Lee, Won Suk.

In: Data and Knowledge Engineering, Vol. 68, No. 3, 01.03.2009, p. 362-379.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Efficiently tracing clusters over high-dimensional on-line data streams

AU - Lee, Jae Woo

AU - Park, Nam Hun

AU - Lee, Won Suk

PY - 2009/3/1

Y1 - 2009/3/1

N2 - A good clustering method should provide flexible scalability on the number of dimensions as well as the size of a data set. This paper proposes a method of efficiently tracing the clusters of a high-dimensional on-line data stream. While tracing the one-dimensional clusters of each dimension independently, a technique which is similar to frequent itemset mining is employed to find the set of multi-dimensional clusters. By finding a frequently co-occurred set of one-dimensional clusters, it is possible to trace a multi-dimensional rectangular space whose range is defined by the one-dimensional clusters collectively. In order to trace such candidates over a multi-dimensional online data stream, a cluster-statistics tree (CS-Tree) is proposed in this paper. A k-depth node(k ≤ d) in the CS-tree is corresponding to a k-dimensional rectangular space. Each node keeps track of the density of data elements in its corresponding rectangular space. Only a node corresponding to a dense rectangular space is allowed to have a child node. The scalability on the number of dimensions is greatly enhanced while sacrificing the accuracy of identified clusters slightly.

AB - A good clustering method should provide flexible scalability on the number of dimensions as well as the size of a data set. This paper proposes a method of efficiently tracing the clusters of a high-dimensional on-line data stream. While tracing the one-dimensional clusters of each dimension independently, a technique which is similar to frequent itemset mining is employed to find the set of multi-dimensional clusters. By finding a frequently co-occurred set of one-dimensional clusters, it is possible to trace a multi-dimensional rectangular space whose range is defined by the one-dimensional clusters collectively. In order to trace such candidates over a multi-dimensional online data stream, a cluster-statistics tree (CS-Tree) is proposed in this paper. A k-depth node(k ≤ d) in the CS-tree is corresponding to a k-dimensional rectangular space. Each node keeps track of the density of data elements in its corresponding rectangular space. Only a node corresponding to a dense rectangular space is allowed to have a child node. The scalability on the number of dimensions is greatly enhanced while sacrificing the accuracy of identified clusters slightly.

UR - http://www.scopus.com/inward/record.url?scp=61349152282&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=61349152282&partnerID=8YFLogxK

U2 - 10.1016/j.datak.2008.11.004

DO - 10.1016/j.datak.2008.11.004

M3 - Article

AN - SCOPUS:61349152282

VL - 68

SP - 362

EP - 379

JO - Data and Knowledge Engineering

JF - Data and Knowledge Engineering

SN - 0169-023X

IS - 3

ER -