Finding frequent itemsets over online data streams

Joong Hyuk Chang, Won Suk Lee

Research output: Contribution to journalArticle

7 Citations (Scopus)

Abstract

Conventional data mining methods for finding frequent itemsets require considerable computing time to produce their results from a large data set. Due to this reason, it is almost impossible to apply them to an analysis task in an online data stream where a new transaction is continuously generated at a rapid rate. An algorithm for finding frequent itemsets over an online data stream should support flexible trade-off between processing time and mining accuracy. Furthermore, the most up-to-date resulting set of frequent itemsets should be available quickly at any moment. To satisfy these requirements, this paper proposes a data mining method for finding frequent itemsets over an online data stream. The proposed method examines each transaction one-by-one without any candidate generation process. The count of an itemset that appears in each transaction is monitored by a lexicographic tree resided in main memory. The current set of monitored itemsets in an online data stream is minimized by two major operations: delayed-insertion and pruning. The former is delaying the insertion of a new itemset in recent transactions until the itemset becomes significant enough to be monitored. The latter is pruning a monitored itemset when the itemset turns out to be insignificant. The number of monitored itemsets can be flexibly controlled by the thresholds of these two operations. As the number of monitored itemsets is decreased, frequent itemsets in the online data stream are more rapidly traced while they are less accurate. The performance of the proposed method is analyzed through a series of experiments in order to identify its various characteristics.

Original languageEnglish
Pages (from-to)606-618
Number of pages13
JournalInformation and Software Technology
Volume48
Issue number7
DOIs
Publication statusPublished - 2006 Jul 1

Fingerprint

Data mining
Data storage equipment
Processing
Experiments

All Science Journal Classification (ASJC) codes

  • Software
  • Information Systems
  • Computer Science Applications

Cite this

@article{c8ae34968b574f40b83412d8851a46a2,
title = "Finding frequent itemsets over online data streams",
abstract = "Conventional data mining methods for finding frequent itemsets require considerable computing time to produce their results from a large data set. Due to this reason, it is almost impossible to apply them to an analysis task in an online data stream where a new transaction is continuously generated at a rapid rate. An algorithm for finding frequent itemsets over an online data stream should support flexible trade-off between processing time and mining accuracy. Furthermore, the most up-to-date resulting set of frequent itemsets should be available quickly at any moment. To satisfy these requirements, this paper proposes a data mining method for finding frequent itemsets over an online data stream. The proposed method examines each transaction one-by-one without any candidate generation process. The count of an itemset that appears in each transaction is monitored by a lexicographic tree resided in main memory. The current set of monitored itemsets in an online data stream is minimized by two major operations: delayed-insertion and pruning. The former is delaying the insertion of a new itemset in recent transactions until the itemset becomes significant enough to be monitored. The latter is pruning a monitored itemset when the itemset turns out to be insignificant. The number of monitored itemsets can be flexibly controlled by the thresholds of these two operations. As the number of monitored itemsets is decreased, frequent itemsets in the online data stream are more rapidly traced while they are less accurate. The performance of the proposed method is analyzed through a series of experiments in order to identify its various characteristics.",
author = "Chang, {Joong Hyuk} and Lee, {Won Suk}",
year = "2006",
month = "7",
day = "1",
doi = "10.1016/j.infsof.2005.06.004",
language = "English",
volume = "48",
pages = "606--618",
journal = "Information and Software Technology",
issn = "0950-5849",
publisher = "Elsevier",
number = "7",

}

Finding frequent itemsets over online data streams. / Chang, Joong Hyuk; Lee, Won Suk.

In: Information and Software Technology, Vol. 48, No. 7, 01.07.2006, p. 606-618.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Finding frequent itemsets over online data streams

AU - Chang, Joong Hyuk

AU - Lee, Won Suk

PY - 2006/7/1

Y1 - 2006/7/1

N2 - Conventional data mining methods for finding frequent itemsets require considerable computing time to produce their results from a large data set. Due to this reason, it is almost impossible to apply them to an analysis task in an online data stream where a new transaction is continuously generated at a rapid rate. An algorithm for finding frequent itemsets over an online data stream should support flexible trade-off between processing time and mining accuracy. Furthermore, the most up-to-date resulting set of frequent itemsets should be available quickly at any moment. To satisfy these requirements, this paper proposes a data mining method for finding frequent itemsets over an online data stream. The proposed method examines each transaction one-by-one without any candidate generation process. The count of an itemset that appears in each transaction is monitored by a lexicographic tree resided in main memory. The current set of monitored itemsets in an online data stream is minimized by two major operations: delayed-insertion and pruning. The former is delaying the insertion of a new itemset in recent transactions until the itemset becomes significant enough to be monitored. The latter is pruning a monitored itemset when the itemset turns out to be insignificant. The number of monitored itemsets can be flexibly controlled by the thresholds of these two operations. As the number of monitored itemsets is decreased, frequent itemsets in the online data stream are more rapidly traced while they are less accurate. The performance of the proposed method is analyzed through a series of experiments in order to identify its various characteristics.

AB - Conventional data mining methods for finding frequent itemsets require considerable computing time to produce their results from a large data set. Due to this reason, it is almost impossible to apply them to an analysis task in an online data stream where a new transaction is continuously generated at a rapid rate. An algorithm for finding frequent itemsets over an online data stream should support flexible trade-off between processing time and mining accuracy. Furthermore, the most up-to-date resulting set of frequent itemsets should be available quickly at any moment. To satisfy these requirements, this paper proposes a data mining method for finding frequent itemsets over an online data stream. The proposed method examines each transaction one-by-one without any candidate generation process. The count of an itemset that appears in each transaction is monitored by a lexicographic tree resided in main memory. The current set of monitored itemsets in an online data stream is minimized by two major operations: delayed-insertion and pruning. The former is delaying the insertion of a new itemset in recent transactions until the itemset becomes significant enough to be monitored. The latter is pruning a monitored itemset when the itemset turns out to be insignificant. The number of monitored itemsets can be flexibly controlled by the thresholds of these two operations. As the number of monitored itemsets is decreased, frequent itemsets in the online data stream are more rapidly traced while they are less accurate. The performance of the proposed method is analyzed through a series of experiments in order to identify its various characteristics.

UR - http://www.scopus.com/inward/record.url?scp=33646520347&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33646520347&partnerID=8YFLogxK

U2 - 10.1016/j.infsof.2005.06.004

DO - 10.1016/j.infsof.2005.06.004

M3 - Article

AN - SCOPUS:33646520347

VL - 48

SP - 606

EP - 618

JO - Information and Software Technology

JF - Information and Software Technology

SN - 0950-5849

IS - 7

ER -