Efficient processing of similarity search under time warping in sequence databases: An index-based approach

Sang Wook Kim, Sanghyun Park, Wesley W. Chu

Research output: Contribution to journalArticle

32 Citations (Scopus)

Abstract

This paper discusses the effective processing of similarity search that supports time warping in large sequence databases. Time warping enables sequences with similar patterns to be found even when they are of different lengths. Prior methods for processing similarity search that supports time warping failed to employ multi-dimensional indexes without false dismissal since the time warping distance does not satisfy the triangular inequality. They have to scan the entire database, thus suffering from serious performance degradation in large databases. Another method that hires the suffix tree, which does not assume any distance function, also shows poor performance due to the large tree size. In this paper, we propose a novel method for similarity search that supports time warping. Our primary goal is to enhance the search performance in large databases without permitting any false dismissal. To attain this goal, we have devised a new distance function, Dtw-lb, which consistently underestimates the time warping distance and satisfies the triangular inequality. Dtw-lb uses a 4-tuple feature vector that is extracted from each sequence and is invariant to time warping. For the efficient processing of similarity search, we employ a multi-dimensional index that uses the 4-tuple feature vector as indexing attributes, and D tw-lb as a distance function. We prove that our method does not incur false dismissal. To verify the superiority of our method, we have performed extensive experiments. The results reveal that our method achieves a significant improvement in speed up to 43 times faster with a data set containing real-world S&P 500 stock data sequences, and up to 720 times with data sets containing a very large volume of synthetic data sequences. The performance gain increases: (1) as the number of data sequences increases, (2) the average length of data sequences increases, and (3) as the tolerance in a query decreases. Considering the characteristics of real databases, these tendencies imply that our approach is suitable for practical applications.

Original languageEnglish
Pages (from-to)405-420
Number of pages16
JournalInformation Systems
Volume29
Issue number5
DOIs
Publication statusPublished - 2004 Jul 1

Fingerprint

Processing
Degradation
Experiments

All Science Journal Classification (ASJC) codes

  • Software
  • Information Systems
  • Hardware and Architecture

Cite this

@article{5c9ad7cbd61b4e07a5557b5305f53531,
title = "Efficient processing of similarity search under time warping in sequence databases: An index-based approach",
abstract = "This paper discusses the effective processing of similarity search that supports time warping in large sequence databases. Time warping enables sequences with similar patterns to be found even when they are of different lengths. Prior methods for processing similarity search that supports time warping failed to employ multi-dimensional indexes without false dismissal since the time warping distance does not satisfy the triangular inequality. They have to scan the entire database, thus suffering from serious performance degradation in large databases. Another method that hires the suffix tree, which does not assume any distance function, also shows poor performance due to the large tree size. In this paper, we propose a novel method for similarity search that supports time warping. Our primary goal is to enhance the search performance in large databases without permitting any false dismissal. To attain this goal, we have devised a new distance function, Dtw-lb, which consistently underestimates the time warping distance and satisfies the triangular inequality. Dtw-lb uses a 4-tuple feature vector that is extracted from each sequence and is invariant to time warping. For the efficient processing of similarity search, we employ a multi-dimensional index that uses the 4-tuple feature vector as indexing attributes, and D tw-lb as a distance function. We prove that our method does not incur false dismissal. To verify the superiority of our method, we have performed extensive experiments. The results reveal that our method achieves a significant improvement in speed up to 43 times faster with a data set containing real-world S&P 500 stock data sequences, and up to 720 times with data sets containing a very large volume of synthetic data sequences. The performance gain increases: (1) as the number of data sequences increases, (2) the average length of data sequences increases, and (3) as the tolerance in a query decreases. Considering the characteristics of real databases, these tendencies imply that our approach is suitable for practical applications.",
author = "Kim, {Sang Wook} and Sanghyun Park and Chu, {Wesley W.}",
year = "2004",
month = "7",
day = "1",
doi = "10.1016/S0306-4379(03)00037-1",
language = "English",
volume = "29",
pages = "405--420",
journal = "Information Systems",
issn = "0306-4379",
publisher = "Elsevier Limited",
number = "5",

}

Efficient processing of similarity search under time warping in sequence databases : An index-based approach. / Kim, Sang Wook; Park, Sanghyun; Chu, Wesley W.

In: Information Systems, Vol. 29, No. 5, 01.07.2004, p. 405-420.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Efficient processing of similarity search under time warping in sequence databases

T2 - An index-based approach

AU - Kim, Sang Wook

AU - Park, Sanghyun

AU - Chu, Wesley W.

PY - 2004/7/1

Y1 - 2004/7/1

N2 - This paper discusses the effective processing of similarity search that supports time warping in large sequence databases. Time warping enables sequences with similar patterns to be found even when they are of different lengths. Prior methods for processing similarity search that supports time warping failed to employ multi-dimensional indexes without false dismissal since the time warping distance does not satisfy the triangular inequality. They have to scan the entire database, thus suffering from serious performance degradation in large databases. Another method that hires the suffix tree, which does not assume any distance function, also shows poor performance due to the large tree size. In this paper, we propose a novel method for similarity search that supports time warping. Our primary goal is to enhance the search performance in large databases without permitting any false dismissal. To attain this goal, we have devised a new distance function, Dtw-lb, which consistently underestimates the time warping distance and satisfies the triangular inequality. Dtw-lb uses a 4-tuple feature vector that is extracted from each sequence and is invariant to time warping. For the efficient processing of similarity search, we employ a multi-dimensional index that uses the 4-tuple feature vector as indexing attributes, and D tw-lb as a distance function. We prove that our method does not incur false dismissal. To verify the superiority of our method, we have performed extensive experiments. The results reveal that our method achieves a significant improvement in speed up to 43 times faster with a data set containing real-world S&P 500 stock data sequences, and up to 720 times with data sets containing a very large volume of synthetic data sequences. The performance gain increases: (1) as the number of data sequences increases, (2) the average length of data sequences increases, and (3) as the tolerance in a query decreases. Considering the characteristics of real databases, these tendencies imply that our approach is suitable for practical applications.

AB - This paper discusses the effective processing of similarity search that supports time warping in large sequence databases. Time warping enables sequences with similar patterns to be found even when they are of different lengths. Prior methods for processing similarity search that supports time warping failed to employ multi-dimensional indexes without false dismissal since the time warping distance does not satisfy the triangular inequality. They have to scan the entire database, thus suffering from serious performance degradation in large databases. Another method that hires the suffix tree, which does not assume any distance function, also shows poor performance due to the large tree size. In this paper, we propose a novel method for similarity search that supports time warping. Our primary goal is to enhance the search performance in large databases without permitting any false dismissal. To attain this goal, we have devised a new distance function, Dtw-lb, which consistently underestimates the time warping distance and satisfies the triangular inequality. Dtw-lb uses a 4-tuple feature vector that is extracted from each sequence and is invariant to time warping. For the efficient processing of similarity search, we employ a multi-dimensional index that uses the 4-tuple feature vector as indexing attributes, and D tw-lb as a distance function. We prove that our method does not incur false dismissal. To verify the superiority of our method, we have performed extensive experiments. The results reveal that our method achieves a significant improvement in speed up to 43 times faster with a data set containing real-world S&P 500 stock data sequences, and up to 720 times with data sets containing a very large volume of synthetic data sequences. The performance gain increases: (1) as the number of data sequences increases, (2) the average length of data sequences increases, and (3) as the tolerance in a query decreases. Considering the characteristics of real databases, these tendencies imply that our approach is suitable for practical applications.

UR - http://www.scopus.com/inward/record.url?scp=1342264178&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=1342264178&partnerID=8YFLogxK

U2 - 10.1016/S0306-4379(03)00037-1

DO - 10.1016/S0306-4379(03)00037-1

M3 - Article

AN - SCOPUS:1342264178

VL - 29

SP - 405

EP - 420

JO - Information Systems

JF - Information Systems

SN - 0306-4379

IS - 5

ER -