Fast retrieval of similar subsequences in long sequence databases

Sang Hyun Park, Dongwon Lee, Wesley W. Chu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

53 Citations (Scopus)

Abstract

Although the Euclidean distance has been the most popular similarity measure in sequence databases, recent techniques prefer to use high-cost distance functions such as the time warping distance and the editing distance for wider applicability. However, if these distance functions are applied to the retrieval of similar subsequences, the number of subsequences to be inspected during the search is quadratic to the average length L~ of data sequences. We propose a novel subsequence matching scheme, called the aligned subsequence matching, where the number of subsequences to be compared with a query sequence is reduced to linear to L~. We also present an indexing technique to speed-up the aligned subsequence matching using the similarity measure of the modified time warping distance. Experiments on synthetic data sequences demonstrate the effectiveness of our proposed approach; ours consistently outperformed sequential scanning and achieved an up to 6.5 times speed-up.

Original languageEnglish
Title of host publicationProceedings - 1999 Workshop on Knowledge and Data Engineering Exchange, KDEX 1999
EditorsPeter Scheuermann
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages60-67
Number of pages8
ISBN (Electronic)0769504531, 9780769504537
DOIs
Publication statusPublished - 1999 Jan 1
Event1999 Workshop on Knowledge and Data Engineering Exchange, KDEX 1999 - Chicago, United States
Duration: 1999 Nov 7 → …

Publication series

NameProceedings - 1999 Workshop on Knowledge and Data Engineering Exchange, KDEX 1999

Other

Other1999 Workshop on Knowledge and Data Engineering Exchange, KDEX 1999
CountryUnited States
CityChicago
Period99/11/7 → …

Fingerprint

Scanning
Costs
Experiments
Data base
Similarity measure
Warping
Distance function
Query
Editing
Experiment
Indexing
Euclidean distance

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Information Systems
  • Information Systems and Management

Cite this

Park, S. H., Lee, D., & Chu, W. W. (1999). Fast retrieval of similar subsequences in long sequence databases. In P. Scheuermann (Ed.), Proceedings - 1999 Workshop on Knowledge and Data Engineering Exchange, KDEX 1999 (pp. 60-67). [836610] (Proceedings - 1999 Workshop on Knowledge and Data Engineering Exchange, KDEX 1999). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/KDEX.1999.836610
Park, Sang Hyun ; Lee, Dongwon ; Chu, Wesley W. / Fast retrieval of similar subsequences in long sequence databases. Proceedings - 1999 Workshop on Knowledge and Data Engineering Exchange, KDEX 1999. editor / Peter Scheuermann. Institute of Electrical and Electronics Engineers Inc., 1999. pp. 60-67 (Proceedings - 1999 Workshop on Knowledge and Data Engineering Exchange, KDEX 1999).
@inproceedings{d35b36af574f4beeb3a690440a671fa6,
title = "Fast retrieval of similar subsequences in long sequence databases",
abstract = "Although the Euclidean distance has been the most popular similarity measure in sequence databases, recent techniques prefer to use high-cost distance functions such as the time warping distance and the editing distance for wider applicability. However, if these distance functions are applied to the retrieval of similar subsequences, the number of subsequences to be inspected during the search is quadratic to the average length L~ of data sequences. We propose a novel subsequence matching scheme, called the aligned subsequence matching, where the number of subsequences to be compared with a query sequence is reduced to linear to L~. We also present an indexing technique to speed-up the aligned subsequence matching using the similarity measure of the modified time warping distance. Experiments on synthetic data sequences demonstrate the effectiveness of our proposed approach; ours consistently outperformed sequential scanning and achieved an up to 6.5 times speed-up.",
author = "Park, {Sang Hyun} and Dongwon Lee and Chu, {Wesley W.}",
year = "1999",
month = "1",
day = "1",
doi = "10.1109/KDEX.1999.836610",
language = "English",
series = "Proceedings - 1999 Workshop on Knowledge and Data Engineering Exchange, KDEX 1999",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "60--67",
editor = "Peter Scheuermann",
booktitle = "Proceedings - 1999 Workshop on Knowledge and Data Engineering Exchange, KDEX 1999",
address = "United States",

}

Park, SH, Lee, D & Chu, WW 1999, Fast retrieval of similar subsequences in long sequence databases. in P Scheuermann (ed.), Proceedings - 1999 Workshop on Knowledge and Data Engineering Exchange, KDEX 1999., 836610, Proceedings - 1999 Workshop on Knowledge and Data Engineering Exchange, KDEX 1999, Institute of Electrical and Electronics Engineers Inc., pp. 60-67, 1999 Workshop on Knowledge and Data Engineering Exchange, KDEX 1999, Chicago, United States, 99/11/7. https://doi.org/10.1109/KDEX.1999.836610

Fast retrieval of similar subsequences in long sequence databases. / Park, Sang Hyun; Lee, Dongwon; Chu, Wesley W.

Proceedings - 1999 Workshop on Knowledge and Data Engineering Exchange, KDEX 1999. ed. / Peter Scheuermann. Institute of Electrical and Electronics Engineers Inc., 1999. p. 60-67 836610 (Proceedings - 1999 Workshop on Knowledge and Data Engineering Exchange, KDEX 1999).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Fast retrieval of similar subsequences in long sequence databases

AU - Park, Sang Hyun

AU - Lee, Dongwon

AU - Chu, Wesley W.

PY - 1999/1/1

Y1 - 1999/1/1

N2 - Although the Euclidean distance has been the most popular similarity measure in sequence databases, recent techniques prefer to use high-cost distance functions such as the time warping distance and the editing distance for wider applicability. However, if these distance functions are applied to the retrieval of similar subsequences, the number of subsequences to be inspected during the search is quadratic to the average length L~ of data sequences. We propose a novel subsequence matching scheme, called the aligned subsequence matching, where the number of subsequences to be compared with a query sequence is reduced to linear to L~. We also present an indexing technique to speed-up the aligned subsequence matching using the similarity measure of the modified time warping distance. Experiments on synthetic data sequences demonstrate the effectiveness of our proposed approach; ours consistently outperformed sequential scanning and achieved an up to 6.5 times speed-up.

AB - Although the Euclidean distance has been the most popular similarity measure in sequence databases, recent techniques prefer to use high-cost distance functions such as the time warping distance and the editing distance for wider applicability. However, if these distance functions are applied to the retrieval of similar subsequences, the number of subsequences to be inspected during the search is quadratic to the average length L~ of data sequences. We propose a novel subsequence matching scheme, called the aligned subsequence matching, where the number of subsequences to be compared with a query sequence is reduced to linear to L~. We also present an indexing technique to speed-up the aligned subsequence matching using the similarity measure of the modified time warping distance. Experiments on synthetic data sequences demonstrate the effectiveness of our proposed approach; ours consistently outperformed sequential scanning and achieved an up to 6.5 times speed-up.

UR - http://www.scopus.com/inward/record.url?scp=85038261782&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85038261782&partnerID=8YFLogxK

U2 - 10.1109/KDEX.1999.836610

DO - 10.1109/KDEX.1999.836610

M3 - Conference contribution

AN - SCOPUS:85038261782

T3 - Proceedings - 1999 Workshop on Knowledge and Data Engineering Exchange, KDEX 1999

SP - 60

EP - 67

BT - Proceedings - 1999 Workshop on Knowledge and Data Engineering Exchange, KDEX 1999

A2 - Scheuermann, Peter

PB - Institute of Electrical and Electronics Engineers Inc.

ER -

Park SH, Lee D, Chu WW. Fast retrieval of similar subsequences in long sequence databases. In Scheuermann P, editor, Proceedings - 1999 Workshop on Knowledge and Data Engineering Exchange, KDEX 1999. Institute of Electrical and Electronics Engineers Inc. 1999. p. 60-67. 836610. (Proceedings - 1999 Workshop on Knowledge and Data Engineering Exchange, KDEX 1999). https://doi.org/10.1109/KDEX.1999.836610