An efficient approach for sequence matching in large DNA databases

Jung Im Won, Sanghyun Park, Jee Hee Yoon, Sang Wook Kim

Research output: Contribution to journalArticle

5 Citations (Scopus)

Abstract

In molecular biology, DNA sequence matching is one of the most crucial operations. Since DNA databases contain a huge volume of sequences, fast indexes are essential for efficient processing of DNA sequence matching. In this paper, we first point out the problems of the suffix tree, an index structure widely-used for DNA sequence matching, in respect of storage overhead, search performance, and difficulty in seamless integration with DBMS. Then, we propose a new index structure that resolves such problems. The proposed index structure consists of two parts: the primary part realizes the trie as binary bit-string representation without any pointers, and the secondary part helps fast access to the trie's leaf nodes that need to be accessed for post-processing. We also suggest efficient algorithms based on that index for DNA sequence matching. To verify the superiority of the proposed approach, we conduct performance evaluation via a series of experiments. The results reveal that the proposed approach, which requires smaller storage space, can be a few orders of magnitude faster than the suffix tree.

Original languageEnglish
Pages (from-to)88-104
Number of pages17
JournalJournal of Information Science
Volume32
Issue number1
DOIs
Publication statusPublished - 2006 Feb 1

Fingerprint

DNA sequences
DNA
Molecular biology
Processing
performance
biology
experiment
evaluation
Experiments

All Science Journal Classification (ASJC) codes

  • Information Systems
  • Library and Information Sciences

Cite this

Won, Jung Im ; Park, Sanghyun ; Yoon, Jee Hee ; Kim, Sang Wook. / An efficient approach for sequence matching in large DNA databases. In: Journal of Information Science. 2006 ; Vol. 32, No. 1. pp. 88-104.
@article{4926d6a0188a4d4bb8c19623d2e30424,
title = "An efficient approach for sequence matching in large DNA databases",
abstract = "In molecular biology, DNA sequence matching is one of the most crucial operations. Since DNA databases contain a huge volume of sequences, fast indexes are essential for efficient processing of DNA sequence matching. In this paper, we first point out the problems of the suffix tree, an index structure widely-used for DNA sequence matching, in respect of storage overhead, search performance, and difficulty in seamless integration with DBMS. Then, we propose a new index structure that resolves such problems. The proposed index structure consists of two parts: the primary part realizes the trie as binary bit-string representation without any pointers, and the secondary part helps fast access to the trie's leaf nodes that need to be accessed for post-processing. We also suggest efficient algorithms based on that index for DNA sequence matching. To verify the superiority of the proposed approach, we conduct performance evaluation via a series of experiments. The results reveal that the proposed approach, which requires smaller storage space, can be a few orders of magnitude faster than the suffix tree.",
author = "Won, {Jung Im} and Sanghyun Park and Yoon, {Jee Hee} and Kim, {Sang Wook}",
year = "2006",
month = "2",
day = "1",
doi = "10.1177/0165551506059229",
language = "English",
volume = "32",
pages = "88--104",
journal = "Journal of Information Science",
issn = "0165-5515",
publisher = "SAGE Publications Ltd",
number = "1",

}

An efficient approach for sequence matching in large DNA databases. / Won, Jung Im; Park, Sanghyun; Yoon, Jee Hee; Kim, Sang Wook.

In: Journal of Information Science, Vol. 32, No. 1, 01.02.2006, p. 88-104.

Research output: Contribution to journalArticle

TY - JOUR

T1 - An efficient approach for sequence matching in large DNA databases

AU - Won, Jung Im

AU - Park, Sanghyun

AU - Yoon, Jee Hee

AU - Kim, Sang Wook

PY - 2006/2/1

Y1 - 2006/2/1

N2 - In molecular biology, DNA sequence matching is one of the most crucial operations. Since DNA databases contain a huge volume of sequences, fast indexes are essential for efficient processing of DNA sequence matching. In this paper, we first point out the problems of the suffix tree, an index structure widely-used for DNA sequence matching, in respect of storage overhead, search performance, and difficulty in seamless integration with DBMS. Then, we propose a new index structure that resolves such problems. The proposed index structure consists of two parts: the primary part realizes the trie as binary bit-string representation without any pointers, and the secondary part helps fast access to the trie's leaf nodes that need to be accessed for post-processing. We also suggest efficient algorithms based on that index for DNA sequence matching. To verify the superiority of the proposed approach, we conduct performance evaluation via a series of experiments. The results reveal that the proposed approach, which requires smaller storage space, can be a few orders of magnitude faster than the suffix tree.

AB - In molecular biology, DNA sequence matching is one of the most crucial operations. Since DNA databases contain a huge volume of sequences, fast indexes are essential for efficient processing of DNA sequence matching. In this paper, we first point out the problems of the suffix tree, an index structure widely-used for DNA sequence matching, in respect of storage overhead, search performance, and difficulty in seamless integration with DBMS. Then, we propose a new index structure that resolves such problems. The proposed index structure consists of two parts: the primary part realizes the trie as binary bit-string representation without any pointers, and the secondary part helps fast access to the trie's leaf nodes that need to be accessed for post-processing. We also suggest efficient algorithms based on that index for DNA sequence matching. To verify the superiority of the proposed approach, we conduct performance evaluation via a series of experiments. The results reveal that the proposed approach, which requires smaller storage space, can be a few orders of magnitude faster than the suffix tree.

UR - http://www.scopus.com/inward/record.url?scp=32944464978&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=32944464978&partnerID=8YFLogxK

U2 - 10.1177/0165551506059229

DO - 10.1177/0165551506059229

M3 - Article

AN - SCOPUS:32944464978

VL - 32

SP - 88

EP - 104

JO - Journal of Information Science

JF - Journal of Information Science

SN - 0165-5515

IS - 1

ER -