TY - JOUR
T1 - An efficient DNA sequence searching method using position specific weighting scheme
AU - Kim, Woo Cheol
AU - Park, Sanghyun
AU - Won, Jung Im
AU - Kim, Sang Wook
AU - Yoon, Jee Hee
PY - 2006/4
Y1 - 2006/4
N2 - Exact match queries, wildcard match queries, and k mismatch queries are widely used in various molecular biology applications including the searching of ESTs (Expressed Sequence Tags) and DNA transcription factors. In this paper, we suggest an efficient indexing and processing mechanism for such queries. Our indexing method places a sliding window at every possible location of a DNA sequence and extracts its signature by considering the occurrence frequency of each nucleotide. It then stores a set of signatures using a multi-dimensional index such as the R*-tree. Also, by assigning a weight to each position of a window, it prevents signatures from being concentrated around a few spots in indexing space. Our query processing method converts a query sequence into a multi-dimensional rectangle and searches the index for the signatures overlapping with the rectangle. Experiments with real biological data sets have revealed that the proposed approach is at least 4.4 times, 2.1 times, and several orders of magnitude faster than the previous one in performing exact match, wildcard match, and k-mismatch queries, respectively.
AB - Exact match queries, wildcard match queries, and k mismatch queries are widely used in various molecular biology applications including the searching of ESTs (Expressed Sequence Tags) and DNA transcription factors. In this paper, we suggest an efficient indexing and processing mechanism for such queries. Our indexing method places a sliding window at every possible location of a DNA sequence and extracts its signature by considering the occurrence frequency of each nucleotide. It then stores a set of signatures using a multi-dimensional index such as the R*-tree. Also, by assigning a weight to each position of a window, it prevents signatures from being concentrated around a few spots in indexing space. Our query processing method converts a query sequence into a multi-dimensional rectangle and searches the index for the signatures overlapping with the rectangle. Experiments with real biological data sets have revealed that the proposed approach is at least 4.4 times, 2.1 times, and several orders of magnitude faster than the previous one in performing exact match, wildcard match, and k-mismatch queries, respectively.
UR - http://www.scopus.com/inward/record.url?scp=33644906389&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=33644906389&partnerID=8YFLogxK
U2 - 10.1177/0165551506062329
DO - 10.1177/0165551506062329
M3 - Article
AN - SCOPUS:33644906389
VL - 32
SP - 176
EP - 190
JO - Journal of Information Science
JF - Journal of Information Science
SN - 0165-5515
IS - 2
ER -