An automatic unsupervised querying algorithm for efficient information extraction in biomedical domain

Min Song, Il Yeol Song, Xiaohua Hu, Robert B. Allen

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In the domain of bioinformatics, extracting a relation such as protein-protein iriterations from a large database of text documents is a challenging task. One major issue with biomedical information extraction is how to efficiently digest the sheersize of unstructured biomedical data corpus. Often, among these huge biomedical data, only a small fraction of the documents contain information that is relevant to the extraction task. We propose a novel query expansion algorithm to automatically discover the characteristics of documents that are useful for extraction of a target relation. Our technique introduces a hybrid query re-weighting algorithm combining the modified Robertson Sparck-Jones query ranking algorithm with a keyphrase extraction algorithm. Our technique also adopts a novel query translation technique that incorporates POS categories to query translation. We conduct a series of experiments and report the experimental results. The results show that our technique is able to retrieve more documents that contain protein-protein pairs from MEDLINE as iteration increases. Our technique is also compared with SLIPPER, a supervised rule-based query expansion technique. The results show that our technique outperforms SLIPPER from 17.90% to 29.98 better in four iterations.

Original languageEnglish
Title of host publicationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Pages173-179
Number of pages7
Publication statusPublished - 2005 Dec 1
Event9th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD 2005 - Hanoi, Viet Nam
Duration: 2005 May 182005 May 20

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume3518 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other9th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD 2005
CountryViet Nam
CityHanoi
Period05/5/1805/5/20

Fingerprint

Information Extraction
Proteins
Query
Protein
Query Expansion
Bioinformatics
Iteration
Weighting
Ranking
Experiments
Target
Series
Experimental Results
Experiment

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Song, M., Song, I. Y., Hu, X., & Allen, R. B. (2005). An automatic unsupervised querying algorithm for efficient information extraction in biomedical domain. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (pp. 173-179). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 3518 LNAI).
Song, Min ; Song, Il Yeol ; Hu, Xiaohua ; Allen, Robert B. / An automatic unsupervised querying algorithm for efficient information extraction in biomedical domain. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 2005. pp. 173-179 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{3c8e8f36193b4c2b8bddbb0803b036ec,
title = "An automatic unsupervised querying algorithm for efficient information extraction in biomedical domain",
abstract = "In the domain of bioinformatics, extracting a relation such as protein-protein iriterations from a large database of text documents is a challenging task. One major issue with biomedical information extraction is how to efficiently digest the sheersize of unstructured biomedical data corpus. Often, among these huge biomedical data, only a small fraction of the documents contain information that is relevant to the extraction task. We propose a novel query expansion algorithm to automatically discover the characteristics of documents that are useful for extraction of a target relation. Our technique introduces a hybrid query re-weighting algorithm combining the modified Robertson Sparck-Jones query ranking algorithm with a keyphrase extraction algorithm. Our technique also adopts a novel query translation technique that incorporates POS categories to query translation. We conduct a series of experiments and report the experimental results. The results show that our technique is able to retrieve more documents that contain protein-protein pairs from MEDLINE as iteration increases. Our technique is also compared with SLIPPER, a supervised rule-based query expansion technique. The results show that our technique outperforms SLIPPER from 17.90{\%} to 29.98 better in four iterations.",
author = "Min Song and Song, {Il Yeol} and Xiaohua Hu and Allen, {Robert B.}",
year = "2005",
month = "12",
day = "1",
language = "English",
isbn = "3540260765",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
pages = "173--179",
booktitle = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

}

Song, M, Song, IY, Hu, X & Allen, RB 2005, An automatic unsupervised querying algorithm for efficient information extraction in biomedical domain. in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 3518 LNAI, pp. 173-179, 9th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD 2005, Hanoi, Viet Nam, 05/5/18.

An automatic unsupervised querying algorithm for efficient information extraction in biomedical domain. / Song, Min; Song, Il Yeol; Hu, Xiaohua; Allen, Robert B.

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 2005. p. 173-179 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 3518 LNAI).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - An automatic unsupervised querying algorithm for efficient information extraction in biomedical domain

AU - Song, Min

AU - Song, Il Yeol

AU - Hu, Xiaohua

AU - Allen, Robert B.

PY - 2005/12/1

Y1 - 2005/12/1

N2 - In the domain of bioinformatics, extracting a relation such as protein-protein iriterations from a large database of text documents is a challenging task. One major issue with biomedical information extraction is how to efficiently digest the sheersize of unstructured biomedical data corpus. Often, among these huge biomedical data, only a small fraction of the documents contain information that is relevant to the extraction task. We propose a novel query expansion algorithm to automatically discover the characteristics of documents that are useful for extraction of a target relation. Our technique introduces a hybrid query re-weighting algorithm combining the modified Robertson Sparck-Jones query ranking algorithm with a keyphrase extraction algorithm. Our technique also adopts a novel query translation technique that incorporates POS categories to query translation. We conduct a series of experiments and report the experimental results. The results show that our technique is able to retrieve more documents that contain protein-protein pairs from MEDLINE as iteration increases. Our technique is also compared with SLIPPER, a supervised rule-based query expansion technique. The results show that our technique outperforms SLIPPER from 17.90% to 29.98 better in four iterations.

AB - In the domain of bioinformatics, extracting a relation such as protein-protein iriterations from a large database of text documents is a challenging task. One major issue with biomedical information extraction is how to efficiently digest the sheersize of unstructured biomedical data corpus. Often, among these huge biomedical data, only a small fraction of the documents contain information that is relevant to the extraction task. We propose a novel query expansion algorithm to automatically discover the characteristics of documents that are useful for extraction of a target relation. Our technique introduces a hybrid query re-weighting algorithm combining the modified Robertson Sparck-Jones query ranking algorithm with a keyphrase extraction algorithm. Our technique also adopts a novel query translation technique that incorporates POS categories to query translation. We conduct a series of experiments and report the experimental results. The results show that our technique is able to retrieve more documents that contain protein-protein pairs from MEDLINE as iteration increases. Our technique is also compared with SLIPPER, a supervised rule-based query expansion technique. The results show that our technique outperforms SLIPPER from 17.90% to 29.98 better in four iterations.

UR - http://www.scopus.com/inward/record.url?scp=26944468655&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=26944468655&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:26944468655

SN - 3540260765

SN - 9783540260769

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 173

EP - 179

BT - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

ER -

Song M, Song IY, Hu X, Allen RB. An automatic unsupervised querying algorithm for efficient information extraction in biomedical domain. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 2005. p. 173-179. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).