Integrating text chunking with mixture Hidden Markov Models for effective biomedical information extraction

Min Song, Il Yeol Song, Xiaohua Hu, Robert B. Allen

Research output: Contribution to journalConference article

5 Citations (Scopus)

Abstract

This paper presents a new information extraction (IE) technique, KXtractor, which integrates a text chunking technique with Mixture Hidden Markov Models (MiHMM). KXtractor is differentiated from other approaches in that (a) it overcomes the problem of the single Part-Of-Speech (POS) HMMs with modeling the rich representation of text where features overlap among state units such as word, line, sentence, and paragraph. By incorporating sentence structures into the learned models, KXtractor provides better extraction accuracy than the single POS HMMs do. (b) It resolves the issues with the traditional HMMs for IE that operate only on the semi-structured data such as HTML documents and other text sources in which language grammar does not play a pivotal role. We compared KXtractor with three IE techniques: 1) RAPIER, an inductive learning-based machine learning system, 2) a Dictionary-based extraction system, and 3) single POS HMM. Our experiments showed that KXtractor outperforms these three IE systems in extracting protein-protein interactions. In our experiments, F-measure for KXtractor was higher than ones for RAPIER, a dictionary-based system, and single POS HMM respectively by 16.89%, 16.28%, and 8.58%. In addition, both precision and recall of KXtractor are higher than those systems.

Original languageEnglish
Pages (from-to)976-984
Number of pages9
JournalLecture Notes in Computer Science
Volume3515
Issue numberII
Publication statusPublished - 2005 Sep 30
Event5th International Conference on Computational Science - ICCS 2005 - Atlanta, GA, United States
Duration: 2005 May 222005 May 25

Fingerprint

Information Extraction
Hidden Markov models
Markov Model
Glossaries
Inductive Learning
Semistructured Data
HTML
Learning systems
Protein-protein Interaction
Learning Systems
Grammar
Proteins
Experiment
Overlap
Resolve
Machine Learning
Integrate
Unit
Speech
Text

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

@article{10ab35f4a7814fe19e5b93b416e1507e,
title = "Integrating text chunking with mixture Hidden Markov Models for effective biomedical information extraction",
abstract = "This paper presents a new information extraction (IE) technique, KXtractor, which integrates a text chunking technique with Mixture Hidden Markov Models (MiHMM). KXtractor is differentiated from other approaches in that (a) it overcomes the problem of the single Part-Of-Speech (POS) HMMs with modeling the rich representation of text where features overlap among state units such as word, line, sentence, and paragraph. By incorporating sentence structures into the learned models, KXtractor provides better extraction accuracy than the single POS HMMs do. (b) It resolves the issues with the traditional HMMs for IE that operate only on the semi-structured data such as HTML documents and other text sources in which language grammar does not play a pivotal role. We compared KXtractor with three IE techniques: 1) RAPIER, an inductive learning-based machine learning system, 2) a Dictionary-based extraction system, and 3) single POS HMM. Our experiments showed that KXtractor outperforms these three IE systems in extracting protein-protein interactions. In our experiments, F-measure for KXtractor was higher than ones for RAPIER, a dictionary-based system, and single POS HMM respectively by 16.89{\%}, 16.28{\%}, and 8.58{\%}. In addition, both precision and recall of KXtractor are higher than those systems.",
author = "Min Song and Song, {Il Yeol} and Xiaohua Hu and Allen, {Robert B.}",
year = "2005",
month = "9",
day = "30",
language = "English",
volume = "3515",
pages = "976--984",
journal = "Lecture Notes in Computer Science",
issn = "0302-9743",
publisher = "Springer Verlag",
number = "II",

}

Integrating text chunking with mixture Hidden Markov Models for effective biomedical information extraction. / Song, Min; Song, Il Yeol; Hu, Xiaohua; Allen, Robert B.

In: Lecture Notes in Computer Science, Vol. 3515, No. II, 30.09.2005, p. 976-984.

Research output: Contribution to journalConference article

TY - JOUR

T1 - Integrating text chunking with mixture Hidden Markov Models for effective biomedical information extraction

AU - Song, Min

AU - Song, Il Yeol

AU - Hu, Xiaohua

AU - Allen, Robert B.

PY - 2005/9/30

Y1 - 2005/9/30

N2 - This paper presents a new information extraction (IE) technique, KXtractor, which integrates a text chunking technique with Mixture Hidden Markov Models (MiHMM). KXtractor is differentiated from other approaches in that (a) it overcomes the problem of the single Part-Of-Speech (POS) HMMs with modeling the rich representation of text where features overlap among state units such as word, line, sentence, and paragraph. By incorporating sentence structures into the learned models, KXtractor provides better extraction accuracy than the single POS HMMs do. (b) It resolves the issues with the traditional HMMs for IE that operate only on the semi-structured data such as HTML documents and other text sources in which language grammar does not play a pivotal role. We compared KXtractor with three IE techniques: 1) RAPIER, an inductive learning-based machine learning system, 2) a Dictionary-based extraction system, and 3) single POS HMM. Our experiments showed that KXtractor outperforms these three IE systems in extracting protein-protein interactions. In our experiments, F-measure for KXtractor was higher than ones for RAPIER, a dictionary-based system, and single POS HMM respectively by 16.89%, 16.28%, and 8.58%. In addition, both precision and recall of KXtractor are higher than those systems.

AB - This paper presents a new information extraction (IE) technique, KXtractor, which integrates a text chunking technique with Mixture Hidden Markov Models (MiHMM). KXtractor is differentiated from other approaches in that (a) it overcomes the problem of the single Part-Of-Speech (POS) HMMs with modeling the rich representation of text where features overlap among state units such as word, line, sentence, and paragraph. By incorporating sentence structures into the learned models, KXtractor provides better extraction accuracy than the single POS HMMs do. (b) It resolves the issues with the traditional HMMs for IE that operate only on the semi-structured data such as HTML documents and other text sources in which language grammar does not play a pivotal role. We compared KXtractor with three IE techniques: 1) RAPIER, an inductive learning-based machine learning system, 2) a Dictionary-based extraction system, and 3) single POS HMM. Our experiments showed that KXtractor outperforms these three IE systems in extracting protein-protein interactions. In our experiments, F-measure for KXtractor was higher than ones for RAPIER, a dictionary-based system, and single POS HMM respectively by 16.89%, 16.28%, and 8.58%. In addition, both precision and recall of KXtractor are higher than those systems.

UR - http://www.scopus.com/inward/record.url?scp=25144480898&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=25144480898&partnerID=8YFLogxK

M3 - Conference article

VL - 3515

SP - 976

EP - 984

JO - Lecture Notes in Computer Science

JF - Lecture Notes in Computer Science

SN - 0302-9743

IS - II

ER -