Establishing a baseline for literature mining human genetic variants and their relationships to disease cohorts

Karin M. Verspoor, Go Eun Heo, Keun Young Kang, Min Song

Research output: Contribution to journalArticle

6 Citations (Scopus)

Abstract

Background: The Variome corpus, a small collection of published articles about inherited colorectal cancer, includes annotations of 11 entity types and 13 relation types related to the curation of the relationship between genetic variation and disease. Due to the richness of these annotations, the corpus provides a good testbed for evaluation of biomedical literature information extraction systems. Methods: In this paper, we focus on assessing performance on extracting the relations in the corpus, using gold standard entities as a starting point, to establish a baseline for extraction of relations important for extraction of genetic variant information from the literature. We test the application of the Public Knowledge Discovery Engine for Java (PKDE4J) system, a natural language processing system designed for information extraction of entities and relations in text, on the relation extraction task using this corpus. Results: For the relations which are attested at least 100 times in the Variome corpus, we realise a performance ranging from 0.78-0.84 Precision-weighted F-score, depending on the relation. We find that the PKDE4J system adapted straightforwardly to the range of relation types represented in the corpus; some extensions to the original methodology were required to adapt to the multi-relational classification context. The results are competitive with state-of-the-art relation extraction performance on more heavily studied corpora, although the analysis shows that the Recall of a co-occurrence baseline outweighs the benefit of improved Precision for many relations, indicating the value of simple semantic constraints on relations. Conclusions: This work represents the first attempt to apply relation extraction methods to the Variome corpus. The results demonstrate that automated methods have good potential to structure the information expressed in the published literature related to genetic variants, connecting mutations to genes, diseases, and patient cohorts. Further development of such approaches will facilitate more efficient biocuration of genetic variant information into structured databases, leveraging the knowledge embedded in the vast publication literature.

Original languageEnglish
Article number68
JournalBMC Medical Informatics and Decision Making
Volume16
DOIs
Publication statusPublished - 2016 Jul 18

Fingerprint

Medical Genetics
Information Storage and Retrieval
Natural Language Processing
Inborn Genetic Diseases
Semantics
Information Systems
Publications
Colorectal Neoplasms
Databases
Mutation
Genes

All Science Journal Classification (ASJC) codes

  • Health Policy
  • Health Informatics

Cite this

@article{b3fe62bee4e94f8b80c7c0c843701d3c,
title = "Establishing a baseline for literature mining human genetic variants and their relationships to disease cohorts",
abstract = "Background: The Variome corpus, a small collection of published articles about inherited colorectal cancer, includes annotations of 11 entity types and 13 relation types related to the curation of the relationship between genetic variation and disease. Due to the richness of these annotations, the corpus provides a good testbed for evaluation of biomedical literature information extraction systems. Methods: In this paper, we focus on assessing performance on extracting the relations in the corpus, using gold standard entities as a starting point, to establish a baseline for extraction of relations important for extraction of genetic variant information from the literature. We test the application of the Public Knowledge Discovery Engine for Java (PKDE4J) system, a natural language processing system designed for information extraction of entities and relations in text, on the relation extraction task using this corpus. Results: For the relations which are attested at least 100 times in the Variome corpus, we realise a performance ranging from 0.78-0.84 Precision-weighted F-score, depending on the relation. We find that the PKDE4J system adapted straightforwardly to the range of relation types represented in the corpus; some extensions to the original methodology were required to adapt to the multi-relational classification context. The results are competitive with state-of-the-art relation extraction performance on more heavily studied corpora, although the analysis shows that the Recall of a co-occurrence baseline outweighs the benefit of improved Precision for many relations, indicating the value of simple semantic constraints on relations. Conclusions: This work represents the first attempt to apply relation extraction methods to the Variome corpus. The results demonstrate that automated methods have good potential to structure the information expressed in the published literature related to genetic variants, connecting mutations to genes, diseases, and patient cohorts. Further development of such approaches will facilitate more efficient biocuration of genetic variant information into structured databases, leveraging the knowledge embedded in the vast publication literature.",
author = "Verspoor, {Karin M.} and Heo, {Go Eun} and Kang, {Keun Young} and Min Song",
year = "2016",
month = "7",
day = "18",
doi = "10.1186/s12911-016-0294-3",
language = "English",
volume = "16",
journal = "BMC Medical Informatics and Decision Making",
issn = "1472-6947",
publisher = "BioMed Central",

}

Establishing a baseline for literature mining human genetic variants and their relationships to disease cohorts. / Verspoor, Karin M.; Heo, Go Eun; Kang, Keun Young; Song, Min.

In: BMC Medical Informatics and Decision Making, Vol. 16, 68, 18.07.2016.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Establishing a baseline for literature mining human genetic variants and their relationships to disease cohorts

AU - Verspoor, Karin M.

AU - Heo, Go Eun

AU - Kang, Keun Young

AU - Song, Min

PY - 2016/7/18

Y1 - 2016/7/18

N2 - Background: The Variome corpus, a small collection of published articles about inherited colorectal cancer, includes annotations of 11 entity types and 13 relation types related to the curation of the relationship between genetic variation and disease. Due to the richness of these annotations, the corpus provides a good testbed for evaluation of biomedical literature information extraction systems. Methods: In this paper, we focus on assessing performance on extracting the relations in the corpus, using gold standard entities as a starting point, to establish a baseline for extraction of relations important for extraction of genetic variant information from the literature. We test the application of the Public Knowledge Discovery Engine for Java (PKDE4J) system, a natural language processing system designed for information extraction of entities and relations in text, on the relation extraction task using this corpus. Results: For the relations which are attested at least 100 times in the Variome corpus, we realise a performance ranging from 0.78-0.84 Precision-weighted F-score, depending on the relation. We find that the PKDE4J system adapted straightforwardly to the range of relation types represented in the corpus; some extensions to the original methodology were required to adapt to the multi-relational classification context. The results are competitive with state-of-the-art relation extraction performance on more heavily studied corpora, although the analysis shows that the Recall of a co-occurrence baseline outweighs the benefit of improved Precision for many relations, indicating the value of simple semantic constraints on relations. Conclusions: This work represents the first attempt to apply relation extraction methods to the Variome corpus. The results demonstrate that automated methods have good potential to structure the information expressed in the published literature related to genetic variants, connecting mutations to genes, diseases, and patient cohorts. Further development of such approaches will facilitate more efficient biocuration of genetic variant information into structured databases, leveraging the knowledge embedded in the vast publication literature.

AB - Background: The Variome corpus, a small collection of published articles about inherited colorectal cancer, includes annotations of 11 entity types and 13 relation types related to the curation of the relationship between genetic variation and disease. Due to the richness of these annotations, the corpus provides a good testbed for evaluation of biomedical literature information extraction systems. Methods: In this paper, we focus on assessing performance on extracting the relations in the corpus, using gold standard entities as a starting point, to establish a baseline for extraction of relations important for extraction of genetic variant information from the literature. We test the application of the Public Knowledge Discovery Engine for Java (PKDE4J) system, a natural language processing system designed for information extraction of entities and relations in text, on the relation extraction task using this corpus. Results: For the relations which are attested at least 100 times in the Variome corpus, we realise a performance ranging from 0.78-0.84 Precision-weighted F-score, depending on the relation. We find that the PKDE4J system adapted straightforwardly to the range of relation types represented in the corpus; some extensions to the original methodology were required to adapt to the multi-relational classification context. The results are competitive with state-of-the-art relation extraction performance on more heavily studied corpora, although the analysis shows that the Recall of a co-occurrence baseline outweighs the benefit of improved Precision for many relations, indicating the value of simple semantic constraints on relations. Conclusions: This work represents the first attempt to apply relation extraction methods to the Variome corpus. The results demonstrate that automated methods have good potential to structure the information expressed in the published literature related to genetic variants, connecting mutations to genes, diseases, and patient cohorts. Further development of such approaches will facilitate more efficient biocuration of genetic variant information into structured databases, leveraging the knowledge embedded in the vast publication literature.

UR - http://www.scopus.com/inward/record.url?scp=84978376839&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84978376839&partnerID=8YFLogxK

U2 - 10.1186/s12911-016-0294-3

DO - 10.1186/s12911-016-0294-3

M3 - Article

C2 - 27454860

AN - SCOPUS:84978376839

VL - 16

JO - BMC Medical Informatics and Decision Making

JF - BMC Medical Informatics and Decision Making

SN - 1472-6947

M1 - 68

ER -