LGscore: A method to identify disease-related genes using biological literature and Google data

Jeongwoo Kim, Hyunjin Kim, Youngmi Yoon, Sang Hyun Park

Research output: Contribution to journalArticle

15 Citations (Scopus)

Abstract

Since the genome project in 1990s, a number of studies associated with genes have been conducted and researchers have confirmed that genes are involved in disease. For this reason, the identification of the relationships between diseases and genes is important in biology. We propose a method called LGscore, which identifies disease-related genes using Google data and literature data. To implement this method, first, we construct a disease-related gene network using text-mining results. We then extract gene-gene interactions based on co-occurrences in abstract data obtained from PubMed, and calculate the weights of edges in the gene network by means of Z-scoring. The weights contain two values: the frequency and the Google search results. The frequency value is extracted from literature data, and the Google search result is obtained using Google. We assign a score to each gene through a network analysis. We assume that genes with a large number of links and numerous Google search results and frequency values are more likely to be involved in disease. For validation, we investigated the top 20 inferred genes for five different diseases using answer sets. The answer sets comprised six databases that contain information on disease-gene relationships. We identified a significant number of disease-related genes as well as candidate genes for Alzheimer's disease, diabetes, colon cancer, lung cancer, and prostate cancer. Our method was up to 40% more accurate than existing methods.

Original languageEnglish
Pages (from-to)270-282
Number of pages13
JournalJournal of Biomedical Informatics
Volume54
DOIs
Publication statusPublished - 2015 Apr 1

Fingerprint

Genes
Gene Regulatory Networks
Lung Neoplasms
Prostatic Neoplasms
Weights and Measures
Data Mining
PubMed
Colonic Neoplasms
Medical problems
Alzheimer Disease
Electric network analysis
Research Personnel
Genome
Databases

All Science Journal Classification (ASJC) codes

  • Computer Science Applications
  • Health Informatics

Cite this

@article{5ad71c9076ab42b5a38ff3ca972cf38c,
title = "LGscore: A method to identify disease-related genes using biological literature and Google data",
abstract = "Since the genome project in 1990s, a number of studies associated with genes have been conducted and researchers have confirmed that genes are involved in disease. For this reason, the identification of the relationships between diseases and genes is important in biology. We propose a method called LGscore, which identifies disease-related genes using Google data and literature data. To implement this method, first, we construct a disease-related gene network using text-mining results. We then extract gene-gene interactions based on co-occurrences in abstract data obtained from PubMed, and calculate the weights of edges in the gene network by means of Z-scoring. The weights contain two values: the frequency and the Google search results. The frequency value is extracted from literature data, and the Google search result is obtained using Google. We assign a score to each gene through a network analysis. We assume that genes with a large number of links and numerous Google search results and frequency values are more likely to be involved in disease. For validation, we investigated the top 20 inferred genes for five different diseases using answer sets. The answer sets comprised six databases that contain information on disease-gene relationships. We identified a significant number of disease-related genes as well as candidate genes for Alzheimer's disease, diabetes, colon cancer, lung cancer, and prostate cancer. Our method was up to 40{\%} more accurate than existing methods.",
author = "Jeongwoo Kim and Hyunjin Kim and Youngmi Yoon and Park, {Sang Hyun}",
year = "2015",
month = "4",
day = "1",
doi = "10.1016/j.jbi.2015.01.003",
language = "English",
volume = "54",
pages = "270--282",
journal = "Journal of Biomedical Informatics",
issn = "1532-0464",
publisher = "Academic Press Inc.",

}

LGscore : A method to identify disease-related genes using biological literature and Google data. / Kim, Jeongwoo; Kim, Hyunjin; Yoon, Youngmi; Park, Sang Hyun.

In: Journal of Biomedical Informatics, Vol. 54, 01.04.2015, p. 270-282.

Research output: Contribution to journalArticle

TY - JOUR

T1 - LGscore

T2 - A method to identify disease-related genes using biological literature and Google data

AU - Kim, Jeongwoo

AU - Kim, Hyunjin

AU - Yoon, Youngmi

AU - Park, Sang Hyun

PY - 2015/4/1

Y1 - 2015/4/1

N2 - Since the genome project in 1990s, a number of studies associated with genes have been conducted and researchers have confirmed that genes are involved in disease. For this reason, the identification of the relationships between diseases and genes is important in biology. We propose a method called LGscore, which identifies disease-related genes using Google data and literature data. To implement this method, first, we construct a disease-related gene network using text-mining results. We then extract gene-gene interactions based on co-occurrences in abstract data obtained from PubMed, and calculate the weights of edges in the gene network by means of Z-scoring. The weights contain two values: the frequency and the Google search results. The frequency value is extracted from literature data, and the Google search result is obtained using Google. We assign a score to each gene through a network analysis. We assume that genes with a large number of links and numerous Google search results and frequency values are more likely to be involved in disease. For validation, we investigated the top 20 inferred genes for five different diseases using answer sets. The answer sets comprised six databases that contain information on disease-gene relationships. We identified a significant number of disease-related genes as well as candidate genes for Alzheimer's disease, diabetes, colon cancer, lung cancer, and prostate cancer. Our method was up to 40% more accurate than existing methods.

AB - Since the genome project in 1990s, a number of studies associated with genes have been conducted and researchers have confirmed that genes are involved in disease. For this reason, the identification of the relationships between diseases and genes is important in biology. We propose a method called LGscore, which identifies disease-related genes using Google data and literature data. To implement this method, first, we construct a disease-related gene network using text-mining results. We then extract gene-gene interactions based on co-occurrences in abstract data obtained from PubMed, and calculate the weights of edges in the gene network by means of Z-scoring. The weights contain two values: the frequency and the Google search results. The frequency value is extracted from literature data, and the Google search result is obtained using Google. We assign a score to each gene through a network analysis. We assume that genes with a large number of links and numerous Google search results and frequency values are more likely to be involved in disease. For validation, we investigated the top 20 inferred genes for five different diseases using answer sets. The answer sets comprised six databases that contain information on disease-gene relationships. We identified a significant number of disease-related genes as well as candidate genes for Alzheimer's disease, diabetes, colon cancer, lung cancer, and prostate cancer. Our method was up to 40% more accurate than existing methods.

UR - http://www.scopus.com/inward/record.url?scp=84927978866&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84927978866&partnerID=8YFLogxK

U2 - 10.1016/j.jbi.2015.01.003

DO - 10.1016/j.jbi.2015.01.003

M3 - Article

C2 - 25617670

AN - SCOPUS:84927978866

VL - 54

SP - 270

EP - 282

JO - Journal of Biomedical Informatics

JF - Journal of Biomedical Informatics

SN - 1532-0464

ER -