GRiD

Gathering rich data from PubMed using one-class SVM

Junbum Cha, Jeongwoo Kim, Sanghyun Park

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

The Medical Subject Headings (MeSH) term search is typical data-gathering method in biomedical text mining. However, it has two problems: the allocation delay of the MeSH term and missing valuable literature sources. Since MeSH term allocation is performed by a human being, the allocation process has delay. In addition, even if a literature source was allocated with a MeSH term, there is a still the problem that valuable literature sources are missed during the data-gathering process. There are literature sources that are not indexed to the MeSH term of a keyword, even though it contains valuable information related to the MeSH term. The MeSH term search misses these valuable literature sources. In order to resolve these problems, we propose a novel method to gather rich data using a one-class support vector machine (SVM) and relevance rule. The term frequency-inverse document frequency (TF-IDF) and paragraph vector are examined as text vectorization methods with various parameters and relevance factors. We apply our method to lung cancer, prostate cancer, breast cancer, and Alzheimer's disease. As a result, up to 26% of keyword data and 35% of target data are gathered with high quality (a C-score of at least 0.948).

Original languageEnglish
Title of host publication2016 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2016 - Conference Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages4325-4331
Number of pages7
ISBN (Electronic)9781509018970
DOIs
Publication statusPublished - 2017 Feb 6
Event2016 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2016 - Budapest, Hungary
Duration: 2016 Oct 92016 Oct 12

Other

Other2016 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2016
CountryHungary
CityBudapest
Period16/10/916/10/12

Fingerprint

Support vector machines
Support Vector Machine
Term
Vectorization
Prostate Cancer
Alzheimer's Disease
Class
Lung Cancer
Text Mining
Breast Cancer
Resolve
Target

All Science Journal Classification (ASJC) codes

  • Computer Vision and Pattern Recognition
  • Artificial Intelligence
  • Control and Optimization
  • Human-Computer Interaction

Cite this

Cha, J., Kim, J., & Park, S. (2017). GRiD: Gathering rich data from PubMed using one-class SVM. In 2016 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2016 - Conference Proceedings (pp. 4325-4331). [7844911] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/SMC.2016.7844911
Cha, Junbum ; Kim, Jeongwoo ; Park, Sanghyun. / GRiD : Gathering rich data from PubMed using one-class SVM. 2016 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2016 - Conference Proceedings. Institute of Electrical and Electronics Engineers Inc., 2017. pp. 4325-4331
@inproceedings{ef6443e9868b44d1814e634a3c13bf58,
title = "GRiD: Gathering rich data from PubMed using one-class SVM",
abstract = "The Medical Subject Headings (MeSH) term search is typical data-gathering method in biomedical text mining. However, it has two problems: the allocation delay of the MeSH term and missing valuable literature sources. Since MeSH term allocation is performed by a human being, the allocation process has delay. In addition, even if a literature source was allocated with a MeSH term, there is a still the problem that valuable literature sources are missed during the data-gathering process. There are literature sources that are not indexed to the MeSH term of a keyword, even though it contains valuable information related to the MeSH term. The MeSH term search misses these valuable literature sources. In order to resolve these problems, we propose a novel method to gather rich data using a one-class support vector machine (SVM) and relevance rule. The term frequency-inverse document frequency (TF-IDF) and paragraph vector are examined as text vectorization methods with various parameters and relevance factors. We apply our method to lung cancer, prostate cancer, breast cancer, and Alzheimer's disease. As a result, up to 26{\%} of keyword data and 35{\%} of target data are gathered with high quality (a C-score of at least 0.948).",
author = "Junbum Cha and Jeongwoo Kim and Sanghyun Park",
year = "2017",
month = "2",
day = "6",
doi = "10.1109/SMC.2016.7844911",
language = "English",
pages = "4325--4331",
booktitle = "2016 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2016 - Conference Proceedings",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
address = "United States",

}

Cha, J, Kim, J & Park, S 2017, GRiD: Gathering rich data from PubMed using one-class SVM. in 2016 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2016 - Conference Proceedings., 7844911, Institute of Electrical and Electronics Engineers Inc., pp. 4325-4331, 2016 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2016, Budapest, Hungary, 16/10/9. https://doi.org/10.1109/SMC.2016.7844911

GRiD : Gathering rich data from PubMed using one-class SVM. / Cha, Junbum; Kim, Jeongwoo; Park, Sanghyun.

2016 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2016 - Conference Proceedings. Institute of Electrical and Electronics Engineers Inc., 2017. p. 4325-4331 7844911.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - GRiD

T2 - Gathering rich data from PubMed using one-class SVM

AU - Cha, Junbum

AU - Kim, Jeongwoo

AU - Park, Sanghyun

PY - 2017/2/6

Y1 - 2017/2/6

N2 - The Medical Subject Headings (MeSH) term search is typical data-gathering method in biomedical text mining. However, it has two problems: the allocation delay of the MeSH term and missing valuable literature sources. Since MeSH term allocation is performed by a human being, the allocation process has delay. In addition, even if a literature source was allocated with a MeSH term, there is a still the problem that valuable literature sources are missed during the data-gathering process. There are literature sources that are not indexed to the MeSH term of a keyword, even though it contains valuable information related to the MeSH term. The MeSH term search misses these valuable literature sources. In order to resolve these problems, we propose a novel method to gather rich data using a one-class support vector machine (SVM) and relevance rule. The term frequency-inverse document frequency (TF-IDF) and paragraph vector are examined as text vectorization methods with various parameters and relevance factors. We apply our method to lung cancer, prostate cancer, breast cancer, and Alzheimer's disease. As a result, up to 26% of keyword data and 35% of target data are gathered with high quality (a C-score of at least 0.948).

AB - The Medical Subject Headings (MeSH) term search is typical data-gathering method in biomedical text mining. However, it has two problems: the allocation delay of the MeSH term and missing valuable literature sources. Since MeSH term allocation is performed by a human being, the allocation process has delay. In addition, even if a literature source was allocated with a MeSH term, there is a still the problem that valuable literature sources are missed during the data-gathering process. There are literature sources that are not indexed to the MeSH term of a keyword, even though it contains valuable information related to the MeSH term. The MeSH term search misses these valuable literature sources. In order to resolve these problems, we propose a novel method to gather rich data using a one-class support vector machine (SVM) and relevance rule. The term frequency-inverse document frequency (TF-IDF) and paragraph vector are examined as text vectorization methods with various parameters and relevance factors. We apply our method to lung cancer, prostate cancer, breast cancer, and Alzheimer's disease. As a result, up to 26% of keyword data and 35% of target data are gathered with high quality (a C-score of at least 0.948).

UR - http://www.scopus.com/inward/record.url?scp=85015752038&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85015752038&partnerID=8YFLogxK

U2 - 10.1109/SMC.2016.7844911

DO - 10.1109/SMC.2016.7844911

M3 - Conference contribution

SP - 4325

EP - 4331

BT - 2016 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2016 - Conference Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

ER -

Cha J, Kim J, Park S. GRiD: Gathering rich data from PubMed using one-class SVM. In 2016 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2016 - Conference Proceedings. Institute of Electrical and Electronics Engineers Inc. 2017. p. 4325-4331. 7844911 https://doi.org/10.1109/SMC.2016.7844911