Fast max-margin clustering for unsupervised word sense disambiguation in biomedical texts

Weisi Duan, Min Song, Alexander Yates

Research output: Contribution to journalArticle

14 Citations (Scopus)

Abstract

Background: We aim to solve the problem of determining word senses for ambiguous biomedical terms with minimal human effort. Methods: We build a fully automated system for Word Sense Disambiguation by designing a system that does not require manually-constructed external resources or manually-labeled training examples except for a single ambiguous word. The system uses a novel and efficient graph-based algorithm to cluster words into groups that have the same meaning. Our algorithm follows the principle of finding a maximum margin between clusters, determining a split of the data that maximizes the minimum distance between pairs of data points belonging to two different clusters. Results: On a test set of 21 ambiguous keywords from PubMed abstracts, our system has an average accuracy of 78%, outperforming a state-of-the-art unsupervised system by 2% and a baseline technique by 23%. On a standard data set from the National Library of Medicine, our system outperforms the baseline by 6% and comes within 5% of the accuracy of a supervised system. Conclusion: Our system is a novel, state-of-the-art technique for efficiently finding word sense clusters, and does not require training data or human effort for each new word to be disambiguated.

Original languageEnglish
Article numberS4
JournalBMC bioinformatics
Volume10
Issue numberSUPPL. 3
DOIs
Publication statusPublished - 2009 Mar 19

Fingerprint

Word Sense Disambiguation
Margin
Cluster Analysis
Clustering
National Library of Medicine (U.S.)
PubMed
Medicine
Ambiguous
Baseline
Text
Test Set
Minimum Distance
Maximise
Resources
Datasets
Term

All Science Journal Classification (ASJC) codes

  • Structural Biology
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Applied Mathematics

Cite this

@article{588f53e943d24e5bb43976fc2c4ed5c2,
title = "Fast max-margin clustering for unsupervised word sense disambiguation in biomedical texts",
abstract = "Background: We aim to solve the problem of determining word senses for ambiguous biomedical terms with minimal human effort. Methods: We build a fully automated system for Word Sense Disambiguation by designing a system that does not require manually-constructed external resources or manually-labeled training examples except for a single ambiguous word. The system uses a novel and efficient graph-based algorithm to cluster words into groups that have the same meaning. Our algorithm follows the principle of finding a maximum margin between clusters, determining a split of the data that maximizes the minimum distance between pairs of data points belonging to two different clusters. Results: On a test set of 21 ambiguous keywords from PubMed abstracts, our system has an average accuracy of 78{\%}, outperforming a state-of-the-art unsupervised system by 2{\%} and a baseline technique by 23{\%}. On a standard data set from the National Library of Medicine, our system outperforms the baseline by 6{\%} and comes within 5{\%} of the accuracy of a supervised system. Conclusion: Our system is a novel, state-of-the-art technique for efficiently finding word sense clusters, and does not require training data or human effort for each new word to be disambiguated.",
author = "Weisi Duan and Min Song and Alexander Yates",
year = "2009",
month = "3",
day = "19",
doi = "10.1186/1471-2105-10-S3-S4",
language = "English",
volume = "10",
journal = "BMC Bioinformatics",
issn = "1471-2105",
publisher = "BioMed Central",
number = "SUPPL. 3",

}

Fast max-margin clustering for unsupervised word sense disambiguation in biomedical texts. / Duan, Weisi; Song, Min; Yates, Alexander.

In: BMC bioinformatics, Vol. 10, No. SUPPL. 3, S4, 19.03.2009.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Fast max-margin clustering for unsupervised word sense disambiguation in biomedical texts

AU - Duan, Weisi

AU - Song, Min

AU - Yates, Alexander

PY - 2009/3/19

Y1 - 2009/3/19

N2 - Background: We aim to solve the problem of determining word senses for ambiguous biomedical terms with minimal human effort. Methods: We build a fully automated system for Word Sense Disambiguation by designing a system that does not require manually-constructed external resources or manually-labeled training examples except for a single ambiguous word. The system uses a novel and efficient graph-based algorithm to cluster words into groups that have the same meaning. Our algorithm follows the principle of finding a maximum margin between clusters, determining a split of the data that maximizes the minimum distance between pairs of data points belonging to two different clusters. Results: On a test set of 21 ambiguous keywords from PubMed abstracts, our system has an average accuracy of 78%, outperforming a state-of-the-art unsupervised system by 2% and a baseline technique by 23%. On a standard data set from the National Library of Medicine, our system outperforms the baseline by 6% and comes within 5% of the accuracy of a supervised system. Conclusion: Our system is a novel, state-of-the-art technique for efficiently finding word sense clusters, and does not require training data or human effort for each new word to be disambiguated.

AB - Background: We aim to solve the problem of determining word senses for ambiguous biomedical terms with minimal human effort. Methods: We build a fully automated system for Word Sense Disambiguation by designing a system that does not require manually-constructed external resources or manually-labeled training examples except for a single ambiguous word. The system uses a novel and efficient graph-based algorithm to cluster words into groups that have the same meaning. Our algorithm follows the principle of finding a maximum margin between clusters, determining a split of the data that maximizes the minimum distance between pairs of data points belonging to two different clusters. Results: On a test set of 21 ambiguous keywords from PubMed abstracts, our system has an average accuracy of 78%, outperforming a state-of-the-art unsupervised system by 2% and a baseline technique by 23%. On a standard data set from the National Library of Medicine, our system outperforms the baseline by 6% and comes within 5% of the accuracy of a supervised system. Conclusion: Our system is a novel, state-of-the-art technique for efficiently finding word sense clusters, and does not require training data or human effort for each new word to be disambiguated.

UR - http://www.scopus.com/inward/record.url?scp=63449126014&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=63449126014&partnerID=8YFLogxK

U2 - 10.1186/1471-2105-10-S3-S4

DO - 10.1186/1471-2105-10-S3-S4

M3 - Article

C2 - 19344480

AN - SCOPUS:63449126014

VL - 10

JO - BMC Bioinformatics

JF - BMC Bioinformatics

SN - 1471-2105

IS - SUPPL. 3

M1 - S4

ER -