Detecting duplicate biological entities using Markov random field-based edit distance

Min Song, Alex Rudniy

Research output: Contribution to journalArticle

7 Citations (Scopus)

Abstract

Detecting duplicate entities in biological data is an important research task. In this paper, we propose a novel and context-sensitive Markov random field-based edit distance (MRFED) for this task. We apply the Markov random field theory to the Needleman-Wunsch distance and combine MRFED with TFIDF, a token-based distance algorithm, resulting in SoftMRFED. We compare SoftMRFED with other distance algorithms such as Levenshtein, SoftTFIDF, and Monge-Elkan for two matching tasks: biological entity matching and synonym matching. The experimental results show that SoftMRFED significantly outperforms the other edit distance algorithms on several test data collections. In addition, the performance of SoftMRFED is superior to token-based distance algorithms in two matching tasks.

Original languageEnglish
Pages (from-to)371-387
Number of pages17
JournalKnowledge and Information Systems
Volume25
Issue number2
DOIs
Publication statusPublished - 2010 Nov 1

All Science Journal Classification (ASJC) codes

  • Software
  • Information Systems
  • Human-Computer Interaction
  • Hardware and Architecture
  • Artificial Intelligence

Cite this

@article{c03c5af9e20d42858ded757650acace1,
title = "Detecting duplicate biological entities using Markov random field-based edit distance",
abstract = "Detecting duplicate entities in biological data is an important research task. In this paper, we propose a novel and context-sensitive Markov random field-based edit distance (MRFED) for this task. We apply the Markov random field theory to the Needleman-Wunsch distance and combine MRFED with TFIDF, a token-based distance algorithm, resulting in SoftMRFED. We compare SoftMRFED with other distance algorithms such as Levenshtein, SoftTFIDF, and Monge-Elkan for two matching tasks: biological entity matching and synonym matching. The experimental results show that SoftMRFED significantly outperforms the other edit distance algorithms on several test data collections. In addition, the performance of SoftMRFED is superior to token-based distance algorithms in two matching tasks.",
author = "Min Song and Alex Rudniy",
year = "2010",
month = "11",
day = "1",
doi = "10.1007/s10115-009-0254-7",
language = "English",
volume = "25",
pages = "371--387",
journal = "Knowledge and Information Systems",
issn = "0219-1377",
publisher = "Springer London",
number = "2",

}

Detecting duplicate biological entities using Markov random field-based edit distance. / Song, Min; Rudniy, Alex.

In: Knowledge and Information Systems, Vol. 25, No. 2, 01.11.2010, p. 371-387.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Detecting duplicate biological entities using Markov random field-based edit distance

AU - Song, Min

AU - Rudniy, Alex

PY - 2010/11/1

Y1 - 2010/11/1

N2 - Detecting duplicate entities in biological data is an important research task. In this paper, we propose a novel and context-sensitive Markov random field-based edit distance (MRFED) for this task. We apply the Markov random field theory to the Needleman-Wunsch distance and combine MRFED with TFIDF, a token-based distance algorithm, resulting in SoftMRFED. We compare SoftMRFED with other distance algorithms such as Levenshtein, SoftTFIDF, and Monge-Elkan for two matching tasks: biological entity matching and synonym matching. The experimental results show that SoftMRFED significantly outperforms the other edit distance algorithms on several test data collections. In addition, the performance of SoftMRFED is superior to token-based distance algorithms in two matching tasks.

AB - Detecting duplicate entities in biological data is an important research task. In this paper, we propose a novel and context-sensitive Markov random field-based edit distance (MRFED) for this task. We apply the Markov random field theory to the Needleman-Wunsch distance and combine MRFED with TFIDF, a token-based distance algorithm, resulting in SoftMRFED. We compare SoftMRFED with other distance algorithms such as Levenshtein, SoftTFIDF, and Monge-Elkan for two matching tasks: biological entity matching and synonym matching. The experimental results show that SoftMRFED significantly outperforms the other edit distance algorithms on several test data collections. In addition, the performance of SoftMRFED is superior to token-based distance algorithms in two matching tasks.

UR - http://www.scopus.com/inward/record.url?scp=78049440735&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=78049440735&partnerID=8YFLogxK

U2 - 10.1007/s10115-009-0254-7

DO - 10.1007/s10115-009-0254-7

M3 - Article

AN - SCOPUS:78049440735

VL - 25

SP - 371

EP - 387

JO - Knowledge and Information Systems

JF - Knowledge and Information Systems

SN - 0219-1377

IS - 2

ER -