Term discrimination for text search tasks derived from negative binomial distribution

Lorenz Bernauer, Eun Jin Han, So Young Sohn

Research output: Contribution to journalArticle

Abstract

Accurate term discrimination in information retrieval is essential for identifying important terms in specific documents. In addition to the widely known inverse document frequency (IDF) method, alternative approaches such as the residual inverse document frequency (RIDF) scheme have been introduced for term discrimination. However, existing methods’ performance is not unconditionally convincing. We propose a new collection frequency weighting scheme derived from the negative binomial distribution model of term occurrences. Factorial experiments were performed to examine potential interaction effect between collection frequency weight methods and term frequency weight methods according to the mean average precision and normalized discounted cumulative gain performance assessors. The results indicate that our proposed term discrimination method offers a significant gain in accuracy as compared to the IDF and RIDF scheme. This finding is reinforced by the fact that the results show no interaction effects among factors.

Original languageEnglish
Pages (from-to)370-379
Number of pages10
JournalInformation Processing and Management
Volume54
Issue number3
DOIs
Publication statusPublished - 2018 May 1

Fingerprint

Information retrieval
discrimination
Experiments
interaction
weighting
information retrieval
performance
Negative binomial
Discrimination
experiment

All Science Journal Classification (ASJC) codes

  • Information Systems
  • Media Technology
  • Computer Science Applications
  • Management Science and Operations Research
  • Library and Information Sciences

Cite this

@article{9ade4b0dc0154bf9a66a0dda560b53e1,
title = "Term discrimination for text search tasks derived from negative binomial distribution",
abstract = "Accurate term discrimination in information retrieval is essential for identifying important terms in specific documents. In addition to the widely known inverse document frequency (IDF) method, alternative approaches such as the residual inverse document frequency (RIDF) scheme have been introduced for term discrimination. However, existing methods’ performance is not unconditionally convincing. We propose a new collection frequency weighting scheme derived from the negative binomial distribution model of term occurrences. Factorial experiments were performed to examine potential interaction effect between collection frequency weight methods and term frequency weight methods according to the mean average precision and normalized discounted cumulative gain performance assessors. The results indicate that our proposed term discrimination method offers a significant gain in accuracy as compared to the IDF and RIDF scheme. This finding is reinforced by the fact that the results show no interaction effects among factors.",
author = "Lorenz Bernauer and Han, {Eun Jin} and Sohn, {So Young}",
year = "2018",
month = "5",
day = "1",
doi = "10.1016/j.ipm.2018.01.003",
language = "English",
volume = "54",
pages = "370--379",
journal = "Information Processing and Management",
issn = "0306-4573",
publisher = "Elsevier Limited",
number = "3",

}

Term discrimination for text search tasks derived from negative binomial distribution. / Bernauer, Lorenz; Han, Eun Jin; Sohn, So Young.

In: Information Processing and Management, Vol. 54, No. 3, 01.05.2018, p. 370-379.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Term discrimination for text search tasks derived from negative binomial distribution

AU - Bernauer, Lorenz

AU - Han, Eun Jin

AU - Sohn, So Young

PY - 2018/5/1

Y1 - 2018/5/1

N2 - Accurate term discrimination in information retrieval is essential for identifying important terms in specific documents. In addition to the widely known inverse document frequency (IDF) method, alternative approaches such as the residual inverse document frequency (RIDF) scheme have been introduced for term discrimination. However, existing methods’ performance is not unconditionally convincing. We propose a new collection frequency weighting scheme derived from the negative binomial distribution model of term occurrences. Factorial experiments were performed to examine potential interaction effect between collection frequency weight methods and term frequency weight methods according to the mean average precision and normalized discounted cumulative gain performance assessors. The results indicate that our proposed term discrimination method offers a significant gain in accuracy as compared to the IDF and RIDF scheme. This finding is reinforced by the fact that the results show no interaction effects among factors.

AB - Accurate term discrimination in information retrieval is essential for identifying important terms in specific documents. In addition to the widely known inverse document frequency (IDF) method, alternative approaches such as the residual inverse document frequency (RIDF) scheme have been introduced for term discrimination. However, existing methods’ performance is not unconditionally convincing. We propose a new collection frequency weighting scheme derived from the negative binomial distribution model of term occurrences. Factorial experiments were performed to examine potential interaction effect between collection frequency weight methods and term frequency weight methods according to the mean average precision and normalized discounted cumulative gain performance assessors. The results indicate that our proposed term discrimination method offers a significant gain in accuracy as compared to the IDF and RIDF scheme. This finding is reinforced by the fact that the results show no interaction effects among factors.

UR - http://www.scopus.com/inward/record.url?scp=85042464843&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85042464843&partnerID=8YFLogxK

U2 - 10.1016/j.ipm.2018.01.003

DO - 10.1016/j.ipm.2018.01.003

M3 - Article

AN - SCOPUS:85042464843

VL - 54

SP - 370

EP - 379

JO - Information Processing and Management

JF - Information Processing and Management

SN - 0306-4573

IS - 3

ER -