Partially collapsed Gibbs sampling for latent Dirichlet allocation

Hongju Park, Taeyoung Park, Yung Seop Lee

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

A latent Dirichlet allocation (LDA) model is a machine learning technique to identify latent topics from text corpora within a Bayesian hierarchical framework. Current popular inferential methods to fit the LDA model are based on variational Bayesian inference, collapsed Gibbs sampling, or a combination of these. Because these methods assume a unimodal distribution over topics, however, they can suffer from large bias when text corpora consist of various clusters with different topic distributions. This paper proposes an inferential LDA method to efficiently obtain unbiased estimates under flexible modeling for heterogeneous text corpora with the method of partial collapse and the Dirichlet process mixtures. The method is illustrated using a simulation study and an application to a corpus of 1300 documents from neural information processing systems (NIPS) conference articles during the period of 2000–2002 and British Broadcasting Corporation (BBC) news articles during the period of 2004–2005.

Original languageEnglish
Pages (from-to)208-218
Number of pages11
JournalExpert Systems with Applications
Volume131
DOIs
Publication statusPublished - 2019 Oct 1

Fingerprint

Sampling
Broadcasting
Learning systems
Industry

All Science Journal Classification (ASJC) codes

  • Engineering(all)
  • Computer Science Applications
  • Artificial Intelligence

Cite this

@article{6570782010f04f0abfea1b199f4f3da3,
title = "Partially collapsed Gibbs sampling for latent Dirichlet allocation",
abstract = "A latent Dirichlet allocation (LDA) model is a machine learning technique to identify latent topics from text corpora within a Bayesian hierarchical framework. Current popular inferential methods to fit the LDA model are based on variational Bayesian inference, collapsed Gibbs sampling, or a combination of these. Because these methods assume a unimodal distribution over topics, however, they can suffer from large bias when text corpora consist of various clusters with different topic distributions. This paper proposes an inferential LDA method to efficiently obtain unbiased estimates under flexible modeling for heterogeneous text corpora with the method of partial collapse and the Dirichlet process mixtures. The method is illustrated using a simulation study and an application to a corpus of 1300 documents from neural information processing systems (NIPS) conference articles during the period of 2000–2002 and British Broadcasting Corporation (BBC) news articles during the period of 2004–2005.",
author = "Hongju Park and Taeyoung Park and Lee, {Yung Seop}",
year = "2019",
month = "10",
day = "1",
doi = "10.1016/j.eswa.2019.04.028",
language = "English",
volume = "131",
pages = "208--218",
journal = "Expert Systems with Applications",
issn = "0957-4174",
publisher = "Elsevier Limited",

}

Partially collapsed Gibbs sampling for latent Dirichlet allocation. / Park, Hongju; Park, Taeyoung; Lee, Yung Seop.

In: Expert Systems with Applications, Vol. 131, 01.10.2019, p. 208-218.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Partially collapsed Gibbs sampling for latent Dirichlet allocation

AU - Park, Hongju

AU - Park, Taeyoung

AU - Lee, Yung Seop

PY - 2019/10/1

Y1 - 2019/10/1

N2 - A latent Dirichlet allocation (LDA) model is a machine learning technique to identify latent topics from text corpora within a Bayesian hierarchical framework. Current popular inferential methods to fit the LDA model are based on variational Bayesian inference, collapsed Gibbs sampling, or a combination of these. Because these methods assume a unimodal distribution over topics, however, they can suffer from large bias when text corpora consist of various clusters with different topic distributions. This paper proposes an inferential LDA method to efficiently obtain unbiased estimates under flexible modeling for heterogeneous text corpora with the method of partial collapse and the Dirichlet process mixtures. The method is illustrated using a simulation study and an application to a corpus of 1300 documents from neural information processing systems (NIPS) conference articles during the period of 2000–2002 and British Broadcasting Corporation (BBC) news articles during the period of 2004–2005.

AB - A latent Dirichlet allocation (LDA) model is a machine learning technique to identify latent topics from text corpora within a Bayesian hierarchical framework. Current popular inferential methods to fit the LDA model are based on variational Bayesian inference, collapsed Gibbs sampling, or a combination of these. Because these methods assume a unimodal distribution over topics, however, they can suffer from large bias when text corpora consist of various clusters with different topic distributions. This paper proposes an inferential LDA method to efficiently obtain unbiased estimates under flexible modeling for heterogeneous text corpora with the method of partial collapse and the Dirichlet process mixtures. The method is illustrated using a simulation study and an application to a corpus of 1300 documents from neural information processing systems (NIPS) conference articles during the period of 2000–2002 and British Broadcasting Corporation (BBC) news articles during the period of 2004–2005.

UR - http://www.scopus.com/inward/record.url?scp=85064893718&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85064893718&partnerID=8YFLogxK

U2 - 10.1016/j.eswa.2019.04.028

DO - 10.1016/j.eswa.2019.04.028

M3 - Article

AN - SCOPUS:85064893718

VL - 131

SP - 208

EP - 218

JO - Expert Systems with Applications

JF - Expert Systems with Applications

SN - 0957-4174

ER -