Monolingual and multilingual topic analysis using LDA and BERT embeddings

Qing Xie, Xinyuan Zhang, Ying Ding, Min Song

Research output: Contribution to journalArticle

Abstract

Analyzing research topics offers potential insights into the direction of scientific development. In particular, analyzing multilingual research topics can help researchers grasp the evolution of topics globally, revealing topic similarity among scientific publications written in different languages. Most studies to date on topic analysis have been based on English-language publications and have relied heavily on citation-based topic evolution analysis. However, since it can be challenging for English publications to cite non-English sources and since many languages do not offer English translations of abstracts, citation-based methodologies are not suitable for analyzing multilingual research topic relations. Since multilingual sentence embeddings can effectively preserve word semantics in multilingual translation tasks, a topic model based on multilingual sentence embeddings could potentially generate topic-word distributions for publications in multilingual analysis. In this paper, which is situated in the field of library and information science, we use multilingual pretrained Bidirectional Encoder Representations from Transformers (BERT) embeddings and the Latent Dirichlet Allocation (LDA) topic model to analyze topic evolution in monolingual and multilingual topic similarity settings. For each topic, we multiply its LDA probability value by the averaged tensor similarity of BERT embeddings to explore the evolution of the topic in scientific publications. As our proposed method does not rely on a machine translator or the author's subjective translation, it avoids confusion and misusages caused by either machine error or the author's subjectively chosen English keywords. Our results show that the proposed approach is well-suited to analyzing the scientific evolutions in monolingual and scientific multilingual topic similarity relations.

Original languageEnglish
Article number101055
JournalJournal of Informetrics
Volume14
Issue number3
DOIs
Publication statusPublished - 2020 Aug

All Science Journal Classification (ASJC) codes

  • Computer Science Applications
  • Library and Information Sciences

Fingerprint Dive into the research topics of 'Monolingual and multilingual topic analysis using LDA and BERT embeddings'. Together they form a unique fingerprint.

  • Cite this