ADC: Advanced document clustering using contextualized representations

J. Park, Chanhee Park, Jeongwoo Kim, Minsoo Cho, Sanghyun Park

Research output: Contribution to journalArticle

2 Citations (Scopus)

Abstract

Document representation is central to modern natural language processing systems including document clustering. Empirical experiments in recent studies provide strong evidence that unsupervised language models can learn context-aware representations in the given documents and advance several NLP benchmark results. However, existing clustering approaches focus on the dimensionality reduction and do not exploit these informative representations. In this paper, we propose a conceptually simple but experimentally effective clustering framework called Advanced Document Clustering (ADC). In contrast to previous clustering methods, ADC is designed to leverage syntactically and semantically meaningful features through feature-extraction and clustering modules in the framework. We first extract features from pre-trained language models and initialize cluster centroids to spread out uniformly. In the clustering module of ADC, the semantic similarity can be measured using the cosine similarity and centroids update while assigning centroids to a mini-batch input. Also, we utilize cross entropy loss partially, as the self-training scheme can be biased when parameters in the model are inaccurate. As a result, ADC can take advantages of contextualized representations while mitigating the limitations introduced by high-dimensional vectors. In numerous experiments with four datasets, the proposed ADC outperforms other existing approaches. In particular, experiments on categorizing news corpus with fake news demonstrated the effectiveness of our method for contextualized representations.

Original languageEnglish
Pages (from-to)157-166
Number of pages10
JournalExpert Systems with Applications
Volume137
DOIs
Publication statusPublished - 2019 Dec 15

Fingerprint

Natural language processing systems
Experiments
Feature extraction
Entropy
Semantics

All Science Journal Classification (ASJC) codes

  • Engineering(all)
  • Computer Science Applications
  • Artificial Intelligence

Cite this

Park, J. ; Park, Chanhee ; Kim, Jeongwoo ; Cho, Minsoo ; Park, Sanghyun. / ADC : Advanced document clustering using contextualized representations. In: Expert Systems with Applications. 2019 ; Vol. 137. pp. 157-166.
@article{74b0a15b011b4f299a85be37c0afc379,
title = "ADC: Advanced document clustering using contextualized representations",
abstract = "Document representation is central to modern natural language processing systems including document clustering. Empirical experiments in recent studies provide strong evidence that unsupervised language models can learn context-aware representations in the given documents and advance several NLP benchmark results. However, existing clustering approaches focus on the dimensionality reduction and do not exploit these informative representations. In this paper, we propose a conceptually simple but experimentally effective clustering framework called Advanced Document Clustering (ADC). In contrast to previous clustering methods, ADC is designed to leverage syntactically and semantically meaningful features through feature-extraction and clustering modules in the framework. We first extract features from pre-trained language models and initialize cluster centroids to spread out uniformly. In the clustering module of ADC, the semantic similarity can be measured using the cosine similarity and centroids update while assigning centroids to a mini-batch input. Also, we utilize cross entropy loss partially, as the self-training scheme can be biased when parameters in the model are inaccurate. As a result, ADC can take advantages of contextualized representations while mitigating the limitations introduced by high-dimensional vectors. In numerous experiments with four datasets, the proposed ADC outperforms other existing approaches. In particular, experiments on categorizing news corpus with fake news demonstrated the effectiveness of our method for contextualized representations.",
author = "J. Park and Chanhee Park and Jeongwoo Kim and Minsoo Cho and Sanghyun Park",
year = "2019",
month = "12",
day = "15",
doi = "10.1016/j.eswa.2019.06.068",
language = "English",
volume = "137",
pages = "157--166",
journal = "Expert Systems with Applications",
issn = "0957-4174",
publisher = "Elsevier Limited",

}

ADC : Advanced document clustering using contextualized representations. / Park, J.; Park, Chanhee; Kim, Jeongwoo; Cho, Minsoo; Park, Sanghyun.

In: Expert Systems with Applications, Vol. 137, 15.12.2019, p. 157-166.

Research output: Contribution to journalArticle

TY - JOUR

T1 - ADC

T2 - Advanced document clustering using contextualized representations

AU - Park, J.

AU - Park, Chanhee

AU - Kim, Jeongwoo

AU - Cho, Minsoo

AU - Park, Sanghyun

PY - 2019/12/15

Y1 - 2019/12/15

N2 - Document representation is central to modern natural language processing systems including document clustering. Empirical experiments in recent studies provide strong evidence that unsupervised language models can learn context-aware representations in the given documents and advance several NLP benchmark results. However, existing clustering approaches focus on the dimensionality reduction and do not exploit these informative representations. In this paper, we propose a conceptually simple but experimentally effective clustering framework called Advanced Document Clustering (ADC). In contrast to previous clustering methods, ADC is designed to leverage syntactically and semantically meaningful features through feature-extraction and clustering modules in the framework. We first extract features from pre-trained language models and initialize cluster centroids to spread out uniformly. In the clustering module of ADC, the semantic similarity can be measured using the cosine similarity and centroids update while assigning centroids to a mini-batch input. Also, we utilize cross entropy loss partially, as the self-training scheme can be biased when parameters in the model are inaccurate. As a result, ADC can take advantages of contextualized representations while mitigating the limitations introduced by high-dimensional vectors. In numerous experiments with four datasets, the proposed ADC outperforms other existing approaches. In particular, experiments on categorizing news corpus with fake news demonstrated the effectiveness of our method for contextualized representations.

AB - Document representation is central to modern natural language processing systems including document clustering. Empirical experiments in recent studies provide strong evidence that unsupervised language models can learn context-aware representations in the given documents and advance several NLP benchmark results. However, existing clustering approaches focus on the dimensionality reduction and do not exploit these informative representations. In this paper, we propose a conceptually simple but experimentally effective clustering framework called Advanced Document Clustering (ADC). In contrast to previous clustering methods, ADC is designed to leverage syntactically and semantically meaningful features through feature-extraction and clustering modules in the framework. We first extract features from pre-trained language models and initialize cluster centroids to spread out uniformly. In the clustering module of ADC, the semantic similarity can be measured using the cosine similarity and centroids update while assigning centroids to a mini-batch input. Also, we utilize cross entropy loss partially, as the self-training scheme can be biased when parameters in the model are inaccurate. As a result, ADC can take advantages of contextualized representations while mitigating the limitations introduced by high-dimensional vectors. In numerous experiments with four datasets, the proposed ADC outperforms other existing approaches. In particular, experiments on categorizing news corpus with fake news demonstrated the effectiveness of our method for contextualized representations.

UR - http://www.scopus.com/inward/record.url?scp=85068482837&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85068482837&partnerID=8YFLogxK

U2 - 10.1016/j.eswa.2019.06.068

DO - 10.1016/j.eswa.2019.06.068

M3 - Article

AN - SCOPUS:85068482837

VL - 137

SP - 157

EP - 166

JO - Expert Systems with Applications

JF - Expert Systems with Applications

SN - 0957-4174

ER -