Document representation is central to modern natural language processing systems including document clustering. Empirical experiments in recent studies provide strong evidence that unsupervised language models can learn context-aware representations in the given documents and advance several NLP benchmark results. However, existing clustering approaches focus on the dimensionality reduction and do not exploit these informative representations. In this paper, we propose a conceptually simple but experimentally effective clustering framework called Advanced Document Clustering (ADC). In contrast to previous clustering methods, ADC is designed to leverage syntactically and semantically meaningful features through feature-extraction and clustering modules in the framework. We first extract features from pre-trained language models and initialize cluster centroids to spread out uniformly. In the clustering module of ADC, the semantic similarity can be measured using the cosine similarity and centroids update while assigning centroids to a mini-batch input. Also, we utilize cross entropy loss partially, as the self-training scheme can be biased when parameters in the model are inaccurate. As a result, ADC can take advantages of contextualized representations while mitigating the limitations introduced by high-dimensional vectors. In numerous experiments with four datasets, the proposed ADC outperforms other existing approaches. In particular, experiments on categorizing news corpus with fake news demonstrated the effectiveness of our method for contextualized representations.
All Science Journal Classification (ASJC) codes
- Computer Science Applications
- Artificial Intelligence