Out-of-Category Document Identification Using Target-Category Names as Weak Supervision

Dongha Lee, Dongmin Hyun, Jiawei Han, Hwanjo Yu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Identifying outlier documents, whose content is different from the majority of the documents in a corpus, has played an important role to manage a large text collection. However, due to the absence of explicit information about the inlier (or target) distribution, existing unsupervised outlier detectors are likely to make unreliable results depending on the density or diversity of the outliers in the corpus. To address this challenge, we introduce a new task referred to as out-of-category detection, which aims to distinguish the documents according to their semantic relevance to the inlier (or target) categories by using the category names as weak supervision. In practice, this task can be widely applicable in that it can flexibly designate the scope of target categories according to users' interests while requiring only the target-category names as minimum guidance. In this paper, we present an out-of-category detection framework, which effectively measures how confidently each document belongs to one of the target categories. Our framework adopts a two-step approach, to take advantage of both (i) a discriminative text embedding and (ii) a neural text classifier. The experiments on real-world datasets demonstrate that our framework achieves the best detection performance among all baseline methods in various scenarios specifying different target categories.

Original languageEnglish
Title of host publicationProceedings - 21st IEEE International Conference on Data Mining, ICDM 2021
EditorsJames Bailey, Pauli Miettinen, Yun Sing Koh, Dacheng Tao, Xindong Wu
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1186-1191
Number of pages6
ISBN (Electronic)9781665423984
DOIs
Publication statusPublished - 2021
Event21st IEEE International Conference on Data Mining, ICDM 2021 - Virtual, Online, New Zealand
Duration: 2021 Dec 72021 Dec 10

Publication series

NameProceedings - IEEE International Conference on Data Mining, ICDM
Volume2021-December
ISSN (Print)1550-4786

Conference

Conference21st IEEE International Conference on Data Mining, ICDM 2021
Country/TerritoryNew Zealand
CityVirtual, Online
Period21/12/721/12/10

Bibliographical note

Funding Information:
V. CONCLUSION This paper proposes a new task for detecting out-of-category documents from a text corpus, by using given target-category names as weak supervision. The proposed OOCD framework adopts the two-step approach that takes advantage of both the textual similarity encoded in a text embedding space and the discriminative power of a neural text classifier. Our empirical evaluation demonstrates that OOCD successfully identifies out-of-category documents in various target scenarios. In conclusion, OOCD can be practically used in many real-world applications for filtering out the documents that are less relevant to inlier categories or user-interested topics, only requiring the minimum guidance. Acknowledgement. This work was supported by the NRF grant (No. 2020R1A2B5B03097210), the IITP grant (No. 2018-0-00584, 2019-0-01906), US DARPA KAIROS Program (No. FA8750-19-2-1004),SocialSim Program (No. W911NF-17-C-0099), INCAS Program (No. HR001121C0165), National Science Foundation (IIS-19-56151, IIS-17-41317, IIS 17-04532), and the Molecule Maker Lab Institute: An AI Research Institutes program (No. 2019897).

Publisher Copyright:
© 2021 IEEE.

All Science Journal Classification (ASJC) codes

  • Engineering(all)

Fingerprint

Dive into the research topics of 'Out-of-Category Document Identification Using Target-Category Names as Weak Supervision'. Together they form a unique fingerprint.

Cite this