Abstract
Identifying outlier documents, whose content is different from the majority of the documents in a corpus, has played an important role to manage a large text collection. However, due to the absence of explicit information about the inlier (or target) distribution, existing unsupervised outlier detectors are likely to make unreliable results depending on the density or diversity of the outliers in the corpus. To address this challenge, we introduce a new task referred to as out-of-category detection, which aims to distinguish the documents according to their semantic relevance to the inlier (or target) categories by using the category names as weak supervision. In practice, this task can be widely applicable in that it can flexibly designate the scope of target categories according to users' interests while requiring only the target-category names as minimum guidance. In this paper, we present an out-of-category detection framework, which effectively measures how confidently each document belongs to one of the target categories. Our framework adopts a two-step approach, to take advantage of both (i) a discriminative text embedding and (ii) a neural text classifier. The experiments on real-world datasets demonstrate that our framework achieves the best detection performance among all baseline methods in various scenarios specifying different target categories.
Original language | English |
---|---|
Title of host publication | Proceedings - 21st IEEE International Conference on Data Mining, ICDM 2021 |
Editors | James Bailey, Pauli Miettinen, Yun Sing Koh, Dacheng Tao, Xindong Wu |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 1186-1191 |
Number of pages | 6 |
ISBN (Electronic) | 9781665423984 |
DOIs | |
Publication status | Published - 2021 |
Event | 21st IEEE International Conference on Data Mining, ICDM 2021 - Virtual, Online, New Zealand Duration: 2021 Dec 7 → 2021 Dec 10 |
Publication series
Name | Proceedings - IEEE International Conference on Data Mining, ICDM |
---|---|
Volume | 2021-December |
ISSN (Print) | 1550-4786 |
Conference
Conference | 21st IEEE International Conference on Data Mining, ICDM 2021 |
---|---|
Country/Territory | New Zealand |
City | Virtual, Online |
Period | 21/12/7 → 21/12/10 |
Bibliographical note
Funding Information:V. CONCLUSION This paper proposes a new task for detecting out-of-category documents from a text corpus, by using given target-category names as weak supervision. The proposed OOCD framework adopts the two-step approach that takes advantage of both the textual similarity encoded in a text embedding space and the discriminative power of a neural text classifier. Our empirical evaluation demonstrates that OOCD successfully identifies out-of-category documents in various target scenarios. In conclusion, OOCD can be practically used in many real-world applications for filtering out the documents that are less relevant to inlier categories or user-interested topics, only requiring the minimum guidance. Acknowledgement. This work was supported by the NRF grant (No. 2020R1A2B5B03097210), the IITP grant (No. 2018-0-00584, 2019-0-01906), US DARPA KAIROS Program (No. FA8750-19-2-1004),SocialSim Program (No. W911NF-17-C-0099), INCAS Program (No. HR001121C0165), National Science Foundation (IIS-19-56151, IIS-17-41317, IIS 17-04532), and the Molecule Maker Lab Institute: An AI Research Institutes program (No. 2019897).
Publisher Copyright:
© 2021 IEEE.
All Science Journal Classification (ASJC) codes
- Engineering(all)