Abstract
The Universal Trigger (UniTrigger) is a recently-proposed powerful adversarial textual attack method. Utilizing a learning-based mechanism, UniTrigger generates a fixed phrase that, when added to any benign inputs, can drop the prediction accuracy of a textual neural network (NN) model to near zero on a target class. To defend against this attack that can cause significant harm, in this paper, we borrow the “honeypot” concept from the cybersecurity community and propose DARCY, a honeypot-based defense framework against UniTrigger. DARCY greedily searches and injects multiple trapdoors into an NN model to “bait and catch” potential attacks. Through comprehensive experiments across four public datasets, we show that DARCY detects UniTrigger's adversarial attacks with up to 99% TPR and less than 2% FPR in most cases, while maintaining the prediction accuracy (in F1) for clean inputs within a 1% margin. We also demonstrate that DARCY with multiple trapdoors is also robust to a diverse set of attack scenarios with attackers' varying levels of knowledge and skills. We release the source code of DARCY at: https://github.com/lethaiq/ACL2021-DARCY-HoneypotDefenseNLP.
Original language | English |
---|---|
Title of host publication | ACL-IJCNLP 2021 - 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Proceedings of the Conference |
Publisher | Association for Computational Linguistics (ACL) |
Pages | 3831-3844 |
Number of pages | 14 |
ISBN (Electronic) | 9781954085527 |
Publication status | Published - 2021 |
Event | Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL-IJCNLP 2021 - Virtual, Online Duration: 2021 Aug 1 → 2021 Aug 6 |
Publication series
Name | ACL-IJCNLP 2021 - 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Proceedings of the Conference |
---|
Conference
Conference | Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL-IJCNLP 2021 |
---|---|
City | Virtual, Online |
Period | 21/8/1 → 21/8/6 |
Bibliographical note
Funding Information:The works of Thai Le and Dongwon Lee were in part supported by NSF awards #1742702, #1820609, #1909702, #1915801, #1940076, #1934782, and #2114824. The work of Noseong Park was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) (No. 2020-0-01361, Artificial Intelligence Graduate School Program (Yonsei University)).
Publisher Copyright:
© 2021 Association for Computational Linguistics
All Science Journal Classification (ASJC) codes
- Software
- Computational Theory and Mathematics
- Linguistics and Language
- Language and Linguistics