The problem of imbalanced classes arises frequently in binary classification tasks. If one class outnumbers another, trained classifiers become heavily biased towards the majority class. For phishing URL detection, it is very natural that the number of collected benign URLs (i.e., the majority class) is much larger than the number of collected phishy URLs (i.e., the minority class). Oversampling the minority class can be a powerful tool to overcome this situation. However, existing methods perform the oversampling task in the feature space where the original data format is removed and URLs are succinctly represented by vectors. These methods are successful only if feature definitions are correct and the dataset is diverse and not too sparse. In this paper, we propose an oversampling technique in the data space. We train text generative adversarial networks (text-GANs) with URLs in the minority class and generate synthetic URLs that can be made part of the training set. We crawl a crowd-sourced URL repository to collect recently discovered phishy and benign URLs. Our experiments demonstrate significant performance improvements after using the proposed oversampling technique. Interestingly, some of the original test URLs are exactly regenerated by the proposed text generative model.
|Title of host publication||Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018|
|Editors||Yang Song, Bing Liu, Kisung Lee, Naoki Abe, Calton Pu, Mu Qiao, Nesreen Ahmed, Donald Kossmann, Jeffrey Saltz, Jiliang Tang, Jingrui He, Huan Liu, Xiaohua Hu|
|Publisher||Institute of Electrical and Electronics Engineers Inc.|
|Number of pages||10|
|Publication status||Published - 2019 Jan 22|
|Event||2018 IEEE International Conference on Big Data, Big Data 2018 - Seattle, United States|
Duration: 2018 Dec 10 → 2018 Dec 13
|Name||Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018|
|Conference||2018 IEEE International Conference on Big Data, Big Data 2018|
|Period||18/12/10 → 18/12/13|
Bibliographical noteFunding Information:
*Equally contributed and listed in alphabetical order; †Corresponding author; This work was partially supported by the Office of Naval Research under the MURI grant N00014-18-1-2670, and the Indo-UK Collaborative Project DST/INT/UKP158/2017.
© 2018 IEEE.
All Science Journal Classification (ASJC) codes
- Computer Science Applications
- Information Systems