Birds of a Feature: Intrafamily Clustering for Version Identification of Packed Malware

Leo Hyun Park, Jungbeen Yu, Hong Koo Kang, Taejin Lee, Taekyoung Kwon

Research output: Contribution to journalArticlepeer-review

Abstract

It is challenging for malware lineage inference to identify versions of collected malware by ensuring high accuracy in clustering. In this article, we tackle this problem and present a novel mechanism using behavioral features for version identification of (un)packed malware. Our basic idea involves focusing on intrafamily clustering. We extract the so-called family feature sets, i.e., hybrid features specific to each family. Our intuition is that family feature sets may achieve higher accuracy in clustering than common feature sets, and unpacked malware found in or relevant to such a cluster can result in the lineage inference of family members using traditional inference methods. We conduct experiments with two datasets, 8928 malware samples from VXHeavens and 3293 samples by manual analysis, composed of packed malware in a large portion. The results demonstrate that we can accurately classify samples into malware families based on the hybrid features we choose. In addition, we can also effectively extract family feature sets from 37 feature categories using forward stepwise selection. For intrafamily clustering, we employed the agglomerative clustering algorithm and observed that using family feature sets is significantly more accurate than using common feature sets, which facilitates higher accuracy lineage inference of packed malware.

Original languageEnglish
Article number8951062
Pages (from-to)4545-4556
Number of pages12
JournalIEEE Systems Journal
Volume14
Issue number3
DOIs
Publication statusPublished - 2020 Sep

Bibliographical note

Funding Information:
This work was supported in part by the Institute of Information nd Communications Technology Planning and Evaluation under Grant 2017-0-00158 (Development of Cyber Threat Intelligence Analysis and Information SharingTechnology forNationalCyber IncidentResponse) funded by the KoreaGovernment (Ministry of Science, ICT and Future Planning) and in part by the Institute for Information and Communications Technology Promotion under Grant 2018-0-00513 (Machine Learning Based Automation of Vulnerability Detection on Unix-Based Kernel) funded by the Korea Government (Ministry of Science, ICT and Future Planning).

Funding Information:
Manuscript received June 25, 2019; revised November 12, 2019; accepted December 6, 2019. Date of publication January 7, 2020; date of current version September 2, 2020. This work was supported in part by the Institute of Information and Communications Technology Planning and Evaluation under Grant 2017-0-00158 (Development of Cyber Threat Intelligence Analysis and Information Sharing Technology for National Cyber Incident Response) funded by the Korea Government (Ministry of Science, ICT and Future Planning) and in part by the Institute for Information and Communications Technology Promotion under Grant 2018-0-00513 (Machine Learning Based Automation of Vulnerability Detection on Unix-Based Kernel) funded by the Korea Government (Ministry of Science, ICT and Future Planning). (Corresponding author: Taekyoung Kwon.) L. H. Park, J. Yu, and T. Kwon are with the Yonsei University, Seoul 03722, South Korea (e-mail: dofi@yonsei.ac.kr; symnoisy@yonsei.ac.kr; taekyoung@ yonsei.ac.kr).

Publisher Copyright:
© 2007-2012 IEEE.

All Science Journal Classification (ASJC) codes

  • Control and Systems Engineering
  • Information Systems
  • Computer Science Applications
  • Computer Networks and Communications
  • Electrical and Electronic Engineering

Fingerprint Dive into the research topics of 'Birds of a Feature: Intrafamily Clustering for Version Identification of Packed Malware'. Together they form a unique fingerprint.

Cite this