Improving cross-platform binary analysis using representation learning via graph alignment

Geunwoo Kim, Sanghyun Hong, Michael Franz, Dokyung Song

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

Cross-platform binary analysis requires a common representation of binaries across platforms, on which a specific analysis can be performed. Recent work proposed to learn low-dimensional, numeric vector representations (i.e., embeddings) of disassembled binary code, and perform binary analysis in the embedding space. Unfortunately, however, existing techniques fall short in that they are either (i) specific to a single platform producing embeddings not aligned across platforms, or (ii) not designed to capture the rich contextual information available in a disassembled binary. We present a novel deep learning-based method, XBA, which addresses the aforementioned problems. To this end, we first abstract binaries as typed graphs, dubbed binary disassembly graphs (BDGs), which encode control-flow and other rich contextual information of different entities found in a disassembled binary, including basic blocks, external functions called, and string literals referenced. We then formulate binary code representation learning as a graph alignment problem, i.e., finding the node correspondences between BDGs extracted from two binaries compiled for different platforms. XBA uses graph convolutional networks to learn the semantics of each node, (i) using its rich contextual information encoded in the BDG, and (ii) aligning its embeddings across platforms. Our formulation allows XBA to learn semantic alignments between two BDGs in a semi-supervised manner, requiring only a limited number of node pairs be aligned across platforms for training. Our evaluation shows that XBA can learn semantically-rich embeddings of binaries aligned across platforms without apriori platform-specific knowledge. By training our model only with 50% of the oracle alignments, XBA was able to predict, on average, 75% of the rest. Our case studies further show that the learned embeddings encode knowledge useful for cross-platform binary analysis.

Original languageEnglish
Title of host publicationISSTA 2022 - Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis
EditorsSukyoung Ryu, Yannis Smaragdakis
PublisherAssociation for Computing Machinery, Inc
Pages151-163
Number of pages13
ISBN (Electronic)9781450393799
DOIs
Publication statusPublished - 2022 Jul 18
Event31st ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2022 - Virtual, Online, Korea, Republic of
Duration: 2022 Jul 182022 Jul 22

Publication series

NameISSTA 2022 - Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis

Conference

Conference31st ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2022
Country/TerritoryKorea, Republic of
CityVirtual, Online
Period22/7/1822/7/22

Bibliographical note

Funding Information:
We thank the anonymous reviewers for their insightful feedback. This material is based upon work partly supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) under awards 2021R1F1A1057436 and 2022R1C1C1003551, the Office of Naval Research (ONR) under awards N00014-21-1-2409 and N00014-17-1-2232, the Defense Advanced Research Projects Agency (DARPA) under award N66001-20-C-4027, and the Defense Advanced Research Projects Agency (DARPA) Small Business Technology Transfer (STTR) Program Office under contracts W31P4Q-20-C-0052 and W912CG-21-C-0020. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of NRF, MSIT, ONR, DARPA, the DARPA STTR Program Office, or any other South Korea and U.S. government agency. We also gratefully acknowledge an łEndeavorž research award from the Donald Bren School of Information and Computer Sciences at UC Irvine, and an award from the Yonsei University Research Fund of 2021 (2021-22-0039).

Publisher Copyright:
© 2022 ACM.

All Science Journal Classification (ASJC) codes

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Software

Fingerprint

Dive into the research topics of 'Improving cross-platform binary analysis using representation learning via graph alignment'. Together they form a unique fingerprint.

Cite this