TY - GEN
T1 - A Transformer-based Function Symbol Name Inference Model from an Assembly Language for Binary Reversing
AU - Kim, Hyunjin
AU - Bak, Jinyeong
AU - Cho, Kyunghyun
AU - Koo, Hyungjoon
N1 - Publisher Copyright:
© 2023 ACM.
PY - 2023/7/10
Y1 - 2023/7/10
N2 - Reverse engineering of a stripped binary has a wide range of applications, yet it is challenging mainly due to the lack of contextually useful information within. Once debugging symbols (e.g., variable names, types, function names) are discarded, recovering such information is not technically viable with traditional approaches like static or dynamic binary analysis. We focus on a function symbol name recovery, which allows a reverse engineer to gain a quick overview of an unseen binary. The key insight is that a well-developed program labels a meaningful function name that describes its underlying semantics well. In this paper, we present AsmDepictor, the Transformer-based framework that generates a function symbol name from a set of assembly codes (i.e., machine instructions), which consists of three major components: binary code refinement, model training, and inference. To this end, we conduct systematic experiments on the effectiveness of code refinement that can enhance an overall performance. We introduce the per-layer positional embedding and Unique-softmax for AsmDepictor so that both can aid to capture a better relationship between tokens. Lastly, we devise a novel evaluation metric tailored for a short description length, the Jaccard∗score. Our empirical evaluation shows that the performance of AsmDepictor by far surpasses that of the state-of-the-art models up to around 400%. The best AsmDepictor model achieves an F1 of 71.5 and Jaccard∗of 75.4.
AB - Reverse engineering of a stripped binary has a wide range of applications, yet it is challenging mainly due to the lack of contextually useful information within. Once debugging symbols (e.g., variable names, types, function names) are discarded, recovering such information is not technically viable with traditional approaches like static or dynamic binary analysis. We focus on a function symbol name recovery, which allows a reverse engineer to gain a quick overview of an unseen binary. The key insight is that a well-developed program labels a meaningful function name that describes its underlying semantics well. In this paper, we present AsmDepictor, the Transformer-based framework that generates a function symbol name from a set of assembly codes (i.e., machine instructions), which consists of three major components: binary code refinement, model training, and inference. To this end, we conduct systematic experiments on the effectiveness of code refinement that can enhance an overall performance. We introduce the per-layer positional embedding and Unique-softmax for AsmDepictor so that both can aid to capture a better relationship between tokens. Lastly, we devise a novel evaluation metric tailored for a short description length, the Jaccard∗score. Our empirical evaluation shows that the performance of AsmDepictor by far surpasses that of the state-of-the-art models up to around 400%. The best AsmDepictor model achieves an F1 of 71.5 and Jaccard∗of 75.4.
KW - assembly
KW - function name
KW - neural networks
KW - reversing
KW - Transformer
UR - http://www.scopus.com/inward/record.url?scp=85168148901&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85168148901&partnerID=8YFLogxK
U2 - 10.1145/3579856.3582823
DO - 10.1145/3579856.3582823
M3 - Conference contribution
AN - SCOPUS:85168148901
T3 - Proceedings of the ACM Conference on Computer and Communications Security
SP - 951
EP - 965
BT - ASIA CCS 2023 - Proceedings of the 2023 ACM Asia Conference on Computer and Communications Security
PB - Association for Computing Machinery
T2 - 18th ACM ASIA Conference on Computer and Communications Security, ASIA CCS 2023
Y2 - 10 July 2023 through 14 July 2023
ER -