二进制代码是软件分发的最常见形态之一。因此,对二进制代码进行逆向分析是进行软件安全防护、知识产权保护、软件供应链分析等任务的重要基础。相对于源代码,二进制代码中缺失了大量信息,给逆向分析带来了巨大的挑战。在代码表征方面,现有的机器学习方案存在对二进制代码的表征不完善,导致其在二进制代码相似性识别等下游任务中性能不佳。在代码结构恢复方面,当前方案恢复函数调用图开销较大,且精确度不高。在代码类型恢复方面,当前方案恢复结构体定义不精确,且可扩展性较差。针对上述挑战,本文围绕二进制代码表征、代码结构恢复、代码类型恢复三方面展开深入研究,主要研究内容和贡献如下:1. 在二进制代码表征方面,提出了融合领域知识的机器语言模型,可以为二进制代码生成融合领域知识的高质量表征。本文通过从指令集定义中提取显式知识作为机器语言模型的额外输入,并构建预训练任务建模指令边界、指令间依赖关系等信息,从而融合隐式知识。本文设计并实现了原型系统 kTrans,在二进制代码相似性识别、函数签名恢复、间接调用识别三项任务中相对于已有方案均能取得显著提升。2. 在二进制代码结构恢复方面,提出了一种结合迁移学习和对比学习的函数调用图恢复方案,可以精确识别二进制代码中的间接调用。本文将间接调用识别问题建模为问答问题,利用对比学习匹配调用者和被调用者的上下文,并结合迁移学习,利用直接调用数据进行预训练,从而增强间接调用识别。本文设计并实现了原型系统 CALLEE,其能够在复杂二进制代码中精确识别间接调用,并能显著提升二进制代码相似性识别、混合模糊测试等下游任务的性能。3. 在二进制代码类型恢复方面,提出了一种跨过程的结构体定义恢复方案,可以从二进制代码中恢复出结构体的定义,包括结构体成员的类型和个数。本文通过对齐有符号和无符号的二进制代码自动生成结构体定义数据集,并利用机器语言模型识别同类型结构体的访问点,最后微调大语言模型恢复结构体定义。本文设计并实现了原型系统 StructLM,能够成功在跨过程场景恢复结构体定义,相比于现有的大模型恢复效果更好。整体而言,本文在二进制代码表征、二进制代码结构恢复、二进制代码类型恢复三方面进行了探索,为二进制代码逆向分析提供了有效解决方案。
Binary code is one of the most common forms of software distribution. Therefore, reverse engineering of binary code is an essential foundation for tasks such as software security protection, intellectual property preservation, and software supply chain analysis, etc. Compared to source code, binary code lacks a significant amount of information, posing substantial challenges to all aspects of reverse analysis.In terms of binary code representation, existing machine learning solutions suffer from incomplete representations of binary code, leading to suboptimal performance in downstream tasks. In binary code structure recovery, current solutions have high overhead and low precision in recovering call graphs. In binary code type recovery, existing solutions have imprecise recovery of struct definitions and poor scalability.To address these challenges, this dissertation conducts in-depth research in three aspects: binary code representation, structure recovery, and type recovery, with the main research contents and contributions as follows:In binary code representation, this dissertation proposes a machine language model that integrates domain knowledge to generate high-quality representations. This dissertation extracts explicit knowledge from instruction set definitions as additional input to a transformer model and constructs pre-training tasks to model instruction boundaries and dependencies, thus integrating implicit knowledge. This dissertation has implemented a prototype system, kTrans, which significantly outperforms existing solutions in tasks such as binary code similarity recognition, function signature recovery, and indirect call identification.In binary code structure recovery, this dissertation proposes a scheme combining transfer learning and contrastive learning for precise identification of indirect calls in binary code. This dissertation models the problem of indirect call identification as a question-answering task, utilizing contrastive learning to match caller and callee contexts and combining transfer learning using direct calls for pre-training to enhance indirect call identification. This dissertation has implemented a prototype system, CALLEE, which can accurately identify indirect calls in complex binary code and significantly improve performance in tasks such as binary code similarity recognition and hybrid fuzz testing.In binary code type recovery, this dissertation proposes an inter-procedural struct recovery scheme that can recover struct definitions from binary code, including struct member types and counts. This dissertation automatically generates struct definition datasets by aligning stripped and unstripped binary code and utilizes the machine language model to identify access points of struct instances of the same type, and finally fine-tune large language models to recover struct definitions. This dissertation has implemented a prototype system, StructLM, which is able to recover struct definitions in inter-procedural scenarios, with better recovery performance compared to existing large language models.In summary, this dissertation explores binary code representation, binary code structure recovery, and binary code type recovery, providing effective solutions for reverse engineering of binary code.