登录 EN

添加临时用户

后量子密码算法Falcon的高性能硬件设计关键技术

Key Technologies on High-Performance Hardware Design of Post-Quantum Cryptography Algorithm Falcon

作者:欧阳屹
  • 学号
    2020******
  • 学位
    硕士
  • 电子邮箱
    YiO******com
  • 答辩日期
    2023.05.19
  • 导师
    魏少军
  • 学科名
    电子信息
  • 页码
    71
  • 保密级别
    公开
  • 培养单位
    026 集成电路学院
  • 中文关键词
    快速傅里叶采样,格基密码,Falcon,递归,数论变换
  • 英文关键词
    Fast Fourier Sampling,Lattice cryptography,Falcon,Recursion,NTT

摘要

随着量子计算机的稳步发展和后量子密码算法的标准逐步确立,学术界和工业界开始关注现有密码系统到后量子密码系统的迁移,但在迁移过程中,如何通过算法-硬件的协同设计使得计算更加复杂、存储开销更高的后量子密码算法适配 各种已有的协议与密码系统成为当前该领域的核心技术挑战。Falcon 算法由于其性能及密钥尺寸方面的优势已经被确定为国际后量子密码算法的标准之一。但由于其相对复杂的计算与控制模式,以及浮点操作的引入,造成目前面向 Falcon 算法的高性能实现研究仍然不足。在此背景下,本文在对 Falcon 算法的计算存储模 式进行充分分析的基础上,提出了面向 Falcon 算法的高性能处理架构,以及基于任务驱动的资源调度机制,并通过 TSMC 28nm CMOS 的标准工艺库对本文提出的关键技术进行全面验证。在高性能 Falcon 处理架构设计过程中,分别针对最为复杂的哈希数据通路、算术计算及采样数据通路开展了深入研究。首先,提出了可扩展哈希计算电路,实现了对不同配置的多参数哈希函数的支持。其次,针对 Falcon 算法中算术操作类型多样、参数规模差异明显的特点,提出了兼容浮点乘法与蒙哥马利模乘的电路 结构,在保持关键路径不变的情况下,将乘法器单元数量需求降低 50%。进一步 设计了支持 FFT/NTT 的统一蝶形单元,通过配置可支持整数、有限域、浮点及复 数域的基本算术操作。通过统一的向量化原位计算方法,最终将算法在频域上分解与合并非原位计算完成适配。最后,针对 Falcon 中的采样计算,提出了一种中止响应机制来均衡处理采样电路中随机数更新与消耗。为了进一步挖掘本文中硬件架构的性能提升潜力,提高硬件资源的利用效率,本文进一步提出了基于任务驱动的调度机制。对于控制通路中最为复杂的递归流程,设计了基于状态转换的递归展开方式,以较低的的开销高效完成递归,并在兼顾任务灵活性的前提下实现了签名与验签功能的任务实现。为了进一步对本文中的关键技术进行验证,在TSMC 28 nm CMOS 工艺下进行硬件实现评估。该芯片工作频率为 500MHz,后端布局布线后的面积为3.2?𝑚2。在保持抗时序攻击的恒定实现下,Falcon算法的签名与验签性能分别达到了5255次/秒和 245459 次/秒,相比于目前的CPU实现,分别提升了1.18倍和9.64倍。相比于目前仅有的验签硬件设计,速度提升1.87倍。

With the steady development of quantum computers and the gradual establishment of standards for post-quantum cryptographic algorithms, academia and industry have begun to pay attention to the migration of existing cryptographic systems to post-quantum cryptographic systems. The adaptation of post-quantum cryptography algorithms, which make calculations more complex and have higher storage overhead, to various existing protocols and cryptosystems has become a core technical challenge in this field. Due to its advantages in performance and key size, the Falcon algorithm has been identified as one of the international standards for post-quantum cryptography algorithms. However, due to its relatively complex calculation and control mode, as well as the introduction of floating-point operations, research on the high-performance implementation of the Falcon algorithm is still insufficient. In this context, on the basis of a full analysis of the calculation and storage mode of the Falcon algorithm, this paper proposes a high-performance processing architecture for the Falcon algorithm and a task-driven resource scheduling mechanism, and through the TSMC 28nm CMOS standard process library The key technologies proposed in this paper are fully verified.During the design process of the high-performance Falcon processing architecture, in-depth research was carried out on the most complex hash data path, arithmetic calculation and sampling data path. First, a scalable hash computing circuit is proposed, which realizes the support for multi-parameter hash functions with different configurations. Secondly, in view of the characteristics of various types of arithmetic operations and obvious differences in parameter scale in the Falcon algorithm, a circuit structure compatible with floating-point multiplication and Montgomery modular multiplication is proposed, and the number of multiplier units is reduced by 50% while keeping the critical path unchanged. A unified butterfly unit supporting FFT/NTT is further designed, which can support basic arithmetic operations in integer, finite field, floating point and complex number fields through configuration. Through the unified vectorized in-situ calculation method, the algorithm is finally decomposed and combined in the frequency domain to complete the adaptation. Finally, for the sampling calculation in Falcon, an abort response mechanism is proposed to balance the update and consumption of random numbers in the sampling circuit.In order to further tap the performance improvement potential of the hardware architecture in this paper and improve the utilization efficiency of hardware resources, this paper further proposes a task-driven scheduling mechanism. For the most complex recursive process in the control path, a recursive expansion method based on state transition is designed to complete the recursion efficiently with low overhead, and realize the task realization of signature and signature verification functions under the premise of taking into account the task flexibility.In order to further verify the key technologies in this paper, the hardware implementation evaluation is carried out under the TSMC 28 nm CMOS process. The operating frequency of the chip is 500MHz, and the area after layout and wiring of the back end is 3.2?𝑚2. Under the constant implementation of anti-timing attack, the signature and verification performance of the Falcon algorithm reached 5255 times/s and 245459 times/s, respectively, compared with the current CPU implementation, they were increased by 1.18 times and 9.64 times, respectively. Compared with the current only signature verification hardware design, the speed is increased by 1.87 times.