登录 EN

添加临时用户

神经网络存内计算加速器软硬件协同优化与编译技术研究

Research on Software-Hardware Co-Design and Compiling Techniques for In-Memory Neural Network Accelerators

作者:韩建辉
  • 学号
    2016******
  • 学位
    博士
  • 电子邮箱
    han******.cn
  • 答辩日期
    2021.05.20
  • 导师
    李兆麟
  • 学科名
    电子科学与技术
  • 页码
    129
  • 保密级别
    公开
  • 培养单位
    026 集成电路学院
  • 中文关键词
    神经网络,阻变存储器,长短期记忆网络,编译,多面体模型
  • 英文关键词
    Nerual Network, Resistive Random-Access Memory, Long Short-Term Memory, Compilation, Polyheral Model

摘要

传统处理器已经无法满足神经网络应用对高算力和高能效的需求,这一问题引发了设计神经网络专用加速器的热潮。但是,神经网络加速器面临着提升并行性和访存带宽的困难。阻变存储器(ReRAM)高密度、低功耗、存算合一的特性,使得基于 ReRAM 阵列的神经网络存内计算加速器具有良好的性能,成为一种发展前景广阔的新型神经网络加速器。 虽然国内外已经开展了此类加速器的性能优化技术研究,但基于 ReRAM 的神经网络存内计算加速仍面临着诸多挑战,涉及体系结构、软硬件协同优化与编译等。一是多类型算子支持问题,以长短期记忆网络(LSTM)为代表的循环神经网络使用多种基本算子,高效实现这些算子需要在微体系结构方面进行创新。二是设计优化问题,克服硬件系统中多种类型噪声的影响并实现推理精度与计算效率的权衡需要进行软硬件协同的优化。三是高效编程问题,提高加速器的可编程性要求实现神经网络任务在基于 ReRAM 的存内计算加速器上的高效、自动化的编译,需要在编译层面进行创新。针对上述问题,本论文实现了如下创新:(1)在微体系结构层面,本论文提出了一种新型的基于 ReRAM 的 LSTM 加速器架构,通过采用神经网络逼近器实现LSTM中的向量计算,使得系统中的多种功能单元具有一致的架构,优化了数模和模数转换的硬件开销,提高了系统的计算效率。实验结果表明,与现有的基于 ReRAM 的 LSTM 加速器相比,本论文提出的加速器架构能够带来超过6倍的效率提升。(2)在协同优化层面,本论文提出了一套完整的硬件特征感知的模型训练与微调算法,降低了硬件约束带来的影响,提高了模型推理的精度。本论文还通过改进片上数据传输机制,使得上述 LSTM 加速器架构可以支持压缩 LSTM 模型,从而获得硬件执行效率的进一步提升。实验结果表明,在合理的精度权衡下,本论文提出的压缩模型相比于现有的稠密模型可以获得计算效率1.6 - 2.4倍的提升。(3)在软件编译层面,本论文提出了一种基于多面体模型的源到源神经网络编译框架,通过对输入源代码中神经网络算子的自动识别和映射,并充分利用算子和网络层面的优化机会,提高了系统的性能。多种配置下的实验结果表明,本文提出的编译框架通过支持片上的流水线执行,相比于现有工作获得了性能提升。

Conventional processors face difficulties in providing high computation power and energy efficiency for neural network applications. Thus, there have been increasing research interests in developing domain-specific accelerators for neural networks. However, efficient parallel processing and high memory bandwidth are two main challenges for developing such accelerators. Among all the accelerator architectures, the processing-in-memory architectures based on Resistive Random-Access Memory (ReRAM) crossbars have become promising due to the high-density, low-power, and in-situ-computation features of ReRAM crossbars. Although researches on optimizing the performance of ReRAM-based neural network accelerators have been carried out at home and abroad, fully exploiting the potential of such accelerators still faces many challenges, including architecture, software-hardware co-optimization, and compilation. The first challenge is supporting multiple operator types. Recurrent Neural Networks such as Long Short-Term Memory (LSTM) employ a variety of basic operators. Architectural innovations are required to efficiently implement these operators. The second challenge is inference accuracy, as there are multiple noise sources in the hardware system, which makes it necessary to coordinate software and hardware optimization. The third challenge is how to improve the programmability by performing efficient and automatic compilation of neural networks onto the ReRAM-based accelerators, which requires researches on compilation techniques. Therefore, this dissertation carries out researches in response to the above challenges respectively, and makes the following contributions: (1) At the architectural level, this dissertation proposes an efficient ReRAM-based LSTM acceleration architecture, which uses neural approximators to implement vector computations in LSTM, so that multiple functional units in the system have a consistent architecture. The hardware overhead of digital-to-analog and analog-to-digital conversion is optimized, and the computation efficiency of the system is significantly improved. Evaluation results show that the energy efficiency of the proposed architecture outperforms a state-of-the-art LSTM accelerator by over 6 times. (2) At the co-optimization level, this dissertation proposes a hardware-aware training/fine-tuning scheme, which reduces the impact of hardware constraints and improves inference accuracy. Through a hardware enhancement by changing the on-chip data transmission mechanism, the proposed LSTM accelerator architecture can support compressed LSTM models to obtain a further improvement in computation efficiency. Evaluation results show that under a reasonable precision trade-off, inference with compressed models can achieve a 1.6 - 2.4 times improvement in computation efficiency compared with dense models.(3) At the compilation level, this dissertation proposes a source-to-source neural network compilation framework based on the polyhedral model, which can automatically detect and map the operators in the input source code, and make full use of the optimization opportunities in both operator and network levels. Evaluation results show that by supporting on-chip pipeline execution, the proposed compilation framework achieves performance improvement compared with existing work.