登录 EN

添加临时用户

融合数字电路与存内计算的高能效神经网络处理器

High Energy Efficiency Neural Network Processor with Combined Digital and Computing-in-Memory Architecture

作者:岳金山
  • 学号
    2016******
  • 学位
    博士
  • 电子邮箱
    yjs******com
  • 答辩日期
    2021.05.24
  • 导师
    刘勇攀
  • 学科名
    电子科学与技术
  • 页码
    113
  • 保密级别
    公开
  • 培养单位
    023 电子系
  • 中文关键词
    神经网络处理器, 高能效, 专用集成电路, 存内计算, 系统芯片
  • 英文关键词
    Neural Network Processor, High Energy Efficiency, Application Specific Integrated Circuits, Computing-in-Memory, System on Chip

摘要

神经网络算法推动了现代人工智能的快速发展,展现出明显优于传统算法的准确率,并在社会治安、医疗、金融、交通导航、农业等领域广泛应用。然而,神经网络算法的高计算复杂度和存储访问复杂度对神经网络处理器的性能和功耗提出了严峻挑战。高能效神经网络处理器成为神经网络算法在广泛的低功耗智能设备上实际应用的迫切需求。为了应对以上挑战,本文从纯数字电路和融合存内计算的高能效神经网络处理器两个角度,开展了四项主要的研究工作。 在数字电路神经网络处理器层面,本文一方面针对传统架构数据复用优化不充分的问题,提出了针对特定卷积核优化的卷积神经网络处理器KOP3,通过特定卷积核优化的计算阵列与本地存储器循环访问调度技术,相比同期工作降低16.7倍的片上本地缓存面积或4.92倍的全局存储器访问开销。另一方面,针对不规则稀疏网络压缩技术引起的显著额外功耗面积开销,采用结构化频域压缩算法CirCNN,提出整体并行-比特串行的FFT电路、低功耗分块转置TRAM和频域二维数据复用阵列,以规则的方式压缩了存储和计算量。本文设计并流片验证的STICKER-T芯片相比同期代表性工作实现了8.1倍的面积效率和4.2倍的能量效率。 在融合数字电路与存内计算的神经网络处理器层面,本文融合数字电路的灵活性和存内计算IP的高能效特性,进一步提升能量效率。一方面,通过分块结构化权重稀疏与激活值动态稀疏、核心内/外高效数据复用与网络映射策略、支持动态关断ADC的存内计算IP,设计流片了存内计算系统芯片STICKER-IM,首次在存内计算芯片中实现了稀疏压缩技术,实现了2.9~37.5TOPS/W的系统能量效率。另一方面,本文进一步针对现有工作与大模型实际应用之间的差距,指出了大模型权重更新引起的性能下降和稀疏利用不充分等问题,提出了组相联分块稀疏电路、乒乓存内计算电路和可调采样精度ADC技术。本文设计并流片验证的STICKER-IM2芯片首次考虑了存内计算的权重更新代价,实现了ImageNet数据集上的高能效和较高准确率验证,相比同期代表性工作实现了6.35倍的能量效率。 本文研究工作表明,融合数字电路与存内计算是实现更高能效神经网络处理器芯片的可行技术路线。需要充分利用数字电路与存内计算架构各自的优势,结合器件、电路、架构和算法应用等多层次的联合优化实现更高能效,推动低功耗智能设备的大规模发展应用。

The Neural Network (NN) algorithms promote the rapid development of modern artificial intelligence (AI), and show obvious improvement compared with conventional algorithms. NN algorithms have been applied on social security, medical care, finance, transportation, navigation, agriculture and many other fields. However, the high computation complexity and data access complexity pose a serious challenge to the performance and power consumption of the NN processors. The energy-efficient NN processor has become an urgent requirement for the practical NN applications on wide-range low-power AI devices. To solve this challenge, this dissertation investigates pure-digital and digital-computing-in-memory (digital-CIM) solutions and carries out four major studies. For pure-digital NN processors, this dissertation first analyzes the insufficient data reuse in conventional architectures, and proposes kernel-optimized convolutional NN processor KOP3. By kernel-optimized computation array and local-buffer-circulant scheduling, this work shows 16.7x local buffer area reduction or 4.92x global SRAM access reduction. Secondly, this dissertation analyzes the extra power/area overhead of irregular sparsity compression, and adopts the structural frequency-domain compression algorithm CirCNN. By global-parallel bit-serial FFT, frequency-domain 2-dimentional data-reuse array with low-power transpose SRAM (TRAM), this work reduces storage and computation in a regular way. The designed and fabricated NN processor STICKER-T shows 8.1x area efficiency and 4.2x energy efficiency compared with the state-of-the-art NN processor. For digital-CIM NN processors, this dissertation combines the flexibility of digital circuits and high energy efficiency of CIM IP to further improve the system energy efficiency. Firstly, by block-wise weight sparsity and dynamic activation sparsity, flexible network mapping and inter/intra-data-reuse, and CIM IP with dynamic power-off ADCs, this work designs and fabricates CIM system processor STICKER-IM with 2.9~37.5TOPS/W system energy efficiency, which validates sparsity improvement on CIM processors for the first time. Secondly, this dissertation further points out the gap between current work and practical applications, including weight updating on large NN models and insufficient sparsity utilization. This work proposes set-associate block-wise sparsity, ping-pong CIM updating circuits, and dynamic-precision ADC. For the first time, the fabricated processor STICKER-IM2 considers weight updating problem and demonstrates high energy efficiency and accuracy on ImageNet dataset with CIM architecture. STICKER-IM2 shows 6.35x system energy efficiency compared with the state-of-the-art CIM system processor. This dissertation demonstrates that combining digital and CIM circuits is a promising technical route for energy-efficient NN processor. Researchers should fully utilize the advantages of digital and CIM circuits, and co-optimize from device, circuit, architecture and system levels to achieve better energy efficiency, which can promote the large-scale application of low-power AI devices.