登录 EN

添加临时用户

基于卷积核稀疏分解的神经网络加速器设计

Design of Neural Network Accelerator Based on Sparse Decomposition of Convolution Kernel

作者:李亦轩
  • 学号
    2019******
  • 学位
    硕士
  • 电子邮箱
    945******com
  • 答辩日期
    2022.05.23
  • 导师
    魏少军
  • 学科名
    集成电路工程
  • 页码
    79
  • 保密级别
    公开
  • 培养单位
    026 集成电路学院
  • 中文关键词
    卷积神经网络,量化神经网络,硬件加速器,RLC编码
  • 英文关键词
    CNN,ASIC,RLC

摘要

随着近年来计算机算力的发展,深度神经网络迎来了新一波热潮,在图像识别、目标检测、语义检测等许多领域均实现了优于传统方法的效果。但是由于深度神经网络的计算密集性和存储密集性,给其应用带来了很大挑战。量化神经网络是解决这个问题的一种广泛使用的方法。通过对权重阵列或特征图进行低位宽定点量化,一定程度上减少量化神经网络的存储和计算需求。但事实上量化神经网络的特点使得仍然存在很大的潜在硬件优化空间,而在传统硬件计算中无法利用。 本文针对量化卷积神经网络的硬件加速计算,采用了一种卷积核稀疏分解方法,从原始权重阵列根据实际权重值分解出多个子权重阵列,分别计算子权重阵列与输入特征图的卷积再进行累加,使得卷积计算可以按照实际权重值顺序执行,同时大幅度减少了卷积计算的乘法操作数;并结合卷积核稀疏分解方法与最大值池化层的特点,设计了一种最大值池化预测方法,提前预测最大值减少无效计算;此外对于卷积核稀疏分解方法中的权重分解步骤,利用子权重阵列互质的特点设计一种互质拆分方法,减少分离过程的计算量;最后由于多个子权重阵列稀疏度变化幅度大,常用的RLC(run-length coding)稀疏编码方法由于index位宽设置问题,在这种场景下无法实现较好的压缩效果,因此本文设计了一种改进的RLC编码方法,可以在5bit量化场景下相对于传统RLC方法减少35.5%的子权重阵列存储需求。 结合以上优化方法,本文设计了一种多位宽卷积核稀疏分解硬件加速器,采用输出固定数据流,使得在最大值池化预测前后计算中都实现PE利用率100%;对子权重阵列进行稀疏编码,并且在计算时同时跳过0权重带来的能耗和延时。最终根据综合结果电路整体面积为0.6674 mm^2,由于PE内无乘法器,PE阵列面积占比仅为16.54%。

With the development of computer computing power in recent years, deep neural networks have ushered in a new wave of upsurge, and have achieved better results than tra- ditional methods in many fields such as image recognition, object detection, and semantic detection. However, due to the computational and storage-intensive nature of deep neural networks, it brings great challenges to their applications. Quantized neural networks are a widely used approach to this problem. By performing low-bit-width fixed-point quanti- zation on the weight array or feature map, the storage and computing requirements of the quantized neural network are reduced to a certain extent. But in fact, the characteristics of quantized neural network still have a lot of potential hardware optimization space, which cannot be used in traditional hardware computing.Aiming at the hardware calculation of quantized convolutional neural network, this paper adopts a quantized separation weight convolution method, which separates multiple sub-weight arrays from the original weight array according to the actual weight value, and calculates the convolution of the sub-weight array and the input feature map separately. Accumulate, so that the convolution calculation can be performed in the order of the ac- tual weight value, and at the same time, the number of multiplication operations in the convolution calculation is greatly reduced. The value pooling prediction method predicts the maximum value in advance to reduce invalid calculations; in addition, for the weight separation step in the quantized separation weight convolution method, a coprime split- ting method is designed by using the characteristics of sub-weight arrays that are relatively prime to reduce the calculation of the separation process. Finally, due to the large vari- ation of the sparsity of multiple sub-weight arrays, the commonly used RLC (run-length coding) sparse coding method cannot achieve a good compression effect in this scenario due to the problem of index bit width setting. Therefore, this paper designs a An improved RLC coding method can reduce the sub-weight array storage requirement by 35.5Combined with the above optimization methods, this paper designs a multi-bit-width weight separation convolutional neural network accelerator, which uses a fixed output data stream to achieve a PE utilization rate of 100