深度神经网络在人类生产生活中的各个领域中都取得了空前的成果,对实际问题如计算机视觉,自然语言处理等提供了高效的解决方法。但随着深度神经网络的功能越来越强大,其结构也越来越复杂,网络的训练依赖于计算力强大的大规模GPU集群。应用领域的大面积扩张使得深度神经网络在如可穿戴设备、监控系统、便携式设备等资源受限平台的部署也愈发困难。因此研究神经网络专用加速计算硬件至关重要。本文从软硬件协同设计角度出发,针对神经网络推理计算设计了一套规范化的量化神经网络加速系统,具备良好的通用性与可扩展性。软件方面,设计了一套对接主流深度学习框架计算的加速器API,并对CNN网络进行了低比特量化;硬件方面,以“主处理器+协处理器”的结构设计了量化CNN加速SoC。多核可扩展的加速器作为协处理器,支持CNN中绝大多数计算,主处理器负责系统控制以及其他计算。为提升卷积计算效率,采用数据分块与优化计算顺序的方式提高计算并行度。设计了专用的量化处理单元用于提高定点数计算速度。将VGG-16与ResNet-50的全连接层参数量化为4bits,其他参数量化为8bits,最终网络的Top-1准确率损失分别为0.7%与0.6%。再将加速器SoC集成至Xilinx ZCU102 PFGA开发板。设置加速器工作频率为250MHz,运行量化后的VGG-16与ResNet-50网络,测得量化加速系统的峰值性能为512GOPS,能耗比为94.99GOPS/W,分别为其他设计的3.3倍与3.2倍。
Deep neural networks have achieved unprecedented results in all areas of human life, which provides efficient solutions to practical problems such as computer vision, natural language processing, etc. But as deep neural networks become more powerful, its structure becomes more and more complex. The training of the networks relies on large-scale GPU clusters. The large-scale expansion of application fields makes the deployment of deep neural networks on resource-constrained platforms such as wearable devices, monitoring systems, and portable devices more difficult. Thus, it is crucial important to study the dedicated accelerated computing hardware for neural networks. A standard quantization neural network acceleration system is designed for neural network inference, and the system possesses good versatility and scalability.This article starts from the perspective of software and hardware co-design. In software, a set of accelerator APIs that connect to mainstream deep learning framework have been designed, and performed low-bit quantization on CNNs. In hardware, a quantization CNN accelerator SoC is designed with the structure of "main processor + coprocessor". Multi-core and scalable accelerator as co-processor, which supports most of the calculations in CNNs. The main processor is used for system control and other calculations. In order to improve the efficiency of convolution calculation, the method of data block and optimization of calculation order is adopted to increase the degree of parallelism of calculation. A special quantization processing unit is designed to improve the calculation speed of fixed-point numbers.Quantize the fully connected layer parameters of VGG-16 and ResNet-50 to 4bits, and quantize other parameters to 8bits. The final top-1 accuracy loss is 0.7% and 0.6% respectively. Then integrate the accelerator SoC into the Xilinx ZCU102 PFGA development board. Set the accelerator working frequency to 250MHz, run quantized VGG-16 and ResNet-50, The peak performance of the quantitation accelerator system is 512 GOPS, and the energy consumption ratio is 94.99 GOPS/W, 3.3 times and 3.2 times of other designs respectively.