登录 EN

添加临时用户

神经网络计算图优化研究与全流程编译实现

Research on Optimization of Neural Network Calculation Graph and Implementation of Whole Process

作者:张显觉
  • 学号
    2019******
  • 学位
    硕士
  • 电子邮箱
    zxj******.cn
  • 答辩日期
    2022.05.21
  • 导师
    李翔宇
  • 学科名
    集成电路工程
  • 页码
    61
  • 保密级别
    公开
  • 培养单位
    026 集成电路学院
  • 中文关键词
    深度学习编译器,编译,图优化,算子调度,并行执行
  • 英文关键词
    deep learning compiler, compile, graph level optimization, operator schedule, parallel execution

摘要

深度学习发展至今模型架构推陈出新,神经网络模型最初仅由几层卷积算子组成,演变到现在算子数可达上千且计算图拓扑结构越来越复杂,同时为加速神经网络的计算,各类领域专用处理器也针对深度学习模型特有的计算模式而设计。如何将深度学习模型编译到硬件上高效运行成为工业界需要解决的问题。深度学习编译器将不同深度学习架构下训练的神经网络作为输入,经过编译为各种深度学习硬件产生优化过的代码。为提升神经网络部署在硬件上运行的性能,本文首先对神经网络做图级优化,然后根据硬件特性做算子内部的编译优化,最后将神经网络部署在GPU运行。图级优化部分与硬件特性无关并分为两个部分:一是图重写策略,图重写策略包含算子融合、代数简化和循环融合,通过融合神经网络算子、化简神经神经网络计算,节省冗余的内存访问和中间变量;二是基于动态规划的图级算子调度算法,通过探索神经网络算子间的并行执行来提高神经网络计算图运行性能,同时为了降低子图划分的复杂度的同时保证正确性,提出了神经网络图的线性聚类算法,之后基于动态规划的图级调度算法对神经网络计算图进行子图划分,并给出可并行执行的算子调度策略。为了提高算子内计算的空间局部性,提出根据硬件特性进行循环分块粒度的自动调优算法,根据GPU内存层次结构等特性,搜索出循环分块粒度的近似最优解以提高计算访存比。与当下流行的度学习编译器和框架做推理延迟的对比,上述的编译优化方法切实提升了神经网络运行的性能,其中InceptionV3、NasNet、SqueezeNet等非线性神经网络经过编译后显著降低了推理延迟,相较其他深度学习编译器,最大可达72.61%的性能提升,相较内部算子优化做到极致的编译器,上述深度学习编译方法也能在有更大算子并行执行机会的神经网络上获得最高28.98%的性能提升。

Since the development of deep learning, the model architecture has been innovated. At first, the neural network model was only composed of several layers of convolution operators. Now, the number of operators can reach thousands, and the topology of the calculation graph is becoming more and more complex. At the same time, to accelerate the calculation of neural networks, various domain-specific processors are also designed for the unique calculation mode of the deep learning model. How to compile the deep learning model into hardware and run efficiently has become a problem that needs to be solved in the industry. The deep learning compiler takes the neural networks trained under different deep learning frameworks as input and generates optimized code for various deep learning hardware. To improve the performance of the neural network deployed on the hardware, this paper first makes the graph level optimization of the neural network, then makes the internal compilation optimization of the operator according to the hardware characteristics, and finally deploys the neural network to run on the GPU. The graph-level optimization part has nothing to do with the hardware characteristics and is divided into two parts. One is the graph rewriting strategy. The graph rewriting strategy includes operator fusion, algebraic simplification, and loop fusion. By fusing neural network operators and simplifying neural network calculation, redundant memory access and intermediate variables are saved. The second is the graph level operator scheduling algorithm based on dynamic programming, which improves the operation performance of neural network computing graph by exploring the parallel execution between neural network operators. At the same time, to reduce the complexity of subgraph partition and ensure the correctness, a linear clustering algorithm of neural network graph is proposed, and then the graph level scheduling algorithm based on dynamic programming divides the neural network computing graph, The operator scheduling strategy which can be executed in parallel is given. To improve the spatial locality of calculation in the operator, we propose an automatic optimization algorithm of loop tiling granularity according to the characteristics of the hardware. According to the characteristics of the GPU memory hierarchy, we search for the approximate optimal solution of loop tiling granularity to improve the memory access ratio of calculation. Compared with the current popular deep learning compilers and frameworks for inference delay, the above compilation optimization methods effectively enhance the performance of neural networks. Among them, nonlinear neural networks such as InceptionV3, NasNet, and SqueezeNet significantly reduce the reasoning delay after compilation, and the maximum performance improvement can be up to 72.61% compared with other deep learning compilers. Compared with the compiler with extreme internal operator optimization, our deep learning compilation method can also achieve a maximum performance improvement of 28.98% on neural networks with more opportunities for parallel execution of operators.