登录 EN

添加临时用户

面向视觉内容理解的高效深度神经网络模型研究

Efficient Deep Neural Network Models for Visual Understanding

作者:饶永铭
  • 学号
    2018******
  • 学位
    博士
  • 电子邮箱
    rym******.cn
  • 答辩日期
    2023.05.21
  • 导师
    鲁继文
  • 学科名
    控制科学与工程
  • 页码
    130
  • 保密级别
    公开
  • 培养单位
    025 自动化系
  • 中文关键词
    深度神经网络,视觉内容理解,模型结构设计,模型加速,动态神经网络
  • 英文关键词
    Architecture Design, Deep Neural Networks, Visual Understanding, Model Acceleration, Dynamic Neural Networks

摘要

模型架构研究是深度学习时代计算机视觉研究的重要课题,对提高计算机视 觉方法在各种视觉内容理解任务的性能有重要意义。近年来,随着以卷积神经网 络和视觉自注意力模型为代表的视觉模型架构的快速发展,视觉内容理解领域已 经取得了长足的进步。但新兴的深度学习模型存在对强大算力的依赖,在资源受 限条件下实现可靠、精准的视觉内容理解仍然存在困难,仍然存在诸多影响效率 的问题。主要体现在:(1)准确性较高的全局关系建模方法计算效率低,难以拓 展到高分辨率图像;(2)基于自注意力机制的空间交互方法计算成本高,在资源 受限环境下难以部署;(3)主流的静态模型加速方法精度损失大,在较大加速比 例时难以取得令人满意的性能;(4)现有的动态网络结构在实际部署时推理速度 慢,难以实现高效部署。本文针对以上四个关键问题,从模型结构设计和模型推 理加速两个视角出发,提出了高效滤波模型、高阶交互模型、动态剪枝模型、空间 稀疏模型四种新型高效深度神经网络模型架构。主要创新成果和研究内容如下: 1. 针对全局建模模型可拓展性差的问题,提出了基于全局滤波的高效视觉基础 模型。该模型通过利用基于快速傅里叶变换实现对数线性复杂度的高效全局 信息交互,并设计了一系列基于该基本模块的视觉基础模型,在取得比视觉 自注意力模型更好的性能的同时实现了高分辨图像的高效处理。 2. 针对空间交互机制计算成本高的问题,提出了基于高阶交互的高效视觉基础 模型。该模型以递归门控卷积为基本模块,通过在视觉基础模型中引入基于 门控卷积的高阶空间交互机制,以更低复杂度实现了视觉自注意力模型的各 项优秀性质,在各种模型规模均取得更佳的精度-复杂度平衡。 3. 针对静态剪方法枝精度损失大的问题,提出了基于通道自适应剪枝的模型加 速方法。该方法设计了基于深度强化学习的动态模型剪枝方案,通过引入自 底向上决策模块选取有效通道,使模型能够根据输入样本困难程度自适应调 整计算开销,有效提高了在较大加速比例下模型的识别准确性。 4. 针对动态神经网络推理速度慢的问题,提出了基于动态空间稀疏化的模型加 速方法。该方法通过构建自适应渐进空间稀疏化框架动态选取视觉输入中的 有效信息,结合视觉自注意力模型和新型卷积模型特性,在无显著精度损失 的基础上实现了在多种硬件上模型推理速度的显著提升。

Architecture design is a fundamental problem in computer vision in the era of deep learning, which is an effective way to improve the performance of various visual understanding tasks. In recent years, with the rapid evolution of vision Transformers and convolutional neural networks, the field of visual understanding has made great progress. However, deep learning models largely rely on powerful hardware, and it is still difficult to achieve reliable and accurate visual understanding under resource-constrained conditions. There are also many existing problems that affect the efficiency of these models, including: (1) The models based on global relation modeling usually have high accuracy but low computational efficiency, which makes them difficult to apply to high-resolution images; (2) Transformer-based models introduces high computational costs to achieve spatial interactions, and it is difficult to implement in a resource-constrained environment. (3) The mainstream static model acceleration methods suffer from significant accuracy drops, which makes them difficult to achieve satisfactory performance with a large acceleration ratio; (4) The existing dynamic neural networks have a slow inference speed on hardware, which makes them hard to deploy in real-world applications. To address these four key problems, this paper proposes four new efficient deep neural network model architectures from the perspectives of model design and model acceleration: Global Filter Networks, High-Order Spatial-Interaction Networks, Runtime Neural Pruning Model, and Dynamic Spatial Sparsification Model. The main innovations and contributions of this thesis can be summarized as follows: 1. To improve the scaling ability of models based on global relation modeling, we propose Global Filter Networks that achieve global interactions with log-linear complexity thanks to the high efficiency of the Fast Fourier Transform. Our models achieve better performance than vision Transformers and enable efficient processing of high-resolution images. 2. To reduce the computational costs of spatial interaction operations, we propose a HorNet that is based on recursive gated convolutions. By introducing high-order spatial interactions with gated convolutions, we realize various nice properties of vision Transformers with lower complexity. Our models achieve better accuracy-complexity trade-offs at various model sizes. II Abstract 3. To mitigate the accuracy drops of static network pruning methods, we propose a new model acceleration method based on adaptive network routing. By introducing the bottom-up runtime pruning strategy and deep reinforcement learning-based optimization framework, we can dynamically adjust model complexity based on sample difficulties, which significantly improves the performance at large acceleration ratios. 4. To enable fast inference of dynamic neural networks, we propose a new model acceleration method based on dynamic spatial sparsification. By designing an adaptive spatial sparsification framework to progressively select informative regions in images, our method can largely speed up various vision Transformers and convolutional neural networks on hardware with no significant accuracy drop.