登录 EN

添加临时用户

基于由粗到细策略的实例级视觉识别方法研究

Research on Instance-Level Visual Recognition Methods with Coarse-to-Fine Strategies

作者:唐楚峰
  • 学号
    2018******
  • 学位
    博士
  • 电子邮箱
    chu******com
  • 答辩日期
    2023.05.19
  • 导师
    胡晓林
  • 学科名
    计算机科学与技术
  • 页码
    134
  • 保密级别
    公开
  • 培养单位
    024 计算机系
  • 中文关键词
    深度学习,计算机视觉,实例级视觉识别,由粗到细
  • 英文关键词
    deep learning, computer vision, instance-level visual recognition, coarse-to-fine

摘要

近年来,基于深度学习的计算机视觉技术在各类任务和实际应用场景中都取得了显著进展,尤其是在以图像分类为代表的图像级视觉识别任务上。然而,以识别出复杂场景中每个视觉概念为目标的实例级视觉识别任务还面临着诸多挑战。现有算法大多都采用先局部后整体的识别策略,在复杂场景下的识别效果仍然令人不够满意,可能会出现识别准确率低和识别效率低等问题。与之相反的是,一些神经科学的发现表明,人类能胜任复杂视觉感知任务的一大重要原因是采用了由粗到细的识别策略,即先整体后局部。因此,研究和开发基于由粗到细策略的识别算法,有助于实现高效且精准的实例级视觉识别,并进一步推动计算机视觉技术在更广泛且更复杂的应用场景中发挥价值。本文系统且深入地研究了实例级视觉识别中的由粗到细策略,并面向实际场景中的不同问题提出了有效的解决办法。本文的主要创新点概括如下:1. 针对实例级物体分割和场景理解任务中边缘预测不够准确的问题,提出了一种空间上从整体到边缘的图像分割边缘优化方法,先粗粒度地定位到物体再细粒度地对边缘进行校准,能够成功地提升任意图像分割模型的边缘预测质量,在一定程度上打破了现有方法在分割精度上的瓶颈。2. 针对实例级人体属性识别任务中属性与图像局部区域之间的语义关联未能被挖掘和利用的问题,提出了一种空间上自适应从整体到局部的行人属性识别方法,通过弱监督学习自适应定位到属性相关的局部区域,实现了对行人属性识别准确率的大幅提升,同时还拥有计算效率高和可解释性强的优势。3. 针对场景理解任务中识别粒度受限且高度依赖大规模数据标注的问题,提出了一种在空间和语义粒度上由粗到细的按需视觉识别方法,通过构建层次化的知识库并由粗到细地进行语言驱动的层次化场景理解,成功提升了算法的识别粒度,并拥有较强的开放域视觉识别能力。4. 针对实例分割任务中数据标注代价高昂的问题,提出了一种在监督信息上由粗到细的标签高效实例分割训练方法,通过主动学习和弱监督学习对像素点有选择性地进行标注,成功提升了实例分割模型对数据标注的利用效率。

Deep learning-based computer vision techniques have made significant progress in numerous tasks and real-world applications, especially in the image-level visual recognition tasks represented by image classification. However, instance-level visual recognition tasks are still facing challenges, which aim at recognizing every visual concept in complex scenes. Existing algorithms mostly follow the local-then-global recognition strategy, while the recognized results in complex visual scenes are still unsatisfactory, which may lead to problems such as low-accuracy and low-efficiency. On the contrary, as indicated in some neuroscience findings, one of the important reasons why humans are capable of recognizing complex visual concepts is the coarse-to-fine recognition strategy. Therefore, research and development of the coarse-to-fine recognition algorithms will help achieve the goal of efficient and accurate instance-level visual recognition, and further promote the value of computer vision technology in broader and more complex application scenarios. This dissertation provides systematical and in-depth studies on the coarse-to-fine strategy in instance-level visual recognition and proposes several effective methods to solve the problems in different scenarios. The main contributions are summarized as follows:1. For the problem of imprecise boundary predictions in instance segmentation and scene understanding tasks, a spatially coarse-to-fine boundary refinement method is proposed. It first locates objects roughly and then segments the boundaries precisely, which can successfully improve the boundary quality of any image segmentation model, and to some extent breaks the bottleneck in segmentation accuracy.2. For the problem that the semantic relations between attributes and local regions are under-utilized in instance-level attribute recognition, a spatially adaptive pedestrian attribute recognition method is proposed. It is capable of adaptively locating the attribute-related local regions via weakly-supervised learning, significantly improving the attribute recognition accuracy, while also having the advantages of higher computational efficiency and better interpretability.3. For the problem of limited recognition granularity and high dependence on large-scale data annotation in scene understanding, a method of visual recognition by request is proposed, which follows the coarse-to-fine strategy both spatially and semantically. It builds a hierarchical knowledge base and performs language-driven scene understanding step-by-step, which successfully improves the recognition granularity and has the strong ability of open-domain recognition.4. For the problem of expensive data annotation in instance segmentation, a label-efficient instance segmentation training method is proposed, which arranges supervision in a coarse-to-fine manner. It labels pixels selectively through active learning and weakly-supervised learning, successfully improving the efficiency of utilizing labeled data for instance segmentation models.