登录 EN

添加临时用户

面向室内场景的单目视觉深度估计算法研究

Monocular Indoor Depth Estimation Algorithms

作者:娄志强
  • 学号
    2020******
  • 学位
    硕士
  • 电子邮箱
    610******com
  • 答辩日期
    2023.05.18
  • 导师
    季向阳
  • 学科名
    控制科学与工程
  • 页码
    76
  • 保密级别
    公开
  • 培养单位
    025 自动化系
  • 中文关键词
    深度估计,运动恢复结构,深度滤波器,事件相机,室内场景
  • 英文关键词
    Depth Estimation,Structure From Motion, Depth Filter, Event Vision, Indoor Scene

摘要

深度估计作为三维视觉的基础任务之一,在增强现实/虚拟现实、机器人导航、场景重建等方面有广泛且重要的应用。单目深度估计作为深度的子领域,指仅通过单个视觉传感器感知并计算视觉场景中各像素的距离信息,它因低传感器配置要求而成为深度估计领域的研究热点之一。目前单目深度估计在户外场景下已取得较为满意的效果,而在复杂多变的室内场景下表现依然欠佳,这限制了室内场景下相关应用的发展。本文针对室内场景的复杂性进行了分析与探究,将室内场景的复杂性主要归因于多样化的几何结构以及光照条件,并基于此研究更加鲁棒、精确的室内深度估计技术。本文分别从“面元结构先验”、“全局结构约束”、“事件流图像帧多模态融合”三个角度进行层次递进的研究,以下为具体研究内容:(1) 基于面元结构先验的时序概率室内深度估计算法。局部的平面结构在室内场景中广泛存在,本文基于这种局部平面结构设计面元,在传统深度滤波器基础上推导基于面元的深度滤波过程,并进一步构建了完整的时序概率深度估计算法。该算法可利用时序结构信息高效优化初始深度图,提升深度图的准确度。(2) 基于全局结构约束的无监督室内深度估计算法。基于深度学习的深度估计网络可从单帧图像中推断深度图,在特定数据上有超越传统算法的表现。本文利用室内场景中大量存在的平面结构以及其空间关系设计了适用于室内场景的无监督深度估计网络,通过引入结构信息提升网络的泛化性能与估计精度。(3) 基于事件流与图像帧联合的室内深度估计算法。多变的光照条件是室内场景复杂性的重要来源之一,传统相机天然难以应对这些极端场景,本文利用事件流图像帧混合相机处理这些极端场景。该相机输出同一视角下的事件流与图像帧,拥有更宽的动态范围,可适应低光、运动模糊等极端场景,可增强算法的通用性与鲁棒性。本工作建立了一个标准的真实室内场景事件流深度估计数据集作为相关算法的评价基准,并提出一种基于联合体素表征的深度估计网络,该网络可高效利用两种数据的互补特性,完成室内高速场景或极端光照条件下高精度的深度估计。综上,本文提出了多项创新性的室内深度估计算法,引入室内结构信息改进了传统深度滤波器算法,并进一步利用结构先验设计了无监督室内深度估计网络,最后通过事件流图像帧多模态融合技术处理极端场景,提出了相应的评测基准与算法,本文研究工作为室内场景单目深度估计领域的后续深入研究奠定了基础。

As one of the fundamental tasks of 3D vision, depth estimation has extensive and important applications in AR/VR, Robot Navigation, Scene Geometry Reconstruction, and other related fields. Monocular depth estimation, as a sub-domain of depth estimation, utilizes the captured data from only a camera to compute per-pixel depth values, which has been an important research area due to its low demand of hardware configuration. Currently monocular depth estimation algorithms can infer satisfactory depth maps on outdoor scenes, while only exhibit limited performance when faced with indoor scenes. We explore the diffculty of indoor depth estimation and claim that diversified geometry structures and lighting conditions attribute to the complexity of indoor depth estimation. We successively investigate three topics to improve the indoor depth estimation performance, which are surfel structure prior, global structure constraint and event image multimodal fusion. The followings are the specific details of the three research topics:(1) Temporal Depth Estimation Framework Based on Surfel Structure Prior. We utilize the common local planar structures in indoor scenes to generate surfels, and derive a surfel-level depth filter algorithm based on traditional pixel-level depth filter. A complete temporal depth estimation framework is further established, which can synthesize rich temporal information to refine the initial depth maps from other naive methods. (2) Indoor Unsupervised Depth Estimation Network Based on Global Structure Constraint. Learning-based depth estimation approaches can estimate depth maps from only a single frame and are superior to traditional algorithms on certain datasets. We extract global structure constraints from the plane structures of indoor scenes and design an indoor unsupervised depth estimation network based on the global constraints. Extensive experiments demonstrate the effectiveness of our proposed global constraints. (3) Indoor Depth Estimation from Event and Image. Images hardly collect meaningful information when faced with extremely low light condition, which hinders indoor depth estimation performance’s improvement. Event-image multimodal camera can capture synchronized event stream and image data from the same view point. Compared with traditional cameras, it possesses a wider dynamic range and a higher temporal resolution, which enables it to handle extreme cases, like low light and motion blur. Thus we combine event and image data to improve indoor depth estimation algorithm’s performance and robustness. Specifically, we first establish a standard indoor event depth estimation benchmark on real scenes, and correspondingly propose a joint volume representation to design the depth network, which can integrate information form two modals of data and complete high-quality depth estimation in extreme scenes. To sum up, we propose three novel indoor depth estimation algorithms. Surfel-level depth filter is derived and can introduce local planar priors for indoor depth estimation. Global structure constraints are extracted from large planes and their relations, contributing to a more effective unsupervised indoor depth estimation network. To handle extreme indoor scenes, we incorporate the event modal, build a standard real indoor event depth estimation dataset and propose a joint-volume-based depth network. Our research lays the foundation for further in-depth research in the field of indoor monocular depth estimation.