登录 EN

添加临时用户

室外多视角多模态三维成像系统的设计与多目标跟踪

System Design and Multi-object Tracking of A Flexible Multiview Multimodal Imaging System for Outdoor Scenes

作者:张猛
  • 学号
    2020******
  • 学位
    硕士
  • 电子邮箱
    zha******.cn
  • 答辩日期
    2023.05.16
  • 导师
    冯建江
  • 学科名
    控制科学与工程
  • 页码
    101
  • 保密级别
    公开
  • 培养单位
    025 自动化系
  • 中文关键词
    三维成像系统,多视角多模态,时空标定,数据集,长时期跟踪
  • 英文关键词
    3D imaging system,multiview multimodal,spatio-temporal calibration,dataset,long-term tracking

摘要

三维视觉相关任务具有极高的研究与应用价值,已经成为计算机视觉领域最受关注的方向之一。精确的三维成像是进行场景视觉感知与分析的基础。大多数现有的三维成像系统使用多相机进行数据采集,局限在室内小场景内进行测试和开发。面对复杂动态的大场景,已有的多相机成像系统存在着诸多难以避免的缺点,难以得到更广泛的应用。同时,缺少在室外大场景下精确的三维成像也影响了三维目标的检测和跟踪结果,损害了室外大场景下的三维目标感知能力。针对上述问题,本文引入激光雷达传感器,设计并构建了一套场景普适的多视角多模态三维成像系统,并实现了低成本自动化的多节点时间同步和空间标定。本文利用此系统进行了多个场景的三维成像数据采集,并且探究了基于多视角多模态数据的长期多目标检测跟踪算法,在实际体育场景上实现初步的视觉感知验证。本文的主要研究内容和成果包括以下三点:1、本文设计并构建了一套针对室外大场景下的多视角多模态三维成像系统。针对RGB相机的局限性,本文引入激光雷达传感器,构建多模态传感器网络。激光雷达受环境影响较小,可以获取目标的精准深度,从而与相机获得的丰富纹理信息相互补充。针对室外场景的复杂性,本文设计易于拆解组装的可移动装置结构,并且可以进行自由的节点扩容。针对时空标定的难点,本文设计了基于GPS的多节点时间同步系统和基于点云配准的空间标定技术。2、本文在多样的实际室外场景中采集了大量的多视角多模态三维动态成像数据。为了验证系统的功能性和稳定性,本文在交通场景、监控场景、体育场景分别采集了多视角多模态三维数据,进行数据整理、分析和标注,构建了一个全新的室外大场景的时空同步的多视角多模态三维数据集。3、本文提出了一种融合多视角点云和图像的三维检测和长时期目标跟踪算法。针对多模态数据,本文方法对点云数据和图像数据分别进行特征提取,然后投影到鸟瞰视图上进行特征融合,从而回归得到精准的三维目标检测结果。在跟踪阶段,本文方法利用目标的三维空间信息和多视角图像中目标的外观特征,匹配连接得到目标物体的三维跟踪轨迹。本文方法在长期序列跟踪任务上取得了55.48%的HOTA跟踪指标,比基准方法提升了22.62%,证明了本文方法在长时期跟踪任务上的良好效果。

With great research and application value, 3D vision has been attracting a lot of interest in the field of computer vision. A 3D dynamic imaging system is fundamental to observe, understand and interact with the world. Most of the existing 3D imaging systems apply multiple RGB cameras for data acquisition, which are limited in small indoor scenes for the test and development. As for the outdoor large-scale scenes, existing multi-camera imaging systems are greatly inapplicable due to many unavoidable drawbacks. At the same time, it is difficult to obtain a promising 3D object detection and tracking results without accurate 3D imaging data in large outdoor scenes, which impairs the visual perception ability in large outdoor scenes. To address the above issues, this thesis designs and builds a flexible multiview multimodal 3D imaging system with the low-cost automated time synchronization and spatial calibration method. This thesis utilizes the system to collect a 3D imaging dataset in multiple scenes. Finally, this thesis proposes a multi-object detection and long-term tracking method based on multiview multimodal data, which is applied on real sports scenes for validation. The contributions and results of this thesis are as the following three aspects:1. This thesis designs and builds a multiview multimodal 3D imaging system for large outdoor scenes. To address the limitations of RGB cameras, this thesis introduces the LiDAR sensor to make up the multimodal sensor network, which is less affected by the environment and can acquire the precise depth of the object. At the same time, the system obtains the rich texture information by the RGB camera. As for the complexity of outdoor scenes, this thesis designs a movable device that can be easily disassembled and assembled. Moreover, this thesis designs a GPS-based multi-node time synchronization system and a spatial calibration technique based on point cloud registration.2. In this thesis, a large amount of multiview multimodal 3D dynamic imaging data is collected in various scenes. In order to verify the functionality and stability of the imaging system, this thesis collects, analyses and annotates the multiview multimodal 3D data from traffic scenes, surveillance scenes and sports scenes to build a new multiview multimodal 3D dataset with accurate spatio-temporal calibration.3. This thesis proposes a 3D detection and long-term object tracking method that fuses multiview point clouds and images. For multimodal data, the method in this thesis performs feature extraction on point cloud data and image data separately, and then projects them onto the bird‘s-eye view for feature aggregation, so as to obtain accurate 3D object detection results. In the tracking stage, this method utilizes the 3D spatial information and the appearance features of the object to match and link the 3D tracking trajectory of the object. With experiments on the multiview multimodal dataset, the method in this thesis achieves $55.48\%$ of HOTA in long-term tracking and outperforms the baseline method by a large margin.