登录 EN

添加临时用户

基于多源异构信息融合的3D目标检测算法研究

Research on Multi-view and Multimodal Fusion for 3D Object Detection

作者:黄迪和
  • 学号
    2020******
  • 学位
    硕士
  • 电子邮箱
    106******com
  • 答辩日期
    2023.05.12
  • 导师
    李志恒
  • 学科名
    电子信息
  • 页码
    96
  • 保密级别
    公开
  • 培养单位
    599 国际研究生院
  • 中文关键词
    3D目标检测,多模态融合,多视角融合
  • 英文关键词
    3D object detection,multimodal fusion,multi-view fusion

摘要

三维目标检测在自动驾驶、辅助驾驶系统、机器人以及智能交通监控等领域都有着广泛的应用,其目的是在三维空间中对环境中的物体进行定位和识别。目前3D目标检测技术在室外场景面临的主要问题是远距离或被遮挡的物体检测精度低、几何结构复杂多样的物体类别召回率低、感知存在视野盲区以及缺少同时处理不同模态数据和融合不同视角信息的开源3D物体检测系统。因此,亟需建立一个融合多视角和不同模态信息的3D动态目标检测系统,从而实现多模态与多视角协同感知,提升自动驾驶汽车和智能交通监控的感知能力。 首先,本文对基于激光雷达点云数据的3D物体检测进行研究,分析了现有点云3D目标检测方法在特征降维过程中丢失大量几何信息和难以进行高效自适应特征提取的问题,提出了基于空间感知特征降维、多级空间残差策略和离散Transformer的点云3D目标检测方法,通过在多个自动驾驶数据集上的实验对比评估,验证了所提出方法能够有效提升3D目标检测在识别结构复杂多样、距离较远或局部被遮挡的物体时的检测精度,为自动驾驶与智能交通监控提供算法支持。 同时,本文对基于点云与图像数据融合的多模态3D物体检测任务进行研究,探索了点云与图像特征融合存在较大的模态差异性等问题,提出了基于自适应点云图像特征对齐、深度期望回归图像特征反投影与离散Transformer融合的多模态3D目标检测方法,通过与其他方法的对比分析,验证了所提出的方法可以有效提升行人、自行车等类别的检测准确率,为后续多视角协同感知提供模型基础。 基于上述点云3D目标检测方法与多模态3D目标检测方法,本文面向道路监控与车路多视角场景搭建了一套多源异构信息融合感知系统,该系统获取多个不同视角的多帧感知结果构建时序多帧物体轨迹,然后通过物体时序特征编码与融合进行物体三维边界框估计。 最后,为了进一步验证所提出算法系统的性能,本文使用多传感器数据采集系统进行多模态数据采集并进行实验测评。实验结果表明,本文所提出的感知系统能够完成高精度的3D目标检测任务,为自动驾驶、车路协同和智能交通道路监控领域提供方法与系统支撑。

3D object detection has a broad range of applications in the fields of self-driving vehicles, driver assistance system, intelligent traffic surveillance system and robotics, with the aim of locating and classifying objects in a 3D space. The current challenges of 3D object detection technology in autonomous driving are low accuracy of detecting objects at a distance or that are obscured, the low recall of objects with complex and diverse geometries, the existence of field of view blindness, and the absence of an open-source 3D object detection system that simultaneously process data from various modalities and fuse information from different perspectives. Consequently, there is an immediate necessity to construct a 3D object detection system that fuses multi-view and multimodal information, so as to achieve multimodal and multi-view perception and enhance the perception capability of autonomous vehicles and intelligent traffic surveillance system. Firstly, this thesis investigates 3D object detection based on LiDAR point cloud data, analyzing the issues that existing point cloud 3D object detection methods suffer from a significant loss of geometric information in the feature dimensionality reduction and are challenging to perform efficient adaptive feature extraction. To address these problems, a point cloud 3D object detection method is proposed, which is based on Spatial-aware Dimensionality Reduction, Multi-level Spatial Residual strategy and Scatter Transformer. Through experiments and evaluations on multiple autonomous driving datasets, it is verified that the proposed method can effectively improve the detection accuracy of 3D object detection in identifying objects with complex and diverse structures, distant or partially obscured objects, and provide algorithmic support for autonomous driving and intelligent traffic surveillance system. In the meanwhile, this thesis delves into the multimodal 3D object detection task based on the fusion of point cloud and image, and investigates the challenges such as the large modal disparity between point cloud features and image features. To address this issue, this thesis proposes a multimodal 3D object detection method which is based on adaptive feature alignment, image feature inverse projection with depth expectation regression, and the utilization of a scatter Transformer for fusion. Through comparative analysis with other methods, it is verified that the proposed method can effectively enhance the accuracy of pedestrians, bicycles and other categories, and provide a model basis for subsequent multi-view cooperative detection. Drawing on the aforesaid point cloud 3D object detection and multimodal 3D object detection models, a multi-view and multimodal 3D object detection system is developed for the multi-view collaborative perception. The system collects multi-frame perception data from several different viewpoints to construct temporal multi-frame object trajectories, and then accomplishes 3D bounding box estimation of objects through feature encoding and decoding. Finally, to validate the effectiveness of the proposed algorithm system, this thesis leverages a multi-sensor data collection system to acquire multimodal data and conduct experiments. The results demonstrate that the proposed perception system can deliver high-precision 3D object detection and provide method and system support for the autonomous driving and intelligent traffic surveillance domains.