登录 EN

添加临时用户

基于无人机的视频目标检测系统设计与实现

Design and Implementation of UAV-based Video Object Detection System

作者:张佩仪
  • 学号
    2020******
  • 学位
    硕士
  • 电子邮箱
    zha******com
  • 答辩日期
    2023.05.17
  • 导师
    李亚利
  • 学科名
    电子信息
  • 页码
    78
  • 保密级别
    公开
  • 培养单位
    023 电子系
  • 中文关键词
    视频目标检测,无人机,特征融合,深度学习,注意力机制
  • 英文关键词
    Video object detection, Unmanned Aerial Vehicle, Feature aggregation, Deep learning, Attention mechanism

摘要

随着无人机技术的快速发展,搭载高清摄像头的无人机已经越来越广泛地被应用于商业及国防等领域。基于无人机的视频目标检测技术是对无人机拍摄的视频数据进行智能化分析与应用的基础,近年来成为计算机视觉领域的研究热点之一。无人机拍摄的场景通常具有较宽的视野和较高的视角,视频中存在多种尺度和大小的目标,加上拍摄环境、光照等条件不同,现有的检测器难以适应这类复杂场景。此外,无人机拍摄的视频中存在由运动引起的模糊、失焦、遮挡等问题,为从无人机拍摄的视频中检测目标带来一定的挑战。为了解决这些挑战,本文设计了基于无人机的视频目标检测系统。针对无人机拍摄的视频中存在的目标视觉信息不充足的问题,本文提出了一种语言引导的无人机视频目标检测方法。通过将预训练视觉语言模型的知识迁移到下游基于无人机的视频目标检测任务,使得模型能够从文本中为视觉信息不足的目标提供补充信息。将本文增强的视觉特征进一步进行跨帧融合,能够充分利用视频中的时间信息增强单帧特征,从而有效提升检测器对目标外观随时间变化的鲁棒性。为了提升小目标检测精度,同时增强对目标内部结构信息的理解,本文提出一种结构级语义聚合的视频目标检测方法。通过在特征提取阶段将含有丰富场景语义信息的高层特征图融合到涵盖更多小目标信息的浅层特征图中,增强小目标对环境的感知能力。为了进一步提升跨帧特征聚合的效果,本方法在检测阶段利用移动窗口注意力的方式对目标内部结构信息进行建模,从而有效提升检测结果的稳定性。基于上述提出的方法,本文在无人机拍摄的视频数据集VisDrone-VID2019 上进行了充分实验,实验结果验证了系统能够有效提升无人机拍摄视频场景下的目标检测精度,同时能够有效提升检测结果的稳定性。

Due to the rapid advancement in drone technology, UAVs equipped with high-resolution cameras are increasingly being used in various industries such as commercial and national defense. To enable intelligent analysis and application of these video data, the UAV-based video object detection technology has become a popular research area in computer vision. However, UAV-based videos present several challenges such as multiscale and small objects, varying shooting environments and lighting conditions, motion-induced blurring, and occlusion, which bring certain challenges to detecting objects from videos captured by UAVs. To address these challenges, a UAV-based video object detection system is designed in this paper. This paper proposes a language-guided UAV-based video object detection method aiming at the problem of insufficient object visual information in videos captured by UAVs. By transferring the knowledge of the pre-trained visual-language model to the downstream UAV-based video object detection task, the model is enabled to provide complementary information for visually insufficient objects from text. Cross-frame feature aggregation of the enhanced visual features in this paper can make full use of the temporal information in the video to enhance single-frame features, which can effectively improve the robustness of the detector to changes in object appearance over time. In order to improve the detection accuracy of small objects and at the same time enhance the understanding of the internal structure information of objects, this paper proposes a video object detection method based on structure-level semantic aggregation. By fusing high-level feature maps containing rich scene semantic information into shallow feature maps covering more small object information in the feature extraction stage, the perception of small objects to the environment is enhanced. The method proposed in this paper for cross-frame feature aggregation can effectively utilize the temporal information present in videos to enhance the individual frame features. As a result, the detector becomes more robust to changes in object appearance over time and can improve the stability of the detection results.Based on the methods proposed above, this paper conducts extensive experiments on VisDrone-VID2019 dataset. The experimental results verify that the system can significantly enhance the accuracy of UAV-based video object detection, while also improving the stability of the detection results.