登录 EN

添加临时用户

基于注意力机制的视频描述算法研究

Research on Video Captioning Algorithm Based on Attention Mechanism

作者:唐铭康
  • 学号
    2020******
  • 学位
    硕士
  • 电子邮箱
    tmk******.cn
  • 答辩日期
    2023.05.14
  • 导师
    李秀
  • 学科名
    电子信息
  • 页码
    87
  • 保密级别
    公开
  • 培养单位
    599 国际研究生院
  • 中文关键词
    视频描述, 注意力机制, 编解码器, 特征提取
  • 英文关键词
    Video Captioning, Attention Mechanism, Encoder-Decoder, Feature Extraction

摘要

视频描述是给定视频内容,自动为其生成自然语言描述的视觉-语言跨模态理解任务。基于编解码器的描述算法通过深度神经网络提取视觉特征,设计编码器与解码器模型,以逐词输出的形式生成最终的描述文本,在视频描述领域取得广泛成功。然而,目前的研究仍然存在部分缺陷:现有研究使用计算机视觉领域多种视觉特征提取器的组合提取视觉特征,使得特征提取计算量大、耗时长,且视觉特征提取器无法建立与文本的关联;现有研究强调在编码器端通过注意力等方式聚合空间特征,编码器的视觉表征输入解码器时以帧特征形式输入,丢失了细粒度的空间细节,无法在生成每个词时动态确定其关联的空间区域;视频通常伴随视频字幕等额外的关联文本,当前的视频描述相关研究中很少考虑关联文本的利用;学术界现有的视频描述研究以英文为主,缺乏中文相关的视频描述研究。为了弥补上述缺陷,本文对基于注意力机制的视频描述算法展开研究,贡献如下:(1)针对视频描述视觉特征提取,本文对比多种现有视觉特征提取器的性能并提出基于对比学习的Transformer视频描述特征提取器。该特征提取器使用文本作为监督信号,通过视频-文本对比学习对齐语义,丰富特征提取器的多模态常识,有利于从视觉内容中解码出文本信息。实验部分展示了提出的特征提取器相比于传统特征提取器组合在速度和生成描述质量上都有显著优势,获得了ACM MM 2021视频理解预训练挑战赛亚军和ICCV 2021 VALUE Challenge冠军。(2)针对视频描述时空关联建模,提出基于顺序时空注意力的视频描述编解码器模型。该模型在成本更低、包含信息更广的网格级特征上对时空关联建模,编码器中保留完整的时空信息并在解码器中执行时空注意力聚合视频表征,可根据当前上下文决定模型需要关注的时序帧和空间区域。实验表明提出的方法在同等条件下推理速度和描述质量优于多个最先进的视频描述模型。(3)针对视频描述模型缺乏整合关联文本及中文预测的能力,设计多模态多语言的视频描述框架,将其整合于视频描述生成平台中。该框架以本文提出的特征提取器及视频描述模型为基础,针对文本输入和中英文预测提供扩展方法。实验证明框架可从文本中获取有价值信息提升描述质量,且中英文预测可互为补充,效果优于单独训练中文或英文。视频描述生成平台整合上述模型和功能,提供完整的视频描述生成解决方案。

Video captioning is a challenging task in visual-language cross-modal understanding, which aims to generate natural language descriptions for given video content. Encoder-decoder-based captioning algorithms extract visual features through deep neural networks and design encoder and decoder models to generate the final description in a word-by-word form, achieving remarkable success in video captioning. However, current research still has some limitations: The current research relies on a combination of various visual feature extractors to extract visual features, resulting in a high computational cost and slow inference speed in the feature extraction stage. Furthermore, the visual feature extractor cannot establish a correlation with the text. Existing video captioning research emphasizes aggregating spatial features and mining salient regions through an attention mechanism on the encoder. However, the visual representation input of the decoder is frame-level features, losing fine-grained spatial information, which makes it difficult for the decoder to dynamically choose the corresponding spatial regions when generating words. Videos are often accompanied by additional related texts such as video subtitles, which are rarely considered in current video captioning research. Existing video captioning research mainly focuses on English, and there is a lack of Chinese video captioning research. To address the above limitations, this paper conducts research on video captioning algorithms based on attention mechanism, with the following contributions:(1) In terms of visual feature extraction for video captioning, this paper compares the performance of multiple existing visual feature extractors and proposes a novel visual feature extractor based on contrastive learning specifically designed for the video captioning task. This feature extractor uses text as a supervisory signal to learn the alignment between video and text, enhancing the multi-modal knowledge of the feature extractor and facilitating text generation from visual content. Experimental results demonstrate that the proposed visual feature extractor outperforms the traditional combination of visual feature extractors in terms of inference speed and captioning generation quality.(2) To model spatial-temporal correlations in videos, a video captioning model based on spatial-temporal attention is proposed. This model captures spatial-temporal correlation on the grid-level features, which is computationally more efficient and contains richer information. The encoder retains complete spatial-temporal information and the video representation is aggregated in the decoder using spatial-temporal attention. The model can dynamically determine which frames and spatial regions to focus on based on the current context. Experimental results show that the proposed model achieves better inference speed and description quality than many advanced video captioning models. (3) Considering the limitations of existing video captioning models in modeling associated text and generating Chinese captions, this paper designs a multi-modal and multi-language video captioning framework that is integrated into the video captioning generation platform. Building on the feature extractor and video captioning model proposed in this paper, the framework offers extended methods for related text input and chinese prediction. Experimental results demonstrate that the framework can automatically attend to valuable information from the text to improve the description and that Chinese and English predictions can complement each other, outperforming models trained in either language alone. The video captioning generation platform integrates the aforementioned models and functions to provide users with a simple and seamless video captioning generation experience.