登录 EN

添加临时用户

融合跨模态知识迁移的预训练大模型研究

Research on a Large Pre-training Model Incorporating Cross-modal Knowledge Transfer

作者:孙容一
  • 学号
    2020******
  • 学位
    硕士
  • 电子邮箱
    sry******.cn
  • 答辩日期
    2023.05.19
  • 导师
    郑海涛
  • 学科名
    计算机技术
  • 页码
    64
  • 保密级别
    公开
  • 培养单位
    599 国际研究生院
  • 中文关键词
    多模态预训练,单塔,双塔,并行式交互,跨模态知识迁移
  • 英文关键词
    Vision-Language Pre-Training, Fusion-based Encoders, Fusion-free Encoders, Parallel Attention Interaction, Cross-modal Knowledge Transfer

摘要

预训练模型在自然语言处理(Natural Language Processing, NLP)和计算机视觉(Computer Vision, CV)领域取得了惊人的成就,Transformer表现出的大一统趋势吸引了不同方向的学者投身于此。 构建一个支持视觉和语言的图文多模态预训练(Vision-and-Language Pre-training, VLP)模型,用以提升具有丰富应用场景的图文多模态任务的表现成为研究的热点。现有图文多模态预训练范式主要基于融合编码的单塔编码器或视觉-语言对偶编码的双塔编码器,两者均难以同时满足不同下游多模态任务的需求。 为了使模型能够同时支持图文理解任务和图文检索任务,提升模型在实际应用中的能力,本文基于双塔模型设计了一个可以从图文交互模式到无交互模式自由切换的模型。为满足上述目的,本文针对模型结构提出了并行式图文注意力交互机制,避免模型由图文交互模式切换到无交互模式时影响模型计算过程。此外,为了在模型由图文交互模式切换到无交互模式时保留更多的图文交互知识,提升模型在图文检索任务上的表现,本文提出了跨模态知识迁移任务,帮助单一模态的特征从图文交互模块得到的多模态特征中学习更多的交互知识。由于预训练成本昂贵,本文先设计了以Transformer为基础的多模态对话模型来验证并行式图文注意力交互机制的有效性,实验结果充分证明并行式图文注意力交互机制能够灵活且有效地完成图文交互计算。在此基础上,进一步设计了以并行式图文注意力交互机制为交互方式的融合跨模态知识迁移的多模态预训练模型,在使用最为广泛的图文描述数据集GCC、VG、SBU和COCO上进行预训练,并在图文问答(VQAv2)、图文推理(NLVR2)和图文检索(Image-Text Retrieval, Image-Text Retrieval)上进行了充分的实验。结果表明,本文提出的多模态预训练模型能够在图文交互模式下优异地完成图文问答和图文推理任务,并可以灵活地切换到无交互模式下高效地完成图文检索任务,在评价指标上超越了使用同等预训练数据进行训练的同等参数规模的模型,表现出了卓越的性能。

Recently, pre-trained models have achieved remarkable success in both Natural Language Processing (NLP) and Computer Vision (CV). The trend of unification demonstrated by the Transformer has attracted scholars from different domains to construct a Vision-and-Language Pre-training (VLP) model, supporting both visual and linguistic inputs, to enhance the performance of multimodal tasks with rich application scenarios. However, existing paradigms for multimodal pre-training are mainly based on fusion-based encoder or fusion-free encoders, which make it difficult to simultaneously meet the requirements of different downstream tasks. To enable the model to support both the visual-language understanding tasks and retrieval tasks, this paper designs a model that can switch freely from the fusion-based pattern to the fusion-free pattern. To satisfy the above purpose, this paper proposes a parallel attention interaction mechanism between images and texts that avoids affecting the model‘s computational architecture when the model is switched from the fusion-based pattern to the fusion-free one. In addition, to help retain more interaction knowledge when the model is switched from the fusion-based pattern to the fusion-free one and improve the performance on retrieval tasks, this paper proposes a cross-modal knowledge transfer strategy to help features of a single modality learn more interaction knowledge from the multimodal features obtained from the fusion-based mode.Due to the expensive pre-training cost, the first step of this paper is to design a transformer-based multi-modal dialogue model to verify the effectiveness of the parallel attention interaction mechanism, and the experimental results fully demonstrate that the parallel attention interaction mechanism can accomplish the interaction computation flexibly and effectively. Based on this, we further design a VLP model with cross-modal knowledge transfer and pre-train it on the most widely used image description datasets: GCC, VG, SBU, and COCO. Furthermore, we test it in Visual Question Answering (VQAv2), Visual Reasoning (NLVR2), and Image-Text Retrieval. The results demonstrate that the VLP model described in this paper outperforms models of the same parameter scale trained with the same data.