登录 EN

添加临时用户

基于预训练的多模态问答系统研究

Research on Multi-modal Question Answering System Based on Pre-training

作者:韩宏炜
  • 学号
    2020******
  • 学位
    硕士
  • 电子邮箱
    han******com
  • 答辩日期
    2023.05.15
  • 导师
    李秀
  • 学科名
    电子信息
  • 页码
    90
  • 保密级别
    公开
  • 培养单位
    599 国际研究生院
  • 中文关键词
    多模态预训练,网页问答,半结构化数据理解,复杂序列生成
  • 英文关键词
    Multimodal Pre-training,Web-based Question-Answering,Semi-structured Data Understanding,Complex Sequence Generation

摘要

随着技术创新和互联网的普及,用户在浏览网页时对于多模态内容的需求不断增长,包括图片、视频和音频等。同时,网页上的信息呈现出复杂的结构特征,如表格、问卷等。因此,在网页问答场景中,处理这些多模态数据以解决用户在使用过程中遇到的问题已成为人工智能(尤其是自然语言处理)领域的重要挑战。本文研究了一种基于预训练技术的多模态网页问答系统,整合了文本、半结构化数据和视觉信息,以精确地理解用户需求并提供准确、实用和高品质的解答。为应对这一挑战,本文按照难度将多模态网页问答任务分为三个级别。本文提出的方法和技术在总共11个任务和数据集上进行了实验验证,并展示了良好的性能。首先,针对半结构化数据理解任务,本文提出了一种基于数字插件和预训练的表格增强算法LUNA。通过数字插件将每个数字表示为一个整体进行建模,从而提高语言模型的数值推理和计算能力。通过数字预训练,包括分类、回归和模型蒸馏,LUNA使数字和词汇的分布对齐并能够理解表格结构。其次,除了借助编码器“理解”输入的内容,本文还希望借助解码器“生成”形式更加自由的答案序列,为此本文提出了一种基于自回归自增强的序列精炼系统CASR,用于解耦复杂答案序列生成任务中的依赖关系。通过多阶段精炼,模型能够借鉴先前的精炼结果中与其有依赖关系的其他位置,以便更准确地预测当前位置。本文还探讨了将CASR接入ChatGPT等生成式大语言模型的可能性。最后,为了进一步提升问答系统的性能,本文引入了视觉信息。以线上肺炎诊断任务为例,本文提出了一个基于多任务学习的医疗报告生成框架,该框架包括视觉文本对齐、多标签诊断分类和词重要性加权等机制,可以学习到细粒度的图像和文本特征的依赖关系。

With the advancement of technology innovation and the widespread use of the internet, users‘ demand for multimodal content, including images, videos, and audio, has grown while browsing web pages. Simultaneously, the presentation of information on web pages exhibits complex structural features, such as tables and questionnaires. Therefore, processing these multimodal data to address users‘ issues during usage has become a significant challenge in the field of artificial intelligence (particularly natural language processing). This thesis investigates a multimodal web-based question-answering system based on pre-trained technology, integrating text, semi-structured data, and visual information to accurately understand user needs and provide precise, practical, and high-quality answers. To address this challenge, this thesis divides the multimodal web question-answering task into three levels according to their complexity. The methods and techniques proposed in this thesis have been experimentally validated on a total of 11 tasks and datasets, demonstrating promising performance. First, for the semi-structured data understanding task, this thesis proposes a table augmentation algorithm called LUNA, based on numeral tokens and pre-training. By modeling each numeral as a whole using numeral tokens, the language model‘s numerical reasoning and calculation capabilities are enhanced. Through numeral pre-training, including classification, regression, and model distillation, LUNA aligns the distribution of numerals and vocabulary and is capable of understanding table structures.Second, in addition to leveraging the encoder to "understand" the content of the input, this thesis also aims to utilize the decoder to "generate" more flexible answer sequences. To achieve this, this thesis proposes CASR, a system to generate Complex sequences with Auto-regressive Self-boost Refinement, to decouple dependencies in complex sequences. Through multi-stage refinement, the model can leverage the refined results of other dependent positions to more accurately predict the current position. This thesis also explores the possibility of integrating CASR with generative large language models such as ChatGPT.Finally, to further improve the question-answering system‘s performance, visual information is introduced. Using an online pneumonia diagnosis task as an example, this thesis proposes a medical report generation framework based on multi-task learning, which includes image-text matching, multi-label diagnostic classification, and term importance weighting mechanisms, capable of learning dependencies between fine-grained image and text features.