登录 EN

添加临时用户

基于语言模型微调的科学引文意图研究

Research on scientific citation intent based on language model fine-tuning

作者:刘淏天
  • 学号
    2021******
  • 学位
    硕士
  • 电子邮箱
    947******com
  • 答辩日期
    2024.05.13
  • 导师
    李秀
  • 学科名
    电子信息
  • 页码
    91
  • 保密级别
    公开
  • 培养单位
    599 国际研究生院
  • 中文关键词
    引文意图预测;提示学习;微调;大语言模型
  • 英文关键词
    Citation intent prediction;prompt learning;fine-tuning;large language models

摘要

在科学研究中,引文不仅可以帮助读者追溯文献来源,理清研究脉络,还可以作为一种评价指标,衡量文献的参考价值。引文具有多种不同的引用意图与目的,目前的研究主要通过引用次数定量地衡量引用的重要性,然而这种方法难以定性地表达对原始论文的贡献程度和参考方向。因此,对引文的引用意图进行预测和分类具有重要的研究价值。当前的引文意图预测算法主要分为基于规则特征的技术、基于机器学习的技术和基于深度学习的技术。深度学习因其卓越的性能而备受推崇,其中基于预训练加微调的传统范式是最为普遍使用的。然而,这种方法存在着预训练任务与下游场景不匹配的问题,难以充分发挥预训练模型的优势。同时,模型的性能高度依赖于训练数据的数量和质量,在数据集类别不平衡或样本稀缺场景下表现受限。此外,传统语言模型由于参数量的限制,在理解和处理长文本时始终存在瓶颈。随着大语言模型的涌现,这些问题迎来了全新的解决方案,其广泛的参数空间和深层次的语言理解能力为解决自然语言理解任务提供了丰富的可能性。基于上述背景与思考,本文提出了以下贡献:1. 本文提出了 PromptCiter 引文意图预测框架,基于提示学习的思想对预训练模型进行微调,旨在解决预训练任务与下游应用场景之间的不匹配问题。该框架对比了人工经验模板、基于 T5 模型的自适应生成式模板以及加入思维链后的逻辑模板,通过消融实验确定了基座模型和提示方式的最佳组合,在公开数据集上取得了显著的结果。2. 本文提出了一种知识增强模块,旨在改善提示学习微调模型的映射方式。针对微调模型在处理小样本和类别不平衡数据集时鲁棒性不足的问题,本文从基于语义相似度的词表扩充和基于 RoBERTa 模型的词表生成两个方面对映射标签集进行了扩展,通过对比实验验证了在零样本和少样本场景下知识增强模块的有效性。3. 本文将大模型引入引文意图预测任务,利用其丰富的先验知识,旨在突破传统语言模型的能力极限,以获取更为充分的建模表征。本文对当前较为热门的ChatGLM 系列、Baichuan 系列和 LLaMA 2 模型进行了分析实验,比对了不同微调方式下模型性能的差异,并根据性能表现提出了大模型在专有领域应用时的思考与展望。

In scientific research, citations not only help readers trace the sources of literature, clarify research trajectories, but also serve as an evaluation metric to gauge the reference value of literature. Citations possess various citation intents and purposes. Currently, research primarily quantitatively measures the importance of citations through citation counts. However, this method struggles to qualitatively express the extent of contribution and reference direction to the original thesis. Therefore, predicting and classifying citation intents holds significant research value. Currently, citation intent prediction algorithms mainly consist of techniques based on rule features, machine learning, and deep learning. Deep learning, renowned for its outstanding performance, particularly favors the traditional paradigm of pre-training followed by fine-tuning. However, this approach encounters the issue of mismatch between pre-training tasks and downstream scenarios, hindering the full utilization of pre-trained model advantages. Moreover, the performance of models heavily relies on the quantity and quality of training data, thereby facing limitations in scenarios with class imbalances or scarce samples. Additionally, traditional language models face bottlenecks in understanding and processing long texts due to parameter constraints. With the emergence of large language models, these issues have ushered in novel solutions. Their extensive parameter space and deep understanding of language provide abundant possibilities for addressing natural language understanding tasks. Building upon the aforementioned background and considerations, this thesis presents the following contributions: 1. This thesis proposes the PromptCiter framework for predicting citation intent, which fine-tunes pre-trained models based on the concept of prompt learning, aiming to address the mismatch between pre-training tasks and downstream application scenarios. The framework is compared with manual empirical templates, adaptive generative templates based on the T5 model, and logical templates with added thought chains. Through ablation experiments, the optimal combination of base models and prompts is determined, yielding significant results on publicly available datasets. 2. This thesis introduces a knowledge enhancement module aimed at improving the mapping approach of prompt learning fine-tuned models. Addressing the issue of insufficient robustness of fine-tuned models when handling small-sample and class-imbalanced datasets, this thesis expands the mapping label set through two aspects: semantic similarity-based vocabulary expansion and vocabulary generation based on the RoBERTa model. Through comparative experiments, the effectiveness of the knowledge enhancement module in zero-shot and few-shot scenarios is validated. 3. This thesis introduces large-scale models into the task of citation intent prediction, leveraging their rich prior knowledge to surpass the capability limits of traditional language models and achieve more comprehensive modeling representations. The thesis conducts analytical experiments on the popular ChatGLM series, Baichuan series, and LLaMA 2 models, comparing the performance differences under different fine-tuning methods. Based on the performance outcomes, the thesis puts forward reflections and prospects for the application of large-scale models in specialized domains.