登录 EN

添加临时用户

开放域文本的结构化知识获取

Structured Knowledge Acquisition from Open-Domain Text

作者:韩旭
  • 学号
    2017******
  • 学位
    博士
  • 电子邮箱
    cst******com
  • 答辩日期
    2022.05.17
  • 导师
    刘知远
  • 学科名
    计算机科学与技术
  • 页码
    196
  • 保密级别
    公开
  • 培养单位
    024 计算机系
  • 中文关键词
    知识图谱,知识获取,开放域文本
  • 英文关键词
    knowledge graph, knowledge acquisition, open-domain text

摘要

知识图谱以结构化符号系统组织人类知识,是推动人工智能发展、支撑智能服务应用的基础技术。相比于现实世界中的海量知识,已有知识图谱距离完善仍有较大距离。开放域文本规模大、形式多、内容丰富,从开放域文本中自动获取结构化知识,是扩充知识图谱的有效手段。本文面向开放域文本结构化知识获取中``一少三多''四大挑战,即标注数据少、长尾数据多、增量数据多、数据多源异构,进行了四方面工作:(1)面向远程监督的降噪学习,包括:基于内源信息的远程监督降噪,利用对抗训练挖掘数据内源信息来过滤远程监督自动标注数据中的噪声样本;基于外源信息的远程监督降噪,利用实体间关系的层次结构作为外源信息来从自动标注数据中选择高质量样本;分析远程监督降噪的适用条件,系统评测各类远程监督降噪算法,剖析各类降噪机制的适用条件与优缺点。(2)面向长尾关系的小样本学习,包括:少样本知识获取的框架构建,基于元学习与度量学习构建针对知识获取的少样本学习框架;少样本知识获取的富信息样本选择,基于混合注意力机制选择富信息样本来强化小样本学习能力;少样本知识获取的知识迁移,基于预训练语言模型学习无标注数据来缓解样本不足的问题。(3)面向新增关系的持续学习,包括:知识获取的样本持续挖掘,基于神经雪球系统持续挖掘开放域文本中适于训练知识获取模型的样本;知识获取的模型持续学习,基于记忆再巩固进行开放域文本上实体间新关系的持续学习同时规避灾难性遗忘问题。(4)多源异构数据的联合学习,包括:联合跨结构信息的知识获取,基于互注意力进行非结构化文本与结构化知识图谱的跨结构联合;联合跨语言信息的知识获取,基于对抗训练在统一语义空间中进行多语言文本的跨语言联合。基于上述四方面工作,本文形成了开放域文本的结构化知识获取算法体系。围绕该算法体系,本文也将从工程实现角度来介绍如何构建高效的知识应用系统。上述算法与系统有利于丰富知识图谱的结构化知识规模,促进当前数据驱动的深度学习善于刻画特征以及符号表示的结构化知识善于认知推理的双重优势结合,对于揭示自然语言处理机理、实现智能语言理解具有重要意义。

Knowledge graphs organize human knowledge in a structured and symbolic way, which is the foundation to support the development of artificial intelligence and the application of intelligent services. Compared with the massive knowledge in the real world, the existing knowledge graphs are still far from complete. Since large-scale knowledge is stored in the open-domain text, it is very important and promising to acquire the structured knowledge from the open-domain text for expanding knowledge graphs. However, this open-domain knowledge acquisition process suffers from the challenges of annotation scarcity, long-tail data, incremental data, and multi-source heterogeneous data. To handle the above challenges, our work focuses on the following directions:(1) Denoising learning for distantly supervised data, including: Denoising data with endogenous information for distant supervision, by using adversarial training to extract the endogenous features of noise training data to filter out noise samples. Denoising data with exogenous information for distant supervision, by using the hierarchies of those relations between entities as exogenous information to help select high-quality samples from noise training data. Analyzing the applicable conditions of denoising mechanisms for distant supervision, by systematically evaluating various denoising mechanisms for distant supervision and then grasping both advantages and disadvantages of these denoising methods. The above denoising learning methods can use more data to train a knowledge acquisition model while reducing the side effect of noise data.(2) Few-shot learning for long-tail relations, including: Building a unified framework for few-shot knowledge acquisition, by applying meta learning and metric learning to build a few-shot learning framework to acquire knowledge. Informative sample selection for few-shot knowledge acquisition, by using a hybrid attention mechanism to select informative training samples to enhance the few-shot learning process. Knowledge transfer for few-shot knowledge acquisition, by transferring the knowledge learned by pre-trained language models from large amounts of unsupervised data to alleviate data scarcity. The above few-shot learning methods only require learning a small number of samples to obtain a knowledge acquisition model.(3) Continual learning for new relations, including: Continual sample detection for knowledge acquisition, based on a neural snowball system to continually detect training samples from open-domain text, which can be used for training a more general knowledge acquisition model. Continual model learning for knowledge acquisition, based on a memory reconsolidation mechanism to continually learn new relations between entities in the open-domain text while avoiding catastrophic forgetting problems. The above continual learning methods can automatically find new relations that may exist in the open-domain text, automatically detect training samples for these new relations, and make knowledge acquisition models continually learn these new relations.(4) Joint learning for multi-source heterogeneous data, including: Joint learning heterogeneous information for knowledge acquisition, by considering both unstructured text and structured knowledge graphs to acquire knowledge based on a mutual attention mechanism. Joint learning multilingual information for knowledge acquisition, by fusing multilingual semantic features in a unified space to acquire knowledge through adversarial training. The above joint learning methods are beneficial to integrate the rich contextual information of multi-source heterogeneous data to acquire knowledge.This thesis proposes a unified algorithm framework for structured knowledge acquisition from open-domain text, relying on the efforts for the above four directions. Besides the algorithm framework, this thesis also introduces how to build an efficient and effective knowledge application system. Based on the proposed algorithms and systems, we can acquire large-scale structured knowledge for knowledge graphs, and further combine data-driven deep learning methods that are good at extracting semantic features and symbolic knowledge that is suitable for cognitive inference together, which is quite important for a deeper understanding of natural languages.