登录 EN

添加临时用户

基于贝叶斯推断的开放域中文分词和词语发现方法及应用

Open-Domain Chinese Text Segmentation and Word Discovery Methods and Applications based on Bayesian Inference

作者:潘长在
  • 学号
    2018******
  • 学位
    博士
  • 电子邮箱
    pan******com
  • 答辩日期
    2024.05.21
  • 导师
    邓柯
  • 学科名
    统计学
  • 页码
    131
  • 保密级别
    公开
  • 培养单位
    016 工业工程系
  • 中文关键词
    中文分词;词语发现;贝叶斯推断;开放域;命名实体识别
  • 英文关键词
    Chinese text segmentation;word discovery;Bayesian inference;open domain;named entity recognition

摘要

自然语言处理是人工智能的重要分支。随着深度学习领域的发展,基于深度学习的自然语言处理方法在各个中文自然语言处理任务中取得了较好的表现,但是这些方法在开放域文本中的应用仍是难点。开放域文本的常见特点——行文风格迥异、专业词汇丰富、标注数据难以获取,使得已有方法在开放域文本上的表现大幅下降。 本文将基于深度学习的方法和传统统计语言模型相结合,提出了一系列开放域中文文本分析的方法。诚然,基于统计语言模型的弱监督文本分析方法,可解释性强、对标注数据依赖弱、擅长挖掘文本中的专业词汇,但其在自然语言处理任务上的表现,相比于深度学习方法仍有待提高。本文吸收了深度学习方法的优势,利用深度学习方法构建贝叶斯框架的先验信息,指导统计模型的初始化、训练和推断。本文所提出的方法表现相比原统计模型有着大幅提升,同时,又依然保有统计模型可解释性强、对标注数据依赖弱、擅长发现专业词汇的优势。 首先,在开放域文本分词和词语发现领域,本文提出了基于贝叶斯框架同步实现中文文本分词和词语发现的方法TopWORDS-Seg。该方法构建了基于词汇边界向量的词语词典模型,并将基于深度学习方法的分词结果转化为先验信息融入模型,利用贝叶斯推断实现中文文本分词和词语发现任务。在实际数据测试中发现,该方法能够在开放域文本中准确识别专业词汇并且实现高质量的中文文本分词。其次,针对中文开放域中的古典诗歌领域,本文利用格律诗歌的模板信息,提出了同步实现中文古典诗歌分词和词语发现的无监督方法TopWORDS-Poetry。在《全唐诗》上的实验证明该方法能够有效发现古典诗歌中人名、地名、典故等专业术语,并且在分词和词语发现任务上的表现超过已有方法。此外,对于更复杂的开放域命名实体识别任务,本文利用预训练语言模型加强已有的TopWORDS-MEPA方法,提出了能够同步实现开放域中文文本分词、词语发现以及命名实体识别的方法TopWORDS-MEPA-PLM。该方法不但保留了原模型的可解释性,并且大幅提升了原方法在电子病历实际文本测评中的表现。最后,本文将已有的中文文本分析方法应用到非物质文化遗产领域文本上并构建了非遗领域的专业术语词表和术语间的关联网络。

Natural language processing is an important branch in the field of artificial intelligence. With the development of neural networks, deep learning-based methods have achieved state-of-the-art performance in various Chinese natural language processing tasks, but still have challenges when applied to open-domain Chinese corpus with various writing styles, rich technical words, but few annotated training data, which leads to dramatic performance degradation of existing methods on open domain texts. This dissertation integrates statistical language models with deep learning-based methods to propose a series of methods for analyzing Chinese texts in open domains. Admittedly, weakly supervised text analysis methods based on statistical language models exhibit strong interpretability, low dependence on annotated data, and proficiency in extracting technical terms from texts, but have relatively low performance on natural language processing tasks compared to deep learning methods.This dissertation leverages the advantages of methods based on deep learning and statistical models, utilizing deep learning techniques to construct prior information for a Bayesian framework and further guiding the initialization, estimation, and inference stages of statistical models. Our method significantly improves performance compared to the original statistical models, while still maintaining the advantages of the statistical model of being highly interpretable, weakly relying on annotated data, and strong at discovering technical terms.Firstly, in the field of open-domain text segmentation and word discovery, this dissertation proposes a method called TopWORDS-Seg that simultaneously achieves word discovery and text segmentation via the Bayesian framework. This method constructs a word dictionary model based on word boundary vectors, integrates the segmentation results based on the deep learning method as a priori information, and realizes the tasks of Chinese text segmentation and word discovery tasks by Bayesian inference.Experiment studies demonstrate the proposed method can achieve high-quality word discovery and text segmentation in open-domain texts. Secondly, focusing on the domain of Chinese classical poetry, this dissertation proposes an unsupervised method named TopWORDS-Poetry, which can achieve reliable text segmentation and word discovery simultaneously. Utilizing the template information from metrical poetry, this method effectively discovers technical terms such as person names, addresses, and literature allusions in classical poetry. Furthermore, TopWORDS-Poetry outperforms existing methods in both segmentation and word discovery tasks, as demonstrated in the experiment on Complete Tang Poetry.Thirdly, for a more complex task called named entity recognition, by enhancing TopWORDS-MEPA with pre-trained language models, this dissertation proposes TopWORDS-MEPA-PLM, which can simultaneously implement open domain Chinese text word segmentation, word discovery, and named entity recognition and enjoys a significant improvement in experimental studies with transparent interpretation.Finally, applying existing Chinese text analysis methods to the texts in the field of intangible cultural heritage, this dissertation constructs a term dictionary and an association network between terms of this domain.