登录 EN

添加临时用户

容错加权有限状态机语音识别解码器算法研究和实现

Research and Implementation of Fault-Tolerant Algorithms in WFST-Based Speech Recognition Decoder

作者:潘广谋
  • 学号
    2013******
  • 学位
    硕士
  • 电子邮箱
    gua******com
  • 答辩日期
    2016.05.26
  • 导师
    刘加
  • 学科名
    信息与通信工程
  • 页码
    70
  • 保密级别
    公开
  • 培养单位
    023 电子系
  • 中文关键词
    语音识别, 词图生成, 加权有限状态机, 冗余去除, 容错
  • 英文关键词
    speech recognition, lattice generation, WFST, redundancy removal, fault tolerance

摘要

近年来,随着深度学习方法和GPU计算能力的增强,大词汇量连续语音识别系统的性能显著提高,并在众多领域中得到较为广泛的应用。解码器引擎作为语音识别系统最核心的模块之一,融合了声学模型、发音字典及语言模型等知识信息,并通过一系列非常复杂的最优化搜索和剪枝算法,将输入语音的信号转化为对应内容的文本信息输出,输出形式有最优选识别结果、多候选识别结果以及词图等。论文以计算效率更高的基于加权有限状态机(WFST)的静态解码器为基础,深入研究了如何实现高效且具有容错性的语音识别解码算法。论文首先分析了WFST理论及算法,并讨论了解码网络的构建机制和基础解码方法,以此搭建了具有国际先进水平的解码器基线平台。在此基础上,论文提出了一种令牌图直接提炼词级别原始词图的快速解码方法,构建并实现了一套可实际应用的高效语音识别解码器系统。为了进一步提高该解码引擎的容错性能,论文深入研究了词图生成的词边界确定问题。论文利用音素错位的现象,设计了错位添加消歧符号的方法,该方法构建的解码网络直接携带了词边界信息,减少了解码输出的时间信息错误;论文还提出了配套的词标签后推解码算法,保证了词图中词标签的准确性。如何去除复杂的词图冗余是容错性解码需要考虑的最重要问题之一。论文详细讨论了冗余的产生过程,并将之划分为词边界冗余和语言模型回退冗余两类。分析表明:词边界冗余会极大地影响词图的质量,造成关键词检测和解码发生错误,是WFST解码器词图生成中必须解决的核心问题之一。为此,论文提出了一种能完全去除词图冗余的词级别确定化算法,并从数学上严格证明了该算法能保证词图覆盖率不下降的容错性和时间信息的准确性。论文还提出了一种全新的词对过滤算法,并从理论上严格证明了该算法得到词图与原始词图的等价性。作为本文提出的重要创新,词对过滤算法可快速去除词图中大部分冗余节点和边信息,该算法与词级别确定化算法相结合使用可以得到一种广泛容错且无冗余的词图。实验表明,论文提出的语音识别解码算法及构建和实现的解码系统,可以输出最大查询词加权值指标与国际先进水平相当的高质量词图,且解码效率非常高,系统的时间消耗仅为同领域最新研究的1/3。论文研发的容错WFST语音识别解码引擎已经在国家相关部门中投入实际应用,发挥良好作用。

In recent years, with the enhancement of deep learning methods and GPU computing, performance of Large Vocabulary Continuous Speech Recognition (LVCSR) system is significantly improved, and the technique is widely applied in various fields.Decoder, as one of the core modules of speech recognition system, integrating multiple knowledge sources like Acoustic Model, lexicon and Language Model, and through a series of complicated and optimized search and pruning algorithms, transforms the signal of input speech to the corresponding text information. The output format of decoder contains 1-best result, N-best result and lattice. The thesis chooses Weighted Finite-State Transducers (WFST) based static decoder which is rather efficient as research direction, and aims to research and implement efficient and fault-tolerant decoding algorithms in speech recognition.The WFST theories and algorithms are first analyzed in the thesis, and the construction mechanism of WFST decoding network and fundamental decoding algorithms are discussed. According to these theories, a baseline decoding platform with an advanced international level is created. Based on the baseline platform, a fast decoding method that converts the token lattices directly to word-level lattices is proposed by the thesis and an efficient and feasible system of speech recognition decoder is implemented.To further improve the fault-tolerant property of the decoding engine, the thesis discusses the word-boundary determining problem in lattice generation. Taking advantage of the phenomenon of phone misalignment, the thesis designs a special method that inserts disambiguation symbols globally in the decoding network, which makes the network contains obvious word-boundary information directly. The method is able to reduce the timing information error of the decoder. Moreover, a supplementary word label back-pushing decoding algorithm is also proposed to maintain the accuracy of the word labels in the lattices.To remove the complicated redundancies in the lattices is one of the most crucial problems in fault-tolerant decoding. The thesis analyzes the redundancy phenomenon in detail and classifies them into two types, the word-boundary redundancy and the Language Model back-off redundancy. Analysis shows that the word-boundary redundancy greatly affects the quality of lattices and leads to errors in keyword spotting and decoding, and this type of redundancy is considered as one of the key problems must be resolved in lattice generation using WFST decoders. To completely remove the redundancies, a word-level determinization algorithm is proposed. Using mathematical theories, the thesis proves that the algorithm is effective to hold the fault-tolerant property by ensuring that the coverage of a lattice does not drop. Furthermore, the thesis proposes a novel Word-Pair Filtering Algorithm and presents the theoretical proof that the algorithm generates lattices equivalent to the original ones. As the significant innovation of the thesis, the algorithm is able to remove most of the redundant nodes and arcs in the lattices very efficiently. By combining Word-Pair Filtering Algorithm and the word-level determinizaiton algorithm, the decoding system generates non-redundant final lattices with a widely fault-tolerant property. Experiments show that the decoding algorithms proposed are highly effective and the system implemented by the thesis is able to generate lattices that are at the same level with international leading systems in the Maximum Term-Weighted Value index. Further, with the efficiency of lattice generation procedure greatly improved, the time consumption of the proposed system is only 1/3 of the latest research in this field. The fault-tolerant WFST decoding engine of speech recognition in this research has been put into practical use in some relevant national departments.