随着人工智能技术的发展和智能语音产品的普及,表现力语音合成的应用场景越来越广泛。韵律和风格建模是表现力语音合成中非常重要的研究方向,通过对语音中的韵律和风格信息进行建模,可以实现更加自然流畅和生动活泼的语音合成。本研究主要包括四个部分,分别从语音合成系统的前端韵律结构预测、声学模型风格多尺度建模、声学模型风格预训练和目标用户风格选择这四个角度进行研究,提升合成语音的表现力。本文的主要工作和贡献如下:一、提出了一种基于跨度表征的韵律结构预测模型 SpanPSP。 针对合成语音韵律表现不足的问题,本文从前端文本语义和韵律表现的相关性出发,采用跨度来表征韵律结构树,利用树结构将对于韵律词、韵律短语和语调短语三个韵律结构层级的预测任务统一起来,并采用动态规划的算法寻找最优的韵律结构树。实验结果表明本文所提方法可以预测更加准确的韵律结构并有效提升合成语音的韵律表现。二、提出了一种基于无监督多尺度风格表征的有声小说表现力语音合成模型MSHCE。 针对合成语音局部表现力缺失的问题,本文从声学模型风格多尺度建模的角度出发,采用了一个多尺度的垂直上下文编码器,从待合成语句的上下文文本中分别对全局尺度(句子级)和局部尺度(音素级)的风格表征进行建模和预测。实验结果表明本文所提出的模型有效地提升了合成语音的全局和局部风格表现力。三、提出了一种基于自监督预训练进行风格强化的有声小说表现力语音合成模型 StyleSpeech。 针对合成语音域外表现力差的问题,本文从声学模型风格预训练的角度出发,分别采用了一个文本风格编码器和一个音频风格提取器,借助于大量的无标注数据进行预训练,提升声学模型在文本端和音频端的风格表征泛化能力。实验结果表明本文所提出的模型有效地提升了合成语音的域外风格表现力。四、提出了一种面向目标用户的人机回环风格选择语音合成模型 HILvoice。针对不同用户群体对于语音风格的感知和需求不同的问题,本文从目标用户风格选择的角度出发,将目标用户引入到语音合成系统中进行风格选择,根据目标用户的偏好反馈对语音合成模型进行优化更新,从而合成出更适合目标用户的语音。实验结果表明该方法可以合成更受目标用户喜欢的语音。
With the development of artificial intelligence technology and the popularization of intelligent voice products, the application scenarios of expressive speech synthesis are becoming more and more extensive. Prosody and style modeling is a very important research direction in expressive speech synthesis. By modeling the prosody and style information in speech, more natural, smooth and lively speech synthesis can be achieved. This thesis mainly includes four parts, which are researched from the front-end prosodic structure prediction, the multi-scale style modeling of the acoustic model, the style pre-training of the acoustic model, and the style selection of the target users, so that the synthesized speech has richer expressiveness. The main work and contributions of this thesis are as follows: Firstly, a prosodic structure prediction model named SpanPSP based on span representation is proposed. Aiming at the problem of insufficient prosodic performance of synthesized speech, this thesis starts from the correlation between the front-end text semantics and prosodic representation. This thesis uses the span to represent the prosodic structure tree, and uses the tree structure to predict the three prosodic structure levels of prosodic word, prosodic phrase and intonational phrase and unify these three tasks. Besides, the dynamic programming algorithm is used to find the optimal prosodic structure tree. Experimental results show that the method proposed in this thesis can predict more accurate prosodic structure and effectively improve the prosodic performance of synthesized speech. Secondly, an unsupervised multi-scale style modeling method named MSHCE for audiobook speech synthesis is proposed. In view of the lack of local expressiveness of synthesized speech, this thesis starts from the perspective of multi-scale style modeling of acoustic model. A multi-scale hierarchical context encoder is employed to model and predict style representations at global scale (sentence level) and local scale (phoneme level) respectively from the contextual text of the sentence to be synthesized. Experimental results show that the model proposed in this thesis effectively improves the global and local style expressiveness of synthesized speech.Thirdly, an audiobook speech synthesis model named StyleSpeech based on selfsupervised style pre-training is proposed. Aiming at the out-domain poor expressiveness of synthesized speech, this thesis starts from the perspective of style pre-training of acoustic model. A text style encoder and an audio style extractor are designed respectively, which are pre-trained with a large amount of unlabeled data, so as to improve the generalization ability of the acoustic model in the style representation of text and audio. Experiment results show that the proposed method has better performance in terms of naturalness and expressiveness of synthesized speech. Finally, a human-in-the-loop style selection model for speech synthesis named HILvoice is proposed. To solve the problem that different user groups have different perceptions and needs for synthesized speech, this thesis starts from the perspective of target user style selection. The target user is introduced into the speech synthesis system for style selection, and the speech synthesis model is optimized and updated according to the target user’s preference feedback, so as to synthesize a speech more suitable for the target user. Experimental results show that this method can synthesize speech that is morepopular with target users.