登录 EN

添加临时用户

配音音频生成的表现力及鲁棒性研究

Investigating and Improving the Expressiveness and Robustness of Dubbing Audio Generation

作者:李思磐
  • 学号
    2020******
  • 学位
    硕士
  • 电子邮箱
    lsp******.cn
  • 答辩日期
    2023.05.16
  • 导师
    吴志勇
  • 学科名
    电子信息
  • 页码
    78
  • 保密级别
    公开
  • 培养单位
    599 国际研究生院
  • 中文关键词
    自动配音, 音效生成, 语音合成, 声码器, 数字信号处理
  • 英文关键词
    automatic dubbing, sound effect generation, speech synthesis, vocoder, digital signal processing

摘要

配音是为影片或多媒体加入声音的过程。对电影、电子游戏或其他媒体的艺术或其他内容的声音处理,加入额外的音效以及提升台词的表现力将会显著地提升用户的沉浸式体验感。在专业影视制作的配音流程中,台词、对话和音效录制的通常是分离的,不仅需要完成音效的配音,也要完成台词、对话的配音。 本论文针对自动配音生成技术相关前沿方法中的三个关键问题展开研究,包括配音音效生成的表现力效果、配音语音合成的韵律表现力提升和用于声音生成的通用声码器鲁棒性,主要工作和贡献如下: 一、针对配音音效生成技术存在的模型训练速度慢,生成音效表现力、鲁棒性差的问题,提出了基于Transformer编解码器结构的声学模型,构建非自回归模式的音效生成方案,优化了模型训练的速度,提升了生成音效的表现力和鲁棒性。现有的音效生成方法模型训练速度慢,并且生成的音效和视频内容存在内容不匹配,时序不同步等问题。本文提出了基于Transformer的编码器解码器声学模型,并引入了位置编码与类别嵌入。实验结果证明了本文方案对于音效生成表现力和鲁棒性改善以及提升模型训练速度的有效性。 二、针对自动配音技术存在的缺乏源语言和目标语言之间对应的韵律表现力的问题,提出了跨语言的词级别时长预测与控制方案,并通过迁移源语言和目标语言之间对应的词级别韵律风格,在游戏语音数据上提升了配音表现力。现有的主流语音合成模型在进行配音语音合成时,未考虑配音应该具有的源语言和目标语言之间应存在的跨语言韵律风格对应关系。本文提出了加入跨语言之间对应的词级别时长节奏预测、控制,实现了跨语言之间对应的词级别韵律风格预测及迁移,提升了配音语音合成的表现力。 三、针对声码器技术存在的泛化能力弱的问题,提出了基于可微数字信号处理(DDSP)的通用声码器方案,提升了声码器的鲁棒性。现有主流的神经声码器在面对未见过的数据分布时音质下降,本文提出将信号处理的方法引入到神经声码器中作为先验知识,并且针对音频的周期特性引入了周期性偏置构建通用声码器。该通用声码器训练完成后可以用于各类音频生成场景,包括生成未见过说话人的具有丰富表现力的语音、歌声;小提琴等乐器声;笑声等非语音的人声等,体现出了对训练域之外的场景的强大泛化能力和通用性,实现了卓越的性能。

Dubbing is the process of adding sound to films or multimedia. The sound effect refers to the effect produced by sound. It refers to the sound added to the soundtrack to enhance the reality, atmosphere, or dramatic information of a scene, mainly including music and effect sound. Adding additional sound effects to the sound processing of movies, video games, music, or other media art or other content will greatly enhance the audience‘s immersive experience. At present, the creation of onomatopoeia is very time-consuming and laborious, the studio of onomatopoeia is gradually declining, and the art of onomatopoeia is also difficult to inherit. In professional film and television production, the lines, dialogues, and sound effects are usually recorded separately. In a complete dubbing process, it is necessary to complete not only the dubbing of sound effects but also the dubbing of lines and dialogues. This thesis investigates three key issues in state-of-the-art methods for automatic dubbing generation, including expressiveness of generated dubbing sound effects, prosodic expressiveness enhancement of dubbing speech synthesis, and robustness of universal vocoders for sound generation. The main contributions are as follows: Firstly, aiming at the issues of slow model training speed and poor expressiveness and robustness of generated sound effects in dubbing sound effect generation, this thesis propose an acoustic model based on the Transformer encoder-decoder structure, constructing a non-autoregressive sound effect generation scheme that optimizes the model training speed and improves the expressiveness and robustness of generated sound effects. Existing sound effect generation methods suffer from slow training speeds and issues such as content mismatch and temporal misalignment between generated sound effects and video content. This thesis introduces a Transformer-based encoder-decoder acoustic model with position encoding and category embedding. Experimental results demonstrate the effectiveness of the proposed approach in improving the expressiveness and robustness of sound effect generation and speeding up model training. Secondly, in view of the lack of corresponding prosodic expressiveness between source and target languages in automatic dubbing, this thesis propose a cross-lingual word-level duration prediction and control scheme that enhances dubbing expressiveness on game voice data by transferring corresponding word-level prosodic styles between source and target languages. Existing mainstream speech synthesis models do not consider the corresponding cross-lingual prosodic styles that dubbing should have between source and target languages. This thesis introduces the prediction and control of corresponding word-level duration rhythm between languages, realizing the prediction and transfer of corresponding word-level prosodic styles between languages, thus enhancing the expressiveness of dubbing speech synthesis. Finally, focusing on the problems of poor generalization and robustness in vocoder, this thesis propose a general vocoder scheme based on differentiable digital signal processing (DDSP) to improve the robustness of vocoder.Current mainstream neural vocoders experience a decline in sound quality when encountering unseen data distributions. This thesis introduces the method of incorporating signal processing as prior knowledge into neural vocoders and constructs a universal vocoder with a periodic bias for the periodic characteristics of audio. The trained universal vocoder can be applied to various audio generation scenarios, including generating expressive speech and singing for unseen speakers, violin and other instrument sounds, laughter, and other non-speech human sounds, demonstrating strong generalization capabilities and versatility for scenarios beyond the training domain and achieving superior performance.