登录 EN

添加临时用户

面向复杂环境的语音转换技术的研究与应用

Research and Application of Voice Conversion Technology for Complex Environments

作者:肖龙
  • 学号
    2021******
  • 学位
    硕士
  • 电子邮箱
    xql******com
  • 答辩日期
    2024.05.14
  • 导师
    吴志勇
  • 学科名
    计算机技术
  • 页码
    50
  • 保密级别
    公开
  • 培养单位
    599 国际研究生院
  • 中文关键词
    语音转换;扩散模型;解耦表征;环境声音
  • 英文关键词
    voice conversion; diffusion model; decoupled representation; environmental sound

摘要

在近年来,随着深度学习技术的迅猛发展,语音转换领域也取得了显著进展,并被广泛应用。语音转换技术的主要目标是将一个声音从其原始声音身份转变为另一种特定的声音身份,同时确保语言内容的一致性。该技术在个性化语音合成、跨性别或跨年龄的语音调整、娱乐以及声音恢复等多个方面显示出了广阔的应用前景。然而,当前在复杂场景下进行语音转换时仍面临一些挑战和改进的空间,包括声学特征分离能力有限、在噪声环境下难以有效建模当前声学场景等问题。本研究旨在通过内容表征的优化、在噪声环境下的语音失真以及提升声学特征建模能力这三个方向,进一步提高现有语音转换模型的性能。主要工作和贡献如下:一、针对语音转换中鲁棒性的问题,提出音素概率图生成扩散模型先验分布的语音转换模型,尝试使用更鲁棒方法而不是通过数据集中均值计算的方法对内容进行建模,让模型保留了更多的内容信息和韵律信息,而非利用MFA标签中更多的时长信息,提高了网络的建模能力和鲁棒性,生成了更好更鲁棒的转换效果。二、针对目前在噪声环境下的语音失真的问题,使用了一个新的训练过程,使得提出的编码器可以根据不同的波形,产生不同的代表环境信息和语音信息的编码。神经网络将输入音频映射到一个嵌入空间,并将其分成多个分区,每个分区捕获输入音频的不同属性。通过利用有针对性的增强数据和自定义损失函数,在模型中引入强大的归纳偏差,从而使每个分区专门分配给特定的属性。最后,在附加说话人信息的支持下,通过解码器直接重建波形,并取得了更好的性能。三、针对丰富模型的生成的音色的问题,提出了一种基于VAE的算法,对说话人表征重新建模,进而通过插值等方法进行说话人生成,并通过采样生成样本和重构样本进行对抗训练,显著提高了声学表征的解耦能力。基于VAE的重建样本应与生成样本的分布相似的原则,在训练过程中利用这一特性,将对抗损失引入到VAE的传统训练中。为此,训练过程中引入了一个判别器网络,并使用VAE的解码器作为生成器。实验证明了所提方法在内容和说话人表征解耦的有效性,并能够有效地进行说话人生成任务。

In recent years, with the rapid development of deep learning technology, significantprogress has been made in the field of voice conversion, which has been widely applied.The main goal of voice conversion technology is to transform a voice from its originalidentity to another specific voice identity while ensuring consistency in linguistic content.This technology demonstrates broad application prospects in personalized speech synthe-sis, gender or age adjustment in speech, entertainment, and voice restoration, among manyother areas.However, challenges and areas for improvement still exist when conducting voiceconversion in complex scenarios. These include limited acoustic feature separation ca-pability, difficulty in effectively modeling current acoustic scenes in noisy environments,among other issues. This study aims to further enhance the performance of existing voiceconversion models through optimization of content representation, addressing speech dis-tortion in noisy environments, and improving acoustic feature modeling capability. Themain contributions of this work are as follows:1) Addressing the robustness issues in voice conversion, we propose a phonemeprobability map-based diffusion model prior distribution for voice conversion mod-els.We attempt to model content using a more robust method rather than relying onmean calculations from datasets, allowing the model to retain more content and prosodicinformation instead of relying on more duration information from MFA labels, therebyenhancing the network’s modeling capability and robustness, resulting in better and morerobust conversion outcomes.2) In response to the current issue of speech distortion in noisy environments,we employ a novel training process that allows our encoder to produce different rep-resentations for different waveforms, representing both environmental and speechinformation.Neural networks map input audio into an embedding space, which we par-tition into multiple partitions, each capturing different attributes of the input audio. Byleveraging targeted data augmentation and custom loss functions, we introduce stronginductive biases into the model to specifically allocate each partition to particular at-tributes. Finally, with the additional speaker information, we directly reconstruct wave-forms through the decoder, achieving better performance.3) To address the issue of timbre generation in rich models, we propose aVAE-based algorithm that remodels the speaker representation, thereby facilitatingspeaker generation through methods such as interpolation. Additionally, we employadversarial training using generated and reconstructed samples to significantly en-hance the decoupling capability of acoustic representations.Our approach is groundedin the principle that the distribution of VAE’s reconstructed samples should be similar tothat of the generated samples. Throughout the training process, this feature is leveragedby incorporating adversarial loss into the conventional VAE training regimen. To thisend, a discriminator network is introduced during training, with the VAE’s decoder serv-ing as the generator. Experimental results have validated the effectiveness of our methodin decoupling content and speaker representations, as well as its efficiency in executingspeaker generation tasks.