语音转换是在保留语音原始内容的同时,将语音特征更改为听起来像另一说话人的过程。语音转换在多个不同行业有着广泛的应用,可用于娱乐行业如创建特定历史人物的语音,可用于游戏行业以生成非玩家角色的声音。语音转换在言语治疗领域也有实际应用,可用于帮助有言语障碍或残疾的人士修改他们的声音,使其更容易理解。为解决语音转换中的数据稀缺问题并生成具有高度表现力的语音,本文采用基于深度学习的方法进行数据增强,并对内容、音色、节奏和音调进行建模、解耦和控制。本文主要研究内容和贡献如下:(1)提出了一种使用相对较少数据的基于对偶学习的神经网络小样本语音转换方法。数据稀缺是语音领域研究亟待解决的关键问题。传统语音转换方法需要语音及对应的文本表征来训练模型。本文直接从原始音频提取梅尔谱进行语音转换,而无需利用额外的文本信息。首先,将传统的序列映射语音合成模型Tacotron-2修改为适用于语音转换任务,构建BaseVC模型。当有足够训练数据时,该模型能高质量地将源说话人的梅尔谱转换为目标说话人的梅尔谱。而在小样本场景下,如将训练数据减少到30\%时,该模型将出现对齐问题,造成转换效果变差。为此,本文提出了基于对偶学习的DualVC机制。利用语音转换任务的对偶性,设计了一个对偶训练的途径以生成伪数据增加数据量,并利用课程学习进行预训练。实验表明,所提出的DualVC和课程学习策略在提高性能的同时减轻了数据集依赖性。(2)提出了一种面向单样本语音转换的基于对抗互信息学习的语音特征解耦方法。语音特征解耦是实现高质量语音转换的关键。现有大部分研究没有考虑节奏信息的解耦,导致音高和节奏耦合在一起、进而影响语音转换效果。本文提出了一种对抗互信息学习的方法,对多种不同语音特征进行解耦。首先,构建了面向音色表示学习的通用分类器,以区分出与说话人特征密切相关的音色信息。进而,通过梯度反转层(GRL)将与说话人无关的信息尽可能分开。然后,通过随机重采样从韵律信息中去除音高和内容信息,并通过音高解码器确保音高编码器能获得准确的音高信息。最后,通过互信息的变分对比对数比率上限(vCLUB)将与说话人无关的信息尽可能解耦开。实验表明,所提方法能够较好地实现节奏、内容、音色和音高信息的解耦,并显著提升单样本语音转换的性能。
Voice conversion is the process of altering the characteristics of a recorded voice to sound like a different speaker while retaining the original content of the speech. It is an important technology for a variety of applications, including speech synthesis, voice imitation, and speech recognition. Voice conversion also has practical applications in the field of speech therapy, where it can be used to help people with speech disorders or disabilities by modifying their voice to make it clearer and more intelligible. It can also be used in forensic investigations to alter a speaker‘s voice to protect their identity or to identify a suspect.Overall, voice conversion technology has a wide range of applications and is becoming increasingly important as speech-based technologies continue to advance. In order to solve data scarcity issues and generate highly expressive speech, this thesis adopts deep learning-based approaches to work on data augmentation as well as modelling, disentangling, controlling and transferring content, timbre, rhythm, and pitch representations of speech. The main contents and contributions of this thesis are as follows:(1) A few-shot voice conversion method based on reconstructing dual learning for neural voice conversion using relatively few samples is proposed. Since scarcity of data remains as issue for speech research, this thesis try to solve the voice conversion problem from the mel spectrograms extracted from the raw audios without text features conditioned. A base sequence-to-sequence voice conversion model, BaseVC is firstly proposed based on the acoustic module of Tacotron-2. This can convert the source mel spectrogram to the target mel spectrogram when adequate corpus is provided. In order to tackle the alignment problems when reducing the number of training data to 30 percent under few sample scenarios, a DualVC mechanism is proposed which utilizes the duality of the voice conversion task that a dual training cycle is designed for helping the BaseVC model to generate pseudo samples for enhancing the amount of the data. Two objective loss functions are designed for the dual training cycle to achieve better alignment results. To further exploit the upper bound of the proposed method’s performance, a curriculum learning strategy is designed for both BaseVC and DualVC models by pre-training the models on another dataset with adequate corpus. The experiment results show the proposed DualVC and curriculum learning strategy alleviates the dependence of dataset while boosting the performance.(2) An one-shot controllable voice conversion method based on speech representation disentanglement with adversarial mutual information learning is proposed. Disentangling speech information is one of the bases to achieve high quality converted speech for voice conversion. However, most of the researches do not consider the decoupling of rhythm, resulting in that pitch and rhythm still entangles together. In this thesis, based on the disentanglement of different speech representations, this thesis propose an approach using adversarial mutual information learning for one-shot voice conversion. To make the timbre information as similar as possible to the speaker, this research use a common classifier of the timbre. Also, we use gradient reversal layer(GRL) to keep speaker-irrelevant information as separate from the speaker as possible. Then pitch and content information can be removed from rhyme information by random resampling. The pitch decoder ensures that the pitch encoder gets the correct pitch information. The variational contrastive log-ratio upper bound(vCLUB) makes speaker-irrelevant information as separate as possible. This work achieve proper disentanglement of rhythm, content, speaker and pitch representations and are able to transfer different representation styles to one-shot voice conversion separately. The performance and robustness of speech representation disentanglement is improved by adversarial mutual information learning. The naturalness and intelligibility of one-shot voice conversion is improved by speech representation disentanglement.