说话人视频合成技术是根据提供的文本或语音等说话内容,驱动人脸图像或虚拟角色图像,合成自然流畅的说话视频。广泛应用于虚拟会议、在线教育等社交、传媒领域,可在丰富信息内容(音视频双模态)的同时大幅提高运行效率并降低运营成本。此项技术需要建模人脸动画与语音音频之间复杂且动态的映射关系,存在仅接受语音输入导致使用灵活性差、对任意人脸或任意语音风格的泛化性不足、对合成人脸的细节还原度较低等问题,是一项非常具有挑战性的跨模态任务。首先,针对无法灵活地合成任意说话内容视频的问题,本文设计了文本及语音两种输入作为说话内容的系统前端。对输入的任意文本,只需用户提供其一段参考语音音频,即可从参考语音中提取此用户的语音风格特征,并作用于多说话人语音合成系统,以合成对应内容和风格的语音,进一步用于驱动视频合成。其次,为提高对参考语音风格的模仿能力,本文分别在粗粒度句级别提取说话人表征,在细粒度音素级别克隆韵律,将说话人表征及韵律特征均应用于语音合成模型中,实现精准韵律克隆的语音合成。为得到参考语音的准确音素分割,本文分别针对有无文本标注提出了有监督 CTC 和无监督最近邻聚类的音素分割方法,实现了在音素级别提取基频、能量值等属性作为韵律特征。最后,本文利用神经辐射场以建模高细节保真度的人脸动画。为克服神经辐射场无泛化性、无法拓展于任意人脸的缺陷,本文引入了变分动态映射器大规模地学习通用的人脸关键点——语音的一致对齐,进一步设计了目标人脸域迁移网络将通用映射拟合至此用户人脸域。由于说话人脸是非刚性可变形的,提出的可微泛化网络可约束预测关键点的偏移量,使其作为条件引导最终神经辐射场的渲染。实验结果表明,本方法仅使用有限的微调数据和迭代次数就能合成具有良好三维真实性和音画一致性的说话人视频。
Talking face video synthesis is to drive face images or virtual character images to synthesize natural and smooth speaking videos based on the provided speaking content such as text or speech. Widely used in social and media fields such as virtual meetingsand online education, it can significantly improve efficiency and reduce costs while enriching information (visual and aural bimodality). This technology requires modeling complex and dynamic mapping relationships between facial animation and speech audio, and suffers from poor flexibility of use due to accepting only speech input, insufficientgeneralization to arbitrary faces or arbitrary speech styles, and low detail restoration ofsynthesized faces, which is a challenging cross-modal task.First, to address the problem of not able to synthesize arbitrary content videos flexibly, this paper designs both text and speech inputs as the front-end of the system. For an arbitrary text input, only a piece of reference speech audio of the user is required, and the speech style features of this user can be extracted from the reference speech and acted on the multi-speaker speech synthesis system to synthesize the corresponding content andstyle of speech, which is further used to drive video synthesis.Second, to improve the imitation of the reference speech style, this paper extracts the speaker representation at the coarse-grained sentence level and clones the rhyme at the fine-grained phoneme level, and applies both speaker representation and rhyme features to the speech synthesis model to achieve accurate rhyme cloning for speech synthesis. To obtain accurate phoneme segmentation of reference speech, this paper proposes supervised CTC and unsupervised nearest neighbor clustering for phoneme segmentation with and without text annotation, respectively, and achieves the extraction of pitch, energy value and other attributes as rhyme features at phoneme level.Finally, this paper utilizes neural radiance field in order to model face animations with high detail fidelity. To overcome the drawback that the neural radiance field is not generalizable and cannot be applied to arbitrary faces, this paper introduces a variational dynamic mapper to learn the generic face key point - the consistent alignment of speech -on a large scale, and further designs a target face domain transfer network to fit the generic mapping to user’s face domain. Since the speaker face is non-rigid and deformable, the proposed differentiable generalization network can constrain the offset of the predicted keypoints so that it serves as a condition to guide the rendering of the final neural radiance field. Experimental results show that our method can synthesize videos with great 3D realism and audio-visual consistency using only a limited amount of fine-tune data and iterations.