登录 EN

添加临时用户

基于深度学习的三维人头动画生成方法

3D Head Animation Generation Method Based on Deep Learning

作者:陈本旺
  • 学号
    2020******
  • 学位
    硕士
  • 电子邮箱
    cbw******com
  • 答辩日期
    2023.05.16
  • 导师
    王好谦
  • 学科名
    电子信息
  • 页码
    64
  • 保密级别
    公开
  • 培养单位
    599 国际研究生院
  • 中文关键词
    三维计算视觉,人脸动画,神经辐射场,多模态
  • 英文关键词
    3D Computer Vision,Facial Animation,Neural Radiance Fields,Multi-model

摘要

随着虚拟网络的发展和元宇宙概念的兴起,在视频会议和电视电影制作等领域,可交互的虚拟形象越来越受到广泛关注,其中可驱动的三维人头动画是实现虚拟形象交互的重要研究方向。当前基于音频驱动的三维人头动画方法取得了令人印象深刻的结果,然而这些方法受限于音频分布,并且难以编辑,相比而言文本具有易编辑等特点,但文本驱动方法缺乏相应的三维数据集,未能获得很好发展。本文首先采集并制作了基于文本的多视角动态人头数据集,设计训练台词一段,测试台词六段,基于多视角采集平台共采集150万张图像和对应音频,图像分辨率达到了4096*3000,视角数目达到了12个。其次提出了基于文本的三维人头驱动方法,可以根据文本编辑和生成高保真三维人头动画。该方法通过神经辐射场对多视角人头视频进行建模并编码,基于控制变量的思想设计了解耦面部神经辐射场(DFNeRF)来解耦人头不同属性,实现了对人头不同区域的单独控制;利用时间对齐的音素序列建立音素和表情潜码的映射关系,根据输入文本查找并组合表情潜码,从而生成对应的人头动画,为了使得相邻音素对应口型平滑过渡,在查找过程中引入带有判别器的音素搜索算法,从而找到文本中最匹配的音素。在所提数据集上,基于文本编辑和生成的大量实验表明了所提方法的有效性。最后,在三维人头几何动画中,针对音频单一模态在生成过程中导致唇部运动过度平滑的问题,提出了基于多头注意力的多模态驱动框架,学习来自音频、文本和音素的多模态特征以生成面部几何动画,利用音素和嘴部运动之间的内在关系限制从音频到唇部运动中的过度平滑,使用强大的预训练语言模型从文本中提取上下文语义信息,消除音频中不同身份特征引起的面部动画偏差,广泛的实验定性且定量地证明了所提的方法的有效性。

With the development of the virtual network and the rise of the metaverse concept, interactive avatars have attracted more and more attention in the fields of video conferencing and TV film production, among which the drivable 3D head animation is an important research to realize the interaction of avatar direction. Current audio-driven 3D head animation methods have achieved impressive results. However, these methods are limited by audio distribution and are difficult to edit. In contrast, the text has the characteristics of easy editing, but text-based methods lack corresponding 3D data sets and have not been well developed. This thesis first collected and produced a text-based multi-view dynamic head dataset, with one training script and six test scripts designed. Based on a multi-view acquisition platform, 1.5 million images and corresponding audio were collected, with an image resolution of up to 4096*3000 and 12 perspectives. Secondly, a text-based 3D head driving method was proposed, which can generate high-fidelity 3D head animations according to text editing and generation. This method models and encodes the multi-view head video using the neural radiance field, and designs a decoupling facial neural radiance field (DFNeRF) based on the idea of controlling variables to decouple the different attributes of the head and achieve separate control over different areas of the head. By establishing the mapping relationship between phoneme and expression latent codes using time-aligned phoneme sequences, the method can search and combine expression latent codes based on the input text to generate corresponding head animations. In order to achieve smooth transitions between adjacent phonemes, a phoneme search algorithm with a discriminator is introduced in the search process to find the most matching phoneme in the text. A large number of experiments based on text editing and generation on the proposed dataset demonstrate the effectiveness of the proposed method. Finally, in the 3D head geometry animation, a multi-modal driving framework based on multi-head attention is proposed to address the problem of excessive smoothness in lip movements caused by a single modality of audio during the generation process. The method learns multi-modal features from audio, text, and phonemes to generate facial geometry animations, uses the intrinsic relationship between phonemes and mouth movements to limit the excessive smoothness from audio to lip movements, and extracts contextual semantic information from text using powerful pre-trained language models to eliminate facial animation biases caused by different identity features in the audio. Extensive qualitative and quantitative experiments demonstrate the effectiveness of the proposed method.