随着人工智能与虚拟现实技术的发展及其融合,数字人拟人化程度愈来愈高,从表情、姿态、动作,再到语音、语意等方面正在逐步逼近真人水平。目前数字人动作生成方式为:真人驱动肢体动作生成和预录好的动作循环播放。本研究希望根据语音直接生成自然、语义匹配的肢体动作。本课题的主要工作和贡献如下:一、针对手势生成中的多模态间弱关联性问题,本文提出了一种基于模态不变空间和特定模态空间的表征学习框架,以生成更逼真的、语音匹配的手势。音频、文本和手势模态被投射到两个不同的子空间。为了学习模态间不变的共性并捕捉特定模态表征的特征,在训练过程中使用了基于梯度反转的对抗分类器和模态重构解码器。手势解码器利用与音频节奏相关的表征和特征生成适当的手势。生成的手势在 GENEA 手势生成挑战赛中人类相似性评估平均分为 44.2,排名第 4。二、针对生成局部动作可控的手势问题,本文提出了一个基于手势的VQ-VAE模型,以捕捉有意义手势单元。每个编码代表一个独特的手势,有效缓解了随机抖动问题。然后,基于音频量化的莱文斯坦距离作为手势对应音频的相似性度量,有助于匹配更合适的手势与语音。本文还根据语义或音频节奏引入相位来选择最佳手势匹配。相位指导何时应该基于文本或基于语音生成手势,生成的手势在人类相似性(4.07 ± 0.15)和语音匹配度(3.77 ± 0.21)上均达到了最先进的性能。三、针对生成全局风格可控的手势问题,本文在扩散模型中引入了交叉局部注意力和自注意力,以生成更拟人的、风格可控的手势。本文利用无分类器指导训练模型,通过插值或外推法控制手势风格。此外,本文还利用不同的初始手势和噪声提高了生成手势的多样性。该方法提高了手势的自然度(4.11 ± 0.08)和语音匹配度(4.11 ± 0.10),优于最新的手势生成方法并且能够很好的控制手势的风格。四、针对生成统一的跨骨骼手势问题,本文提出了一种重定向网络,学习不同骨骼的潜在同构图,以整合多源异构运动数据。该方法在扩展数据集规模的同时统一了各种骨骼表征,为进一步优化生成的手势,对于下半身采用逆向运动学进行物理学指导。该方法在人类相似性(3.80 ± 0.11)和弗雷歇手势距离(3.850)上均显示提高了生成质量。此外,本文首次建立了能同时生成发言者的自发动作(如共语手势)和非自发动作(如在讲台周围移动)的框架。基于 DoubleTake 策略保证了动作间的无缝衔接,实验表明随着运动数据库的扩大,模型表现将有所提升。
With the development and integration of artificial intelligence, virtual reality and other technologies, the degree of anthropomorphism in digital humans has significantly increased. Their expressions, postures, movements, voices, and semantics are gradually approaching the level of real humans. Currently, the digital human gesture generation methods include real human-driven body motion generation and loop playback of pre-recorded motions. This study aims to generate natural, semantically matched body move-ments directly from speech. The main contributions of this thesis are as follows:First, addressing the weak multimodal correlation problem in gesture generation systems, this thesis proposes a representation learning framework based on modality-invariant space and modality-specific space to generate better speech-matched and re-alistic gestures. Audio, text, and gesture modalities are projected into these two different subspaces. To learn the invariant commonalities between modalities and capture the fea-tures of modality-specific representations, an adversarial classifier based on a gradient inversion layer and a modality reconstruction decoder are employed during training. The gesture decoder then generates appropriate gestures using all the representations and fea-tures related to the audio rhythm. The generated gestures achieved an average human likeness score of 44.2 in the GENEA gesture generation challenge, ranking fourth.Second, to address the problem of generating locally controllable gestures, this the-sis proposes a gesture VQ-VAE module for summarizing meaningful gesture units. Each code represents a unique gesture, effectively mitigating the random jitter problem. The Levenshtein distance, based on audio quantization, is used as a similarity metric to align different gestures with corresponding speech segments. Additionally, this thesis intro-duces the selection of the optimal phase for gesture matching based on contextual seman-tics or audio, guiding when to match text-based or speech-based gestures. The gener-ated gestures achieved state-of-the-art performance in human similarity (4.07 ± 0.15) and speech appropriateness (3.77 ± 0.21).Third, to generate globally stylistically controllable gestures, this thesis introduces cross-local attention and self-attention in the diffusion model. It employs a classifier-free guidance to train the model and control the gesture style through interpolation or extrapolation. Different initial gestures and noise are utilized to improve the diversity of generated gestures. Extensive experiments show that this method improves the natural-ness (4.11 ± 0.08) and speech appropriateness (4.11 ± 0.10) of the gestures, surpassing recent gesture generation methods while effectively controlling gesture style.Finally, to address the problem of generating unified cross-skeletal gestures, this the-sis proposes a retargeting network to learn latent homeomorphic graphs for different mo-tion capture standards, unifying the representations of various gestures while extending the dataset. For optimizing generated gestures, inverse kinematics for physics guidance is used for the lower body. Experiments show that this method improves generation quality in both human likeness (3.80 ± 0.11) and Fréchet Gesture Distance (3.850). Furthermore, this thesis achieves for the first framework for the generation of both spontaneous (e.g., co-speech gesture) and non-spontaneous (e.g., moving around the podium) speaker mo-tions. Employing the DoubleTake strategy and advanced generation techniques ensures a seamless transition between motions, with experimental results indicating that model performance will improve as the motion database is expanded.