无需借助任何辅助设备,直接从单目视频中精确恢复3D人体姿态和形状,这是一项极具技术挑战性和实际应用价值的研究任务。它对于深入解析多媒体内容中的人类行为至关重要。尽管近年来3D人体姿态和形状估计技术在常规场景下取得了显著进展,但在处理真实世界单目视频时仍面临诸多难题,如深度歧义、视觉模糊和遮挡等。这些问题导致现有方法在复杂或特殊场景下的估计精度不足。 为了攻克这些难题,本文深入研究了视频三维人体姿态与形状估计算法,并提出了两种创新性的算法:Diff-HMR和DR-Net。 针对现有算法在3D估计中忽视不确定性以及缺乏对多种可能性的有效建模的问题,Diff-HMR算法引入了一个基于扩散模型的回归器DPR,将确定性姿态的估计过程重构为逆向扩散过程,通过逐步降噪来降低不确定性。同时,该算法还设计了一个基于注意力机制的跨时域特征融合模块CTAM,用于聚合上下文特征,作为逆向扩散的条件输入,从而引导模型生成更为精确的结果。此外,为了对网格顶点施加更严格的约束,本文还设计了一个顶点约束损失函数,以显著提升估计网格顶点的准确性。 另一方面,针对现有算法在姿态和形状参数处理上缺乏区分度的问题,本文提出了一种由粗到细的估计网络DR-Net。该网络采用两阶段训练模式:首先,初始回归器在大量数据集上进行预训练,以捕捉人体运动的基本范式;然后,针对姿态参数的复杂性,采用基于扩散模型的细化回归器对初始估计值进行修正,从而得到更为精确的结果。为更有效地捕捉人体关节点之间的运动特征和参数与特征之间的关系,本文还提出了GCN-ATT模块,利用骨骼拓扑结构构建图卷积网络,进一步提升姿态估计的精确性。 为验证所提算法的有效性,本文在多个主流数据集上进行了对比分析,并与先进方法进行了对比实验。实验结果显示,算法在各项评估指标上均展现出卓越性能,具有高估计精度和强鲁棒性。此外,通过设计针对特殊场景的对比实验和消融实验,本文进一步验证了算法在处理不确定性方面的卓越能力以及网络各模块的有效性。在算法理论验证的基础上,本文深入探索了算法在数字人驱动上的实践应用,这些应用研究成果充分证明了所提算法的广泛应用前景。综上所述,本研究不仅在学术层面开展了具有创新性的深入探索,同时也在实际应用层面展开了富有意义的探究,力求实现理论与实践的有机结合。
Accurately recovering 3D human pose and shape from monocular videos without any auxiliary equipment is a challenging yet highly valuable research task, crucial for understanding human behavior in multimedia content. Despite significant progress in 3D human pose and shape estimation techniques in recent years, meeting the demands of real-world monocular videos remains a significant challenge. Existing methods often employ unified frameworks for joint estimation of pose and shape but fail to effectively address the inherent uncertainty issues in the 3D estimation process, such as depth ambiguity, visual blur, and occlusion. The neglect of these factors leads to insufficient accuracy of estimation results in complex or special scenarios. To tackle these challenges, this thesis delves into the research of video-based 3D human pose and shape estimation algorithms, and innovatively proposes two algorithms: Diff-HMR and DR-Net. These algorithms break through in network architecture and loss function design, aiming to achieve more precise, robust, and smooth estimation effects. Addressing the issue of uncertainty neglect and effective modeling of multiple possibilities in existing algorithms for 3D estimation, we propose the Diff-HMR algorithm based on the diffusion model. This algorithm introduces a diffusion model regressor (DPR), reconstructing the estimation process of deterministic poses into a reverse diffusion process, reducing uncertainty through progressive denoising. Additionally, the algorithm designs a Cross-Temporal Attention Module (CTAM) based on attention mechanisms for temporal feature fusion to aggregate contextual features as a conditional input for reverse diffusion, guiding the model to generate more accurate results. Furthermore, to impose stricter constraints on mesh vertices, we specially design a vertex constraint loss function to significantly enhance the accuracy of estimated mesh vertices. On the other hand, addressing the lack of distinction in handling pose and shape parameters in existing algorithms, we propose a coarse-to-fine estimation network called DR-Net. This network adopts a two-stage training mode: firstly, pretraining the initial regressor on a large dataset to capture the basic paradigm of human motion; then, refining the initial estimates using a refinement regressor to obtain more accurate results, considering the complexity of pose parameters. To more effectively capture the motion characteristics between human keypoints and the relationships between parameters and features, we also innovatively propose the GCN-ATT module. This module constructs a graph convolutional network based on the skeleton‘s own topological structure, further improving the accuracy of pose estimation. To validate the effectiveness of our proposed algorithms, we conducted comparative analyses on several mainstream datasets and performed experiments comparing it with state-of-the-art methods. The results demonstrate that our algorithm achieves outstanding performance across all evaluation metrics, showcasing high estimation accuracy and robustness. Furthermore, through the design of comparison and ablation experiments for specific scenarios, we confirm the algorithm‘s superior ability to handle uncertainty and validate the effectiveness of each network module. Beyond theoretical verification, we delved into the practical applications of the algorithm in digital human animation, and these research outcomes significantly testify the widespread potential of our algorithm. These application results underscore the algorithm‘s potential for diverse applications. In conclusion, this research has not only conducted innovative and in-depth explorations at the academic level, but also carried out meaningful inquiries at the practical application level, striving to achieve an organic integration of theory and practice.