登录 EN

添加临时用户

针对三维人体姿态估计的可微 姿态扩增框架研究

Research on the Differentiable Pose Augmentation Framework for 3D Human Pose Estimation

作者:陈欢
  • 学号
    2020******
  • 学位
    硕士
  • 电子邮箱
    226******com
  • 答辩日期
    2023.05.15
  • 导师
    张盛
  • 学科名
    电子信息
  • 页码
    67
  • 保密级别
    公开
  • 培养单位
    599 国际研究生院
  • 中文关键词
    三维人体姿态估计,姿态迁移,可微扩增
  • 英文关键词
    3D Human Pose Estimation,Pose Transfer,Differentiable Augmentation

摘要

三维人体姿态估计是指从图片或者视频中获取身体关键点的三维坐标信息。它是一项在动作识别、机器人交互、人类跟踪等场景中有广泛应用的基本任务。三维人体姿态估计算法主要分为两大流派:一种端到端地从图片中获取三维姿态信息;另一种先从图片中估计二维姿态,再提升到三维姿态。由于制作三维姿态数据集的成本昂贵,目前只能使用在实验室环境下采集的具有真值标签的数据集进行训练,尽管在室内场景中取得了成功,但由于训练数据的多样性有限,两大流派的方法都很难推广到跨场景数据集(例如,野外数据集)。目前学术界通过数据扩增来解决这个问题,但大多数学者只关注二维姿态-三维姿态组合的有限多样性,提出的方法无法适用于两大种类型的姿态估计器。因此提出能提升数据多样性、适用于各种姿态估计器的数据扩增框架具有十分重要的意义。针对现有的数据扩增框架适用范围较小的问题,本文提出了一种生成三维姿态-图片组合的数据扩增方法,实现了一个基于神经网络的可微扩增器,并通过鉴别器和估计器的误差反馈循环对扩增的三维姿态-图片组合的多样性和识别难度进行有效控制,形成了端到端的在线训练机制,生成了丰富多样的新姿态和细节逼真的新图片。这种触及两端的产物,使本文的扩增方法能同时适用于两大流派的姿态估计算法。此外,由于附带生成了匹配的三维姿态-二维姿态-人体网格-图片组合,本文的扩增方法可以拓展到多种任务。由于调整姿态比直接修改图片更加可控,本文将扩增器解耦为姿态生成器和姿态迁移模块。考虑到目前关于姿态迁移的研究都无法兼顾生成的图片真实可信、可端到端实现两个需求,本文提出了一个可端到端实现的姿态迁移网络(Pose Transfer, PoseTransfer),通过充分利用姿态、服装、纹理和环境信息,生成了即使在高难度姿态下也能保留面部身份、服饰细节和环境信息的新图片。为了普遍提升现有姿态估计器的性能,本文提出了一个基于三维姿态-图片组合的可微扩增框架(3D Pose-Image Pair Augmentation, ImgAug),在Human3.6M、MPII、3DPW三个数据集上的跨场景测试,多种三维姿态估计算法的基线测试,包括端到端三维姿态估计器、二维-三维提升姿态估计器两大种类型,验证了本框架对不同场景泛化性较高、对不同姿态估计算法兼容性较高。

3D human pose estimation refers to obtaining 3D coordinates information of body keypoints from images or videos. It is a basic task widely used in scenarios such as action recognition, robot interaction, and human tracking. There are two mainstream of 3D human pose estimation algorithms: one is to obtain 3D pose from the image end-to-end; and the other is to obtain 2D pose from the image, and then lift to 3D pose. Due to the high cost of collecting 3D pose datasets, there are only datasets with groundtruth labels collected in laboratory environment can be used for training. Although successful in indoor scenarios, due to the limited diversity of training data, both two mainstream of 3D human pose estimators are difficult to generalize to cross-scene datasets (e.g., an in-the-wild dataset). At present, the academic community solves this problem through data augmentation, but most scholars only focus on the limited diversity of 2D-3D pose pairs, so the proposed methods cannot be applied to both two categories of pose estimators. Therefore, proposing a data augmentation framework that can enhance data diversity and be suitable for various pose estimators is of great significance.To address the issue of limited applicability of existing data augmentation frameworks, we propose a data augmentation method for generating 3D pose- image pairs. The proposed network involves a neural network-based differentiable augmentator, to effectively control the diversity and estimating difficulty of the augmented 3D pose-image pairs end-to-end, we use a discriminator and the error feedback loop of estimator in an online training mechanism, so that we can generate diverse new poses and realistic new images with details. With such touching-both-ends-of-estimator augmentation scheme, our method is applicable to both two mainstream 3D human pose estimators. In addition, due to the accompanying generation of matched 3D pose-2D pose-human body mesh-image pairs, our method can be extended to multiple tasks.Considering that diversifying poses is more controllable than directly modifying images, we decouple the augmentor into a pose generator and a pose transfer module. As the previous work on pose transfer either can’t produce photorealistic images, or don’t present an end-to-end implementation, we propose a new end-to-end pose transfer network, PoseTransfer. By fully exploring the information of pose, clothing, and texture, we can generate new images preserving the facial identity, body shape and texture patterns of clothes over all the subjects with various challenging poses.In order to generally improve the performance of existing pose estimators, we propose a differentiable augmentation framework, ImgAug, based on 3D pose-image pairs generation. Extensive experiments demonstrate that ImgAug brings clear improvements on both intra-scenario and cross-scenario datasets, such as Human3.6M, MPII, and 3DPW, and is also generic and applicable to different 3D pose estimators, including two types: end-to-end 3D human pose estimators and 2D-3D lifting pose estimators. This verifies that this framework has high generalization ability for different scenarios and high compatibility with different 3D human pose estimators.