多人姿态估计技术旨在检测出画面中所有人体的关键点位置,有助于更好地进行行为分析和理解。该技术在人机交互、增强现实、体育运动分析等诸多领域展现出广阔的应用前景。但在实际应用场景中,通常对姿态估计算法要求实时高效的推理能力。尽管利用深度学习进行多人姿态估计的当前方法在各大公开数据集上已取得很高的精度,但仍面临以下亟待解决的挑战:(i) 主流的二阶段自顶向下方法难以在精度与计算速度之间取得有效平衡,没法同时达到高精度和实时推理的速度;(ii)在人数较多的场景中,自顶向下方法速度慢,单阶段方法虽可实时推理但精度较低。为了解决二阶段模型在精度与速度之间难以平衡的问题,本文提出了基于坐标分类的RTMPose算法。RTMPose首先利用高效实时目标检测器获取图像中人体的边界框,然后将坐标分类算法应用于每个人体实例,将关键点定位任务视为分类问题进行精确姿态估计。在SimCC算法基础上,RTMPose全方位优化网络结构、损失函数、优化策略和推理流程,同步提升推理速度和精度,实现与主流非轻量级模型相当的实时高精度推理。为了在多人场景下克服单阶段模型精度不足的挑战,本文将坐标分类思想与单阶段检测框架YOLO相结合,提出了RTMO模型。RTMO采用单阶段策略直接在输入图像上预测所有人体姿态,旨在保持高速推理的同时提高精度。它沿袭了YOLO中的密集预测策略,并创新性地引入了动态坐标分类器模块,解决了传统坐标分类在密集预测框架中的适用性问题,显著提升了关键点定位的准确度。RTMO还设计了基于最大似然估计的创新坐标分类损失函数,通过引入可学习的方差自动平衡不同样本的学习难度,进一步优化了坐标分类的性能表现。凭借上述创新,RTMO在保持高速推理能力的同时,关键点定位精度达到了接近二阶段模型的水平,展现出了在复杂多人场景下的卓越性能。
Multi-person pose estimation aims to localize key body joints of all individuals present in an image, enabling more comprehensive behavior analysis and understanding. This technology exhibits broad application prospects across domains like human-computer interaction, augmented reality, and sports analytics. However, real-world applications typically demand real-time and computationally efficient inference capabilities from pose estimation algorithms. Despite current deep learning-based multi-person methods achieving remarkable accuracy on major benchmarks, they still face the following key challenges: (i) Mainstream two-stage top-down approaches struggle to effectively balance precision and speed, failing to achieve both high accuracy and real-time inference speed; (ii) In crowded scenarios, top-down methods are slow while one-stage methods can perform real-time inference but with low accuracy.To address the dilemma of balancing accuracy and speed for two-stage models, this thesis proposes the RTMPose algorithm based on coordinate classification. RTMPose first utilizes an efficient real-time object detector to obtain human bounding boxes in the image, and then applies the coordinate classification algorithm to each human instance, treating keypoint localization as a classification problem for precise pose estimation. Building upon the SimCC algorithm, RTMPose comprehensively optimizes the network structure, loss function, optimization strategy, and inference process, simultaneously improving inference speed and accuracy to achieve real-time high-precision inference on par with mainstream non-lightweight models.To tackle the accuracy limitation of single-stage models in multi-person scenarios, this thesis combine the coordinate classification approach with the single-stage detection framework YOLO, proposing the RTMO model. RTMO adopts a single-stage strategy, directly predicting all human poses from the input image, aiming to maintain high inference speed while improving accuracy. It inherits the dense prediction strategy from YOLO and innovatively introduces a Dynamic Coordinate Classifier (DCC) module to address the applicability issue of traditional coordinate classification in dense prediction frameworks, significantly boosting keypoint localization accuracy. RTMO also designs an innovative coordinate classification loss function based on maximum likelihood estimation, automatically balancing the learning difficulties of different samples by incorporating learnable variance, further optimizing the coordinate classification performance. With these innovations, RTMO achieves keypoint localization precision approaching that of two-stage models while retaining high inference speed, demonstrating outstanding performance in complex multi-person scenarios.