物体位姿估计是计算机视觉和机器人领域的一个关键任务,涉及估计物体在三维空间中六个自由度的位置和姿态。近年来,深度学习技术的突破带动了物体位姿估计领域的发展。数据集在深度学习中起着关键作用,算法模型通过对大量数据进行训练,学习数据背后的规律和特征从而应用于不同领域。因此确保数据的质量和多样性至关重要。然而相对于其他计算机视觉任务,位姿参数的特殊性使得准确标注更困难且耗时。此外,目前公开的数据集存在不少问题,包括标注不准确,制备性质相近,缺乏针对特定应用场景的数据集。本文重点研究物体位姿估计数据集的生成方法,旨在降低数据集标注人工成本,并提出一个与现有公开数据集性质不同的物体位姿估计数据集,促进物体位姿估计领域的发展及应用。本文的主要贡献如下:1. 我们建立了一个面向物体位姿估计的数据采集平台,在平台中,针对于物体位姿标注困难,以及现有公开的物体位姿估计数据集所存在的局限性、单调性和不足之处,我们提出了一种基于光学动作捕捉系统的制备方法,并进一步构建了一个涉及物体与人交互的物体位姿估计数据集。该数据集包含十个家庭常见物体,共计约 11 万张带有真实位姿标注的图片。其中针对于处理过程相当耗时和费力的问题,我们将平台生成数据集的代码模块化,将从各个传感器收集来的数据按照一定的格式整合,即可立刻将数据转化成物体位姿估计中训练和测试常用的标准 BOP 格式,大幅减少制备数据集过程的时间和工作量。2. 我们提出目标检测模型 YOLOv4 及 YOLOX,物体位姿估计模型 CDPN 及GDR-Net 在所制备的数据集上的基线,通过调整部分网络模型参数分析该类型数据集的特性,并找出最佳性能的模型。此外,亦提出了一种真实数据与合成数据混合训练策略,可以在保持真实性的同时以低成本的方式增加数据量,并且能提高模型的泛化性能。3. 为了验证模型在实际应用场景中的有效性,本文构建了一个基于视觉引导的机械臂抓取系统,验证所训练模型在真实机械臂抓取场景中的效果。实验结果证明了所构建的机械臂抓取系统在稳定性和鲁棒性方面表现出色,亦证实了采用真实数据与合成数据共同训练的策略能够使模型具有较强的泛化性能。该抓取系统具有很高的实际应用价值,为相关实际应用场景提出了一个可行的解决方案。
6D object pose estimation is a crucial task in computer vision and robotics, involving the estimation of an object‘s six degrees of freedom in three-dimensional space, including three translational and three rotational degrees of freedom. In recent years, breakthroughs in deep learning techniques have driven rapid development in the field of object pose estimation. Datasets play a critical role in deep learning, as algorithms learn patterns and features behind the data by training on large datasets, which can then be applied to different scenarios. Therefore, ensuring the quality, diversity, and reliability of data is of utmost importance. However, compared to other computer vision tasks, such as image classification, the peculiarity of pose parameters makes accurate annotation of ground truth difficult and time-consuming. Additionally, currently available public object pose datasets are very similar and lack datasets involving interactions between objects and humans. Therefore, we focuses on the generation methods for object pose estimation datasets, aiming to reduce dataset annotation costs and proposing a novel object pose estimation dataset that is distinct from current existing public datasets, promoting the development and application of the object pose estimation field. The main contributions of this paper are as follows:In this paper focuses on generating methods for object pose estimation datasets with the aim of reducing dataset annotation costs while proposing a dataset with different properties than existing public datasets to promote development and application within this field. In summary, the main contributions in this paper include:1. We proposes a dataset collection platform for object pose estimation. In the platform, we propose a preparation method based on an optical motion capture system to address the difficulties in annotating object poses and the limitations, monotonicity, and shortcomings of existing publicly available object pose estimation datasets. Furthermore, we construct an object pose estimation dataset involving interactions between objects and humans. The dataset contains ten common household objects with approximately 110,000 images annotated with real poses. To address the time-consuming and laborious processing issues, we modularize the code that generates data sets on the platform. We integrate data collected from various sensors into a certain format without requiring any professional knowledge related to preparation so that data can be immediately converted into standard BOP formats commonly used for training and testing in object pose estimation, significantly reducing the time and workload required for preparing datasets.2. We propose the baseline of the advanced object detection algorithms YOLOv4 and YOLOX, the object pose estimation algorithm CDPN and GDRN on the proposed dataset, we analyze the characteristics of this type of data set by adjusting some network model parameters, and find the best performance model. In addition, a hybrid training strategy of real data and synthetic data is also proposed, which can increase the amount of data in a low-cost way while maintaining authenticity, and can improve the generalization performance of the model.3. To verify the effectiveness of the model in practical application scenarios, we build a visual-guided robotic arm grasping and placing system to validate the performance of the trained model in real robotic arm grasping scenes. The experimental results demonstrate that training the model with both real and synthetic data can enhance its generalization ability. On the other hand, it also confirms that the robotic arm grasping system exhibits excellent stability and robustness, which has high practical value and provides a feasible solution for actual related application scenarios.