登录 EN

添加临时用户

城市场景交通参与者的轨迹预测网络设计与高精度训练

Architectural Design and High-precision Training of Trajectory Prediction Network for Traffic Participants in Urban Scenarios

作者:兰志前
  • 学号
    2021******
  • 学位
    硕士
  • 电子邮箱
    lan******.cn
  • 答辩日期
    2024.05.29
  • 导师
    李升波
  • 学科名
    机械工程
  • 页码
    91
  • 保密级别
    公开
  • 培养单位
    015 车辆学院
  • 中文关键词
    自动驾驶;轨迹预测;神经网络;场景理解;深度学习
  • 英文关键词
    autonomous driving; trajectory prediction; neural network; scene understanding; deep learning

摘要

交通参与者的轨迹预测是自动驾驶系统的关键模块。实时、准确的轨迹预测,是实现智能汽车良好决策与安全控制的前提保障。目前,以神经网络为载体的数据驱动型方法凭借对复杂交通场景的强大建模能力,正逐步取代传统人工规则式方法,成为交通参与者轨迹预测的主流方案。然而,现有的轨迹预测模型普遍存在结构复杂、训练数据利用率不足等问题,导致计算效率低下、预测精度不佳,难以满足高级别自动驾驶的需求。为此,本文针对城市道路交通场景下的周车轨迹预测问题,设计了一种时空分离的高效编解码神经网络结构,提出了增强交通场景数据挖掘能力的训练方法,并利用大型轨迹预测数据集与实车部署验证了所提出方法的有效性。 首先,针对现有预测网络结构复杂,计算效率低下的问题,提出了场景编码预测注意力网络(SEPT):包括场景编码器与轨迹解码器两个核心部分。在场景编码器中,设计了一种时空分离型编码结构,先编码历史轨迹的时序信息,后处理交通参与者之间及其与道路结构之间的空间交互信息,大幅简化了信息处理流程,提高了模型计算效率。轨迹解码器则通过模态向量与场景编码的互注意力计算,直接生成多模态轨迹,避免引入锚点等中间态变量,节省内存开销,进一步提升整体网络的简洁性。 随后,针对现有训练方法数据利用率低,导致预测精度不佳的问题,提出了一种双阶段级联训练技术(DuCa):包含场景理解与轨迹预测两个训练阶段。场景理解训练阶段利用大规模无标注的场景数据集,构建关于地图与轨迹输入的“掩码-重建”自监督学习任务,使得编码器理解交通场景的复杂时空关系,充分挖掘有价值的场景结构信息。轨迹预测训练则将上一阶段学习的知识迁移至轨迹预测任务,在带标签的数据集上开展监督学习训练,实现多模态的轨迹预测。本方法在大型轨迹预测数据集Argoverse的单目标预测任务上超过了所有其他算法,取得了六项性能指标的最优成绩,充分说明了训练方法的有效性。 最后,依托实车平台,对训练后的网络模型进行部署测试。对比主流的规则式预测算法,本方法在minADE1、minFDE1等指标上取得了19.6%与23.2%的显著性能提升,且单个场景的平均推理时间小于10毫秒,证明了其在实际应用中的可行性。

Trajectory prediction of traffic participants is a critical module in autonomous driving systems. Real-time and accurate trajectory prediction is essential for enabling intelligent vehicles to make informed decisions and ensure driving safety. Currently, data-driven methods based on deep neural networks, leveraging their strong modeling capabilities for complex traffic scenarios, are gradually replacing traditional rule-based approaches and becoming the mainstream solution for trajectory prediction. However, existing prediction models commonly suffer from problems such as complex structures and insufficient utilization of training data, leading to low computational efficiency and poor prediction accuracy, which makes it challenging to meet the requirements of high-level autonomous driving. Therefore, this paper addresses the problem of trajectory prediction of surrounding agents in urban road traffic scenarios by designing an efficient encoder-decoder neural network structure tailored for trajectory prediction and proposing a training method to enhance the mining capability of traffic scene data, and validating the effectiveness of the proposed methods using large-scale trajectory prediction datasets and real road test.Firstly, to address the issues of complex prediction network structures and low computational efficiency, a Scene Encoding Predictive Transformer (SEPT) is proposed, comprising two core components: a scene encoder and a trajectory decoder. In the scene encoder, a spatiotemporal separation encoding structure is designed to encode the temporal information of historical trajectories first and then process the spatial interaction information among traffic participants and the road structure, significantly simplifying the information processing pipeline and improving the computational efficiency. The trajectory decoder generates multimodal trajectories directly through cross attention between learned modal vectors and scene encoding, avoiding the introduction of intermediate variables such as anchors, thus saving memory overhead and further enhancing the overall network‘s simplicity. Subsequently, to address the problem of low data utilization in existing training methods, a Dual Cascading training technique (DuCa) is proposed, consisting of scene understanding training and trajectory prediction training. In the scene understanding training stage, three "mask-reconstruction" self-supervised learning tasks are constructed using unlabeled traffic scene data, which train the scene encoder to understand the complex spatiotemporal relationships in traffic scenes and fully exploit valuable scene structural information. The knowledge learned in the previous stage is transferred to the trajectory prediction task in the trajectory prediction training stage, where supervised learning training is conducted on labeled datasets to achieve multimodal trajectory prediction. Results from the Argoverse leaderboard demonstrate that the proposed method outperforms all other algorithms in single-agent prediction tasks and achieves the best scores across six performance metrics, highlighting the effectiveness of the training method. Finally, the trained network is deployed and tested in a real vehicle platform. Compared to mainstream rule-based prediction algorithms, our method achieves significant performance improvements of 19.6% and 23.2% in minADE1 and minFDE1, respectively, with an average inference time per scene of less than 10 milliseconds, demonstrating its feasibility in practical applications.