登录 EN

添加临时用户

面向自动驾驶的安全强化学习算法设计及其应用研究

Algorithm and Application Research of Safe Reinforcement Learning for Autonomous Driving

作者:张麟睿
  • 学号
    2020******
  • 学位
    硕士
  • 电子邮箱
    zha******com
  • 答辩日期
    2023.05.15
  • 导师
    王学谦
  • 学科名
    电子信息
  • 页码
    74
  • 保密级别
    公开
  • 培养单位
    599 国际研究生院
  • 中文关键词
    安安全强化学习,自动驾驶,惩罚近端策略优化,基线策略安全校正,安全关键策略评估
  • 英文关键词
    Safe reinforcement learning, Autonomous driving, Penalized Proximal Policy Optimization, Safety Correction from Baseline, Safety-critical policy evaluation

摘要

强化学习可以解决复杂的序列决策问题,在自动驾驶任务上拥有巨大的应用潜力。但是这种追求高回报的学习范式往往会导致高风险的驾驶行为,违背实际交通场景中关于速度、油耗或安全性等方面的要求。本文研究的面向自动驾驶的安全强化学习,旨在最大化累计奖励的同时满足交通规则或用户定义的安全约束,可以为基于强化学习的自动驾驶策略提供必要的安全保障,具有重要的研究意义。面向自动驾驶的安全强化学习面临以下难点:首先,自动驾驶任务中安全约束类型多样且各类约束共存的现象突出,导致现有方法存在计算成本高、求解不稳定、约束难以完全满足等问题;其次,在实际交通场景中违反约束可能导致灾难性的后果,现有的安全强化学习方法普遍存在探索危险和学习缓慢问题,其样本效率不能满足实际要求;最后,真实世界驾驶数据中与驾驶风险相关的场景非常稀疏,面向自动驾驶的安全强化学习长期以来缺乏统一的测试场景和对比评估。本文面向上述三个关键问题进行了算法设计和应用研究,主要贡献是:(1)针对面向自动驾驶的安全强化学习算法在高维和多约束空间中求解困难、约束难以满足的问题,本文提出了惩罚近端策略优化(Penalized Proximal Policy Optimization, P3O)算法。本文从理论上对该方法的精确性进行了分析,并给出了最坏情况下的估计误差界限。本文在多个安全基准任务,包括以前很少被考虑的多约束和多智能体场景中,验证了P3O算法的有效性和鲁棒性。(2)针对面向自动驾驶的安全强化学习算法对成本价值函数学习缓慢、样本效率低的问题,本文提出了基线策略安全校正(Safety Correction from Baseline, SCfB)算法。该算法将基线策略和安全策略的获取进行解耦,并支持智能体从不同形式的基线策略进行安全修正。本文在多个任务上对SCfB算法进行了验证,结果表明该算法相比于传统方法提高了10倍以上的样本效率,并获得了有竞争力的回报。 (3)针对面向自动驾驶的安全强化学习算法缺乏安全关键场景和公平对比评估的问题,本文首先使用深度对抗生成模型进行了安全关键场景生成,而后搭建了安全强化学习算法工具箱SafeRL-Kit,最后在统一的流程框架下部署与测试了包括本文提出算法在内的多种基于强化学习的安全驾驶策略,其定量与定性评估结果可以为算法的实际应用提供参考。

Reinforcement learning (RL) can solve complex sequential decision-making problems, and has great potential in applications such as autonomous driving. However, RL algorithms that simply pursue high returns tend to favor high-risk strategies and cannot meet various safety constraints in practical traffic scenarios, such as speed limits and collision avoidance. This paper focuses on safe RL for autonomous driving, which aims to maximize the cumulative reward while satisfying traffic rules or user-defined safety constraints. It is a prerequisite for applying RL to safe autonomous driving and thus has important research significance.Safe RL for autonomous driving faces the following challenges. First, the constraints are complex and diverse, resulting in high computational costs, unstable solutions, and poor constraint-satisfaction in existing methods. Second, violating constraints can lead to catastrophic consequences in practical traffic scenarios, but existing safe RL methods often suffer from dangerous exploration and inadmissible sample efficiency. Third, the driving risk related scenarios are sparse in real-world driving data, thus safe RL for autonomous driving lacks a unified testing environment and comparison evaluation.To address these challenges, we first propose Penalized Proximal Policy Optimization (P3O) to solve the problem of high-dimensional and multi-constraint spaces in safe RL for autonomous driving. We theoretically analyze the accuracy of the P3O algorithm, and then give the worst-case estimation error bound. The effectiveness and robustness of the P3O algorithm are verified on multiple safety benchmark tasks, including previously less considered multi-constraint and multi-agent scenarios. Second, we propose Safety Correction from Baseline (SCfB) to improve the sample efficiency of safe RL for autonomous driving. The SCfB algorithm decouples the acquisition of baseline policies and safe policies and supports agents to perform safety corrections from different types of baseline controllers. The SCfB algorithm is validated on multiple tasks, showing competitive returns and more than 10 times higher sample efficiency than traditional methods.Finally, to address the lack of safety-critical scenarios and fair comparison evaluation in safe RL for autonomous driving, we first uses deep adversarial generative models to generate safety-critical scenarios and then propose the unified toolbox for safe RL called SafeRL-Kit. At last, multiple RL-based safe driving strategies, including the algorithm proposed in this paper, are deployed and tested under a unified process framework. The quantitative and qualitative evaluation results can provide references for the practical application of the algorithms.