自主决策能力是自动驾驶汽车智能性的核心体现。现有规则型决策难以穷尽所有驾驶可能,无法满足高级别自动驾驶需求。依托强化学习的智能汽车决策自我进化是解决该问题的一类重要途径,该法对人工规则和自然驾驶数据依赖较小。然而现有强化学习型决策面临值函数过估计、周车数目及排序动态变化的难题,制约驾驶策略性能的提升。为此,本文研究了具备过估计抑制能力的强化学习算法和动态数目周车的状态编码方法,搭建了强化学习型驾驶策略训练系统,通过仿真和实车实验验证策略性能,为下一代学习型智能汽车决策系统的开发奠定基础。首先,针对值函数过估计难题,提出了值分布强化学习算法。引入最大熵策略优化目标和值分布函数,建立了具有收敛至最优策略能力的值分布柔性迭代框架。揭示了回报方差对过估计偏差的抑制机理,据此推导了可抑制过估计偏差的值分布柔性执行-评价算法。MuJoCo基准测试任务表明:与SAC、TD3、DDPG等主流算法相比,所提出算法对过估计偏差的抑制效果明显,策略性能提升20.0%。其次,针对周车数目和输入顺序动态变化的特点,提出了周车的排列不变性状态编码方法。该法利用编码神经网络得到每辆周车的表征向量,通过该向量的加和将动态数目周车信息映射为定维排列不变向量。同时,证明了该方法具备对周车集合的单射表征能力。与固定排序状态表征相比,所提出方法提高了对动态周车集合的表征能力,对应的函数近似误差降低62.2%。然后,搭建了强化学习型自动驾驶策略训练系统。该系统由训练模块和仿真模块构成。面向训练模块设计,将值分布强化学习算法与周车的排列不变性状态编码进行深度融合,依托值分布柔性迭代框架推导了策略网络、值网络、编码网络的更新梯度。面向仿真模块研发,针对多车道交通场景自动驾驶任务,选择并设计了驾驶观测、动作以及多目标收益函数,开发了LasVSim高保真自动驾驶仿真软件。利用该系统进行了高速公路自动驾驶策略的训练与验证,与现有强化学习型决策方法相比,平均驾驶速度提高11.2%,3000公里驾驶仿真无碰撞事故发生。最后,为验证所学策略,依托长安CS55实车平台和园区双车道场景进行自动驾驶决策实验。实验表明,该策略可平顺地完成直行、换道、超车等操作,从而实现应对不同周车配置的自动驾驶,且对感知噪声和人为方向盘干扰具有较高的鲁棒性。稳态直行时车道中心线横向偏移小于8cm,平均单步决策耗时小于4ms。
In the architecture of the self-driving car, decision making is a key component to realize autonomy. Existing rule-based methods rely on mass manually encoding rules to cover all possible driving scenarios, which are challenging to meet the functional requirements of high-level autonomous vehicles. Reinforcement learning (RL) is a promising remedy to realize the self-evolution of self-driving decision-making ability without reliance on rules and labelled driving data. Existing RL-based decision-making methods face the problem of value overestimations and inability to deal with the dynamic changes in the number of surrounding vehicles. To overcome these problems, this paper mainly studies the RL algorithm for reducing overestimations and the state encoding method of surrounding vehicles. Besides, an RL-based self-driving policy learning system is constructed, and then verified through simulation and real vehicle experiments. This work lays the foundation for the development of the next-generation self-driving decision-making system.Firstly, a distributional RL method is proposed to mitigate value overestimations. By embedding the return distribution function into maximum entropy RL, a distributional soft policy iteration (DSPI) framework,which is proved to converge to optimal policy,is developed. Theoretical analysis shows that the overestimation bias is inversely proportional to the continuous distribution variance, which effectively solves the problem of overestimations caused by the randomness of the cumulative return in some states. Then, a distributional RL algorithm, called distributional soft actor-critic (DSAC), is derived. The test results of MuJoCo benchmark tasks show that, compared with the existing algorithms (such as SAC, TD3 and DDPG), the overestimation bias has been significantly reduced, and the policy performance has been improved by more than 20.0%.Secondly, a dynamic permutation-invariance encoding (DPIE) method is proposed for the state representation of surrounding vehicles with different numbers. This method introduces an encoding network to encode each surrounding vehicle into a coding vector, and realizes the fixed-dimensional permutation-invariance state representation based on the vector summation. This paper has further proved that the proposed method can realize the injective representation of surrounding vehicles. Compared with the fixed-sorting state representation, the proposed method improves the representation ability of the surrounding vehicles, and the corresponding approximation error is reduced by 62.2%. Thirdly, an RL-based driving policy learning system is built. The system consists of training module and simulation module. For the design of the training module, the proposed DSAC algorithm is deeply integrated with the DPIE-based state representation, and the update gradients of the policy network, value network and encoding network are derived based on the DSPI framework. For the development of simulation module, according to the requirements of autonomous driving in multi-lane roads, the observation information, decision actions and multi-objective reward function are selected and designed, and a high-fidelity autonomous driving simulation software called LasVSim is built. The simulation results of autonomous highway driving based on this system show that, compared with the traditional RL decision-making method, the average driving speed is increased by more than 11.2%, and no collision occurs in the 3000km autonomous driving.Finally, to verify the learned policy, the autonomous driving decision-making experiment is carried out based on Chang'an CS55 and two-lane park road. Experimental results show that this policy can smoothly complete maneuvers such as lane-keeping, lane-changing and overtaking, so as to realize autonomous driving in response to different surrounding vehicles. Besides, the learned policy has high robustness to perceptual noise and artificial steering wheel interference. The lateral deviation of the vehicle from the lane centre is less than 8cm, and the average single-step decision-making time is less than 4ms.