四足机器人在各种复杂环境下运动的鲁棒性对于其进一步落地应用来说至关重要。近年来,基于强化学习(RL)的方法在四足机器人的运动控制上取得了显著的成果,一些方法通过域随机化、特权蒸馏或结合专家数据有效地提高了运动控制策略的泛化能力和鲁棒性,将四足机器人的地形通过能力提升到了一个新的高度。然而在实际应用中四足机器人可能会遇到环境中各种未知的风险,例如突然的地形变化或意外的外力干扰,这对目前的方法来说仍是一项挑战。在此背景下,本文考虑了一种新颖的风险敏感视角,通过分布值函数建模环境的随机不确定性,并通过优化条件风险价值目标得到一个风险规避的运动控制策略,以增强四足机器人运动控制的鲁棒性。此外,分布值函数可以估计机器人所处环境的风险水平,使机器人能够根据环境风险实时切换不同风险偏好的策略。本文的主要研究内容如下:(1)针对四足机器人运动控制问题设计了风险敏感强化学习训练框架,根据四足机器人在危险状态下的本体感知引入了风险项奖励函数,通过分位数回归训练一个分布值函数,并使用累积奖励的条件风险价值作为策略网络的优化目标,最终得到风险规避的运动控制策略,该策略能够使机器人抵抗剧烈的外部干扰;(2)风险规避的运动控制策略优化的是四足机器人在环境风险较高时的表现,然而其在低风险环境下存在步态差、速度跟踪不精确等问题,对此,本文提出了一种风险感知元控制器,利用分布值函数作为上层控制器,根据机器人所处环境的风险水平切换不同风险偏好的策略。对于风险中立的策略,本文采用特权蒸馏算法训练,其中教师策略接收仿真环境中的真实地形信息,学生策略根据深度相机推测周围地形并对教师策略进行模仿;(3)针对四足机器人在现实场景中可能遇到的各种风险设计了大量实验,在Isaac Gym仿真环境中与目前四足机器人运动控制领域的主流算法进行了对比分析,实验结果表明本文提出的风险规避的运动控制策略对于外力、地形和负重等各种外界干扰有着更高的鲁棒性。此外,在混合风险环境中的实验展示了采用风险感知元控制器的控制策略能够在各种风险水平下均取得最高的奖励。最后本文在一台Aliengo四足机器人上进行了实物实验,实验结果显示本文的方法能够有效提高四足机器人运动控制的鲁棒性。
The robustness of quadruped robots in traversing through complex environments is crucial for their broader practical applications. Recent advancements in Reinforcement Learning (RL) have led to significant progress in the motion control of quadruped robots. Techniques such as domain randomization, privileged learning, and the incorporation of expert data have effectively enhanced the generalizability and robustness of locomotion control policies, elevating the terrain adaptability of quadruped robots to new heights. However, real-world applications often present unforeseen risks such as abrupt terrain changes or unexpected external disturbances, posing challenges to current methodologies. In this context, this thesis introduces a novel risk-sensitive approach. By modeling environmental aleatoric uncertainty through a distributional value function and optimizing for Conditional Value at Risk (CVaR) objective, this thesis develops a risk-averse locomotion policy to improve the robustness of quadruped locomotion control. Furthermore, the distributional value function can assess the risk level of the current environment, enabling real-time policy adjustment based on risk level. The main content of this thesis are as follows:(1) This thesis proposes a risk-sensitive reinforcement learning framework for quadruped robot locomotion control. The framework includes a distributional value function trained using quantile regression and a risk-averse policy obtained by optimizing CVaR of cumulative rewards. Several risk-related reward terms are introduced, deriving from the robot's proprioception in hazardous states. The resulted risk-averse locomotion control policy allows the robot to withstand severe external disturbances.(2) While optimizing for performance in high-risk environments, our risk-averse locomotion control policy may exhibit suboptimal gait and imprecise velocity tracking in low-risk situations. To address this, this thesis proposes a risk-aware meta-controller. Utilizing the distributional value function as a higher-level controller, the robot could switch between policies of various risk preferences based on the assessed risk level. For the risk-neutral policy, a teacher-student two-phase algorithm is employed, where the teacher policy receives privilege terrain information from simulation, and the student policy mimics the teacher by estimating surrounding terrain using a depth camera.(3) Extensive experiments were conducted to evaluate the quadruped robot's performance against various risks in real-world scenarios. Comparative analysis with mainstream quadruped robot locomotion control algorithms in the Isaac Gym simulation environment demonstrated the superior robustness of our proposed risk-averse motion control policy against external forces, terrain variations, and heavy loads. Additionally, experiments in mixed-risk environments showcased the highest rewards achieved by the risk-aware meta-controller across different risk levels. Finally, real-world experiments on an Aliengo quadruped robot confirmed the effectiveness of our approach in enhancing the robustness of quadruped locomotion control.