登录 EN

添加临时用户

基于行为克隆的自模仿学习算法研究

Research on Self Imitation Learning Based on Behavior Cloning

作者:战昱竹
  • 学号
    2016******
  • 学位
    硕士
  • 电子邮箱
    zha******net
  • 答辩日期
    2022.09.09
  • 导师
    张旭东
  • 学科名
    信息与通信工程
  • 页码
    75
  • 保密级别
    公开
  • 培养单位
    023 电子系
  • 中文关键词
    强化学习,行为克隆,自模仿学习,经验利用
  • 英文关键词
    Reinforcement Learning,Behavior Cloning,Self-imitation Learning,Exploitation of Experiences

摘要

强化学习(Reinforcement Learning, RL)算法近年越来越受到广大学者的关注,它与传统的机器学习(Machine Learning, ML)算法不同,是基于试错和奖励的模式,通过智能体与环境的交互获得反馈不断优化策略。强化学习中,探索(Exploration)和利用(Exploitation)是两个关键的环节,而如何做好探索和利用之间的平衡也成为众多学者关注的焦点。在如何更好地利用已探索到的经验方面,目前有经验回放、优先经验回放、利用演示数据模仿学习、传统的自模仿学习等方法。我们提出基于行为克隆(Behavior Clone)的自模仿学习方法(SILBC),并将其与传统的强化学习方法结合,取得了更好的性能,并设计了仿真实验进行验证。基于行为克隆的自模仿学习方法可以帮助策略网络直接从经验数据中学习。该方法先从经验数据中筛选出回报更高的优质行为数据,直接模仿这些优质行为数据,进行自模仿学习。自模仿学习的过程通过最小化策略网络输出的行为与经验中的优质行为之间的距离来实现。另外,在传统强化学习的目标函数中,引入了基于行为克隆的自模仿学习目标函数项,这可以有效解决数据分布不匹配的问题,可以进一步提升算法的性能。我们在传统的单智能体强化学习算法DDPG的基础上实现SILBC,得到SILBC-DDPG算法。为了验证算法性能,设计了二维平面下单智能体导航仿真实验,重点关注任务成功率、平均奖励、碰撞率和智能体完成任务所花费的时间四个指标。对比了DDPG算法、SIL-DDPG算法以及所提出的SILBC-DDPG算法的表现,SILBC-DDPG算法在四个指标上均是最优的。进一步,将基于行为克隆的自模仿学习推广到多智能体环境,提出针对多智能体强化学习的SILBC-MADDPG算法,并设计了二维平面下多智能体导航仿真实验。提出的SILBC-MADDPG算法与MADDPG算法、SIL-MADDPG算法相比,在四个指标上均表现最优。综上所述,本文提出的基于行为克隆的自模仿学习方法在单智能体和多智能体问题上均可以有较好的表现,验证了所提出的算法的有效性。

In recent years, reinforcement learning (RL) has received increasing attention for research. RL optimizes the policies of agents based on trial and reward with continuous interactions between agents and the environment, which is different from traditional machine learning (ML). Exploration and exploitation are two essential processes in RL. The balance of exploration and exploitation is an important problem in RL. Replay buffer, prioritized replay buffer, learn from demonstrations, and traditional self-imitation learning have been proposed to better exploit experiences explored by agents. We propose Self Imitation Learning with Behavior Cloning (SILBC), and combine it with traditional reinforcement learning methods to achieve better performance. Simulation experiments are conducted to validate the effectiveness and advantage of our method.The method of SILBC can directly learn from sampled experiences to train policy networks. SILBC chooses experiences that get a higher return, and directly learns from these chosen experiences by minimizing the output of the policy network and chosen experiences that are believed to be better in the whole buffer. We combine the loss function of self-imitation with the loss of traditional RL, which can solve the problem of mismatch of data distribution and thus further improve the performance of our method.We implement SILBC based on a traditional single-agent RL algorithm DDPG, and name the proposed algorithm as SILBC-DDPG. For validation of the performance of SILBC-DDPG, we design an experimental environment for single-agent navigation task. We focus on four metrics in the experiments: the success rate of the task, average reward received by the agent, the collision rate of the task, and the time steps to complete a task. We contrast our algorithm SILBC-DDPG with DDPG and SIL-DDPG. Experimental results show that SILBC surpasses the other two algorithms in terms of all the metrics. We further implement SILBC based on a traditional multi-agent RL algorithm MADDPG, and name the proposed algorithm as SILBC-MADDPG. Similarly, an experimental environment for multi-agent navigation task is used for validation of SILBC-MADDPG, and the same four metrics are used. In comparison with MADDPG and SIL-MADDPG, SILBC-MADDPG is better in the performance of these four metrics. Based on these experimental results, our proposed method SILBC can achieve better performance in both single-agent environment and multi-agent environment.