登录 EN

添加临时用户

基于多步重标签与协调器的强化学习抗延迟方法研究

Investigating Anti-Delay Methods for Reinforcement Learning Based on Multi-Step Relabeling and Coordinator

作者:杨宇
  • 学号
    2020******
  • 学位
    硕士
  • 电子邮箱
    yuy******com
  • 答辩日期
    2023.05.15
  • 导师
    李秀
  • 学科名
    电子信息
  • 页码
    61
  • 保密级别
    公开
  • 培养单位
    599 国际研究生院
  • 中文关键词
    深度强化学习,奖励延迟,多智能体强化学习,通信延迟
  • 英文关键词
    deep reinforcement learning,reward delay,multi-agent reinforcement learning,communication delay

摘要

当今人工智能领域的发展非常迅速,强化学习在许多虚拟仿真环境中有了越来越多的应用。然而在真实场景中,强化学习算法的落地依旧困难重重。真实环境中普遍存在的延迟问题,会导致强化学习算法的学习速度和收敛结果都出现大幅下降。本文旨在提高强化学习算法在延迟环境下的表现,为延迟所导致的强化学习落地苦难问题提供解决方案。按照奖励出现在智能体与环境的交互过程中的何处,延迟问题可以分类成:奖励延迟,动作延迟和状态延迟。本文在对研究背景与研究现状进行详细调研后,就延迟问题分别从单智能体奖励延迟和多智能体通信延迟问题展开研究:(1). 基于事后经验回放算法,本文从克服奖励延迟的角度出发提出了一种基于$n$步重新标记和$n$步回报的多步HER($\lambda$)算法。该算法结合了多步重标签技术以提高算法的样本利用率,同时使用了超参数$\lambda$来控制多步值函数的估计偏差。通过在多个机器人控制环境中的实验结果表明,该算法可以有效提高延迟的奖励下,算法的表现,同时该算法相比较于其他基于HER算法的方法具有较小的计算成本,和更快的值函数收敛速度。(2). 针对多智能体强化学习中,智能体之间普遍存在的通信行为以及通信延迟所导致的智能体动作和观测延迟问题,本文首先提出了一种多智能体之间的通信延迟的建模方案。基于该方案,可以引入多种不同类型的延迟进行仿真实验。基于该通信信道,本文进一步提出了一种协调器-智能体通信框架,该框架可以将多智能体强化学习问题转化成具有通信延迟的形式。随后,本文基于该框架分析了其对于缓解延迟通信所起到的作用。(3). 为了进一步的缓解延迟对于多智能体通信造成的影响,本文在协调器-智能体框架的基础上,提出了基于去噪自编码器的抗延迟预测模型——DAAE模型。该模型可以利用消息缓冲区中现存的消息,对缺失的消息进行预测,从而缓解因延迟而导致的信息缺失问题。将该框架与去中心化的多智能体强化学习算法结合,我们提出了一种抗延迟的基于模型的强化学习算法框架——DACA框架。在多智能体通信基准环境ma-gym上,通过大量实验证明了该方法的有效性。通过进一步的消融实验与压力测试可以发现,该算法框架的缓冲区机制、DAAE模型都对缓解延迟造成的负面影响起到了一定作用。在具有较大的延迟的meetin-in-a-grid环境中,依旧保持了80%的胜率。同时在具有较小的缓冲区大小下,取得了超过基准算法的结果,从而降低了对计算资源的使用。

The development in the field of artificial intelligence today is very rapid, and reinforcement learning is increasingly applied in many virtual simulation environments. However, in real-world scenarios, the implementation of reinforcement learning algorithms still faces significant difficulties. The ubiquitous problem of delay in real environments can lead to significant decreases in the learning speed and convergence results of reinforcement learning algorithms. This thesis aims to improve the performance of reinforcement learning algorithms in delay environments and provide solutions for the difficulties caused by delays in the implementation of reinforcement learning.According to where rewards appear in the interaction process between the agent and the environment, the delay problem can be classified into: reward delay, action delay, and state delay. After detailed investigation of the research background and current status, this thesis addresses the delay problem from the perspectives of a single-agent reward delay and multi-agent communication delay:(1). Based on the hindsight experience replay algorithm, this thesis proposes a multi-step HER($\lambda$) algorithm based on $n$-step re-labeling and $n$-step returns to overcome reward delay. This algorithm combines multi-step relabeling techniques to improve the sample utilization rate of the algorithm, and uses the hyperparameter $\lambda$ to control the bias of the multi-step value function estimation. Experimental results in multiple robot control environments show that this algorithm can effectively improve the performance of the algorithm under reward delay. Compared to other HER-based methods, this algorithm has lower computational costs and faster value function convergence speed.(2). Regarding the communication behavior that commonly exists between agents and the delay of agent actions and observations caused by communication delay in multi-agent reinforcement learning, this thesis first proposes a modeling scheme for communication delay between multiple agents. Based on this scheme, various types of delays can be introduced for simulation experiments. Based on this communication channel, this thesis further proposes a coordinator-agent communication framework that can transform multi-agent reinforcement learning problems into a form with communication delays. Subsequently, this thesis analyzes the role of this framework in alleviating delay communication.(3). To further mitigate the impact of delay on multi-agent communication, based on the coordinator-agent framework, this thesis proposes a delay-resistant prediction model based on the denoising autoencoder—the DAAE model. This model can use the existing messages in the message buffer to predict missing messages, thereby alleviating the problem of information loss caused by delay. Combining this framework with a decentralized multi-agent reinforcement learning algorithm, we propose a delay-resistant model-based reinforcement learning algorithm framework—the DACA framework. Through extensive experiments in the multi-agent communication benchmark environment ma-gym, the effectiveness of this method is demonstrated. Through further ablation experiments and stress tests, it was found that the buffering mechanism and DAAE model of this algorithm framework both played a role in alleviating the negative impact of delay. It still maintains an 80% win rate in the meeting-in-a-grid environment with a large delay. Meanwhile, under a smaller buffer size, it achieves results exceeding the baseline algorithm, thus reducing the use of computing resources.