登录 EN

添加临时用户

面向多智能体集群控制的强化学习算法研究

Research on Multi-Agent Flocking Control-Oriented Reinforcement Learning Algorithms

作者:邱云波
  • 学号
    2018******
  • 学位
    博士
  • 电子邮箱
    qiu******com
  • 答辩日期
    2024.05.26
  • 导师
    张旭东
  • 学科名
    信息与通信工程
  • 页码
    110
  • 保密级别
    公开
  • 培养单位
    023 电子系
  • 中文关键词
    多智能体系统;深度强化学习;集群控制;协同任务;马尔科夫随机博弈
  • 英文关键词
    multi-agent system;deep reinforcement learning;flocking control;collaborative task;Markov games

摘要

近年来,多智能体系统在国计民生中的多种应用任务得到了越来越多的关注。通过深度强化学习得到的多智能体集群控制算法,与传统控制方法相比展现出了更高的对不同环境的适应性、在复杂环境中的灵活性、对环境不确定性的鲁棒性。然而目前的多智能体强化学习算法的系统协同性不足,缺乏对于智能体两两之间显式影响的研究;在训练过程中需要与环境交互的次数过多,采样效率低下,得到策略的代价较高;在实施部署时以预期回报为主要优化目标,缺乏对智能体运动安全性的保障。因此本文研究使系统更协同、训练更高效、实施更安全的多智能体强化学习方法和关键技术以用于解决多智能体集群控制问题。 首先,针对多智能体系统协同性不足的问题,提出了一种基于互助的多智能体强化学习方法,在传统多智能体强化学习的演员-评论家框架之上,增加了一个期望动作模块,在训练过程中先利用辅助网络使得各个智能体提出期望的其它智能体的动作,然后各智能体在不大幅降低自身性能的情况下,采纳接收到的期望动作,以尽量帮助其它智能体。实验结果表明所提出的方法在集群控制任务中提升了多智能体系统的协同性。 其次,针对多智能体系统训练过程中采样效率低下的问题,提出了一种非专家级策略辅助的多智能体强化学习方法,以融合非专家级的先验策略中关于任务的先验知识,帮助提升训练的采样效率。在预训练阶段通过先验策略生成少量经验样本,使智能体对其学习后了解任务的初步知识,达到暖启动的效果;于在线强化学习训练阶段,在先验策略比智能体自身策略产生的动作更好时,让智能体自身策略模仿先验策略。实验结果表明所提出的方法能够在同等训练样本数量下提升算法性能,提高了采样效率。 最后,针对多智能体系统在部署实施时安全性缺乏保障的问题,提出了一种基于动态安全屏障的多智能体强化学习方法,不依赖强先验知识,而只依靠智能体自身运动模型得到关于动作的安全屏障。在训练过程中,根据智能体实时的不安全率,动态调整安全屏障的范围;随着智能体的安全性逐渐提升,安全屏障逐渐放松。通过两阶段训练,利用备用策略进一步提升多智能体系统的安全性。实验结果表明所提出的方法有效地降低了训练过程与执行阶段的不安全率。

In recent years, various applications of multi-agent systems in national governance and people‘s livelihoods have garnered increasing attention. Compared to traditional control methods, multi-agent flocking control algorithms obtained through deep reinforcement learning demonstrate higher adaptability to different environments, greater flexibility in complex environments, and robustness to environmental uncertainties. However, current multi-agent reinforcement learning algorithms lack system coordination, and research on explicit influences between pairs of agents is missing. The training process involves excessive interaction with the environment, resulting in low sampling efficiency and thus high costs for obtaining strategies. During deployment, the focus is primarily on optimizing expected returns, with insufficient emphasis on ensuring the safety of agent movements. Therefore, this thesis investigates multi-agent reinforcement learning methods and key technologies aimed at enhancing system coordination, improving sample efficiency, and ensuring safer implementation, with the goal of addressing multi-agent flocking control. Firstly, to address the lack of coordination in multi-agent systems, a mutual help-based multi-agent reinforcement learning method is proposed. Building upon the actor-critic framework of traditional multi-agent reinforcement learning, an expected action module is introduced. During the training process, an auxiliary network is used to enable each agent to propose expected actions for other agents. Subsequently, each agent adopts received expected actions while minimizing the reduction in its own performance, aiming to assist other agents as much as possible. Experimental results demonstrate that the proposed method enhances the coordination of multi-agent systems in flocking control. Secondly, to address the issue of low sampling efficiency during the training process of multi-agent systems, a non-expert prior policy-assisted multi-agent reinforcement learning method is proposed. This method integrates prior knowledge about tasks from a non-expert prior policy to help improve sampling efficiency during training. During the pre-training phase, a small number of experience samples are generated using prior policy to provide initial knowledge of the task to the agents, achieving a warm start effect. During the online reinforcement learning training phase, when the prior policy generates better actions than the agent‘s own policy, the agent‘s own policy imitates the prior policy. Experimental results demonstrate that the proposed method can enhance algorithm performance and improve sampling efficiency with the same number of samples. Lastly, to address the issue of insufficient safety assurance during the deployment and implementation of multi-agent systems, a multi-agent reinforcement learning method based on dynamic safety shielding is proposed. This method does not rely on strong prior knowledge but instead utilizes the agents‘ own motion models to derive safety shielding regarding actions. During the training process, the range of safety shielding is dynamically adjusted based on the real-time insecurity rate of the agents. As the safety of the agents gradually improves, the safety barriers are relaxed accordingly. Through a two-stage training process, using a backup policy further enhances the safety of the multi-agent system. Experimental results demonstrate that the proposed method effectively reduces the insecurity rates during both the training and execution phases.