登录 EN

添加临时用户

面向群体合作的多智能体强化学习算法及交通灯控制应用

Multi--Agent Reinforcement Learning for Group Cooperation and its Application in Traffic Light Control

作者:李承昊
  • 学号
    2018******
  • 学位
    博士
  • 电子邮箱
    lic******.cn
  • 答辩日期
    2023.05.22
  • 导师
    赵千川
  • 学科名
    控制科学与工程
  • 页码
    108
  • 保密级别
    公开
  • 培养单位
    025 自动化系
  • 中文关键词
    群体合作,多智能体系统,深度强化学习,集中式学习分布式执行,智能交通灯控制
  • 英文关键词
    group cooperation, multi--agent system, deep reinforcement learning, cen-tralized training with decentralized execution, intelligent traffic light control

摘要

为多智能体系统赋予智能化是我国国务院于2017年提出的《新一代人工智能发展规划的通知》中明确指出的重要研究方向。基于集中式学习分布式执行框架,多智能体强化学习可以在Dec-POMDP限制下完成基础合作。然而状态动作空间随智能体数量增多而指数级爆炸的现象对探索和优化带来的阻碍仍然没有被完全解决,更多有挑战性的合作需要群体同时进行高效的分工和协同,这为群体合作的学习带来更大的挑战。本文针对上述问题,将群体合作分为分工和协同两个层面进行针对性设计,并考虑二者的平衡来进一步增强群体合作学习效率。基于上述平衡关系,本文分别研讨如何通过轻量化的子任务分解提升优化效率,以及如何动态调整信息论因子以激励连续探索。最后本文将以交通仿真环境中的交通灯控制为例,研讨多智能体强化学习应用的可行性。本文取得的研究成果如下:1. 针对集中式学习分布式执行框架对智能体策略之间协同性缺少显式表征激励的问题,提出策略协方差优化算法,在经典任务中训练效率提高超过20%。2. 针对集中式学习分布式执行框架对分工重视不足的问题,在保证协同性的基础上平衡群体同质性和个体多样性来激励必要的分工,在国际主流群体合作测试环境上取得了领先性能。3. 针对目前多智能体分层强化学习在集中式学习分布式执行框架中引入的上层策略太过复杂的问题,基于上述平衡关系通过建模并激励观测表征的多样性,在时序层面和智能体集群层面同时对总体任务进行轻量化的分解,在国际主流群体合作测试环境上超越了先进的多智能体分层强化学习算法。4. 通过实验展示了基于神经网络变分推断估计的信息论因子在集中式学习分布式执行框架中可能会引发重复探索的问题,针对这一问题提出重访问的动态监测方法以及内部奖励函数自适应调整算法,增强了群体合作的探索效率的同时,在国际主流群体合作测试环境上取得了进一步的突破。5. 针对大规模交通网络的交通灯控制直接训练效率较低的问题,通过单路口交通灯控制分析交通仿真环境的特性,并基于迁移学习和本文针对多智能体子任务分解的研究成果,展示了小规模路网训练的策略直接控制大规模路网的交通灯的可行性。

Enabling intelligence for multi-agent systems was clearly identified as an important research direction in the notice on the development plan for the new generation of artificial intelligence released by the State Council in 2017. Based on the centralized learning with decentralized execution framework, multi-agent reinforcement learning can achieve basic cooperation under the restriction of Dec-POMDP. However, the exponential explosion in the state-action space with an increasing number of agents still poses obstacles to exploration and optimization. Especially, more challenging cooperative tasks necessitate efficient labor division and coordination, which present greater challenges for the learning of cooperation. In view of the above shortcomings, the thesis divides group cooperation into two levels, division of labor and coordination, and emphasizes the importance of balancing between these two levels to further enhance the learning efficiency of group cooperation. Building on the above balance relationship, the thesis studies how to enhance optimization efficiency through sub-task decomposition and how to dynamically adjust information theory factors to promote continuous exploration. Furthermore, the thesis employs traffic light control in traffic simulation environments as an example to examine the feasibility of applying multi-agent reinforcement learning in real-world settings. The research results obtained in the thesis include:1. To address limited representation and incentives for coordination between agent strategies in the centralized learning with decentralized execution framework, a covariance optimization algorithm is proposed, which has shown effectiveness in increasing training efficiency by over 20% in several classic tasks.2. To address insufficient attention to the division of labor in the centralized learning with decentralized execution framework, a balance between group consistency and individual diversity is proposed to stimulate the necessary division of labor on the basis of ensuring coordination. The proposed approach has achieved outstanding performance in international mainstream group cooperation testing environments.3. To address the challenge of complex high-level policies in current multi-agent hierarchical reinforcement learning, a lightweight approach based on the above balance is proposed to decompose sub-tasks at both temporal and agent levels through information-constrained observation representation modules. The proposed approach has surpassed advanced multi-agent hierarchical reinforcement learning algorithms in international mainstream group cooperation testing environments.4.Conduct experiments to reveal the problem of repeated exploration in the multi-agent field that arises from estimating information theory factors through variational inference based on neural networks. To address this problem, an approach for adapting the intrinsic rewards is proposed based on detecting revisitation dynamically. The proposed approach enhances exploration efficiency in group cooperation and has achieved further advancements in the international mainstream group cooperation testing environment.5. To address the low efficiency of direct training for traffic light control in large-scale traffic networks, the characteristics of the traffic simulation environment are analyzed through single intersection traffic light control. Based on transfer learning and the research of multi-agent sub-task decomposition in the thesis, an approach is proposed to demonstrate the feasibility of using small-scale traffic network training for direct control of traffic lights in large-scale traffic networks.