近年来,人工智能在“看、听、说”领域的感知智能已经达到或者超过了人类的水平,需要进一步向决策智能演进。决策智能体可以分为单体决策智能和群体决策智能两类。与单体决策智能相比,群体决策智能具有更大的感知范围和策略执行空间,为进一步提升人工智能提供了新途径。多智能体强化学习具有建模复杂策略的潜力,被认为是解决群体决策智能的重要手段。然而,该领域还存在很多挑战性难题。本论文重点研究其在基础算法和多智能体泛化性两方面的问题,基础算法为研究多智能体泛化性提供算法支撑。在基础算法方面,首先针对多智能体强化学习算法训练效率低、鲁棒性差且不通用的问题,提出具有高鲁棒性、高效率和高性能的多智能体近端策略优化(Multi-Agent Proximal Policy Optimization,MAPPO)算法。在不经过任何网络架构设计或者算法设计的情况下,只需要极少的超参搜索就可以在四个主流的多智能体测试环境中取得最佳性能。进一步地,分析影响MAPPO算法性能的5个关键因素并给出实用建议。其次针对在真实环境中智能体异步执行性能会下降的问题,将MAPPO算法进一步扩展至智能体异步的情况,提出异步多智能体近端策略优化(Asynchronous MAPPO,Async-MAPPO)算法。以多智能体协同探索环境为示例任务,基于Async-MAPPO算法训练异步探索器,与传统方法相比减少10%以上的实际探索时间,证明了Async-MAPPO算法的有效性。在多智能体泛化性方面,解决上述基础算法在面对未知智能体数量和类型泛化性差的问题。首先针对智能体数量问题,设计具有智能体数量不变性的多智能体协同强化学习策略表征。基于这种策略表征形式搭建基于强化学习的多智能体协同视觉探索方案。在未知场景中,基于强化学习的探索方案的探索效率比基于规划的算法高7.99%,且能够泛化至未知智能体数量,应对智能体数量变化的任务。其次针对智能体类型问题,以与不同偏好的人类合作为目标,利用多样性策略池训练可用于人机协同的自适应策略。本文提出在低维奖励空间扰动的奖励随机化策略梯度法,可以找到多种人类可解释的多样性策略,并且能够探索到高风险合作策略。进一步地,基于奖励随机化方法设计无需人类数据的隐效用自博弈人机协同算法,在多样性策略池中采用隐效用建模人类不同偏好行为。真实人类实验结果表明,该算法可以泛化至存在不同偏好的人类。
In recent years, the perceptual intelligence of artificial intelligence (AI) in the "seeing, listening, and speaking" domain has reached or even exceeded human levels, and it is necessary to further evolve towards decision-making intelligence. Decision-making intelligence can be divided into two categories: individual decision-making intelligence and group decision-making intelligence. Compared with individual decision-making intelligence, group decision-making intelligence has a larger perceptual range and strategy execution space, providing a new pathway for further improving AI. Multi-agent reinforcement learning (MARL) has the potential to model complex strategies and is considered an important means of solving group decision-making intelligence. However, there are still many challenging issues in this field. This paper focuses on the issues in fundamental algorithms and multi-agent generalization.In terms of fundamental algorithms, firstly, to address the problems of low training efficiency, poor robustness, and lack of generality in MARL algorithms, a new algorithm called MAPPO (Multi-Agent Proximal Policy Optimization) is proposed, which has high robustness, efficiency, and performance. Without any network architecture design or algorithm design, MAPPO can achieve state-of-the-art performance in four mainstream MARL testing environments with very few hyperparameter searches. Furthermore, five key factors that affect the performance of MAPPO are analyzed, and practical suggestions are given. Secondly, to address the issue of degraded performance when agents execute asynchronously in real environments, MAPPO is further extended to the asynchronous case, resulting in the Async-MAPPO (Asynchronous MAPPO) algorithm. Using the multi-agent cooperative exploration environment as an example task, an asynchronous coordination explorer is trained based on the Async-MAPPO algorithm. Compared with traditional methods, the asynchronous coordination explorer reduces the actual exploration time by more than 10%, demonstrating the effectiveness of the Async-MAPPO algorithm.In terms of multi-agent generalization, the aforementioned fundamental algorithms suffer from poor generalization in the face of unknown numbers and types of agents. Firstly, to address the issue of unknown numbers of agents, an agent-invariant cooperative policy representation is designed. Based on this policy representation, a MARL-based cooperative visual exploration scheme is constructed. The MARL-based scheme has a higher exploration efficiency than planning-based algorithms by 7.99% in unknown environments and can generalize to unknown numbers of agents to cope with tasks with varying numbers of agents. Secondly, to address the issue of different types of agents, aiming to collaborate with humans with different preferences, an adaptive policy is trained using a diversity policy pool. A reward randomization policy gradient method with low-dimensional reward space disturbance is proposed in this paper, which can find multiple human-interpretable diverse policies and explore high-risk collaborative strategies. Furthermore, a hidden-utility self-play algorithm is designed based on the reward randomization method, which does not require human data and models human behavior with different preferences using the hidden utility. The results of real human experiments show that the hidden-utility self-play algorithm can generalize to humans with different preferences.