深度强化学习是一类很有前途的通用型框架来解决时序决策问题。近年来,深度强化学习取得了巨大成功,并在视频游戏和机器人等多个领域超越了人类水平。这些成果大多集中在单智能体、需要大量在线交互和单任务。现实世界的实用需求往往需要深度强化学习在以下三个更具有挑战性的场景下应用:多智能体、离线学习和多任务。在这些任务中,智能体可以与其他智能体通信、合作;可以利用大规模离线数据训练策略从而避免大量昂贵的与环境在线交互;也可以训练出元策略在多任务场景中进行自适应。这三类强化学习应用基础模块组合起来有潜力解决自动驾驶场景等现实难题。本文旨在这三类强化学习应用基础模块中,对于深度强化学习进行理论建模并设计实践算法。 在合作多智能体强化学习中,单个智能体需要和其他智能体协作完成任务,不同的任务对于智能体之间的通信有带宽限制。针对多智能体强化学习的优化算法,我们形式化了一个通用理论框架,分析已有算法的收敛性和最优性,发现具有完备的价值函数空间可以获得以上两点。以此作为动机,提出了第一个高效满足该性质的实践算法。对于多智能体强化学习的通信策略,提出了两个信息论度量标准来测量智能体通信的有效性和精简性,可以在有限带宽下最大化通信效率。离线强化学习开发了一种新的学习范式,仅使用预先收集的离线数据集来训练智能体,从而可以降低昂贵的在线探索成本。基于已有数据集,平衡策略的保守性和泛化性是离线强化学习核心的问题。我们在基于模型的离线强化学习中,创新性地提出能够实现有效保守泛化的反向离线模型想象。这是一种新颖的双向学习范式,将反向假想轨迹与离线数据集中预先收集的正向轨迹连接起来,与人类的双向推理有着相似的动机。在离线元强化学习中,智能体在预先收集的离线多任务数据中进行训练,并在测试中实现在线快速策略迁移适应新任务。对于这个实用场景,我们首次在理论上刻画出一个独特挑战:离线数据集和在线适应之间的转换奖励分布不一致性。该差异性会导致不可靠的离线元训练策略评估。为了解决这个问题,我们提出了一种新颖的在线适应框架,可以通过不确定性量化评估在在线适应中筛选出与离线数据集同分布的上下文,从而执行有效的任务信念推理来解决测试任务。
Deep Reinforcement Learning (DRL) is a promising and general framework for solving sequential decision-making problems. In recent years, DRL has achieved tremendous success, surpassing human-level performance in various domains such as video games and robotics. However, these achievements are mostly concentrated in single-agent, a large number of online interactions, and single-task. Practical demands in the real world often require the application of DRL in three more challenging scenarios: multi-agent systems, offline learning, and multitasking. In these tasks, agents can communicate and cooperate with other agents; leverage large offline datasets to train policies, thus avoiding extensive and costly online environmental interactions; and develop meta-strategies for online adaptation in unseen testing tasks. The combination of these three fundamental modules of reinforcement learning applications has the potential to solve real-world challenges such as autonomous driving. This thesis aims to conduct theoretical modeling and design practical algorithms for DRL in three fundamental modules of reinforcement learning applications.In cooperative multi-agent reinforcement learning, individual agents need to collaborate with others to solve tasks, with limited communication bandwidth. For the optimization algorithms of multi-agent reinforcement learning, we formalize a general theoretical framework, analyze the convergence and optimality of existing algorithms, and find that a complete action-value function space can theoretically achieve these two aspects. Motivated by our theory, we propose the first efficient practical algorithm that satisfies this property. For the communication strategy of multi-agent reinforcement learning, we propose two information-theoretic metrics to measure the effectiveness and succinctness of agent communication, enabling agents to maximize communication efficiency under limited bandwidth.Offline reinforcement learning has developed a new learning paradigm, using only pre-collected offline datasets to train agents, thereby reducing costly online exploration costs. Balancing policy conservatism and generalization based on given datasets is the core issue of offline reinforcement learning. In model-based offline reinforcement learning, we propose effective conservative generalization through reverse offline model imagination. This novel bidirectional learning paradigm connects backward imaginary trajectories with forward trajectories pre-collected in offline datasets, motivated by a similarity to human bidirectional reasoning.In offline meta-reinforcement learning, agents train meta-policies using pre-collected offline multi-task datasets and achieve online rapid policy adaptation to new tasks during testing. For this practical scenario, we first theoretically characterize a unique challenge: transition-reward distribution shift between offline datasets and online adaptation. This discrepancy can lead to unreliable offline meta-training policy evaluations. To address this issue, we propose a novel online adaptation framework that can generates in-distribution context using an uncertainty quantification and perform effective task belief inference to address unseen testing tasks.