强化学习是解决时序决策问题的一种通用方法。在强化学习的场景下,智能体通过与未知的环境进行交互以最大化长期奖励。近年来,结合深度特征表示的强化学习算法在许多具有挑战性的任务中取得了巨大的成功,例如计算机游戏、机器人控制等。然而,强化学习仍然面临着几个重大挑战,尤其当其与深度神经网络相结合时。智能体通常依赖于值函数来实现最大化预期收益的目标,因此,这需要它对值函数有一个良好的估计。由于环境未知,智能体必须通过与环境交互并收集经验从而进行学习,这通常需要多次的交互与大量的样本。在不同类型的实际场景中,环境通常是复杂且高维的,因此如何高效且实际地应用深度强化学习也很重要。此外,许多实际的应用场景都涉及多个智能体,而不同智能体之间的复杂关系使得该问题更具有挑战性。本文旨在研究鲁棒、高效和实用的深度强化学习算法,并针对性地解决其中三个核心问题。首先,我们如何确保智能体学习的收敛性和鲁棒性?其次,我们如何解决深度强化学习算法的高复杂度问题?第三,如何在重要的实际应用中成功应用深度强化学习算法(例如计算可持续性问题)?具体而言,本文的创新点和贡献如下:(一) 前人的研究工作表明基于玻尔兹曼softmax分布的算子违背了非扩张性,可能导致不鲁棒的学习行为。我们证明了该算子所导致的误差是有界的,并进一步发现了该算子的优势(它具有平滑性并且能够减小高估和低估误差)。为了进一步保证收敛性和鲁棒性,我们提出了动态玻尔兹曼softmax算子,并证明了它在规划和学习的场景下均有良好的收敛性保证。我们指出高估误差在多智能体中更加严峻,并提出了基于正则化和高效近似的softmax算子的方法来解决多智能体强化学习中的过估计问题。(二)针对深度强化学习中的学习效率,我们提出了基于种群的多样性来改进探索的方法,从而提高样本效率。我们研究了智能体不与环境进行交互的离线多智能体强化学习的场景,并提出了一种行为者修正的方法。我们证明该方法可以保证安全的策略优化并提高性能和效率。(三)针对深度强化学习算法在具有高维状态和动作空间的实际场景中的应用,我们通过分而治之的结构处理其中复杂的时空特性,并将其成功应用在共享单车系统的重新平衡中。
Reinforcement learning (RL) is a general paradigm for solving sequential decision-making problems. In RL, an agent interacts with an unknown environment in order to maximize the expected long-term rewards. Recent years have witnessed great success of RL with deep feature representations, i.e., deep reinforcement learning (DRL), in many challenging tasks, including computer games, robotics, etc. However, RL still faces several major challenges, especially when combined with deep neural networks. The agent typically relies on the value function to achieve the goal for maximizing the expected return, which requires it to have a good estimate of the value function. Since the environment is unknown, the agent has to learn by interaction for experiences collection, which usually requires a large amount of samples. It can also be challenging to apply DRL in different applications due to the complex and high-dimensional environments. In addition, most of the successful applications of DRL involve multiple agents, which makes the problem much more challenging due to the complicated relationship between different agents. This thesis focuses on developing robust, efficient, and practical RL algorithms, and targets three core questions. Firstly, how can we ensure a convergent and robust learning behavior of an RL agent? Secondly, how can we tackle the high complexity problem of RL algorithms? Thirdly, how to successfully apply RL algorithms in important practical applications (e.g., computational sustainability problems)? The novel contributions of the thesis are summarized as below:(1) It was shown in previous works that the softmax operator cannot guarantee robust learning behavior due to the violation of the non-expansion property. We prove that the induced error can be controlled and bounded. We further demonstrate that the softmax operator has several advantages (it helps to smooth the optimization landscape and can reduce the overestimation and underestimation error). To fully resolve the non-convergence issue of the softmax operator and to ensure convergence and robustness, we propose the dynamic Boltzmann softmax operator, which has good convergence guarantee in the settings of planning and learning. We find that the overestimation problem presents a more severe practical challenge in multi-agent systems than previously acknowledged. We then propose a regularization-based update and an efficiently approximated softmax operator to tackle the problem in multi-agent RL.(2) We propose an effective scheme that leverages a population of diverse policies to improve exploration in order to improve sample efficiency. We study the challenging offline multi-agent setting, where agents are not allowed to interact with the environment. We propose an actor rectification method, which enables safe policy improvement, achieves state-of-the-art performance, and improves efficiency.(3) To tackle the high-dimensional state and action spaces in practical applications, we propose a DRL algorithm with a divide-and-conquer structure to handle complex spatial-temporal dependencies to rebalance dockless bike sharing systems.