登录 EN

添加临时用户

面向保优分解的强化学习研究

Optimality-Preserving Decomposition in Reinforcement Learning

作者:任津生
  • 学号
    2017******
  • 学位
    博士
  • 电子邮箱
    rjs******.cn
  • 答辩日期
    2023.08.22
  • 导师
    陈峰
  • 学科名
    控制科学与工程
  • 页码
    150
  • 保密级别
    公开
  • 培养单位
    025 自动化系
  • 中文关键词
    强化学习,值函数分解,奖励函数分解,任务分解,保优分解
  • 英文关键词
    reinforcement learning, value function decomposition, reward function decomposition, task decomposition, optimality-preserving decomposition

摘要

近年来强化学习在游戏、无人驾驶、机器人控制、医疗救助等众多领域取得了突破性进展。然而,目前的强化学习算法仍然面临样本需求量大、稀疏奖励、探索效率低、奖励函数难设计以及训练稳定性差等挑战。造成这些挑战的核心原因在于,任务规模性大,决策复杂性高。解决强化学习的规模性大复杂性高问题的一种有效方法是分而治之,即将大规模复杂马尔可夫决策过程分解为多个小规模简单的马尔可夫决策过程。然而,强化学习的分解研究仍然面临着值函数分解中主辅回报不平衡、奖励函数分解中策略最优性被破坏、任务分解中最优与高效冲突等关键问题。为此,本文从值函数分解、奖励函数分解以及任务分解的角度,系统性地研究了强化学习中不同要素分解的算法和理论。本文的主要贡献如下:1. 针对值函数分解中主辅回报不平衡问题,本文利用了多目标优化理论将主辅回报平衡问题建模为搜索主回报最优的Pareto最优解问题。提出了具有对主回报偏好的Pareto度量,设计了基于该度量的主辅回报自平衡学习算法,理论证明了主辅回报平衡以概率一收敛到Pareto最优解,并给出算法的时间复杂度上下界。实验表明本文的方法优于现有基于专家先验的强化学习算法。2. 针对奖励分解中策略最优性被破坏问题,本文建立了基于HES度量的保优奖励分解理论,使得奖励函数分解能够保证策略空间上的近似最优解,从而实现了最优性保护的奖励分解。为了在保优的前提下最大程度提升分解的高效性,本文提出了兼具保优和高效的分解准则。进一步地,本文设计了基于该分解准则的优化算法,成功解决了具有第一人称三维视觉输入的Doom游戏中的复杂决策任务。3. 针对任务分解中最优性与高效性冲突问题,本文从信息论的视角提出了一个可以同时权衡分解最优性和分解高效性的约束优化目标。为了量化分解的高效性,本文从理论上给出了基于最优先验策略的样本复杂度的度量。基于理论结果,本文设计了可梯度优化的分层强化学习算法用于实现最优-高效权衡的任务分解,实验展示了该算法在任务分解和策略推广上的优势。

In recent years, reinforcement learning has made great breakthroughs in various fields such as games, unmanned driving, robot control and medical assistance, and so on. However, the current reinforcement learning algorithms still face a lot of challenges including large sample demand, sparse rewards, low exploration efficiency, difficult design of reward functions and poor training stability. The main reasons causing these challenges lie in the large scale of tasks and high complexity of policies. One of the most effective approaches to solving the two problems is decomposition, i.e. decomposing a large and complex Markov decision process into several small and simple Markov decision processes. Nonetheless, some key issues remain unresolved for the decomposition study, such as imbalance between main and auxiliary returns in value function decomposition, destruction of policy optimality in reward function decomposition, and conflict between optimality and efficiency in task decomposition. To this end, this thesis systematically investigates the algorithms and theories for decomposition of different elements in reinforcement learning, from the perspective of value function decomposition, reward function decomposition and task decomposition. The main contributions are as follows.1. In order to solve the problem of unbalance between main and auxiliary returns in value function decomposition, the multi-objective optimization theory is adopted to model the balance between main and auxiliary returns as the problem of searching the Pareto optimal solution with optimal main returns. In particular, a Pareto metric with a preference for main returns is proposed, and then a self-balancing learning algorithm based on this metric is designed. It is proved from theories that the balance between main and auxiliary returns converges to the Pareto optimal solution with probability of 1, and the upper and lower bound of the time complexity of the algorithm is given. Experiments demonstrate that our method is superior to the existing reinforcement learning algorithm based on expert prior. 2. In order to solve the problem of destruction of policy optimality in reward decomposition, a novel theoretical framework for optimality-preserving reward decomposition is established based on HES metric, so that decomposing reward function can ensure the approximate optimal solution in the policy space. To improve the efficiency of decomposition to the maximum extent while preserving the optimality of decompositon, a decomposition criterion is proposed which possesses high efficiency and meanwhile preserves optimality. Based on this decomposition criterion, an optimization algorithm is further designed, which successfully solves complex decision tasks in Doom games with first-person 3D visual input.3. In order to solve the problem of conflict between optimality and efficiency in task decomposition, a constrained optimization objective is proposed, which can balance optimality and efficiency of decomposition, from the perspective of information theory. In order to quantify the efficiency of decomposition, a measure of sample complexity based on the best prior policy is given with theoretical guarantee. Based on the theoretical results, a gradient-optimizable hierarchical reinforcement learning algorithm is further designed to realize the tradeoff between optimality and efficiency in task decomposition. Experiments demonstrate the advantages of this algorithm in task decomposition and policy generalization.