登录 EN

添加临时用户

强化学习的状态和动作抽象研究

State and Action Abstraction in Reinforcement Learning

作者:郭尚岐
  • 学号
    2015******
  • 学位
    博士
  • 电子邮箱
    198******com
  • 答辩日期
    2021.05.19
  • 导师
    陈峰
  • 学科名
    控制科学与工程
  • 页码
    141
  • 保密级别
    公开
  • 培养单位
    025 自动化系
  • 中文关键词
    强化学习,状态抽象,动作抽象,状态-动作联合抽象
  • 英文关键词
    Reinforcement Learning, State Abstraction, Temporally Abstract Action, State-Action Joint Abstraction

摘要

近年来强化学习在游戏、无人驾驶、机器人控制、药物设计等众多领域取得了突破性进展。然而,目前的强化学习算法仍然面临样本需求量大、训练稳定性弱、学习效率低、探索不充分、鲁棒性差以及训练功耗大等困难。造成这些困难的原因之一在于强化学习任务的规模庞大,即状态空间大和决策总步数多。解决强化学习规模庞大问题的一种有效方法是将大规模细粒度的决策过程抽象为小规模粗粒度的决策过程,从而压缩强化学习任务的规模,降低算法优化的难度。然而,目前强化学习的抽象研究面临着动作抽象空间大、状态表示抽象鲁棒性差、状态聚类抽象难以与动作抽象相结合等关键问题。为此,本文从动作抽象、状态抽象以及状态-动作联合抽象的角度,系统性地研究了强化学习的动作抽象和状态抽象理论。本文的主要贡献如下:(1)针对抽象动作空间大的问题, 本文证明了当全局子目标抽象动作空间被约束至局部近邻子目标抽象动作域时,基于子目标抽象动作的策略不损失最优性。因此近邻约束不仅能够缩减子目标抽象动作,还能保证全局子目标空间的最优性,从而大大提高了基于子目标抽象动作的策略优化效率。为了能够高效地实现近邻约束,本文提出了一个近邻网络来计算状态之间的近邻距离。实验结果表明本文的方法优于现有基于子目标抽象动作的强化学习算法。(2)针对状态表示抽象鲁棒性差的问题,本文提出了基于脉冲变分最大期望的策略梯度算法,使得策略学习能够在多层脉冲神经网络中实现,从而能够利用脉冲神经网络的特点来提升状态表示抽象的鲁棒性。该算法建立了脉冲发放方程和变分推理之间的关系以及奖励调制突触可塑性规则和变分策略梯度之间的关系。实验结果表明本文的算法在多种噪声干扰下具有较强的鲁棒性。(3)针对状态聚类抽象与动作抽象结合难的问题,本文建立了基于近似保优度量的状态-动作抽象理论,使得状态聚类抽象能够保证抽象动作空间上的近似最优解,从而实现了状态-动作的联合抽象。为了实现可计算的状态-动作联合抽象,本文提出了一种度量学习网络来计算近似保优度量。进一步,本文提出了基于状态-动作联合抽象的强化学习算法,成功地解决了具有第一人称三维视觉输入的Doom游戏中的复杂决策任务。

Recently, reinforcement learning (RL) has made significant progress in various fields such as games, autonomous driving, robot control, and drug design. However, RL algorithms often suffer from high sample complexity, training instability, learning inefficiency, exploration insufficiency, poor robustness, high power consumption and other difficulties. RL tasks typically involve large state spaces and long-term decision processes, which is one of the reasons leading to those difficulties. An approach to addressing those difficulties is to abstract a large-scale and fine-grained RL problem into a small-scale and coarse-grained RL problem, to relieve the burden of RL algorithms. Nevertheless, current abstraction studies in RL have the following problems: first, the temporally abstract action (TAA) space is very large; second, the robustness of state representations is poor; third, it is difficult to combine state aggregation abstraction with TAAs. To address these problems, this thesis systematically studies state and action abstraction in RL from the perspectives of action abstraction, state abstraction, and state-action joint abstraction, respectively. The main contributions are as follows:(1) To address the problem of large TAA spaces, this thesis proves that a global subgoal space can be restricted to a local subgoal adjacency region without loss of optimality for the policy over subgoals. Therefore, the adjacency constraint not only reduces the subgoal space but also preserves its optimality, which significantly improves learning efficiency for the policy over subgoals. To implement the adjacency constraint in the subgoal space, this thesis presents an adjacency-network to learn whether two states are adjacent in an environment. Experimental results demonstrated that the presented method outperformed existing subgoal-based hierarchical RL algorithms. (2) To address the problem for the poor robustness of state representations, this thesis presents a spiking-based variational expectation maximum policy gradient (SVEM-PG) algorithm to implement policy gradient in a spiking neural network (SNN), which can make use of the SNN characteristics to improve the robustness of state representations. SVEM-PG establishes the relationship between spike-firing equations and mean-field inference equations as well as the relationship between the reward-modulated synaptic plasticity and the variational policy gradient. Experimental results demonstrated that SVEM-PG was robust under various noise interference. (3) To address the combination difficulty of state aggregation abstraction and action abstraction, this thesis develops a state-action joint abstraction theory based on approximately optimality-preserving metrics, which makes state aggregation abstraction approximately preserves the optimality of the policy over TAAs. To perform computable state-action joint abstraction, this thesis presents a metric learning approach to calculate the approximately optimality-preserving metric. Furthermore, this thesis presents a state-action joint abstraction-based RL algorithm that successfully solved a complex task in Doom, a first-person 3D video game.