近年来,随着人工智能的不断发展,深度强化学习在各种自动化决策控制任务中取得了令人瞩目的成果。强化学习算法无需人工辅助,即可在与环境交互和反馈的过程中通过自我探索和学习,得到高效的智能决策策略。尽管强化学习算法能够以简单的通用模式学习决策策略,但在真实场景中,由于需要与环境和设备进行大量的数据交互,样本成本高昂,导致强化学习算法的样本利用效率成为其应用于真实场景的关键瓶颈之一。此外,由于强化学习算法往往以平均收益为优化目标,尽管能在期望意义下取得整体的最大收益,但是在个别极端情况下可能产生较大损失,这也是强化学习算法学习策略产生不稳定性和不安全性的主要原因。安全性的缺乏导致强化学习算法也不被自动驾驶等一些高安全性要求的应用领域接受。现有的强化学习算法已经开始尝试解决样本利用效率和安全性这两大关键瓶颈,优先经验回放算法通过改变回放样本分布,优先学习部分样本的做法提升了样本利用效率。基于模型的强化学习算法通过对环境建模产生模拟数据辅助学习,也能加快学习速度。但这些现有方法在提升样本利用效率的同时,会进一步恶化策略的稳定性问题,使得难以在样本利用效率与策略稳定性之间取舍。针对这些问题,本文开展了强化学习算法的样本利用效率和安全性研究,主要包括以下方面的创新点和贡献点:1.提出了基于双层缓存的优先经验回放算法(Double-layer Prioritized Experience Replay,DPER),通过双层经验回放池的设计,以不同速率进行优先级经验回放,在同等的样本消耗量下,实现了对样本空间更大的经验回放覆盖率,有效提升了强化学习算法的样本利用效率;2.提出了基于模型集成的筛选规划算法(Model-Based Dropout Planning,MBDP),通过将集成模型筛选模块和模拟数据筛选模块相整合,以对抗的学习方式,设计了一个可以动态取舍样本利用效率和策略鲁棒性的算法框架,最终实现了在保证样本利用效率的前提下对所学决策策略安全性进行有效的提升;3.在真实温室环境中,根据提出的算法设计了自动化控制策略学习框架,在严苛的条件下对提出的算法进行了全面充分的验证,对于强化学习算法在真实复杂场景中的可应用性有重要意义。
In recent years, with the continuous development of artificial intelligence, deep reinforcement learning has achieved impressive results in various automated decision control tasks. Reinforcement learning algorithms can obtain efficient and intelligent policies by self-exploration and learning during interaction with the environment without human assistance. Although reinforcement learning algorithms can learn policies in simple generic patterns, the high cost of samples in real scenarios, which require a lot of data interactions with the environment and devices, leads to the sample efficiency of reinforcement learning algorithms as one of the key bottlenecks for their application in real scenarios. In addition, since reinforcement learning algorithms tend to optimize with the average gain as the goal, although they can achieve the overall maximum return in the expectation sense, they may generate large losses in individual extreme cases, which is the main reason for the instability and unsafety arising from the learned policies. The lack of safety has led to the fact that reinforcement learning algorithms are also not accepted in some high-safety demanding applications such as autonomous driving.Existing works have started to try to solve the two key bottlenecks of sample efficiency and safety. The priority experience replay (PER) algorithm improves the sample utilization efficiency by changing the replay distribution and prioritizing the samples. Model-based reinforcement learning algorithms can also speed up learning by modeling the environment to generate simulated data. However, these methods can further worsen the stability problem of the policy while improving the sample efficiency, leading to difficult trade-offs. To address these problems, this paper conducts a study on the sample efficiency and safety of reinforcement learning algorithms, which mainly includes the following innovative and contributing points.1.The Double-layer Prioritized Experience Replay (DPER) algorithm is proposed, which achieves a larger experience replay distribution coverage of the sample space with the same sample consumption by designing a double-layer experience replay buffer with different rates of prioritized experience replay.2.The Model-Based Dropout Planning (MBDP) algorithm is proposed, which is designed as a framework that can dynamically trade-off sample efficiency and robustness. By integrating the model-dropout module and the rollout-dropout module in an adversarial manner, we can achieve effective enhancement of safety while ensuring sample efficiency.3.In a real greenhouse environment, an automated control policy learning framework is designed based on the proposed MBDP algorithm. Under severe conditions, we fully validated the sample efficiency and safety of the proposed framework, which is important for strengthening the applicability of the reinforcement learning algorithm in real scenarios.