深度强化学习是人工智能领域的关键技术分支之一,为各种复杂无人系统的智能决策与控制问题提供了极具前景的解决思路。其中,水下机器人是一类用于海洋探索的重要无人系统,其机理模型通常具有高度非线性、强耦合和时变等特点。本文面向水下机器人的复杂应用场景,研究基于无模型深度强化学习的自适应跟踪控制方法,并针对其在应用过程中面临的样本效率较低和需要与环境交互的问题,研究样本效率提升方法和离线强化学习方法。本文的主要研究成果如下:第一,针对水下机器人的轨迹跟踪问题和目标跟随问题,本文提出了一种基于多重Q学习的确定性策略梯度算法。结合两种控制问题各自的特点,首先分别建立了完全可观的马尔可夫决策过程模型。在此基础上,本文提出了一种多重Q学习方法,既可以高效求解具有连续动作空间的任务,还有效缓解了动作值函数存在过估计的问题。此外,本文提出使用多个策略网络的均值策略作为水下机器人的最终控制器,有效保证了策略学习的稳定性。基于真实水下机器人模型的实验验证了所提出算法的各项性能。第二,针对深度强化学习算法应用于水下机器人无人系统时面临的样本效率较低的问题,本文提出了一种基于正则化安德森加速的样本效率提升方法。考虑基于函数近似的一般情形,本文提出了一种正则化安德森加速方法,通过引入正则项可以有效控制各类扰动对迭代过程造成的影响。之后,本文将正则化安德森加速方法与已有深度强化学习算法相结合,并进一步提出了渐进更新和自适应重启等策略用于改善算法的学习性能。基于基准环境和水下机器人环境的实验证明了所提出方法在提升样本效率方面的有效性。第三,针对在线深度强化学习算法应用于水下机器人无人系统时需要与环境交互的问题,本文提出了一种基于自适应策略约束的离线深度强化学习方法。该方法可以从历史收集的非最优离线数据中进行策略学习,而无需与环境进行交互。考虑到已有策略约束方法的性能受限于离线数据质量的缺点,本文提出了一种自适应策略约束方法,通过对学习策略施加自适应约束可以实现对离线数据分布外动作空间的有效安全探索。之后,考虑基于函数近似的一般情形,本文提出了基于自适应策略约束的行动者-评论家算法。基于基准环境和水下机器人环境的实验表明,所提出算法可以从非最优离线数据中学习到具有较好性能的策略。
In the realm of artificial intelligence, deep reinforcement learning (RL) is one of the key technology branches, and it provides a very promising solution to the intelligent control of various unmanned systems. Among them, underwater vehicle is an important unmanned equipment used for exploring the ocean, and its mechanism model usually has several characteristics such as high nonlinearity, strongly coupled and time-varying dynamics. In order to solve the adaptive tracking control problem of underwater vehicles, this dissertation studies the model-free deep reinforcement learning methods, and explores corresponding solutions to the low sample efficiency and the need to interact with the environment in its application. The main contributions of this dissertation are summarized as follows:First, a deterministic policy gradient algorithm based on multiple Q-learning is proposed for the trajectory tracking and target following problems of underwater vehicles. By combining the characteristics of ocean environment, the motion control problems are formulated as fully-observable Markov decision processes. Then, this dissertation proposes a multiple Q-learning method, which can not only efficiently solve the task with continuous action space, but also effectively alleviate the overestimation problem of action value function. In addition, this dissertation proposes to use the average policy of multiple actor networks as the final controller of underwater vehicle, effectively ensuring the stability of policy learning. The experiments based on a real AUV model verify the control performance of the proposed algorithm.Second, aiming at improving the sample efficiency when the online deep RL algorithm is applied to the motion control problem of underwater vehicle, a novel regularized Anderson acceleration method is proposed for off-policy deep RL algorithm in this dissertation. Consider the general case where the action value is approximated by a learned function approximators, this dissertation proposes a regularized Anderson acceleration method, which can effectively control the impact of various disturbances on the iterative process by introducing the regularization term. Then the proposed regularized Anderson acceleration method is combined with an existing deep RL algorithm, and two strategies are further proposed to imporove the learning performance, namely progressive update and adaptive restart. The experiments are performed on a widely used benchmark environment and underwater vehicle environment, and empirical results show the effectiveness of the proposed method in improving sample efficiency.Third, in order to solve the interaction problem when the online deep RL algorithm is applied to underwater vehicle environment, an offline deep RL method based on adaptive policy constraint is proposed in this dissertation. In particular, the proposed method can learn policies from non-optimal offline data without online interaction with the environment. Although some policy constraint methods have been proposed, their performance is generally limited by the quality of offline data. To address this problem, an adaptive policy constraint method is proposed to allow effective exploration on out-of-distribution actions by imposing an adaptive constraint on the learned policy. Then considering the general case where the action value is approximated by a learned function approximators, this dissertation proposes an actor-critic algorithm based on adaptive policy constraint. The experiments are performed on a widely used environment and underwater vehicle environment, and empirical results demonstrate that the proposed algorithm can learn a good policy from non-optimal offline data.