登录 EN

添加临时用户

面向无人机自主导航的深度增强学习算法研究

Research on Autonomous UAV Navigation-Oriented Deep Reinforcement Learning Algorithms

作者:王超
  • 学号
    2015******
  • 学位
    博士
  • 电子邮箱
    w-c******.cn
  • 答辩日期
    2020.05.18
  • 导师
    张旭东
  • 学科名
    信息与通信工程
  • 页码
    128
  • 保密级别
    公开
  • 培养单位
    023 电子系
  • 中文关键词
    无人机自主导航,深度增强学习,部分可观测马尔可夫决策过程,稀疏奖励,对抗攻击防御
  • 英文关键词
    autonomous unmanned aerial vehicle navigation, deep reinforcement learning, partially observable Markov decision process, sparse reward, adversarial attack and defense

摘要

近年来小型无人机(Unmanned Aerial Vehicle,UAV)被大规模应用到民用和国防领域。实现UAV在大尺度复杂环境(如高楼林立的城市)中的自主导航是智能化UAV需解决的关键技术。深度增强学习(Deep Reinforcement Learning,DRL)研究智能体如何基于对环境的认知做出行动来最大化长期收益,是解决智能控制问题的重要方法。本文使用DRL求解UAV在大尺度复杂环境中的自主导航问题,研究在不同应用场景下导航问题的建模与求解算法,具体研究内容和成果如下:第一,UAV感知周围环境的能力有限,使用非完全信息做决策会得到次优结果。本文提出使用部分可观测马尔可夫决策过程建模UAV在大尺度复杂环境中的自主导航问题,该方法借助历史观测信息完善对当前环境的观测,使UAV能利用更全面的信息做出更好的决策。进一步地,本文提出更高效的DRL算法求解部分可观测马尔可夫决策过程,该算法能够利用更少的样本学到性能更好的控制策略。第二,UAV自主导航是目标驱动型的任务,更适合使用稀疏奖励建模。但稀疏奖励的使用会导致UAV由于在大部分时刻得不到环境反馈而无法有效地优化策略。本文提出使用具有一定独立完成任务能力的非专家指导策略协助UAV高效地搜索环境并优化策略。非专指导家策略的引入使得UAV能够直接在稀疏奖励环境中学习高性能的控制策略而不受非稀疏奖励设计优劣的约束。第三,UAV自主导航的扩展问题是多无人机集群导航。本文提出融合集群控制(Flocking Control)思想的DRL方法求解该问题。各UAV基于特定的交互协议观测其有限个邻近UAV的部分状态,距离其更远的UAV的行为对该UAV决策的影响则被忽略。特定交互协议的使用在简化UAV对集群系统状态观测的同时,使得在少量UAV集群系统中训练出的策略能够直接扩展应用到更多UAV的集群导航系统,显著降低了在更多UAV集群导航系统中训练控制策略的算法复杂度。第四,在真实环境中部署UAV自主导航系统需考虑控制策略的安全性,而DRL训练出的策略易受对抗攻击噪声的干扰。本文在鲁棒优化的框架内设计针对深度策略的对抗攻击与防御方法,其中对抗攻击通过优化对抗攻击噪声,使控制策略性能最差,而对抗防御的目的是提升处在最差情况下策略的性能。实验表明提出的对抗攻击可以使策略的性能更差,而对抗防御方法能抵御更强的对抗干扰。

Over the past few years, unmanned aerial vehicles (UAVs) have been massively applied to civil and military areas. Enabling UAVs to autonomously perform navigation tasks in large-scale complex environments (e.g., cities crowded with skyscrapers) is key to intelligentize many UAV applications. Deep reinforcement learning (DRL) is a machine learning paradigm that focuses on how intelligent agents ought to act according to the dynamically evolving environment so as to maximize the long-term rewards given by the environment.This thesis uses DRL to address the UAV navigation problem in large-scale complex environments, investigating the modeling methods and the corresponding DRL algorithms in different navigation circumstances. Details are as follows.(1) Since UAVs have limited sensing capacity, decision making based on incomplete information of the environment will lead to sum-optimal results. In this respect, this thesis uses partially observable Markov decision processes (POMDPs) to model the UAV navigation problem in large-scale complex environments. The use of POMDPs enables UAVs to perfect current observations of the environment based on history observations. Furthermore, to address POMDPs, this thesis proposes a more efficient DRL algorithm that can derive better control policies using less data samples. (2) UAV navigation is a goal-driven environment, a more appropriate way is to use sparse rewards. Sparse rewarding scheme hinders agents from efficiently learning useful behaviors since environment feedback is rarely encountered. This thesis proposes to use a non-expert policy to guide the agent in efficiently exploring the environment as well as optimizing the control policy. The use of non-expert policy enables agents to learn high-performance control policies without considering the optimality of non-sparse rewards. (3) Flocking and navigation of UAVs in large-scale complex environments is a direct extension of the single UAV navigation problem. This thesis proposes to address the multi-agent decision-making problem by combining flocking control and DRL. Each UAV only partially observes the states of its neighbors by following specifically designed interaction protocols and those far away are omitted. The introduction of interaction protocols simplifies the MDP modeling complexity of the flocking and navigation system and in the meantime enables the learned policy to generalize to systems with even more UAVs, which significantly reduces the training complexity when the system involves many UAVs. (4) Safety is a critical concern when applying UAV navigation policies in real environments. However, DRL policies are vulnerable to adversarial attacks. This thesis investigates adversarial attack and defense of DRL policies in the robust optimization framework, where the adversarial attack method optimizes adversarial perturbations so as to lead the DRL policy to the worst-case scenario, and the adversarial defense method improves the performance of the perturbed policy. Simulation results suggest that the proposed adversarial attack method yields stronger adversarial perturbations and the adversarial defense method yields policies robust to stronger adversarial perturbations.