四旋翼具有高机动性,且不需人力直接进入飞行器内操作,能大幅降低执行任务时的危险性。在具有复杂性、动态性、未知性的环境中飞行,需要对四旋翼的感知、规划、控制性能进一步的提升,提高了四旋翼自主飞行能力的要求。 四旋翼自主导航飞行的技术难点为,在未知环境飞行,无法得到地图信息来规划可行轨迹,四旋翼仅能以搭载的传感器来探测环境的局部信息。本文将四旋翼自主导航飞行描述为运动规划问题,并基于深度强化学习训练智能体的策略函数。为了验证算法的正确性,搭建了四旋翼深度强化学习开发平台,并于此基础构建不同飞行环境用于训练与测试。本文贡献可以总结为以下三点:(1)针对强化学习收敛速度过慢以及训练效果不佳的问题,提出了基于阶段式训练的奖励函数设计以及基于惩罚项的奖励函数设计。基于阶段式训练的奖励函数解决了四旋翼在训练初期需要大量探索且训练效果不佳的问题,通过子任务的引导能更快地学习到正确的策略。基于惩罚项的奖励函数设计,解决了收敛速度过慢的问题,并能使用更少的迭代次数飞达目标点。(2)针对深度图像避障效果不佳的问题,分别使用记忆层网络与迁移学习来改进。加入记忆层的神经网络模型,解决了观测状态不足导致避障效果不佳的问题,并且对Q函数有更准确的估计。使用迁移学习结合激光雷达与深度图像的预训练模型,解决了相机视角受限导致部份障碍物无法被针测的问题,并在训练初期便能够稳定地累积期望奖励值。(3)基于AirSim仿真器构建符合OpenAI Gym框架的四旋翼深度强化学习开发平台。模块化的设计使平台能应用于不同类型的飞行任务,并能用于不同强化学习算法的开发与测试。基于此开发平台,与地面监控站连接,将神经网络作为辅助模块嵌入至导航系统中,实现四旋翼在任意未知环境下的实时运动控制。
Quadrotor has the advantage of high maneuverability, and can fly without directly manual operation, which reduces the risk of executing dangerous tasks. Flying in a complex, dynamic, and unknown environment brings out the demands of self perception, planning, and control, which raises the requirements for autonomous flight ability of quadrotor. For autonomous navigating, the main challenge is the lack of map information. Without map information, path planning algorithms fail to plan a collision-free trajectory. The quadrotor can only get partial observation from the sensors. This thesis describes the autonomous navigation as a motion planning problem, and trains the agent's policy based on Deep Reinforcement Learning (DRL) approach. For training and evaluating the DRL algorithm, a quadrotor simulation and training system is built, and several training and testing environments are built based on this system. The main contributions of this thesis are summarized as follows:(1)This thesis proposes a curriculum based reward function and a penalty based reward function. The curriculum based reward function solves the problem that the agent need lots of exploration which causes the inefficiency in training, and leads the agent learn from simple to difficult task. The penalty based reward function shows faster convergence that compares to the benchmark method. The agent can arrive at the goal within less iterations.(2)The application of memory layers solves the problem of lack of observation, and improves the estimation of Q-function. The application of transfer learning that combines the pre-trained models of laser and depth image solves the limitation of camera FOV, and shows faster convergence and better stability in training process.(3)A quadrotor simulation and training system is built on AirSim, and the APIs are designed as OpenAI Gym framework for developing various DRL algorithm and DRL tasks. Based on this system, the HITL test demonstrates the feasibility of applying DRL on navigation, and the trained agent can be a module embedded into autonomous navigation system.