深度强化学习融合了强化学习的决策能力与深度学习的表征能力,使其能够应对复杂的决策与控制任务。深度强化学习目前已经在诸多领域,例如视频游戏、自动驾驶、机器人控制和自然语言处理中展现了强大的应用潜力。然而,目前的深度强化学习同时也面临诸多挑战,过拟合问题是其中最重要的问题之一。虽然过拟合是机器学习研究的重点之一,但是在强化学习领域还未受到足够的关注。过拟合会降低深度强化学习算法的性能,具体表现为,降低采样效率,并限制其泛化能力。本研究从模块的角度对深度强化学习中的过拟合问题进行了详细分析并提出了一些缓解过拟合的方法:[1]关注编码器的过拟合现象,本研究采用了现有的深度学习过拟合解决方法来应对深度强化学习中的过拟合现象。设计了一个结合了CrossNorm和SelfNorm两种归一化技术的通用模块,这个模块是即插即用的,并且在没有先验知识和额外数据的情况下能有效减缓过拟合,从而显著提升深度强化学习算法的零样本视觉泛化能力。[2]关注强化学习策略网络和价值网络的过拟合现象,本研究深入探究了深度强化学习的优先偏差这一现象和可塑性损失。通过实验结果揭示了深度强化学习智能体的不同模块和训练阶段展现出不同的特性:价值网络受到过拟合的影响非常严重,保护价值网络的可塑性是很重要的;并且关键阶段是训练的前期,如果可塑性在前期丧失,那么将无法恢复。[3]基于上述的发现,提出了一种基于神经网络模型参数编辑的遗忘方法,能够有效缓解深度强化学习训练过程中的过拟合现象,显著提升深度强化学习算法的采样效率。
Deep Reinforcement Learning fuses the decision-making capabilities of reinforcement learning with the representational capabilities of deep learning, enabling it to tackle complex decision-making and control tasks. Deep Reinforcement Learning has already demonstrated its potential in a number of domains, such as video games, autonomous driving, robot control, and natural language processing. However, current deep reinforcement learning also faces many challenges, and the overfitting problem is one of the most important ones. Although overfitting is one of the focuses of machine learning research, it has not received enough attention in the field of reinforcement learning. Overfitting degrades the performance of deep reinforcement learning algorithms by reducing the sampling efficiency and limiting their generalization ability. This study analyzes the overfitting problem in deep reinforcement learning in detail from the perspective of modules and proposes some methods to mitigate overfitting:[1]Focusing on the overfitting phenomenon in encoders, this study adopts existing deep learning solutions for overfitting to address the phenomenon in deep reinforcement learning. A universal module that combines CrossNorm and SelfNorm normalization techniques was designed. This module is plug-and-play and can effectively mitigate overfitting without prior knowledge and additional data, thereby significantly improving the zero-shot visual generalization ability of deep reinforcement learning algorithms.[2]Paying attention to the overfitting phenomena in reinforcement learning policy networks and value networks, this study delves into the phenomena of priority bias and plasticity loss in deep reinforcement learning. Experimental results revealed different characteristics of various modules and training phases of deep reinforcement learning agents: the value networks are severely affected by overfitting, and preserving the plasticity of value networks is crucial; the early stage of training is a critical phase, as lost plasticity in this phase cannot be recovered.[3]Based on these findings, a forgetting method based on neural network model parameter editing was proposed to effectively alleviate overfitting during the training process of deep reinforcement learning, significantly improving the sampling efficiency of deep reinforcement learning algorithms.