深度神经网络在各种机器学习应用中取得了显著的成功,如计算机视觉、自然语言处理和推荐系统等。然而,在训练深度神经网络时,一个主要的挑战是标注数据的稀缺。近年来,多任务学习作为一种有前景的技术,被用于应对这一挑战。然而,在实践中,和学习单个的任务相比,同时学习多个任务往往会导致模型性能的下降,这一现象被称为{负迁移}。即使在大型语言模型中,负迁移问题也可能仍然存在。例如,带有人类反馈的强化学习(RLHF)是聊天机器人ChatGPT中一个关键辅助学习任务,它的目标是提高对话生成的准确性和流畅性,然而在对参数量更大的语言模型GPT-4使用RLHF进行多任务训练后,在将近一半的多项选择题任务上产生了负面的效果。目前已经提出大量的方法来缓解多任务学习中的负迁移问题。之前的研究通常认为负迁移是由于优化层面的困难,特别是不同任务之间的梯度冲突所导致的,因此此前的方法主要通过避免任务梯度之间的负面干扰或平衡不同任务的梯度大小来克服负迁移。另外一些工作专注于选择最相关的辅助任务,并通过避免任务冲突严重的任务分组来减轻负迁移。需要指出的是,尽管有许多工作努力去克服负迁移问题,但是这个问题的成因并没有被深入地探究过。因此,本文从优化和泛化的角度实验性地分析了多任务学习中负迁移的可能原因。从优化的角度看,本文的实验表明,梯度冲突并不一定会导致负迁移。例如,权重衰减,一种特殊的辅助任务,可以在梯度上与目标任务冲突,但仍对目标性能有益。从泛化的角度看,本文观察到,当多任务训练数据和目标测试数据之间的分布偏移增大时,负迁移更有可能发生。基于上述的发现,本文提出了一种名为{分支合并算法}的新方法。由于无法预知哪种任务分布组合会导致更好的泛化,而且为每种可能的分布训练单独的模型的代价过高,本文将任务分布的组合问题转化为模型假设的组合问题。具体来说,本文将模型复制成多个分支,不同分支的参数将在不同的数据分布上进行优化,这些数据分布是通过改变不同分支的任务权重来构造的。最后,每隔一定的时间间隔,将合
Deep neural networks have achieved remarkable success in various machine learning applications, such as computer vision, natural language processing, and recommendation systems. However, one major challenge in training deep neural networks is the scarcity of labeled data. In recent years, Multi-Task Learning (MTL) has emerged as a promising technique to address this challenge.MTL improves the generalization of target tasks by leveraging the useful signals provided by some related auxiliary tasks. For instance, larger-scale tasks, such as user click prediction, can be utilized as auxiliary tasks to improve the performance of smaller-scale target tasks, such as user conversion prediction in recommendation. Self-supervised tasks on unlabeled data can serve as auxiliary tasks to improve the performance of the target task in computer vision and natural language processing, without requiring additional labeled data. However, in practice, learning multiple tasks simultaneously sometimes leads to performance degradation compared to learning only the target task, a phenomenon known as {negative transfer}.Even in large language models, negative transfer problems may still exist. For example, RLHF, a key component of ChatGPT, achieves negative effects on nearly half of the multiple-choice question tasks when post-training GPT-4.There has been a significant amount of methods proposed to mitigate negative transfer in MTL.Notable previous studies attribute negative transfer to the optimization difficulty, especially the gradient conflicts between different tasks, and propose to mitigate negative transfer by reducing interference between task gradients.Other works focus on selecting the most relevant auxiliary tasks and reducing negative transfer by avoiding task groups with severe task conflicts.However, despite the significant efforts to address negative transfer, its underlying causes are still not fully understood. In this regard, we experimentally analyze potential causes of negative transfer in MTL from the perspectives of optimization and generalization.From an optimization view, our experiments suggest that gradient conflicts do not necessarily lead to negative transfer. For example, weight decay, a special auxiliary task, can conflict with the target task in gradients but still be beneficial to the target performance. From a generalization view, we observe that negative transfer is more likely to occur when the distribution shift between the multi-task training data and target test data is enlarged. Based on our above findings, we present a new approach named {ForkMerge}. Since we cannot know which task distribution combination leads to better generalization in advance, and training models for each possible distribution is prohibitively expensive, we transform the problem of combining task distributions into that of combining model hypotheses.Specifically, we fork the model into multiple branches and optimize the parameters of different branches on diverse data distributions by varying the task weights.Then at regular intervals, we merge and synchronize the parameters of each branch to approach the optimal model hypothesis.In this way, we will filter out harmful parameter updates to mitigate negative transfer and keep desirable parameter updates to promote positive transfer. The contributions of this work are summarized as follows:(1) We systematically identify the problem and analyze the causes of negative transfer in MTL.(2) We propose {ForkMerge}, a novel approach to mitigate negative transfer and boost the performance of MTL.(3) We conduct extensive experiments and validate that {ForkMerge} outperforms previous methods on a series of MTL benchmarks.