登录 EN

添加临时用户

面向分布式深度学习的数据中心网络拓扑优化和流量调度

Data Center Network Topology Optimization and Traffic Scheduling for Distributed Deep Learning

作者:王帅
  • 学号
    2017******
  • 学位
    博士
  • 电子邮箱
    s-w******.cn
  • 答辩日期
    2022.05.19
  • 导师
    李丹
  • 学科名
    计算机科学与技术
  • 页码
    130
  • 保密级别
    公开
  • 培养单位
    024 计算机系
  • 中文关键词
    分布式深度学习, 数据中心网络, 拓扑优化, 流量调度, 数据并行
  • 英文关键词
    Distributed Deep Learning, Data Center Network, Topology Optimization, Traffic Scheduling, Data Parallel

摘要

近年来,深度学习在许多应用领域取得了巨大成功。随着深度学习模型规模的不断增大,单一计算设备已远远无法满足深度学习模型训练的算力需求。为了提供强大的算力,利用数据中心内的海量服务器进行分布式深度学习训练已经非常普遍。然而,为了保证分布式训练结果与单机训练结果的一致性,分布式深度学习训练系统的不同节点间需要频繁地同步模型参数。许多研究工作和本文的研究都发现,参数同步所带来的网络通信开销已经成为限制分布式深度学习训练系统性能的重要因素。本文通过对参数同步的通信现状进行分析,归纳出分布式深度学习训练面临的三项主要挑战:(1)大规模分布式训练的参数同步耗时长;(2)模型计算和参数传输之间存在依赖关系;(3)分布式训练性能受限于慢节点的训练速度。针对上述挑战,本文从网络拓扑优化和流量调度等方面入手,优化数据中心在支持分布式深度学习训练时的网络通信性能。本文的主要研究内容和贡献总结如下:(1)提出了层次化参数同步算法HiPS,并研究了多种参数同步算法和网络拓扑组合对参数同步速度的影响。传统的扁平化参数同步算法往往存在带宽竞争或通信时延累积问题。通过分层同步,HiPS算法可以在减少参数同步流量的同时避免上述问题。本文还基于参数同步算法的通信特点对网络拓扑进行了优化。理论分析和仿真测试均发现,由于服务器带宽更高、负载均衡性能更优并且高效支持RoCE协议,HiPS+BCube组合可以显著降低参数同步耗时。(2)提出了基于深度学习模型感知的网络流量调度方案Geryon。现有深度学习框架在传输多层参数时未考虑其消耗顺序,导致模型计算难以和参数同步重叠。为了实现全网规模参数传输调度,Geryon根据模型计算顺序为参数同步流量分配优先级,并借助全网配置的严格优先级调度策略保证较早被消耗的参数更快到达接收端。对于多种深度学习模型,Geryon均取得了显著的端到端训练性能提升。(3)提出了面向异构分布式训练的网络流量调度方案CEFS。现有深度学习框架在向多个计算节点传输参数时未考虑其计算性能,因此慢节点不得不与其他节点同时开始计算。CEFS在参数传输调度的基础上,还优先调度慢节点的参数同步流量,以使其更早地触发前向计算,从而缓解慢节点对分布式系统的阻塞。实验结果表明,CEFS可大幅提高慢节点的计算速度,并显著提升端到端训练性能。

Recently, deep learning has achieved great success in many application areas. As the scale of deep learning models increases, a single computing device is far from being able to meet the computing power requirements of deep learning training. To provide huge computing power, it has become very common to use massive servers in data centers to train deep learning models in a distributed fashion. However, in order to keep the distributed training results consistent with the single-machine training ones, the model parameters need to be frequently synchronized between different nodes of the distributed deep learning training system. Some research work, together with this dissertation, found that the communication overhead caused by parameter synchronization has become a bottleneck that limits the performance of distributed deep learning training systems. By analyzing the communication of parameter synchronization, this dissertation summarizes three main challenges faced by distributed deep learning training: (1) parameter synchronization in large-scale distributed training takes a long time? (2) there is a dependency between forward computation and parameter synchronization? (3) distributed training performance is limited by the straggler. To address the above challenges, this dissertation optimizes the performance of data center network in terms of network topology and traffic scheduling to better support distributed deep learning training. The main contributions of this dissertation are summarized as follows: (1) We propose HiPS, a hierarchical parameter synchronization algorithm, and study the impact of different combinations of parameter synchronization algorithm and network topology on the performance of distributed deep learning training. The traditional flat parameter synchronization algorithms are inefficient, due to bandwidth contention or communication latency accumulation. HiPS adopts a hierarchical synchronization method, which can avoid the above problems while reducing the parameter synchronization traffic. Both theoretical analysis and simulation evaluation show that, due to the higher bandwidth, better load balancing and efficient support for the RoCE protocol, the HiPS+BCube combination can significantly reduce the global parameter synchronization time. (2) We propose Geryon, a deep learning model-aware network traffic scheduling scheme. The existing deep learning frameworks do not consider the parameters’ consumption order when transferring them. As a result, there is little overlap between forward computation and parameter synchronization. To achieve network-wide parameter transmission scheduling, Geryon prioritizes parameter synchronization traffic based on forward computation order. With the strict priority scheduling policy configured in the entire network, parameters that are consumed earlier can reach the receiver faster. Geryon achieves significant end-to-end training performance improvements for typical deep learning models. (3) We propose CEFS, a compute-efficient traffic scheduling scheme for heterogeneous distributed training. The existing deep learning frameworks do not consider the servers’ compute capability when sending parameters to different servers. Therefore, the stragglers start forward computation at the same time as other servers. Apart from parameter transmission scheduling, CEFS also prioritizes the parameter synchronization traffic of the stragglers to trigger their forward computation earlier, thereby alleviating the blocking effect of the stragglers on the performance of distributed training system. Experimental results show that CEFS can greatly improve the training speed of the stragglers and significantly improve the end-to-end training performance.