登录 EN

添加临时用户

基于多层次自适应方法的全规约算法优化

Optimization of the AllReduce Algorithm Based on a Hierarchical Adaptive Method

作者:张博文
  • 学号
    2021******
  • 学位
    硕士
  • 电子邮箱
    zha******com
  • 答辩日期
    2024.05.24
  • 导师
    向东
  • 学科名
    软件工程
  • 页码
    84
  • 保密级别
    公开
  • 培养单位
    410 软件学院
  • 中文关键词
    全局规约;集合通信;自适应方法;分层方法
  • 英文关键词
    AllReduce;Collective Communication;Adaptive Method;Hierarchical Method

摘要

单个计算节点的算力增长,已经无法满足各研究领域和工业应用场景对于算力的需求,因此高性能计算集群带来的算力成为了一个必要的需求。高性能计算集群在众多传统领域中已经有了广泛的应用,随着近年来人工智能技术的飞速发展和对算力需求的快速膨胀,并行计算系统性能的提升被赋予了更加重要的意义。在高性能计算集群中,多节点协同工作的模式必然地带来网络通信和数据交换的开销,硬件加速比随节点数量增加而不断下降,通信开销成为制约性能的瓶颈。全规约算法是进程间数据交换最为常见也是最为复杂的集合通信算法,如何优化全规约算法成为了一个重要的挑战。 目前,全规约算法的优化工作大多受到具体的网络拓扑或硬件资源的限制,虽然在一部分场景中表现出性能的提升,但是很难得到广泛的应用。本研究首先了解了当前在实际应用场景中常见的递归倍增、环规约和减半倍增三种全规约算法,总结了一个集合通信算法的性能分析框架,通过计算传输和进行规约运算的数据量,对不同算法的性能、适用范围和可扩展性进行了简单的分析。接下来本文探究了全规约算法的不同影响因素,充分了解了不同算法的优势场景,从分层次和自适应两个角度对全规约算法进行了优化,下面是本文的主要贡献:在论文的算法设计部分,本研究着重于从节点数量和参与规约的数据量两个维度分析了不同算法的性能变化趋势。研究结果表明,算法的实际性能变化与预先通过分析框架预测的变化趋势大体一致。然而,算法性能也受到长尾效应和网络资源争用等因素的影响,这导致实际结果与预期存在一定出入。基于对这些性能变化趋势及其适用区间的深入理解,本研究引入了算法分支的自适应选择机制,从而在多样化的应用场景中实现了性能的最优化。进一步地,本研究采用了通信局部化策略。通过实施节点的自适应分层方法,有效减少了执行过程中的步骤数和相关开销,同时最大化了局部通信带宽的利用率。另一方面,我们针对包含2的幂次及非2的幂次节点数量的多样化场景进行了深入测试,以评估在参与全规约算法过程中不同数据量时的性能表现。此外,测试范围还扩展到了每个节点上运行多个进程的情况。多个测试框架的结果一致证明,我们的算法几乎始终保持了最优的性能,并在数据量逐渐增加时,将优化幅度逐步增加到30%以上,具有广泛适用性和有效性。

The growth in computational power of individual computing nodes has been unable to meet the demands for computational resources across various research fields and industrial application scenarios, making the computational power brought by high-performance computing (HPC) clusters a necessary requirement. HPC clusters have found extensive applications in numerous traditional domains. With the rapid development of artificial intelligence technologies in recent years and the swift expansion in demand for computational resources, enhancing the performance of parallel computing systems has taken on even greater significance. Within HPC clusters, the collaborative working model of multiple nodes inevitably incurs overheads related to network communication and data exchange. As the number of nodes increases, the hardware acceleration ratio continuously declines, and communication overhead becomes a bottleneck constraining performance. The AllReduce algorithm represents the most critical and complex collective communication algorithm for inter-process data exchange. Optimizing the AllReduce algorithm has thus become a significant challenge. Currently, the optimization efforts for AllReduce algorithms are largely constrained by specific network topologies or hardware resources. Although performance improvements are observed in some scenarios, widespread application is challenging. Consequently, this study is dedicated to devising a universal optimization scheme for AllReduce.This study first introduces three common AllReduce algorithms applied in real-world scenarios—Recursive Doubling, Ring AllReduce, and Halving Doubling. It summarizes a performance analysis framework for collective communication algorithms and provides a preliminary analysis of the performance, applicability, and scalability of different algorithms based on the volume of data transmitted and reduced. Subsequently, this paper investigates various factors influencing AllReduce algorithms, fully understanding the advantageous scenarios of different algorithms. It optimizes the AllReduce algorithm from two perspectives: hierarchical and adaptive.In the algorithm design section, this study focuses on analyzing the performance trends of different algorithms from the dimensions of node quantity and data volume involved in the reduction process. The findings indicate that the actual performance changes of algorithms are generally consistent with the expected trends predicted by the analytical framework. However, actual performance is also affected by factors such as the long-tail effect and contention for network resources, leading to some deviations. With a deep understanding of these performance trends and their applicability, this research introduces an adaptive selection mechanism for algorithm branches, achieving optimal performance across diverse application scenarios. Further, this study employs a communication localization strategy. By implementing an adaptive hierarchical method for nodes, it effectively reduces the number of steps and related overhead in the execution process while maximizing the utilization of local communication bandwidth.On the other hand, we conducted thorough testing across diverse scenarios involving both power-of-two and non-power-of-two node counts to assess the performance of the AllReduce algorithm with varying data volumes. The testing scope was also extended to scenarios where multiple processes run on each node. The consistent results from multiple testing frameworks conclusively demonstrate that our algorithm almost always maintains optimal performance. As the data volume gradually increases, the degree of optimization progressively rises to over 30%, evidencing broad applicability and effectiveness.