高性能计算机的性能持续增长,然而由于负载不均、资源竞争等性能问题,大量并行程序无法高效地利用底层硬件系统,导致极大的资源浪费。多样的并行程序负载特征、复杂的软硬件结构、各种性能问题相互交织,共同导致并行程序的性能瓶颈隐蔽且难以检测。传统的分析技术,由于较高的分析开销,以及高昂的人力分析成本,无法被广泛应用在当前高性能计算机上。围绕上述挑战,本文在大规模并行程序性能分析与优化方面开展了深入研究,主要贡献包括:(1)针对并行程序性能瓶颈定位难的挑战,提出了基于图分析技术的自动可扩展性瓶颈检测方法,并实现了轻量级的瓶颈检测系统SCALANA。该系统通过在编译时提取程序的结构和控制流信息辅助动态分析,可以有效降低运行时分析开销。同时,通过引入图分析算法,实现复杂并行程序可扩展性瓶颈的自动定位。实验表明,在2048进程规模下,该系统仅引入平均1.73%的运行时开销和有限的存储开销,并且能够自动定位真实应用程序的可扩展性瓶颈。(2)针对大规模并行程序性能分析系统开发复杂度高的挑战,提出了面向性能分析的领域特定编程框架PERFLOW。该框架将复杂的性能分析过程抽象为数据流图,将基本性能分析模块作为数据流图中的节点,上层提供领域特定的编程语言方便描述性能分析过程。同时,提供基于二进制文件的分析接口,不依赖于程序源码,支持生产环境下真实并行应用的性能分析。实验表明,对可扩展性分析任务,仅需要34行代码描述,而已有方法需数千行代码,有效地降低了性能分析系统开发的复杂度。(3)针对并行程序优化策略选择难的挑战,提出了异步策略感知的性能建模方法 SMOD。通过性能解耦、层次化建模、硬件感知模拟等关键技术,实现高效且精确的性能预测。以典型科学计算程序HPL为例,在多个高性能计算平台上验证了该方法的精度和效率。实验表明,对“神威?太湖之光”上超4百万核规模的HPL,预测误差仅为1.09%,预测开销为毫秒级。(4)基于上述性能分析与建模技术,设计并实现了面向领域的多层次性能优化系统PUZZLE。通过多级中间表达提升中间代码的描述能力,融入性能分析和建模技术,实现多层次感知的全面性能优化。实验表明,该系统支持多硬件平台的运行与优化,可将典型的流体力学程序端到端的性能提升一到两个数量级。
Modern supercomputers bring unprecedented growth of computing power. However,parallel programs cannot efficiently utilize computing resources due to load imbalance, inter-process communication, and resource contention. Because of complex load characteristics, hardware architectures, and interactions between performance bugs, performance bottlenecks are deeply hidden and hard to detect. Current approaches incur significant overhead and human efforts, thus they cannot be widely used on HPC systems. In this dissertation, we take further research on performance analysis and optimization based on previous works. Specifically, the main contributions are as follows:(1) To detect performance bottlenecks under complex scenarios, we propose SCALANA, a graph-based approach for automatic scaling loss detection. SCALANA leverages hybrid static-dynamic analysis to reduce overhead, and uses graph analysis to automatically identify the root causes of scaling issues. We evaluate the efficacy and efficiency of SCALANA with several real-world applications. SCALANA only incurs 1.73% runtime overhead very low storage costs on up to 2,048 processes. Besides, SCALANA can detect the root causes more automatically and efficiently, compared with state-of-the-art tools.(2) To ease the burden of implementing performance analysis tasks, we propose PERFLOW, a domain specific framework for performance analysis. PERFLOW represents the process of performance analysis as a dataflow graph, and abstracts the performance behavior of a parallel program as a specific graph. Plentiful APIs are provided by PERFLOW to implement analysis tasks. Besides, PERFLOW is based on binary, which means it is more practical and more suitable for production environments. Experimental results show that PERFLOW only uses 34 lines of code to implement a scalability analysis task, which significantly reduces the complexity of developing analysis tasks.(3) To design better optimization strategies, we propose ASMOD, an asynchronous strategy-aware performance model. Efficient and accurate performance prediction is realized through several key techniques, including performance decoupling, hierarchical modeling, and hardware-aware simulation. We take HPL as a specific case to verify ASMOD, and evaluate the accuracy and efficiency on multiple HPC systems. Experimental results show that the prediction error for HPL with over 4 million cores on the Sunway TaihuLight supercomputer is only 1.09%, and the prediction overhead is only several milliseconds.(4) Based on the above performance analysis and modeling techniques, we design and implement PUZZLE, a domain-specific multi-level optimization framework to achieve automatic and comprehensive performance optimizations. PUZZLE adopts multi-level intermediate representation to improve the representation ability, and integrates performance models to guide the selection of optimization strategies for achieving in-depth optimizations. We take the domain of computational fluid dynamics as an example to validate the design of PUZZLE. Experimental results show that PUZZLE brings 11.55× and 240.69× performance improvement on average for CPU and GPU platforms, respectively.