登录 EN

添加临时用户

SU2的图形处理器加速研究

Research of GPU Acceleration based on SU2 Solver

作者:薛建恒
  • 学号
    2017******
  • 学位
    硕士
  • 电子邮箱
    188******com
  • 答辩日期
    2020.05.20
  • 导师
    肖志祥
  • 学科名
    航天工程
  • 页码
    88
  • 保密级别
    公开
  • 培养单位
    031 航院
  • 中文关键词
    SU2,GPU,块矩阵,ILU预处理,GMRES方法
  • 英文关键词
    SU2,Graphics Processing Unit, block matrix,ILU preconditioner,GMRES method

摘要

为使计算流体力学(CFD)程序或软件适应高性能计算(HPC)领域异构计算的发展趋势,本文基于斯坦福大学开源非结构CFD求解器SU2(Stanford University Unstructured)进行GPU加速研究。针对SU2中计算最密集的性能瓶颈部分,本文实现了求解线性方程组的多GPU加速并行,测试了计算效率,并进行算法收敛性和并行性验证,具体工作如下:1. 在开源代数多重网格加速库(Algebraic multigrid, AmgX)的基础上,实现了块形式的不完全LU(Incomplete lower-upper, ILU)预处理方法和广义极小残差(Generalized Minimal Residual,GMRES)方法,用于求解SU2隐式格式计算的雷诺平均的Navier-Stokes(Reynolds- averaged Navier-Stokes equations,RANS)方程组,其中为了挖掘ILU算法的并行空间以适应GPU加速,采用了图着色算法暴露并行度。由于目前GPU硬件限制,针对二维和三维问题所对应的不同维度子块矩阵,设计了不同的CUDA核函数以保证计算效率。2. 完成了GPU加速算法的收敛特性分析。GPU加速算法在求解二维和三维线性方程组时,迭代步数相同的情况下,发现GPU收敛精度普遍不如CPU,主要原因在于矩阵着色分层对预处理带来的不利影响,但这一缺陷可通过增加迭代步数或引入代数多重网格方法弥补,且对计算效率影响很小。结合测试结果,提出设计GPU加速的迭代算法需要权衡计算效率、收敛精度和显卡显存三方面因素。3. 基于SU2现有的MPI并行模式实现了多GPU并行,并通过多个算例验证其并行特性,包括二维湍流平板、ONERA M6机翼跨声速流动、DLR-F4翼身组合体亚音速流动和F6带前缘整流流动。算例结果验证了GPU计算的正确性,同时表明:多GPU加速的线性求解器可加速5~6倍左右,SU2整体效率可提升60%。

In order to make the computational fluid dynamics (CFD) code or program adapt to the development trend of high-performance computing (HPC) with heterogeneous computing, this paper presents the GPU acceleration of open source SU2 (Stanford University Unstructured) solver. Focusing on the most computationally intensive performance bottleneck of SU2, we realized the multi-GPU linear solver, measured the calculation efficiency,studied the convergence and parallel property of the GPU version solver. The main contents could be concluded as below:1. The block ILU(Incomplete lower-upper) preconditioner and GMRES (Generalized Minimal Residual) iterative methods are implemented based on the open source AmgX (Algebraic multigrid) library, for solving SU2 coupled RANS (Reynolds- Averaged Navier-Stokes) linear system which is produced by implicit schemes. We also implemented the graph coloring algorithm that discovers more parallelism of ILU factorization for GPU acceleration. Limited by the GPU hardware, we need to code different CUDA kernel functions to solve 2D and 3D problem respectively.2. The convergence property of the GPU accelerated algorithm was analyzed. We found that the convergence accuracy on GPU is generally worse than on CPU with the same numbers of iteration. Convergence was usually negatively affected by the matrix coloring and reordering, but we can increase the numbers of iteration and use the algebraic multigrid algorithm to remedy this problem with the little effect on efficiency. Considering the test results, we pointed out that an algorithm executed on GPU must balance the compute efficiency, convergence accuracy and GPU memory.3. The multi-GPU parallel was realized based on the existed MPI system of SU2. The parallel characteristics was investigated by many cases, including turbulent flat plate, ONERA M6 subsonic flow, DLR-F4 wing/body subsonic flow and F6 with leading horn. The results validate the correctness show that speedup of the linear solver accelerated by multi-GPU is 5~6 times, and SU2 solver have 40% increase in efficiency.