在稀疏线性代数算法中,最具代表性的稀疏矩阵向量乘(SpMV)算法一直是工程和科学应用中最重要的计算内核之一,广泛应用于医学影像重构、云计算、经济和电路建模、神经网络等多个领域。然而,由于稀疏矩阵自身不规则的数据结构,使得在以CPU和GPU为代表的传统冯·诺依曼体系结构上,不能很好地挖掘出SpMV的内在并行性。随着半导体制造工艺的进步,基于数据流模式执行的可重构器件FPGA集成了越来越多的硬件资源,在可重构计算领域中展示出巨大的作用与潜力。凭借其丰富的流水线资源和大量的并行计算资源,FPGA虽然能很好地加速SpMV运算,但是当矩阵规模增大时,性能也会下降。这是因为FPGA的片上存储资源有限,使得在执行大规模SpMV时,不规则的控制流导致中间结果需要反复地在片上与片下存储器之间进行调度,增大了数据的内存访问,从而影响了整体性能。针对SpMV在数据内存访问和控制流两方面不规则的特点,我们探索了基于FPGA的可重构架构上SpMV加速器实现的关键技术。总体来说,我们的设计思想是在CPU-FPGA异构可重构平台上,将SpMV中不规则的内存访问部分和不规则的控制流部分分别加载到CPU和FPGA上执行。本文的主要研究工作与创新内容如下:首先,我们提出了一种新型稀疏矩阵存储格式——CCC格式来很好地适配异构系统。该格式充分考虑了CPU端取数据时的数据局部性以及FPGA端硬件计算资源有限等因素。接着,为了实现整体设计的无阻塞执行以提高带宽利用率,我们在CPU端提出了一种新颖的取数据算法,该算法能很好地保证取数据时同一个cacheline的数据来自于不同行,从而避免了FPGA端数据流冲突。最后在FPGA端,我们设计了硬件实现部分,主要包括乘法运算、数据流通以及乘积累加三个方面,来执SpMV中最主要的乘加操作。我们将设计的完整架构,通过Intel的HARP-2测试平台,采用Florida大学的稀疏矩阵测试集进行验证。测试结果显示,我们提出的架构在执行SpMV时,平均GFLOPs达到了1.13,平均带宽利用率达到了0.094 GFLOP/GB,是现有其他设计的1.6到4.3倍,从而很好地实现了SpMV算法的加速。
Among the sparse linear algebra algorithms, the most representative Sparse Matrix-Vector Multiplication (SpMV) algorithm is one of the most important computing kernels in both engineering and scientific applications. It is widely used in medical image reconstruction, cloud computing, economic and circuit modeling, neural networks and many other fields. However, due to the irregular data structure of sparse matrices, the inherent parallelism of SpMV cannot be well exploited on traditional Von Neumann architectures like CPUs and GPUs. With the rapid development of semiconductor technologies, reconfigurable devices like FPGAs which are based on dataflow execution are integrated more and more hardware resources, and play a promisingly important role in reconfigurable computing fields. FPGA can speed up SpMV algorithm well with its rich pipeline resources and a large number of parallel computing resources. However, when the scale of matrices increases, the performance will also decrease. This is because the on-chip memory resources of FPGAs are limited. So when large-scale SpMV is performed, the irregular control flow leads to intermediate results repeatedly scheduled between on-chip and off-chip memories, which increases the memory footprint, thus affecting the overall performance.Aiming at the irregular characteristics of SpMV both in data memory access and control flow, we explored the key technologies for SpMV accelerators on FPGA-based reconfigurable architectures. In general, the idea of our design is to load the irregular memory access part and the irregular control flow part of SpMV to the CPU side and FPGA side respectively on the CPU-FPGA heterogeneous reconfigurable platform. The main research work and contributions of the paper are as follows: First, we propose a new sparse matrix storage format, the CCC format, to adapt well to the heterogeneous platform. This format fully considers the data locality when fetching data on the CPU side and the limited hardware computing resources on the FPGA side. Next, in order to achieve non-blocking execution of the whole design to improve bandwidth utilization, we propose a novel data fetching algorithm on the CPU side. This algorithm ensures that the data of the same cacheline come from different rows when data are fetching, thus avoiding data conflicts on the FPGA side. Finally, on the FPGA side, we design the hardware implementation, including the three parts of multiplication, data streaming, and accumulation to perform the most important multiplication-accumulation operations of SpMV.The complete architecture implementation of this design is verified by Intel's HARP-2 experimental platform using the University of Florida sparse matrix collection as the benchmarks. The result shows that the proposed architecture achieves a competitive performance of 1.13 GFLOPs and a bandwidth utilization of 0.094 GFLOP/GB averagely, which outperforms existing designs by 1.6-4.3x, thus achieving a good acceleration of SpMV.