近年来,基于深度学习的人工智能应用已经成为数据中心业务的重要组成部分。未来的数据中心,将是以异构计算为核心的数据中心,现场可编程门阵列(Field Programmable Gate Array,FPGA)芯片将扮演重要角色。FPGA的硬件可编程性决定了其广泛的算法适应性和高性能功耗比。然而,FPGA深度学习加速器也面临着特殊挑战,如性能相对较弱、生态相对欠缺,使用门槛较高等。为解决上述问题,本文致力于研究数据中心深度学习服务场景下高效率的FPGA加速器设计及工具链解决方案。本文以FPGA深度学习加速器为核心,开展了如下三个层面的研究与项目实施:一、基础架构层面:针对数据中心深度学习推理业务中常见的带宽性能瓶颈定位与优化问题,本文提出了一种支持复杂数据调度策略建模的深度学习加速器理论模型与分析方法,在相同硬件约束下有效提升了加速器性能;二、加速器硬件层面:针对数据中心FPGA深度学习加速器性能偏低、可扩展性较差、缺乏多样性等问题,本文提出了一种卷积神经网络(Convolutional Neural Networks,CNN)多精度推理加速器设计及一种基于变换器的双向编码器表示(Bidirectional Encoder Representation from Transformers,BERT)推理加速器设计,性能与灵活性相比传统方案显著提升;三、工具链与生态层面:针对数据中心FPGA深度学习加速器使用门槛过高问题,本文提出了一种面向数据中心的FPGA深度学习加速工具链层次设计,实现了从算法框架到底层硬件加速器无缝衔接的敏捷部署流程。本文结合数据中心应用场景,设计了高效的数据中心FPGA深度学习加速器架构,以在吞吐率、延迟、功耗、灵活性间寻求平衡;通过软硬件一体设计方法和优化技术,提高了加速器性能与效率,降低了用户使用门槛,从而推动数据中心FPGA深度学习加速器在更多领域的落地。
In recent years, the artificial intelligence applications based on deep learning have become an important business for data centers. Future data centers will be centered on heterogeneous computing, and field programmable gate array (FPGA) will play an important role. The hardware programmability of FPGA determines its broad algorithm adaptability and high performance-to-power ratio. However, FPGA-based deep learning accelerators also face special challenges, such as the weak performance, the lack of ecology and the high barrier to use, etc. To address the above issues, this dissertation is dedicated to research on the FPGA-based deep learning solution for data centers.This dissertation focuses on FPGA-based deep learning accelerators and has carried out the following three levels of work: 1) Infrastructure level: to address the common bandwidth bottleneck problem in machine learning applications in data center, an architecture modeling approach supporting complex scheduling strategies is proposed for FPGA-based deep learning accelerators, which effectively improves the performance under the same hardware constraints. 2) Hardware level: to solve the problems of low performance, poor scalability, and lack of diversity of current data center FPGA solutions, a mixed precision convolutional neural network (CNN) accelerator and a bidirectional encoder representation from transformers (BERT) accelerator are proposed , significantly improving the performance and flexibility. 3) Tool chain and ecology level: a deep learning toolchain hierarchy design for FPGA in data center is proposed, which solves the problem of the high user threshold of FPGA solutions and realizes a seamless and agile deployment flow from the algorithm framework to the underlying hardware accelerator.This dissertation combines data center scenarios to design an efficient FPGA-based deep learning accelerator to seek a balance between throughput, latency, power, and flexibility. With the software and hardware co-design approach, this dissertation improves the performance and efficiency of the accelerators, and lowers the user threshold, thus promoting FPGA-based deep learning solutions for data center in more areas.