登录 EN

添加临时用户

基于RISC-V的可重构阵列片上系统的搭建与实现

Design and Implementation of CGRA Based on RISC-V Processor

作者:谢思敏
  • 学号
    2019******
  • 学位
    硕士
  • 电子邮箱
    209******com
  • 答辩日期
    2022.05.21
  • 导师
    方华军
  • 学科名
    集成电路工程
  • 页码
    83
  • 保密级别
    公开
  • 培养单位
    026 集成电路学院
  • 中文关键词
    粗粒度可重构处理器,配置控制器,玄铁C910,片上系统
  • 英文关键词
    Coarse-grained reconfigurable architectures, configuration controller, Xuantie-910, system-on-chip

摘要

粗粒度动态可重构处理器是一种由配置流和数据流共同驱动的硬件计算架构,融合了通用处理器的指令流在时域上的高灵活性和专用集成电路的数据流在空域上的高能量效率的特点,符合集成电路的发展趋势。粗粒度动态可重构处理器的硬件架构由控制通路和数据通路组成,数据通路由多个处理单元和处理单元之间的互联组成;控制通路一般由主控制器、配置控制器和数据控制器控制。由于数据通路上有多个处理单元在空间上并行执行,所以数据通路具有较快的处理速度,导致控制通路和数据通路的处理速度有较大的差异。本文的主要目标是提高控制通路的处理速度,平衡控制通路和数据通路二者之间处理速度的差异。本文的主要工作内容如下:1.针对配置控制器,提出配置信息加载、分发和切换的方案,并通过方案中的预取机制将加载过程和数据通路的计算过程流水化执行,使得配置控制器无需等待计算结果即可处理配置信息,进而提高控制通路的处理速度;针对配置信息是由64bit机器码组成,提出高灵活可扩展的配置信息的表示方法,简化机器码的生成过程。2. 针对主控制器,考虑已实现项目中Picorv32的主控制器性能因素造成主控制器和配置控制器之间通信交互的巨大延迟,采用同样基于RISC-V指令集架构的开源高性能通用处理器玄铁C910。除了开源核内的代码之外,玄铁C910还提供了用于参考的集成设计和用于开发测试的仿真环境Smart平台。基于Smart平台,对玄铁C910进行性能测试;基于参考的集成设计,对现有IP进行调试检验,和包括配置控制器的可重构处理单元阵列进行集成,实现片上系统。3.分析快速傅里叶变换的计算特点,结合可重构处理器的结构特点和计算方式,以256点的快速傅里叶变换的8层蝶形运算作为可重构处理器计算的应用分析。针对单层蝶形运算,提出一种软件定义的零缓冲流水的实现方法;针对8层之间的数据和权重的更新,提出一种可扩展的乒乓模式的实现方法,降低主控制器和配置控制器、数据控制器间的通信交互的次数,提高控制通路的处理速度。实验表明,主控核为玄铁C910的可重构处理器处理256点的快速傅里叶变换相较于主控核为Picorv32的可重构处理器,主控制器和配置控制器的单次通信交互时间提升了16倍。

Coarse-grained reconfigurable architectures (CGRAs) have operations driven by dataflow and configuration flow. CGRAs merge with flexibility and high energy efficiency. The flexibility of CGRAs is compared to the instruction flow of the general-purpose processors (GPPs) in the time domain. The high energy efficiency is like the data flow of the application-specific integrated circuits (ASICs) in space domain. CGRAs conform to the development trend of the integrated circuits. The CGRAs are controlled by the data path and the control path. The data path is composed of many processing elements and the interconnect among the processing elements. The control path is generally controlled by the host processor, the configuration controller, and the data controller. Since there are multiple processing elements in the data path that execute in parallel in space domain, the data path has a faster processing speed, resulting in a large difference in the processing speed of the control path and the data path. The main goal of this paper is to improve the processing speed of the control path and balance the difference in processing speed between the control path and the data path. The main work of this article is as follows: 1. For the configuration controller, the configuration information loading, distribution and switching scheme is proposed, and the calculation process of the loading process and the data path is executed through the prefetching mechanism in the scheme, so that the configuration controller can process the configuration information without waiting for the calculation results, thereby improving the processing speed of the control channel; for the configuration information is composed of 64-bit machine code, a highly flexible and scalable configuration information representation method is proposed to simplify the generation process of the machine code.2. For the host controller, considering the huge delay in communication interaction between the host controller and the configuration controller caused by the performance factors of the host controller of the Picorv32 in the implemented project, the open source high-performance general-purpose processor Xuantie-910, which is also based on the RISC-V instruction set architecture, is used. In addition to the code within the open source core, the Xuantie-910 provides a Smart platform for reference integrated design and simulation environments for development testing. Based on the Smart platform, the performance test of Xuantie-910 is carried out, the existing IP is debugged and verified based on the reference integration design, and the reconfigurable processing unit array including the configuration controller is integrated to achieve the system-on-chip.3. Analyze the computational characteristics of the fast Fourier transform, combined with the structural characteristics and calculation methods of the reconfigurable processor, and the 8-layer butterfly operation of the fast Fourier transform of 256 points is used as the application analysis of the reconfigurable processor calculation. For the single-layer butterfly operation, a software-defined zero-buffered flow implementation method is proposed, and for the update of data and weights between layers 8, an extensible ping-pong mode implementation method is proposed, which reduces the number of communication interactions between the main controller, the configuration controller and the data controller, and improves the processing speed of the control path.