登录 EN

添加临时用户

天机X1类脑计算芯片性能优化方法研究

Research on performance of Tianjic X1 brain inspired computing chip and the optimization method

作者:王松
  • 学号
    2018******
  • 学位
    硕士
  • 电子邮箱
    wan******.cn
  • 答辩日期
    2021.05.20
  • 导师
    裴京
  • 学科名
    仪器科学与技术
  • 页码
    116
  • 保密级别
    公开
  • 培养单位
    013 精仪系
  • 中文关键词
    人工智能,类脑芯片,树突单元,神经网,映射
  • 英文关键词
    Artificial intelligence, brain inspired chip, dendritic unit,neural network,mapping

摘要

当前人工智能高速发展,在多个领域取得了重大突破。随着海量的数据、超强的算力、高效的算法等高速发展,人工智能迎来了第三次热潮。而异构融合等多种算法提出,为窄人工智能走向人工通用智能的发展提供了新的研究范式和基础。对于人工通用智能的实现主要有两个方法,基于计算科学和基于神经科学。由于编码方案和范式不同导致二者依赖的计算平台不兼容。天机X1类脑计算芯片面向人工通用智能,能够同时支持计算科学和神经科学。仿照人脑的时空关联特性的天机X1类脑计算芯片采用众核分布、可重构、并行、异步、存算一体的架构。此芯片采用混合编码、流式映射,支持多网络的计算。由于芯片的单个计算核的存、算能力有限,需要众多核实现对一张大规模的神经网络并行计算,对神经网络进行拆分、映射到每个计算核上。结合设计的类脑计算芯片的内部硬件结构、物理约束特性、信息存储特点、计算模式对网络映射进行研究。主要研究工作和成果包括:1.设计天机芯片树突单元,提高类脑芯片的算力。设计可灵活配置的2D乘加器计算阵列,在读写带宽固定的情况下提高对数据的读取速度,提高算力。设计支持多精度数据计算的32位的累加器,避免分段计算时的数据精度损失。设计多模态乘加器,在不同精度数据计算中可灵活配置、降低计算功耗。2.基于天机X1芯片存算一体功能核的存算资源有限、存算资源比固定等特点,面向人工神经网络的复杂性,研究网络的四个维度的拆分映射方法,有效解决核资源有限与网络规模无限的矛盾。研究在四维拆分、解决映射过程中存在的部分和数据处理、维度调整、交叠数据处理、向量拆分处理、内存管理,路由传输,片间数据通信等问题,旨在提高映射的效率。3.以ResNet50网络的第5、6层为例,根据本文研究的映射原理,进行手动映射。分析4个维度不同组合带来的映射效率并配置天机X1原语参数,进行大量实验理论评估分析。在分配56个计算core的条件下,通过映射研究将芯片的Axon计算效率由62.9%提高到93.7%,将运行时钟数由61248降低到26064,计算速度提高1.35倍。经优化后,采用分配28个core,最优的时钟数为50880。整个ResNet50映射需要600个计算core左右,计算时钟在10K以内。最终建立天机X1芯片映射的数学理论优化模型。关键词:人工智能;类脑芯片;树突单元;神经网络;映射

At present, the rapid development of artificial intelligence has made great breakthroughs in many fields. With the rapid development of massive data, super computing ability and efficient algorithm, artificial intelligence has ushered in the third upsurge. And heterogeneous fusion algorithms provide a new research paradigm and foundation for the development of narrow AI to AGI.There are two main approaches for the realization of AGI, one is based on Computational Science, the other is based on neuroscience. Tianji chip is a brain inspired chip for AGI, which supported heterogeneous computing science and neuroscience. Based on the spatiotemporal correlation characteristics of brain, Tianji chip adopts the architecture of multi-core distribution, reconfigurable, parallel, asynchronous computing and memory computing integration. Mixed with coding and streaming mapping are adopted to support multi network computing. Due to the limited memory capacity and computing ability of a single core, many cores are needed to implement parallel computing of a large-scale neural network, and the neural network needs to be split and mapped to each computing core. Therefore, it is necessary to study the network mapping,which based on the internal hardware structure, physical constraints, information storage characteristics and computing mode of TianjiX1 brain inspired computing chip.The main research work and achievements as follow:1. In order to improve the computing ability of brain inspired chip, we study and design the dendrite unit of Tianji chip. A flexible 2D Multiply Accumulator(MAC) array is designed to improve the reading speed of data with fixed reading and writing bandwidth,enhancing the computing ability finally. The designation supports multi precision data calculation, which has a 32-bit accumulator to avoid the loss of data precision in segmented calculation. Multi mode MAC unit is designed, which can be flexibly configured in different precision data calculation. From the structure design of brain inspired chip, we can reduce the computing power of the chip.2. Based on the characteristics of Tianji X1 chip, such as limited storage, computing resources, fixed storage and computing resource ratio, and facing the complexity of artificial neural network, this paper studies the split mapping method of four dimensions of the network, so as to effectively solve the contradiction between limited nuclear resources and unlimited network scale. In order to improve the efficiency of the mapping, this paper studies the adjustment of four dimensions, and adopted many solutions to resolve partial sum data processing, dimension, overlapping data routing packet header processing, vector splitting processing, memory management, routing transceiver, inter chip data communication and so on.3. Taking the layer 5 and 6 of resnet50 network as an example, according to the mapping principle studied in this paper, manual mapping is carried out to analyze the mapping efficiency brought by different combinations of four dimensions, and the parameters of Tianji X1 primitive are configured to conduct a large number of experimental theoretical evaluation and analysis. Under the condition of allocating 56 computing cores, the axon efficiency of the chip is improved from 62.9% to 93.7% by mapping research. The number of running clock is reduced from 61248 to 26064, so the calculation speed is increased by 1.35 times. After optimization, 28 cores are allocated and the optimal number of running clock is 50880. The whole resnet50 mapping needs about 600 computing cores, and the computing clock is within 10K. Finally, the mathematical theory optimization model of Tianji X1 chip mapping is established.Key words: Artificial intelligence; brain inspired chip; dendritic unit; neural network; mapping