当前人工智能有很大的发展,巨大的发展和变革促使学术界思考人工智能芯片架构和计算模式。目前已有针对神经网络算法的定制化硬件架构,但是存储单元和计算单元间的数据交换限制着传统神经网络加速器的性能和能效。基于阻变存储器的存算一体架构将存储与计算整合到一个模块,减少计算过程中的数据传输,提供实现高效信息处理的途径。然而,实际的阻变存储器存在可靠性问题,比如数据保持、耐久性、弛豫效应等。用于计算的模拟阻变存储器可靠性与数据存储应用的二值阻变存储器可靠性并不同,目前缺乏全面、准确的可靠性模型来指导大规模存算一体芯片设计,而且阻变存储器的可靠性退化对存算一体系统计算精度的影响也无法在单一抽象层评估。此外,阻变存储器宏电路中的模拟数字转换器极大限制存算一体系统的能量效率。针对上述问题,本文围绕基于阻变存储器的存算一体系统的跨层仿真和架构优化展开研究,并在研究工作中取得如下创新成果:1. 针对缺乏全面、准确的可靠性退化模型指导大规模存算一体芯片设计的问题,本文提出模拟阻变存储器可靠性退化模型,对数据保持、耐久性建立物理模型,可靠性的紧凑模型还包含位良率和弛豫效应,通过实验数据的统计分析建立经验模型。模型的仿真结果可以很好地匹配实验结果,验证了可靠性退化模型的准确性和有效性。2. 针对阻变存储器的可靠性退化无法在单一抽象层评估的问题,本文提出器件-电路-系统的跨层仿真方法,并构建一个器件端到算法端、经过硬件芯片系统校准的跨层仿真器。通过跨层仿真分析可靠性退化对存算一体系统的影响,提出系统层次的优化方案恢复了系统计算准确率。此外,利用跨层仿真器探索存算一体芯片的设计空间,以给出系统设计指标和分析架构瓶颈。3. 针对阻变存储器宏电路中的模拟数字转换器限制存算一体系统能效提升的问题,本文提出模拟直传模块和模拟传输架构。该架构消除了宏电路级模拟数字转换器,并利用模拟直传模块实现阵列间模拟数据的处理、模拟域存储、传输。针对卷积神经网络,本文提出一种逐块方法加速计算密集型网络层,解决流水线不均衡的问题。模拟传输架构相比于基于模拟数字转换器的存算一体系统,其能效提高2.45×~3.17×,逐块方法可进一步提高能效2.21×~2.54×。
Artificial intelligence (AI) applications, such as speech recognition, computer vision, and recommendation systems, have achieved a significant progress. These rapid developments have also prompted innovations in AI chip architectures. There are customized hardware architectures for AI, especially for deep learning, to improve hardware performance and energy efficiency. CPUs, GPUs, and ASICs are all optimized for AI applications to achieve better real-time response speed and low power consumption. However, these CMOS-based chip architectures have inevitable data exchange between memory and computing units. CMOS-based accelerators have obvious bottlenecks in improving performance and energy efficiency. The resistive random access memory (RRAM, also called memristor) based computation-in-memory (CIM) architecture can combine the memory unit and the computing unit. It can reduce data transmission in the computing process and provide high energy efficiency. However, RRAM devices have reliability issues, such as data retention, endurance, relaxation. In addition, the reliability model of analog RRAM for computing is different from data memory applications. There is a lack of comprehensive and accurate reliability model to guide the design of large-scale CIM chips. Then, the impact of the reliability degradation to computational accuracy cannot be evaluated at a single abstraction layer. In addition, the analog-to-digital convertors (ADCs) limit the energy efficiency of CIM systems. To deal with the above limitations, this thesis has achieved the following innovations in the research work:1. This thesis proposed a RRAM reliability degradation model. A model based on physical principles for retention and endurance was established. The compact model of reliability also involved bit yield and relaxation effects, and empirical models were established through statistical analysis of experimental data. The simulation of the model could be well matched with the experimental results, which validated the accuracy and effectiveness of the reliability model.2. This thesis proposed a cross-level simulation method. A cross-level simulator from the device end to the algorithm end was constructed and calibrated by the hardware CIM systems. The effect of reliability degradation on the CIM system was analyzed by cross-level simulation. A system-level optimization scheme was proposed to compensate for the impact on the system. In addition, the cross-level simulator was used to explore the design space to provide system design guidelines and analyzed the architectural bottleneck.3. This thesis proposed a CIM architecture design that eliminates macro-level ADCs. It exploited a straightforward link module which can save the inter-array analog data to the local analog domain and directly transfer analog data to the next array. Furthermore, for CNNs, a blockwise dataflow was proposed to speed up computing-intensive layers and balance the pipeline. Compared with the CIM system based on analog-to-digital converters, the energy efficiency was improved by 2.45×~3.17×, and further improved by 2.21×~2.54× through the blockwise method.