基于稀疏自注意力机制的超长上下文推理加速方法

On Accelerating Long-Context Inference via Sparse Self-Attention

作者：黄宇翔

学号

2021******
学位

本科
答辩日期

20250609
导师

刘知远
毕业年代

2025
培养单位

024 计算机系

中文关键词

大语言模型, 长上下文, 推理加速, 并行加速, 稀疏注意力机制

英文关键词

large language models, long-context, inference acceleration, parallel acceleration, sparse attention mechanism

摘要

近年来，大模型技术取得了卓越的突破，在自然语言理解与生成方面革新了传统研究范式，成为了生成式人工智能的基石。大模型对更长上下文的处理能力日益增强，这一趋势反映了现实世界对解决长程数据处理问题的需求。大模型的长上下文处理能力已成为自主智能体、长视频理解、人工智能驱动的操作系统等复杂人工智能应用的基石。然而，大模型的长上下文推理存在显著的效率瓶颈，这制约了上述长上下文应用的发展。总体而言，大模型长上下文推理主要面临三大挑战：（1）高昂的资源开销；（2）缓慢的推理速度；（3）多模态模型在视觉与文本编码上的双重效率瓶颈。本文针对上述挑战系统化地提出了三个推理框架：Locret、APB以及APB-V。针对长上下文推理资源开销高昂这一挑战，Locret框架优化了低显存场景的长上下文推理，通过训练额外的保留头以在分块预填充中实现精准的键值缓存丢弃。Locret能够在不显著损失任务性能的同时极大降低推理长上下文的显存需求，在任务性能损失10\%以内实现最高20倍键值缓存压缩率。针对长上下文推理速度缓慢这一挑战，APB框架在分布式多主机场景上优化了长上下文的预填充处理。APB面向序列并行场景设计了近似注意力机制，成功结合了提升并行度与降低计算量两个主流长上下文优化技术路线，实现了在不牺牲任务能力的前提下显著提升推理速度，相较于传统方法推理速度提升9.2倍。面向多模态模型在视觉与文本编码中存在的双重效率瓶颈，APB-V将APB运用在视觉与文本编码上，能够在不降低模型长视频理解能力的同时大幅提升长视频预填充速度，且加速效果在高清超长视频上更加显著。总体而言，针对大模型长上下文推理在效率层面的主要挑战，本文通过基于后训练的、以键值缓存为核心的优化，首次系统地提出解决长上下文推理三方面挑战的技术体系。相关的实验展示了这一技术体系的有效性与高效性。本文已开源相关的框架实现与实验代码。

In recent years, breakthroughs in Large Language Models (LLMs) have transformed natural language understanding and generation. LLMs now support longer context lengths, reflecting the demand to solve long-distance problems in real-world scenarios. The long-context capability of LLMs blooms advanced AI applications, including LLM-based agents, AI-driven operating systems, and long video understanding. However, efficiency issues in LLM long-context inference limit the development of long-context applications. Long-context inference faces three major challenges: (1) high inference cost, (2) slow inference speed, and (3) the dual bottleneck of visual and text encoding in multimodal models. This paper systematically proposes three frameworks, Locret, APB, and APB-V, to address these challenges. Designed for low GPU memory scenarios, Locret addresses the challenge of high inference cost. Locret uses trained retaining heads for KV cache eviction along with chunked prefill, reducing long-context inference cost without significant performance degradation. Locret achieves a $20\times$ speedup with only a 10\% performance drop. To address slow inference speed, APB optimizes the prefill stage in multi-host distributed scenarios. APB introduces a sequence-parallelism-aware approximate attention, combining parallelism enhancement and computation reduction to boost prefill speed. APB achieves up to a $9.2\times$ speedup without significant performance degradation. To optimize the dual bottleneck of visual and text encoding in multimodal models, APB-V applies sequence parallelism uniformly to both the visual encoder and LLM backbone, enhancing long video prefill speed without compromising understanding. This acceleration is more significant when processing longer, higher-resolution videos. Overall, to address the major efficiency challenges of long-context inference, this paper systematically proposes a framework centered on post-training and KV cache optimizations, tackling all three key challenges for the first time. Empirical results demonstrate the effectiveness and efficiency of this design. We provide parts of the source code of the frameworks and experimental implementations.

概览页

基于稀疏自注意力机制的超长上下文推理加速方法

On Accelerating Long-Context Inference via Sparse Self-Attention

摘要

请选择登录入口

添加临时用户

概览页

基于稀疏自注意力机制的超长上下文推理加速方法

On Accelerating Long-Context Inference via Sparse Self-Attention

摘要

国内学位论文

国外学位论文

请选择登录入口