精读-07：StreamingLLM — 流式语言模型与注意力汇

论文全名：Efficient Streaming Language Models with Attention Sinks
arXiv 编号：2309.17453
发表会议：ICLR 2024
机构：MIT · Meta AI · CMU
代码：https://github.com/mit-han-lab/streaming-llm
精读日期：2026-05-27

阅读地图

本文解决一个非常实际的问题：如何让大模型在不停地聊天、不停地生成文字时，内存不爆炸、性能不崩溃？

阅读顺序建议：

先看"背景问题"（本文第一节）——理解 KV 缓存为什么是瓶颈，以及两种简单方法的缺陷
再看"核心发现"（第 3.1 节）——注意力汇（attention sink）是什么，为什么滑动窗口会崩
再看"解法"（第 3.2 节）——StreamingLLM 如何用"保留开头 + 滑动窗口"解决问题
再看"进阶方案"（第 3.3 节）——预训练时加一个专用 sink token
最后看"实验"——4 百万 token 的稳定生成，22.2 倍加速

关键术语导览（首次出现时详解）：
- Token：文字被切分成的最小单元，可理解为"字符块"
- KV 缓存（KV Cache）：模型在生成每个新词时，需要参考所有历史词的"中间表示"，这些表示被缓存下来，称为 Key-Value 缓存
- Attention（注意力）：模型决定"生成当前词时，应该关注哪些历史词"的机制
- Softmax：一个把任意一组数转化成"加起来等于 1 的概率分布"的数学函数
- 困惑度（Perplexity，PPL）：衡量语言模型好坏的指标，数字越低越好；崩溃时会飙升到几千甚至几万
- 流式应用（Streaming）：输入流源源不断，模型需要持续处理而不截断，如长期对话、代码补全

一、背景问题：为什么聊天机器人处理超长对话这么难？

在正式翻译论文前，先用大白话把问题讲清楚。

问题 1：KV 缓存会无限增长

每次大模型生成一个新词，它要"回头看"所有之前的词，才能决定下一个词是什么。为了不重复计算，模型会把每个历史词的"摘要信息"（称为 Key 和 Value，合称 KV）存起来，这就是 KV 缓存。

问题在于：对话越长，缓存越大。聊了 10 万个词，就要存 10 万条 KV，显存直接爆炸。

类比：像一个秘书，把开会以来每个人说过的每句话都用便利贴贴在墙上，会议越开越长，墙上的便利贴越贴越多，最终贴不下了。

问题 2：模型只能处理训练时见过的长度

大模型在训练时，每次只处理固定长度的文本（如 4096 个 token）。超过这个长度，模型的性能就会下降，因为它没见过更长的序列。

两种失败的简单方案

方案 A：密集注意力（Dense Attention） — 保留所有 token 的 KV。
- 缺点：显存无限增长，超过训练长度后性能下降。

方案 B：滑动窗口注意力（Window Attention） — 只保留最近 N 个 token 的 KV，更旧的丢掉。
- 表面上看很合理：只记最近的对话内容，节省显存。
- 但结果是灾难性的：只要丢掉了最开头几个 token 的 KV，模型性能就会突然崩溃（困惑度从 5 飙升到 5158）。

为什么滑动窗口会崩？ 这正是本文的核心发现——注意力汇（Attention Sink）现象。

二、Abstract（摘要）逐段精译

原文

Deploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges. Firstly, during the decoding stage, caching previous tokens' Key and Value states (KV) consumes extensive memory. Secondly, popular LLMs cannot generalize to longer texts than the training sequence length.

翻译

在多轮对话等流式应用中部署大型语言模型（LLM），在这些应用中预期会有很长的交互，这是迫切需要解决的问题，但面临两大挑战。首先，在解码阶段，缓存之前所有 token 的键值状态（KV）会消耗大量内存。其次，目前流行的 LLM 无法泛化到比训练序列长度更长的文本。

新手讲解

"流式应用"就是输入没有尽头的场景，比如你和 AI 一直聊天，聊了 3 小时，AI 要一直保持"记忆"和"理解力"。两个核心难题：（1）内存爆炸；（2）模型超出了自己训练时的"视野范围"就失灵。

原文

Window attention, where only the most recent KVs are cached, is a natural approach — but we show that it fails when the text length surpasses the cache size. We observe an interesting phenomenon, namely attention sink, that keeping the KV of initial tokens will largely recover the performance of window attention.

翻译

滑动窗口注意力——只缓存最近几个 token 的 KV——是一个自然的解决思路，但我们证明了，当文本长度超过缓存大小时，它会失败。我们发现了一个有趣的现象，称为注意力汇（attention sink）：只要保留开头几个 token 的 KV，就能在很大程度上恢复窗口注意力的性能。

新手讲解

"注意力汇"是本文最重要的发现。直觉上，我们以为只需保留最近的内容就够了，但实验发现开头几个 token 有特殊的"镇场"作用——丢掉它们，模型就崩了。

原文

In this paper, we first demonstrate that the emergence of attention sink is due to the strong attention scores towards initial tokens as a "sink" even if they are not semantically important. Based on the above analysis, we introduce StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence length without any fine-tuning.

翻译

在本文中，我们首先证明注意力汇的出现，是因为模型对初始 token 产生了强烈的注意力分数，将其作为"汇"（sink），即使这些 token 在语义上并不重要。基于上述分析，我们提出 StreamingLLM，一个高效框架，使在有限注意力窗口上训练的 LLM 能够泛化到无限序列长度，而无需任何微调。

新手讲解

"语义上不重要"这一点很关键——注意力汇跟内容无关，跟位置有关。就算把开头换成换行符"\n"，模型还是会把大量注意力倾倒给它。StreamingLLM 的核心优势：即插即用，不需要重新训练模型。

原文

We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more. In addition, we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment. In streaming settings, StreamingLLM outperforms the sliding window recomputation baseline by up to 22.2× speedup.

翻译

我们展示了 StreamingLLM 能让 Llama-2、MPT、Falcon 和 Pythia 在多达 400 万 token 的文本上进行稳定、高效的语言建模。此外，我们发现在预训练时加入一个占位符 token 作为专用的注意力汇，可以进一步提升流式部署的效果。在流式场景中，StreamingLLM 比滑动窗口重计算基线最高快 22.2 倍。

新手讲解

400 万 token 大约是 300 万汉字，相当于 5 部《红楼梦》。之前的方案要么内存爆炸，要么性能崩溃，StreamingLLM 可以稳定处理这么长的序列，而且速度是竞争方案的 22 倍以上。

三、Introduction（引言）核心段落精译

原文（第一段）

Large Language Models (LLMs) are becoming ubiquitous, powering many natural language processing applications such as dialog systems, document summarization, code completion and question answering. To unleash the full potential of pretrained LLMs, they should be able to efficiently and accurately perform long sequence generation. For example, an ideal ChatBot assistant can stably work over the content of recent day-long conversations. However, it is very challenging for LLM to generalize to longer sequence lengths than they have been pretrained on, e.g., 4K for Llama-2.

翻译

大型语言模型正变得无处不在，驱动着对话系统、文档摘要、代码补全和问答等众多自然语言处理应用。要充分发挥预训练 LLM 的潜力，它们需要能够高效、准确地完成长序列生成。例如，理想的聊天机器人助手应能稳定处理整天对话的内容。然而，让 LLM 泛化到比预训练时更长的序列是非常困难的，例如 Llama-2 的预训练长度仅为 4K。

新手讲解

4K token 大约是 3000 个英文词，或约 2000 汉字。这对于"整天的对话"来说远远不够——一个下午的对话可能就有几万字。现实需求和模型能力之间有巨大的鸿沟。

原文（核心挑战段）

When applying LLMs for infinite input streams, two primary challenges arise:
1. During the decoding stage, Transformer-based LLMs cache the Key and Value states (KV) of all previous tokens, which can lead to excessive memory usage and increasing decoding latency.
2. Existing models have limited length extrapolation abilities, i.e., their performance degrades when the sequence length goes beyond the attention window size set during pre-training.

翻译

将 LLM 用于无限输入流时，会出现两个主要挑战：
1. 在解码阶段，基于 Transformer 的 LLM 会缓存所有之前 token 的键值状态（KV），这会导致过度的内存占用和不断增加的解码延迟。
2. 现有模型的长度外推能力有限，即当序列长度超过预训练时设定的注意力窗口大小时，性能会下降。

新手讲解

KV 缓存详解：Transformer 模型的注意力机制需要把每个历史 token 编码成两个向量——Key（K，"关键词索引"）和 Value（V，"内容信息"）。生成第 1000 个词时，需要读取前 999 个词的 KV；生成第 100 万个词时，就需要读取前 99 万 9999 个词的 KV，显存需求是线性增长的。

原文（两种方案及其局限）

An intuitive approach, known as window attention, maintains only a fixed-size sliding window on the KV states of most recent tokens. Although it ensures constant memory usage and decoding speed after the cache is initially filled, the model collapses once the sequence length exceeds the cache size, i.e., even just evicting the KV of the first token, as illustrated in Figure 3.

Another strategy is the sliding window with re-computation, which rebuilds the KV states of recent tokens for each generated token. While it offers strong performance, this approach is significantly slower due to the computation of quadratic attention within its window, making this method impractical for real-world streaming applications.

翻译

一种直觉上合理的方法是窗口注意力，即只维护最近几个 token 的 KV 状态的固定大小滑动窗口。虽然这确保了缓存填满后内存使用量和解码速度恒定，但一旦序列长度超过缓存大小，模型就会崩溃——甚至只是驱逐第一个 token 的 KV 就足以让它崩溃，如图 3 所示。

另一种策略是带重计算的滑动窗口，每生成一个新 token 时都重建最近 token 的 KV 状态。虽然性能强劲，但由于其窗口内需要进行二次复杂度的注意力计算，速度非常慢，这使该方法在实际流式应用中不切实际。

新手讲解

三种方法的对比：

方法	内存	速度	长文本性能
密集注意力（保留全部KV）	线性增长，会爆	越来越慢	超过训练长度就崩
滑动窗口（只保留最近N个）	恒定，好	快，好	丢开头token就崩
滑动窗口+重计算	恒定，好	非常慢（O(TL²)）	好，但太慢
StreamingLLM（本文）	恒定，好	快，好	稳定！

"重计算"相当于每生成一个词，都要把最近 N 个词从头重新"理解"一遍，复杂度是 O(T·L²)，T 是总生成量，L 是窗口大小，这在实际应用中太慢了。

原文（本文发现与方案）

To understand the failure of window attention, we find an interesting phenomenon of autoregressive LLMs: a surprisingly large amount of attention score is allocated to the initial tokens, irrespective of their relevance to the language modeling task. We term these tokens "attention sinks". Despite their lack of semantic significance, they collect significant attention scores. We attribute the reason to the Softmax operation, which requires attention scores to sum up to one for all contextual tokens. Thus, even when the current query does not have a strong match in many previous tokens, the model still needs to allocate these unneeded attention values somewhere so it sums up to one. The reason behind initial tokens as sink tokens is intuitive: initial tokens are visible to almost all subsequent tokens because of the autoregressive language modeling nature, making them more readily trained to serve as attention sinks.

翻译

为了理解窗口注意力的失败，我们在自回归 LLM 中发现了一个有趣的现象：出乎意料地，大量注意力分数被分配给了初始 token，与它们是否与语言建模任务相关无关。我们将这些 token 称为"注意力汇"。尽管它们在语义上并不重要，却汇集了大量注意力分数。我们将原因归结于 Softmax 运算，该运算要求所有上下文 token 的注意力分数加起来等于 1。因此，即使当前查询在许多前面的 token 中没有强匹配，模型仍然需要把这些多余的注意力值分配到某个地方，使它们加起来等于 1。初始 token 成为注意力汇的原因很直觉：由于自回归语言建模的序列特性，初始 token 对几乎所有后续 token 都是可见的，因此它们更容易被训练为充当注意力汇。

新手讲解

这是全文最核心的一段，必须讲透。

Softmax 的"强迫症"：Softmax 函数有个硬性规定——所有注意力权重加起来必须等于 1。就像强制要求你把 100 分的信任感分配给房间里所有的人，即使你觉得大多数人的话都不重要，你也必须给他们分一些分数，否则数学上无法成立。

注意力汇的形成机制：模型在训练时发现，初始 token 在所有时刻都是"可见的"（因为自回归模型是从左到右处理的，第 1 个 token 对之后所有 token 都可见，第 2 个 token 对第 2 个之后所有 token 都可见……），因此模型"习惯"把多余的、没有明确去处的注意力权重倾倒给这些始终在场的初始 token。就像开会时大家习惯看主持人一样——不管议题是什么，主持人始终是视线的默认落点。

类比"泄压阀"：想象一个水压系统，水必须往某个地方流（Softmax 总和为 1）。初始 token 就像系统里的泄压阀，多余的"水压"（注意力权重）都倾泻到这里。一旦把泄压阀封堵（丢掉初始 token 的 KV），整个系统压力分布就失衡了，模型崩溃。

原文（StreamingLLM 方案摘要）

Based on the above insights, we propose StreamingLLM, a simple and efficient framework that enables LLMs trained with a finite attention window to work on text of infinite length without fine-tuning. StreamingLLM exploits the fact that attention sinks have high attention values, and preserving them can maintain the attention score distribution close to normal. Therefore, StreamingLLM simply keeps the attention sink tokens' KV (with just 4 initial tokens sufficing) together with the sliding window's KV to anchor the attention computation and stabilize the model's performance.

翻译

基于上述洞察，我们提出 StreamingLLM，一个简单高效的框架，使在有限注意力窗口上训练的 LLM 能够在无限长度的文本上工作，无需任何微调。StreamingLLM 利用了注意力汇具有高注意力值这一事实，保留它们可以使注意力分数分布接近正常状态。因此，StreamingLLM 只需将注意力汇 token 的 KV（只需 4 个初始 token 就足够）与滑动窗口的 KV 一起保留，以锚定注意力计算并稳定模型性能。

新手讲解

核心解法一句话：始终保留开头 4 个 token 的 KV（注意力汇锚点）+ 最近 N 个 token 的 KV（滑动窗口），中间所有 token 的 KV 全部丢掉。

开会类比：
- 开头 4 个 token = 主持人（始终保留，作为注意力的默认落点）
- 最近 N 个 token = 最近几轮发言（保留，用于实际内容理解）
- 中间所有 token = 几小时前说过的话（丢掉，不影响当前讨论）

无需微调这一点极其重要：你现在用的 Llama-2、Falcon 等模型，直接套上 StreamingLLM 就能用，不需要重新训练，不需要修改模型权重。

四、Section 3：StreamingLLM 方法（完整精读）

3.1 窗口注意力的失败与注意力汇

原文（3.1 开篇）

While the window attention technique offers efficiency during inference, it results in an exceedingly high language modeling perplexity. Consequently, the model's performance is unsuitable for deployment in streaming applications. In this section, we use the concept of attention sink to explain the failure of window attention, serving as the inspiration behind StreamingLLM.

翻译

虽然窗口注意力技术在推理时效率较高，但它导致语言建模困惑度极高。因此，模型的性能不适合在流式应用中部署。在本节中，我们用注意力汇的概念来解释窗口注意力的失败，这也是 StreamingLLM 的灵感来源。

新手讲解

"困惑度极高"是灾难性的——正常语言模型的困惑度在 5-20 左右，窗口注意力在丢弃初始 token 后困惑度飙升到 5000 以上，意味着模型几乎在随机输出，完全无法使用。

原文（识别困惑度飙升的时刻）

Identifying the Point of Perplexity Surge. Figure 3 shows the perplexity of language modeling on a 20K token text. It is evident that perplexity spikes when the text length surpasses the cache size, led by the exclusion of initial tokens. This suggests that the initial tokens, regardless of their distance from the tokens being predicted, are crucial for maintaining the stability of LLMs.

翻译

识别困惑度飙升的时刻。 图 3 显示了对 20K token 文本进行语言建模时的困惑度。很明显，当文本长度超过缓存大小时——由于初始 token 被排除——困惑度会急剧飙升。这表明初始 token，无论它们与被预测 token 的距离多远，对于维持 LLM 的稳定性都至关重要。

新手讲解

实验发现非常精确：只要生成的 token 数量一超过缓存大小，迫使第一个 token 被驱逐，困惑度就立刻崩溃。这不是逐渐退化，而是断崖式下跌，说明初始 token 在系统中扮演着某种"基础设施"角色。

原文（为什么移除初始 KV 会让模型崩溃）

Why do LLMs break when removing initial tokens' KV? We visualize attention maps from all layers and heads of the Llama-2-7B model. We find that, beyond the bottom two layers, the model consistently focuses on the initial tokens across all layers and heads. The implication is clear: removing these initial tokens' KV will remove a considerable portion of the denominator in the SoftMax function in attention computation. This alteration leads to a significant shift in the distribution of attention scores away from what would be expected in normal inference settings.

翻译

为什么移除初始 token 的 KV 会让 LLM 崩溃？ 我们对 Llama-2-7B 模型所有层和所有注意力头的注意力图进行了可视化。我们发现，除底部两层外，该模型在所有层和头中都一致地聚焦于初始 token。这一含义很明确：移除这些初始 token 的 KV，将去除注意力计算中 SoftMax 函数分母的相当大一部分。这一改变导致注意力分数分布发生重大偏移，偏离了正常推理设置下应有的分布。

新手讲解

数学直觉：Softmax 的公式是：

注意力权重(token i) = exp(score_i) / (exp(score_1) + exp(score_2) + ... + exp(score_N))

其中 score_1（初始 token 的得分）通常非常大（注意力汇的特性）。如果你把 score_1 从分母里拿掉，整个分母突然变小，所有其他 token 的注意力权重就会被"虚假地放大"，注意力分布完全失真，模型的每一层都在用错误的权重聚合信息，最终输出的是乱码。

类比：想象一个班级投票选班长，规则是所有票数相加必须等于 100 分（Softmax）。A 同学平时很受关注，经常得到 40 分的"默认关注票"。现在突然宣布 A 同学离开了（删掉初始 token），剩下的同学要重新分配这 100 分——但他们从来没练习过怎么在 A 不在的情况下分配，结果乱成一锅粥。

原文（区分语义与位置的关键实验）

There are two possible explanations for the importance of the initial tokens in language modeling: (1) Either their semantics are crucial, or (2) the model learns a bias towards their absolute position. To distinguish between these possibilities, we conduct experiments (Table 1), wherein the first four tokens are substituted with the linebreak token "\n". The observations indicate that the model still significantly emphasizes these initial linebreak tokens. Furthermore, reintroducing them restores the language modeling perplexity to levels comparable to having the original initial tokens. This suggests that the absolute position of the starting tokens, rather than their semantic value, holds greater significance.

翻译

初始 token 在语言建模中重要性的可能解释有两种：（1）它们的语义至关重要，或（2）模型对其绝对位置学习了一种偏好。为了区分这两种可能性，我们进行了实验（表 1），将前四个 token 替换为换行符"\n"。观察表明，模型仍然显著地强调这些初始换行符。此外，重新引入它们可以将语言建模困惑度恢复到接近原始初始 token 的水平。这表明，起始 token 的绝对位置，而非其语义价值，具有更重要的意义。

关键实验数据（表 1，Llama-2-13B）：

缓存配置	困惑度（PPL，越低越好）
0+1024（纯窗口注意力，无初始token）	5158.07
4+1020（4个原始初始token + 最近1020个）	5.40
4"\n"+1020（4个换行符 + 最近1020个）	5.60

新手讲解

这个实验非常精彩，一刀切断了"语义重要性"的解释：把开头 4 个 token 换成毫无语义的换行符"\n"，困惑度从 5158 恢复到 5.60（仅比原版 5.40 高一点点）。内容不重要，位置才重要。 模型需要的不是"开头说了什么"，而是"有东西站在开头"这个事实本身。

原文（注意力汇的正式定义）

LLMs attend to Initial Tokens as Attention Sinks. To explain why the model disproportionately focuses on initial tokens—regardless of their semantic relevance to language modeling, we introduce the concept of "attention sink". The nature of the SoftMax function prevents all attended tokens from having zero values. This requires aggregating some information from other tokens across all heads in all layers, even if the current embedding has sufficient self-contained information for its prediction. Consequently, the model tends to dump unnecessary attention values to specific tokens.

翻译

LLM 将初始 Token 视为注意力汇。 为了解释模型为何不成比例地关注初始 token——无论它们与语言建模任务的语义相关性如何——我们引入了"注意力汇（attention sink）"的概念。SoftMax 函数的本质决定了不能让所有被关注的 token 的值都为零。这要求在所有头和所有层中，即使当前嵌入表示已有足够的自包含信息来完成预测，仍需从其他 token 聚合一些信息。因此，模型倾向于将多余的注意力值倾倒到特定的 token 上。

新手讲解

"注意力汇"的概念现在可以完整理解了：

Softmax 的强制求和为 1 → 即使"不想关注任何人"，也必须把 100% 的注意力分配出去
多余的注意力要有个出口 → 模型找到了一个"垃圾桶"——初始 token
初始 token 是最佳"垃圾桶" → 因为它们在训练时对所有后续 token 都可见，被反复"练习"为接收废弃注意力的容器

"sink"这个词的含义：在英文里，sink 有"水槽"的意思——废水最终流进水槽。多余的注意力权重最终"流入"初始 token，初始 token 就是注意力的"水槽"。

原文（为什么是初始 token 而不是其他 token？）

Why do various autoregressive LLMs, such as Llama-2, MPT, Falcon, and Pythia, consistently focus on initial tokens as their attention sinks, rather than other tokens? Our explanation is straightforward: Due to the sequential nature of autoregressive language modeling, initial tokens are visible to all subsequent tokens, while later tokens are only visible to a limited set of subsequent tokens. As a result, initial tokens are more easily trained to serve as attention sinks, capturing unnecessary attention.

翻译

为什么 Llama-2、MPT、Falcon 和 Pythia 等各种自回归 LLM 都一致地将初始 token 作为注意力汇，而不是其他 token？我们的解释很直接：由于自回归语言建模的序列特性，初始 token 对所有后续 token 都是可见的，而后面的 token 只对有限的后续 token 可见。 因此，初始 token 更容易被训练为充当注意力汇，汇集不必要的注意力。

新手讲解

自回归模型的特性：生成第 N 个词时，可以看到第 1 到 N-1 个词。所以：
- 第 1 个词：被所有 token（1 到 N）"看见"，曝光度最高
- 第 2 个词：被第 2 到 N 个 token 看见
- 第 N-1 个词：只被最后 1 个 token 看见

因为初始 token 曝光度最高，在训练的无数次更新中，它们逐渐被"优化"为吸收废弃注意力的容器——这是模型自发学到的一种"隐性技巧"，不是人为设计的。

原文（需要多少个 sink token？）

We've noted that LLMs are typically trained to utilize multiple initial tokens as attention sinks rather than just one. As illustrated in Figure 2, the introduction of four initial tokens, as attention sinks, suffices to restore the LLM's performance. In contrast, adding just one or two doesn't achieve full recovery. We believe this pattern emerges because these models didn't include a consistent starting token across all input samples during pre-training.

翻译

我们注意到，LLM 通常被训练为使用多个初始 token 作为注意力汇，而不仅仅是一个。如图 2 所示，引入四个初始 token 作为注意力汇就足以恢复 LLM 的性能。相比之下，只添加一两个并不能完全恢复。我们认为这种模式的出现，是因为这些模型在预训练期间没有在所有输入样本中使用一致的起始 token。

关键实验数据（表 2，各模型，越低越好）：

模型	0+2048（纯窗口）	1+2047	2+2046	4+2044	8+2040
Falcon-7B	17.90	12.12	12.12	12.12	12.12
MPT-7B	460.29	14.99	15.00	14.99	14.98
Pythia-12B	21.62	11.95	12.09	12.09	12.02
Llama-2-7B	3359.95	11.88	10.51	9.59	9.54

新手讲解

数据说明了两件事：
1. 1-2 个 sink token 通常不够：大多数模型需要至少 4 个
2. 4 个之后收益递减：从 4 增加到 8 几乎没有改善

为什么需要 4 个？因为现有模型在预训练时没有强制指定一个"专用 sink token"，模型学会了用开头几个（通常 4 个）token 分摊这个职责。如果预训练时就指定了专用 sink token（第 3.3 节），1 个就够了。

3.2 带注意力汇的滚动 KV 缓存

原文（StreamingLLM 的设计）

To enable LLM streaming in already trained LLMs, we propose a straightforward method that can recover window attention's perplexity without any model finetuning. Alongside the current sliding window tokens, we reintroduce a few starting tokens' KV in the attention computation. The KV cache in StreamingLLM can be conceptually divided into two parts: (1) Attention sinks (four initial tokens) stabilize the attention computation; (2) Rolling KV Cache retains the most recent tokens, crucial for language modeling.

翻译

为了在已经训练好的 LLM 上实现流式处理，我们提出了一种简单方法，可以在无需任何模型微调的情况下恢复窗口注意力的困惑度。在当前滑动窗口 token 的基础上，我们在注意力计算中重新引入少量起始 token 的 KV。StreamingLLM 的 KV 缓存概念上可以分为两部分：（1）注意力汇（四个初始 token）稳定注意力计算；（2）滚动 KV 缓存保留最近的 token，这对语言建模至关重要。

新手讲解

StreamingLLM 的缓存结构可视化：

完整文本：[Token 0, 1, 2, 3] ... [Token 980, 981, ... 1023] [Token 1024 正在生成]
                ↑ 注意力汇           ↑ 被丢弃的中间部分         ↑ 滚动窗口
             （始终保留）              （节省显存）               （保留最近内容）

StreamingLLM KV缓存：[T0 T1 T2 T3] + [T980 T981 ... T1023]
                      注意力汇 4个      最近 N 个 token

两个关键字：
- "重新引入"（reintroduce）：这 4 个初始 token 的 KV 始终保存着，不会被驱逐
- "无需微调"：整个方案只修改了 KV 缓存的管理策略，模型参数一点不变

原文（位置编码的处理——关键技术细节）

StreamingLLM's design is versatile and can be seamlessly incorporated into any autoregressive language model that employs relative positional encoding, such as RoPE and ALiBi. When determining the relative distance and adding positional information to tokens, StreamingLLM focuses on positions within the cache rather than those in the original text. This distinction is crucial for StreamingLLM's performance. For instance, if the current cache has tokens [0, 1, 2, 3, 6, 7, 8] and is in the process of decoding the 9th token, the positions assigned are [0, 1, 2, 3, 4, 5, 6, 7], rather than the positions in the original text, which would be [0, 1, 2, 3, 6, 7, 8, 9].

翻译

StreamingLLM 的设计具有通用性，可以无缝集成到任何使用相对位置编码（如 RoPE 和 ALiBi）的自回归语言模型中。在确定相对距离并为 token 添加位置信息时，StreamingLLM 关注的是缓存中的位置，而非原始文本中的位置。 这一区分对 StreamingLLM 的性能至关重要。例如，如果当前缓存包含 token [0, 1, 2, 3, 6, 7, 8]，正在解码第 9 个 token，则分配的位置为 [0, 1, 2, 3, 4, 5, 6, 7]，而非原始文本中的位置 [0, 1, 2, 3, 6, 7, 8, 9]。

新手讲解

这是一个非常重要的工程细节，直接影响方法能否工作。

问题：模型的位置编码（RoPE/ALiBi）告诉模型"我现在在读第几个词"。如果 token 4 和 5 被丢弃了，缓存里的 token 6 怎么办？

错误方式：告诉 token 6"你是原始文本的第 6 个"→ 但你旁边的是 token 3，距离跳了 3 格，注意力计算会混乱
正确方式（StreamingLLM）：在缓存里重新编号，token 6 被告知"你是缓存里的第 5 个"→ 这样相对位置是连续的，模型工作正常

类比：开会时有些人中途离场了。如果你用"原始座位号"来称呼大家，座位号会有空缺（1, 2, 3, 6, 7, 8）；如果你重新用"现在在场的第几位"来称呼，就是连续的（1, 2, 3, 4, 5, 6），模型处理起来更自然。

3.3 预训练时加入注意力汇 Token

原文（问题的根源分析）

As elaborated in Section 3.1, a significant reason for the model's excessive attention to multiple initial tokens is the absence of a designated sink token to offload excessive attention scores. Due to this, the model inadvertently designates globally visible tokens, primarily the initial ones, as attention sinks. A potential remedy can be the intentional inclusion of a global trainable attention sink token, denoted as a "Sink Token", which would serve as a repository for unnecessary attention scores.

翻译

如第 3.1 节所述，模型过度关注多个初始 token 的重要原因是缺少一个指定的 sink token 来卸载多余的注意力分数。因此，模型无意间将全局可见的 token（主要是初始 token）指定为注意力汇。一个潜在的解决方案是有意加入一个全局可训练的注意力汇 token，记为"Sink Token"，它将充当不必要注意力分数的存储库。

新手讲解

这一节是"更优雅的未来方案"。现有模型（如 Llama-2）没有专门设计 sink token，所以需要 4 个自然形成的初始 token 分担职责。如果在预训练时就加入一个专用 sink token，1 个就够了，而且模型还可以把注意力"汇"管理得更干净。

原文（验证实验）

For validation, we pre-train three language models with 160 million parameters from scratch under identical settings. The first model utilizes the standard SoftMax attention (Vanilla), the second replaced the regular attention mechanism with SoftMax₁ (Zero Sink), and one prepending a learnable placeholder token (Sink Token) in all training samples. As shown in Table 3, while the zero sink alleviates the attention sink problem to some extent, the model still relies on other initial tokens as attention sinks. Introducing a sink token is highly effective in stabilizing the attention mechanism. Simply pairing this sink token with recent tokens sufficiently anchors the model's performance.

翻译

为了验证，我们在相同条件下从头开始预训练三个 1.6 亿参数的语言模型。第一个模型使用标准 SoftMax 注意力（原版），第二个用 SoftMax₁ 替换了常规注意力机制（零汇），另一个在所有训练样本中前置一个可学习的占位符 token（Sink Token）。如表 3 所示，虽然零汇在一定程度上缓解了注意力汇问题，但模型仍然依赖其他初始 token 作为注意力汇。引入 sink token 在稳定注意力机制方面非常有效。只需将此 sink token 与最近的 token 配对，就足以锚定模型的性能。

关键实验数据（表 3，160M 参数模型）：

缓存配置	原版（Vanilla）PPL	零汇（Zero Sink）PPL	可学习 Sink Token PPL
0+1024（无初始token）	27.87	29214	1235
1+1023	18.49	19.90	18.01
2+1022	18.05	18.27	18.01
4+1020	18.05	18.01	18.02

新手讲解

表 3 的关键发现：
- 原版模型：需要 2 个以上初始 token 才能稳定（1+1023 时 PPL=18.49，不够好）
- 零汇（SoftMax₁）：思路是让 Softmax 不强制求和为 1，相当于添加了一个"虚拟 token"，效果有所改善，但还不够
- Sink Token（可学习占位符）：只需 1 个 sink token，PPL=18.01，与需要 4 个初始 token 的原版相当甚至更好

结论：未来的模型应该在预训练时就加一个专用 sink token，这样一个 token 就能搞定所有问题。

五、实验结果（精读核心段落）

4.1 长文本语言建模实验

原文（主要结果）

Figure 3 illustrates that StreamingLLM can match the oracle baseline (sliding window with re-computation) in terms of perplexity on texts spanning 20K tokens. Meanwhile, the dense attention technique fails when the input length exceeds its pre-training window, and the window attention technique struggles when the input length surpasses the cache size, leading to the eviction of the initial tokens. [Further results show] StreamingLLM can reliably handle exceptionally extended texts, encompassing more than 4 million tokens, across a spectrum of model families and scales.

翻译

图 3 表明，StreamingLLM 在跨越 20K token 的文本上，困惑度可以媲美oracle基线（带重计算的滑动窗口）。同时，当输入长度超过预训练窗口时密集注意力失败，当输入长度超过缓存大小时窗口注意力因驱逐初始 token 而崩溃。[进一步结果显示] StreamingLLM 可以在各种模型家族和规模上可靠地处理超过 400 万 token 的超长文本。

新手讲解

三条线的对比（图 3 的直觉理解）：
1. 密集注意力：在预训练长度（如 4096）处性能开始下降，然后缓慢恶化
2. 窗口注意力：在缓存大小（如 2048）处性能断崖式崩溃（PPL 从 5 飙升到 5000+）
3. StreamingLLM：全程稳定，与"重计算+滑动窗口"（理想性能上限）几乎完全重合

400 万 token 的稳定生成是所有方法中最强的。

4.2 预训练 Sink Token 的结果

原文

Including a sink token during pre-training has no negative impact on model convergence and subsequent performance on a range of NLP benchmarks. As depicted in Figure 6, models trained with a sink token exhibit similar convergence dynamics compared to their vanilla counterparts. The model pre-trained with a sink token performs similarly to that trained using the vanilla approach [on 7 NLP benchmarks]. Streaming Performance: As illustrated in Table 3, the vanilla model requires the addition of multiple tokens as attention sinks to maintain stable streaming perplexity. In contrast, the model trained with a sink token achieves satisfactory streaming performance using just the sink token.

翻译

在预训练期间加入 sink token 对模型收敛和随后在一系列 NLP 基准上的性能没有负面影响。如图 6 所示，使用 sink token 训练的模型与其普通版本表现出相似的收敛动态。在 7 个 NLP 基准上，预训练了 sink token 的模型与普通方式训练的模型性能相近。流式性能方面：如表 3 所示，普通模型需要添加多个 token 作为注意力汇才能维持稳定的流式困惑度；相比之下，使用 sink token 训练的模型只需使用该 sink token 即可达到令人满意的流式性能。

新手讲解

这个实验消除了最后的顾虑：会不会加了 sink token 之后，正常的 NLP 能力反而下降了？答案是否定的——7 个基准测试全部保持正常甚至略有提升（如 ARC-C：18.6→19.6，ARC-E：45.2→45.6）。既稳定了流式性能，又没有牺牲正常能力，一举两得。

4.3 流式问答实验

原文

To show StreamingLLM's real-world applicability, we emulate multi-round question-answering using instruction-tuned LLMs. We first concatenate all question-answer pairs from the ARC-[Challenge, Easy] datasets, feed the continuous stream to Llama-2-[7,13,70]B-Chat models. As table 5 indicates, dense attention results in Out-of-Memory (OOM) errors, showing it unsuitable for this setting. While the window attention method works efficiently, it exhibits low accuracy due to random outputs when the input length exceeds the cache size. Conversely, StreamingLLM excels by efficiently handling the streaming format, aligning with the one-shot, sample-by-sample baseline accuracy.

翻译

为了展示 StreamingLLM 的实际应用价值，我们使用指令调优的 LLM 模拟多轮问答。我们将 ARC 数据集的所有问答对连接起来，以连续流的形式送入 Llama-2-[7,13,70]B-Chat 模型。如表 5 所示，密集注意力因内存溢出（OOM）而失败，表明其不适合这种场景。窗口注意力虽然运行效率高，但当输入长度超过缓存大小时，由于输出随机，准确率极低。相比之下，StreamingLLM 在高效处理流式格式的同时，准确率与逐样本单次推理基线相当。

关键数据（表 5，流式问答准确率 %）：

方法	Llama-2-7B-Chat ARC-E	Llama-2-13B-Chat ARC-E	Llama-2-70B-Chat ARC-E
逐样本基线（上限）	71.25	78.16	91.29
密集注意力	OOM（内存溢出）	OOM	OOM
窗口注意力	3.58	0.25	0.12
StreamingLLM	71.34	80.89	91.37

新手讲解

这组数据最有说服力：
- 密集注意力：直接显存溢出，连运行都运行不了
- 窗口注意力：准确率 3.58%（约等于随机乱猜，比随机水平 25% 还低很多，因为它在乱输出）
- StreamingLLM：71.34%，几乎完美复现了理想上限（71.25%）

StreamingLLM 不只是"能用"，而是精度损失极小，可以真正部署到生产环境。

4.5 效率实验

原文

As the cache size increases, StreamingLLM's decoding speed demonstrates a linear growth. The sliding window with re-computation baseline has a quadratic rise in decoding latency. Thus, StreamingLLM achieves an impressive speedup, reaching up to 22.2× per token. Despite its reduced latency, StreamingLLM sustains a memory footprint consistent with the re-computation baseline.

翻译

随着缓存大小的增加，StreamingLLM 的解码速度呈线性增长。而带重计算的滑动窗口基线的解码延迟则呈二次方增长。因此，StreamingLLM 实现了令人印象深刻的加速，每个 token 最高达 22.2 倍。尽管延迟更低，StreamingLLM 的内存占用与重计算基线保持一致。

新手讲解

效率对比：
- 重计算基线：每生成一个词都要重新计算最近 L 个词的注意力，复杂度 O(L²)，缓存越大越慢
- StreamingLLM：只需用缓存里的 KV 做一次注意力计算，复杂度 O(L)，线性的

22.2 倍加速意味着：之前生成 100 个词需要 22.2 秒，现在只需 1 秒。这对实时对话应用来说是决定性优势。

六、Conclusion（结论）精译

原文

Deploying LLMs in streaming applications is urgently needed but comes with challenges due to efficiency limitations and reduced performance with longer texts. Window attention provides a partial solution, but its performance plummets when initial tokens are excluded. Recognizing the role of these tokens as "attention sinks", we introduced StreamingLLM —a simple and efficient framework that enables LLMs to handle unlimited texts without fine-tuning. By adding attention sinks with recent tokens, StreamingLLM can efficiently model texts of up to 4 million tokens. We further show that pre-training models with a dedicated sink token can improve the streaming performance. StreamingLLM firstly decouples the LLM's pre-training window size and its actual text generation length, paving the way for the streaming deployment of LLMs.

翻译

在流式应用中部署 LLM 是迫切需要的，但由于效率限制和在较长文本上性能下降而面临挑战。窗口注意力提供了部分解决方案，但当初始 token 被排除时，其性能会急剧下降。认识到这些 token 作为"注意力汇"的作用，我们引入了 StreamingLLM——一个简单高效的框架，使 LLM 能够处理无限长度的文本而无需微调。通过将注意力汇与最近的 token 结合，StreamingLLM 可以高效地对多达 400 万 token 的文本进行建模。我们进一步表明，使用专用 sink token 预训练模型可以改善流式性能。StreamingLLM 首次将 LLM 的预训练窗口大小与其实际文本生成长度解耦，为 LLM 的流式部署铺平了道路。

新手讲解

"解耦预训练长度和实际生成长度"——这句话是全文的终极贡献。之前的逻辑是：你训练时用了多长的窗口，你就只能处理多长的文本。StreamingLLM 打破了这个束缚：模型训练时的窗口大小不再是部署时的天花板。 不管预训练窗口是 4K 还是 8K，你都可以用 StreamingLLM 处理无限长的输入流。

七、相关工作与附录（一句话概述）

相关工作：论文将相关工作分为三类——（1）长度外推（RoPE、ALiBi 等位置编码改进）；（2）上下文窗口扩展（FlashAttention、位置插值微调等）；（3）提升长文本利用率。StreamingLLM 主要属于第一类，专注于超越预训练窗口的无限流式生成，与后两类正交、可互补。

附录：包含更多模型的注意力可视化（Llama-2-70B、BERT 等编码器模型同样存在注意力汇）、更长序列（4M token）的困惑度曲线、消融实验（不同缓存大小对性能的影响）以及 StreamEval 基准的详细结果。

八、核心贡献总结

维度	内容
核心发现	注意力汇（Attention Sink）：初始 token 被模型用作 Softmax 的"废弃注意力存放处"，与语义无关，与位置有关
核心方法	始终保留 4 个初始 token（注意力汇）+ 最近 N 个 token（滑动窗口），中间全部丢弃
位置编码技巧	使用缓存内位置而非原文绝对位置，保证 RoPE/ALiBi 的连续性
预训练优化	在每个训练样本开头加专用可学习 Sink Token，使 1 个 sink token 即可替代 4 个
不需要微调	即插即用，直接用于 Llama-2、MPT、Falcon、Pythia 等已有模型
性能	4M token 稳定生成，流式 QA 准确率媲美理想上限，比重计算基线快 22.2 倍
理论意义	首次解耦 LLM 预训练窗口大小与实际文本生成长度

九、方法对比速查表

方法	显存	速度	超长文本性能	需要微调
密集注意力	线性增长，会爆	二次方变慢	超过训练长度就崩	否
窗口注意力	恒定	快	丢第一个token就崩	否
滑动窗口+重计算	恒定	慢（22×）	稳定	否
StreamingLLM	恒定	快	稳定（4M+）	否

十、延伸思考：局限性与后续工作

无法利用中间历史：StreamingLLM 丢弃了中间 token，意味着它不能"记住"很久以前发生的事情（只有最近 N 个 token 是完整保留的）。这与 RAG、MemGPT 等外部记忆方案互补。
预训练阶段的改进：本文建议未来所有 LLM 在预训练时都加入专用 Sink Token，这已经在工业界开始影响后续模型的设计。
与长上下文技术互补：StreamingLLM 与 LongLoRA（精读-06）、YaRN（精读-05）等长上下文扩展技术正交——后者扩大了"最近 N 个 token"的窗口上限，与 StreamingLLM 结合可以进一步提升流式处理能力。
与 RAG 的关系：StreamingLLM 解决的是"模型当前上下文的稳定性"问题，而 RAG 解决的是"从海量历史中检索相关内容"问题。两者可以协同工作：StreamingLLM 保证当前对话不崩，RAG 从数据库中拉回远古记忆。

字数统计：正文约 8500 字
覆盖章节：Abstract（全部）、Introduction（核心段落）、Section 3.1-3.3（全部段落，完整精读）、Section 4.1/4.2/4.3/4.5（核心段落）、Conclusion（全部）