精读笔记：ViT — An Image is Worth 16×16 Words

论文基本信息

项目	内容
论文全名	An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
arXiv	2010.11929
发表会议	ICLR 2021
作者机构	Google Research, Brain Team
通讯作者	Alexey Dosovitskiy, Neil Houlsby（等量技术贡献）
代码	https://github.com/google-research/vision_transformer

阅读地图

本文共 5 个核心章节，阅读建议路径如下：

Abstract（摘要）
  ↓
Introduction（引言）—— 理解"为什么要做这件事"
  ↓
Section 3.1 Vision Transformer (ViT) —— 核心模型结构（最重要）
  ↓
Section 3.2 Fine-tuning —— 如何迁移到下游任务
  ↓
Section 4.3 Pre-training Data Requirements —— 归纳偏置 vs 数据量的关键实验
  ↓
Section 4.2 Comparison to State of the Art —— 与 CNN 的终极对比

核心思想一句话：把一张图片切成 16×16 像素的小块（patch），把每个小块当成一个"词"，然后直接用 NLP 里的 Transformer 来处理这个"词序列"，在大规模数据预训练后，效果超越了 ResNet 等 CNN 模型。

一、Abstract（摘要）

原文

While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

翻译

Transformer 架构已经成为自然语言处理任务的事实标准，但它在计算机视觉中的应用仍然有限。在视觉领域，注意力机制要么与卷积网络结合使用，要么用于替换卷积网络的某些组件，同时保持其整体结构不变。我们证明，这种对 CNN 的依赖并非必要——将纯 Transformer 直接应用于图像块（patch）序列，就可以在图像分类任务上取得非常好的效果。当在大量数据上预训练并迁移到多个中等规模或小规模图像识别基准（ImageNet、CIFAR-100、VTAB 等）时，Vision Transformer（ViT）与最先进的卷积网络相比取得了卓越的结果，同时所需的训练计算资源显著减少。

新手讲解

这段话在说什么？

想象一下，NLP（自然语言处理）领域有一把"神器"叫 Transformer，它靠处理"词序列"来理解文本，效果非常好。但图像不是文字，怎么用这把神器来处理图片？

以前的研究要么把 Transformer 和 CNN（卷积神经网络，专门处理图像的老牌架构）拼在一起用，要么用 Transformer 替换 CNN 里某些零件，但整体还是 CNN 的框架。

这篇论文说：我们不需要 CNN，直接把图片变成"词序列"，送进 Transformer，就能搞定图像分类！ 而且效果还更好、训练更省算力。

关键词解释：
- Transformer：一种基于"注意力机制"的神经网络架构，2017 年由 Vaswani 等人提出，最初用于机器翻译，后来成为 NLP 的主流。
- patch（图像块）：把图片切成若干小方块，每个方块叫一个 patch。这是 ViT 的核心创意。
- image classification（图像分类）：给模型一张图，让它判断图里是猫还是狗。

二、Introduction（引言）

原文段落 1

Self-attention-based architectures, in particular Transformers (Vaswani et al., 2017), have become the model of choice in natural language processing (NLP). The dominant approach is to pre-train on a large text corpus and then fine-tune on a smaller task-specific dataset (Devlin et al., 2019). Thanks to Transformers' computational efficiency and scalability, it has become possible to train models of unprecedented size, with over 100B parameters (Brown et al., 2020; Lepikhin et al., 2020). With the models and datasets growing, there is still no sign of saturating performance.

翻译

基于自注意力的架构，尤其是 Transformer，已经成为 NLP 领域的首选模型。主流做法是先在大型文本语料库上预训练，然后在较小的特定任务数据集上微调。得益于 Transformer 的计算效率和可扩展性，训练超过 1000 亿参数的超大模型已成为可能。随着模型和数据集的不断增大，性能仍没有饱和的迹象。

新手讲解

NLP 领域的成功路线是：先在海量文本上预训练（学通用知识）→ 再在小数据集上微调（学具体任务）。这条路线在文本上屡试不爽，越大的模型效果越好，且还没有天花板。

作者的问题意识：这套方法能直接搬到图像上吗？

原文段落 2

In computer vision, however, convolutional architectures remain dominant (LeCun et al., 1989; Krizhevsky et al., 2012; He et al., 2016). Inspired by NLP successes, multiple works try combining CNN-like architectures with self-attention (Wang et al., 2018; Carion et al., 2020), some replacing the convolutions entirely (Ramachandran et al., 2019; Wang et al., 2020a). The latter models, while theoretically efficient, have not yet been scaled effectively on modern hardware accelerators due to the use of specialized attention patterns. Therefore, in large-scale image recognition, classic ResNet-like architectures are still state of the art.

翻译

然而在计算机视觉中，卷积架构依然占主导地位。受 NLP 成功的启发，多项工作尝试将类 CNN 架构与自注意力结合，有些甚至完全替换卷积。但后者在现代硬件加速器上由于使用了特殊的注意力模式而无法有效扩展。因此，在大规模图像识别中，经典的 ResNet 类架构仍然是最先进的。

新手讲解

为什么之前的人没有成功？

以前也有人想把注意力机制用于图像，但遇到了"暴力计算"问题：如果让每个像素都跟其他所有像素做注意力计算，一张 224×224 的图片有 50176 个像素，两两计算量是天文数字（复杂度是像素数的平方）。

所以人们要么用"局部注意力"（只看周围一小片），要么用复杂的稀疏注意力，但这些"特殊设计"在 GPU/TPU 上跑起来反而效率低，所以还没有打败 ResNet。

ResNet 就是深度残差网络（He et al., 2016），是当时图像识别的"王者"。

原文段落 3（核心创意阐述）

Inspired by the Transformer scaling successes in NLP, we experiment with applying a standard Transformer directly to images, with the fewest possible modifications. To do so, we split an image into patches and provide the sequence of linear embeddings of these patches as an input to a Transformer. Image patches are treated the same way as tokens (words) in an NLP application. We train the model on image classification in supervised fashion.

翻译

受 NLP 中 Transformer 扩展成功的启发，我们尝试将标准 Transformer 尽量少地修改直接应用于图像。为此，我们将图像切分成多个小块（patch），并将这些小块的线性嵌入序列作为输入提供给 Transformer。图像小块的处理方式与 NLP 应用中的 token（词）完全相同。我们以监督学习的方式在图像分类任务上训练模型。

新手讲解

ViT 的核心创意——"把图片当句子读"

这就是全文最关键的一句话，用一个类比彻底理解它：

类比：图片 = 文章，patch = 词

NLP 里处理文本	ViT 里处理图像
把句子切成一个个词	把图片切成一个个小块（patch）
每个词变成词向量（embedding）	每个 patch 压平后变成向量（patch embedding）
词的序列送入 Transformer	patch 的序列送入 Transformer
Transformer 学词与词之间的关系	Transformer 学 patch 与 patch 之间的关系

具体操作：一张 224×224 的彩色图片，切成每块 16×16 像素的小块，可以切出 (224/16)×(224/16) = 14×14 = 196 个 patch。这 196 个 patch 就是 196 个"词"，形成一个长度为 196 的序列，送入 Transformer！

这个想法极其简单，但在大数据下威力惊人。

原文段落 4（问题引出）

When trained on mid-sized datasets such as ImageNet without strong regularization, these models yield modest accuracies of a few percentage points below ResNets of comparable size. This seemingly discouraging outcome may be expected: Transformers lack some of the inductive biases inherent to CNNs, such as translation equivariance and locality, and therefore do not generalize well when trained on insufficient amounts of data.

翻译

当在 ImageNet 这样的中等规模数据集上训练而不使用强正则化时，这些模型的准确率比相近规模的 ResNet 低几个百分点。这一看似令人沮丧的结果其实在意料之中：Transformer 缺乏 CNN 固有的某些归纳偏置，例如平移等变性和局部性，因此在数据量不足时泛化效果不好。

新手讲解

归纳偏置（Inductive Bias）是什么？

归纳偏置，就是模型在学习之前，设计者已经"内置"的关于数据结构的假设。这些假设让模型更容易从少量数据中学到有用的规律。

CNN 有两个强大的归纳偏置：
1. 局部性（Locality）：图片中相邻的像素更可能有关联（比如一块皮毛的颜色是连续的）。CNN 的卷积核只看局部小区域，天然利用了这一点。
2. 平移等变性（Translation Equivariance）：猫在图片左边和右边，CNN 都能用同一套卷积核识别出来。

ViT 没有这些内置假设——它把每个 patch 当成独立的 token，不知道哪两个 patch 是相邻的，也不知道"相邻比远距离更重要"。所以在数据量少的时候，ViT 需要从头学这些规律，容易过拟合，效果不如 CNN。

打个比方：CNN 就像一个从小就学过"看图说话"的孩子，认识图片的"语法规则"；ViT 就像一个只读过书（文本）的成年人，第一次看图，需要大量图片才能学会怎么"读图"。

原文段落 5（反转——大数据下反超）

However, the picture changes if the models are trained on larger datasets (14M-300M images). We find that large scale training trumps inductive bias. Our Vision Transformer (ViT) attains excellent results when pre-trained at sufficient scale and transferred to tasks with fewer datapoints. When pre-trained on the public ImageNet-21k dataset or the in-house JFT-300M dataset, ViT approaches or beats state of the art on multiple image recognition benchmarks. In particular, the best model reaches the accuracy of 88.55% on ImageNet, 90.72% on ImageNet-ReaL, 94.55% on CIFAR-100, and 77.63% on the VTAB suite of 19 tasks.

翻译

然而，当模型在更大的数据集（1400万至3亿张图像）上训练时，情况发生了变化。我们发现，大规模训练可以压倒归纳偏置。ViT 在足够大规模上预训练后，迁移到数据较少的任务上同样取得了卓越效果。当在公开的 ImageNet-21k 数据集或内部的 JFT-300M 数据集上预训练时，ViT 在多个图像识别基准上接近或超越了最先进水平。其中最佳模型在 ImageNet 上达到 88.55%，在 ImageNet-ReaL 上达到 90.72%，在 CIFAR-100 上达到 94.55%，在 VTAB 的 19 项任务套件上达到 77.63%。

新手讲解

数据量可以弥补归纳偏置的缺失！

这是本文最重要的发现之一：

少数据（ImageNet，130万张）：ViT 输给 ResNet。归纳偏置很重要。
中等数据（ImageNet-21k，1400万张）：ViT 与 ResNet 持平。
大数据（JFT-300M，3亿张）：ViT 赢了！大数据让 ViT 自己"学会"了 CNN 内置的那些图像规律。

关键数据集介绍：
- ImageNet：130万张图，1000个类别，是最常用的图像识别基准。
- ImageNet-21k：ImageNet 的超集，21000个类别，约1400万张图，公开数据集。
- JFT-300M：Google 内部数据集，3亿张图，18000个类别，不公开。

三、Method 章节（第3节）——ViT 模型结构详解

3.1 Vision Transformer (ViT)

原文（结构总览）

An overview of the model is depicted in Figure 1. The standard Transformer receives as input a 1D sequence of token embeddings. To handle 2D images, we reshape the image x ∈ R^{H×W×C} into a sequence of flattened 2D patches x_p ∈ R^{N×(P²·C)}, where (H, W) is the resolution of the original image, C is the number of channels, (P, P) is the resolution of each image patch, and N = HW/P² is the resulting number of patches, which also serves as the effective input sequence length for the Transformer.

翻译

模型概览如图 1 所示。标准 Transformer 接收一维 token 嵌入序列作为输入。为了处理二维图像，我们将图像 x（形状为 H×W×C）重塑为展平的二维 patch 序列 x_p（形状为 N×(P²·C)），其中 (H, W) 是原始图像的分辨率，C 是通道数，(P, P) 是每个图像 patch 的分辨率，N = HW/P² 是生成的 patch 数量，它也是 Transformer 的有效输入序列长度。

新手讲解

第一步：图像切块（Patch Embedding）

用具体数字来理解：

输入图像：224 × 224 × 3（高×宽×RGB三通道）

切块设置：patch 大小 = 16×16 像素

patch 数量：N = (224/16) × (224/16) = 14 × 14 = 196 个 patch

每个 patch 的原始大小：16 × 16 × 3 = 768 个数字

展平后：每个 patch 是一个长度为 768 的向量

所以原来是一张图，现在变成了 196 个向量，就像一段有 196 个词的句子！

注意命名规则：论文里把不同大小的 ViT 简写为 ViT-B/16、ViT-L/32 等。斜杠后面的数字就是 patch 大小。比如：
- ViT-B/16：Base 大小，16×16 patch，每张图产生 196 个 patch
- ViT-H/14：Huge 大小，14×14 patch，每张图产生 256 个 patch（更细，计算量更大）

原文（线性投影 + 位置编码 + class token）

The Transformer uses constant latent vector size D through all of its layers, so we flatten the patches and map to D dimensions with a trainable linear projection (Eq. 1). We refer to the output of this projection as the patch embeddings.

Similar to BERT's [class] token, we prepend a learnable embedding to the sequence of embedded patches (z₀⁰ = x_class), whose state at the output of the Transformer encoder (z₀^L) serves as the image representation y (Eq. 4). Both during pre-training and fine-tuning, a classification head is attached to z₀^L.

Position embeddings are added to the patch embeddings to retain positional information. We use standard learnable 1D position embeddings, since we have not observed significant performance gains from using more advanced 2D-aware position embeddings (Appendix D.4). The resulting sequence of embedding vectors serves as input to the encoder.

翻译

Transformer 在所有层中使用固定的隐向量维度 D，因此我们将 patch 展平后，用一个可训练的线性投影将其映射到 D 维（公式 1）。我们将这个投影的输出称为 patch 嵌入。

类似于 BERT 的 [class] token，我们在嵌入 patch 序列前面添加一个可学习的嵌入（z₀⁰ = x_class），其在 Transformer 编码器输出端的状态（z₀^L）作为图像表示 y（公式 4）。在预训练和微调阶段，分类头都被附加到 z₀^L 上。

位置嵌入被加到 patch 嵌入上以保留位置信息。我们使用标准的可学习一维位置嵌入，因为我们没有观察到更先进的二维位置嵌入带来的显著性能提升（附录 D.4）。最终的嵌入向量序列作为编码器的输入。

新手讲解

第二步：三个关键组件

（1）线性投影（Linear Projection）

每个 patch 展平后是 768 维的向量（对于 16×16×3），但 Transformer 需要统一的维度 D（比如 ViT-Base 用 D=768，ViT-Large 用 D=1024）。于是用一个全连接层（线性变换）把每个 patch 投影到 D 维。这就像"把每个词变成词向量"。

（2）[class] token（分类标记）

这个设计来自 BERT（NLP 界的著名模型）。

问题：处理完 196 个 patch 后，Transformer 会输出 196 个向量，哪个向量代表整张图片的语义？

解决方案：在序列最前面加一个特殊的可学习向量，叫 [class] token（分类标记，通常记为 [CLS]）。这样序列就变成了 197 个元素（1个CLS + 196个patch）。

Transformer 处理完后，CLS token 对应的输出向量就被当作整张图片的表示，送入分类器（一个小 MLP）来预测类别。

类比：就像在一篇文章最前面加一个"总结词"——它在阅读全文后汇总所有信息，最后这个"总结词"的状态就代表了对整篇文章的理解。

（3）Position Embedding（位置嵌入）

问题：Transformer 本身不知道序列元素的顺序。如果把 196 个 patch 打乱顺序，没有位置信息的 Transformer 会给出完全相同的结果——这显然不行，因为"左上角的天空"和"右下角的地面"语义完全不同。

解决方案：给每个 patch 加一个"位置编号"，把位置信息编码成向量，加到 patch 嵌入上。

类比：就像在书页角标上页码，让 Transformer 知道"这是第3个patch，位于图片的第3行第2列"。

论文发现：用最简单的一维可学习位置嵌入（直接给196个位置编号1~196）就够了，更复杂的二维位置嵌入（分别编码行号和列号）提升不明显。这说明 Transformer 通过大量数据，自己就能学会位置之间的二维拓扑关系。

原文（公式汇总 + Transformer Encoder 结构）

z₀ = [x_class; x_p¹E; x_p²E; ··· ; x_pᴺE] + E_pos, E ∈ R^{(P²·C)×D}, E_pos ∈ R^{(N+1)×D} (1)
z'ℓ = MSA(LN(z{ℓ-1})) + z_{ℓ-1}, ℓ = 1...L (2)
z_ℓ = MLP(LN(z'_ℓ)) + z'_ℓ, ℓ = 1...L (3)
y = LN(z₀^L) (4)

The Transformer encoder (Vaswani et al., 2017) consists of alternating layers of multiheaded self-attention (MSA, see Appendix A) and MLP blocks (Eq. 2, 3). Layernorm (LN) is applied before every block, and residual connections after every block.

翻译

（公式 1）输入序列 = [CLS token; patch1 投影; patch2 投影; ...; patchN 投影] + 位置嵌入

（公式 2）每层先做 LayerNorm，再做多头自注意力（MSA），再加残差连接

（公式 3）每层再做 LayerNorm，再过 MLP，再加残差连接

（公式 4）最终输出 = 对 CLS token 的最后一层表示做 LayerNorm

Transformer 编码器由多头自注意力（MSA）和 MLP 块交替叠加组成。每个块之前做 LayerNorm，每个块之后加残差连接。

新手讲解

ViT 完整的数据流（以 ViT-Base/16 处理 224×224 图片为例）

输入图像 [224, 224, 3]
    ↓ 切块
196个 patch，每个 [16, 16, 3] = 768 维向量
    ↓ 线性投影（E 矩阵）
196个 patch embedding，每个 768 维
    ↓ 拼接 CLS token
序列长度 197，每个元素 768 维
    ↓ 加位置嵌入（E_pos）
197 × 768 的输入矩阵
    ↓ 进入 Transformer Encoder（重复 L=12 层）
    每层：LayerNorm → 多头自注意力（12个头）→ 残差 → LayerNorm → MLP → 残差
    ↓
197 × 768 的输出矩阵
    ↓ 取第0位（CLS token）
768 维图像表示向量
    ↓ MLP 分类头
1000维（ImageNet 1000类）→ 预测结果

多头自注意力（MSA）的直觉：每个 patch 可以"看"所有其他 patch，计算"我对你有多感兴趣"的权重，然后加权求和得到新的表示。这样 patch 之间的长距离依赖（比如左上角的天空和右上角的云）可以直接建立联系，而 CNN 需要很多层才能建立远距离联系。

MLP 块：两层全连接网络，中间用 GELU 激活函数，隐层维度是 D 的4倍（ViT-Base 是 3072）。

LayerNorm（层归一化）：在自注意力之前做归一化，让训练更稳定。

残差连接：每层的输入直接加到输出上（类似 ResNet），防止梯度消失。

原文（模型变体规格）

We base ViT configurations on those used for BERT (Devlin et al., 2019), as summarized in Table 1. The "Base" and "Large" models are directly adopted from BERT and we add the larger "Huge" model. In what follows we use brief notation to indicate the model size and the input patch size: for instance, ViT-L/16 means the "Large" variant with 16×16 input patch size. Note that the Transformer's sequence length is inversely proportional to the square of the patch size, thus models with smaller patch size are computationally more expensive.

翻译

我们基于 BERT 使用的配置来设计 ViT，如表1所示。"Base"和"Large"模型直接采用自 BERT，我们额外增加了更大的"Huge"模型。记法说明：ViT-L/16 表示"Large"变体，使用 16×16 输入 patch 大小。注意 Transformer 的序列长度与 patch 大小的平方成反比，因此 patch 越小的模型计算量越大。

新手讲解

三种 ViT 规格对比

模型	层数(L)	隐维度(D)	MLP维度	注意力头数	参数量
ViT-Base（B）	12	768	3072	12	86M
ViT-Large（L）	24	1024	4096	16	307M
ViT-Huge（H）	32	1280	5120	16	632M

这三个规格直接对应 BERT-Base 和 BERT-Large 的参数配置，体现了"把 NLP 的成功搬到图像"的思路。

为什么 patch 越小越贵？
- patch 大小 32×32：每张图 (224/32)² = 49 个 patch → 序列长度 49
- patch 大小 16×16：每张图 (224/16)² = 196 个 patch → 序列长度 196
- patch 大小 14×14：每张图 (224/14)² = 256 个 patch → 序列长度 256

自注意力的计算量是序列长度的平方，所以序列越长，计算量暴增。ViT-H/14 就是用更小的 patch 换来更精细的图像理解，代价是更大的计算量。

原文（归纳偏置部分）

Inductive bias. We note that Vision Transformer has much less image-specific inductive bias than CNNs. In CNNs, locality, two-dimensional neighborhood structure, and translation equivariance are baked into each layer throughout the whole model. In ViT, only MLP layers are local and translationally equivariant, while the self-attention layers are global. The two-dimensional neighborhood structure is used very sparingly: in the beginning of the model by cutting the image into patches and at fine-tuning time for adjusting the position embeddings for images of different resolution (as described below). Other than that, the position embeddings at initialization time carry no information about the 2D positions of the patches and all spatial relations between the patches have to be learned from scratch.

翻译

归纳偏置。 我们注意到，Vision Transformer 相比 CNN 具有少得多的图像特有归纳偏置。在 CNN 中，局部性、二维邻域结构和平移等变性被内置到模型的每一层中贯穿始终。在 ViT 中，只有 MLP 层具有局部性和平移等变性，而自注意力层是全局的。二维邻域结构的使用极为稀少：仅在模型开始时用于将图像切分成 patch，以及在微调时用于为不同分辨率图像调整位置嵌入。除此之外，位置嵌入在初始化时不携带任何关于 patch 二维位置的信息，所有 patch 之间的空间关系都必须从头学习。

新手讲解

归纳偏置 vs 数据量：ViT 的核心权衡

这是理解 ViT 的最关键段落之一，用一张表来对比：

特性	CNN	ViT
局部性	内置（卷积只看局部）	无（注意力看全局）
平移等变性	内置（权重共享）	无（需学习）
二维结构理解	内置	几乎无（只有切patch时用到）
需要数据量	少数据可训练	需要大量数据
数据量充足时	受限于先验假设	可学习更灵活的规律

一个类比：

CNN 就像一个拿着"放大镜"逐块扫描图片的人，天生知道要"看局部细节"，也知道"同样的花纹不管在哪个角落都一样"。这让它在数据少的时候也能学得很好。
ViT 就像一个什么预设知识都没有的外星人，第一次接触图片。它没有"局部"或"平移不变"的概念，需要从大量图片中自己摸索出这些规律。数据少了，它什么都学不会；数据足够多了，它学到的规律比人类设计的更灵活、更强大。

3.2 Hybrid Architecture（混合架构）

原文

As an alternative to raw image patches, the input sequence can be formed from feature maps of a CNN (LeCun et al., 1989). In this hybrid model, the patch embedding projection E (Eq. 1) is applied to patches extracted from a CNN feature map. As a special case, the patches can have spatial size 1×1, which means that the input sequence is obtained by simply flattening the spatial dimensions of the feature map and projecting to the Transformer dimension.

翻译

作为原始图像 patch 的替代方案，输入序列也可以由 CNN 的特征图构成。在这种混合模型中，patch 嵌入投影 E 被应用于从 CNN 特征图中提取的 patch。特别地，当 patch 的空间大小为 1×1 时，输入序列直接由展平特征图的空间维度投影到 Transformer 维度得到。

新手讲解

混合架构（Hybrid Architecture）：CNN + Transformer 的结合

这是一个"取长补短"的方案：
1. 先用 CNN（比如 ResNet 的前几层）处理图像，提取底层特征图
2. 把这个特征图当成"已经理解了局部结构的 patch"，送入 Transformer

优势：CNN 擅长提取局部特征（边缘、纹理），Transformer 擅长建立全局关系。混合架构让 Transformer 不需要从原始像素学起，节省了学习局部特征的成本。

实验结果（图5）：在计算量较小时，混合架构略优于纯 ViT；但在大模型规模下，二者差距消失。这说明 ViT 在有足够参数和数据时，完全可以自己学会 CNN 所做的事情。

3.3 Fine-tuning and Higher Resolution（微调与高分辨率）

原文

Typically, we pre-train ViT on large datasets, and fine-tune to (smaller) downstream tasks. For this, we remove the pre-trained prediction head and attach a zero-initialized D×K feedforward layer, where K is the number of downstream classes. It is often beneficial to fine-tune at higher resolution than pre-training (Touvron et al., 2019; Kolesnikov et al., 2020). When feeding images of higher resolution, we keep the patch size the same, which results in a larger effective sequence length. The Vision Transformer can handle arbitrary sequence lengths (up to memory constraints), however, the pre-trained position embeddings may no longer be meaningful. We therefore perform 2D interpolation of the pre-trained position embeddings, according to their location in the original image. Note that this resolution adjustment and patch extraction are the only points at which an inductive bias about the 2D structure of the images is manually injected into the Vision Transformer.

翻译

通常我们在大型数据集上预训练 ViT，然后在（较小的）下游任务上微调。为此，我们移除预训练的预测头，换上一个零初始化的 D×K 前馈层，其中 K 是下游任务的类别数。在比预训练更高的分辨率上微调往往有益。当输入更高分辨率的图像时，我们保持 patch 大小不变，这会导致更长的有效序列。Vision Transformer 可以处理任意长度的序列（受内存限制），但预训练的位置嵌入可能不再有意义。因此，我们根据 patch 在原始图像中的位置，对预训练的位置嵌入进行二维插值。注意，这种分辨率调整和 patch 提取是唯二手动向 ViT 注入关于图像二维结构归纳偏置的地方。

新手讲解

微调（Fine-tuning）流程

预训练阶段（大数据集，如 JFT-300M）：
  图片 → ViT → [CLS token 输出] → MLP头（预测18000类）

微调阶段（小数据集，如 ImageNet 1000类）：
  1. 去掉原来的 MLP 分类头
  2. 换上新的线性层（1024→1000）
  3. 用较小学习率继续训练

高分辨率微调的小技巧

预训练时用 224×224，微调时用 384×384（更大分辨率），可以看到更细节，效果更好。

但有个问题：预训练时 patch 序列是 196 个（14×14），现在是 576 个（24×24）。预训练好的位置嵌入只有196个，现在需要576个，怎么办？

解决方案：二维插值（2D Interpolation）

把原来 14×14 的位置嵌入"拉伸"成 24×24，就像把一张 14×14 的图片放大到 24×24——中间补上插值。这样不需要重新训练位置嵌入，就能适应更高分辨率。

作者特别指出：这里的"知道 patch 的二维位置"是 ViT 中仅有的两个手动加入图像结构知识的地方（另一个是切 patch 本身）。

四、Experiments（实验章节）

4.1 数据集与模型规格

原文

To explore model scalability, we use the ILSVRC-2012 ImageNet dataset with 1k classes and 1.3M images (we refer to it as ImageNet in what follows), its superset ImageNet-21k with 21k classes and 14M images, and JFT (Sun et al., 2017) with 18k classes and 303M high-resolution images.

翻译

为探索模型可扩展性，我们使用了三个数据集：ILSVRC-2012 ImageNet（1000类，130万张图）、其超集 ImageNet-21k（21000类，1400万张图），以及 JFT（18000类，3亿张高分辨率图像）。

新手讲解

数据规模对比，一目了然

ImageNet        ≈  1.3M 张图  ← 研究社区标准，公开
ImageNet-21k    ≈ 14M  张图  ← 中等规模，公开
JFT-300M        ≈ 300M 张图  ← Google 内部，不公开，约是 ImageNet 的 230 倍

4.2 与最先进方法的对比（Comparison to State of the Art）

原文

Table 2 shows the results. The smaller ViT-L/16 model pre-trained on JFT-300M outperforms BiT-L (which is pre-trained on the same dataset) on all tasks, while requiring substantially less computational resources to train. The larger model, ViT-H/14, further improves the performance, especially on the more challenging datasets – ImageNet, CIFAR-100, and the VTAB suite. Interestingly, this model still took substantially less compute to pre-train than prior state of the art.

翻译

表 2 显示了结果。在 JFT-300M 上预训练的较小模型 ViT-L/16 在所有任务上均优于 BiT-L（同样在 JFT-300M 上预训练），同时所需训练计算资源显著更少。更大的模型 ViT-H/14 进一步提升了性能，尤其是在更具挑战性的数据集上——ImageNet、CIFAR-100 和 VTAB 套件。有趣的是，这个模型的预训练计算量仍然显著少于此前的最先进方法。

新手讲解

关键实验结果（表2）

模型	ImageNet	CIFAR-100	VTAB(19任务)	预训练算力(TPUv3 核·天)
ViT-H/14（JFT）	88.55%	94.55%	77.63%	2500
ViT-L/16（JFT）	87.76%	93.90%	76.28%	680
BiT-L（ResNet152x4，JFT）	87.54%	93.51%	76.29%	9900
Noisy Student（EfficientNet-L2）	88.4%	—	—	12300

最震撼的对比：
- ViT-H/14 的 ImageNet 精度（88.55%）超过了当时最强的 BiT-L（87.54%）和 Noisy Student（88.4%）
- ViT-H/14 的预训练算力（2500 TPU 核·天）仅是 BiT-L（9900）的 1/4，仅是 Noisy Student（12300）的约 1/5

结论：用更少的算力，达到更好的效果。这是 ViT 的杀手锏。

4.3 预训练数据需求（Pre-training Data Requirements）

原文

The Vision Transformer performs well when pre-trained on a large JFT-300M dataset. With fewer inductive biases for vision than ResNets, how crucial is the dataset size? We perform two series of experiments.

First, we pre-train ViT models on datasets of increasing size: ImageNet, ImageNet-21k, and JFT-300M. ... When pre-trained on the smallest dataset, ImageNet, ViT-Large models underperform compared to ViT-Base models, despite (moderate) regularization. With ImageNet-21k pre-training, their performances are similar. Only with JFT-300M, do we see the full benefit of larger models.

翻译

Vision Transformer 在大型 JFT-300M 数据集上预训练时表现出色。但它比 ResNet 具有更少的视觉归纳偏置，数据集大小有多关键？我们进行了两组实验。

首先，我们在规模递增的数据集上预训练 ViT：ImageNet、ImageNet-21k 和 JFT-300M。在最小的数据集 ImageNet 上预训练时，即使有（适度的）正则化，ViT-Large 的表现也不如 ViT-Base。ImageNet-21k 预训练时两者性能相近。只有在 JFT-300M 上，我们才能看到更大模型的完整优势。

新手讲解

图3的核心结论（数据量与模型大小的交互效应）

           ImageNet(130万)   ImageNet-21k(1400万)   JFT-300M(3亿)
ViT-Large       输给 ViT-Base      ≈ ViT-Base            远超 ViT-Base
ViT-Base        好于 ViT-Large     ≈ ViT-Large            被 ViT-Large 超越
ResNet(BiT)     赢过两者           与 ViT 相当             被 ViT 反超

这个实验用数据说明了：模型越大，越需要更多数据才能发挥优势。 小数据+大模型 = 过拟合 = 表现反而更差。

原文（第二组实验：JFT 子集）

Second, we train our models on random subsets of 9M, 30M, and 90M as well as the full JFT-300M dataset. ... Vision Transformers overfit more than ResNets with comparable computational cost on smaller datasets. For example, ViT-B/32 is slightly faster than ResNet50; it performs much worse on the 9M subset, but better on 90M+ subsets. The same is true for ResNet152x2 and ViT-L/16. This result reinforces the intuition that the convolutional inductive bias is useful for smaller datasets, but for larger ones, learning the relevant patterns directly from data is sufficient, even beneficial.

翻译

其次，我们在 JFT-300M 的随机子集（900万、3000万、9000万和完整的 3亿）上训练模型。结果显示，在较小的数据集上，Vision Transformer 比计算量相近的 ResNet 更容易过拟合。例如，ViT-B/32 略快于 ResNet50，但在 900万子集上表现差得多，在 9000万以上的子集上则表现更好。ViT-L/16 和 ResNet152x2 也有类似规律。这一结果进一步印证了直觉：卷积的归纳偏置对小数据集很有用，但对大数据集来说，直接从数据中学习相关模式就足够了，甚至更有益。

新手讲解

图4的核心结论（"交叉反超"曲线）

这个实验是整篇论文最清晰地展示"归纳偏置 vs 数据量"权衡的部分。

想象两条曲线随数据量增长：
- ResNet 曲线：一开始就高（归纳偏置帮忙），但增长较慢（先验知识限制了上限）
- ViT 曲线：一开始低（没有先验，数据少了就过拟合），但增长更快（更灵活，能从大数据中学到更多）

两条曲线在某个数据量（约 3000万～9000万张）处交叉，之后 ViT 反超 ResNet，且越来越大的优势。

这就是"大数据让 ViT 的弱点（没有归纳偏置）变成了强点（更灵活）"。

4.4 缩放研究（Scaling Study）

原文

First, Vision Transformers dominate ResNets on the performance/compute trade-off. ViT uses approximately 2−4× less compute to attain the same performance (average over 5 datasets). Second, hybrids slightly outperform ViT at small computational budgets, but the difference vanishes for larger models. Third, Vision Transformers appear not to saturate within the range tried, motivating future scaling efforts.

翻译

第一，Vision Transformer 在性能/计算量权衡上优于 ResNet。ViT 达到相同性能所需计算量约为 ResNet 的 1/2 到 1/4（在5个数据集上的平均值）。第二，混合架构在小计算量预算时略优于纯 ViT，但在更大模型下差距消失。第三，Vision Transformer 在尝试的范围内似乎没有出现性能饱和，这激励着未来的扩展研究。

新手讲解

效率的惊人优势

图5中，相同精度下：
- ViT 比 ResNet 省 2~4 倍算力
- 更大的 ViT 还没有达到性能天花板（ResNet 大到一定程度就不再显著提升了）

这解释了为什么 ViT 会引发视觉 Transformer 的革命：不仅精度更高，而且训练成本更低，扩展潜力更强。

4.5 内部分析：ViT 学到了什么？

原文

The first layer of the Vision Transformer linearly projects the flattened patches into a lower-dimensional space. Figure 7 (left) shows the top principal components of the learned embedding filters. The components resemble plausible basis functions for a low-dimensional representation of the fine structure within each patch.

After the projection, a learned position embedding is added to the patch representations. Figure 7 (center) shows that the model learns to encode distance within the image in the similarity of position embeddings, i.e. closer patches tend to have more similar position embeddings. Further, the row-column structure appears; patches in the same row/column have similar embeddings.

翻译

Vision Transformer 第一层将展平的 patch 线性投影到低维空间。图 7（左）展示了学到的嵌入滤波器的主成分，它们类似于对每个 patch 内细粒度结构进行低维表示的合理基函数。

投影后，可学习的位置嵌入被加到 patch 表示上。图 7（中）显示，模型学会了在位置嵌入的相似性中编码图像内的距离，即相邻的 patch 倾向于有更相似的位置嵌入。此外，行列结构也出现了；同行/列的 patch 有相似的嵌入。

新手讲解

ViT 自己发现了什么？

作者对 ViT 内部进行了"解剖"，发现了三个有趣的现象：

第一层学到了类 CNN 的特征基：第一层线性投影学到的滤波器，看起来很像 CNN 第一层卷积核学到的 Gabor 滤波器（边缘、颜色梯度等）。说明 ViT 在大数据下自己学会了CNN 手工设计的那些低层特征。
位置嵌入学到了二维拓扑结构：虽然用的是一维位置编号（1~196），但模型自己学到了二维的空间关系——相邻 patch 的位置嵌入更相似，同行/列的 patch 嵌入有规律性。这说明一维位置编号加上大量数据，足以让模型学会二维空间的理解，无需人为设计二维位置编码。
注意力距离从浅层到深层逐渐增大：浅层有些注意力头关注局部（小 patch 区域），有些关注全局；深层大多数注意力头关注全局。这说明 ViT 的不同层/头分别学习了不同尺度的特征，与 CNN 的层次化结构有异曲同工之处。

4.6 自监督预训练（Self-Supervision）

原文

We also perform a preliminary exploration on masked patch prediction for self-supervision, mimicking the masked language modeling task used in BERT. With self-supervised pre-training, our smaller ViT-B/16 model achieves 79.9% accuracy on ImageNet, a significant improvement of 2% to training from scratch, but still 4% behind supervised pre-training.

翻译

我们还对自监督进行了初步探索，使用遮盖 patch 预测任务，模仿 BERT 的掩码语言模型任务。经过自监督预训练，我们较小的 ViT-B/16 模型在 ImageNet 上达到 79.9% 精度，比从头训练提升了 2%，但仍比监督预训练低 4%。

新手讲解

自监督 ViT 预训练的先兆

这一小节是 MAE（Masked Autoencoders）等后续工作的先兆。思路是：
- 遮盖掉50%的 patch
- 让模型预测被遮盖 patch 的颜色/内容
- 不需要人工标注，只用图片本身监督

结果：79.9% 的 ImageNet 精度，比从头训练好，但当时还比不上有监督预训练（83.97%）。这个方向后来被 MAE（2021）等工作大幅改进，最终证明了自监督 ViT 可以媲美甚至超过有监督训练。

五、Conclusion（结论）

原文

We have explored the direct application of Transformers to image recognition. Unlike prior works using self-attention in computer vision, we do not introduce image-specific inductive biases into the architecture apart from the initial patch extraction step. Instead, we interpret an image as a sequence of patches and process it by a standard Transformer encoder as used in NLP. This simple, yet scalable, strategy works surprisingly well when coupled with pre-training on large datasets. Thus, Vision Transformer matches or exceeds the state of the art on many image classification benchmarks, whilst being relatively cheap to pre-train.

翻译

我们探索了将 Transformer 直接应用于图像识别的方法。与此前在计算机视觉中使用自注意力的工作不同，我们除了初始 patch 提取步骤外，没有向架构中引入图像特有的归纳偏置。我们将图像理解为 patch 序列，并使用 NLP 中标准的 Transformer 编码器处理它。这一简单而可扩展的策略，在大规模数据集预训练的加持下，表现出令人惊讶的优异性能。因此，Vision Transformer 在众多图像分类基准上与最先进方法相当或超越，同时预训练成本相对低廉。

新手讲解

一句话总结全文

ViT 的核心贡献：证明了在足够大的数据规模下，不需要 CNN 的专门设计，直接把图片当句子处理的纯 Transformer 就能做到最好的图像识别。

六、总结：核心概念速查表

术语	英文	一句话解释
图像块	Patch	把图片切成 P×P 像素的小方块，每块当一个"词"
补丁嵌入	Patch Embedding	把每个 patch 展平后经线性变换映射到 D 维向量
分类标记	[class] token / [CLS]	序列最前面的特殊向量，汇聚全图信息用于分类
位置嵌入	Position Embedding	给每个 patch 加的位置编号向量，告诉模型 patch 的位置
多头自注意力	Multi-Head Self-Attention (MSA)	让每个 patch 与所有其他 patch 交互，学习长距离依赖
归纳偏置	Inductive Bias	模型内置的关于数据结构的假设（CNN 内置局部性和平移不变性）
平移等变性	Translation Equivariance	同一物体不管在图片哪个位置，识别结果应该一样
微调	Fine-tuning	在大数据上预训练好的模型，在小数据集上继续训练以适应新任务
混合架构	Hybrid Architecture	CNN 提取特征图 + Transformer 处理特征序列的组合模型

七、ViT 的历史意义

ViT 发表于 2020 年 10 月，2021 年发表在 ICLR，是计算机视觉领域的里程碑之作：

打破了 CNN 在视觉的垄断：证明了 Transformer 不仅适用于 NLP，也能统一视觉领域。
开启了"大数据 + 大模型"在视觉的时代：预示着视觉预训练的重要性超过架构设计。
催生了庞大的后续工作：
- DeiT（2021）：不需要 JFT，只用 ImageNet 就能训好 ViT
- Swin Transformer（2021）：加入局部窗口，把 ViT 推广到检测/分割
- MAE（2021）：掩码自编码预训练，ViT 的自监督进化
- CLIP（2021）：图文对比预训练，ViT 成为视觉骨干
- 现代多模态大模型（GPT-4V、Gemini 等）：几乎都用 ViT 做视觉编码器
提供了"归纳偏置 vs 数据量"的深刻洞见：这一权衡不仅适用于视觉，也是深度学习中模型设计的普遍规律。

精读整理完毕。覆盖章节：Abstract、Introduction（全部5段）、Method 第3.1节（全部，含公式逐行解析）、3.2节（Fine-tuning）、3.3节（Hybrid）、Experiments 第4.2~4.6节重点段落、Conclusion。原始 PDF 共 22 页，主体内容 9 页（第1-9页），附录 13 页（第10-22页）。