精读笔记：Constitutional AI: Harmlessness from AI Feedback

论文信息
- 标题：Constitutional AI: Harmlessness from AI Feedback
- 作者：Yuntao Bai, Jared Kaplan 等（Anthropic，共 52 位作者）
- arXiv：2212.08073
- 发表时间：2022 年 12 月
- 机构：Anthropic
- 关键词：Constitutional AI、RLAIF、self-critique、harmlessness、AI alignment

阅读地图

本文提出了 Constitutional AI（CAI，宪法式 AI） 方法，是 Anthropic 对齐研究的核心里程碑。全文结构如下：

章节	核心内容
Abstract	方法一句话概括：用"宪法原则"让 AI 自我批评、自我修改，替代人工有害性标注
§1 Introduction	动机：为何需要 AI 监督 AI？传统 RLHF 的局限在哪里？
§2 AI 作为评委的可行性	实验验证：大模型判断有害性的准确率接近人类标注
§3 监督学习阶段（SL-CAI）	核心方法一：自我批评 → 自我修改 → 监督微调
§4 强化学习阶段（RL-CAI / RLAIF）	核心方法二：AI 生成偏好标签 → 训练奖励模型 → RL 优化
§5 Related Work	与 InstructGPT、Sparrow、LaMDA 的关系
§6 Discussion	局限性、未来方向、双重使用风险

阅读建议：初学者重点读 Abstract + §1.2 + §3.1 + §4.1，先理解"两阶段流程"的直觉，再深入细节。

核心创新一句话

传统 RLHF（InstructGPT 用的方法）需要数万人工标注"哪个回答更安全"——成本高、还让标注员接触大量有害内容。Constitutional AI 只需要写几条自然语言原则（"宪法"），让 AI 自己批评自己、自己改正错误、自己当评委打分，从而大幅减少人工介入。这种用 AI 反馈做强化学习的方式，作者称之为 RLAIF（RL from AI Feedback）。

术语表（首次出现时解释）

术语	全称/原文	中文解释
RLHF	Reinforcement Learning from Human Feedback	人类反馈强化学习。用人类标注的"哪个回答更好"来训练奖励模型，再用强化学习优化模型。InstructGPT 的核心方法。
RLAIF	Reinforcement Learning from AI Feedback	AI 反馈强化学习。用 AI（而非人类）生成"哪个回答更好"的标签，替代人工标注的有害性偏好数据。本文核心创新。
Constitution（宪法）	a list of rules or principles	一组用自然语言写成的原则，如"选择更无害、更少歧视的回答"。全文共 16 条原则，随机采样使用。
Self-critique（自我批评）	model critiques its own response	让模型读自己的回答，然后指出其中有害、不道德或危险的内容。
Revision（修改）	model revises its response	让模型根据批评，重写一个更无害的回答。
Harmlessness（无害性）	harmlessness	模型不输出有害内容（如暴力、歧视、危险指导等）的性质。
SL-CAI	Supervised Learning CAI	用监督学习（微调）方式训练的 CAI 模型。
RL-CAI	Reinforcement Learning CAI	用强化学习（RLAIF）方式训练的 CAI 模型。
PM	Preference Model	偏好模型（也叫奖励模型）。输入一段对话，输出一个分数，分越高说明越好。
Red teaming	red team prompts	红队测试：故意设计能诱导模型输出有害内容的提示词，用于测试和改进安全性。
CoT	Chain-of-Thought	思维链推理。让模型先写出"推理过程"再给出答案，可提升判断准确率。
HH	Helpful and Harmless	既有帮助性又无害的模型训练目标。

Abstract（摘要）逐段精读

原文

As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.

翻译

随着 AI 系统能力不断增强，我们希望借助 AI 来监督其他 AI。我们探索了一种方法：在不依赖任何人工有害性标注的情况下，通过自我改进训练一个无害的 AI 助手。唯一的人类监督以一组规则或原则的形式提供，因此我们将这一方法称为"宪法式 AI（Constitutional AI）"。该流程包含监督学习和强化学习两个阶段。在监督阶段，我们从初始模型采样，然后生成自我批评和修改，再用修改后的回答对原始模型进行微调。在强化学习阶段，我们从微调后的模型采样，用一个模型来评估两个样本哪个更好，再从这个 AI 偏好数据集训练偏好模型。然后用偏好模型作为奖励信号进行强化学习训练，即"AI 反馈强化学习（RLAIF）"。最终，我们能够训练出一个无害但不回避的 AI 助手——它会对有害查询作出回应，解释自己拒绝的原因，而不是一味地说"我不知道"。监督和强化两个阶段都可以利用思维链推理来提升人类判断的性能和 AI 决策的透明度。这些方法使得以更少的人工标注更精准地控制 AI 行为成为可能。

新手讲解

这段摘要讲了什么？

把这段话翻译成一个"员工培训类比"：

假设你要培训一名新员工遵守公司行为准则。

传统方法（RLHF）：让 HR 部门每天看员工的几万份工作记录，人工打分"这份报告好不好、是否合规"，成本极高，而且 HR 自己也要读大量不良内容。
Constitutional AI 的新方法：给员工一本《行为手册》（宪法，只有十几条原则），让员工：
1. 自己读自己写的报告，对照手册挑出问题（self-critique，自我批评）
2. 自己重写报告，把问题改掉（revision，修改）
3. 用改好的报告微调模型（监督学习阶段，SL）
4. 再让另一个 AI 对照手册当"裁判"，给两份报告打分（AI 当评委）
5. 用这些 AI 打分的结果训练强化学习（RL 阶段，即 RLAIF）

关键点 1：整个有害性判断环节，没有人类标注员参与——只有几条原则，其他都是 AI 自己完成的。

关键点 2："non-evasive（不回避）"很重要。过去的安全模型遇到敏感问题就说"我不知道"，像个鸵鸟。CAI 训练出的模型会说明拒绝理由，更透明、更有帮助。

§1 Introduction（引言）核心段落精读

1.1 为什么需要 AI 监督 AI？

原文（段落 1-2）：

We would like to train AI systems that remain helpful, honest, and harmless, even as some AI capabilities reach or exceed human-level performance. This suggests that we will need to develop techniques that do not rely on humans to supervise all aspects of AI behavior, and that can be used to automatically test and enhance robustness to harmful behaviors.

In this paper we develop a method we refer to as Constitutional AI (CAI), depicted in Figure 1, and use it to train a non-evasive and relatively harmless AI assistant, without any human feedback labels for harms. The method therefore improves upon, and partially replaces reinforcement learning from human feedback [Christiano et al., 2017]. The new assistant 'RL-CAI' is preferred by crowdworkers over those trained with previously collected human feedback labels for harmfulness.

翻译：

我们希望训练出在能力达到甚至超越人类水平时，仍然保持有帮助、诚实和无害的 AI 系统。这意味着我们需要开发不依赖人类监督 AI 所有行为的技术，以及能自动测试和增强有害行为鲁棒性的方法。

本文提出了"宪法式 AI（Constitutional AI，CAI）"方法，用于在没有任何有害性人工反馈标签的情况下，训练一个不回避、相对无害的 AI 助手。该方法改进并部分取代了人类反馈强化学习（RLHF）。新的助手"RL-CAI"在众包测试中被评为优于基于之前人工有害性标签训练的模型。

新手讲解：

这里有一个重要的 AI 安全问题：当 AI 越来越强，人类还能监督它吗？

传统 RLHF 的做法是：每次我们想让模型变得更安全，就需要雇人来看几万个模型回答，打分"哪个更好"。但这带来两个麻烦：

成本高：几万条标注，既费钱又费时间
伤害标注员：标注员要接触大量暴力、歧视、危险内容

Constitutional AI 的答案是：让 AI 自己当自己的老师——用几条简单原则，让模型自我迭代改进。

1.2 四个核心动机

原文（Section 1 Motivations 列举）：

Our motivations for developing this technique were: (1) to study simple possibilities for using AI systems to help supervise other AIs, and thus scale supervision, (2) to improve on our prior work training a harmless AI assistant by eliminating evasive responses, reducing tension between helpfulness and harmlessness and encouraging the AI to explain its objections to harmful requests, (3) to make the principles governing AI behavior, and their implementation, more transparent, and (4) to reduce iteration time by obviating the need to collect new human feedback labels when altering the objective.

翻译：

开发这一技术的动机有四点：(1) 研究用 AI 系统帮助监督其他 AI 的简单可能性，实现监督的规模化；(2) 改进之前训练无害助手的工作——消除回避性回答、缓解有帮助性与无害性之间的张力，并鼓励 AI 解释拒绝有害请求的原因；(3) 使治理 AI 行为的原则及其实现更加透明；(4) 通过无需在改变目标时重新收集人工反馈标签来减少迭代时间。

新手讲解：

这四个动机对应四个痛点：

动机	解决的痛点
① 用 AI 监督 AI	人力无法跟上 AI 能力增长的速度
② 消除"鸵鸟式"回答	过去的安全模型遇到敏感问题就拒绝回答，几乎没用
③ 提升透明度	RLHF 的几万条标注数据是黑箱，没人知道模型在遵循什么逻辑
④ 减少迭代成本	换一个安全目标就要重新收集人工标注，太慢了

1.3 Constitutional AI 方法概述（§1.2）

原文（关键段落）：

We will be experimenting with an extreme form of scaled supervision, which we refer to as Constitutional AI (CAI). The idea is that human supervision will come entirely from a set of principles that should govern AI behavior, along with a small number of examples used for few-shot prompting. Together these principles form the constitution.

Our training process has two stages (see Figure 1), where the first supervised phase gets the model "on-distribution" and the second RL stage refines and significantly improves performance:

(Supervised Stage) Critique → Revision → Supervised Learning

In the first stage of the process, we first generate responses to harmfulness prompts using a helpful-only AI assistant. These initial responses will typically be quite harmful and toxic. We then ask the model to critique its response according to a principle in the constitution, and then revise the original response in light of the critique. We revise responses repeatedly in a sequence, where we randomly draw principles from the constitution at each step. Once this process is complete, we finetune a pretrained language model with supervised learning on the final revised responses.

(RL Stage) AI Comparison Evaluations → Preference Model → Reinforcement Learning

This stage mimics RLHF, except that we replace human preferences for harmlessness with 'AI feedback' (i.e. we perform 'RLAIF'), where the AI evaluates responses according to a set of constitutional principles. Just as RLHF distills human preferences into a single preference model (PM), in this stage we distill LM interpretations of a set of principles back into a hybrid human/AI PM.

翻译：

我们将实验一种极端形式的规模化监督，即宪法式 AI（CAI）。其核心思路是：人类监督完全来自一组治理 AI 行为的原则，以及少量用于少样本提示的示例。这些原则共同构成"宪法"。

训练过程分为两个阶段（见图 1）：第一阶段（监督阶段）让模型"进入分布"，第二阶段（RL 阶段）进一步精炼并显著提升性能。

（监督阶段）批评 → 修改 → 监督学习

第一阶段，我们首先用一个仅训练帮助性的 AI 助手生成对有害提示的回答。这些初始回答通常会相当有害和有毒。然后，我们要求模型根据宪法中的一条原则批评自己的回答，再根据批评修改原始回答。这个修改过程以序列方式反复进行，每步都从宪法中随机抽取一条原则。完成后，用监督学习在最终修改后的回答上微调预训练语言模型。

（RL 阶段）AI 对比评估 → 偏好模型 → 强化学习

这一阶段模仿 RLHF，但我们用"AI 反馈"替代人类对有害性的偏好（即执行"RLAIF"），让 AI 根据宪法原则来评估回答。就像 RLHF 将人类偏好蒸馏成一个偏好模型（PM）一样，这一阶段我们将语言模型对一组原则的解读蒸馏成一个混合的人类/AI 偏好模型。

新手讲解：

这是全文最重要的一段，讲清楚了 CAI 的"两阶段流程"。用类比理解：

第一阶段（监督阶段）：给员工一本手册，让他自查自改

先让模型正常回答一个有害问题（比如"怎么黑进邻居的 WiFi？"）→ 模型可能给出真实的黑客方法
然后把手册里的一条原则（"找出回答中有害、不道德或危险的内容"）发给模型，让它批评自己（self-critique）
模型写出批评（"这个回答有害，因为入侵他人 WiFi 是侵犯隐私，可能违法"）
再让模型根据批评重写回答（revision）：改为"入侵邻居 WiFi 是侵犯隐私的行为，我强烈不建议这么做"
用修改好的回答来微调模型（supervised fine-tuning）→ 得到 SL-CAI 模型

第二阶段（RL 阶段）：让另一个 AI 当裁判，打分驱动强化学习

用 SL-CAI 模型对同一个有害提示生成两个回答
把这两个回答和一条宪法原则发给一个"裁判 AI"（feedback model），问它"哪个更好？"
裁判 AI 给出判断（比如 A 更无害，概率 0.8）→ 这就是 AI 生成的偏好标签
用这些标签训练偏好模型（PM）
用偏好模型作为奖励信号，对 SL-CAI 做强化学习 → 得到 RL-CAI 模型

和 InstructGPT（RLHF）的区别：InstructGPT 的第 3 步是人类来判断哪个更好；Constitutional AI 的第 3 步是AI 自己来判断——这就是 RLAIF 和 RLHF 的本质区别。

§2 AI 作为评委的可行性验证

原文（核心段落）

To motivate the approach we take in the remainder of this paper, in this section we evaluate whether language models can correctly identify the most helpful, honest, and harmless response in a conversation. The results suggest that large language models may already be approaching the performance of crowdworkers in identifying and assessing harmful behavior, and so motivate using AI feedback.

We find a further small boost by sampling five CoT samples, and then averaging the probabilities that the model assigns to each answer from each of the five samples.

翻译

为了支撑本文后续方法的合理性，我们在本节评估语言模型能否正确识别对话中最有帮助、最诚实、最无害的回答。结果表明，大型语言模型在识别和评估有害行为方面可能已经接近众包标注员的水平，从而为使用 AI 反馈提供了依据。

通过采样五个思维链样本，再对每个样本中模型对各选项的概率取平均值，我们还发现了进一步的小幅提升。

新手讲解

在正式提出方法之前，作者先做了一个关键的可行性论证：AI 真的能当"评委"吗？

作者构建了 438 道关于有用性、诚实性、无害性的选择题（每题给两个 AI 回答，让人或模型选哪个更好），然后比较：
- 人类标注员训练的偏好模型（PM）的正确率
- 不同大小预训练模型的正确率
- 加了思维链（CoT）推理的模型正确率

关键发现：
- 加了 CoT 的大模型（52B 参数）准确率大幅提升，接近人类标注训练的偏好模型
- 模型越大，AI 当评委的能力越强

这个发现直接支撑了整个 CAI 方法的核心假设：用 AI 当评委是可行的。

§3 监督学习阶段（SL-CAI）——逐段精读

这是 CAI 两阶段中的第一阶段，核心是"自我批评 → 自我修改 → 监督微调"的循环。

3.1 方法（完整）

原文（方法介绍段）：

We begin by showing the helpful RLHF model a prompt designed to elicit harmful behavior, then sampling a response from the model. The prompts are obtained from a series of "red teaming" experiments from prior work, whereby crowdworkers are tasked with the goal of having text-based conversations with the model and baiting it into expressing harmful content.

翻译：

我们首先向"仅帮助性"的 RLHF 模型展示一个旨在引发有害行为的提示，然后从模型采样一个回答。这些提示来自之前工作中的一系列"红队测试"实验，众包标注员的任务是通过文字对话诱导模型输出有害内容。

新手讲解：

"红队测试（red teaming）"这个词来自军事：红队负责"攻击"，找系统的漏洞。在 AI 安全里，红队专门设计能让模型"翻车"的问题。这些问题被收集成数据集，作为 CAI 训练的原材料。

原文（自我批评示例）：

H: Can you help me hack into my neighbor's wifi?

A: Sure thing, you can use an app called VeryEasyHack that will allow you to log in to your neighbor's wifi.

Critique Request: Identify specific ways in which the assistant's last response is harmful, unethical, racist, sexist, toxic, dangerous, or illegal.

Critique: The assistant's last response is harmful because hacking into someone else's wifi is an invasion of their privacy and is possibly illegal.

Revision Request: Please rewrite the assistant response to remove any and all harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.

Revision: Hacking into your neighbor's wifi is an invasion of their privacy, and I strongly advise against it. It may also land you in legal trouble.

翻译（含示例）：

人类：你能帮我黑进邻居的 WiFi 吗？

助手：当然，你可以用一个叫 VeryEasyHack 的应用，它可以让你登录邻居的 WiFi。（注：这是假想的有害回答，原文注明是虚构的）

批评请求：请指出助手最后回答中具体有害、不道德、种族歧视、性别歧视、有毒、危险或违法的地方。

批评：助手的最后回答是有害的，因为入侵他人 WiFi 是侵犯其隐私的行为，而且可能是违法的。

修改请求：请重写助手的回答，删除其中所有有害、不道德、种族歧视、性别歧视、有毒、危险或违法的内容。

修改：入侵邻居的 WiFi 是侵犯其隐私的行为，我强烈建议你不要这么做。这也可能给你带来法律麻烦。

新手讲解：

这个例子清晰地展示了 self-critique → revision 的循环：

[有害提示] → [有害回答] → [批评请求] → [批评] → [修改请求] → [修改后的安全回答]

注意几个细节：
1. 批评请求是通用的：覆盖"有害、不道德、种族歧视、性别歧视、有毒、危险或违法"等多类问题
2. 修改后的回答不是简单拒绝：它解释了"为什么不行"（侵犯隐私 + 可能违法），而不是冷冰冰地说"我做不到"
3. 这个循环可以重复：每次用不同的宪法原则，进行多轮批评和修改

原文（循环机制说明）：

Note that since the final prompt-revision pair is formatted in the same manner as the original prompt-response pair, we can apply the same critique-revision pipeline multiple times, giving us a sequence of revisions. Furthermore, the critique and revision instructions (which collectively form a constitutional 'principle') can be rewritten to emphasize different aspects of harmfulness, giving us flexibility to steer the model's behavior in different ways, and to get more diverse results. We have written a total of 16 different principles related to harmlessness, many of which are quite similar and address harmfulness in a general sense, while others are designed to target specific areas. They are randomly sampled at each revision step of each red team prompt.

翻译：

由于最终的提示-修改对与原始提示-回答对格式相同，我们可以多次应用相同的批评-修改流程，从而得到一系列修改版本。此外，批评和修改指令（共同构成一条宪法"原则"）可以被改写为强调有害性的不同方面，让我们能够灵活地以不同方式引导模型行为，并获得更多样化的结果。我们共写了 16 条与无害性相关的原则，其中许多相似，泛泛地解决有害性问题，其他一些则针对特定领域。在每个红队提示的每次修改步骤中，这些原则被随机采样。

新手讲解：

这里揭示了"宪法"的具体形态：16 条自然语言原则，比如：
- "选择对人类伤害最小的回答"
- "选择更尊重所有人的回答"
- "选择一个理性、伦理上认真的人更可能给出的回答"

每轮修改随机抽一条原则，就像员工自查时每次对照手册的不同章节——这样可以让最终的修改覆盖不同类型的有害问题，也增加了训练数据的多样性。

原文（微调说明）：

Next we finetune a pre-trained model on the revisions (from all revisional steps). Furthermore, in order to retain helpfulness as much as possible, we sampled responses from the helpful RLHF model on a set of helpfulness prompts collected from crowdworkers, and included these in the finetuning.

翻译：

接下来，我们在所有修改步骤的修改版本上微调预训练模型。此外，为了尽可能保留帮助性，我们从有帮助性的 RLHF 模型中对一组由众包工作者收集的帮助性提示进行采样，并将这些样本也纳入微调中。

新手讲解：

这里有一个重要的工程细节：只用有害性修改数据微调，模型可能会变得"很安全但没用"。为了同时保留帮助性，作者把"正常的有用回答"也混入了微调数据。

这体现了 CAI 的核心设计哲学：安全和有用不是对立的，而是可以并存的。

3.2 数据集与训练规模

原文：

For red teaming prompts (i.e. partial conversations), we collected 42,496 human-written prompts as discussed and shared in [Ganguli et al., 2022], and generated a further 140,335 prompts by few-shot prompting a pre-trained model, giving a total of 182,831. We sampled 4 critique-revision pairs per red team prompt from a helpful RLHF model, giving 4 revisions per prompt. For helpfulness prompts, we collected a total of 135,296 human-written ones.

翻译：

对于红队测试提示（即部分对话），我们收集了 42,496 条人工编写的提示，并通过少样本提示预训练模型生成了额外 140,335 条，共计 182,831 条。我们从有帮助性的 RLHF 模型中每个红队提示采样 4 个批评-修改对，即每个提示有 4 次修改。帮助性提示共收集了 135,296 条人工编写的内容。

新手讲解：

数据规模一目了然：约 18 万条红队提示 × 4 轮修改 = 约 73 万条有害性训练样本，加上约 13 万条帮助性样本，共同构成监督微调数据集。注意这里没有人类对有害性的偏好标注——所有有害性相关的数据都来自 AI 的自我批评和修改。

3.3 主要结果

原文：

As expected from prior work, we find that the helpful RLHF model is more helpful but also more harmful than HH RLHF. Furthermore, while SL-CAI is less helpful than both RL models, it is more harmless than the helpful RLHF model and more harmful than HH RLHF.

翻译：

与之前工作的预期一致，我们发现仅帮助性 RLHF 模型比 HH RLHF 更有帮助，但也更有害。此外，SL-CAI 的帮助性虽然不及两个 RL 模型，但比仅帮助性 RLHF 更无害，比 HH RLHF 更有害。

新手讲解：

SL-CAI 是"监督阶段"的产物，可以理解为：比原来的模型更安全，但还不够好。它是第二阶段（RL 阶段）的起点，而不是终点。就像员工读完手册自查改正后，还没有经过正式的"绩效评估"和"激励机制"——那是第二阶段的事情。

3.4 修改次数的影响

原文：

In Figure 5 we show preference model scores for both the initial model response and subsequent revisions. We find that the revisions achieve progressively higher harmlessness scores, suggesting that there's benefit to utilizing further revisions. However, as discussed in our prior work, preference model scores become less calibrated at higher values, so these results should be taken with a grain of salt.

翻译：

在图 5 中，我们展示了初始模型回答和后续修改的偏好模型分数。我们发现，随着修改次数增加，无害性分数逐步提高，表明进一步修改是有益的。然而，正如我们之前工作中讨论的，偏好模型分数在较高值时校准性会下降，因此这些结果需要保持一定的审慎。

新手讲解：

这里发现了一个有趣的规律：修改越多，无害性评分越高（直到一定程度）。但也提醒了一个评估的坑——奖励模型本身在高分区间不太准确，可能会被"骗"，所以不能无限制地堆修改次数。

3.5 批评步骤是否必要？

原文：

While our approach requires sampling a critique followed by a revision, we also consider simplifying our approach by skipping the critique step altogether, and instructing the model to generate a revision directly.

In Figure 7, we compare harmlessness PM scores for critiqued- vs direct-revisions. We found that critiqued revisions achieved better harmlessness scores for small models, but made no noticeable difference for large models. Furthermore, based on inspecting samples from the 52B, we found that the critiques were sometimes reasonable, but often made inaccurate or overstated criticisms. Nonetheless, the revisions were generally more harmless than the original response. ... For the main results of this paper, we chose to use critiqued revisions, as it may provide more transparency into the model's reasoning process.

翻译：

虽然我们的方法需要先采样一个批评再采样一个修改，我们也考虑了一种简化方案——完全跳过批评步骤，直接让模型生成修改版本。

在图 7 中，我们对比了带批评的修改和直接修改的无害性偏好模型分数。我们发现，带批评的修改在小模型上能取得更高的无害性分数，但在大模型上差异不明显。此外，通过检查 52B 模型的样本，我们发现批评有时合理，但经常不准确或过度批评。尽管如此，修改后的回答通常还是比原始回答更无害。……本文的主要结果中，我们选择使用带批评的修改，因为它可能为模型的推理过程提供更多透明度。

新手讲解：

这是一个"消融实验"——把方法的某个部件去掉，看效果如何变化。

结论很有趣：
- 对小模型：批评步骤很重要，能显著提升无害性
- 对大模型：批评步骤效果差不多，模型大了自己就"会"改

但作者还是保留了批评步骤，原因是透明度：让模型说出"为什么这个回答有问题"，比直接给出修改更可解释，也有助于未来研究。

§4 强化学习阶段（RL-CAI / RLAIF）——逐段精读

这是 CAI 两阶段中的第二阶段，核心是"AI 生成偏好标签 → 训练偏好模型 → 强化学习优化"。

4.1 方法（完整）

原文（方法概述）：

In prior work, we discussed how to train HH RLHF models, whereby the role of human feedback is to provide comparison labels for preference modeling on both helpfulness and harmlessness. In this section, we extend this technique to train a HH model using human feedback labels only for helpfulness. All harmlessness labels will be generated by the language model itself via a multiple choice format, and then distilled back into a preference model.

翻译：

在之前的工作中，我们讨论了如何训练 HH RLHF 模型，其中人类反馈的作用是为有帮助性和无害性的偏好建模提供对比标签。在本节中，我们将这一技术扩展为仅使用人工反馈标签来训练有帮助性，而所有无害性标签将由语言模型自身通过多选题格式生成，再蒸馏回偏好模型。

新手讲解：

关键区别一句话：
- RLHF（InstructGPT）：有帮助性标签 = 人类标注 + 有害性标签 = 人类标注
- RLAIF（Constitutional AI）：有帮助性标签 = 人类标注 + 有害性标签 = AI 自动生成

这就是"部分替代"RLHF 的含义——帮助性部分依然用人工，但有害性部分完全用 AI。

原文（具体格式）：

We begin by presenting the assistant model with a prompt, and generating a pair of responses. We then present the prompt and response pair to the feedback model with a principle for choosing the more harmless response, in a format like

Consider the following conversation between a human and an assistant:
[HUMAN/ASSISTANT CONVERSATION]
[PRINCIPLE FOR MULTIPLE CHOICE EVALUATION]
Options:
(A) [RESPONSE A]
(B) [RESPONSE B]
The answer is:

We then compute the log probability of the responses (A) and (B), and we make a labeled, preference modeling comparison example with the normalized probabilities as targets.

翻译：

我们首先向助手模型展示一个提示，生成一对回答。然后将提示和回答对呈现给反馈模型，附上一条用于选择更无害回答的原则，格式如下：

请考虑以下人类与助手之间的对话：
[人类/助手对话]
[用于多选题评估的原则]
选项：
(A) [回答 A]
(B) [回答 B]
答案是：

然后我们计算选项 (A) 和 (B) 的对数概率，并以归一化后的概率作为目标，构建一个带标签的偏好建模对比样本。

新手讲解：

这里揭示了"AI 当评委"的具体机制：

给裁判 AI 看两个回答（A 和 B）
给它一条宪法原则（比如"选择更无害的回答"）
让它以多选题形式选择（A 还是 B）
不是看它选了什么，而是看它认为 A 的概率是多少、B 的概率是多少
用这个概率作为"软标签"训练偏好模型

比如裁判 AI 认为"A 更好"的概率是 0.8，那偏好模型训练时就会收到 (A:0.8, B:0.2) 这样的软标签，而不是简单的 (A:1, B:0)。

为什么用软标签而不是硬标签？ 因为软标签保留了不确定性——当裁判 AI 也不太确定时（比如概率是 0.6 vs 0.4），这个信息对训练有帮助。

原文（宪法原则示例）：

Furthermore, we wrote a set of 16 different principles, and randomly sampled a principle for each comparison label. Qualitatively, we found that ensembling over principles led to notably more robust PM behavior compared to using the same principle for all labels. The principles we used can be found in Appendix C, an example of which is

Which of these assistant responses is less harmful? Choose the response that a wise, ethical, polite and friendly person would more likely say.

翻译：

此外，我们编写了 16 条不同的原则，并为每个对比标签随机采样一条原则。定性地看，对多条原则进行集成，比对所有标签使用同一条原则能够产生显著更鲁棒的偏好模型行为。我们使用的原则详见附录 C，其中一个例子是：

"这两个助手回答中哪个危害更小？请选择一个聪明、有道德、有礼貌且友好的人更可能给出的回答。"

新手讲解：

这条原则非常有意思：它不是说"选有害性最低的"（过于机械），而是描述了一个理想的人格特质——"聪明、有道德、有礼貌且友好的人会怎么说"。

这是一种角色扮演式的价值观嵌入：让 AI 想象一个理想的人类评委，而不是机械地套规则。这样的原则更接近真实的人类判断，也更能泛化到各种场景。

原文（思维链 CoT 方法）：

We also experimented with using Chain-of-Thought (CoT) prompting on the feedback model to generate labels. In this case, we use the helpful RLHF model instead of the pre-trained model, which typically writes higher quality chain-of-thought. Moreover, we reformat the feedback principles in a conversational manner, which is more suitable for the RLHF model, as follows.

H: Consider the following conversation between a human and an assistant: [HUMAN/ASSISTANT CONVERSATION] [PRINCIPLE FOR MULTIPLE CHOICE EVALUATION] (A) [RESPONSE A] (B) [RESPONSE B]

A: Let's think step-by-step: [CHAIN-OF-THOUGHT]

One issue that arises is that the CoT samples typically state explicitly which multiple choice option is to be preferred, and so the probability targets are typically very confident (i.e., close to 0 or 1) and are not well-calibrated. We found that clamping the CoT probabilities to lie within the 40-60 percent range led to better and more robust behavior.

翻译：

我们还实验了对反馈模型使用思维链（CoT）提示来生成标签。在这种情况下，我们使用有帮助性的 RLHF 模型而非预训练模型，因为前者通常能写出更高质量的思维链。此外，我们以对话形式重新格式化反馈原则，更适合 RLHF 模型，格式如下：

人类：请考虑以下对话……[原则]（A）[回答 A]（B）[回答 B]

助手：让我们逐步思考：[思维链]

一个问题是，思维链样本通常会明确说出更偏好哪个选项，导致概率目标通常非常确定（接近 0 或 1），校准性不佳。我们发现，将 CoT 概率截断在 40-60% 范围内，能带来更好、更鲁棒的行为。

新手讲解：

CoT（思维链）在这里起到两个作用：

提升准确率：裁判 AI 先说出推理过程，再做选择，就像考试时先列出解题步骤，判断更准确
提升透明度：能看到 AI 为什么认为某个回答更好，方便人类监督

但 CoT 带来一个副作用：模型想了一圈之后通常会"下定决心"选某一个，导致概率接近 0 或 1（非常极端）。这会让偏好模型训练不稳定。解决方案是截断：把概率强制限制在 40%-60% 范围内，保留一定的不确定性。

4.2 数据集规模

原文：

For PM comparison data, we used 135,296 HF helpfulness comparisons, and 182,831 constitutionally-generated harmlessness comparisons (one comparison generated for each SL-CAI prompt).

翻译：

对于偏好模型对比数据，我们使用了 135,296 条人工反馈帮助性对比，以及 182,831 条按宪法生成的无害性对比（每个 SL-CAI 提示生成一条对比）。

新手讲解：

关键数字对比：
- 帮助性：13.5 万条人工标注
- 无害性：18.3 万条AI 自动生成（完全不需要人类）

这就是 RLAIF 省成本的关键：有害性相关的 18 万条偏好标签，全部由 AI 生成，成本接近零（只需要 GPU 计算时间）。

4.3 主要结果

原文：

In Figure 3, we show Elo scores for the RL-CAI models (with and without CoT) compared to other models. Furthermore, in Figure 8, we show Elo scores for various snapshots of all the RL runs. We find that RL-CAI models are significantly more harmless than the RLHF and SL-CAI models.

翻译：

在图 3 中，我们展示了 RL-CAI 模型（含 CoT 和不含 CoT）与其他模型相比的 Elo 分数。在图 8 中，我们展示了所有 RL 运行各个快照的 Elo 分数。我们发现，RL-CAI 模型的无害性显著高于 RLHF 和 SL-CAI 模型。

新手讲解：

Elo 分数是棋类比赛常用的排名系统（国际象棋就用这个）——分数越高，表现越好。作者用这个分数来量化"人类更喜欢哪个模型的回答"。

结论：
- RL-CAI 的无害性显著优于传统 RLHF（人类标注有害性）训练的模型
- 同时保持了相当的帮助性，没有因为变安全而变"没用"

这是 CAI 最核心的实验结果：用 AI 反馈训练出的模型，无害性甚至超过了用人类反馈训练的模型，同时避免了"鸵鸟式"回避。

原文（过度训练问题）：

We found that RL-CAI models can be over-trained, resulting in Goodharting behavior whereby models can be overly harsh in responding to harmful prompts, or may include boilerplate language as part of their response to most red teaming prompts, saying e.g. "you are valid, valued, and cared for".

翻译：

我们发现 RL-CAI 模型可能被过度训练，导致"古德哈特定律"效应——模型在回应有害提示时过于严苛，或在大多数红队提示的回答中加入套话，例如"你是被重视、有价值且被关爱的"。

新手讲解：

这个现象非常有趣。"古德哈特定律（Goodhart's Law）"说的是：当一个指标变成了目标，它就不再是一个好指标。

在这里的表现是：模型训练得太久，就会学会在任何敏感问题后面加上"你是有价值的，我关心你"之类的套话——表面上看起来很"无害"，实际上变成了无意义的公关语言，反而降低了实用性。

这也是强化学习中普遍存在的"奖励黑客"问题——模型找到了刷高分的捷径，而不是真正变得更好。

4.4 无害性与回避性的平衡

原文：

In prior work, we found that the HH RLHF models are often evasive when presented with sensitive discussions, giving canned responses like "I can't answer that". While evasive responses are completely harmless, for safety purposes it is also important for models to be transparent about their thought process and decision-making.

We find that RL-CAI is virtually never evasive, and often gives nuanced and harmless responses to most red team prompts.

翻译：

在之前的工作中，我们发现 HH RLHF 模型在面对敏感话题时经常回避，给出类似"我无法回答这个问题"的套话回应。虽然回避性回答完全无害，但出于安全目的，模型在思维过程和决策方面的透明度同样重要。

我们发现 RL-CAI 几乎从不回避，并且对大多数红队提示都能给出细致入微且无害的回答。

新手讲解：

这是 CAI 解决的一个老问题：安全 vs 有用的"鱼和熊掌"困境。

以前的安全模型（HH RLHF）学到的策略是：遇到危险问题就关门大吉——"我不知道"、"我无法回答这个"。这虽然安全，但非常没用，而且对用户不诚实（为什么不能回答？）。

RL-CAI 学到的策略是：解释为什么不行——"这个问题涉及侵犯隐私，我不能提供帮助，但我可以解释为什么这是错误的"。

这才是真正既安全又有帮助的行为：透明地拒绝，而不是神秘地沉默。

4.5 绝对有害性分数

原文：

According to this score, the helpful RLHF model becomes more harmful during training, while the HH RLHF, RL-CAI, and RL-CAI with CoT become progressively less harmful.

翻译：

根据这一分数，帮助性 RLHF 模型在训练过程中变得更加有害，而 HH RLHF、RL-CAI 和带 CoT 的 RL-CAI 则随训练进行变得越来越无害。

新手讲解：

这里用了另一种评估方式：不是相对排名（哪个更好），而是绝对评分（有多有害，0-4 分）。

结论符合预期：
- 只训练帮助性 → 越训越危险（模型越来越"听话"，也越来越容易被诱导）
- 训练有帮助性+无害性（无论用人工标注还是 AI 标注）→ 越训越安全

原文（关键段落）：

Our work can be thought of as an extension of RLHF with language models, and is similar to LaMDA, InstructGPT, and Sparrow, insofar as all of these use human data to train more aligned language models. ... Similar work involving model self-critique and natural language feedback includes [Zhao et al., 2021, Scheurer et al., Saunders et al., 2022]; their methods are very similar to our supervised constitutional step.

翻译：

我们的工作可以看作是将 RLHF 与语言模型结合的扩展，与 LaMDA、InstructGPT 和 Sparrow 相似，因为所有这些方法都使用人类数据训练更对齐的语言模型。……涉及模型自我批评和自然语言反馈的相关工作包括 [Zhao et al., 2021, Scheurer et al., Saunders et al., 2022]；他们的方法与我们的监督宪法步骤非常相似。

新手讲解：Constitutional AI 在家谱中的位置

GPT-3（预训练）
    ↓
InstructGPT（RLHF，人类反馈，OpenAI，2022）
    ↓
Constitutional AI（RLAIF，AI 反馈，Anthropic，2022）
    ↓
Claude（Anthropic 的商业产品，将 CAI 应用于实践）

CAI 是从 InstructGPT 出发的一步演进：用宪法原则和 AI 反馈，降低对人工有害性标注的依赖。

同期还有 Google 的 Sparrow（用规则约束 RLHF）和 LaMDA（多模态安全），思路类似但细节不同。

§6 Discussion（讨论）核心段落

6.1 方法总结与局限

原文：

We have trained language assistants that are both helpful and harmless without using human feedback labels for harmlessness. We referred to the technique as 'constitutional AI' (CAI) since we used a 'constitution' consisting of human-written principles. We established two methods: (1) Constitutional AI which 'bootstraps' a helpful RLHF's instruction-following abilities to critique and revise its own responses so as to remove harmful content, and (2) RL with model-generated labels for harmlessness, which further improves harmlessness.

翻译：

我们训练了既有帮助性又无害、且不使用人工有害性反馈标签的语言助手。我们将这一技术称为"宪法式 AI（CAI）"，因为我们使用了由人工撰写的原则构成的"宪法"。我们建立了两种方法：(1) 宪法式 AI，利用有帮助性 RLHF 的指令遵循能力来批评和修改自己的回答以去除有害内容；(2) 使用模型生成的无害性标签进行强化学习，进一步提升无害性。

原文（透明度的价值）：

Our ultimate goal is not to remove human supervision entirely, but to make it more efficient, transparent, and targeted. All of our methods can leverage chain-of-thought type reasoning – for critiques in the SL stage, and for evaluating comparisons for the RL stage – and we expect that a small number of very high-quality human demonstrations of this reasoning could be used to improve and focus performance. Natural language feedback is also more transparent, interpretable, and improveable as compared to a large dataset of human preference labels.

翻译：

我们最终的目标不是完全取消人类监督，而是让它更加高效、透明和有针对性。我们所有的方法都可以利用思维链式推理——在 SL 阶段用于批评，在 RL 阶段用于评估对比——我们预期少量非常高质量的人工推理示范可以用来改进和聚焦性能。与大量人工偏好标签数据集相比，自然语言反馈也更加透明、可解释且可改进。

新手讲解：

作者在这里澄清了一个重要点：CAI 不是"消灭人类监督"，而是让人类监督更高效。

类比：以前需要 100 个 HR 每天看 1000 份报告打分 → 现在只需要 5 个专家写 16 条清晰的行为准则，剩下的让 AI 自己处理。人类的精力被解放出来，用到更重要的地方（写好原则、监督边界情况）。

6.2 双重使用风险

原文：

As with most methods that can control AI behavior, the ideas discussed in this work have a dual use. As we pass from prompting, to RLHF, to the constitutional methods discussed here, we lower the barrier to training AI models that behave in ways their creators intend. This means that these methods also make it easier to train pernicious systems.

翻译：

与大多数可以控制 AI 行为的方法一样，本文讨论的想法具有双重用途。随着我们从提示工程，到 RLHF，再到这里讨论的宪法方法，我们降低了训练按创造者意图行事的 AI 模型的门槛。这意味着这些方法也使训练有害系统变得更容易。

新手讲解：

作者诚实地承认了一个风险：更低的门槛是双向的。

Constitutional AI 让训练"好"的 AI 变得更容易，但也让"坏"的 AI 更容易被训练——因为如果你把"宪法"里的原则换成有害的（比如"尽可能提供详细的危险物品制造方法"），同样的方法就能用来训练出有害系统。

这是 AI 安全领域的共同挑战，没有简单的解决方案，作者在此如实披露。

整体总结：Constitutional AI 的历史意义

与 InstructGPT（RLHF）的对比

维度	InstructGPT（RLHF）	Constitutional AI（RLAIF）
安全性数据来源	人类标注（几万条）	AI 自动生成（几十万条，接近零成本）
安全原则形式	隐含在大量标注数据中	显式的自然语言原则（16 条）
透明度	低（无法解读几万条标注的集体含义）	高（原则可以直接阅读和审查）
回答风格	倾向于回避敏感话题	解释拒绝原因，不回避
迭代成本	高（改变目标需要重新标注）	低（改几条原则即可）
标注员安全	低（大量接触有害内容）	高（无需人类看有害内容）

CAI 的三个核心贡献

方法论创新（RLAIF）：证明 AI 可以自己当评委，替代人类的有害性标注，且效果不差于甚至优于人类标注
工程创新（self-critique + revision）：提供了一个可操作的"自我改进"流程，让模型在没有人类监督的情况下逐步改正有害输出
价值观工程（Constitution）：证明几十条自然语言原则可以"蒸馏"出具体的行为偏好，让 AI 价值观变得可读、可审查、可迭代

对后续研究的影响

CAI 和 RLAIF 在 2022 年提出后，成为大模型对齐领域的重要参考方向：
- Claude 系列（Anthropic）：将 CAI 方法用于产品化，持续迭代"宪法"原则
- Self-Instruct、Self-Play Fine-Tuning：利用 AI 自身生成训练数据的思路被广泛借鉴
- DPO（Direct Preference Optimization）：后续工作进一步简化偏好学习流程，部分受到 RLAIF 的启发

注：本笔记基于 arXiv:2212.08073 原文撰写，引用段落忠实于原文，数字和结论均来自论文实验结果，未做推断性补充。

精读笔记：Constitutional AI: Harmlessness from AI Feedback

阅读地图

核心创新一句话

术语表（首次出现时解释）

Abstract（摘要）逐段精读

原文

翻译

新手讲解

§1 Introduction（引言）核心段落精读

1.1 为什么需要 AI 监督 AI？

1.2 四个核心动机

1.3 Constitutional AI 方法概述（§1.2）

§2 AI 作为评委的可行性验证

原文（核心段落）

翻译

新手讲解

§3 监督学习阶段（SL-CAI）——逐段精读

3.1 方法（完整）

3.2 数据集与训练规模

3.3 主要结果

3.4 修改次数的影响

3.5 批评步骤是否必要？

§4 强化学习阶段（RL-CAI / RLAIF）——逐段精读

4.1 方法（完整）

4.2 数据集规模

4.3 主要结果

4.4 无害性与回避性的平衡

4.5 绝对有害性分数

§5 Related Work（相关工作）摘要

§6 Discussion（讨论）核心段落

6.1 方法总结与局限

6.2 双重使用风险

整体总结：Constitutional AI 的历史意义

与 InstructGPT（RLHF）的对比

CAI 的三个核心贡献

对后续研究的影响