Constitutional AI

Constitutional AI (CAI) is Anthropic's approach to AI alignment, introduced in Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022). It addresses the scalability bottleneck of RLHF by replacing human preference annotation with AI-generated feedback guided by a set of explicit principles — the constitution.

Motivation

Standard RLHF requires human annotators to evaluate model outputs for helpfulness and harmlessness. This creates several problems:

Scalability: human annotation is expensive and slow
Evaluator expertise: annotators cannot reliably judge outputs in specialised domains
Consistency: human judgements are noisy and may encode conflicting values
Transparency: implicit annotator preferences are difficult to audit or adjust

Constitutional AI addresses these by making the normative basis of training explicit and using AI to apply it at scale.

The Constitution

A constitution is a set of natural language principles specifying what properties model outputs should have. For example:

"Choose the response that is least likely to contain harmful, unethical, racist, toxic, dangerous, or illegal content."

"Choose the response that is most helpful, accurate, and appropriate."

Principles can be derived from existing guidelines (e.g., the UN Declaration of Human Rights), usage policies, or domain-specific norms. The key property is that they are written down and inspectable.

The CAI Pipeline

Phase 1: Supervised Learning from AI Feedback (SL-CAI)

The model generates responses to potentially harmful prompts
The model critiques its own response according to a constitutional principle
The model revises the response to address the critique
Steps 2–3 may iterate multiple times
Final revised responses become supervised fine-tuning data

Phase 2: Reinforcement Learning from AI Feedback (RLAIF)

The SL-CAI model generates response pairs for each prompt
A feedback model selects the preferred response according to the constitution
These AI preference labels train a preference model (analogous to the RLHF reward model)
The preference model provides reward signal for PPO optimisation

Comparison to RLHF

Dimension	RLHF	Constitutional AI
Feedback source	Human annotators	AI model guided by principles
Scalability	Limited by human throughput	Scales with compute
Transparency	Implicit in annotator preferences	Explicit in written principles
Consistency	Variable (inter-annotator disagreement)	Higher (same model, same principles)
Expertise ceiling	Bounded by annotator knowledge	Bounded by AI model knowledge

Significance

CAI represents a shift toward scalable oversight: as AI systems become more capable, human oversight must be augmented by AI assistance. The constitutional framework also enables:

Iterative refinement of norms without re-collecting human data
Auditable alignment: the principles governing model behaviour are written down
Reduced annotation cost while maintaining or improving harmlessness

Claude is trained using a descendant of the CAI framework. Subsequent Anthropic work on model welfare, interpretability, and honesty builds on the same foundation of making alignment criteria explicit and verifiable.