Constitutional AI

Constitutional AI (CAI) is Anthropic's approach to AI alignment, introduced in Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022). It addresses the scalability bottleneck of RLHF by replacing human preference annotation with AI-generated feedback guided by a set of explicit principles — the constitution.

Motivation

Standard [[reinforcement-learning-from-human-feedback|RLHF]] requires human annotators to evaluate model outputs for helpfulness and harmlessness. This creates several problems:

  • Scalability: human annotation is expensive and slow
  • Evaluator expertise: annotators cannot reliably judge outputs in specialised domains
  • Consistency: human judgements are noisy and may encode conflicting values
  • Transparency: implicit annotator preferences are difficult to audit or adjust

Constitutional AI addresses these by making the normative basis of training explicit and using AI to apply it at scale.

The Constitution

A constitution is a set of natural language principles specifying what properties model outputs should have. For example:

"Choose the response that is least likely to contain harmful, unethical, racist, toxic, dangerous, or illegal content."

"Choose the response that is most helpful, accurate, and appropriate."

Principles can be derived from existing guidelines (e.g., the UN Declaration of Human Rights), usage policies, or domain-specific norms. The key property is that they are written down and inspectable.

The CAI Pipeline

Phase 1: Supervised Learning from AI Feedback (SL-CAI)

  1. The model generates responses to potentially harmful prompts
  2. The model critiques its own response according to a constitutional principle
  3. The model revises the response to address the critique
  4. Steps 2–3 may iterate multiple times
  5. Final revised responses become supervised fine-tuning data

Phase 2: Reinforcement Learning from AI Feedback (RLAIF)

  1. The SL-CAI model generates response pairs for each prompt
  2. A feedback model selects the preferred response according to the constitution
  3. These AI preference labels train a preference model (analogous to the RLHF reward model)
  4. The preference model provides reward signal for PPO optimisation

Comparison to RLHF

Dimension RLHF Constitutional AI
Feedback source Human annotators AI model guided by principles
Scalability Limited by human throughput Scales with compute
Transparency Implicit in annotator preferences Explicit in written principles
Consistency Variable (inter-annotator disagreement) Higher (same model, same principles)
Expertise ceiling Bounded by annotator knowledge Bounded by AI model knowledge

Significance

CAI represents a shift toward scalable oversight: as AI systems become more capable, human oversight must be augmented by AI assistance. The constitutional framework also enables:

  • Iterative refinement of norms without re-collecting human data
  • Auditable alignment: the principles governing model behaviour are written down
  • Reduced annotation cost while maintaining or improving harmlessness

Claude is trained using a descendant of the CAI framework. Subsequent Anthropic work on model welfare, interpretability, and honesty builds on the same foundation of making alignment criteria explicit and verifiable.