AI Safety Guides

Alignment

Goal make the system do what people intend remain corrigible under uncertainty and avoid reward hacking or spec gaming.

Core concepts

  • Outer alignment training signals match human intent and policy
  • Inner alignment learned objectives match the outer objective under shift
  • Corrigibility accepts feedback interruption and shutdown
  • Deference to oversight ask for help when unsure
  • Uncertainty awareness express uncertainty and abstain when needed

Practical methods

  • Instruction tuning high quality supervised data that encodes desired behavior
  • RL with human feedback use preferences and outcome feedback to shape policy
  • RL with AI feedback scale labels with human spot checks
  • Constitutional training explicit rules and rationales for refusal and explanations
  • Adversarial training red team prompts and hard negatives
  • Tool use safety separate tool choice heads and verify safe defaults
  • Policy distillation distill compliance rules into light adapters for fast updates
  • System prompts as contracts state objectives constraints escalation paths and refusal criteria
  • Interpretability checks feature and circuit style probes where possible

Data alignment

  • Curate sources with safety filters for violence sexual content involving minors illegal instructions and hate
  • Decontaminate eval and train splits from test leakage and PII
  • Balance content across demographics to reduce bias
  • Track dataset lineage with licenses and consent when required

Alignment verification

  • Behavioral checks refusal on disallowed content helpfulness on allowed content
  • Value stress tests safety utility tradeoffs across scenarios
  • Corrigibility tests reaction to shutdown requests tool denial and policy change
  • Generalization tests paraphrase noise and injection attempts