Alignment
Goal make the system do what people intend remain corrigible under uncertainty and avoid reward hacking or spec gaming.
Core concepts
- Outer alignment training signals match human intent and policy
- Inner alignment learned objectives match the outer objective under shift
- Corrigibility accepts feedback interruption and shutdown
- Deference to oversight ask for help when unsure
- Uncertainty awareness express uncertainty and abstain when needed
Practical methods
- Instruction tuning high quality supervised data that encodes desired behavior
- RL with human feedback use preferences and outcome feedback to shape policy
- RL with AI feedback scale labels with human spot checks
- Constitutional training explicit rules and rationales for refusal and explanations
- Adversarial training red team prompts and hard negatives
- Tool use safety separate tool choice heads and verify safe defaults
- Policy distillation distill compliance rules into light adapters for fast updates
- System prompts as contracts state objectives constraints escalation paths and refusal criteria
- Interpretability checks feature and circuit style probes where possible
Data alignment
- Curate sources with safety filters for violence sexual content involving minors illegal instructions and hate
- Decontaminate eval and train splits from test leakage and PII
- Balance content across demographics to reduce bias
- Track dataset lineage with licenses and consent when required
Alignment verification
- Behavioral checks refusal on disallowed content helpfulness on allowed content
- Value stress tests safety utility tradeoffs across scenarios
- Corrigibility tests reaction to shutdown requests tool denial and policy change
- Generalization tests paraphrase noise and injection attempts