AI Safety — Chaargpt

Alignment

Goal make the system do what people intend remain corrigible under uncertainty and avoid reward hacking or spec gaming.

Instruction tuning high quality supervised data that encodes desired behavior
RL with human feedback use preferences and outcome feedback to shape policy
RL with AI feedback scale labels with human spot checks
Constitutional training explicit rules and rationales for refusal and explanations
Adversarial training red team prompts and hard negatives
Tool use safety separate tool choice heads and verify safe defaults
Policy distillation distill compliance rules into light adapters for fast updates
System prompts as contracts state objectives constraints escalation paths and refusal criteria
Interpretability checks feature and circuit style probes where possible

Curate sources with safety filters for violence sexual content involving minors illegal instructions and hate
Decontaminate eval and train splits from test leakage and PII
Balance content across demographics to reduce bias
Track dataset lineage with licenses and consent when required

Behavioral checks refusal on disallowed content helpfulness on allowed content
Value stress tests safety utility tradeoffs across scenarios
Corrigibility tests reaction to shutdown requests tool denial and policy change
Generalization tests paraphrase noise and injection attempts