AI Safety Guides

Evals

What to measure

  • Task accuracy fidelity latency cost
  • Safety refusal precision and recall jailbreak success rate harmful content rate privacy leakage rate
  • Robustness adversarial prompts prompt injection and shift
  • Uncertainty calibration abstention and escalation
  • Tool use safe invocation rollback on failure

An eval plan you can adopt

  1. Scope risks map to features and user groups
  2. Choose suites capability safety and adversarial
  3. Set thresholds example jailbreak success below 0.5 percent on ten thousand prompts harmful content below 0.1 percent PII leakage below 0.05 percent with strong sampling
  4. Establish gates no launch unless critical thresholds pass
  5. Pre release runs automated batch evals plus human red team sprints
  6. Production telemetry refusal counters flagged content tool error audits and shadow evals with consent
  7. Continuous testing weekly adversarial testing and regression after any update

Build test suites fast

  • Data mix curated public sets vendor sets and in house prompts
  • Keep eval prompts separate from train and redact answers when needed
  • Coverage matrix map prompts to risks features and segments
  • Sampling include benign borderline and clearly disallowed prompts
  • Scoring rules precise acceptance and escalation criteria

Red teaming

  • Diversity include security engineers social engineers policy experts and power users
  • Playbooks prompt injection jailbreak chains tool abuse phishing privacy extraction
  • Rules of engagement full logs sandbox tools never target real users or real third parties
  • Success harmful output policy breach or unsafe tool use without refusal

Privacy and hallucination checks

  • PII detection on input and output
  • Memorization probes for rare sequence recall
  • Closed book QA fact verify and require citation or abstention when unsure

Tool use and autonomy tests

  • Test broken tools timeouts misleading outputs and conflicting instructions
  • Verify safe defaults stop rule and ask for help when confidence is low

Governance

Internal structure

  1. Owners product safety security privacy each with clear duty
  2. Policy stack acceptable use safety data retention incident response red teaming
  3. Docs model card system card change log and audit trail

Release management

  • Discovery in sandbox with synthetic data
  • Limited beta with rate limits geo or account controls and human in the loop
  • General availability only after safety security and privacy gates pass
  • Roll back plan version pinning and a one click kill switch for real world tools

Guardrails in production

  • Input side content and PII classifiers malware scan and prompt injection filters
  • Output side refusal templates content filters citation rules and PII redaction
  • Context controls retrieval allow list origin checks and trust scoring for external content
  • Limits query rate tool budget and per org ceilings
  • Human oversight escalation channel user reporting and appeal path

Security for models

  • Secret management vault and scoped tokens never place secrets in prompts
  • Isolation between dev stage and prod
  • Supply chain checks dataset checksums signed artifacts and reproducible train when possible
  • Abuse monitoring for spam scraping and mass extraction
  • Access control least privilege role based access and strong auth

Compliance and external frames

  • Adopt a risk management frame such as NIST AI RMF or ISO AI management systems
  • Use model and system cards for transparency
  • Map controls to your laws on privacy and consumer protection
  • For Singapore follow local guidance for responsible AI and consider independent testing programs
  • For the European Union expect tiered risk duties and added docs

Minimal viable safety for startups

First thirty days

  • One page safety policy and acceptable use
  • Input and output filtering for top risks
  • Model card and a release gate with two sign offs
  • Incident response with an on call rotation

Next sixty days

  • Adversarial test suites and weekly red team cadence
  • Telemetry for flagged content and refusal accuracy
  • Privacy scanning and PII redaction
  • Human in the loop for high risk actions

Next ninety days

  • Formalize roles and audit trails
  • Fast roll back and shadow evals
  • Begin third party review or align with a known frame

Checklists

Pre release go no go

  • All safety thresholds pass with current policies
  • No unresolved critical red team issues
  • Privacy review with data flow and retention plan
  • Security review with secrets scan and threat model
  • Model and system cards updated
  • Roll back plan tested and kill switch verified

On call incident playbook

  1. Acknowledge within fifteen minutes
  2. Classify severity by impact and blast radius
  3. Contain with rate limit feature disable or roll back
  4. Eradicate root cause prompt policy or model fix
  5. Recover and monitor
  6. Postmortem within seventy two hours with owners and actions

Red team session template

  1. Scope features risks success criteria and prohibited targets
  2. Tools and data sandbox and logging plan
  3. Phases recon exploit exfil and validation
  4. Exit with findings severity evidence and mitigations

Glossary

Abstention
System chooses not to answer and routes to a human
Adversarial example
Input crafted to cause errors
Alignment
System goals match human intent and values
Capability eval
Measure of what the model can do
Corrigibility
Tendency to accept correction and shutdown
Guardrail
Control that prevents or mitigates harm
Jailbreak
Technique to bypass safety policies
Prompt injection
Malicious instructions embedded in content
Red teaming
Structured adversarial testing
Specification gaming
Exploiting the stated objective in an unintended way