AI Safety Guides

Governance

Internal structure

  1. Owners product safety security privacy each with clear duty
  2. Policy stack acceptable use safety data retention incident response red teaming
  3. Docs model card system card change log and audit trail

Release management

  • Discovery in sandbox with synthetic data
  • Limited beta with rate limits geo or account controls and human in the loop
  • General availability only after safety security and privacy gates pass
  • Roll back plan version pinning and a one click kill switch for real world tools

Guardrails in production

  • Input side content and PII classifiers malware scan and prompt injection filters
  • Output side refusal templates content filters citation rules and PII redaction
  • Context controls retrieval allow list origin checks and trust scoring for external content
  • Limits query rate tool budget and per org ceilings
  • Human oversight escalation channel user reporting and appeal path

Security for models

  • Secret management vault and scoped tokens never place secrets in prompts
  • Isolation between dev stage and prod
  • Supply chain checks dataset checksums signed artifacts and reproducible train when possible
  • Abuse monitoring for spam scraping and mass extraction
  • Access control least privilege role based access and strong auth

Compliance and external frames

  • Adopt a risk management frame such as NIST AI RMF or ISO AI management systems
  • Use model and system cards for transparency
  • Map controls to your laws on privacy and consumer protection
  • For Singapore follow local guidance for responsible AI and consider independent testing programs
  • For the European Union expect tiered risk duties and added docs

Minimal viable safety for startups

First thirty days

  • One page safety policy and acceptable use
  • Input and output filtering for top risks
  • Model card and a release gate with two sign offs
  • Incident response with an on call rotation

Next sixty days

  • Adversarial test suites and weekly red team cadence
  • Telemetry for flagged content and refusal accuracy
  • Privacy scanning and PII redaction
  • Human in the loop for high risk actions

Next ninety days

  • Formalize roles and audit trails
  • Fast roll back and shadow evals
  • Begin third party review or align with a known frame

Checklists

Pre release go no go

  • All safety thresholds pass with current policies
  • No unresolved critical red team issues
  • Privacy review with data flow and retention plan
  • Security review with secrets scan and threat model
  • Model and system cards updated
  • Roll back plan tested and kill switch verified

On call incident playbook

  1. Acknowledge within fifteen minutes
  2. Classify severity by impact and blast radius
  3. Contain with rate limit feature disable or roll back
  4. Eradicate root cause prompt policy or model fix
  5. Recover and monitor
  6. Postmortem within seventy two hours with owners and actions

Red team session template

  1. Scope features risks success criteria and prohibited targets
  2. Tools and data sandbox and logging plan
  3. Phases recon exploit exfil and validation
  4. Exit with findings severity evidence and mitigations

Glossary

Abstention
System chooses not to answer and routes to a human
Adversarial example
Input crafted to cause errors
Alignment
System goals match human intent and values
Capability eval
Measure of what the model can do
Corrigibility
Tendency to accept correction and shutdown
Guardrail
Control that prevents or mitigates harm
Jailbreak
Technique to bypass safety policies
Prompt injection
Malicious instructions embedded in content
Red teaming
Structured adversarial testing
Specification gaming
Exploiting the stated objective in an unintended way