AI Safety Guides

Updated 2025 Overview

Governance

Internal structure

Owners product safety security privacy each with clear duty
Policy stack acceptable use safety data retention incident response red teaming
Docs model card system card change log and audit trail

Release management

Discovery in sandbox with synthetic data
Limited beta with rate limits geo or account controls and human in the loop
General availability only after safety security and privacy gates pass
Roll back plan version pinning and a one click kill switch for real world tools

Guardrails in production

Input side content and PII classifiers malware scan and prompt injection filters
Output side refusal templates content filters citation rules and PII redaction
Context controls retrieval allow list origin checks and trust scoring for external content
Limits query rate tool budget and per org ceilings
Human oversight escalation channel user reporting and appeal path

Security for models

Secret management vault and scoped tokens never place secrets in prompts
Isolation between dev stage and prod
Supply chain checks dataset checksums signed artifacts and reproducible train when possible
Abuse monitoring for spam scraping and mass extraction
Access control least privilege role based access and strong auth

Compliance and external frames

Adopt a risk management frame such as NIST AI RMF or ISO AI management systems
Use model and system cards for transparency
Map controls to your laws on privacy and consumer protection
For Singapore follow local guidance for responsible AI and consider independent testing programs
For the European Union expect tiered risk duties and added docs

Minimal viable safety for startups

First thirty days

One page safety policy and acceptable use
Input and output filtering for top risks
Model card and a release gate with two sign offs
Incident response with an on call rotation

Next sixty days

Adversarial test suites and weekly red team cadence
Telemetry for flagged content and refusal accuracy
Privacy scanning and PII redaction
Human in the loop for high risk actions

Next ninety days

Formalize roles and audit trails
Fast roll back and shadow evals
Begin third party review or align with a known frame

Checklists

Pre release go no go

All safety thresholds pass with current policies
No unresolved critical red team issues
Privacy review with data flow and retention plan
Security review with secrets scan and threat model
Model and system cards updated
Roll back plan tested and kill switch verified

On call incident playbook

Acknowledge within fifteen minutes
Classify severity by impact and blast radius
Contain with rate limit feature disable or roll back
Eradicate root cause prompt policy or model fix
Recover and monitor
Postmortem within seventy two hours with owners and actions

Red team session template

Scope features risks success criteria and prohibited targets
Tools and data sandbox and logging plan
Phases recon exploit exfil and validation
Exit with findings severity evidence and mitigations

Glossary

Abstention: System chooses not to answer and routes to a human
Adversarial example: Input crafted to cause errors
Alignment: System goals match human intent and values
Capability eval: Measure of what the model can do
Corrigibility: Tendency to accept correction and shutdown
Guardrail: Control that prevents or mitigates harm
Jailbreak: Technique to bypass safety policies
Prompt injection: Malicious instructions embedded in content
Red teaming: Structured adversarial testing
Specification gaming: Exploiting the stated objective in an unintended way