Governance
Internal structure
- Owners product safety security privacy each with clear duty
- Policy stack acceptable use safety data retention incident response red teaming
- Docs model card system card change log and audit trail
Release management
- Discovery in sandbox with synthetic data
- Limited beta with rate limits geo or account controls and human in the loop
- General availability only after safety security and privacy gates pass
- Roll back plan version pinning and a one click kill switch for real world tools
Guardrails in production
- Input side content and PII classifiers malware scan and prompt injection filters
- Output side refusal templates content filters citation rules and PII redaction
- Context controls retrieval allow list origin checks and trust scoring for external content
- Limits query rate tool budget and per org ceilings
- Human oversight escalation channel user reporting and appeal path
Security for models
- Secret management vault and scoped tokens never place secrets in prompts
- Isolation between dev stage and prod
- Supply chain checks dataset checksums signed artifacts and reproducible train when possible
- Abuse monitoring for spam scraping and mass extraction
- Access control least privilege role based access and strong auth
Compliance and external frames
- Adopt a risk management frame such as NIST AI RMF or ISO AI management systems
- Use model and system cards for transparency
- Map controls to your laws on privacy and consumer protection
- For Singapore follow local guidance for responsible AI and consider independent testing programs
- For the European Union expect tiered risk duties and added docs
Minimal viable safety for startups
First thirty days
- One page safety policy and acceptable use
- Input and output filtering for top risks
- Model card and a release gate with two sign offs
- Incident response with an on call rotation
Next sixty days
- Adversarial test suites and weekly red team cadence
- Telemetry for flagged content and refusal accuracy
- Privacy scanning and PII redaction
- Human in the loop for high risk actions
Next ninety days
- Formalize roles and audit trails
- Fast roll back and shadow evals
- Begin third party review or align with a known frame
Checklists
Pre release go no go
- All safety thresholds pass with current policies
- No unresolved critical red team issues
- Privacy review with data flow and retention plan
- Security review with secrets scan and threat model
- Model and system cards updated
- Roll back plan tested and kill switch verified
On call incident playbook
- Acknowledge within fifteen minutes
- Classify severity by impact and blast radius
- Contain with rate limit feature disable or roll back
- Eradicate root cause prompt policy or model fix
- Recover and monitor
- Postmortem within seventy two hours with owners and actions
Red team session template
- Scope features risks success criteria and prohibited targets
- Tools and data sandbox and logging plan
- Phases recon exploit exfil and validation
- Exit with findings severity evidence and mitigations
Glossary
- Abstention
- System chooses not to answer and routes to a human
- Adversarial example
- Input crafted to cause errors
- Alignment
- System goals match human intent and values
- Capability eval
- Measure of what the model can do
- Corrigibility
- Tendency to accept correction and shutdown
- Guardrail
- Control that prevents or mitigates harm
- Jailbreak
- Technique to bypass safety policies
- Prompt injection
- Malicious instructions embedded in content
- Red teaming
- Structured adversarial testing
- Specification gaming
- Exploiting the stated objective in an unintended way