AGI Safety & Security - Interactive Guide

occurs when a human deliberately uses the AI system to cause harm, against the developer's wishes. The approach described focuses on preventing bad actors from accessing dangerous capabilities.

Evaluation

Dangerous Capability Evaluations

Red-Teaming

Training

Safety Training

Capability Suppression

Deployment

Monitoring

Access Restrictions

Security

Figure 2: Overview of the described approach to mitigating misuse

The approach aims to block bad actors' access to dangerous capabilities through various layers of protection.

Understanding Misuse Risk

AI could exacerbate harm from misuse by:

Increased possibility of causing harm: AI could place the power of significant weapons expertise and a large workforce in the hands of its users, expanding the pool of individuals with the capability to cause severe harms.
Decreased detectability: AI could assist with evading surveillance, reducing the probability that a bad actor is caught.
Disrupting defenses: As a novel and quickly developing technology, AI models disturb the existing misuse equilibrium, requiring time for society to build appropriate controls.
Automation at scale: AI systems may concentrate power in the hands of individuals who control it, enabling a single bad actor to cause harm at unprecedented scale.

Mitigation Approaches

Capability-Based Risk Assessment

Before implementing costly mitigations, it is first assessed whether the AI model has capabilities that could enable severe harm.

Model Deployment Mitigations

When models possess dangerous capabilities, techniques are applied to prevent misuse:

Monitoring

Monitoring systems detect when a threat actor attempts to inappropriately access dangerous capabilities, and respond to prevent them from causing harm.

Access Restrictions

Access to models with dangerous capabilities can be restricted to vetted user groups and use cases, reducing the surface area that an actor can attempt to exploit.

Securing Model Weights

Security measures aim to prevent bad actors from stealing AI model weights, which could allow them to bypass deployment mitigations.

Red-Teaming Mitigations

Red-teaming evaluates whether the described mitigations are sufficient by attempting to bypass them.

Safety Cases

The mitigations described above are critical components for addressing misuse risks, but they don't exist in isolation. To make deployment decisions, a structured way is needed to determine if these mitigations collectively provide sufficient protection against risks. provide a framework for assessing whether our mitigation strategies are comprehensive enough to justify deployment decisions. They help transform individual mitigations into coherent evidence that a system poses acceptable risk.

Safety cases for misuse typically fall into two categories: (demonstrating the model lacks dangerous capabilities) and (demonstrating mitigations are robust against sophisticated attacks).

Addressing Misuse

Understanding Misuse Risk

Mitigation Approaches

Capability-Based Risk Assessment

Model Deployment Mitigations

Monitoring

Access Restrictions

Securing Model Weights

Red-Teaming Mitigations

Safety Cases