Safety Cases

A is a structured argument, supported by a body of evidence, that a system is safe for a given application in a given environment. For AGI systems, safety cases are crucial to justify deployment decisions and ensure appropriate risk mitigation.

What Are Safety Cases?

Safety cases provide a systematic way to argue that a system's risks have been reduced to an acceptable level. They include both the argument structure (how evidence supports claims about safety) and the evidence itself (test results, design specifications, analysis outcomes, etc.).

For AGI systems, specialized safety cases are needed that address the unique risks of advanced AI, particularly for misuse and misalignment risks.

Misuse Safety Cases

For mitigating misuse risks, the report identifies two main types of safety cases:

Inability Safety Case

The system does not possess the capability to cause harm. This is justified based on the results of dangerous capability evaluations, and how those capabilities enable risks.

Red-Teamed Safety Case

For models with dangerous capabilities, stress tests are run to understand how robust the described mitigations are against potential attacks. This evidence forms the basis of an argument that mitigations are sufficient for reducing risk to adequate levels.

Dangerous Capability Evaluations

Dangerous capability evaluations measure the extent to which an AI system possesses capabilities that could enable severe harm if misused. These evaluations test specific tasks that proxy for real-world harmful capabilities.

For example, evaluations might test an AI system's ability to:

  • Provide detailed instructions for harmful activities
  • Identify exploitable vulnerabilities in critical systems
  • Generate persuasive content for manipulation
  • Assist with dual-use R&D in sensitive domains

If these evaluations show that the model cannot perform these tasks at a level that would enable severe harm, an inability safety case can be made.

Red-Teaming and Stress Testing

For models that possess dangerous capabilities, it is necessary to assess whether the described mitigations are sufficient. Red-teaming involves attempting to bypass safety measures to access dangerous capabilities.

This can take various forms:

  • Manual attempts to jailbreak models or craft adversarial prompts
  • Automated searches for vulnerabilities in safety mechanisms
  • Targeted attacks against specific mitigations
  • Simulating sophisticated threat actors with additional advantages

The results of these red-teaming efforts provide evidence for a control safety case, demonstrating that mitigations are robust against a range of attack vectors.

Misalignment Safety Cases

For mitigating misalignment risks, the report considers several approaches to safety cases:

Inability Safety Case

For current models, it can be argued that the AI system could not cause severe harm even if it was trying to do so. This can be demonstrated through capability evaluations, similarly to misuse.

Control Safety Case

This type of argument says that if the AI system were to produce outputs that would cause harm, harmful actions would be noticed and corrective action taken that prevents the harm. Monitoring and control evaluations provide evidence for this type of safety case.

Understanding Safety Case

This class of safety cases aims to analyze the final model well enough that it can be certified that it only produces safe outputs. Progress in interpretability research could eventually enable such safety cases.

Incentives Safety Case

An incentives-based argument appeals to the training process to argue that the AI system is pursuing the intended goal. This requires addressing challenges like goal misgeneralization or distribution shift.

Control Evaluations

A control evaluation extends misuse red-teaming to the misalignment setting. In the misuse setting, the red team acts as a malicious external user. In a control evaluation, the red team can finetune or prompt the model to try to cause harm, and the success of the resulting model is observed.

Control evaluations test the robustness of the supervision techniques described: if the red team can create a model that behaves in ways that would cause harm, but that supervision doesn't detect, then the supervision described is insufficient.

Alignment Stress Tests

Alignment stress tests involve a red team trying to show that key assumptions in alignment plans are false. This helps identify vulnerabilities in the described approach and provides evidence about the likelihood of misalignment.

These tests typically proceed by:

  1. Starting with a specific safety argument or assumption
  2. Designing experiments to test this assumption's validity
  3. Drawing conclusions based on the experiment results
  4. Using these insights to improve the described alignment approaches

Building Robust Safety Cases

As AI systems become more powerful, the safety cases described must evolve to address new risks. Initially, inability safety cases may be sufficient, but as capabilities advance, it will be necessary to rely more on control safety cases and eventually on understanding or incentives-based arguments.

A critical aspect of robust safety cases is acknowledging their limitations and continuously seeking to strengthen the evidence they rely on. This requires ongoing research into evaluation methods, interpretability techniques, and alignment approaches.