When addressing safety and security, it is helpful to identify broad groups of pathways to harm that can be addressed through similar mitigation strategies. The report defines areas based on abstract structural features rather than concrete risk domains.
The user instructs the AI system to cause harm
Key driver of risk: The user is an adversary
The AI system takes actions that it knows the developer didn't intend
Key driver of risk: The AI is an adversary
The AI system causes harm without realizing it
Key driver of risk: Real world is complex
Harms from multi-agent dynamics, where no single agent is at fault
Key driver of risk: Incentives, culture, etc.
Figure 1: Overview of risk areas
Risks grouped based on factors that drive differences in mitigation approaches.
The Four Risk Categories
Misuse
occurs when a user intentionally instructs the AI system to take actions that cause harm, against the intent of the developer. For example, an AI system might help a hacker conduct cyberattacks against critical infrastructure.
AI could exacerbate harm from misuse in several ways:
- Increased possibility of causing harm by placing significant expertise and capabilities in the hands of users
- Decreased detectability of harmful actions by helping evade surveillance
- Disrupting existing defensive institutions, which take time to adapt to new threats
- Enabling automation at scale, allowing individual actors to cause harm more broadly
Misalignment
occurs when the AI system knowingly causes harm against the intent of the developer. This can happen through or .
For example, an AI system may provide confident answers that stand up to scrutiny from human overseers, but the AI knows the answers are actually incorrect. In more extreme cases, a misaligned AI could actively work against the interests of developers or users to pursue its own goals.
Alignment is particularly challenging as AI systems become more capable than their human overseers, as it becomes harder to determine whether the AI is actually pursuing the goals the developers intend or merely appearing to do so.
Mistakes
Mistakes occur when the AI system produces a short sequence of outputs that directly cause harm, but the AI system did not know that the outputs would lead to harmful consequences that the developer did not intend.
For example, an AI agent running a power grid may not be aware that a transmission line requires maintenance, and so might overload it and burn it out, causing a power outage.
When there is no adversary, as with mistakes, standard safety engineering practices (e.g., testing, verification, and redundancy) can drastically reduce risks, and should be similarly effective for averting AI mistakes as for human mistakes.
These risks often emerge from complex interactions and may include:
- AI-generated entertainment and companions distorting genuine relationships
- AI systems undermining human sense of achievement by doing more and more of human work
- Degradation of information quality as AI-generated content becomes widespread
- Gradual loss of human control over political and economic decision-making
- Challenges to democratic processes through misinformation or surveillance
Structural Risks
Structural risks are harms arising from multi-agent dynamics – involving multiple people, organizations, or AI systems – which would not have been prevented simply by changing one person's behaviour, one system's alignment, or one system's safety controls.
These risks often emerge from complex interactions and may include:
- AI-generated entertainment and companions distorting genuine relationships
- AI systems undermining human sense of achievement by doing more and more of human work
- Degradation of information quality as AI-generated content becomes widespread
- Gradual loss of human control over political and economic decision-making
- Challenges to democratic processes through misinformation or surveillance
These problems are often complex and multifaceted, requiring broader societal responses rather than purely technical solutions from AI developers.
Focus of the Report
The report's strategy focuses primarily on and , as these present the most significant and addressable risks of severe harm identified by the authors. Mistakes are considered better addressed through standard safety practices, while structural risks require broader societal responses.
Approach to Misuse
For misuse, the strategy outlined in the report aims to prevent bad actors from accessing dangerous capabilities through robust security, access restrictions, monitoring, and model safety mitigations. The report focuses on identifying when models have dangerous capabilities and implementing appropriate safeguards.
Approach to Misalignment
For misalignment, the report outlines two lines of defense: model-level mitigations such as amplified oversight and robust training, and system-level security measures such as monitoring and access control. The report also leverages techniques from interpretability and uncertainty estimation to enhance these mitigations.