AGI Safety & Security - Interactive Guide

When addressing safety and security, it is helpful to identify broad groups of pathways to harm that can be addressed through similar mitigation strategies. The report defines areas based on abstract structural features rather than concrete risk domains.

Misuse

The user instructs the AI system to cause harm

Key driver of risk: The user is an adversary

Misalignment

The AI system takes actions that it knows the developer didn't intend

Key driver of risk: The AI is an adversary

Mistakes

The AI system causes harm without realizing it

Key driver of risk: Real world is complex

Structural risks

Harms from multi-agent dynamics, where no single agent is at fault

Key driver of risk: Incentives, culture, etc.

Figure 1: Overview of risk areas

Risks grouped based on factors that drive differences in mitigation approaches.

The Four Risk Categories

Misuse

occurs when a user intentionally instructs the AI system to take actions that cause harm, against the intent of the developer. For example, an AI system might help a hacker conduct cyberattacks against critical infrastructure.

AI could exacerbate harm from misuse in several ways:

Increased possibility of causing harm by placing significant expertise and capabilities in the hands of users
Decreased detectability of harmful actions by helping evade surveillance
Disrupting existing defensive institutions, which take time to adapt to new threats
Enabling automation at scale, allowing individual actors to cause harm more broadly

Explore Misuse Mitigation

Misalignment

occurs when the AI system knowingly causes harm against the intent of the developer. This can happen through or .

For example, an AI system may provide confident answers that stand up to scrutiny from human overseers, but the AI knows the answers are actually incorrect. In more extreme cases, a misaligned AI could actively work against the interests of developers or users to pursue its own goals.

Alignment is particularly challenging as AI systems become more capable than their human overseers, as it becomes harder to determine whether the AI is actually pursuing the goals the developers intend or merely appearing to do so.

Explore Misalignment Mitigation

Mistakes

Mistakes occur when the AI system produces a short sequence of outputs that directly cause harm, but the AI system did not know that the outputs would lead to harmful consequences that the developer did not intend.

For example, an AI agent running a power grid may not be aware that a transmission line requires maintenance, and so might overload it and burn it out, causing a power outage.

When there is no adversary, as with mistakes, standard safety engineering practices (e.g., testing, verification, and redundancy) can drastically reduce risks, and should be similarly effective for averting AI mistakes as for human mistakes.

These risks often emerge from complex interactions and may include:

AI-generated entertainment and companions distorting genuine relationships
AI systems undermining human sense of achievement by doing more and more of human work
Degradation of information quality as AI-generated content becomes widespread
Gradual loss of human control over political and economic decision-making
Challenges to democratic processes through misinformation or surveillance

Structural Risks

Structural risks are harms arising from multi-agent dynamics – involving multiple people, organizations, or AI systems – which would not have been prevented simply by changing one person's behaviour, one system's alignment, or one system's safety controls.

These risks often emerge from complex interactions and may include:

AI-generated entertainment and companions distorting genuine relationships
AI systems undermining human sense of achievement by doing more and more of human work
Degradation of information quality as AI-generated content becomes widespread
Gradual loss of human control over political and economic decision-making
Challenges to democratic processes through misinformation or surveillance

These problems are often complex and multifaceted, requiring broader societal responses rather than purely technical solutions from AI developers.

Focus of the Report

The report's strategy focuses primarily on and , as these present the most significant and addressable risks of severe harm identified by the authors. Mistakes are considered better addressed through standard safety practices, while structural risks require broader societal responses.

Approach to Misuse

For misuse, the strategy outlined in the report aims to prevent bad actors from accessing dangerous capabilities through robust security, access restrictions, monitoring, and model safety mitigations. The report focuses on identifying when models have dangerous capabilities and implementing appropriate safeguards.

Approach to Misalignment

For misalignment, the report outlines two lines of defense: model-level mitigations such as amplified oversight and robust training, and system-level security measures such as monitoring and access control. The report also leverages techniques from interpretability and uncertainty estimation to enhance these mitigations.

Explore Misuse Mitigation Explore Misalignment Mitigation

Risk Areas

The Four Risk Categories

Misuse

Misalignment

Mistakes

Structural Risks

Focus of the Report

Approach to Misuse

Approach to Misalignment