Agents Rule of Two: A Practical Approach to AI Agent Security

Imagine a personal AI agent, Email-Bot, designed to help you manage your inbox. To provide value and operate effectively, Email-Bot might need to:

Access unread email contents from various senders to provide helpful summaries
Read through your existing email inbox to track important updates, reminders, or context
Send replies or follow-up emails on your behalf

While such an automated assistant can be incredibly helpful, it also highlights how AI agents introduce novel risks. One of the biggest challenges facing the industry is agents’ susceptibility to prompt injection.

Prompt injection is a fundamental, unsolved weakness in all large language models (LLMs). When untrustworthy strings or data enter an AI agent’s context window, they can cause unintended consequences—ignoring developer instructions, bypassing safety guidelines, or executing unauthorized tasks. This vulnerability could allow an attacker to take control of the agent and harm the user.

For example, if an attacker embeds a prompt injection string in a spam email, they might hijack Email-Bot once it processes that email. Potential attacks include exfiltrating sensitive data like private email contents or sending phishing messages to the user’s contacts.

At Meta, we’re excited about the potential of agentic AI to improve lives and boost productivity. Achieving this vision means granting agents more capabilities, such as access to untrusted data sources, private or sensitive information, and autonomous tools. However, we must balance utility with security to minimize risks like data exfiltration, unauthorized actions, or system disruption.

To address these challenges, we developed the Agents Rule of Two, a framework that deterministically reduces security risks when followed. Inspired by similar policies in Chromium and Simon Willison’s “lethal trifecta,” this approach helps developers navigate trade-offs in today’s powerful agent frameworks.

Agents Rule of Two

At a high level, the Agents Rule of Two states that until robustness research enables reliable detection and refusal of prompt injection, agents must satisfy no more than two of the following three properties within a session to avoid high-impact consequences:

[A] An agent can process untrustworthy inputs
[B] An agent can have access to sensitive systems or private data
[C] An agent can change state or communicate externally

If an agent requires all three properties without starting a new session (i.e., with a fresh context window), it should not operate autonomously and must have supervision—such as human-in-the-loop approval or another reliable validation method.

How the Agents Rule of Two Stops Exploitation

Returning to our Email-Bot example, let’s see how applying the Agents Rule of Two can prevent a data exfiltration attack.

Attack Scenario: A spam email contains a prompt injection string that instructs Email-Bot to gather private inbox contents and forward them to the attacker using a Send-New-Email tool.

This attack succeeds because:
– [A] The agent accesses untrusted data (spam emails)
– [B] The agent accesses private data (inbox)
– [C] The agent communicates externally (sends new emails)

With the Agents Rule of Two, this attack can be prevented in several ways:

In a [BC] configuration, the agent only processes emails from trustworthy senders, preventing the prompt injection payload from reaching its context window.
In an [AC] configuration, the agent has no access to sensitive data or systems (e.g., operating in a test environment), so any prompt injection has no meaningful impact.
In an [AB] configuration, the agent can only send new emails to trusted recipients or after human validation of draft messages, blocking the attacker from completing the attack chain.

This framework allows developers to compare designs and trade-offs—such as user friction or capability limits—to choose the best option for their users’ needs.

Hypothetical Examples and Implementations

Let’s explore three hypothetical agent use cases and how they satisfy the framework:

Travel Agent Assistant [AB]

This public-facing travel assistant answers questions and acts on a user’s behalf. It searches the web for up-to-date travel information [A] and accesses private user info for booking and purchases [B]. To satisfy the Rule of Two, we place preventative controls on tools and communication [C] by:

Requesting human confirmation for actions like reservations or payments
Limiting web requests to URLs from trusted sources, avoiding agent-constructed URLs

Web Browsing Research Assistant [AC]

This agent interacts with a web browser to perform research. It fills out forms and sends requests to arbitrary URLs [C] and processes results [A] to replan as needed. To satisfy the Rule of Two, we control access to sensitive systems and private data [B] by:

Running the browser in a restrictive sandbox without preloaded session data
Limiting the agent’s access to private information and informing users about data sharing

High-Velocity Internal Coder [BC]

This agent solves engineering problems by generating and executing code across internal infrastructure. It accesses production systems [B] and makes stateful changes [C]. To satisfy the Rule of Two, we control untrustworthy data sources [A] by:

Using author-lineage to filter data sources in the agent’s context window
Providing a human-review process for marking false positives and enabling data access

As with any general framework, the devil is in the details. Agents can safely transition between Rule of Two configurations within a session—for example, starting in [AC] to access the internet and switching to [B] by disabling communication when accessing internal systems. The key is disrupting the exploit path to prevent attacks from completing the full chain [A] → [B] → [C].

Limitations

Satisfying the Agents Rule of Two is not sufficient to protect against other common threat vectors—such as attacker uplift, spam proliferation, agent mistakes, hallucinations, or excessive privileges—or lower-consequence outcomes like misinformation in agent responses.

Similarly, applying the Rule of Two is not a finish line for risk mitigation. Designs that comply can still fail (e.g., users blindly confirming warnings), and defense in depth is critical for high-risk scenarios where single-layer failures are likely. The Rule of Two supplements—but does not replace—common security principles like least-privilege.

Existing Solutions

For additional AI protection solutions that complement the Agents Rule of Two, explore Llama Protections, which include:

Llama Firewall for orchestrating agent protections
Prompt Guard for classifying potential prompt injections
Code Shield to reduce insecure code suggestions
Llama Guard for classifying potentially harmful content

What’s Next

We believe the Agents Rule of Two is a practical framework for developers today, with great potential to enable secure development at scale.

As plug-and-play agentic tool-calling gains adoption through protocols like Model Context Protocol (MCP), we see both emerging risks and opportunities. While blindly connecting agents to new tools can be risky, built-in Rule of Two awareness could enable security-by-default. For instance, declaring a Rule of Two configuration in tool calls gives developers confidence that actions will succeed, fail, or request approval according to their policy.

We recognize that as agents become more capable, some use cases—like background processes where human-in-the-loop is disruptive—will be challenging to fit neatly into the Rule of Two. While traditional guardrails and human approvals remain preferred for now, we’re researching ways to satisfy supervisory checks via alignment controls, such as oversight agents and the open-source LlamaFirewall platform. We look forward to sharing more progress in the future.