Articles/"Defense Is Present" Is Not "Defense Actually Works"

"Defense Is Present" Is Not "Defense Actually Works"

Part 3 of the series: Governing the Agents That Govern Themselves

Jeff GrayMarch 10, 2026

Here's a test that's probably in your CI pipeline right now, or something close to it:

def test_sandbox_network_isolation():
    config = load_sandbox_config()
    assert config.network_access == DISABLED

This test passes. Your CI is green. Your sandbox is correctly configured to disable network access.

Now here's the question that test doesn't answer: if a process running inside that sandbox actually tries to open a TCP connection to an external host, is the connection blocked?

The test verified the configuration. It did not verify the defense. Those are categorically different claims, and the gap between them is the subject of this article.


The Verification Gap

I started thinking about this problem while building defenses for an autonomous coding agent. We'd configured sandboxed execution with dropped capabilities, network isolation, read-only filesystem mounts — the usual layered defense posture. We wrote tests for all of it. Everything passed.

And then I kept asking myself: but does it work? If something crafted to bypass our input sanitization hits the API, does the sanitization actually catch it? If code running in the sandbox tries to escalate privileges, does the capability dropping actually prevent it? If something slips through and does something it shouldn't, does the escalation pipeline actually fire?

Standard tests don't answer these questions. They verify that defenses are present — that the configuration specifies the right values, that the code paths exist, that the error handlers are defined. They don't verify that when a real attack is attempted, the defense holds, the attempt is logged with sufficient detail for forensic analysis, and the event is escalated to a human through a channel the attacker can't suppress.

Those are three independent claims. Each requires its own evidence. And a standard test suite doesn't produce that evidence.

This is the verification gap: the distance between "defense is configured" and "defense works under actual attack." In traditional security engineering, this gap is why penetration testing exists alongside configuration auditing. But for autonomous agents, the gap has an additional dimension that most teams haven't grappled with.


Why Autonomous Agents Make This Harder

Many autonomous agents can read their own codebase, configuration, and operational history as part of their normal function. That's not a bug — it's the feature. An agent that maintains a software repository needs to read the repository to do its job.

This means that if your adversarial test payloads are stored anywhere the agent can see them — in the same repository, in configuration files the agent reads, in logs the agent analyzes — the agent will eventually find them. Not because it's adversarial. Because reading code is literally what the agent does. It will encounter the test payloads, analyze their structure, learn their patterns, and potentially adapt to handle them — the same way it adapts to any other pattern it encounters in its operational environment.

This isn't speculative. It follows directly from the properties of a system that reads and reasons about its own code. If the system can see the tests designed to probe it, the tests are compromised. Not today, maybe. But eventually. This is a structural property, and it requires a structural response.

The structural response is one-way visibility: the testing system can observe the target, but the target must have zero awareness that the testing system exists. Separate infrastructure. Separate repository. Separate credentials. Separate network. The testing system watches the target through the same interfaces an external user would use. The target has no channel, no endpoint, no filesystem path, no environment variable that reveals the testing system's presence.

This is harder than it sounds, especially for teams used to keeping tests in the same repository as the code they test. But for a system that reads its own codebase as part of normal operations, co-located testing is fundamentally compromised.


The Oracle Pattern

Once you've established that you need to actually attack your defenses (not just verify their configuration), you need a way to determine whether the defense held. This is the oracle problem — and it's more nuanced than pass/fail.

We developed a verdict taxonomy with six states, because we kept finding that binary pass/fail missed failure modes that matter. Here's the taxonomy:

EXPECTED_BLOCK — the attack was attempted, the target rejected it, and an escalation event fired. This is the happy path. The defense worked, and the system told a human about it.

EXPECTED_ALLOW — a legitimate action was probed, and the target allowed it. This confirms that the defenses aren't over-aggressive. Your system is functional, not just locked down.

EXPECTED_DETECT — the attack was attempted, the target allowed it, but the attempt was logged and escalated. This is a soft detection — the defense didn't hard-block the attack, but the system noticed and reported it. Depending on the threat, this might be an acceptable outcome.

Now the failure states:

ATTACK_PASSED — the attack was attempted, the target allowed it, and no escalation fired. This is the bad one. The defense failed silently. The system is compromised and nobody knows.

FALSE_POSITIVE — a legitimate action was blocked. The defense is too aggressive. This sounds like a minor problem, but for an autonomous agent, false positives can impair core functionality. An agent that can't make legitimate API calls because the firewall is too restrictive isn't safe — it's broken.

SILENT_FAIL — the attack was attempted, the target rejected it, but no escalation fired. The defense worked, but the alerting didn't. This is the failure mode most teams miss. In a standard pass/fail framework, this looks like a pass — the attack was blocked! But nobody was notified. If you're relying on escalation events to maintain operational awareness, a defense that blocks attacks but never tells anyone about them is a defense you can't trust to keep working.

The three-signal verification that catches SILENT_FAIL is what makes the oracle pattern useful in practice. After each probe, you check three independent signals: the interface response (did the API return an error code?), the escalation log (did the system fire an escalation event?), and the target state (did the targeted resource actually change?). All three must match expectations for a pass. A defense that blocks correctly but doesn't escalate is a SILENT_FAIL. A defense that escalates but doesn't block is an EXPECTED_DETECT. These distinctions matter because they correspond to different operational risks.


Canary Payloads vs. Attack Payloads

One more distinction that has practical implications for how you maintain a testing suite for an evolving system.

Attack payloads are crafted inputs designed to test whether a specific defense holds. They're the adversarial probes — the things that try to bypass sanitization, escape sandboxes, inject malicious content, escalate privileges. They test the current defense posture.

Canary payloads are stable, unchanging inputs that test whether the system's behavior has drifted. They don't try to exploit anything. They present the same input every time and check whether the output is consistent with historical behavior. If a canary payload that's been producing the same result for months suddenly produces a different result, something changed — and that change might have security implications, even if the canary payload itself isn't adversarial.

The reason this distinction matters is maintenance. An autonomous agent that's constantly evolving its codebase will naturally shift the behavior of its defenses over time. Attack payloads test whether the current defenses hold against known attack patterns. Canary payloads detect when something has changed that might affect defenses in ways you haven't anticipated — including changes the agent made to itself during an improvement cycle.

For a static system, attack payloads might be sufficient. For a self-modifying system, canary payloads are what catch the slow drift that no individual improvement cycle is designed to monitor.


The Practical Takeaway

If you're running an autonomous agent with any kind of security defense — sandboxing, input sanitization, access controls, escalation pipelines — here's what I'd suggest:

Pick one defense that you consider critical. Now ask yourself: do you have evidence that it works under actual attack conditions? Not that it's configured correctly. Not that the code path exists. Evidence that when a crafted input designed to bypass it actually hits the system, the defense blocks it, the attempt is logged, and someone is notified.

If the answer is no, you have a verification gap. Closing it doesn't require the full oracle pattern and verdict taxonomy I've described — that's the systematic version. Even a single adversarial test against your most critical defense, checking all three signals, gives you more evidence than a thousand configuration tests.

The configuration tests are necessary. They're just not sufficient. "Defense is present" is where security work starts, not where it ends.