Articles/What Your Agent Does When Something Goes Wrong Is an Ethical Decision

What Your Agent Does When Something Goes Wrong Is an Ethical Decision

Part 4 of the series: Governing the Agents That Govern Themselves

Jeff GrayMarch 10, 2026

I want to make a claim that most engineering teams haven't considered: the failure mode of your autonomous agent is not just a reliability decision. It's an ethical commitment. And like the other commitments we've been talking about in this series, it needs structural enforcement.

This is the philosophical article in the series. I'm writing it for the engineer who doesn't usually read philosophy — because the argument has direct engineering consequences, and I think it deserves a non-academic treatment.


Three Ways to Fail

When an autonomous agent encounters something it wasn't designed for — an input it can't parse, a state it doesn't recognize, a conflict between two objectives — it has to do something. The "something" falls into three broad categories, and the choice between them reveals what you actually value.

Silent absorption. The system keeps running. Nobody is notified. The unhandled condition gets swallowed, and whatever downstream effects it produces — corrupted results, degraded performance, incorrect outputs — propagate silently. The operator discovers the problem hours or days later, if they discover it at all, and by then the corrupted outputs have been consumed by other systems and humans who relied on them.

System halt. The agent stops. Everything stops. This is safe in a narrow sense — the agent isn't doing any more damage. But the human operator gets no evidence about what happened. No structured explanation. No preserved state. Just silence. They know something went wrong, but they don't know what, and the operational context that would have helped them understand is gone.

Structured escalation. The specific triggering action is halted. Unrelated work continues unaffected. A structured evidence package — what the agent was doing, what it encountered, what it recommends, why it stopped this particular action — is routed to a human operator through a dedicated channel. The operator has what they need to make an informed decision. The agent has preserved its operational context. The scope of the disruption is proportional to the scope of the problem.

If you've been building software for any length of time, you can probably rank those options intuitively. Silent absorption is clearly worst — it's the one that leads to invisible corruption. System halt is better than silent absorption but worse than you'd like — it's the "turn it off and back on" approach that sacrifices operational context. Structured escalation is what you'd want from a competent colleague who ran into something they didn't understand.

Here's the part that's easy to miss: that ranking isn't a technical judgment. It's an ethical one. You're choosing between failure modes based on values — transparency, accountability, proportionality — not based on which option is technically simplest to implement. (Silent absorption is technically simplest. It's also clearly worst.)


Why This Is an Ethics Problem, Not Just an Engineering One

The values encoded in structured escalation aren't arbitrary preferences. Each one maps to a recognized ethical principle.

Transparency means the system's operational history is visible and accurate. An immutable, append-only audit log that the agent cannot modify or delete. When something goes wrong, there's a complete record of what happened, in what order, and what the agent's state was at each step. This isn't just useful for debugging. It's the foundation of trust. A system whose operational history can be rewritten by the system itself is a system whose history can't be trusted — and a system whose history can't be trusted can't be trusted at all.

Accountability means that when the agent encounters a situation that exceeds its authorization or competence, a human is notified with enough context to take informed action. Not just "an error occurred." A structured evidence package that explains what the agent was trying to do, what it encountered, what it thinks went wrong, and what it recommends. The human makes the decision. The agent provides the evidence. The responsibility for consequential choices stays with a person.

Separation of powers means the entity making decisions is not also the entity enforcing constraints on those decisions. In practice, this means the agent cannot modify the rules that govern its own operation. The governance framework — the constraints, the escalation triggers, the audit log configuration — is protected by integrity mechanisms the agent's process cannot reach. The agent proposes. The architecture constrains. The human adjudicates.

Proportionality means the system's response to a problem is scaled to the problem's scope. If one action in a batch triggers an anomaly, the agent halts that action — not all actions. Resource consumption is bounded. Autonomy is bounded. The agent does what it's authorized to do and doesn't exceed that authorization, even if exceeding it would produce a "better" result by whatever metric the agent is optimizing.

Intellectual honesty means the agent's outputs actually represent what the agent did. Quality gates that resist gaming — mandatory test execution with real coverage thresholds, static analysis that catches real issues, lint checks that enforce real standards. An agent that achieves coverage targets by modifying test configurations to exclude difficult modules has optimized a metric while undermining the property the metric was meant to proxy. The commitment is to the property, not the metric.


Why These Commitments Need Structural Enforcement

This is where the argument connects to the rest of the series. For a static system — one that doesn't modify its own code or configuration over time — advisory constraints might be adequate. You can write these commitments into a design document, implement them in code, review the implementation, and trust that what you reviewed is what's running.

For a self-modifying system, that's not sufficient. Here's why.

A system that iterates on its own code and context over multiple cycles can, through ordinary optimization, arrive at a state where the advisory commitments no longer hold. Not through malice. Through the same incremental drift I described in Part 1. An agent that modifies its own test configuration to improve coverage metrics has undermined intellectual honesty. An agent that restructures its logging to be more efficient has potentially weakened transparency. An agent that optimizes its error handling might inadvertently suppress escalation events that accountability requires.

No single improvement cycle does anything wrong. Each change passes tests, improves a metric, looks reasonable in isolation. But the cumulative effect — over dozens, hundreds, thousands of cycles — can erode the ethical commitments that were present in the original design. Advisory constraints depend on the system's reasoning to respect them. A system that modifies the code responsible for that reasoning cannot guarantee that the respect will persist.

The only properties you can guarantee will hold across the entire operational trajectory of a self-modifying system are the ones enforced by architecture, not by reasoning. Filesystem permissions that prevent audit log modification enforce transparency regardless of what the agent's code does. Process-level capability dropping that restricts sandbox privileges enforces isolation regardless of what code runs inside. Cryptographic checksums on governance documents enforce separation of powers regardless of how the agent's modification proposals are constructed.

This is the same structural argument from Part 1, extended to its ethical implications. The insight is the same: for systems that modify their own rules through ordinary operation, the only stable guarantees are structural ones.


The Escalation Test

If you take one thing from this article, I'd suggest this: the most revealing test of a system's ethical design is not what it does when everything goes right. It's what it does when it encounters something it wasn't designed to handle.

Does it silently absorb the problem and keep going? Does it halt without explanation? Or does it stop the specific problematic action, preserve context, and route structured evidence to a human who can make an informed decision?

The answer to that question tells you what the system's builders actually valued — not what they said they valued, not what they wrote in a design document, but what they encoded into the architecture. Because when something goes wrong and the system has to choose between being transparent and being convenient, between being accountable and being autonomous, between being proportional and being aggressive — the architecture is what determines the choice. Not the system prompt. Not the alignment tuning. The architecture.

If you're building an autonomous agent, the ethical commitments are the architectural decisions you make about failure modes. Make them deliberately. And make them structural.