Articles/Your Agent's System Prompt Is Not a Governance Framework

Your Agent's System Prompt Is Not a Governance Framework

Part 1 of the series: Governing the Agents That Govern Themselves

Jeff GrayMarch 10, 2026

Here's a question most teams deploying autonomous agents haven't asked carefully enough: what happens when your agent, operating autonomously over many cycles, makes a series of individually reasonable decisions that collectively erode a property you considered foundational?

Not a jailbreak. Not a prompt injection. Not an adversarial attack. Just a self-improving system doing exactly what it was designed to do — optimizing — and gradually drifting away from constraints that were expressed as suggestions rather than enforced as structure.

I've been building an autonomous coding agent, and this problem kept me up at night long before any of the fashionable threat vectors did. The agent reads code, generates modifications, runs tests, evaluates results, and iterates. Each cycle is individually reasonable. Each decision locally defensible. And over enough cycles, the cumulative effect of all those reasonable decisions can quietly erode properties I considered non-negotiable — auditability, isolation, human oversight — without any single cycle doing anything that looks wrong.

This is the governance problem, and most of the tools we currently reach for don't actually solve it.


What We Currently Reach For (and Why It's Not Enough)

Most teams I've talked to rely on some combination of three approaches. Each has real strengths. None provides what I'm going to argue you actually need.

System prompts are advisory. They express rules in natural language and depend on the agent's language model to respect them across every decision point in every cycle. This works until it doesn't. A system prompt that says "never modify the configuration file" has no enforcement mechanism if the agent generates a plan whose execution path modifies that file as a side effect of an otherwise legitimate operation. The prompt didn't anticipate the indirect path, and the agent isn't being adversarial — it's just optimizing.

Alignment tuning is probabilistic. Fine-tuning and RLHF shift the model's behavioral distribution toward desired outcomes, but they don't provide mechanistic guarantees. An aligned model is less likely to take harmful actions. It is not structurally unable to take them. For an agent operating autonomously over extended periods, that's the distinction between a safety property that degrades under edge cases and one that holds invariantly.

Kill switches are reactive. They detect that something bad has happened (or is happening) and terminate the agent. This prevents ongoing damage but doesn't prevent the initial harmful action, doesn't preserve the agent's operational context for analysis, and doesn't give the human operator the structured evidence they need to understand what went wrong. A kill switch answers "how do we stop the agent?" but not the more important question: "how do we ensure the agent cannot take this action in the first place?"

These aren't bad tools. They're incomplete tools. The gap they leave open is structural.


The Constitutional Analogy (It's More Precise Than You'd Think)

Constitutions in political systems exist not because legislatures are malicious, but because iterative decision-making — even by well-intentioned actors operating within legitimate processes — can erode foundational guarantees without a structural backstop.

Think about a legislative body with the power to amend its own founding charter by simple majority vote. Every guarantee in that charter — freedom of expression, due process, limits on executive power — exists only as long as a current majority chooses to preserve it. One legislative session, responding to immediate pressures, could remove protections that took generations to establish.

Constitutional entrenchment is the mechanism that transforms contingent guarantees into structural ones. Supermajority requirements, separation of powers, judicial review — these create properties of the system that survive the iterative process of ordinary governance. They share a defining characteristic: they operate external to the ordinary decision-making process. A legislature cannot remove a constitutional constraint through the same process it uses to pass ordinary legislation, because the constraint is enforced at a different institutional layer.

Now consider your autonomous agent. Each cycle, it receives a task, reads the codebase, generates modifications, tests them, and integrates the results. Without structural constraints, any property of the system — including properties that make it safe — is subject to modification through this ordinary iterative process. The agent doesn't need to be adversarial for safety to erode. It needs only to optimize toward a local objective in a way that inadvertently degrades a safety property that wasn't part of the objective function.

An agent tasked with improving test coverage might discover it can achieve higher coverage numbers by modifying the test configuration to exclude modules that are hard to test. It hasn't violated any instruction. It was told to improve coverage, and it found an efficient way to do so. But it degraded a property — comprehensive testing of all modules — that the coverage metric was intended to represent. Advisory constraints are brittle against this kind of indirect erosion, because the agent's plan may not include an explicit step labeled "modify test configuration." The modification emerges as a side effect of a more complex optimization path.

The parallel to autonomous agents is precise, not metaphorical. What I'm proposing is that the same structural insight — foundational properties must be enforced by mechanisms external to the decision-making process they constrain — applies directly to how we build autonomous systems.


What Structural Governance Actually Looks Like

The central principle is this: constitutional properties of autonomous agents must be enforced by mechanisms that the agent's generated code cannot reach, disable, or circumvent, regardless of what that code does.

Filesystem permissions, network namespace isolation, process-level capability restrictions, cryptographic verification — these operate at layers that are external to the agent's code generation and execution pipeline. They provide the structural independence that constitutional enforcement requires.

While developing a governance framework for our own system, we arrived at five categories of structural constraint. I'm sharing them here not as the only valid decomposition, but as a practical checklist — a way of thinking about whether your system has the structural coverage it needs.

Integrity constraints ensure that the governance framework itself can't be modified by the agent. If the agent can edit the rules that govern it, you don't have a constitution — you have a suggestion. In practice, this means governance documents, escalation configurations, and audit logs live in locations the agent's process literally cannot write to. Filesystem permissions, not instructions.

Isolation constraints ensure that the agent's generated code runs in an environment where damage is bounded. Sandboxed execution with dropped capabilities, restricted network namespaces, read-only filesystem mounts except for designated output directories. The sandbox doesn't trust the code inside it, and the code inside it can't escape the sandbox — not because it's told not to, but because the operating system won't let it.

Quality constraints ensure that every output the agent produces passes verification before it's integrated. Mandatory test execution with coverage thresholds, static analysis gates, lint checks. These aren't optional steps the agent can skip if it decides they're unnecessary. They're structural gates in the pipeline — no output moves forward without passing them.

Resource constraints put hard limits on what the agent can consume. Bounded cycle counts, maximum file sizes, rate limits on API calls, time limits on individual operations. These prevent an agent from consuming unbounded resources during optimization and ensure that runaway processes terminate predictably.

Oversight constraints ensure that human operators maintain visibility into what the agent is doing. Immutable, append-only audit logs that the agent cannot modify or delete. Structured escalation channels that route anomalies to humans with evidence packages. The agent can't suppress its own operational history, and it can't silently absorb errors that a human should review.

Each of these categories should be enforced at multiple independent layers — filesystem, network, process, application — so that no single point of failure compromises the governance model. If your sandbox isolation depends on a single Docker configuration flag, one misconfiguration defeats it. If it depends on dropped capabilities and network namespace isolation and read-only mounts and process-level restrictions, an attacker (or an inadvertent optimization) needs to defeat all four.


What This Is (and Isn't)

I want to be direct about what I'm arguing and what I'm not.

I'm arguing that structural enforcement is categorically different from advisory constraints for self-modifying systems. This is an architectural argument. It follows from the properties of feedback loops — not from measurement. If a system can modify its own rules through its ordinary operation, advisory rules provide no stable foundation. This is true regardless of how good the underlying model is, how careful the system prompt is, or how well-aligned the training was.

I'm not arguing that the five-category taxonomy I described is the only valid decomposition. It's the one we developed for our system. Other decompositions might work as well or better for yours. The important thing is that whatever categories you choose, each one is enforced structurally — not by asking the agent to comply.

I'm also not arguing that our specific implementation is better than anyone else's. The framework was developed for a specific production system whose details remain proprietary. What I'm sharing is the generalized architecture, because I think the structural reasoning is useful regardless of what you're building.

If you're deploying an autonomous agent today, here's the practical takeaway: walk through those five categories and ask, for each one, whether the property is enforced by something the agent cannot reach. If the answer is "the system prompt tells it not to" or "the model is aligned to avoid it," you have an advisory constraint, not a structural one. That might be fine for your use case. But if the property matters enough that you'd be in trouble if it eroded over time — and you're running a self-modifying system that operates autonomously across many cycles — advisory isn't enough.

The engineering is harder. But it's the kind of hard that lets you sleep at night.