Articles/Two Prompts That Bootstrap Your Agent's Security Review

Two Prompts That Bootstrap Your Agent's Security Review

Part 2 of the series: Governing the Agents That Govern Themselves

Jeff GrayMarch 10, 2026

You've built an autonomous agent. It reads code, modifies files, calls APIs, maybe runs tests, maybe even commits changes. You know it needs a security review. The question is how.

You could hire a penetration testing firm. That's expensive, slow, and gives you a point-in-time snapshot that starts decaying the moment the agent modifies anything — which, for an autonomous agent, is the next cycle. You could run a SAST scanner. That's fast and cheap and will find the easy stuff — known vulnerability patterns, dependency issues, obvious misconfigurations. It won't reason about your threat model.

Or you could try something that didn't exist two years ago: hand the codebase to a large language model and ask it to reason about what can go wrong.

That's what we did. And the part I want to share isn't just that it worked (with caveats I'll be honest about), but the specific prompts we built — because the design decisions baked into those prompts are what made the difference between generic boilerplate and output we could actually use.

The prompts are free. You can use them today. Here's what they do and why they're built the way they are.


The Two-Prompt Pipeline

The methodology is a two-step chain. Each prompt does a distinct job, and the second one consumes the output of the first.

Prompt 1 — Security Analysis. You give it access to your target system's codebase and ask it to produce a structured security analysis. It generates a threat model with trust boundaries, data flow analysis, STRIDE classification, attack surface mapping, concrete threat vectors with specific attack paths, a mitigation stack with tool recommendations, incident response runbooks for high-priority threats, and a data loss prevention analysis. The key word in all of that is "structured" — every section has a defined format, and every finding must reference specific files, classes, endpoints, or configuration values from the actual repository.

Prompt 2 — Adversarial Testing Specification. You give it the same codebase plus the security analysis from Prompt 1, and it generates a complete blueprint for an adversarial testing system. Not just "here are some test cases." A full specification: isolation architecture, an oracle pattern for determining whether a defense held, 2–5 scenarios per threat vector with specific probes and expected outcomes, mock and fixture architectures, execution models, report security constraints, and operational principles. It designs the system that would verify your defenses actually work — not just that they're configured.

The second prompt depends on the first because the security analysis identifies what can go wrong, and the adversarial spec designs the system that verifies your defenses against those specific threats. Running the second prompt without the first produces vague, generic output. Running them in sequence produces output that's grounded in your actual architecture.


The Design Decision That Matters Most

If I had to pick the single design choice that most affects output quality, it's this: the structured grounding requirement.

Both prompts contain explicit instructions that every finding, every threat, every scenario must reference specific code artifacts — actual file paths, class names, function signatures, configuration values, API endpoints — from the repository the LLM is analyzing. Not generic descriptions. Not "the system may be vulnerable to SQL injection." Instead: "the /api/tasks endpoint in routes/tasks.py accepts user-provided filter parameters that are passed to build_query() at line 47 without parameterized query construction, creating an injection path."

This sounds like a small thing. It's not. Without this constraint, LLMs produce security analyses that read like textbook chapters — technically correct, generically applicable, practically useless. They'll tell you about OWASP Top 10 categories and suggest you implement input validation. You already knew that. What you needed was someone to look at your code and tell you where your specific vulnerabilities are.

The grounding requirement forces the LLM to anchor its reasoning in the actual codebase. It can still produce false positives (it will flag things that aren't real vulnerabilities in context), and it can still miss things (especially emergent interactions between components). But every finding it produces is at least pointing at a real piece of code you can go look at. That's the difference between a report that sits in a drawer and one that generates actionable work items.


What the Security Analysis Prompt Produces

The prompt is organized into nine sections, each with a specific purpose. I'll walk through the structure rather than reproduce the full text here — the complete prompt is published as an appendix to the companion research paper and will be available on GitHub.

The executive summary produces a ranked table of the top risk areas with impact and likelihood ratings. This is the part your engineering lead reads to decide how worried to be.

The system threat model includes a trust boundary map (which components are fully trusted, conditionally trusted, untrusted, or isolated), a data flow diagram marking every trust boundary crossing, and a STRIDE classification table that fills in concrete threats for each component — not generic descriptions but specific attack scenarios referencing actual code.

The attack surface map catalogs every ingress and egress point — API endpoints, file uploads, webhooks, polled sources, database connections, message queues — with the protocol, authentication mechanism, and risk level for each.

The threat vector enumeration is the core of the report. For each significant threat, the prompt requires a concrete attack path ("attacker sends X to endpoint Y, which passes through function Z without validation"), the potential impact, existing defenses, and recommended mitigations with specific tool recommendations.

The mitigation stack consolidates tool-specific recommendations: pre-commit hooks, CI pipeline additions, runtime protections, monitoring and alerting, DLP measures. Each recommendation references the specific threat vector it addresses.

Finally, incident response runbooks for the highest-priority threat vectors provide step-by-step response procedures, including detection signals, containment steps, evidence preservation, and recovery procedures.

The design principles embedded in this structure — STRIDE as an organizing framework, multi-layer mitigation, structured data flow analysis — are drawn from established security analysis methodology. The contribution isn't inventing new security concepts; it's encoding them into a prompt structure that consistently produces grounded, project-specific output from an LLM.


What the Adversarial Testing Specification Produces

The second prompt generates the blueprint for a system that would test whether your defenses actually hold — not just that they're configured. (I'll go deeper on why that distinction matters in Part 3 of this series.)

The specification it produces includes an isolation architecture with requirements for physical separation (separate infrastructure), logical separation (separate repositories, credentials, networks), and institutional separation (the testing system must be invisible to the target). It defines an oracle pattern where every probe declares an expected outcome and verification checks multiple independent signals — not just "did the attack get blocked?" but also "did the system log the attempt?" and "did it escalate to a human?"

For each threat vector from the security analysis, it generates 2–5 adversarial scenarios with specific probes, expected outcomes (block, allow, or detect), and the verification signals to check. It designs the mock and fixture architecture that lets the testing system simulate attacks without risking real damage. And it specifies operational principles that prevent the testing system from doing more harm than good — most importantly, the "never auto-remediate" constraint, which ensures the testing system reports vulnerabilities but never attempts to fix them.

Five architectural principles are baked into the prompt as structural requirements of the generated output, not optional guidelines. One-way visibility — the testing system can observe the target, but the target must have no awareness of the testing system's existence. Oracle-based multi-signal verification — every probe checks at least three independent signals. Neutral naming — adversarial infrastructure uses names that don't reveal its purpose (because the target may read its own environment). Never auto-remediate — the testing system reports, it never fixes. Three-signal verification — interface response, escalation event, and state change must all agree for a passing verdict.


What We Don't Know

I need to be direct about this, because it's the thing I'd want to know if I were reading someone else's article about their methodology.

We have not completed a systematic comparison of LLM-generated security analyses against expert manual review. The prompts were developed iteratively against a production system and refined based on what worked. But "what worked" is a qualitative judgment — our team looking at the output and deciding it was useful, specific, and catching real issues — not a measured result with controlled comparisons.

We're designing that empirical evaluation now. The plan is to run both prompts against multiple open-source autonomous agent frameworks, have independent security experts review the same codebases, and compare the outputs along three dimensions: specificity (are findings grounded in actual code?), actionability (could a team implement the recommendations without further research?), and coverage (what proportion of expert-identified threats does the LLM catch?).

Until those results are in, here's the honest framing: these are bootstrapping tools that give you structured coverage where you'd otherwise have nothing — or where you'd have expensive, slow, point-in-time coverage from a pen test engagement. At minimum, the prompts produce a starting point that's more systematic than staring at your codebase and wondering where the vulnerabilities are. At best, they catch real issues that would have gone unnoticed. Where they fall between those two outcomes is what we're measuring.

The prompts are freely available. If you try them on your own system, I'd genuinely like to hear what you find.