Building in Public: What We're Measuring and What We Don't Know Yet

Throughout this series, I've shared frameworks, prompts, and architectural patterns for governing autonomous agents. I've been careful, in each article, to distinguish between architectural arguments (which follow from the structural properties of feedback loops) and empirical claims (which require measurement). The architectural arguments stand on their own logic. The empirical questions are open.

This article is about the empirical questions. Specifically: does the two-prompt methodology from Part 2 actually produce useful security analyses? How useful? Compared to what? And how would we know?

The honest answer right now is: we think it works, and we're designing the experiments to find out. Rather than dodge that, I want to make the experimental design itself the subject — because I think the methodology challenges are interesting on their own, and because sharing the design before the results is the kind of intellectual honesty I'd want to see from anyone asking me to trust their tools.

What We're Measuring

The evaluation we're designing runs both prompts — security analysis and adversarial testing specification — against multiple open-source autonomous agent frameworks. Not our own system (which remains proprietary) but publicly available codebases that other researchers and practitioners can independently examine.

For each target framework, we compare the LLM-generated output against independent manual expert security review along three dimensions.

Specificity. Are the findings grounded in actual code? Does the generated security analysis point to real files, real functions, real configuration values — or does it produce generic boilerplate that could apply to any project? This is the dimension where the structured grounding requirement in the prompt should make the biggest difference, and it's the easiest to measure: you can count how many findings reference specific code artifacts versus how many are generic statements.

Actionability. Could a development team implement the recommendations without further research? If the analysis says "implement input validation on the task API endpoint," that's moderately actionable. If it says "add parameterized query construction to the build_query() function in routes/tasks.py at line 47, replacing the current string concatenation pattern," that's highly actionable. The distinction matters because a security analysis that requires a second round of investigation to figure out what it's actually recommending is a security analysis that often doesn't get acted on.

Coverage. What proportion of the threats identified by the human expert were also identified by the LLM? This is the hardest dimension to measure, and the one where the methodological challenges are most interesting.

Why Coverage Is Hard to Measure

The obvious approach is: have a security expert review the codebase, have the LLM review the codebase, and compare the two lists of findings. Count how many expert findings the LLM also caught (true positives), how many expert findings the LLM missed (false negatives), and how many LLM findings the expert didn't flag (either false positives or things the expert missed).

The problem is that "expert security review" is not a fixed benchmark. Two experienced security engineers reviewing the same codebase will produce overlapping but not identical findings. They'll prioritize differently based on their experience, their threat models, and their assumptions about the deployment context. One might flag a dependency vulnerability the other considered low-priority. One might identify a subtle privilege escalation path the other missed entirely.

This means "ground truth" is fuzzy. We're not comparing against an answer key; we're comparing against another subjective assessment — a more informed and experienced one, but still a judgment. Our evaluation design accounts for this by using multiple independent reviewers and measuring inter-rater agreement as a baseline for how much variation is inherent in the task itself. If two human experts agree on 70% of findings, then expecting the LLM to match either expert beyond 70% isn't a reasonable bar.

What Else Complicates the Picture

The quality of LLM output depends on variables that are hard to control across experiments.

Model. Different language models produce different quality output from the same prompt. A prompt that produces excellent results with one model may produce mediocre results with another. Our evaluation runs the prompts against the same targets with multiple models to characterize this variability, but this also means the results will be specific to the models we test — not universal claims about "LLM-assisted security analysis" as a category.

Context window. The prompts work by giving the LLM access to the target codebase and asking it to reason across the entire system. For a 5,000-line agent framework, this fits comfortably in current context windows. For a 500,000-line production system, it doesn't. The methodology may work well at one scale and fail at another. We're testing against codebases of varying sizes to characterize where the quality degrades.

Target complexity. Some autonomous agent frameworks have straightforward architectures with clear trust boundaries. Others have complex, poorly-documented interactions between components. The prompts may perform well against well-structured codebases (where the LLM can reason clearly about component boundaries) and poorly against tangled ones (where the security-relevant interactions are hidden in implementation details). This is a hypothesis, not a finding — but it's the kind of hypothesis the evaluation is designed to test.

What We Expect to Find (Honestly)

Based on our qualitative experience developing these prompts against our own system, here's what we expect — stated as hypotheses, not claims.

We expect the LLM to excel at systematic coverage — producing analyses that consistently touch every threat category in the STRIDE framework, every ingress and egress point in the attack surface, every trust boundary crossing in the architecture. Human reviewers tend to prioritize based on intuition and experience, which means they go deep on the threats they consider most likely but sometimes skip categories they consider less relevant. The LLM doesn't deprioritize; it's methodical. This should produce more consistent, if shallower, coverage.

We expect the LLM to be weaker at novel attack paths — threats that emerge from unexpected interactions between components, or from reasoning about how a specific deployment context creates risks that wouldn't exist in a different context. These require the kind of creative, experience-informed reasoning that current models can approximate but don't reliably produce. A human security engineer who's seen a privilege escalation chain in a similar architecture before will recognize the pattern. The LLM may or may not.

We expect the LLM to struggle with risk calibration — assessing which threats are actually severe in a specific deployment versus which are theoretically possible but practically unlikely. A finding that rates a local privilege escalation as "Critical" is technically defensible but practically misleading if the system runs in a container with no network access. Context-dependent risk calibration requires understanding the deployment environment in ways that a codebase review alone doesn't provide.

These are hypotheses. The evaluation is designed to test them. If we're wrong, we'll say so.

There's a norm in some parts of the AI community to wait until you have clean numbers before publishing anything. I understand the impulse — premature claims are everywhere, and the field doesn't need more hype.

But there's also a cost to waiting. The prompts are available now. The governance framework is available now. Teams are deploying autonomous agents now, and some of them could use these tools today, even without the empirical validation. We'd rather share the tools with an honest "we don't know the limits yet" than sit on them until we have a tidy paper.

Sharing the experimental design before the results also invites scrutiny. If there are flaws in how we're measuring — if our coverage metric is biased, if our expert reviewers aren't independent enough, if our codebase sample isn't representative — we'd rather hear about it before we run the experiments than after we've published the results. Methodology that's been stress-tested by the community is more credible than methodology that's been reviewed only by its authors.

So here's where things stand. The frameworks are published. The prompts are published. The evaluation is designed. The results are coming. In the meantime, the tools are free, and if you find them useful — or find their limits — I'd like to hear about it.

What We're Measuring

Why Coverage Is Hard to Measure

What Else Complicates the Picture

What We Expect to Find (Honestly)

Why We're Sharing This Before the Results