Research Companion Guide | EngramForge Research

This guide explains the statistical methods, terminology, and conventions used across EngramForge engineering studies. If you encounter an unfamiliar term while reading one of our papers, you'll find a plain-language explanation here.

Statistical Foundations

Mean

The arithmetic mean is the average of a set of values — add them all up, divide by how many there are.

Example: If a model scores 4, 5, and 5 across three runs, the mean is (4 + 5 + 5) ÷ 3 = 4.67.

We report mean scores because a single run can be misleading. Running the same test multiple times and averaging gives a more reliable picture of typical performance.

Standard Deviation

Standard deviation (often abbreviated std or σ) measures how spread out the results are from the mean. A small standard deviation means the results are consistent; a large one means they vary a lot between runs.

Example: Scores of [4.5, 4.6, 4.7] have a small std (~0.1) — very consistent. Scores of [3.0, 5.0, 4.5] have a larger std (~1.0) — much more variable.

In our studies, standard deviation tells you how much to trust a mean score. A model scoring 4.5 ± 0.1 is more reliably "4.5" than one scoring 4.5 ± 1.5.

Confidence Interval

A confidence interval (CI) is a range that likely contains the true average performance. A 95% confidence interval means: if we repeated this experiment many times, 95% of the intervals we compute would contain the true value.

Example: "4.67 ± 0.58 (95% CI: [4.01, 5.00])" means we're fairly confident the model's true average score falls between 4.01 and 5.00.

Wider intervals mean more uncertainty (usually from small sample sizes or high variance). Narrower intervals mean more precise estimates.

Sample Size (n)

The sample size, written as n, is simply how many times we ran a test. Larger sample sizes produce more reliable statistics.

Example: "n=3" means we ran the test three times. "n=10" would give us more confidence in the results, but costs more in API calls.

Our studies typically use n=2 to n=5 per condition. We always disclose n so readers can judge reliability for themselves. Results with n=2 are directional; results with n=5+ are more trustworthy.

Statistical Significance

A result is statistically significant when it's unlikely to have occurred by chance. In formal research, this is determined by statistical tests that produce a p-value.

Most of our pilot studies have small sample sizes (n=2–5), so we generally cannot claim statistical significance. We use phrases like "directional," "preliminary," or "observed trend" instead. When we do have enough data, we report formal tests.

Significance Level (α)

The significance level (alpha, written α) is the threshold you choose before running a test. If the p-value falls below α, you call the result "statistically significant."

Example: With α = 0.05 (the most common choice), a p-value of 0.003 is significant (0.003 < 0.05) but a p-value of 0.069 is not (0.069 > 0.05).

The α = 0.05 convention means you accept a 5% chance of falsely claiming significance. Some fields use stricter thresholds (α = 0.01) for higher confidence.

Variance

Variance is the square of the standard deviation. It measures how spread out data points are from the mean. In everyday language, "high variance" means "inconsistent results."

We often use "variance" informally to mean "the results were all over the place." When used technically, it's a specific number (σ²). When we say a model "shows high variance," we mean its scores differ substantially between runs.

High-Variance Result

A result where the standard deviation exceeds 25% of the mean. We flag these in our studies because they're less reliable.

Example: A score of 4.0 ± 1.5 is high-variance (1.5 is 37.5% of 4.0). A score of 4.0 ± 0.5 is not (0.5 is 12.5% of 4.0).

High-variance results warrant caution. The true performance could be substantially higher or lower than the reported mean.

p-value

A p-value is the probability of seeing results at least as extreme as what was observed, assuming there's actually no real effect. A small p-value (typically < 0.05) suggests the result is unlikely to be due to chance alone.

We rarely report p-values in our pilot studies because our sample sizes are too small for meaningful hypothesis testing. When we do, we note the test used and the exact p-value rather than just "significant" or "not significant."

Non-Parametric Test

A non-parametric test is a statistical test that makes no assumptions about the shape of the data distribution (it doesn't assume your data follows a bell curve). These tests work well with small samples and ordinal data like gate scores.

All three main statistical tests in our codegen benchmark study (Kruskal-Wallis, Mann-Whitney U, Friedman) are non-parametric. We use them because our score data is ordinal (1–5 gates) and not normally distributed.

Kruskal-Wallis H-Test

The Kruskal-Wallis test checks whether three or more groups have different distributions. It's the non-parametric equivalent of a one-way ANOVA.

Example: "Kruskal-Wallis H=56.65, p<0.001" means the 11 models in our benchmark show statistically significant differences in gate scores — at least one model performs differently from the others.

The test tells you that groups differ, not which groups differ. Follow-up pairwise tests (like Mann-Whitney U) identify the specific pairs.

Mann-Whitney U Test

The Mann-Whitney U test compares two groups to see if one tends to produce higher values than the other. It's the non-parametric equivalent of an independent t-test.

Example: "Mann-Whitney U, p=0.008" between Claude 3.5 Sonnet and GPT-4o means their score distributions differ significantly.

We use this test for pairwise model comparisons and for A/B testing (e.g., does the adapted prompt score differently than the baseline prompt?).

Friedman Test

The Friedman test checks whether repeated measures across conditions differ. It's used when the same subjects are measured under multiple conditions — in our case, the same models tested across different frameworks.

Example: "Friedman χ²=5.35, p=0.069" tests whether framework choice (FastAPI vs ASP.NET vs Spring Boot) systematically affects gate scores. A p-value of 0.069 (above 0.05) means the framework effect is suggestive but not statistically significant at conventional thresholds.

Sign Test

The sign test is the simplest test for paired data — it just counts whether improvements outnumber regressions (ignoring ties and magnitude).

Example: In our meta-prompting A/B test, we checked whether more models improved with adapted prompts than worsened. A sign test p=0.754 means the split was roughly even — no directional trend.

It's less powerful than other tests but makes almost no assumptions about the data, making it useful as a sanity check.

Chi-Squared (χ²) Statistic

The chi-squared statistic (χ²) is the test value produced by certain statistical tests, including the Friedman test. Larger χ² values indicate bigger differences between groups.

Example: "χ²=5.35" in our framework comparison. Whether this is "big enough" depends on the number of groups and the significance level — the p-value translates it into a probability.

H Statistic

The H statistic is the test value produced by the Kruskal-Wallis test. Like χ², larger values indicate more difference between groups.

Example: "H=56.65" in our model comparison is a very large value, confirming that models differ substantially in code generation quality.

Effect Size

Effect size measures how large a difference is, not just whether it's statistically significant. A tiny difference can be "significant" with enough data; effect size tells you if it actually matters.

Example: Two models may differ with p<0.01, but if the effect size is 0.1 standard deviations, the practical difference is negligible. Our meta-prompting analysis reports effect sizes to distinguish meaningful improvements from noise.

Cohen's d

Cohen's d is a specific effect size measure — the difference between two group means divided by the pooled standard deviation. It tells you how many "standard deviations apart" two groups are.

Cohen's d	Interpretation
0.2	Small effect
0.5	Medium effect
0.8	Large effect
> 1.0	Very large effect

Example: "Cohen's d = +3.79" for Claude 3.5 Haiku with adapted prompts means the improvement was nearly 4 standard deviations — an enormous effect.

Coefficient of Variation (CV)

The coefficient of variation is the ratio of standard deviation to mean (σ/μ), expressed as a proportion or percentage. It measures relative consistency — how variable results are relative to their average.

Example: A LOC CV of 0.395 means the standard deviation is about 40% of the mean line count. A CV of 0.05 would mean very consistent output length. We use CV in our code quality meta-analysis to compare consistency across models.

Normal Approximation

A normal approximation uses the bell-curve (Gaussian) distribution to estimate a statistic. Our confidence interval formula $\bar{x} \pm 1.96 \cdot \frac{s}{\sqrt{n}}$ assumes the sampling distribution of the mean is approximately normal.

This works reasonably well even for non-normal data when sample sizes aren't too small (thanks to the Central Limit Theorem), though with n=3–5 it's an approximation, not exact.

Bimodal Distribution

A bimodal distribution has two distinct peaks rather than one. In plain terms: the results cluster around two different values with a gap in between.

Example: GPT-5.2 on FastAPI scores either 5/5 or 0/5 — nothing in between. It either nails the task or fails completely. This "all or nothing" pattern is a textbook bimodal distribution and is much more informative than the mean alone (which would be around 2.5 and tell you nothing useful).

Pearson Correlation (r)

Pearson's r measures the linear relationship between two variables, ranging from −1 (perfect negative correlation) to +1 (perfect positive correlation). Zero means no linear relationship.

Example: In our evaluator calibration, Pearson r=0.89 between the LLM judge and human scores means strong agreement — the judge's ratings track closely with human assessments.

Mean Absolute Deviation (MAD)

Mean Absolute Deviation is the average of the absolute differences between paired observations. It measures how far apart two sets of scores typically are.

Example: "MAD=0.43" between our LLM judge and human scores means the judge typically differs from the human rating by about 0.43 points on a 5-point scale — reasonably close agreement.

Jaccard Similarity

Jaccard similarity (or Jaccard index) measures how much two sets overlap. It's calculated as the size of the intersection divided by the size of the union: |A ∩ B| / |A ∪ B|. Values range from 0 (no overlap) to 1 (identical sets).

Example: A "naming Jaccard" of 0.82 means 82% of function/variable names are shared between two code outputs. We use Jaccard similarity in our code quality meta-analysis to compare structural choices, function names, and import sets across runs.

μ (Mu)

The Greek letter μ (mu) represents the population mean — the true average you'd get if you could measure an infinite number of runs. In practice, we estimate μ using our sample mean $\bar{x}$ .

You'll see μ in formulas like the confidence formula: $\text{confidence} = \max(0, 1 - \frac{\sigma}{\mu})$ , where σ is the standard deviation and μ is the mean gate score.

Study Design

Controlled Experiment

An experiment where we change one thing at a time and keep everything else constant, so we can attribute any difference in results to the thing we changed.

Example: To test whether a "concise" prompt works better than a "verbose" one, we keep the model, temperature, task, and scoring rubric identical — only the prompt changes.

Baseline

The baseline is the starting point we compare everything else to. It's usually the simplest or default configuration.

Example: In our persona study, the "baseline" was the original, unmodified prompt. We measured all variations against it to see what improved, what didn't, and what made things worse.

Condition

A condition is one specific configuration being tested. Each condition differs from the baseline (and other conditions) in a specific, documented way.

Example: "hybrid," "concise," and "more_production" are each conditions — different prompt variations tested under the same circumstances.

Run

A single execution of a test. Running the same condition multiple times (multiple runs) lets us measure consistency and calculate statistics.

Example: "n=3 runs per condition" means each prompt variation was tested 3 separate times, producing 3 independent scores.

Ablation Study

An experiment where you remove or disable one component at a time to understand each component's contribution. If performance drops when you remove something, that component was important.

Example: If a prompt contains both "production scenario" language and "metrics," an ablation study might test: prompt with both, prompt with only scenarios, prompt with only metrics, and prompt with neither.

Entropy Control

A technique we use to ensure fair comparisons between model runs. By controlling the randomness (entropy) in model outputs, we reduce the chance that differences between runs are just random noise rather than meaningful differences.

In practice, this means using fixed random seeds or temperature=0 settings to make outputs as deterministic as possible.

Temperature

A parameter that controls how random a model's output is. Temperature=0 means the model always picks the most likely next word (deterministic). Higher temperatures (0.7, 1.0) introduce more randomness and creativity.

In our benchmarks, we typically use temperature=0 for consistency. This is our "entropy control" — it ensures differences between models come from capability, not randomness.

Reproducibility

The ability for someone else (or your future self) to run the same experiment and get the same results. We publish our code, prompts, and data so others can verify our findings.

Perfect reproducibility with LLMs is challenging because model providers update their models over time. We record model versions and dates to help with this.

Pilot Study

A pilot study is a small-scale preliminary investigation to test feasibility and methods before committing to a larger effort. It answers "does this approach work?" before answering "what are the definitive results?"

Our codegen benchmark and persona optimization papers are both pilot studies — they establish methodology, identify pitfalls, and produce preliminary findings that guide future work.

A/B Test

An A/B test compares exactly two variants — A (the control/baseline) and B (the treatment/adapted version) — under identical conditions to see which performs better.

Example: In our meta-prompting evaluation, A = the standard prompt and B = the model-adapted prompt. Each model was tested with both, and we compared scores to see if adaptation helped.

Brownfield Task

A brownfield task involves modifying or extending an existing codebase — adding features to code that already exists, with established patterns, imports, and conventions to follow.

This is the opposite of a greenfield task and is more representative of real-world development. All tasks in our codegen benchmark are brownfield: models must add a new API endpoint to an existing application with existing routes, models, and middleware.

Greenfield

A greenfield project starts from scratch — writing new code with no existing codebase to integrate with. Greenfield tasks are generally easier for LLMs because there are no pre-existing patterns to follow.

Our benchmark focuses on brownfield tasks specifically because greenfield is less representative of how developers actually use AI coding assistants day-to-day.

Smoke Test

A smoke test is a quick, shallow test to verify that basic functionality works at all. It's a "does it even run?" check before investing in thorough testing.

Example: In our benchmark, some models were initially tested with only 1–2 runs as a smoke test. We later excluded "smoke-test-only" data and re-ran those models with full n≥5 coverage for the entropy-controlled results.

Ceiling Effect

A ceiling effect occurs when scores are already at or near the maximum, making it impossible to measure further improvement. If a model scores 5.00/5, an adapted prompt can't score higher — even if it's genuinely better.

Example: Claude 3.5 Sonnet scored 5.00/5.00 on Spring Boot at baseline. Any adaptation that maintains 5.00 shows "no change," but that doesn't mean adaptation is ineffective — there's simply no room to improve.

Floor Effect

A floor effect is the opposite — scores are so low that there's limited room for further degradation. It can also mask differences between poor-performing approaches.

Example: A model scoring 1.0/5 at baseline can only drop by 1 point maximum. Small improvements might not register as meaningful on the scale.

Calibration (Evaluator)

Calibration is the process of verifying that an evaluator (human or LLM judge) scores consistently and accurately. You check calibration by having the evaluator score the same outputs multiple times and comparing against a reference.

Example: We ran 5 calibration runs where a human and the LLM judge independently scored the same code outputs. MAD=0.43 and Pearson r=0.89 confirmed the judge was well-calibrated.

Evaluation Concepts

Gate

In our code generation benchmark, a gate is a specific quality check that generated code must pass. Gates test concrete, binary criteria: does the code compile? Does it include error handling? Does it follow the framework's conventions?

Example: "4.2 / 5 gates" means the generated code passed 4 full gates and partially passed a 5th, out of 5 total quality checks.

Pass Rate

The percentage of runs where a condition met a specific threshold. Unlike the mean score, pass rate tells you how often something works, not just how well it works on average.

Example: "100% pass rate at 4+ gates" means every run scored 4 or higher. A model with a high mean but low pass rate is unreliable — it sometimes works great but sometimes fails.

Perfect Rate

The percentage of runs that achieved the maximum possible score (e.g., 5/5 gates). This is stricter than pass rate — it tells you how often the model gets everything right.

Example: "67% perfect rate" means two out of three runs scored 5/5. A model with 100% pass rate (≥4 gates) but only 33% perfect rate is good but rarely flawless.

Rubric

A scoring rubric is the specific criteria used to evaluate outputs. It defines exactly what "good" looks like, making scoring consistent and reproducible.

Our rubrics break evaluation into numbered dimensions (e.g., correctness, completeness, style) with specific criteria for each score level.

Composite Score

A composite score combines multiple sub-scores into a single number using a weighted formula. Each component contributes a defined percentage to the final score.

Example: Our code quality composite is: 35% clean code + 25% framework patterns + 25% language idioms + 15% code organization. A model scoring 4.5 on clean code and 3.0 on patterns would have a different composite than one scoring 3.0 and 4.5 on those same dimensions.

Inter-Rater Reliability

A measure of how consistently different evaluators (human or AI) score the same output. High inter-rater reliability means the rubric is clear and scoring is objective.

When we use LLM-as-judge systems, we test inter-rater reliability by having multiple evaluation runs score the same output and checking if they agree.

LLM-as-Judge

Using a language model to evaluate the output of another language model. Instead of having humans score every output (expensive and slow), we use a capable LLM with a detailed rubric.

This introduces a potential bias (models may prefer certain styles), which is why we validate LLM-as-judge scores against human evaluation where possible.

Cross-Family Evaluation

Using an LLM judge from a different model family than the model being evaluated. This prevents a model from rating its own output favorably (self-evaluation bias).

Example: We use a Gemini model to judge code generated by Claude, GPT, and other models. This doesn't eliminate all bias, but removes the most obvious conflict of interest.

Ground Truth

The known correct answer against which we evaluate model outputs. In code generation, ground truth might be a reference implementation that passes all tests.

Not all evaluations have clear ground truth. When they don't, we rely on rubric-based evaluation instead.

Pareto Frontier

The Pareto frontier (also called the Pareto optimal set) is the set of options where no alternative is better on all dimensions. Any improvement in one dimension requires a tradeoff in another.

Example: On a cost vs quality chart, models on the Pareto frontier represent the best possible quality at each price point. A model below the frontier is dominated — you can find another model that's both cheaper and better.

Model Generation Fingerprint

A model generation fingerprint is the characteristic set of code patterns that distinguish one model's output from another's — like a coding "signature." Different models consistently make different choices about decomposition, naming, structure, and style.

Example: Claude models tend to produce more helper functions and longer outputs, while GPT models favor inline logic and shorter files. These patterns are consistent enough across runs to identify the generating model.

Code Quality Meta-Analysis

A code quality meta-analysis goes beyond binary pass/fail to examine the qualitative characteristics of generated code — structure, naming conventions, idiomatic patterns, consistency, and architectural choices.

This analysis uses static analysis tools and LLM judges to evaluate dimensions that gate-based testing misses, such as whether the code is maintainable and follows best practices.

Prompt Engineering

System Prompt

The system prompt (or system message) is the initial instruction that sets up how the AI should behave. It's not part of the conversation — it's the behind-the-scenes instruction that shapes every response.

Example: "You are a senior software engineer who writes production-quality Python code with comprehensive error handling."

Persona Prompt

A system prompt that gives the AI a specific professional identity with experience, preferences, and working style. More detailed than a basic system prompt.

Example: Instead of "You are a developer," a persona prompt might say: "I've deployed FastAPI services in production environments handling millions of requests. I prioritize type safety and comprehensive error handling."

Meta-Prompting

Using an AI to generate or optimize prompts for another AI (or itself). Instead of manually writing prompt variations, you ask a model to suggest improvements based on evaluation results.

Example: "Given that the 'production scenario' pattern correlated with +5.8 points, generate a prompt variation that emphasizes production experience."

Few-Shot Prompting

Including examples of desired input-output pairs in the prompt. The model learns the pattern from the examples and applies it to new inputs.

Example: Showing 2–3 examples of well-scored code before asking the model to generate code for a new task.

Zero-Shot Prompting

Asking a model to perform a task without any examples — relying solely on the instruction and the model's training. Most of our studies use zero-shot prompting to test the model's baseline capability.

Prompt Variation

A systematic modification of a prompt to test a specific hypothesis. Each variation changes one or more identified patterns while keeping the overall structure consistent.

Example: The "concise" variation takes the baseline prompt and shortens it, testing whether brevity improves scores.

Chain of Thought

A prompting technique that asks the model to show its reasoning step by step before giving a final answer. This often improves accuracy on complex tasks.

Example: "First, analyze the requirements. Then, identify the needed imports. Finally, write the implementation."

Bias and Validity

Survivorship Bias

The error of focusing only on successes and ignoring failures. In prompt engineering, this happens when you run many experiments, discard the ones that "don't count," and report only the best results.

Our guidelines require reporting ALL runs, including failures. If runs are excluded, we must state how many and why.

Cherry-Picking

Selectively reporting only the results that support your conclusion. Related to survivorship bias, but more deliberate — choosing specific runs, metrics, or configurations that look good.

We guard against this by pre-registering our test conditions and reporting all results, including ones that contradict our expectations.

Confounding Variable

An uncontrolled factor that might explain the results instead of the variable you intended to test. If two things change at once, you can't tell which one caused the difference.

Example: If you change both the prompt and the model version between tests, you can't attribute score differences to either change alone.

Generalizability

Whether findings from one context apply to other contexts. Our studies are specific to our evaluation pipeline, task types, and model versions. Results may not generalize to different tasks, domains, or model providers.

We always state the scope of our findings and avoid claiming they apply universally.

Overfitting

Optimizing so specifically for a particular test that performance doesn't transfer to real-world use. In prompt engineering, this means creating a prompt that scores well on your benchmark but doesn't actually produce better outputs in practice.

Systematic Bias

A consistent directional error in measurement — always scoring too high or too low, in the same direction. Unlike random error (which averages out), systematic bias skews all results.

Example: In our evaluator calibration, we found the Gemini judge scored 0.43 points lower than the human on average across all dimensions. This consistent offset is systematic bias. We report it so readers can account for it.

Cost and Performance

Cost per Run

The API cost of a single test execution, typically measured in dollars. This includes input tokens (the prompt) and output tokens (the model's response).

We report costs so readers can assess whether reproducing our experiments is feasible and to enable cost–benefit analysis of different approaches.

Token

The basic unit that language models process. A token is roughly ¾ of a word in English. Both input (prompt) and output (response) are measured in tokens, and API pricing is based on token counts.

Example: The sentence "The quick brown fox" is 4 tokens. A 1,000-word document is roughly 1,300 tokens.

Input Tokens vs Output Tokens

API providers charge different rates for input tokens (what you send — the prompt, context, instructions) and output tokens (what the model generates — the code, explanation, response).

Example: GPT-4o charges $2.50 per million input tokens but$ 10.00 per million output tokens. A long prompt with a short response costs less than a short prompt with a long response. This is why our cost tables break down pricing by input/output.

Inference

The process of generating an output from a trained model. Each time you send a prompt and get a response, that's one inference. Inference costs time and money.

Latency

The time delay between sending a request and receiving the complete response. Measured in seconds. Affected by model size, output length, and server load.

Example: "60.8s latency" means it took about a minute from sending the prompt to receiving the complete generated code.

Cost–Quality Frontier

A visualization showing the tradeoff between cost and quality across different models or configurations. Points closer to the top-left corner (high quality, low cost) represent better value.

Example: If Model A costs $0.01/run and scores 5.0, while Model B costs$ 0.30/run and also scores 5.0, Model A is clearly on the efficient frontier.

Open-Weight Model

A model whose trained weights (parameters) are publicly available for download and self-hosting, as opposed to proprietary API-only models where you can only access them through a vendor's API.

Example: DeepSeek R1 and Qwen 2.5 Coder are open-weight models — you can run them on your own infrastructure. GPT-4o and Claude 3.5 are proprietary — you must use OpenAI's or Anthropic's API.

Open-weight models offer cost advantages (no per-token API fees when self-hosted) but require your own GPU infrastructure.

Reporting Conventions

Mean ± Std Notation

The format "X.XX ± Y.YY" reports the average score followed by the standard deviation. This is our standard way of presenting results.

Example: "4.67 ± 0.58" means an average score of 4.67 with a standard deviation of 0.58. The ± tells you how much individual runs typically deviate from the average.

95% CI Notation

The format "[lower, upper]" gives the range of the 95% confidence interval.

Example: "95% CI: [4.01, 5.00]" means we're 95% confident the true mean falls between 4.01 and 5.00.

Delta (Δ)

The Greek letter delta (Δ) means "change" or "difference." In our results tables, "Δ vs Baseline" shows how much a condition's score differs from the baseline.

Example: "Δ = +4.6" means this condition scored 4.6 points higher than the baseline. "Δ = −1.4" means 1.4 points lower.

n= Notation

The notation "n=X" simply states the sample size — how many runs produced the reported result.

Example: "(n=3)" means three runs were conducted. When you see this, you know how much data supports the reported statistic.

σ (Sigma) Column Notation

In results tables, a column headed "σ" contains the standard deviation for that row. It's shorthand — the same value you'd see after the ± in "mean ± std" notation, just in its own column for readability.

Example: A table with columns "Mean" and "σ" showing 4.67 and 0.58 is equivalent to reporting "4.67 ± 0.58."

† (Dagger) Subscription Notation

A dagger symbol (†) next to a cost figure indicates a subscription-priced model where the per-token cost doesn't fully reflect what you pay. The actual cost depends on your subscription tier rather than pure usage.

Example: "$0.000†" means the model is available through a flat-rate subscription (like ChatGPT Pro), so the marginal per-run cost approaches zero — but you're paying a monthly fee regardless of usage.

Technical Infrastructure

REST Endpoint / API Endpoint

A REST endpoint is a specific URL in a web application that accepts HTTP requests and returns responses. "Adding an endpoint" means creating a new URL (like /api/v1/orders) that handles requests.

Example: Our benchmark tasks require models to add a new REST endpoint to an existing web application — a common real-world development task.

FastAPI

FastAPI is a modern Python web framework for building REST APIs. It uses type hints and Pydantic for automatic request validation. It's one of three frameworks tested in our codegen benchmark.

FastAPI tasks test whether models can write async Python code, use Pydantic v2 validation syntax, and follow FastAPI-specific patterns like dependency injection and router organization.

ASP.NET Core

ASP.NET Core is Microsoft's C# web framework for building REST APIs. It uses controller classes, model binding, and built-in dependency injection. It's one of three frameworks tested in our benchmark.

ASP.NET Core tasks test whether models can write idiomatic C# with data annotations, proper service layer patterns, and controller-based routing.

Spring Boot

Spring Boot is a Java web framework that simplifies building REST APIs with annotation-based configuration. It's one of three frameworks tested in our benchmark.

Spring Boot tasks test whether models can use Java annotations (@RestController, @PostMapping), Bean Validation, and the service-repository pattern.

Pydantic

Pydantic is a Python library for data validation using type annotations. In FastAPI, Pydantic models define the shape of request/response data and automatically validate inputs.

Why it matters in our study: Pydantic v2 changed its API significantly from v1. Models that learned v1 syntax produce code that fails validation. This is one of the most common failure modes in our FastAPI tasks.

Dependency Injection

Dependency injection (DI) is a design pattern where a component receives its dependencies from the outside rather than creating them itself. This makes code more testable and modular.

Example: Instead of an endpoint creating its own database connection, it receives one through a parameter: def create_order(db: Session = Depends(get_db)). Models that skip DI produce code that compiles but fails integration tests.

CRUD

CRUD stands for Create, Read, Update, Delete — the four basic operations for persistent data. A "CRUD endpoint" handles one or more of these operations.

Example: Our baseline codebases include a Users CRUD endpoint. The benchmark task asks models to add an Orders CRUD endpoint following the same patterns.

Unified Diff

A unified diff is the standard format (used by git diff) for showing differences between two versions of a file. Lines starting with + were added; lines starting with - were removed.

Our meta-prompting questionnaire asks models whether they prefer to output code changes as unified diffs or complete file replacements.

Linting

Linting is automated checking of code for style violations, suspicious patterns, and potential bugs. A linter doesn't run the code — it analyzes the source text.

Example: Tools like ruff (Python), Roslyn analyzers (C#), and Checkstyle (Java) are linters. Passing the lint gate means the generated code follows the language's style conventions.

Type Checking

Type checking verifies that code uses the correct data types — that you don't pass a string where a number is expected, for example. Static type checkers analyze code without running it.

Example: mypy (Python), Roslyn (C#), and the Java compiler perform type checking. This is one of our five quality gates.

Static Analysis

Static analysis means analyzing code without executing it — examining the source text for properties like structure, complexity, naming patterns, and potential issues.

Our code quality meta-analysis uses static analysis to measure line counts, function decomposition, naming consistency, and import patterns across all generated code samples.

Cyclomatic Complexity

Cyclomatic complexity counts the number of independent paths through a piece of code. More if/else branches, loops, and conditions mean higher complexity. Higher complexity = harder to test and maintain.

Example: A function with no branches has complexity 1. A function with one if/else has complexity 2. A function with a nested if inside a loop might have complexity 5+. We report this in our code quality analysis.

LOC (Lines of Code)

LOC is simply the number of lines in a code file or function. While not a quality metric on its own, comparing LOC across runs reveals how consistently a model produces code of similar size.

Example: "LOC CV of 0.395" means the standard deviation of line count is about 40% of the mean — the model produces quite different amounts of code on each run.

Async/Await

Async/await is a programming pattern for handling operations that take time (like database queries or API calls) without blocking the entire application. Code marked async can pause (await) while waiting for results.

FastAPI uses async/await extensively. Models must use async def for endpoint handlers and await for database operations — using synchronous code is a common failure mode.

YAML

YAML (YAML Ain't Markup Language) is a human-readable data format used for configuration files. It uses indentation instead of brackets.

Example: Our task definitions are YAML files that specify the framework, requirements, existing code paths, and expected gate checks.

Code Quality Patterns

DTO (Data Transfer Object)

A DTO is an object used to pass data between layers of an application. It defines the shape of data at a boundary (like an API request or response) without including business logic.

Example: A CreateOrderRequest class with fields for product_id and quantity is a DTO — it defines what the API accepts but doesn't contain ordering logic.

Repository Pattern

The repository pattern creates an abstraction layer between business logic and data access. Instead of writing database queries directly in your endpoint, you call methods like repository.save(order).

This is a key pattern in ASP.NET Core and Spring Boot. Models that skip the repository layer and put database logic directly in controllers score lower on code organization.

SRP (Single Responsibility Principle)

SRP states that each module, class, or function should have exactly one reason to change. If a function handles validation, database access, and error formatting, it has three responsibilities and violates SRP.

Our LLM-judged code quality rubric evaluates whether generated code follows SRP under the "Clean Code" dimension.

DRY (Don't Repeat Yourself)

DRY means avoiding duplicated logic. If the same code appears in multiple places, it should be extracted into a shared function or class.

Example: If three endpoints all validate authentication the same way, DRY says to extract that into a middleware or dependency — not copy-paste it three times.

Layered Architecture

Layered architecture organizes code into distinct layers with clear responsibilities: controller (handles HTTP) → service (business logic) → repository (data access). Each layer only talks to the one below it.

Models are scored on whether they maintain proper layer separation. Putting database queries directly in a controller or business logic in a DTO violates layered architecture.

Visualization Types

Heatmap

A heatmap is a grid where color intensity represents values. Darker or warmer colors typically mean higher values. Heatmaps are excellent for spotting patterns across two dimensions.

Example: Our gate pass rate heatmap shows models (rows) vs frameworks (columns), with cell colors indicating pass rates. You can instantly spot which model×framework combinations perform well.

Radar Chart

A radar chart (also called a spider chart) has multiple axes radiating from a center point, one per dimension. The data forms a polygon — the shape reveals a model's profile across all dimensions at a glance.

Example: Our model fingerprint radar charts show each model's code quality profile (decomposition, naming, structure, LOC consistency). Different polygon shapes reveal distinct "coding personalities."

Scatter Plot

A scatter plot places individual data points on two axes. Each point represents one observation. Scatter plots reveal relationships, clusters, and outliers.

Example: Our cost–quality frontier chart plots each model as a point where x = cost per run and y = mean gate score. Points in the upper-left corner (high quality, low cost) are the best value.

Dumbbell Chart

A dumbbell chart shows paired values connected by a line — typically a "before" and "after" for each item. The dot positions show the values; the line length shows the magnitude of change.

Example: Our meta-prompting A/B results use a dumbbell chart where each model has two dots (baseline score and adapted score) connected by a line. You can instantly see which models improved, which worsened, and by how much.

Version History

Version	Date	Changes
1.0	February 2026	Initial publication covering terminology from llm-codebench and persona-prompt-optimization studies.
2.0	February 2026	Expanded from 43 to 105 terms. Added Statistical tests (Kruskal-Wallis, Mann-Whitney U, Friedman, Sign test, Cohen's d), study design terms (brownfield, A/B test, ceiling/floor effects), evaluation concepts (Pareto frontier, model fingerprints, meta-analysis), new categories: Technical Infrastructure, Code Quality Patterns, Visualization Types.

This guide evolves as our research practice grows. When new studies introduce new methods or terminology, definitions are added here. If something is unclear, that's a bug — let us know.