EngramForge Engineering Research Guidelines

Purpose

This document establishes standards for how EngramForge engineers design, execute, document, and present empirical investigations conducted during platform development. These guidelines exist to ensure:

Consistency across publications from different teams and projects
Statistical rigor appropriate to our claims
Honest positioning of our work relative to the broader field
Protection of proprietary information while sharing useful methodology
Reduced cognitive load for readers moving between our publications

All published engineering studies must follow these guidelines before release.

1. Research Lifecycle

Every investigation follows a five-phase cycle rooted in standard scientific method, adapted for systems engineering:

Phase 1 — Observe: An inconsistency, variance, or engineering problem is identified during normal development. The observation is documented as a problem statement.

Phase 2 — Hypothesize: The team defines testable engineering questions, identifies the variables to measure, and establishes success criteria before running any experiments.

Phase 3 — Test: Controlled experiments are executed using automated tooling. All results are captured—not just successful ones.

Phase 4 — Analyze: Results are aggregated with appropriate statistical methods (mean, standard deviation, confidence intervals). Findings are scoped to what the data supports.

Phase 5 — Implement: Findings inform platform architecture, code, or configuration decisions. Implementation often surfaces new observations, restarting the cycle.

2. Tone and Positioning

2.1 What We Are

We are a systems engineering organization that uses empirical methods to make data-driven platform decisions. Our publications document this process and share reusable methodology.

2.2 What We Are Not

We are not an academic research lab publishing novel findings. We do not claim to be at the cutting edge of any field. We recognize that much of what we investigate has been studied by others, often more rigorously and at larger scale.

2.3 Mandatory Language Rules

Never use:

"First to" / "novel" / "groundbreaking" / "revolutionary"
"No one has" / "little research exists" / "unexplored"
"We discovered" (when describing known phenomena)
"State of the art" (referring to our own work)
"Industry-changing" / "paradigm shift"

Always use:

"During development of [generic subsystem], we encountered..."
"We applied [standard method] to our specific context..."
"In our testing environment, we observed..."
"This informed our platform architecture by..."
"These results are specific to our configuration and may vary..."

Example transformations:

❌ Don't write	✅ Do write
We discovered a fundamental flaw in how the industry benchmarks LLMs	During benchmark development, we observed significant run-to-run variance in code generation quality
No prior work has examined this combination of variables	We needed to understand how these variables interacted in our specific system
Our novel entropy control system	We implemented automatic variance detection to improve measurement reliability
This changes everything about LLM evaluation	This informed how we structure our evaluation pipeline
GPT-4o-mini is unreliable	In our testing configuration, GPT-4o-mini showed higher variance on Python tasks (3.00 ± 1.50 / 5 gates)

2.4 Intellectual Property Protection

Published studies must describe subsystems generically. Do not disclose:

Proprietary algorithm details or trade secrets
Internal system architecture beyond what's needed for context
Customer data, usage patterns, or business metrics
Competitive analysis or strategic positioning
Unreleased product features or roadmap details

Use phrases like "our evaluation pipeline," "the platform's scoring subsystem," or "our code generation benchmark" rather than naming internal tools or describing proprietary implementations.

3. Statistical Standards

3.1 Minimum Requirements

Requirement	Standard	Rationale
Minimum runs per condition	2	Cannot measure variance with n=1
Preferred runs per condition	3–5	Enables confidence interval calculation
Central tendency	Report mean	Comparable across studies
Dispersion	Report standard deviation	Captures run-to-run variance
Confidence intervals	Report 95% CI when n ≥ 3	Quantifies estimate uncertainty
Individual run data	Include in appendix	Enables reader verification
Sample size	Always disclose	Readers must assess significance themselves

3.2 How to Report Results

Always:

Model X achieved 4.67 ± 0.58 / 5 gates (n=3, 95% CI: [4.01, 5.00])

Never:

Model X achieved 5/5 gates

(Unless all runs produced that result, in which case report: "5.00 ± 0.00 / 5 gates (n=3)")

3.3 Handling Variance

When standard deviation exceeds 25% of the mean, the result is considered high-variance and must be flagged:

Report the individual run scores in the results section (not just the appendix)
Discuss possible explanations for the variance
Avoid strong conclusions drawn from high-variance data
If possible, run additional iterations

3.4 Avoiding Survivorship Bias

During development, engineers naturally run experiments multiple times while debugging tooling and prompts. This creates a history of results where early failures are "explained away" and later successes are reported.

Required practice: When reporting results, include ALL runs from the measurement period, including early failures. If you exclude runs, state how many were excluded and why (e.g., "3 runs were excluded due to API timeout errors unrelated to model quality").

3.5 Cost Reporting

All studies involving API calls must report:

Cost per run (or per condition)
Total cost of the study
Token usage where available

This enables readers to assess reproducibility economics and helps our team estimate costs for future studies.

4. Document Structure

All published engineering studies must use the following structure. Sections may be combined for shorter studies, but the information must be present.

Every study must include a Table of Contents immediately after the metadata header and context block. The TOC must:

Use Markdown anchor links (e.g., [3.1 Task Design](#31-task-design))
List all ## sections as top-level entries
List all ### subsections as nested entries (indented with two spaces)
Match heading text exactly (anchors are auto-generated: lowercase, spaces → hyphens, punctuation stripped)

This allows readers to navigate long studies and gives reviewers a structural overview at a glance.

Required Sections

#	Section	Contents
1	Executive Summary	Key findings (3–5 bullets), summary table of results. No more than half a page.
2	Introduction	Context within platform development, engineering objectives (not "research questions"), evaluation dimensions.
3	Methodology	How tests were designed and executed. Include code samples for measurement tools. Enough detail to reproduce.
4	Results	All experimental data. Tables with mean ± std, sample sizes, costs. No interpretation in this section.
5	Analysis	Interpretation of results. What patterns emerged, what failed, what was surprising. Tied to specific data.
6	Templates	Reusable prompts, configurations, or patterns from the study. Readers should be able to adapt these.
7	Implementation Guide	Step-by-step instructions to reproduce the study. Commands, dependencies, expected output.
8	Limitations	Honest assessment: sample sizes, scope, generalizability, known confounds.
9	Conclusions	Actionable recommendations tied to findings. No claims beyond what the data supports.
10	Appendices	Raw data, individual run details, tool versions, complete prompt text.

Required Metadata

Every study must include this header:

# [Title]
## [Subtitle describing the scope or approach]

**Project:** [project name]
**Date:** [Month Year]
**Version:** [X.Y]
**Code & Data:** https://github.com/engramforge/research/[study-slug]

The subtitle should describe the study's scope naturally. Avoid repeating the same phrasing across studies. The Code & Data link must point to the study's directory in the research repository (see Section 5).

Terminology

Use "engineering objectives" not "research questions." Use "engineering study" or "pilot study" not "paper" or "publication." Use "findings" not "discoveries." Use "observations" not "results" when the sample size is small.

Timestamps

All timestamps in published documents, generated reports, and data files must either:

Use ISO-8601 with timezone offset: 2026-02-07T11:11:20-07:00
Include a timezone annotation: 2026-02-07 11:11:20 MST

Bare timestamps without timezone (e.g., 2026-02-07 11:11:20) are not acceptable — readers cannot determine when an experiment actually ran.

Document metadata dates may use YYYY-MM-DD (timezone is implicit from the author)
Machine-generated timestamps must include offset or timezone abbreviation
Run identifiers (e.g., 20260207-085932) are opaque IDs, not display timestamps — they do not require timezone annotation but should not be presented to readers as times

5. Code & Data Repository

All supporting code, data, and reproducibility artifacts for published studies are hosted in a public GitHub repository:

https://github.com/engramforge/research

5.1 Repository Structure

Each study gets a directory named with a short, URL-friendly slug:

engramforge/research/
├── LICENSE                          # MIT (repository-wide)
├── README.md                        # Index of all studies
├── llm-codegen-benchmark/           # ← one directory per study
│   ├── README.md                    # Study overview + link to published doc
│   ├── data/                        # Raw results, CSVs, JSON
│   │   ├── entropy_results.json
│   │   └── run_scores.csv
│   ├── scripts/                     # Reproduction scripts
│   │   ├── run_benchmark.sh
│   │   └── analyze_results.py
│   ├── prompts/                     # Prompt templates used
│   │   ├── fastapi.txt
│   │   └── aspnetcore.txt
│   └── diagrams/                    # SVG diagrams from the study
│       └── pipeline.svg
├── persona-prompt-optimization/     # ← another study
│   ├── README.md
│   ├── data/
│   ├── scripts/
│   └── prompts/
└── ...

5.2 Naming Conventions

Element	Convention	Example
Study directory	lowercase, hyphens, ≤ 40 chars	`llm-codegen-benchmark`
Data files	descriptive, lowercase, underscores	`entropy_results.json`
Scripts	action verb prefix	`run_benchmark.sh`, `analyze_results.py`
Prompts	framework or role name	`fastapi.txt`, `frontend-developer.yaml`

5.3 What to Include

Must include:

Raw result data (JSON, CSV, or YAML) for every run reported in the study
Scripts or commands sufficient to reproduce the experiments
Prompt templates exactly as used (not paraphrased)
A README.md linking back to the published study

Must NOT include:

API keys, tokens, or credentials (even expired ones)
Proprietary source code from internal repositories
Customer data or internal business metrics
Model output containing generated proprietary code
Large binary files (use Git LFS or link externally if > 10 MB)

5.4 Pre-Publication Scrub for Repository Artifacts

Before pushing to the public research repo, verify:

grep -r 'sk-' . — no API keys in any file
grep -r 'password\|secret\|token' . — no credentials
grep -ri 'internal\|confidential\|proprietary' . — no IP markers
All scripts use environment variables for API keys, not hardcoded values
No .env files or credential files are included
.gitignore excludes *.env, *.key, *.pem, __pycache__/, .venv/

5.5 License

The research repository uses the MIT License, consistent with standard practice for published research artifacts from universities and companies. All contributions to the repository are released under this license.

MIT License

Copyright (c) 2026 EngramForge

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

5.6 Linking Studies to Repository

Every published study must:

Include a Code & Data link in its metadata header pointing to its repo directory
Reference the repo in the Implementation Guide section for reproduction steps
Ensure the repo directory's README.md links back to the published study

6. Diagrams and Visual Standards

5.1 Format

All diagrams must be SVG. Raster images (PNG, JPG) are not acceptable for architectural or process diagrams.

5.2 Style Rules

Property	Value
Background	`#FFFFFF` (white)
Primary text	`#1A1A2E`
Secondary text	`#555555`
Muted text	`#8B8FA3`
Primary accent	`#4A6FA5` (blue)
Secondary accent	`#2A8F82` (teal)
Alert / negative	`#C0392B` (red)
Success / positive	`#E8F5E9` fill, `#2A8F82` stroke
Light fill	`#F0F4FF` (blue tint), `#FAFAFA` (neutral)
Font family	`-apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif`
Border radius	`rx="6"` for major boxes, `rx="4"` for annotations
Stroke width	`1.5` for major elements, `1` for annotations
Fills	Solid colors only — no gradients, shadows, or ombré effects

5.3 Arrowhead Specification

<defs>
  <marker id="arrow" viewBox="0 0 10 8" refX="9" refY="4"
          markerWidth="8" markerHeight="6" orient="auto">
    <path d="M0,0 L10,4 L0,8 Z" fill="#4A6FA5"/>
  </marker>
</defs>

Always use orient="auto" (not auto-start-auto)
Always place <defs> block before any element that references markers
Match marker fill color to the stroke color of the connecting line

5.4 Minimum Diagram Count

Each published study must include at least one SVG diagram. Recommended placements:

System architecture or process flow in the Methodology section
Results comparison chart in the Results section (optional)

7. Pre-Publication Checklist

Before any study is published, it must pass all three review columns:

Tone & Claims

Statistical Rigor

Document Structure

Code & Data Repository

8. Prompt Template for New Studies

When starting a new engineering study, use this prompt template to generate the initial document skeleton. Adapt the placeholders to your specific investigation.

# [Title]
## [Subtitle — scope, method, or platform area]

**Project:** [repository name]
**Date:** [Month Year]
**Version:** 1.0
**Code & Data:** https://github.com/engramforge/research/[study-slug]

---

## Executive Summary

During the development of [generic platform component], we encountered
[specific engineering problem]. We applied [standard methodology] to
empirically determine [what we needed to learn].

This document describes our engineering approach, test methodology, and
findings. We share these methods to demonstrate our data-driven approach
to platform development decisions.

**Key Findings:**
- [Finding 1 with metric: X.XX ± Y.YY (n=Z)]
- [Finding 2 with metric]
- [Finding 3 with metric]

---

## 1. Introduction

### 1.1 Context

As part of [generic platform description], we implemented [subsystem]
that [what it does, described generically]. During integration testing,
we observed [the problem]. This led us to design a systematic test to
understand [what we needed to measure].

### 1.2 Engineering Objectives

Our platform development required answers to specific technical questions:

1. [Question about measurable variable]
2. [Question about configuration choice]
3. [Question about comparative performance]

This work represents a practical application of scientific method to
systems engineering — measuring variables, testing under controlled
conditions, and making data-driven architectural decisions.

### 1.3 System Overview

[Generic description of the subsystem. No proprietary details.
Enough context for a reader to understand what was tested and why.]

---

## 2. Methodology

### 2.1 Test Design

<!-- Describe: what was held constant, what was varied, how
     outcomes were measured. -->

### 2.2 Measurement Tools

```python
# Replace with actual code used for measurement
def measure(...):
    ...
```

### 2.3 Test Infrastructure

<!-- Describe: automated testing setup, how variations were
     generated, executed, and scored. -->

---

## 3. Results

### 3.1 Summary Table

| Condition | Metric | n | Mean ± Std | 95% CI | Cost |
|-----------|--------|---|------------|--------|------|
| _example_ | _gates passed_ | _3_ | _4.67 ± 0.58_ | _[4.01, 5.00]_ | _$0.02_ |

### 3.2 Detailed Results

<!-- Per-condition breakdown. Include individual run data
     for any high-variance results. -->

---

## 4. Analysis

<!-- Interpretation tied to specific data from Section 3.
     Every claim must reference a table or figure above. -->

---

## 5. Templates

<!-- Include reusable prompts, configurations, or patterns
     that readers can adapt for their own systems. -->

---

## 6. Implementation Guide

```bash
# Replace with step-by-step reproduction instructions
git clone ...
pip install ...
python run_experiment.py --flag value
```

---

## 7. Limitations

1. <!-- Sample size limitation -->
2. <!-- Scope / generalizability limitation -->
3. <!-- Configuration specificity -->

---

## 8. Conclusions

<!-- Actionable recommendations. Each must tie back to a
     specific finding in the Results or Analysis sections. -->

---

## Appendices

### Appendix A: Raw Data

<!-- Complete run-by-run data for all conditions. -->

### Appendix B: Tool Versions

| Tool | Version |
|------|---------|

9. Conformance Validation Prompts

Use these prompts to validate a draft study against these guidelines before publication. The main prompt performs a full conformance review; the sub-prompt focuses specifically on the pre-publication checklist.

9.1 Main Conformance Review Prompt

Copy the full text of your draft study and paste it after this prompt. The reviewer will assess the document against all sections of RESEARCH_GUIDELINES.md and return a structured report.

You are a technical editor reviewing an engineering study for publication
on the EngramForge site. Your job is to check the document against our
internal research guidelines and flag any issues.

Review the document below against ALL of the following criteria.

## Tone & Positioning (Section 2)

Check for prohibited language. Flag ANY instance of:
- "first to", "novel", "groundbreaking", "revolutionary", "paradigm shift"
- "no one has", "little research exists", "unexplored", "unique"
- "we discovered" (when describing known phenomena)
- "state of the art" (referring to our own work)
- "industry-changing"

Check that the document:
- Frames work as systems engineering, not academic research
- Positions findings as problem-solving, not discovery
- Uses "engineering objectives" not "research questions"
- Uses "findings" not "discoveries"
- Uses "observations" not "results" when sample size is small
- Describes subsystems generically (no proprietary names/details)
- Scopes every claim to the data that supports it

## Statistical Rigor (Section 3)

Check that:
- All reported metrics use multi-run data (n ≥ 2)
- Format is "mean ± std" for every numeric result
- Confidence intervals are included where n ≥ 3
- ALL runs are reported (no cherry-picking)
- Sample sizes (n) are disclosed for every result
- High-variance results (std > 25% of mean) are flagged and discussed
- Cost per run is reported for any API-based experiment
- Failure modes are analyzed, not just success rates
- Raw data appears in an appendix
- Reproduction steps are included (scripts or commands)

## Document Structure (Section 4)

Check that a Table of Contents exists with nested anchor links for all ## and ### headings.

Check that these sections exist and contain appropriate content:
1. Executive Summary (≤ half page, key findings with metrics)
2. Introduction (context, engineering objectives, system overview)
3. Methodology (test design, code samples, infrastructure)
4. Results (data tables with mean ± std, no interpretation)
5. Analysis (interpretation tied to specific data from Results)
6. Templates (reusable prompts, configs, or patterns)
7. Implementation Guide (step-by-step commands)
8. Limitations (honest scope, sample size, generalizability)
9. Conclusions (recommendations tied to findings)
10. Appendices (raw data, tool versions)

## Visual Standards (Section 5)

Check that:
- At least one SVG diagram is included
- Diagrams are referenced from the Methodology or Results section

## IP Protection (Section 2.4)

Flag any disclosure of:
- Proprietary algorithm details or trade secrets
- Internal architecture beyond what's needed for context
- Customer data, usage patterns, or business metrics
- Competitive analysis or strategic positioning
- Unreleased product features or roadmap

## Code & Data Repository (Section 5)

Check that:
- Metadata header includes a Code & Data link to engramforge/research
- Implementation Guide references the research repo for reproduction
- No API keys, secrets, or credentials appear anywhere in the document
- Prompt templates shown in the study match what will be published to the repo

## Output Format

Return your review as:

### PASS / FAIL

### Issues Found
For each issue:
- **Section:** [which guideline section is violated]
- **Location:** [where in the document]
- **Issue:** [what's wrong]
- **Suggested fix:** [specific replacement language or action]

### Summary
- Total issues: [count]
- Blocking issues (must fix): [count]
- Advisory issues (should fix): [count]

If no issues are found, return PASS with a brief confirmation.

---

DOCUMENT TO REVIEW:

[paste your draft study here]

9.2 Pre-Publication Checklist Sub-Prompt

This is a focused, faster check that only evaluates the 41 checklist items from Section 7. Use it as a final gate after the main review has already been passed and edits have been applied.

You are running a final pre-publication checklist on an engineering study.
Evaluate the document below against each item. Mark each ✅ PASS or ❌ FAIL
with a brief note. Do not suggest rewrites — just flag pass/fail.

## Tone & Claims
1.  No "first to" / "novel" / "groundbreaking" claims
2.  No "little research exists" claims
3.  Framed as systems engineering, not academic research
4.  Positioned as problem-solving, not discovery
5.  Humble about positioning relative to the field
6.  No proprietary IP disclosed
7.  Subsystems described generically
8.  Every claim supported by data in the document
9.  Conclusions scoped to what data supports
10. Limitations clearly stated

## Statistical Rigor
11. Multi-run results (n ≥ 2 per condition)
12. Mean ± standard deviation reported for all metrics
13. Confidence intervals reported where n ≥ 3
14. All runs shown (no cherry-picking)
15. Sample sizes disclosed for every result
16. High-variance results flagged and discussed
17. Cost per run reported
18. Failure modes analyzed
19. Raw data included in appendix
20. Reproducible scripts or commands included

## Document Structure
21. Table of Contents with nested subsection links
22. Executive Summary present (≤ half page)
23. Introduction with engineering objectives
24. Methodology with code samples
25. Results with data tables (mean ± std)
26. Analysis tied to specific data points
27. Reusable templates included
28. Implementation guide with commands
29. Limitations section present
30. Appendices with raw data
31. At least one SVG diagram

## Code & Data Repository
32. Study directory exists in engramforge/research
33. Raw data files included (JSON/CSV for all runs)
34. Reproduction scripts included and tested
35. Prompt templates included verbatim
36. README.md in repo directory links to published study
37. Metadata header includes Code & Data link
38. No API keys, credentials, or secrets in any file
39. No proprietary source code or internal IP
40. .gitignore excludes .env, credentials, caches
41. Licensed under MIT

## Output Format

| #  | Item | Status | Note |
|----|------|--------|------|
| 1  | No novelty claims | ✅ or ❌ | [brief note] |
| 2  | ... | ... | ... |
...

**Result: PASS (41/41) or FAIL (N/41) — list failing items**

---

DOCUMENT TO REVIEW:

[paste your draft study here]

9.3 Usage

Full review (first draft): Use the main conformance prompt (9.1). It checks everything and provides rewrite suggestions.

Final gate (after edits): Use the checklist sub-prompt (9.2). It's a quick pass/fail with no rewrite suggestions — just confirms readiness.

Recommended workflow:

Write draft using the skeleton template (Section 8)
Prepare study directory in engramforge/research repo
Run main conformance review (9.1)
Apply fixes to both the study document and repo artifacts
Run checklist sub-prompt (9.2) as final gate
Publish when 41/41 items pass

10. Existing Studies

The following published studies follow (or have been updated to follow) these guidelines:

Study	Project	Code & Data	Status
LLM Code Generation Quality Across Enterprise Frameworks	llm-codebench	engramforge/research/llm-codegen-benchmark	✅ Conforms
Persona Prompt Optimization for LLM Evaluation Systems	llm-evaluator	engramforge/research/persona-prompt-optimization	✅ Conforms

New studies should reference this governance document, use the template in Section 8 as a starting point, and create their study directory in the research repository before publication.

Version History

Version	Date	Changes
1.0	February 2026	Initial publication. Extracted from llm-codebench and llm-evaluator pilot studies.
1.1	February 2026	Added Section 5 (Code & Data Repository). Research artifacts published to engramforge/research under MIT license. Updated checklist to 41 items. Added TOC requirement.

Note: These guidelines are living standards that evolve as our engineering practice matures. Proposed changes should be reviewed by engineering leadership before adoption.