EngramForge Engineering Research Guidelines
[Subtitle describing the scope or approach]
Purpose
This document establishes standards for how EngramForge engineers design, execute, document, and present empirical investigations conducted during platform development. These guidelines exist to ensure:
- Consistency across publications from different teams and projects
- Statistical rigor appropriate to our claims
- Honest positioning of our work relative to the broader field
- Protection of proprietary information while sharing useful methodology
- Reduced cognitive load for readers moving between our publications
All published engineering studies must follow these guidelines before release.
1. Research Lifecycle
Every investigation follows a five-phase cycle rooted in standard scientific method, adapted for systems engineering:
Phase 1 — Observe: An inconsistency, variance, or engineering problem is identified during normal development. The observation is documented as a problem statement.
Phase 2 — Hypothesize: The team defines testable engineering questions, identifies the variables to measure, and establishes success criteria before running any experiments.
Phase 3 — Test: Controlled experiments are executed using automated tooling. All results are captured—not just successful ones.
Phase 4 — Analyze: Results are aggregated with appropriate statistical methods (mean, standard deviation, confidence intervals). Findings are scoped to what the data supports.
Phase 5 — Implement: Findings inform platform architecture, code, or configuration decisions. Implementation often surfaces new observations, restarting the cycle.
2. Tone and Positioning
2.1 What We Are
We are a systems engineering organization that uses empirical methods to make data-driven platform decisions. Our publications document this process and share reusable methodology.
2.2 What We Are Not
We are not an academic research lab publishing novel findings. We do not claim to be at the cutting edge of any field. We recognize that much of what we investigate has been studied by others, often more rigorously and at larger scale.
2.3 Mandatory Language Rules
Never use:
- "First to" / "novel" / "groundbreaking" / "revolutionary"
- "No one has" / "little research exists" / "unexplored"
- "We discovered" (when describing known phenomena)
- "State of the art" (referring to our own work)
- "Industry-changing" / "paradigm shift"
Always use:
- "During development of [generic subsystem], we encountered..."
- "We applied [standard method] to our specific context..."
- "In our testing environment, we observed..."
- "This informed our platform architecture by..."
- "These results are specific to our configuration and may vary..."
Example transformations:
| ❌ Don't write | ✅ Do write |
|---|---|
| We discovered a fundamental flaw in how the industry benchmarks LLMs | During benchmark development, we observed significant run-to-run variance in code generation quality |
| No prior work has examined this combination of variables | We needed to understand how these variables interacted in our specific system |
| Our novel entropy control system | We implemented automatic variance detection to improve measurement reliability |
| This changes everything about LLM evaluation | This informed how we structure our evaluation pipeline |
| GPT-4o-mini is unreliable | In our testing configuration, GPT-4o-mini showed higher variance on Python tasks (3.00 ± 1.50 / 5 gates) |
2.4 Intellectual Property Protection
Published studies must describe subsystems generically. Do not disclose:
- Proprietary algorithm details or trade secrets
- Internal system architecture beyond what's needed for context
- Customer data, usage patterns, or business metrics
- Competitive analysis or strategic positioning
- Unreleased product features or roadmap details
Use phrases like "our evaluation pipeline," "the platform's scoring subsystem," or "our code generation benchmark" rather than naming internal tools or describing proprietary implementations.
3. Statistical Standards
3.1 Minimum Requirements
| Requirement | Standard | Rationale |
|---|---|---|
| Minimum runs per condition | 2 | Cannot measure variance with n=1 |
| Preferred runs per condition | 3–5 | Enables confidence interval calculation |
| Central tendency | Report mean | Comparable across studies |
| Dispersion | Report standard deviation | Captures run-to-run variance |
| Confidence intervals | Report 95% CI when n ≥ 3 | Quantifies estimate uncertainty |
| Individual run data | Include in appendix | Enables reader verification |
| Sample size | Always disclose | Readers must assess significance themselves |
3.2 How to Report Results
Always:
Model X achieved 4.67 ± 0.58 / 5 gates (n=3, 95% CI: [4.01, 5.00])
Never:
Model X achieved 5/5 gates
(Unless all runs produced that result, in which case report: "5.00 ± 0.00 / 5 gates (n=3)")
3.3 Handling Variance
When standard deviation exceeds 25% of the mean, the result is considered high-variance and must be flagged:
- Report the individual run scores in the results section (not just the appendix)
- Discuss possible explanations for the variance
- Avoid strong conclusions drawn from high-variance data
- If possible, run additional iterations
3.4 Avoiding Survivorship Bias
During development, engineers naturally run experiments multiple times while debugging tooling and prompts. This creates a history of results where early failures are "explained away" and later successes are reported.
Required practice: When reporting results, include ALL runs from the measurement period, including early failures. If you exclude runs, state how many were excluded and why (e.g., "3 runs were excluded due to API timeout errors unrelated to model quality").
3.5 Cost Reporting
All studies involving API calls must report:
- Cost per run (or per condition)
- Total cost of the study
- Token usage where available
This enables readers to assess reproducibility economics and helps our team estimate costs for future studies.
4. Document Structure
All published engineering studies must use the following structure. Sections may be combined for shorter studies, but the information must be present.
Table of Contents
Every study must include a Table of Contents immediately after the metadata header and context block. The TOC must:
- Use Markdown anchor links (e.g.,
[3.1 Task Design](#31-task-design)) - List all
##sections as top-level entries - List all
###subsections as nested entries (indented with two spaces) - Match heading text exactly (anchors are auto-generated: lowercase, spaces → hyphens, punctuation stripped)
This allows readers to navigate long studies and gives reviewers a structural overview at a glance.
Required Sections
| # | Section | Contents |
|---|---|---|
| 1 | Executive Summary | Key findings (3–5 bullets), summary table of results. No more than half a page. |
| 2 | Introduction | Context within platform development, engineering objectives (not "research questions"), evaluation dimensions. |
| 3 | Methodology | How tests were designed and executed. Include code samples for measurement tools. Enough detail to reproduce. |
| 4 | Results | All experimental data. Tables with mean ± std, sample sizes, costs. No interpretation in this section. |
| 5 | Analysis | Interpretation of results. What patterns emerged, what failed, what was surprising. Tied to specific data. |
| 6 | Templates | Reusable prompts, configurations, or patterns from the study. Readers should be able to adapt these. |
| 7 | Implementation Guide | Step-by-step instructions to reproduce the study. Commands, dependencies, expected output. |
| 8 | Limitations | Honest assessment: sample sizes, scope, generalizability, known confounds. |
| 9 | Conclusions | Actionable recommendations tied to findings. No claims beyond what the data supports. |
| 10 | Appendices | Raw data, individual run details, tool versions, complete prompt text. |
Required Metadata
Every study must include this header:
# [Title]
## [Subtitle describing the scope or approach]
**Project:** [project name]
**Date:** [Month Year]
**Version:** [X.Y]
**Code & Data:** https://github.com/engramforge/research/[study-slug]
The subtitle should describe the study's scope naturally. Avoid repeating the same phrasing across studies. The Code & Data link must point to the study's directory in the research repository (see Section 5).
Terminology
Use "engineering objectives" not "research questions." Use "engineering study" or "pilot study" not "paper" or "publication." Use "findings" not "discoveries." Use "observations" not "results" when the sample size is small.
Timestamps
All timestamps in published documents, generated reports, and data files must either:
- Use ISO-8601 with timezone offset:
2026-02-07T11:11:20-07:00 - Include a timezone annotation:
2026-02-07 11:11:20 MST
Bare timestamps without timezone (e.g., 2026-02-07 11:11:20) are not acceptable — readers cannot determine when an experiment actually ran.
- Document metadata dates may use
YYYY-MM-DD(timezone is implicit from the author) - Machine-generated timestamps must include offset or timezone abbreviation
- Run identifiers (e.g.,
20260207-085932) are opaque IDs, not display timestamps — they do not require timezone annotation but should not be presented to readers as times
5. Code & Data Repository
All supporting code, data, and reproducibility artifacts for published studies are hosted in a public GitHub repository:
5.1 Repository Structure
Each study gets a directory named with a short, URL-friendly slug:
engramforge/research/
├── LICENSE # MIT (repository-wide)
├── README.md # Index of all studies
├── llm-codegen-benchmark/ # ← one directory per study
│ ├── README.md # Study overview + link to published doc
│ ├── data/ # Raw results, CSVs, JSON
│ │ ├── entropy_results.json
│ │ └── run_scores.csv
│ ├── scripts/ # Reproduction scripts
│ │ ├── run_benchmark.sh
│ │ └── analyze_results.py
│ ├── prompts/ # Prompt templates used
│ │ ├── fastapi.txt
│ │ └── aspnetcore.txt
│ └── diagrams/ # SVG diagrams from the study
│ └── pipeline.svg
├── persona-prompt-optimization/ # ← another study
│ ├── README.md
│ ├── data/
│ ├── scripts/
│ └── prompts/
└── ...
5.2 Naming Conventions
| Element | Convention | Example |
|---|---|---|
| Study directory | lowercase, hyphens, ≤ 40 chars | llm-codegen-benchmark |
| Data files | descriptive, lowercase, underscores | entropy_results.json |
| Scripts | action verb prefix | run_benchmark.sh, analyze_results.py |
| Prompts | framework or role name | fastapi.txt, frontend-developer.yaml |
5.3 What to Include
Must include:
- Raw result data (JSON, CSV, or YAML) for every run reported in the study
- Scripts or commands sufficient to reproduce the experiments
- Prompt templates exactly as used (not paraphrased)
- A
README.mdlinking back to the published study
Must NOT include:
- API keys, tokens, or credentials (even expired ones)
- Proprietary source code from internal repositories
- Customer data or internal business metrics
- Model output containing generated proprietary code
- Large binary files (use Git LFS or link externally if > 10 MB)
5.4 Pre-Publication Scrub for Repository Artifacts
Before pushing to the public research repo, verify:
grep -r 'sk-' .— no API keys in any filegrep -r 'password\|secret\|token' .— no credentialsgrep -ri 'internal\|confidential\|proprietary' .— no IP markers- All scripts use environment variables for API keys, not hardcoded values
- No
.envfiles or credential files are included .gitignoreexcludes*.env,*.key,*.pem,__pycache__/,.venv/
5.5 License
The research repository uses the MIT License, consistent with standard practice for published research artifacts from universities and companies. All contributions to the repository are released under this license.
MIT License
Copyright (c) 2026 EngramForge
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
5.6 Linking Studies to Repository
Every published study must:
- Include a
Code & Datalink in its metadata header pointing to its repo directory - Reference the repo in the Implementation Guide section for reproduction steps
- Ensure the repo directory's
README.mdlinks back to the published study
6. Diagrams and Visual Standards
5.1 Format
All diagrams must be SVG. Raster images (PNG, JPG) are not acceptable for architectural or process diagrams.
5.2 Style Rules
| Property | Value |
|---|---|
| Background | #FFFFFF (white) |
| Primary text | #1A1A2E |
| Secondary text | #555555 |
| Muted text | #8B8FA3 |
| Primary accent | #4A6FA5 (blue) |
| Secondary accent | #2A8F82 (teal) |
| Alert / negative | #C0392B (red) |
| Success / positive | #E8F5E9 fill, #2A8F82 stroke |
| Light fill | #F0F4FF (blue tint), #FAFAFA (neutral) |
| Font family | -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif |
| Border radius | rx="6" for major boxes, rx="4" for annotations |
| Stroke width | 1.5 for major elements, 1 for annotations |
| Fills | Solid colors only — no gradients, shadows, or ombré effects |
5.3 Arrowhead Specification
<defs>
<marker id="arrow" viewBox="0 0 10 8" refX="9" refY="4"
markerWidth="8" markerHeight="6" orient="auto">
<path d="M0,0 L10,4 L0,8 Z" fill="#4A6FA5"/>
</marker>
</defs>
- Always use
orient="auto"(notauto-start-auto) - Always place
<defs>block before any element that references markers - Match marker fill color to the stroke color of the connecting line
5.4 Minimum Diagram Count
Each published study must include at least one SVG diagram. Recommended placements:
- System architecture or process flow in the Methodology section
- Results comparison chart in the Results section (optional)
7. Pre-Publication Checklist
Before any study is published, it must pass all three review columns:
Tone & Claims
- No "first to" / "novel" / "groundbreaking" claims
- No "little research exists" claims
- Framed as systems engineering, not academic research
- Positioned as problem-solving, not discovery
- Humble about our positioning relative to the field
- No proprietary IP disclosed
- Subsystems described generically
- Every claim supported by data presented in the document
- Conclusions scoped to what data supports
- Limitations clearly stated
Statistical Rigor
- Multi-run results (n ≥ 2 per condition)
- Mean ± standard deviation reported for all metrics
- Confidence intervals reported where n ≥ 3
- All runs shown (no cherry-picking)
- Sample sizes disclosed for every result
- High-variance results flagged and discussed
- Cost per run reported
- Failure modes analyzed
- Raw data included in appendix
- Reproducible scripts or commands included
Document Structure
- Table of Contents with nested subsection links
- Executive Summary present (≤ half page)
- Introduction with engineering objectives
- Methodology with code samples
- Results with data tables (mean ± std)
- Analysis tied to specific data points
- Reusable templates included
- Implementation guide with commands
- Limitations section present
- Appendices with raw data
- At least one SVG diagram
Code & Data Repository
- Study directory created in
engramforge/research - Raw data files included (JSON/CSV for all runs)
- Reproduction scripts included and tested
- Prompt templates included verbatim
-
README.mdin study directory links to published doc - Metadata header includes
Code & Datalink - No API keys, credentials, or secrets in any file
- No proprietary source code or internal IP
-
.gitignoreexcludes.env, credentials, caches - Licensed under MIT (inherited from repo root)
8. Prompt Template for New Studies
When starting a new engineering study, use this prompt template to generate the initial document skeleton. Adapt the placeholders to your specific investigation.
# [Title]
## [Subtitle — scope, method, or platform area]
**Project:** [repository name]
**Date:** [Month Year]
**Version:** 1.0
**Code & Data:** https://github.com/engramforge/research/[study-slug]
---
## Executive Summary
During the development of [generic platform component], we encountered
[specific engineering problem]. We applied [standard methodology] to
empirically determine [what we needed to learn].
This document describes our engineering approach, test methodology, and
findings. We share these methods to demonstrate our data-driven approach
to platform development decisions.
**Key Findings:**
- [Finding 1 with metric: X.XX ± Y.YY (n=Z)]
- [Finding 2 with metric]
- [Finding 3 with metric]
---
## 1. Introduction
### 1.1 Context
As part of [generic platform description], we implemented [subsystem]
that [what it does, described generically]. During integration testing,
we observed [the problem]. This led us to design a systematic test to
understand [what we needed to measure].
### 1.2 Engineering Objectives
Our platform development required answers to specific technical questions:
1. [Question about measurable variable]
2. [Question about configuration choice]
3. [Question about comparative performance]
This work represents a practical application of scientific method to
systems engineering — measuring variables, testing under controlled
conditions, and making data-driven architectural decisions.
### 1.3 System Overview
[Generic description of the subsystem. No proprietary details.
Enough context for a reader to understand what was tested and why.]
---
## 2. Methodology
### 2.1 Test Design
<!-- Describe: what was held constant, what was varied, how
outcomes were measured. -->
### 2.2 Measurement Tools
```python
# Replace with actual code used for measurement
def measure(...):
...
```
### 2.3 Test Infrastructure
<!-- Describe: automated testing setup, how variations were
generated, executed, and scored. -->
---
## 3. Results
### 3.1 Summary Table
| Condition | Metric | n | Mean ± Std | 95% CI | Cost |
|-----------|--------|---|------------|--------|------|
| _example_ | _gates passed_ | _3_ | _4.67 ± 0.58_ | _[4.01, 5.00]_ | _$0.02_ |
### 3.2 Detailed Results
<!-- Per-condition breakdown. Include individual run data
for any high-variance results. -->
---
## 4. Analysis
<!-- Interpretation tied to specific data from Section 3.
Every claim must reference a table or figure above. -->
---
## 5. Templates
<!-- Include reusable prompts, configurations, or patterns
that readers can adapt for their own systems. -->
---
## 6. Implementation Guide
```bash
# Replace with step-by-step reproduction instructions
git clone ...
pip install ...
python run_experiment.py --flag value
```
---
## 7. Limitations
1. <!-- Sample size limitation -->
2. <!-- Scope / generalizability limitation -->
3. <!-- Configuration specificity -->
---
## 8. Conclusions
<!-- Actionable recommendations. Each must tie back to a
specific finding in the Results or Analysis sections. -->
---
## Appendices
### Appendix A: Raw Data
<!-- Complete run-by-run data for all conditions. -->
### Appendix B: Tool Versions
| Tool | Version |
|------|---------|
9. Conformance Validation Prompts
Use these prompts to validate a draft study against these guidelines before publication. The main prompt performs a full conformance review; the sub-prompt focuses specifically on the pre-publication checklist.
9.1 Main Conformance Review Prompt
Copy the full text of your draft study and paste it after this prompt. The reviewer will assess the document against all sections of RESEARCH_GUIDELINES.md and return a structured report.
You are a technical editor reviewing an engineering study for publication
on the EngramForge site. Your job is to check the document against our
internal research guidelines and flag any issues.
Review the document below against ALL of the following criteria.
## Tone & Positioning (Section 2)
Check for prohibited language. Flag ANY instance of:
- "first to", "novel", "groundbreaking", "revolutionary", "paradigm shift"
- "no one has", "little research exists", "unexplored", "unique"
- "we discovered" (when describing known phenomena)
- "state of the art" (referring to our own work)
- "industry-changing"
Check that the document:
- Frames work as systems engineering, not academic research
- Positions findings as problem-solving, not discovery
- Uses "engineering objectives" not "research questions"
- Uses "findings" not "discoveries"
- Uses "observations" not "results" when sample size is small
- Describes subsystems generically (no proprietary names/details)
- Scopes every claim to the data that supports it
## Statistical Rigor (Section 3)
Check that:
- All reported metrics use multi-run data (n ≥ 2)
- Format is "mean ± std" for every numeric result
- Confidence intervals are included where n ≥ 3
- ALL runs are reported (no cherry-picking)
- Sample sizes (n) are disclosed for every result
- High-variance results (std > 25% of mean) are flagged and discussed
- Cost per run is reported for any API-based experiment
- Failure modes are analyzed, not just success rates
- Raw data appears in an appendix
- Reproduction steps are included (scripts or commands)
## Document Structure (Section 4)
Check that a Table of Contents exists with nested anchor links for all ## and ### headings.
Check that these sections exist and contain appropriate content:
1. Executive Summary (≤ half page, key findings with metrics)
2. Introduction (context, engineering objectives, system overview)
3. Methodology (test design, code samples, infrastructure)
4. Results (data tables with mean ± std, no interpretation)
5. Analysis (interpretation tied to specific data from Results)
6. Templates (reusable prompts, configs, or patterns)
7. Implementation Guide (step-by-step commands)
8. Limitations (honest scope, sample size, generalizability)
9. Conclusions (recommendations tied to findings)
10. Appendices (raw data, tool versions)
## Visual Standards (Section 5)
Check that:
- At least one SVG diagram is included
- Diagrams are referenced from the Methodology or Results section
## IP Protection (Section 2.4)
Flag any disclosure of:
- Proprietary algorithm details or trade secrets
- Internal architecture beyond what's needed for context
- Customer data, usage patterns, or business metrics
- Competitive analysis or strategic positioning
- Unreleased product features or roadmap
## Code & Data Repository (Section 5)
Check that:
- Metadata header includes a Code & Data link to engramforge/research
- Implementation Guide references the research repo for reproduction
- No API keys, secrets, or credentials appear anywhere in the document
- Prompt templates shown in the study match what will be published to the repo
## Output Format
Return your review as:
### PASS / FAIL
### Issues Found
For each issue:
- **Section:** [which guideline section is violated]
- **Location:** [where in the document]
- **Issue:** [what's wrong]
- **Suggested fix:** [specific replacement language or action]
### Summary
- Total issues: [count]
- Blocking issues (must fix): [count]
- Advisory issues (should fix): [count]
If no issues are found, return PASS with a brief confirmation.
---
DOCUMENT TO REVIEW:
[paste your draft study here]
9.2 Pre-Publication Checklist Sub-Prompt
This is a focused, faster check that only evaluates the 41 checklist items from Section 7. Use it as a final gate after the main review has already been passed and edits have been applied.
You are running a final pre-publication checklist on an engineering study.
Evaluate the document below against each item. Mark each ✅ PASS or ❌ FAIL
with a brief note. Do not suggest rewrites — just flag pass/fail.
## Tone & Claims
1. No "first to" / "novel" / "groundbreaking" claims
2. No "little research exists" claims
3. Framed as systems engineering, not academic research
4. Positioned as problem-solving, not discovery
5. Humble about positioning relative to the field
6. No proprietary IP disclosed
7. Subsystems described generically
8. Every claim supported by data in the document
9. Conclusions scoped to what data supports
10. Limitations clearly stated
## Statistical Rigor
11. Multi-run results (n ≥ 2 per condition)
12. Mean ± standard deviation reported for all metrics
13. Confidence intervals reported where n ≥ 3
14. All runs shown (no cherry-picking)
15. Sample sizes disclosed for every result
16. High-variance results flagged and discussed
17. Cost per run reported
18. Failure modes analyzed
19. Raw data included in appendix
20. Reproducible scripts or commands included
## Document Structure
21. Table of Contents with nested subsection links
22. Executive Summary present (≤ half page)
23. Introduction with engineering objectives
24. Methodology with code samples
25. Results with data tables (mean ± std)
26. Analysis tied to specific data points
27. Reusable templates included
28. Implementation guide with commands
29. Limitations section present
30. Appendices with raw data
31. At least one SVG diagram
## Code & Data Repository
32. Study directory exists in engramforge/research
33. Raw data files included (JSON/CSV for all runs)
34. Reproduction scripts included and tested
35. Prompt templates included verbatim
36. README.md in repo directory links to published study
37. Metadata header includes Code & Data link
38. No API keys, credentials, or secrets in any file
39. No proprietary source code or internal IP
40. .gitignore excludes .env, credentials, caches
41. Licensed under MIT
## Output Format
| # | Item | Status | Note |
|----|------|--------|------|
| 1 | No novelty claims | ✅ or ❌ | [brief note] |
| 2 | ... | ... | ... |
...
**Result: PASS (41/41) or FAIL (N/41) — list failing items**
---
DOCUMENT TO REVIEW:
[paste your draft study here]
9.3 Usage
Full review (first draft): Use the main conformance prompt (9.1). It checks everything and provides rewrite suggestions.
Final gate (after edits): Use the checklist sub-prompt (9.2). It's a quick pass/fail with no rewrite suggestions — just confirms readiness.
Recommended workflow:
- Write draft using the skeleton template (Section 8)
- Prepare study directory in
engramforge/researchrepo - Run main conformance review (9.1)
- Apply fixes to both the study document and repo artifacts
- Run checklist sub-prompt (9.2) as final gate
- Publish when 41/41 items pass
10. Existing Studies
The following published studies follow (or have been updated to follow) these guidelines:
| Study | Project | Code & Data | Status |
|---|---|---|---|
| LLM Code Generation Quality Across Enterprise Frameworks | llm-codebench | engramforge/research/llm-codegen-benchmark | ✅ Conforms |
| Persona Prompt Optimization for LLM Evaluation Systems | llm-evaluator | engramforge/research/persona-prompt-optimization | ✅ Conforms |
New studies should reference this governance document, use the template in Section 8 as a starting point, and create their study directory in the research repository before publication.
Version History
| Version | Date | Changes |
|---|---|---|
| 1.0 | February 2026 | Initial publication. Extracted from llm-codebench and llm-evaluator pilot studies. |
| 1.1 | February 2026 | Added Section 5 (Code & Data Repository). Research artifacts published to engramforge/research under MIT license. Updated checklist to 41 items. Added TOC requirement. |
Note: These guidelines are living standards that evolve as our engineering practice matures. Proposed changes should be reviewed by engineering leadership before adoption.