Pilot Study: Persona Prompt Optimization for LLM Evaluation Systems
A Systems Engineering Approach to Empirical Prompt Testing
Executive Summary
During the development of an LLM evaluation platform, we encountered inconsistent results when manually adjusting persona prompts used to guide model responses. Rather than rely on intuition, we built automated testing infrastructure and applied systematic A/B testing to empirically determine which prompt patterns correlated with improved evaluation scores.
This document describes our engineering approach, test methodology, and findings. We share these methods and results to demonstrate a data-driven approach to prompt optimization in systems development.
Key Findings (n=2 per variation, 8 variations):
- Hybrid prompts (production scenarios + quantified metrics): 72.3 ± 1.5 vs. baseline 67.6 ± 5.2 (+4.6 points)
- Concise prompts (reduced verbosity): 70.7 ± 0.2 (+3.1 points)
- Model-specific XML syntax: 67.6 ± 6.5 (no measurable benefit, high variance)
- Markdown structure instructions: 66.2 ± 1.7 (−1.4 points)
- Total test cost: $0.37 across 16 evaluations
1. Introduction
1.1 Context
As part of our platform development, we implemented an LLM evaluation system that uses persona prompts to guide model responses toward domain-specific expertise. During integration testing, we observed inconsistent results when manually adjusting these prompts. This led us to develop a systematic testing approach to understand which prompt characteristics correlated with improved evaluation scores.
1.2 Engineering Objectives
Our platform development required answers to specific technical questions:
- Which prompt content patterns correlate with higher evaluation scores in our multi-dimensional assessment system?
- Do model-specific syntax preferences (such as XML tags) improve evaluation performance in practice?
- Can we build automated testing infrastructure to systematically validate prompt variations?
- What quantifiable characteristics differentiate effective vs. ineffective prompt structures?
This work represents a practical application of scientific method to systems engineering—measuring variables, testing hypotheses, and making data-driven architectural decisions.
1.3 Evaluation System Overview
Our evaluation pipeline assesses responses across seven dimensions, each scoring 0–100:
- Accuracy — Technical correctness
- Cognitive Complexity — Problem-solving depth
- Redundancy — Communication efficiency (inverse scoring)
- Optimization — Performance awareness
- Security — Security consciousness
- Understanding — Domain knowledge depth
- Experience Depth — Practical vs. theoretical knowledge
Weighted averaging produces final composite scores. This multi-dimensional approach allows us to measure different aspects of response quality independently.
2. Methodology
2.1 Initial Hypothesis and Testing
We began with a hypothesis based on reviewing the scoring algorithms: adding first-person language, production scenarios, and specific metrics should improve the experience_depth dimension.
Initial Changes Applied:
- Increased first-person phrases ("I've implemented", "In my experience")
- Added production scenarios ("troubleshooting", "deployment", "monitoring")
- Included lessons learned ("I discovered", "I realized")
- Added structured section headers
- Specified tool versions and metrics
Test Scope: 5 roles with 27–35 questions each (frontend-developer, ux-ui-designer, technical-lead, release-manager, qa-automation-engineer)
2.2 Building Measurement Tools
To understand what differentiated successful from unsuccessful optimizations, we built automated analysis tools to quantify prompt characteristics:
def analyze_persona_text(text: str) -> dict:
"""Extract quantifiable features from persona text."""
# First-person phrases
first_person_patterns = [
r"I've\s+\w+", r"I\s+\w+ed", r"In my experience",
r"I discovered", r"I learned", r"I realized"
]
# Production scenarios
production_patterns = [
r"production", r"deployment", r"troubleshooting",
r"debugging", r"monitoring", r"incident"
]
# Lessons learned
lessons_patterns = [
r"learned", r"discovered", r"realized",
r"mistake", r"pitfall", r"challenge"
]
# Metrics and specific improvements
metrics_count = len(re.findall(r'\d+%|\d+x|\d+ to \d+', text))
# Structured sections
section_count = len(re.findall(r'##\s+\w+', text))
2.3 Automated Testing Infrastructure
The mixed results from manual optimization led us to build automated testing infrastructure. This allowed us to systematically test multiple variations and measure outcomes empirically.
System Design:
-
Variation Generator — Programmatically generates prompt variations:
baseline— Current production promptmore_production— Enhanced production scenariosmore_metrics— Added quantified improvementsmore_first_person— Increased first-person languageconcise— Reduced verbosityhybrid— Combined production + metricsclaude_xml— Model-specific XML syntaxmarkdown_structure— Emphasis on markdown formatting
-
Test Harness — Runs each variation through the evaluation pipeline with N questions (typically 2–3 for rapid iteration)
-
Metrics Collection — Captures:
- Average score, min, max, and standard deviation
- Token usage (input/output/thinking)
- Processing speed (tokens per second)
- Cost and elapsed time
- Model identifier
-
Results Analysis — Ranks variations and identifies patterns
2.4 Test Infrastructure
class VariationTester:
"""Tests persona variations and measures performance."""
async def test_variation(self, variation_name: str,
variation_prompt: str) -> Dict:
# Apply variation temporarily
self.apply_variation(variation_name, variation_prompt)
# Run evaluation pipeline as subprocess
result = subprocess.Popen(
['python', 'generate_training.py',
'--role', self.role_id,
'--questions-per-role', str(self.questions_per_test)],
stdin=subprocess.PIPE,
stdout=subprocess.PIPE
)
# Capture metrics
return {
'variation': variation_name,
'avg_score': avg_score,
'std_score': std_score,
'tokens_per_sec': tokens_per_sec,
'elapsed_time': elapsed_time,
'total_cost': total_cost
}
2.5 System Architecture
The complete testing framework operates as an automated pipeline:
Figure 1: Automated testing pipeline showing variation generation, test execution, metrics collection, and results analysis. The system processes multiple prompt variations systematically, measuring performance across consistent test sets.
3. Results
3.1 Initial Manual Optimization Observations
Manual optimization across 5 roles (n=1 per condition — no variance data available):
| Role | Baseline | Optimized | Δ | Outcome |
|---|---|---|---|---|
| frontend-developer | 69.3 | 69.2 | −0.1 | No improvement |
| ux-ui-designer | 71.7 | 71.1 | −0.6 | Decline |
| technical-lead | 72.9 | 72.7 | −0.2 | No improvement |
| release-manager | 72.9 | 74.5 | +1.6 | Improved |
| qa-automation-engineer | 74.1 | 74.9 | +0.8 | Improved |
Note: These initial tests used n=1 per condition and therefore lack variance data. They are reported as directional observations only and should not be used to draw strong conclusions.
3.2 Pattern Correlation Observations
Comparing the 2 successful vs. 3 unsuccessful manual optimizations:
| Pattern | Successful Avg | Unsuccessful Avg | Difference |
|---|---|---|---|
| Production scenario terms | 17.5 | 11.7 | +5.8 |
| Specific metrics | 7.0 | 5.0 | +2.0 |
| First-person phrases | 16.0 | 14.7 | +1.3 |
| Lessons learned phrases | 7.0 | 6.7 | +0.3 |
| Section headers | 4.0 | 3.7 | +0.3 |
| Total word count | 241 | 212 | +29 |
Note: With only 2 successful and 3 unsuccessful samples, these correlations are directional only. They guided our automated variation design but should not be interpreted as statistically significant.
3.3 Automated Variation Testing Results
Testing 8 systematic variations on frontend-developer role (n=2 per variation, 16 total evaluations):
Figure 2: Comparative scores for all 8 variations. Teal bars indicate improvements over baseline; gray is baseline; red bars show no improvement or decline. Dashed line marks baseline score.
All Runs — Individual Scores:
| Variation | Run 1 | Run 2 | Mean ± Std | Δ vs Baseline | Cost | Time |
|---|---|---|---|---|---|---|
| hybrid | 71.2 | 73.3 | 72.3 ± 1.5 | +4.6 | $0.047 | 60.8s |
| concise | 70.6 | 70.9 | 70.7 ± 0.2 | +3.1 | $0.047 | 61.9s |
| more_production | 69.7 | 71.3 | 70.5 ± 1.1 | +2.9 | $0.044 | 57.6s |
| more_metrics | 68.9 | 71.0 | 69.9 ± 1.5 | +2.3 | $0.044 | 60.1s |
| more_first_person | 67.0 | 72.1 | 69.5 ± 3.6 | +1.9 | $0.049 | 62.3s |
| baseline | 63.9 | 71.3 | 67.6 ± 5.2 | — | $0.046 | 63.9s |
| claude_xml | 63.0 | 72.1 | 67.6 ± 6.5 | 0.0 | $0.042 | 55.1s |
| markdown_structure | 65.0 | 67.3 | 66.2 ± 1.7 | −1.4 | $0.046 | 63.3s |
High-variance observations: baseline (std=5.2) and claude_xml (std=6.5) showed the highest run-to-run variance. With n=2 these variance estimates are themselves uncertain; additional runs would be needed to confirm whether these variations are inherently less consistent.
Total cost: $0.37 for 16 evaluations across 8 variations.
Model: Claude Sonnet 4 (Anthropic API) for all runs.
4. Analysis
4.1 Content vs. Syntax in Practice
Model-specific XML syntax (<thinking>, <analysis> tags) scored 67.6 ± 6.5 — identical mean to baseline (67.6 ± 5.2) but with even higher variance (Section 3.3).
In our evaluation pipeline, what the prompt says matters more than how it is formatted. This observation may not generalize to other contexts (direct conversation, code generation, etc.), but for our assessment system, content patterns dominated over syntax preferences.
4.2 Why the Hybrid Pattern Scored Highest
The winning "hybrid" variation combined the two strongest correlations from our pattern analysis (Section 3.2):
- Production scenario terms (+5.8 difference in successful vs. unsuccessful prompts)
- Quantified metrics (+2.0 difference)
- Maintained first-person language (+1.3 difference)
At 72.3 ± 1.5, it also had relatively low variance compared to baseline (5.2) and claude_xml (6.5), suggesting more consistent performance.
Sample hybrid prompt excerpt:
I've deployed these solutions in production environments and monitored their
performance under real load. Through troubleshooting production incidents, I've
learned to prioritize monitoring and observability. I've maintained systems
handling millions of requests and discovered that proactive monitoring catches
issues before they impact users.
I've built production Kubernetes clusters serving 10M+ requests/day with 99.99%
uptime. I've reduced deployment time from 2 hours to 8 minutes and increased
deployment frequency from weekly to 50+ deploys/day.
4.3 Observations on Verbosity
"Concise" scored second (70.7 ± 0.2) with the lowest variance of any variation. This suggests:
- Shorter prompts may reduce noise in evaluation
- Focus on high-value patterns matters more than volume
- Our evaluation system's redundancy dimension may penalize verbose prompts
4.4 Structured Output Impact
Instructing models to use specific markdown structure scored lowest (66.2 ± 1.7, −1.4 vs. baseline). Possible explanations in our evaluation context:
- Models allocated processing to formatting rather than content depth
- Structural requirements may have constrained natural expression
- Our evaluation scoring weights substance over presentation format
This observation is specific to our assessment pipeline and may differ in other applications.
4.5 Performance Metrics
The winning variation (hybrid) showed moderate token efficiency at 5.6 tokens/sec. The fastest variation (more_first_person at 7.4 tokens/sec) scored 5th, indicating that in our evaluation context, processing speed did not correlate with quality scores.
5. Prompt Templates
5.1 Baseline Template
persona_prompt: |
You are a [Role] at [Company Name] with [X]+ years of experience
[domain expertise description]. You specialize in [key technologies]
and have [notable achievements].
Background: [Technical skills and experience description]
Communication: [Communication style guidance]
Answer questions with [guidance on content approach].
5.2 Optimized Hybrid Template (Highest Scoring)
persona_prompt: |
You are a [Role] at [Company Name] with [X]+ years of experience.
I've [specific achievement with metrics] serving [scale metric] users.
Background: I've worked with [specific tech + versions] in production,
[concrete accomplishment]. I've [quantified improvement: from X to Y] and
[metric improvement with %]. I've troubleshot [production scenario]
and learned that [key insight].
In my experience, the most critical issues come from [root cause pattern].
I discovered that [specific technique] can [quantified benefit]. I've
deployed [specific implementation] and [production result].
Through [maintenance/deployment/incidents], I found that [system insight].
I've made the mistake of [anti-pattern]—it taught me that [lesson learned].
I discovered that [best practice] [prevents/enables] [outcome].
Communication: Structure answers with ## [Section1], ## [Section2],
## [Section3]. Use first-person experience ("I've built", "In production,
I found"). Include specific metrics and real-world debugging scenarios.
Answer questions with concrete examples from production, lessons learned
from maintenance, troubleshooting approaches I've used, and specific
tools/metrics that solved real problems.
5.3 Key Pattern Elements
Production Scenarios (~17 mentions target):
production deployment, troubleshooting incidents, debugging issues,
monitoring systems, maintenance windows, incident response,
production load, real-world constraints, deployment pipeline
Quantified Metrics (7+ mentions):
from X to Y, improved by Z%, N+ years, N+ users, 99.X% uptime,
reduced from X seconds to Y seconds, increased throughput by Nx
First-Person Experience (~16 mentions):
I've implemented, I've built, I've deployed, In my experience,
I discovered, I learned, I realized, I found, through my work,
our team, at my previous role
Lessons Learned (7+ mentions):
learned that, discovered that, realized through, found out,
mistake I made, pitfall to avoid, challenge we faced,
taught me that
5.4 Anti-Patterns (Lowest Scoring)
Based on lowest-scoring variations in our testing:
❌ Explicit syntax instructions:
Structure your responses using XML tags:
<thinking>Your analysis</thinking>
<implementation>Code details</implementation>
Observation: No improvement, identical mean to baseline but higher variance (Section 3.3)
❌ Over-specified markdown formatting:
Always use clear markdown structure:
- Start with ## for main sections
- Use ### for subsections
- Bold **key concepts** and *italicize* emphasis
Observation: −1.4 point decrease (Section 3.3)
❌ Excessive verbosity:
You are a highly experienced [Role] with deep expertise across
multiple domains including [long list]. You have comprehensive
knowledge of [extensive enumeration]...
Observation: Concise prompts scored +3.1 higher (Section 3.3)
6. Implementation Guide
6.1 Prerequisites
| Tool | Version | Purpose |
|---|---|---|
| Python | 3.12+ | Runtime |
| pip | latest | Package management |
| Anthropic API key | — | LLM API access |
6.2 Running Variation Tests
Reproduction scripts and data are available in the research repository:
# Clone the research repository
git clone https://github.com/engramforge/research.git
cd research/persona-prompt-optimization
# Create virtual environment and install dependencies
python -m venv .venv
source .venv/bin/activate
pip install -r scripts/requirements.txt
# Set API credentials
export ANTHROPIC_API_KEY="your-key-here"
# Run automated testing on a single role
python scripts/test_persona_variations.py \
--role frontend-developer \
--questions 5
# Results saved to:
# - variation_test_results_[role]_[timestamp].json
# - Console output with ranked results
6.3 Cost Estimation
Per role testing costs:
- 8 variations × 5 questions = 40 evaluations
- ~$0.14 per evaluation (with chain-of-thought)
- Total: ~$5.60 per role
For all 17 roles: ~$95.20
6.4 Integration Workflow
from adaptive_persona_optimizer import PersonaVariationGenerator
# 1. Generate variations
generator = PersonaVariationGenerator(base_persona)
variations = generator.generate_variations()
# 2. Test each variation
tester = VariationTester(role_id='your-role', questions_per_test=5)
results = await tester.test_all_variations()
# 3. Apply winner
winner = max(results, key=lambda x: x['avg_score'])
# Update persona file with winner['prompt']
7. Limitations
7.1 Sample Size
All automated variation tests used n=2 per condition (16 total evaluations). This limits our ability to calculate meaningful confidence intervals and means our variance estimates are themselves uncertain. For production deployment, we recommend n=5–10 per variation.
7.2 Role Coverage
Testing focused on the frontend-developer role. Different role categories may exhibit different patterns:
- Infrastructure roles (DevOps, Backend): May benefit more from production scenarios
- Leadership roles (Product Manager, Tech Lead): May benefit from team/coordination language
- Design roles (UX/UI): May benefit from user research and iteration language
7.3 Model and System Dependencies
All tests used Claude Sonnet 4 via Anthropic API. Results reflect this specific configuration and may vary with:
- Different model versions or families
- Different evaluation scoring algorithms
- Different question types or domains
These findings informed our platform architecture but should be validated for different system configurations.
7.4 Initial Manual Testing
The 5-role manual optimization (Section 3.1) used n=1 per condition, providing no variance data. Those observations should be treated as directional only.
7.5 Measurement Framework Evolution
We are developing enhanced metrics for ongoing optimization:
class OptimizationMetrics:
"""Comprehensive metrics for persona optimization."""
# Performance metrics
score_improvement: float # Points gained vs baseline
score_variance: float # Consistency across questions
cost_efficiency: float # Score improvement per dollar
# Pattern metrics
production_density: int # Production terms per 100 words
metric_density: int # Quantified metrics per 100 words
first_person_density: int # First-person phrases per 100 words
# Efficiency metrics
words_per_point: float # Prompt verbosity vs score
tokens_per_second: float # Generation efficiency
thinking_token_ratio: float # Thinking tokens / total tokens
8. Conclusions
8.1 Engineering Takeaways
-
Systematic testing identified improvements that manual optimization missed — Automated variation testing found +4.6 points (72.3 ± 1.5 vs. 67.6 ± 5.2) where manual attempts showed mixed results (Section 3.1 vs. 3.3)
-
Content patterns outperformed syntax preferences — In our evaluation context, production scenarios and metrics correlated with higher scores while XML formatting provided no benefit (Section 4.1)
-
Combined patterns scored highest — The hybrid approach combining production scenarios + metrics outperformed any single-pattern optimization (Section 4.2)
-
Model syntax preferences are context-dependent — XML tags scored identically to baseline in our assessment pipeline, though they may help in other contexts (Section 4.1)
-
Measurement infrastructure enables data-driven decisions — Building quantifiable metrics and automated testing allowed us to move from intuition to empirical optimization (Section 2.3)
8.2 Platform Implementation Guidance
For similar systems:
- Use automated variation testing rather than manual prompt tuning
- Focus on content patterns relevant to your evaluation dimensions
- Test with sufficient sample sizes (n=5–10 per variation minimum)
- Validate across multiple representative roles before generalizing
- Measure what matters in your specific context
For our platform:
- Deploy hybrid template (production + metrics) as baseline
- Avoid over-specifying output format
- Target ~17 production-related terms per prompt
- Include 7+ quantified metrics
- Use ~16 first-person phrases
- Aim for 200–250 words total
8.3 Shared Methodology
All code, data, and prompt templates are available in the research repository:
github.com/engramforge/research/tree/main/persona-prompt-optimization
We share these tools and findings to demonstrate an empirical approach to prompt optimization in systems engineering. This work represents a practical application of scientific method to platform development challenges. Organizations building similar systems may find these methods useful for their own optimization work.
Appendix A: Sample Persona Prompts
A.1 Baseline Frontend Developer Persona
persona_prompt: |
You are a Frontend Developer at [Company Name] with 7+ years of experience
building modern web applications. I've implemented production systems serving
50,000+ daily users with sub-2-second load times.
Background: I've worked with React 18+ and TypeScript 5.x in production, building
enterprise applications. I've optimized bundle sizes from 1.2MB to 380KB and
improved Core Web Vitals scores (LCP from 4.2s to 1.8s, FID to <100ms). I've
troubleshot performance issues in production and learned that premature
optimization often causes more problems than it solves.
In my experience, the most critical issues come from unnecessary re-renders and
large bundle sizes. I discovered that proper code splitting and lazy loading
can reduce initial load time by 60%. I've deployed applications using Next.js 14,
Vite 5.x, and implemented monitoring with Lighthouse CI in our deployment pipeline.
Communication: Structure answers with ## Performance Considerations, ## Implementation
Details, ## Accessibility Requirements. Use first-person experience ("I've built",
"In production, I found"). Include specific metrics and real-world debugging scenarios.
Answer questions with concrete examples from production, lessons learned from
maintenance, troubleshooting approaches I've used, and specific tools/metrics
that solved real problems.
Score: 67.6 ± 5.2 (n=2, runs: 63.9, 71.3)
A.2 Winning Hybrid Frontend Developer Persona
persona_prompt: |
You are a Frontend Developer at [Company Name] with 7+ years of experience
building modern web applications. I've implemented production systems serving
50,000+ daily users with sub-2-second load times.
Background: I've worked with React 18+ and TypeScript 5.x in production, building
enterprise applications (achieving 99.9% uptime, handling 100K+ requests/day).
I've optimized bundle sizes from 1.2MB to 380KB (reduced by 60%) and improved by 45%
Core Web Vitals scores (LCP from 4.2s to 1.8s, FID to <100ms). I've troubleshot
performance issues in production and learned that premature optimization often causes
more problems than it solves.
In my experience, the most critical issues come from unnecessary re-renders and
large bundle sizes. I discovered that proper code splitting and lazy loading
can reduce initial load time by 60%. I've deployed applications using Next.js 14,
Vite 5.x, and implemented monitoring with Lighthouse CI in our deployment pipeline.
I've deployed these solutions in production environments and monitored their
performance under real load. Through troubleshooting production incidents, I've
learned to prioritize monitoring and observability. I've maintained systems
handling millions of requests and discovered that proactive monitoring catches
issues before they impact users.
Communication: Structure answers with ## Performance Considerations, ## Implementation
Details, ## Accessibility Requirements. Use first-person experience ("I've built",
"In production, I found"). Include specific metrics and real-world debugging scenarios.
Answer questions with concrete examples from production, lessons learned from
maintenance, troubleshooting approaches I've used, and specific tools/metrics
that solved real problems.
Score: 72.3 ± 1.5 (n=2, runs: 71.2, 73.3)
Appendix B: Complete Variation Results Data
All 8 variations with individual run scores and full metrics:
| Variation | Run 1 | Run 2 | Mean | Std | Cost | Time (s) | Think Tokens | Tokens/s |
|---|---|---|---|---|---|---|---|---|
| baseline | 63.94 | 71.28 | 67.61 | 5.19 | $0.0463 | 63.9 | 257 | 4.0 |
| more_production | 69.74 | 71.25 | 70.50 | 1.07 | $0.0443 | 57.6 | 268 | 4.7 |
| more_metrics | 68.90 | 70.97 | 69.94 | 1.46 | $0.0441 | 60.1 | 253 | 4.2 |
| more_first_person | 66.98 | 72.12 | 69.55 | 3.64 | $0.0494 | 62.3 | 462 | 7.4 |
| concise | 70.58 | 70.86 | 70.72 | 0.20 | $0.0469 | 61.9 | 331 | 5.3 |
| hybrid | 71.20 | 73.31 | 72.26 | 1.50 | $0.0469 | 60.8 | 342 | 5.6 |
| claude_xml | 63.01 | 72.14 | 67.57 | 6.45 | $0.0422 | 55.1 | 227 | 4.1 |
| markdown_structure | 64.99 | 67.33 | 66.16 | 1.66 | $0.0464 | 63.3 | 402 | 6.3 |
Test configuration:
- Model: Claude Sonnet 4 (Anthropic API)
- Role: frontend-developer
- Questions per variation: 2
- Total evaluations: 16
- Total cost: $0.37
- Date: 2026-02-07 MST
Appendix C: Tool Versions
| Tool | Version |
|---|---|
| Python | 3.12.8 |
| PyYAML | 6.x |
| Claude Sonnet 4 | Anthropic API (2026-02) |
| asyncio | stdlib |
| subprocess | stdlib |
Resources
Related Work
This work builds on established practices in:
- Prompt engineering methodologies
- LLM evaluation system design
- A/B testing in machine learning systems
- Empirical software engineering
Tools and Technologies
- Platform: Custom LLM evaluation pipeline
- Model: Claude Sonnet 4 (Anthropic API)
- Languages: Python 3.12+
- Libraries: asyncio, pyyaml, subprocess
- Infrastructure: Docker, PostgreSQL, Redis
Acknowledgments
This work represents practical systems engineering during platform development. We share our methodology and findings to contribute to the broader community's understanding of empirical approaches to LLM system optimization.
Document Version: 1.1
Last Updated: 2026-02-07
License: MIT
Citation: If you use this methodology, please cite engramforge/research.