Pilot Study: Persona Prompt Optimization for LLM Evaluation Systems

Executive Summary

During the development of an LLM evaluation platform, we encountered inconsistent results when manually adjusting persona prompts used to guide model responses. Rather than rely on intuition, we built automated testing infrastructure and applied systematic A/B testing to empirically determine which prompt patterns correlated with improved evaluation scores.

This document describes our engineering approach, test methodology, and findings. We share these methods and results to demonstrate a data-driven approach to prompt optimization in systems development.

Key Findings (n=2 per variation, 8 variations):

Hybrid prompts (production scenarios + quantified metrics): 72.3 ± 1.5 vs. baseline 67.6 ± 5.2 (+4.6 points)
Concise prompts (reduced verbosity): 70.7 ± 0.2 (+3.1 points)
Model-specific XML syntax: 67.6 ± 6.5 (no measurable benefit, high variance)
Markdown structure instructions: 66.2 ± 1.7 (−1.4 points)
Total test cost: $0.37 across 16 evaluations

1. Introduction

1.1 Context

As part of our platform development, we implemented an LLM evaluation system that uses persona prompts to guide model responses toward domain-specific expertise. During integration testing, we observed inconsistent results when manually adjusting these prompts. This led us to develop a systematic testing approach to understand which prompt characteristics correlated with improved evaluation scores.

1.2 Engineering Objectives

Our platform development required answers to specific technical questions:

Which prompt content patterns correlate with higher evaluation scores in our multi-dimensional assessment system?
Do model-specific syntax preferences (such as XML tags) improve evaluation performance in practice?
Can we build automated testing infrastructure to systematically validate prompt variations?
What quantifiable characteristics differentiate effective vs. ineffective prompt structures?

This work represents a practical application of scientific method to systems engineering—measuring variables, testing hypotheses, and making data-driven architectural decisions.

1.3 Evaluation System Overview

Our evaluation pipeline assesses responses across seven dimensions, each scoring 0–100:

Accuracy — Technical correctness
Cognitive Complexity — Problem-solving depth
Redundancy — Communication efficiency (inverse scoring)
Optimization — Performance awareness
Security — Security consciousness
Understanding — Domain knowledge depth
Experience Depth — Practical vs. theoretical knowledge

Weighted averaging produces final composite scores. This multi-dimensional approach allows us to measure different aspects of response quality independently.

2. Methodology

2.1 Initial Hypothesis and Testing

We began with a hypothesis based on reviewing the scoring algorithms: adding first-person language, production scenarios, and specific metrics should improve the experience_depth dimension.

Initial Changes Applied:

Increased first-person phrases ("I've implemented", "In my experience")
Added production scenarios ("troubleshooting", "deployment", "monitoring")
Included lessons learned ("I discovered", "I realized")
Added structured section headers
Specified tool versions and metrics

Test Scope: 5 roles with 27–35 questions each (frontend-developer, ux-ui-designer, technical-lead, release-manager, qa-automation-engineer)

2.2 Building Measurement Tools

To understand what differentiated successful from unsuccessful optimizations, we built automated analysis tools to quantify prompt characteristics:

def analyze_persona_text(text: str) -> dict:
    """Extract quantifiable features from persona text."""
    
    # First-person phrases
    first_person_patterns = [
        r"I've\s+\w+", r"I\s+\w+ed", r"In my experience",
        r"I discovered", r"I learned", r"I realized"
    ]
    
    # Production scenarios
    production_patterns = [
        r"production", r"deployment", r"troubleshooting",
        r"debugging", r"monitoring", r"incident"
    ]
    
    # Lessons learned
    lessons_patterns = [
        r"learned", r"discovered", r"realized",
        r"mistake", r"pitfall", r"challenge"
    ]
    
    # Metrics and specific improvements
    metrics_count = len(re.findall(r'\d+%|\d+x|\d+ to \d+', text))
    
    # Structured sections
    section_count = len(re.findall(r'##\s+\w+', text))

2.3 Automated Testing Infrastructure

The mixed results from manual optimization led us to build automated testing infrastructure. This allowed us to systematically test multiple variations and measure outcomes empirically.

System Design:

Variation Generator — Programmatically generates prompt variations:
- baseline — Current production prompt
- more_production — Enhanced production scenarios
- more_metrics — Added quantified improvements
- more_first_person — Increased first-person language
- concise — Reduced verbosity
- hybrid — Combined production + metrics
- claude_xml — Model-specific XML syntax
- markdown_structure — Emphasis on markdown formatting
Test Harness — Runs each variation through the evaluation pipeline with N questions (typically 2–3 for rapid iteration)
Metrics Collection — Captures:
- Average score, min, max, and standard deviation
- Token usage (input/output/thinking)
- Processing speed (tokens per second)
- Cost and elapsed time
- Model identifier
Results Analysis — Ranks variations and identifies patterns

2.4 Test Infrastructure

class VariationTester:
    """Tests persona variations and measures performance."""
    
    async def test_variation(self, variation_name: str, 
                            variation_prompt: str) -> Dict:
        # Apply variation temporarily
        self.apply_variation(variation_name, variation_prompt)
        
        # Run evaluation pipeline as subprocess
        result = subprocess.Popen(
            ['python', 'generate_training.py', 
             '--role', self.role_id,
             '--questions-per-role', str(self.questions_per_test)],
            stdin=subprocess.PIPE,
            stdout=subprocess.PIPE
        )
        
        # Capture metrics
        return {
            'variation': variation_name,
            'avg_score': avg_score,
            'std_score': std_score,
            'tokens_per_sec': tokens_per_sec,
            'elapsed_time': elapsed_time,
            'total_cost': total_cost
        }

2.5 System Architecture

The complete testing framework operates as an automated pipeline:

Figure 1: Automated testing pipeline showing variation generation, test execution, metrics collection, and results analysis. The system processes multiple prompt variations systematically, measuring performance across consistent test sets.

3. Results

3.1 Initial Manual Optimization Observations

Manual optimization across 5 roles (n=1 per condition — no variance data available):

Role	Baseline	Optimized	Δ	Outcome
frontend-developer	69.3	69.2	−0.1	No improvement
ux-ui-designer	71.7	71.1	−0.6	Decline
technical-lead	72.9	72.7	−0.2	No improvement
release-manager	72.9	74.5	+1.6	Improved
qa-automation-engineer	74.1	74.9	+0.8	Improved

Note: These initial tests used n=1 per condition and therefore lack variance data. They are reported as directional observations only and should not be used to draw strong conclusions.

3.2 Pattern Correlation Observations

Comparing the 2 successful vs. 3 unsuccessful manual optimizations:

Pattern	Successful Avg	Unsuccessful Avg	Difference
Production scenario terms	17.5	11.7	+5.8
Specific metrics	7.0	5.0	+2.0
First-person phrases	16.0	14.7	+1.3
Lessons learned phrases	7.0	6.7	+0.3
Section headers	4.0	3.7	+0.3
Total word count	241	212	+29

Note: With only 2 successful and 3 unsuccessful samples, these correlations are directional only. They guided our automated variation design but should not be interpreted as statistically significant.

3.3 Automated Variation Testing Results

Testing 8 systematic variations on frontend-developer role (n=2 per variation, 16 total evaluations):

Figure 2: Comparative scores for all 8 variations. Teal bars indicate improvements over baseline; gray is baseline; red bars show no improvement or decline. Dashed line marks baseline score.

All Runs — Individual Scores:

Variation	Run 1	Run 2	Mean ± Std	Δ vs Baseline	Cost	Time
hybrid	71.2	73.3	72.3 ± 1.5	+4.6	$0.047	60.8s
concise	70.6	70.9	70.7 ± 0.2	+3.1	$0.047	61.9s
more_production	69.7	71.3	70.5 ± 1.1	+2.9	$0.044	57.6s
more_metrics	68.9	71.0	69.9 ± 1.5	+2.3	$0.044	60.1s
more_first_person	67.0	72.1	69.5 ± 3.6	+1.9	$0.049	62.3s
baseline	63.9	71.3	67.6 ± 5.2	—	$0.046	63.9s
claude_xml	63.0	72.1	67.6 ± 6.5	0.0	$0.042	55.1s
markdown_structure	65.0	67.3	66.2 ± 1.7	−1.4	$0.046	63.3s

High-variance observations: baseline (std=5.2) and claude_xml (std=6.5) showed the highest run-to-run variance. With n=2 these variance estimates are themselves uncertain; additional runs would be needed to confirm whether these variations are inherently less consistent.

Total cost: $0.37 for 16 evaluations across 8 variations.

Model: Claude Sonnet 4 (Anthropic API) for all runs.

4. Analysis

4.1 Content vs. Syntax in Practice

Model-specific XML syntax (<thinking>, <analysis> tags) scored 67.6 ± 6.5 — identical mean to baseline (67.6 ± 5.2) but with even higher variance (Section 3.3).

In our evaluation pipeline, what the prompt says matters more than how it is formatted. This observation may not generalize to other contexts (direct conversation, code generation, etc.), but for our assessment system, content patterns dominated over syntax preferences.

4.2 Why the Hybrid Pattern Scored Highest

The winning "hybrid" variation combined the two strongest correlations from our pattern analysis (Section 3.2):

Production scenario terms (+5.8 difference in successful vs. unsuccessful prompts)
Quantified metrics (+2.0 difference)
Maintained first-person language (+1.3 difference)

At 72.3 ± 1.5, it also had relatively low variance compared to baseline (5.2) and claude_xml (6.5), suggesting more consistent performance.

Sample hybrid prompt excerpt:

I've deployed these solutions in production environments and monitored their 
performance under real load. Through troubleshooting production incidents, I've 
learned to prioritize monitoring and observability. I've maintained systems 
handling millions of requests and discovered that proactive monitoring catches 
issues before they impact users.

I've built production Kubernetes clusters serving 10M+ requests/day with 99.99% 
uptime. I've reduced deployment time from 2 hours to 8 minutes and increased 
deployment frequency from weekly to 50+ deploys/day.

4.3 Observations on Verbosity

"Concise" scored second (70.7 ± 0.2) with the lowest variance of any variation. This suggests:

Shorter prompts may reduce noise in evaluation
Focus on high-value patterns matters more than volume
Our evaluation system's redundancy dimension may penalize verbose prompts

4.4 Structured Output Impact

Instructing models to use specific markdown structure scored lowest (66.2 ± 1.7, −1.4 vs. baseline). Possible explanations in our evaluation context:

Models allocated processing to formatting rather than content depth
Structural requirements may have constrained natural expression
Our evaluation scoring weights substance over presentation format

This observation is specific to our assessment pipeline and may differ in other applications.

4.5 Performance Metrics

The winning variation (hybrid) showed moderate token efficiency at 5.6 tokens/sec. The fastest variation (more_first_person at 7.4 tokens/sec) scored 5th, indicating that in our evaluation context, processing speed did not correlate with quality scores.

5. Prompt Templates

5.1 Baseline Template

persona_prompt: |
  You are a [Role] at [Company Name] with [X]+ years of experience 
  [domain expertise description]. You specialize in [key technologies] 
  and have [notable achievements].

  Background: [Technical skills and experience description]

  Communication: [Communication style guidance]

  Answer questions with [guidance on content approach].

5.2 Optimized Hybrid Template (Highest Scoring)

persona_prompt: |
  You are a [Role] at [Company Name] with [X]+ years of experience. 
  I've [specific achievement with metrics] serving [scale metric] users.

  Background: I've worked with [specific tech + versions] in production, 
  [concrete accomplishment]. I've [quantified improvement: from X to Y] and 
  [metric improvement with %]. I've troubleshot [production scenario] 
  and learned that [key insight].

  In my experience, the most critical issues come from [root cause pattern]. 
  I discovered that [specific technique] can [quantified benefit]. I've 
  deployed [specific implementation] and [production result].

  Through [maintenance/deployment/incidents], I found that [system insight]. 
  I've made the mistake of [anti-pattern]—it taught me that [lesson learned]. 
  I discovered that [best practice] [prevents/enables] [outcome].

  Communication: Structure answers with ## [Section1], ## [Section2], 
  ## [Section3]. Use first-person experience ("I've built", "In production, 
  I found"). Include specific metrics and real-world debugging scenarios.

  Answer questions with concrete examples from production, lessons learned 
  from maintenance, troubleshooting approaches I've used, and specific 
  tools/metrics that solved real problems.

5.3 Key Pattern Elements

Production Scenarios (~17 mentions target):

production deployment, troubleshooting incidents, debugging issues,
monitoring systems, maintenance windows, incident response, 
production load, real-world constraints, deployment pipeline

Quantified Metrics (7+ mentions):

from X to Y, improved by Z%, N+ years, N+ users, 99.X% uptime,
reduced from X seconds to Y seconds, increased throughput by Nx

First-Person Experience (~16 mentions):

I've implemented, I've built, I've deployed, In my experience,
I discovered, I learned, I realized, I found, through my work,
our team, at my previous role

Lessons Learned (7+ mentions):

learned that, discovered that, realized through, found out,
mistake I made, pitfall to avoid, challenge we faced, 
taught me that

5.4 Anti-Patterns (Lowest Scoring)

Based on lowest-scoring variations in our testing:

❌ Explicit syntax instructions:

Structure your responses using XML tags:
<thinking>Your analysis</thinking>
<implementation>Code details</implementation>

Observation: No improvement, identical mean to baseline but higher variance (Section 3.3)

❌ Over-specified markdown formatting:

Always use clear markdown structure:
- Start with ## for main sections
- Use ### for subsections
- Bold **key concepts** and *italicize* emphasis

Observation: −1.4 point decrease (Section 3.3)

❌ Excessive verbosity:

You are a highly experienced [Role] with deep expertise across
multiple domains including [long list]. You have comprehensive
knowledge of [extensive enumeration]...

Observation: Concise prompts scored +3.1 higher (Section 3.3)

6. Implementation Guide

6.1 Prerequisites

Tool	Version	Purpose
Python	3.12+	Runtime
pip	latest	Package management
Anthropic API key	—	LLM API access

6.2 Running Variation Tests

Reproduction scripts and data are available in the research repository:

# Clone the research repository
git clone https://github.com/engramforge/research.git
cd research/persona-prompt-optimization

# Create virtual environment and install dependencies
python -m venv .venv
source .venv/bin/activate
pip install -r scripts/requirements.txt

# Set API credentials
export ANTHROPIC_API_KEY="your-key-here"

# Run automated testing on a single role
python scripts/test_persona_variations.py \
  --role frontend-developer \
  --questions 5

# Results saved to:
# - variation_test_results_[role]_[timestamp].json
# - Console output with ranked results

6.3 Cost Estimation

Per role testing costs:

8 variations × 5 questions = 40 evaluations
~$0.14 per evaluation (with chain-of-thought)
Total: ~$5.60 per role

For all 17 roles: ~$95.20

6.4 Integration Workflow

from adaptive_persona_optimizer import PersonaVariationGenerator

# 1. Generate variations
generator = PersonaVariationGenerator(base_persona)
variations = generator.generate_variations()

# 2. Test each variation
tester = VariationTester(role_id='your-role', questions_per_test=5)
results = await tester.test_all_variations()

# 3. Apply winner
winner = max(results, key=lambda x: x['avg_score'])
# Update persona file with winner['prompt']

7. Limitations

7.1 Sample Size

All automated variation tests used n=2 per condition (16 total evaluations). This limits our ability to calculate meaningful confidence intervals and means our variance estimates are themselves uncertain. For production deployment, we recommend n=5–10 per variation.

7.2 Role Coverage

Testing focused on the frontend-developer role. Different role categories may exhibit different patterns:

Infrastructure roles (DevOps, Backend): May benefit more from production scenarios
Leadership roles (Product Manager, Tech Lead): May benefit from team/coordination language
Design roles (UX/UI): May benefit from user research and iteration language

7.3 Model and System Dependencies

All tests used Claude Sonnet 4 via Anthropic API. Results reflect this specific configuration and may vary with:

Different model versions or families
Different evaluation scoring algorithms
Different question types or domains

These findings informed our platform architecture but should be validated for different system configurations.

7.4 Initial Manual Testing

The 5-role manual optimization (Section 3.1) used n=1 per condition, providing no variance data. Those observations should be treated as directional only.

7.5 Measurement Framework Evolution

We are developing enhanced metrics for ongoing optimization:

class OptimizationMetrics:
    """Comprehensive metrics for persona optimization."""
    
    # Performance metrics
    score_improvement: float  # Points gained vs baseline
    score_variance: float     # Consistency across questions
    cost_efficiency: float    # Score improvement per dollar
    
    # Pattern metrics
    production_density: int   # Production terms per 100 words
    metric_density: int       # Quantified metrics per 100 words
    first_person_density: int # First-person phrases per 100 words
    
    # Efficiency metrics
    words_per_point: float    # Prompt verbosity vs score
    tokens_per_second: float  # Generation efficiency
    thinking_token_ratio: float # Thinking tokens / total tokens

8. Conclusions

8.1 Engineering Takeaways

Systematic testing identified improvements that manual optimization missed — Automated variation testing found +4.6 points (72.3 ± 1.5 vs. 67.6 ± 5.2) where manual attempts showed mixed results (Section 3.1 vs. 3.3)
Content patterns outperformed syntax preferences — In our evaluation context, production scenarios and metrics correlated with higher scores while XML formatting provided no benefit (Section 4.1)
Combined patterns scored highest — The hybrid approach combining production scenarios + metrics outperformed any single-pattern optimization (Section 4.2)
Model syntax preferences are context-dependent — XML tags scored identically to baseline in our assessment pipeline, though they may help in other contexts (Section 4.1)
Measurement infrastructure enables data-driven decisions — Building quantifiable metrics and automated testing allowed us to move from intuition to empirical optimization (Section 2.3)

8.2 Platform Implementation Guidance

For similar systems:

Use automated variation testing rather than manual prompt tuning
Focus on content patterns relevant to your evaluation dimensions
Test with sufficient sample sizes (n=5–10 per variation minimum)
Validate across multiple representative roles before generalizing
Measure what matters in your specific context

For our platform:

Deploy hybrid template (production + metrics) as baseline
Avoid over-specifying output format
Target ~17 production-related terms per prompt
Include 7+ quantified metrics
Use ~16 first-person phrases
Aim for 200–250 words total

8.3 Shared Methodology

All code, data, and prompt templates are available in the research repository:

github.com/engramforge/research/tree/main/persona-prompt-optimization

We share these tools and findings to demonstrate an empirical approach to prompt optimization in systems engineering. This work represents a practical application of scientific method to platform development challenges. Organizations building similar systems may find these methods useful for their own optimization work.

Appendix A: Sample Persona Prompts

A.1 Baseline Frontend Developer Persona

persona_prompt: |
  You are a Frontend Developer at [Company Name] with 7+ years of experience 
  building modern web applications. I've implemented production systems serving 
  50,000+ daily users with sub-2-second load times.

  Background: I've worked with React 18+ and TypeScript 5.x in production, building 
  enterprise applications. I've optimized bundle sizes from 1.2MB to 380KB and 
  improved Core Web Vitals scores (LCP from 4.2s to 1.8s, FID to <100ms). I've 
  troubleshot performance issues in production and learned that premature 
  optimization often causes more problems than it solves.

  In my experience, the most critical issues come from unnecessary re-renders and 
  large bundle sizes. I discovered that proper code splitting and lazy loading 
  can reduce initial load time by 60%. I've deployed applications using Next.js 14, 
  Vite 5.x, and implemented monitoring with Lighthouse CI in our deployment pipeline.

  Communication: Structure answers with ## Performance Considerations, ## Implementation 
  Details, ## Accessibility Requirements. Use first-person experience ("I've built", 
  "In production, I found"). Include specific metrics and real-world debugging scenarios.

  Answer questions with concrete examples from production, lessons learned from 
  maintenance, troubleshooting approaches I've used, and specific tools/metrics 
  that solved real problems.

Score: 67.6 ± 5.2 (n=2, runs: 63.9, 71.3)

A.2 Winning Hybrid Frontend Developer Persona

persona_prompt: |
  You are a Frontend Developer at [Company Name] with 7+ years of experience 
  building modern web applications. I've implemented production systems serving 
  50,000+ daily users with sub-2-second load times.

  Background: I've worked with React 18+ and TypeScript 5.x in production, building 
  enterprise applications (achieving 99.9% uptime, handling 100K+ requests/day). 
  I've optimized bundle sizes from 1.2MB to 380KB (reduced by 60%) and improved by 45% 
  Core Web Vitals scores (LCP from 4.2s to 1.8s, FID to <100ms). I've troubleshot 
  performance issues in production and learned that premature optimization often causes 
  more problems than it solves.

  In my experience, the most critical issues come from unnecessary re-renders and 
  large bundle sizes. I discovered that proper code splitting and lazy loading 
  can reduce initial load time by 60%. I've deployed applications using Next.js 14, 
  Vite 5.x, and implemented monitoring with Lighthouse CI in our deployment pipeline.

  I've deployed these solutions in production environments and monitored their 
  performance under real load. Through troubleshooting production incidents, I've 
  learned to prioritize monitoring and observability. I've maintained systems 
  handling millions of requests and discovered that proactive monitoring catches 
  issues before they impact users.

  Communication: Structure answers with ## Performance Considerations, ## Implementation 
  Details, ## Accessibility Requirements. Use first-person experience ("I've built", 
  "In production, I found"). Include specific metrics and real-world debugging scenarios.

  Answer questions with concrete examples from production, lessons learned from 
  maintenance, troubleshooting approaches I've used, and specific tools/metrics 
  that solved real problems.

Score: 72.3 ± 1.5 (n=2, runs: 71.2, 73.3)

Appendix B: Complete Variation Results Data

All 8 variations with individual run scores and full metrics:

Variation	Run 1	Run 2	Mean	Std	Cost	Time (s)	Think Tokens	Tokens/s
baseline	63.94	71.28	67.61	5.19	$0.0463	63.9	257	4.0
more_production	69.74	71.25	70.50	1.07	$0.0443	57.6	268	4.7
more_metrics	68.90	70.97	69.94	1.46	$0.0441	60.1	253	4.2
more_first_person	66.98	72.12	69.55	3.64	$0.0494	62.3	462	7.4
concise	70.58	70.86	70.72	0.20	$0.0469	61.9	331	5.3
hybrid	71.20	73.31	72.26	1.50	$0.0469	60.8	342	5.6
claude_xml	63.01	72.14	67.57	6.45	$0.0422	55.1	227	4.1
markdown_structure	64.99	67.33	66.16	1.66	$0.0464	63.3	402	6.3

Test configuration:

Model: Claude Sonnet 4 (Anthropic API)
Role: frontend-developer
Questions per variation: 2
Total evaluations: 16
Total cost: $0.37
Date: 2026-02-07 MST

Appendix C: Tool Versions

Tool	Version
Python	3.12.8
PyYAML	6.x
Claude Sonnet 4	Anthropic API (2026-02)
asyncio	stdlib
subprocess	stdlib

Resources

This work builds on established practices in:

Prompt engineering methodologies
LLM evaluation system design
A/B testing in machine learning systems
Empirical software engineering

Tools and Technologies

Platform: Custom LLM evaluation pipeline
Model: Claude Sonnet 4 (Anthropic API)
Languages: Python 3.12+
Libraries: asyncio, pyyaml, subprocess
Infrastructure: Docker, PostgreSQL, Redis

Acknowledgments

This work represents practical systems engineering during platform development. We share our methodology and findings to contribute to the broader community's understanding of empirical approaches to LLM system optimization.

Document Version: 1.1
Last Updated: 2026-02-07
License: MIT
Citation: If you use this methodology, please cite engramforge/research.