Maintain ground-truth examples of expected prompt results

An eight-factor AI application maintains a comprehensive set of example inputs and outputs that serve as ground truth for system behavior. These examples act as test cases, documentation, and few-shot learning data. They should be explicitly maintained and version controlled alongside code and prompts.

Examples in an eight-factor app serve multiple purposes:

  • Validation of prompt effectiveness
  • Documentation of expected behavior
  • Training data for model fine-tuning
  • Regression testing of model outputs
  • Few-shot learning inputs for inference

A proper example collection contains:

  • Input/output pairs
  • Edge cases and corner conditions
  • Negative examples (what the system should not do)
  • Metadata about expected behavior
  • Performance benchmarks

Bad practice - implicit or missing examples:

def generate_summary(text):
    prompt = load_prompt("summarization")
    return model.generate(prompt.format(text=text))
    # No way to verify if summary is correct

Good practice - explicit example management:

class SummarizationExamples:
    def get_examples(self) -> List[Example]:
        return [
            Example(
                input="The quick brown fox jumps over the lazy dog.",
                expected_output="A fox jumps over a dog.",
                metadata={
                    "type": "basic",
                    "focuses_on": "core_message",
                    "max_length": 30
                }
            ),
            Example(
                input="Due to heavy rainfall...",
                expected_output="Weather caused delays.",
                metadata={
                    "type": "edge_case",
                    "focuses_on": "causality"
                }
            )
        ]

class Example:
    def __init__(self, input, expected_output, metadata=None):
        self.input = input
        self.expected_output = expected_output
        self.metadata = metadata or {}
        self.validate()
    
    def validate(self):
        """Ensure example meets quality standards"""
        if not self.input or not self.expected_output:
            raise InvalidExampleError("Empty input or output")
        
    def matches(self, actual_output) -> SimilarityScore:
        """Compare actual output to expected output"""
        return calculate_similarity(
            self.expected_output, 
            actual_output,
            self.metadata.get("comparison_type", "semantic")
        )

Examples should be organized by capability:

examples/
  ├── summarization/
  │   ├── basic.yaml
  │   ├── edge_cases.yaml
  │   └── negative.yaml
  ├── classification/
  │   ├── basic.yaml
  │   └── multi_label.yaml
  └── generation/
      ├── creative.yaml
      └── structured.yaml

Examples should be used systematically:

class PromptTester:
    def __init__(self, examples: List[Example], threshold=0.8):
        self.examples = examples
        self.threshold = threshold
    
    def test_prompt(self, prompt_template: str) -> TestResult:
        results = []
        for example in self.examples:
            prompt = prompt_template.format(input=example.input)
            output = model.generate(prompt)
            similarity = example.matches(output)
            results.append({
                "example": example,
                "output": output,
                "similarity": similarity,
                "passed": similarity >= self.threshold
            })
        return TestResult(results)

This approach enables:

  • Systematic prompt testing
  • Clear behavior documentation
  • Performance benchmarking
  • Quality regression detection
  • Continuous improvement

Examples should be treated as critical assets:

  • Version controlled with the codebase
  • Updated when behavior specifications change
  • Used in automated testing
  • Referenced in documentation
  • Used for model evaluation

Some applications may use examples dynamically:

class DynamicExampleSelector:
    def select_examples(self, context, k=3) -> List[Example]:
        """Select most relevant examples for current context"""
        relevant = self.find_similar_examples(context)
        return self.rank_and_select(relevant, k)

This pattern enables:

  • Dynamic few-shot learning
  • Context-aware behavior
  • Adaptive system responses
  • Clear behavior specifications
  • Systematic quality improvement

The example suite should optimize for:

  • Coverage - testing all important cases
  • Clarity - clear expected behaviors
  • Maintainability - easy to update
  • Reusability - useful across systems
  • Measurability - clear success criteria