5. Examples
Maintain ground-truth examples of expected prompt results
An eight-factor AI application maintains a comprehensive set of example inputs and outputs that serve as ground truth for system behavior. These examples act as test cases, documentation, and few-shot learning data. They should be explicitly maintained and version controlled alongside code and prompts.
Examples in an eight-factor app serve multiple purposes:
- Validation of prompt effectiveness
- Documentation of expected behavior
- Training data for model fine-tuning
- Regression testing of model outputs
- Few-shot learning inputs for inference
A proper example collection contains:
- Input/output pairs
- Edge cases and corner conditions
- Negative examples (what the system should not do)
- Metadata about expected behavior
- Performance benchmarks
Bad practice - implicit or missing examples:
def generate_summary(text):
prompt = load_prompt("summarization")
return model.generate(prompt.format(text=text))
# No way to verify if summary is correct
Good practice - explicit example management:
class SummarizationExamples:
def get_examples(self) -> List[Example]:
return [
Example(
input="The quick brown fox jumps over the lazy dog.",
expected_output="A fox jumps over a dog.",
metadata={
"type": "basic",
"focuses_on": "core_message",
"max_length": 30
}
),
Example(
input="Due to heavy rainfall...",
expected_output="Weather caused delays.",
metadata={
"type": "edge_case",
"focuses_on": "causality"
}
)
]
class Example:
def __init__(self, input, expected_output, metadata=None):
self.input = input
self.expected_output = expected_output
self.metadata = metadata or {}
self.validate()
def validate(self):
"""Ensure example meets quality standards"""
if not self.input or not self.expected_output:
raise InvalidExampleError("Empty input or output")
def matches(self, actual_output) -> SimilarityScore:
"""Compare actual output to expected output"""
return calculate_similarity(
self.expected_output,
actual_output,
self.metadata.get("comparison_type", "semantic")
)
Examples should be organized by capability:
examples/
├── summarization/
│ ├── basic.yaml
│ ├── edge_cases.yaml
│ └── negative.yaml
├── classification/
│ ├── basic.yaml
│ └── multi_label.yaml
└── generation/
├── creative.yaml
└── structured.yaml
Examples should be used systematically:
class PromptTester:
def __init__(self, examples: List[Example], threshold=0.8):
self.examples = examples
self.threshold = threshold
def test_prompt(self, prompt_template: str) -> TestResult:
results = []
for example in self.examples:
prompt = prompt_template.format(input=example.input)
output = model.generate(prompt)
similarity = example.matches(output)
results.append({
"example": example,
"output": output,
"similarity": similarity,
"passed": similarity >= self.threshold
})
return TestResult(results)
This approach enables:
- Systematic prompt testing
- Clear behavior documentation
- Performance benchmarking
- Quality regression detection
- Continuous improvement
Examples should be treated as critical assets:
- Version controlled with the codebase
- Updated when behavior specifications change
- Used in automated testing
- Referenced in documentation
- Used for model evaluation
Some applications may use examples dynamically:
class DynamicExampleSelector:
def select_examples(self, context, k=3) -> List[Example]:
"""Select most relevant examples for current context"""
relevant = self.find_similar_examples(context)
return self.rank_and_select(relevant, k)
This pattern enables:
- Dynamic few-shot learning
- Context-aware behavior
- Adaptive system responses
- Clear behavior specifications
- Systematic quality improvement
The example suite should optimize for:
- Coverage - testing all important cases
- Clarity - clear expected behaviors
- Maintainability - easy to update
- Reusability - useful across systems
- Measurability - clear success criteria