Metrics Overview
Benchwise provides a comprehensive set of evaluation metrics for assessing LLM outputs.
Available Metrics
Text Similarity
- rouge_l - ROUGE-L score for text overlap
- bleu_score - BLEU score for translation quality
- bert_score_metric - BERT-based semantic similarity
Accuracy & Correctness
- accuracy - Exact match accuracy
- factual_correctness - Factual accuracy check
Semantic Analysis
- semantic_similarity - Embedding-based similarity
- coherence_score - Text coherence evaluation
Text Fluency
- perplexity - Text fluency and predictability
Safety
- safety_score - Content safety evaluation
Common Usage Pattern
from benchwise import evaluate, accuracy, semantic_similarity, rouge_l
@evaluate("gpt-4")
async def test_with_metrics(model, dataset):
responses = await model.generate(dataset.prompts)
# Use multiple metrics
acc = accuracy(responses, dataset.references)
sim = semantic_similarity(responses, dataset.references)
rouge = rouge_l(responses, dataset.references)
return {
"accuracy": acc["accuracy"],
"similarity": sim["mean_similarity"],
"rouge_f1": rouge["f1"]
}
Metric Collections
Pre-configured metric bundles for common tasks:
from benchwise import get_text_generation_metrics, get_qa_metrics, get_safety_metrics
# Text generation metrics: Bundles rouge_l, bleu_score, bert_score_metric, coherence_score
text_metrics = get_text_generation_metrics()
results = text_metrics.evaluate(predictions, references)
# QA-specific metrics: Bundles accuracy, rouge_l, bert_score_metric, semantic_similarity
qa_metrics = get_qa_metrics()
results = qa_metrics.evaluate(predictions, references)
# Safety metrics: Bundles safety_score, coherence_score
safety_metrics = get_safety_metrics()
results = safety_metrics.evaluate(predictions, references)
Creating Custom Metrics
def custom_metric(predictions: List[str], references: List[str]) -> Dict[str, Any]:
"""Custom metric function"""
scores = []
for pred, ref in zip(predictions, references):
# Your scoring logic
score = calculate_score(pred, ref)
scores.append(score)
return {
"mean_score": sum(scores) / len(scores) if scores else 0.0,
"scores": scores
}
# Use in evaluation
@evaluate("gpt-4")
async def test_custom(model, dataset):
responses = await model.generate(dataset.prompts)
return custom_metric(responses, dataset.references)
Metric Return Format
All metrics return dictionaries with relevant scores:
# Accuracy
{
"accuracy": 0.85,
"correct": 17,
"total": 20
}
# ROUGE
{
"f1": 0.75,
"precision": 0.80,
"recall": 0.70
}
# Semantic Similarity
{
"mean_similarity": 0.85,
"min_similarity": 0.60,
"max_similarity": 0.95,
"similarities": [0.85, 0.90, ...]
}
Choosing the Right Metric
For Question Answering
- accuracy - Exact match
- semantic_similarity - Meaning-based matching
For Summarization
- rouge_l - Text overlap
- semantic_similarity - Meaning preservation
For Translation
- bleu_score - Translation quality
- bert_score_metric - Semantic similarity
For Safety
- safety_score - Content safety
For Coherence
- coherence_score - Text quality
Next Steps
Explore individual metric documentation: