Skip to main content

Metrics Overview

Benchwise provides a comprehensive set of evaluation metrics for assessing LLM outputs.

Available Metrics

Text Similarity

Accuracy & Correctness

Semantic Analysis

Text Fluency

Safety

Common Usage Pattern

from benchwise import evaluate, accuracy, semantic_similarity, rouge_l

@evaluate("gpt-4")
async def test_with_metrics(model, dataset):
responses = await model.generate(dataset.prompts)

# Use multiple metrics
acc = accuracy(responses, dataset.references)
sim = semantic_similarity(responses, dataset.references)
rouge = rouge_l(responses, dataset.references)

return {
"accuracy": acc["accuracy"],
"similarity": sim["mean_similarity"],
"rouge_f1": rouge["f1"]
}

Metric Collections

Pre-configured metric bundles for common tasks:

from benchwise import get_text_generation_metrics, get_qa_metrics, get_safety_metrics

# Text generation metrics: Bundles rouge_l, bleu_score, bert_score_metric, coherence_score
text_metrics = get_text_generation_metrics()
results = text_metrics.evaluate(predictions, references)

# QA-specific metrics: Bundles accuracy, rouge_l, bert_score_metric, semantic_similarity
qa_metrics = get_qa_metrics()
results = qa_metrics.evaluate(predictions, references)

# Safety metrics: Bundles safety_score, coherence_score
safety_metrics = get_safety_metrics()
results = safety_metrics.evaluate(predictions, references)

Creating Custom Metrics

def custom_metric(predictions: List[str], references: List[str]) -> Dict[str, Any]:
"""Custom metric function"""
scores = []

for pred, ref in zip(predictions, references):
# Your scoring logic
score = calculate_score(pred, ref)
scores.append(score)

return {
"mean_score": sum(scores) / len(scores) if scores else 0.0,
"scores": scores
}

# Use in evaluation
@evaluate("gpt-4")
async def test_custom(model, dataset):
responses = await model.generate(dataset.prompts)
return custom_metric(responses, dataset.references)

Metric Return Format

All metrics return dictionaries with relevant scores:

# Accuracy
{
"accuracy": 0.85,
"correct": 17,
"total": 20
}

# ROUGE
{
"f1": 0.75,
"precision": 0.80,
"recall": 0.70
}

# Semantic Similarity
{
"mean_similarity": 0.85,
"min_similarity": 0.60,
"max_similarity": 0.95,
"similarities": [0.85, 0.90, ...]
}

Choosing the Right Metric

For Question Answering

For Summarization

For Translation

For Safety

For Coherence

Next Steps

Explore individual metric documentation: