Welcome to Benchwise
Benchwise is an open-source Python SDK for LLM evaluation with PyTest-like syntax. It allows you to create custom evaluations, run benchmarks across multiple models, and share results with the community.
Why Benchwise?
- PyTest-like Syntax - Familiar decorator-based API (
@evaluate,@benchmark) - Multi-Provider Support - OpenAI, Anthropic, Google, HuggingFace
- Built-in Metrics - ROUGE, BLEU, BERT-score, semantic similarity, and more
- Async-First - Built for performance with async/await throughout
- Results Management - Caching, offline mode
- Dataset Tools - Load standard benchmarks (MMLU, HellaSwag, GSM8K)
- Community Sharing - Share and discover benchmarks (Coming Soon)
Quick Example
import asyncio
from benchwise import evaluate, create_qa_dataset, accuracy
# Create a simple dataset
dataset = create_qa_dataset(
questions=["What is the capital of France?", "What is 2+2?"],
answers=["Paris", "4"]
)
# Evaluate multiple models
@evaluate("gpt-3.5-turbo", "gemini-2.5-flash")
async def test_qa(model, dataset):
responses = await model.generate(dataset.prompts)
scores = accuracy(responses, dataset.references)
return {"accuracy": scores["accuracy"]}
# Run it
results = asyncio.run(test_qa(dataset))
for result in results:
print(f"{result.model_name}: {result.result['accuracy']:.2%}")
# Note: Model names shown are examples and may change. Verify available models in provider documentation.
Key Features
Decorator-Based Evaluation
@benchmark("medical_qa", "Medical question answering")
@evaluate("gpt-3.5-turbo", "gemini-2.5-flash")
async def test_medical_qa(model, dataset):
responses = await model.generate(dataset.prompts)
return accuracy(responses, dataset.references)
Multi-Provider Support
# OpenAI models
@evaluate("gpt-5", "gpt-5-nano")
async def test_openai(model, dataset):
...
# Anthropic models
@evaluate("claude-4.5-opus", "claude-4.5-sonnet")
async def test_anthropic(model, dataset):
...
# Google models
@evaluate("gemini-2.5-pro", "gemini-2.5-flash")
async def test_google(model, dataset):
...
# HuggingFace models
@evaluate("microsoft/DialoGPT-medium")
async def test_huggingface(model, dataset):
...
Built-in Metrics
from benchwise.metrics import (
rouge_l, # Text overlap
bleu_score, # Translation quality
bert_score_metric, # Semantic similarity
accuracy, # Exact match
semantic_similarity, # Embedding similarity
safety_score, # Content safety
)
Community Features (Coming Soon)
We're building a platform to share and discover LLM evaluation benchmarks with the community:
- Share Your Benchmarks - Publish your evaluation results and benchmarks
- Discover Benchmarks - Browse community-contributed evaluations
- Compare Results - See how different models perform across various tasks
- Leaderboards - Track model performance across popular benchmarks
Stay tuned for updates on the community platform launch!
Get Involved
- GitHub Repository - Star us and contribute!
- Issue Tracker - Report bugs or request features
- PyPI Package - Install via pip
License
Benchwise is open-source software licensed under the MIT license.