Skip to main content

Welcome to Benchwise

Benchwise is an open-source Python SDK for LLM evaluation with PyTest-like syntax. It allows you to create custom evaluations, run benchmarks across multiple models, and share results with the community.

Why Benchwise?

  • PyTest-like Syntax - Familiar decorator-based API (@evaluate, @benchmark)
  • Multi-Provider Support - OpenAI, Anthropic, Google, HuggingFace
  • Built-in Metrics - ROUGE, BLEU, BERT-score, semantic similarity, and more
  • Async-First - Built for performance with async/await throughout
  • Results Management - Caching, offline mode
  • Dataset Tools - Load standard benchmarks (MMLU, HellaSwag, GSM8K)
  • Community Sharing - Share and discover benchmarks (Coming Soon)

Quick Example

import asyncio
from benchwise import evaluate, create_qa_dataset, accuracy

# Create a simple dataset
dataset = create_qa_dataset(
questions=["What is the capital of France?", "What is 2+2?"],
answers=["Paris", "4"]
)

# Evaluate multiple models
@evaluate("gpt-3.5-turbo", "gemini-2.5-flash")
async def test_qa(model, dataset):
responses = await model.generate(dataset.prompts)
scores = accuracy(responses, dataset.references)
return {"accuracy": scores["accuracy"]}

# Run it
results = asyncio.run(test_qa(dataset))
for result in results:
print(f"{result.model_name}: {result.result['accuracy']:.2%}")

# Note: Model names shown are examples and may change. Verify available models in provider documentation.

Key Features

Decorator-Based Evaluation

@benchmark("medical_qa", "Medical question answering")
@evaluate("gpt-3.5-turbo", "gemini-2.5-flash")
async def test_medical_qa(model, dataset):
responses = await model.generate(dataset.prompts)
return accuracy(responses, dataset.references)

Multi-Provider Support

# OpenAI models
@evaluate("gpt-5", "gpt-5-nano")
async def test_openai(model, dataset):
...

# Anthropic models
@evaluate("claude-4.5-opus", "claude-4.5-sonnet")
async def test_anthropic(model, dataset):
...

# Google models
@evaluate("gemini-2.5-pro", "gemini-2.5-flash")
async def test_google(model, dataset):
...

# HuggingFace models
@evaluate("microsoft/DialoGPT-medium")
async def test_huggingface(model, dataset):
...

Built-in Metrics

from benchwise.metrics import (
rouge_l, # Text overlap
bleu_score, # Translation quality
bert_score_metric, # Semantic similarity
accuracy, # Exact match
semantic_similarity, # Embedding similarity
safety_score, # Content safety
)

Community Features (Coming Soon)

We're building a platform to share and discover LLM evaluation benchmarks with the community:

  • Share Your Benchmarks - Publish your evaluation results and benchmarks
  • Discover Benchmarks - Browse community-contributed evaluations
  • Compare Results - See how different models perform across various tasks
  • Leaderboards - Track model performance across popular benchmarks

Stay tuned for updates on the community platform launch!

Get Involved

License

Benchwise is open-source software licensed under the MIT license.