Welcome to Benchwise

Benchwise is an open-source Python SDK for LLM evaluation with PyTest-like syntax. It allows you to create custom evaluations, run benchmarks across multiple models, and share results with the community.

Why Benchwise?

PyTest-like Syntax - Familiar decorator-based API (@evaluate, @benchmark)
Multi-Provider Support - OpenAI, Anthropic, Google, HuggingFace
Built-in Metrics - ROUGE, BLEU, BERT-score, semantic similarity, and more
Async-First - Built for performance with async/await throughout
Results Management - Caching, offline mode
Dataset Tools - Load standard benchmarks (MMLU, HellaSwag, GSM8K)
Community Sharing - Share and discover benchmarks (Coming Soon)

Quick Example

import asyncio
from benchwise import evaluate, create_qa_dataset, accuracy

# Create a simple dataset
dataset = create_qa_dataset(
    questions=["What is the capital of France?", "What is 2+2?"],
    answers=["Paris", "4"]
)

# Evaluate multiple models
@evaluate("gpt-3.5-turbo", "gemini-2.5-flash")
async def test_qa(model, dataset):
    responses = await model.generate(dataset.prompts)
    scores = accuracy(responses, dataset.references)
    return {"accuracy": scores["accuracy"]}

# Run it
results = asyncio.run(test_qa(dataset))
for result in results:
    print(f"{result.model_name}: {result.result['accuracy']:.2%}")

# Note: Model names shown are examples and may change. Verify available models in provider documentation.

Key Features

Decorator-Based Evaluation

@benchmark("medical_qa", "Medical question answering")
@evaluate("gpt-3.5-turbo", "gemini-2.5-flash")
async def test_medical_qa(model, dataset):
    responses = await model.generate(dataset.prompts)
    return accuracy(responses, dataset.references)

Multi-Provider Support

# OpenAI models
@evaluate("gpt-5", "gpt-5-nano")
async def test_openai(model, dataset):
    ...

# Anthropic models
@evaluate("claude-4.5-opus", "claude-4.5-sonnet")
async def test_anthropic(model, dataset):
    ...

# Google models
@evaluate("gemini-2.5-pro", "gemini-2.5-flash")
async def test_google(model, dataset):
    ...

# HuggingFace models
@evaluate("microsoft/DialoGPT-medium")
async def test_huggingface(model, dataset):
    ...

Built-in Metrics

from benchwise.metrics import (
    rouge_l,           # Text overlap
    bleu_score,        # Translation quality
    bert_score_metric, # Semantic similarity
    accuracy,          # Exact match
    semantic_similarity, # Embedding similarity
    safety_score,      # Content safety
)

Community Features (Coming Soon)

We're building a platform to share and discover LLM evaluation benchmarks with the community:

Share Your Benchmarks - Publish your evaluation results and benchmarks
Discover Benchmarks - Browse community-contributed evaluations
Compare Results - See how different models perform across various tasks
Leaderboards - Track model performance across popular benchmarks

Stay tuned for updates on the community platform launch!

Get Involved

GitHub Repository - Star us and contribute!
Issue Tracker - Report bugs or request features
PyPI Package - Install via pip

License

Benchwise is open-source software licensed under the MIT license.

Why Benchwise?​

Quick Example​

Key Features​

Decorator-Based Evaluation​

Multi-Provider Support​

Built-in Metrics​

Community Features (Coming Soon)​

Get Involved​

License​