Evaluate
The main decorator for running evaluations across multiple models.
Signature
@evaluate(*models, upload=None, **kwargs)
async def evaluate_function(model, dataset):
...
Parameters
*models(str): One or more model identifiers to evaluateupload(bool, optional): Whether to upload results to Benchwise API (None = use config default).**kwargs: Optional parameters passed to model generation:temperature(float): Sampling temperature (0.0 to 1.0)max_tokens(int): Maximum tokens to generatetop_p(float): Nucleus sampling parameter
Returns
A decorator that wraps async functions to run evaluations across specified models.
Basic Usage
from benchwise import evaluate
@evaluate("gpt-3.5-turbo")
async def test_single_model(model, dataset):
responses = await model.generate(dataset.prompts)
return {"responses": responses}
Multiple Models
@evaluate("gpt-3.5-turbo", "gemini-2.5-flash")
async def test_multiple_models(model, dataset):
...
With Parameters
@evaluate("gpt-3.5-turbo", temperature=0.7, max_tokens=500)
async def test_with_params(model, dataset):
...
Function Signature
The decorated function must have this signature:
async def evaluation_function(model: ModelAdapter, dataset: Dataset) -> Dict[str, Any]:
# Your evaluation logic
pass
Parameters
model(ModelAdapter): The model adapter instance for the current modeldataset(Dataset): The dataset to evaluate on
Returns
- Dict[str, Any]: Dictionary of metrics and results
Model Interface
Inside the decorated function, the model parameter provides:
# Generate text
responses = await model.generate(prompts, temperature=0.7, max_tokens=100)
# Get token count
# Note: Token count is currently an estimate and may not be reliable.
tokens = model.get_token_count(text)
# Estimate cost
cost = model.get_cost_estimate(input_tokens, output_tokens)
Execution
The decorator returns a list of EvaluationResult objects:
results = asyncio.run(test_multiple_models(dataset))
for result in results:
print(f"Model: {result.model_name}")
print(f"Success: {result.success}")
print(f"Result: {result.result}")
print(f"Duration: {result.duration}")
if not result.success:
print(f"Error: {result.error}")
Complete Example
import asyncio
from benchwise import evaluate, create_qa_dataset, accuracy
dataset = create_qa_dataset(
questions=["What is AI?"],
answers=["Artificial Intelligence"]
)
@evaluate("gpt-3.5-turbo", "gemini-2.5-flash", temperature=0)
async def test_qa(model, dataset):
responses = await model.generate(dataset.prompts)
scores = accuracy(responses, dataset.references)
return {
"accuracy": scores["accuracy"],
"total": len(responses)
}
# Run evaluation
results = asyncio.run(test_qa(dataset))
# Process results
for result in results:
if result.success:
print(f"{result.model_name}: {result.result['accuracy']:.2%}")
Combining with @benchmark
from benchwise import benchmark, evaluate
@benchmark("QA Benchmark", "Question answering evaluation")
@evaluate("gpt-3.5-turbo", "gemini-2.5-flash")
async def test_qa_benchmark(model, dataset):
responses = await model.generate(dataset.prompts)
return {"accuracy": accuracy(responses, dataset.references)}
Error Handling
The decorator automatically handles errors:
@evaluate("gpt-3.5-turbo", "invalid-model")
async def test_with_error(model, dataset):
...
results = asyncio.run(test_with_error(dataset))
# Check for failures
for result in results:
if not result.success:
print(f"Error in {result.model_name}: {result.error}")
Upload Results
Enable automatic upload to Benchwise API:
@evaluate("gpt-3.5-turbo", upload=True)
async def test_with_upload(model, dataset):
...
See Also
- @benchmark - Create named benchmarks
- @stress_test - Performance testing
- Evaluation Guide - Evaluation patterns