Skip to main content

Evaluate

The main decorator for running evaluations across multiple models.

Signature

@evaluate(*models, upload=None, **kwargs)
async def evaluate_function(model, dataset):
...

Parameters

  • *models (str): One or more model identifiers to evaluate
  • upload (bool, optional): Whether to upload results to Benchwise API (None = use config default).
  • **kwargs: Optional parameters passed to model generation:
    • temperature (float): Sampling temperature (0.0 to 1.0)
    • max_tokens (int): Maximum tokens to generate
    • top_p (float): Nucleus sampling parameter

Returns

A decorator that wraps async functions to run evaluations across specified models.

Basic Usage

from benchwise import evaluate

@evaluate("gpt-3.5-turbo")
async def test_single_model(model, dataset):
responses = await model.generate(dataset.prompts)
return {"responses": responses}

Multiple Models

@evaluate("gpt-3.5-turbo", "gemini-2.5-flash")
async def test_multiple_models(model, dataset):
...

With Parameters

@evaluate("gpt-3.5-turbo", temperature=0.7, max_tokens=500)
async def test_with_params(model, dataset):
...

Function Signature

The decorated function must have this signature:

async def evaluation_function(model: ModelAdapter, dataset: Dataset) -> Dict[str, Any]:
# Your evaluation logic
pass

Parameters

  • model (ModelAdapter): The model adapter instance for the current model
  • dataset (Dataset): The dataset to evaluate on

Returns

  • Dict[str, Any]: Dictionary of metrics and results

Model Interface

Inside the decorated function, the model parameter provides:

# Generate text
responses = await model.generate(prompts, temperature=0.7, max_tokens=100)

# Get token count
# Note: Token count is currently an estimate and may not be reliable.
tokens = model.get_token_count(text)

# Estimate cost
cost = model.get_cost_estimate(input_tokens, output_tokens)

Execution

The decorator returns a list of EvaluationResult objects:

results = asyncio.run(test_multiple_models(dataset))

for result in results:
print(f"Model: {result.model_name}")
print(f"Success: {result.success}")
print(f"Result: {result.result}")
print(f"Duration: {result.duration}")
if not result.success:
print(f"Error: {result.error}")

Complete Example

import asyncio
from benchwise import evaluate, create_qa_dataset, accuracy

dataset = create_qa_dataset(
questions=["What is AI?"],
answers=["Artificial Intelligence"]
)

@evaluate("gpt-3.5-turbo", "gemini-2.5-flash", temperature=0)
async def test_qa(model, dataset):
responses = await model.generate(dataset.prompts)
scores = accuracy(responses, dataset.references)
return {
"accuracy": scores["accuracy"],
"total": len(responses)
}

# Run evaluation
results = asyncio.run(test_qa(dataset))

# Process results
for result in results:
if result.success:
print(f"{result.model_name}: {result.result['accuracy']:.2%}")

Combining with @benchmark

from benchwise import benchmark, evaluate

@benchmark("QA Benchmark", "Question answering evaluation")
@evaluate("gpt-3.5-turbo", "gemini-2.5-flash")
async def test_qa_benchmark(model, dataset):
responses = await model.generate(dataset.prompts)
return {"accuracy": accuracy(responses, dataset.references)}

Error Handling

The decorator automatically handles errors:

@evaluate("gpt-3.5-turbo", "invalid-model")
async def test_with_error(model, dataset):
...

results = asyncio.run(test_with_error(dataset))

# Check for failures
for result in results:
if not result.success:
print(f"Error in {result.model_name}: {result.error}")

Upload Results

Enable automatic upload to Benchwise API:

@evaluate("gpt-3.5-turbo", upload=True)
async def test_with_upload(model, dataset):
...

See Also