FAQ

Frequently asked questions about Benchwise.

General

Common questions about Benchwise's purpose and availability.

What is Benchwise?

Benchwise is an open-source Python SDK for LLM evaluation with PyTest-like syntax. It allows you to create custom evaluations, run benchmarks across multiple models, and share results with the community.

Is Benchwise free?

Yes, Benchwise is open-source and free to use under the MIT license. You only pay for the LLM API calls to providers like OpenAI, Anthropic, etc.

Which LLM providers are supported?

OpenAI (GPT-3.5, GPT-4)
Anthropic (Claude 3 models)
Google (Gemini)
HuggingFace (any model)

Usage

How to use Benchwise for evaluation and benchmarking.

How do I get started?

Install: pip install benchwise
Set API keys
Create your first evaluation

See the Quickstart Guide.

Can I use custom metrics?

Yes! You can create custom metrics. See Custom Metrics Guide.

How do I compare multiple models?

Use the @evaluate decorator with multiple model names:

@evaluate("gpt-4", "claude-3-opus", "gemini-pro")
async def my_test(model, dataset):
    # Test logic
    pass

Technical

Answers to technical questions about Benchwise's architecture and features.

Why async/await?

Async enables efficient concurrent API calls, reducing evaluation time when testing multiple models or large datasets.

How are costs calculated?

Model adapters estimate costs based on token usage and provider pricing. Use model.get_cost_estimate() to check costs before running evaluations.

Can I cache results?

Yes, results are automatically cached. Use cache.clear_cache() to clear when needed.

Community

Information on contributing to and getting support for Benchwise.

The community sharing platform is coming soon! You'll be able to upload and discover benchmarks.

How can I contribute?

See the Contributing Guide.

Where can I get help?

Troubleshooting

Solutions to common issues encountered while using Benchwise.

API key errors?

Ensure API keys are set as environment variables:

export OPENAI_API_KEY="your-key"
export ANTHROPIC_API_KEY="your-key"
export GOOGLE_API_KEY="your-key"

Import errors?

Install optional dependencies:

pip install benchwise[all]

Rate limiting?

Benchwise handles rate limits automatically. For heavy workloads, consider reducing concurrency or using smaller models.

General​

What is Benchwise?​

Is Benchwise free?​

Which LLM providers are supported?​

Usage​

How do I get started?​

Can I use custom metrics?​

How do I compare multiple models?​

Technical​

Why async/await?​

How are costs calculated?​

Can I cache results?​

Community​

How do I share benchmarks?​

How can I contribute?​

Where can I get help?​

Troubleshooting​

API key errors?​

Import errors?​

Rate limiting?​

See Also​