Factual Correctness
Evaluate factual correctness of predictions using enhanced fact-checking methods.
Signature
from typing import List, Dict, Any, Optional
def factual_correctness(
predictions: List[str],
references: List[str],
fact_checker_endpoint: Optional[str] = None,
use_named_entities: bool = True,
return_confidence: bool = True,
detailed_analysis: bool = True
) -> Dict[str, Any]:
...
Parameters
- predictions (List[str]): Model-generated predictions
- references (List[str]): Ground truth references
- fact_checker_endpoint (str, optional): Optional API endpoint for fact checking
- use_named_entities (bool): Whether to use named entity recognition for better fact extraction (default: True)
- return_confidence (bool): Whether to return confidence intervals (default: True)
- detailed_analysis (bool): Whether to return detailed factual analysis (default: True)
Returns
Dictionary containing:
- mean_correctness (float): Average factual correctness score (0.0 to 1.0)
- median_correctness (float): Median correctness score
- std_correctness (float): Standard deviation of scores
- min_correctness (float): Minimum correctness score
- max_correctness (float): Maximum correctness score
- scores (List[float]): Individual correctness scores
- components (dict, optional): Component-level analysis (if detailed_analysis=True)
- entity_overlap: Named entity overlap scores
- keyword_overlap: Keyword overlap scores
- semantic_overlap: Semantic overlap scores
- detailed_results (List[dict], optional): Per-sample detailed analysis (if detailed_analysis=True)
- correctness_confidence_interval (tuple, optional): Confidence interval (if return_confidence=True)
Usage
from benchwise import factual_correctness
predictions = [
"Paris is the capital of France",
"The Earth orbits the Sun"
]
references = [
"Paris is the capital city of France",
"Earth revolves around the Sun"
]
result = factual_correctness(predictions, references)
print(f"Mean Correctness: {result['mean_correctness']:.3f}")
print(f"Entity Overlap: {result['components']['entity_overlap']['mean']:.3f}")
Basic Usage
from benchwise import factual_correctness
# Simple factual correctness check
predictions = ["Tokyo is the capital of Japan"]
references = ["Tokyo is Japan's capital city"]
scores = factual_correctness(predictions, references)
print(f"Correctness: {scores['mean_correctness']:.2%}")
In Evaluations
from benchwise import evaluate, create_qa_dataset, factual_correctness
dataset = create_qa_dataset(
questions=["What is the capital of Germany?", "Who invented the telephone?"],
answers=["Berlin", "Alexander Graham Bell"]
)
@evaluate("gpt-4")
async def test_factual_qa(model, dataset):
responses = await model.generate(dataset.prompts)
scores = factual_correctness(responses, dataset.references)
return {
"factual_correctness": scores["mean_correctness"],
"entity_overlap": scores["components"]["entity_overlap"]["mean"]
}
Named Entity Recognition
The metric can use spaCy for enhanced named entity recognition:
# With NER enabled (default)
scores = factual_correctness(
predictions,
references,
use_named_entities=True
)
# Without NER (keyword-based only)
scores = factual_correctness(
predictions,
references,
use_named_entities=False
)
Note: Named entity recognition requires spaCy and the English model:
pip install spacy
python -m spacy download en_core_web_sm
Detailed Analysis
Get component-level breakdown of factual correctness:
scores = factual_correctness(
predictions,
references,
detailed_analysis=True
)
# Access component scores
print("Entity Overlap:", scores["components"]["entity_overlap"]["mean"])
print("Keyword Overlap:", scores["components"]["keyword_overlap"]["mean"])
print("Semantic Overlap:", scores["components"]["semantic_overlap"]["mean"])
# Access per-sample details
for i, detail in enumerate(scores["detailed_results"]):
print(f"\nSample {i+1}:")
print(f" Entity: {detail['entity_overlap']:.3f}")
print(f" Keyword: {detail['keyword_overlap']:.3f}")
print(f" Semantic: {detail['semantic_overlap']:.3f}")
Confidence Intervals
Get statistical confidence intervals for factual correctness:
scores = factual_correctness(
predictions,
references,
return_confidence=True
)
if "correctness_confidence_interval" in scores:
ci_lower, ci_upper = scores["correctness_confidence_interval"]
print(f"95% CI: [{ci_lower:.3f}, {ci_upper:.3f}]")
Minimal Output
For lightweight usage, disable detailed analysis:
scores = factual_correctness(
predictions,
references,
detailed_analysis=False,
return_confidence=False
)
# Returns only: mean_correctness, median_correctness, std_correctness,
# min_correctness, max_correctness, scores
See Also
- Accuracy - Exact match accuracy
- Semantic Similarity - Meaning-based matching
- Metrics Overview - All available metrics