ROUGE-L

Calculate ROUGE-L score for text overlap evaluation.

Signature

def rouge_l(
    predictions: List[str],
    references: List[str],
    use_stemmer: bool = True,
    alpha: float = 0.5,
    return_confidence: bool = True,
) -> Dict[str, float]:
    ...

Parameters

predictions (List[str]): Model-generated texts
references (List[str]): Reference texts
use_stemmer (bool, optional): Whether to use stemming for better matching. Defaults to True.
alpha (float, optional): Parameter for F-score calculation (0.5 = balanced, less than 0.5 favors precision, greater than 0.5 favors recall). Defaults to 0.5.
return_confidence (bool, optional): Whether to return confidence intervals. Defaults to True.

Returns

Dictionary containing:

f1 (float): Mean F1 score (custom calculation with alpha).
precision (float): Mean Precision score.
recall (float): Mean Recall score.
rouge1_f1 (float): Mean ROUGE-1 F1 score.
rouge2_f1 (float): Mean ROUGE-2 F1 score.
std_f1 (float): Standard deviation of F1 scores.
std_precision (float): Standard deviation of Precision scores.
std_recall (float): Standard deviation of Recall scores.
scores (Dict[str, List[float]]): Dictionary containing lists of individual precision, recall, and f1 scores.
f1_confidence_interval (Tuple[float, float], optional): 95% confidence interval for F1 score (if return_confidence is True).
precision_confidence_interval (Tuple[float, float], optional): 95% confidence interval for Precision score (if return_confidence is True).
recall_confidence_interval (Tuple[float, float], optional): 95% confidence interval for Recall score (if return_confidence is True).

Usage

from benchwise import rouge_l

predictions = ["The cat sat on the mat"]
references = ["A cat was sitting on the mat"]

result = rouge_l(predictions, references)
print(f"F1: {result['f1']:.3f}")
print(f"Precision: {result['precision']:.3f}")
print(f"Recall: {result['recall']:.3f}")

For Summarization

from benchwise import evaluate, create_summarization_dataset, rouge_l

dataset = create_summarization_dataset(
    documents=["Long article..."],
    summaries=["Summary..."]
)

@evaluate("gpt-4")
async def test_summarization(model, dataset):
    prompts = [f"Summarize: {doc}" for doc in dataset.prompts]
    summaries = await model.generate(prompts)

    scores = rouge_l(summaries, dataset.references)

    return {
        "rouge_f1": scores["f1"],
        "rouge_precision": scores["precision"],
        "rouge_recall": scores["recall"]
    }

Understanding ROUGE-L

Precision: How much of the generated text matches the reference
Recall: How much of the reference is covered by the generated text
F1: Harmonic mean of precision and recall

Signature​

Parameters​

Returns​

Usage​

For Summarization​

Understanding ROUGE-L​

See Also​

Signature

Parameters

Returns

Usage

For Summarization

Understanding ROUGE-L

See Also