BleurtScore (Weighted Semantic Similarity)

An advanced metric based on BLEURT that produces a more nuanced weighted similarity score.

Inputs: candidate (generated) text and reference text (or query, when used as retriever metric).
Returns: a single weighted BLEURT score.

Use Cases

As a generation metric → highlights which chunks contribute more to the output.
As a retriever metric → measures semantic relationships even if exact matches are missing.

Note:

Can be very useful for debugging:

If Context Recall is low, but Weighted Semantic Similarity score is high, it tells the developer: "Your retriever is finding documents that are about the right topic, but it's failing to find the specific sentence or fact needed for the answer"

If both scores are low, the retriever is failing at a more fundamental level

Insights

BluertScore	Inference
closer to 1	high semantic similarity
closer to 0	low semantic similarity

Example

from vero.metrics import BleurtScore

#example inputs
#chunks_list = ["The cat sat on the mat.", "The dog barked at the mailman."]
#answers_list = ["A cat is sitting on a mat and a dog is barking at the mailman."]
with BleurtScore() as bls:
    bleurt_results = [bls.evaluate(chunk, ans) for chunk, ans in zip(chunks_list, answers_list)]
print(bleurt_results)

Output

0.89

Use Cases​

Insights​

Example​

Output​

Use Cases

Insights

Example

Output