Skip to main content

BleurtScore (Weighted Semantic Similarity)

An advanced metric based on BLEURT that produces a more nuanced weighted similarity score.

  • Inputs: candidate (generated) text and reference text (or query, when used as retriever metric).
  • Returns: a single weighted BLEURT score.

Use Cases

  • As a generation metric → highlights which chunks contribute more to the output.
  • As a retriever metric → measures semantic relationships even if exact matches are missing.

Note:

Can be very useful for debugging:

  • If Context Recall is low, but Weighted Semantic Similarity score is high, it tells the developer: "Your retriever is finding documents that are about the right topic, but it's failing to find the specific sentence or fact needed for the answer"
  • If both scores are low, the retriever is failing at a more fundamental level

Insights

BluertScoreInference
closer to 1high semantic similarity
closer to 0low semantic similarity

Example

from vero.metrics import BleurtScore

#example inputs
#chunks_list = ["The cat sat on the mat.", "The dog barked at the mailman."]
#answers_list = ["A cat is sitting on a mat and a dog is barking at the mailman."]
with BleurtScore() as bls:
bleurt_results = [bls.evaluate(chunk, ans) for chunk, ans in zip(chunks_list, answers_list)]
print(bleurt_results)

Output

0.89