Skip to main content

BERTScore

It is an automatic evaluation metric for text generation tasks that measures the similarity between candidate and reference texts using contextual embeddings from pre-trained BERT models.

  • Inputs: candidate (generated) text and reference text.
  • Returns: precision, recall, and F1-score based on token-level similarity.

Example

from vero.metrics import BertScore

#example inputs
#chunks_list = ["The cat sat on the mat.", "The dog barked at the mailman."]
#answers_list = ["A cat is sitting on a mat and a dog is barking at the mailman."]
with BertScore() as bs:
bert_results = [bs.evaluate(chunk, ans) for chunk, ans in tqdm(zip(chunks_list, answers_list), total=len(df_new))]
bert_dicts = [{'Precision': p, 'Recall': r, 'F1score': f} for p, r, f in bert_results]
print(bert_dicts)

Output

{Precision': 0.85, 'Recall': 0.80, 'F1score': 0.825}