Evaluator
Overview
- The Evaluator is a convenience wrapper to run multiple metrics over model outputs and retrieval results.
- It orchestrates generation evaluation (text-generation metrics), retrieval evaluation (precision/recall/sufficiency), and reranker evaluation (NDCG, MAP, MRR). It produces CSV summaries by default.
Quick notes
- The Evaluator uses project metric classes (e.g., BartScore, BertScore, RougeScore, SemScore, PrecisionScore, RecallScore, MeanAP, MeanRR, RerankerNDCG, CumulativeNDCG, etc.). These metrics are in vero.metrics and are referenced internally.
- Many methods expect particular CSV column names (see "Expected CSV schemas").
Steps to evaluate your pipeline
Step 1 - Generation evaluation
- Input: a CSV with "Context Retrieved" and "Answer" columns.
- Result: Generation_Scores.csv with columns such as SemScore, BertScore, RougeLScore, BARTScore, BLUERTScore, G-Eval (Faithfulness).
Example:
from vero.evaluator.evaluator import Evaluator
evaluator = Evaluator()
# data_path must point to a CSV with columns "Context Retrieved" and "Answer"
df_scores = evaluator.evaluate_generation(data_path='testing.csv')
print(df_scores.head())
Step 2 - Preparing reranker inputs (parse ground truth + retriever output)
- Use parse_retriever_data to convert ground-truth chunk ids and retriever outputs into a ranked_chunks_data.csv suitable for reranker evaluation.
Example:
from vero.evaluator.evaluator import Evaluator
evaluator = Evaluator()
# ground_truth_path: dataset with 'Chunk IDs' and 'Less Relevant Chunk IDs' columns
# data_path: retriever output with 'Context Retrieved' containing "id='...'"
evaluator.parse_retriever_data(
ground_truth_path='test_dataset_generator.csv',
data_path='testing.csv'
)
# This will produce 'ranked_chunks_data.csv'
Step 3 - Retrieval evaluation (precision, recall, sufficiency)
- Inputs:
- retriever_data_path: a CSV that contains 'Retrieved Chunk IDs' and 'True Chunk IDs' columns (lists or strings).
- data_path: the generation CSV with 'Context Retrieved' and 'Question' (for sufficiency).
- Result: Retrieval_Scores.csv
Example:
from vero.evaluator.evaluator import Evaluator
evaluator = Evaluator()
df_retrieval_scores = evaluator.evaluate_retrieval(
data_path='testing.csv',
retriever_data_path='ranked_chunks_data.csv'
)
print(df_retrieval_scores.head())
Step 4 - Reranker evaluation (MAP, MRR, NDCG) Example:
from vero.evaluator.evaluator import Evaluator
evaluator = Evaluator()
df_reranker_scores = evaluator.evaluate_reranker(
ground_truth_path='test_dataset_generator.csv',
retriever_data_path='ranked_chunks_data.csv'
)
print(df_reranker_scores)
Lower-level metric usage
To run a single metric directly you can instantiate the metric class. For example, to compute BARTScore or BertScore per pair:
from vero.metrics import BartScore, BertScore
with BartScore() as bs:
bart_results = [bs.evaluate(context, answer) for context, answer in zip(contexts, answers)]
with BertScore() as bert:
bert_results = [bert.evaluate(context, answer) for context, answer in zip(contexts, answers)]