Cognee Reference - Retriever Evaluation

Evaluating the Retriever

In order to ensure optimal performance, we continuously refine all parts of the Cognee pipeline. To assess the quality of the retrieval step, we are evaluating it using two key metrics:

1. Context Coverage Score

Parameters:

golden_context
retrieval_context

Coverage score takes a retrieved context and compares it to the golden context (e.g., from datasets like HotPotQA). This is done using a Question-Answer Generation (QAG) framework, where an LLM generates a set of close-ended (yes/no) questions from the golden context. The system then checks whether the retrieved context can answer these questions correctly.

The coverage score is calculated as the percentage of assessment questions where both the golden context and the retrieved context provide the same answer. A higher score indicates that the retrieved context contains sufficient detail to match the key information from the golden context.

Our approach to measuring Coverage is inspired by Deepeval’s Summarization Score , where Coverage is one of the key components. Please refer to their blog post for a step-by-step explanation.

2. DeepEval’s Contextual Relevancy Metric

Parameters:

input
retrieval_context

Contextual relevancy measures how well the retrieved documents align with the given input (query). DeepEval uses an LLM-as-a-judge approach, where an LLM extracts all statements from the retrieved content and classifies each one as relevant or not. The Contextual Relevancy score is then calculated as the ratio of relevant statements to the total number of statements.

Join the Conversation!

Join our community now to connect with professionals, share insights, and get your questions answered!