Cognee Evaluation Framework

Without evaluation, you have no objective way to know if your system’s answers are good or bad. With cognee’s evaluation framework, you will easily configure and run evaluations in cognee using the run_eval.py entry point. By capturing specific metrics, you can iterate on your pipeline with confidence. Below, you will find how to:

Build or reuse an existing corpus.
Generate and retrieve answers.
Evaluate the answers using different metrics.
Visualize the metrics in a dashboard.

To run evaluations quickly with cognee:

Here is a high-level diagram of the evaluation process, illustrating how various executors interact:

Let’s explore each executor and learn how to configure them

1. Main (run_eval) & cognee Framework

Main (run_eval) orchestrates the entire flow.
cognee Framework provides the underlying infrastructure to run cognee’s cognify pipeline.

Here’s what happens step-by-step when you run the evaluation:

Configuration Loading: The script first loads all parameters from eval_config.py, which can be overridden by your .env file.
Corpus Building: Calls run_corpus_builder() to either create a new corpus or use an existing one.
Question Answering: Executes run_question_answering() to generate answers for each question in the corpus.
Evaluation: Runs run_evaluation() to calculate metrics comparing generated answers with reference answers.
Dashboard Generation: If enabled, creates a visual dashboard using create_dashboard() to help you analyze the results.

Each step produces output files that feed into the next step, creating a complete evaluation pipeline.

2. Corpus Building

A corpus is a collection of questions and contexts that the system will cognify and use to generate question answers which will later be evaluated. The CorpusBuilderExecutor is responsible for loading questions and contexts from one of the supported benchmarks, then “cognifying” them (processing and storing them in cognee’s system).

Benchmark Adapters

Cognee supports multiple benchmark datasets through adapter classes:

BaseBenchmarkAdapter: An abstract base class that defines the interface for all benchmark adapters.
Implementations:
- HotpotQAAdapter: Loads multi-hop questions from the HotpotQA dataset.
- TwoWikiMultihopAdapter: Extends HotpotQA with additional evidence from the 2WikiMultihop dataset.
- MusiqueAdapter: Handles the Musique dataset with question decomposition.
- DummyAdapter: Provides a simple test question for quick demonstrations.

Each adapter loads questions, their answers, and context documents. The golden answers are saved for later evaluation but aren’t cognified. Optionally, adapters can also extract “golden contexts” (the specific parts of documents that answer the question) for context evaluation.

Task Getters

The task_getter parameter allows flexibility in how cognee processes the corpus:

It determines which pipeline will be used to process the documents.
By default, it uses the standard cognee pipeline, but you can configure alternative pipelines.
This is useful for comparing different processing strategies in your evaluations.

Configuring Corpus Building

building_corpus_from_scratch: bool = True
number_of_samples_in_corpus: int = 1
benchmark: str = "Dummy"  # Options: 'HotPotQA', 'TwoWikiMultiHop', 'Musique', 'Dummy'
task_getter_type: str = "Default"  # Options: 'Default', 'CascadeGraph'

building_corpus_from_scratch: True for building a fresh corpus and removing existing data, while False indicates that it will not rebuild the corpus (and will not delete any existing data).
number_of_samples_in_corpus: How many questions you want to include from the selected benchmark.
benchmark: The dataset or benchmark from which questions are sampled.
task_getter_type: Determines which pipeline configuration to use when processing documents.

Implementation Details

The corpus building process is implemented in several key files:

run_corpus_builder.py: This is the entry point called by run_eval.py. It:
- Creates a CorpusBuilderExecutor with the specified benchmark and task getter
- Calls the executor to build the corpus with the configured parameters
- Saves the questions to the output file and the relational database
- Handles the evaluating_contexts flag to determine whether to load golden contexts
corpus_builder_executor.py: Contains the core logic for building the corpus:
- Defines the CorpusBuilderExecutor class that orchestrates the corpus building process
- Loads the appropriate benchmark adapter based on the configuration
- Calls cognee’s core functions to process and store the documents
- Manages the execution of the configured task getter pipeline
base_benchmark_adapter.py: Defines the abstract interface that all benchmark adapters must implement:
- Provides a consistent load_corpus method that all adapters must implement
- Ensures all adapters can handle parameters like sample limits and golden context loading

These files work together to load questions and contexts from the selected benchmark, process them through cognee’s pipeline, and prepare them for the question answering phase.

3. Question Answering

This is where cognee retrieves relevant context and generates answers to the questions. The AnswerGeneratorExecutor processes each question from the corpus built in the previous step. Here’s what happens step-by-step:

Load Questions: The executor reads questions from the questions_output.json file created in the corpus building step.
Context Retrieval: For each question, the system retrieves relevant context using the configured retriever.
Answer Generation: Using the retrieved context, the system generates an answer to the question.
Save Results: The questions, generated answers, golden answers, and optionally the retrieval contexts are saved to answers_output.json.

Available Retrievers

Cognee supports multiple retrieval strategies through different retriever classes:

CompletionRetriever (cognee_completion): The standard retriever that uses semantic search to find relevant context.
GraphCompletionRetriever (cognee_graph_completion): Uses graph-based retrieval to find connected information across documents.
GraphSummaryCompletionRetriever (graph_summary_completion): Combines graph retrieval with summarization for more concise context.

Each retriever has different strengths depending on the types of questions and data you’re working with.

Configuring Question Answering

answering_questions: bool = True
qa_engine: str = "cognee_completion"  # Options: 'cognee_completion', 'cognee_graph_completion', 'graph_summary_completion'
evaluating_contexts: bool = True  # Controls whether contexts are saved for evaluation

answering_questions: True means the AnswerGeneratorExecutor will retrieve context and generate answers. False for skipping the answer generation step (e.g., if you just want to rebuild a corpus).
qa_engine: Specifies the retriever used for finding context and generating answers.
evaluating_contexts: When True, the system will save both the retrieved contexts and the golden contexts (if available) for evaluation. This allows you to assess not just answer quality but also retrieval quality.

The retrieved contexts are saved alongside the answers and can be evaluated in the next phase of the pipeline to assess the quality of the retrieval process.

Implementation Details

The question answering process is implemented in two main files:

run_question_answering_module.py: This is the entry point called by run_eval.py. It:
- Loads questions from the output file of the corpus building step
- Instantiates the appropriate retriever based on the qa_engine parameter
- Calls the AnswerGeneratorExecutor to process each question
- Saves the results to the answers output file and the relational database
answer_generation_executor.py: Contains the core logic for answering questions:
- Defines the AnswerGeneratorExecutor class with the question_answering_non_parallel method
- Maps retriever names to their implementation classes via the retriever_options dictionary
- For each question, retrieves context and generates an answer using the configured retriever
- Packages the question, answer, golden answer, and retrieval context into a structured format

These files work together to transform the questions from the corpus into answered questions with their associated contexts, ready for evaluation.

4. Evaluating the Answers

After the questions are answered, cognee evaluates each answer against benchmark’s reference (“golden”) answer using specified metrics. This lets you see how reliable or accurate your system is. Here’s what happens step-by-step:

Load Answers: The executor reads answers from the answers_output.json file created in the question answering step.
Initialize Evaluator: The system creates an evaluator based on the configured evaluation engine.
Apply Metrics: For each answer, the system calculates scores using the specified metrics.
Save Results: The evaluation results are saved to metrics_output.json and the relational database.

Available Evaluators

Cognee supports multiple evaluation approaches through different adapter classes:

DeepEvalAdapter: Uses the DeepEval library to calculate metrics, supporting both traditional metrics and LLM-based evaluations.
DirectLLMEvalAdapter: Uses a direct call to an LLM with custom prompts to evaluate answer correctness.

Evaluation Metrics

Depending on the evaluator, different metrics are available:

correctness: Uses an LLM-based approach to see if the meaning of the answer aligns with the golden answer.
EM (Exact Match): Checks if the generated answer exactly matches the golden answer.
f1: Uses token-level matching to measure precision and recall.
contextual_relevancy: Evaluates how relevant the retrieved context is to the question (when context evaluation is enabled).
context_coverage: Assesses how well the retrieved context covers the information needed to answer the question (when context evaluation is enabled).

Configuring Evaluation

evaluating_answers: bool = True
evaluating_contexts: bool = True
evaluation_engine: str = "DeepEval"  # Options: 'DeepEval', 'DirectLLM'
evaluation_metrics: List[str] = ["correctness", "EM", "f1"]
deepeval_model: str = "gpt-4o-mini"

evaluating_answers: True triggers the EvaluationExecutor to evaluate the answers.
evaluating_contexts: When True, additional context-related metrics are included in the evaluation.
evaluation_engine: The evaluation adapter to use.
evaluation_metrics: The list of metrics to calculate for each answer.
deepeval_model: The LLM used for computing the LLM-based metrics (e.g., gpt-4o-mini).

Implementation Details

The evaluation process is implemented in several key files:

run_evaluation_module.py: This is the entry point called by run_eval.py. It:
- Loads answers from the output file of the question answering step
- Initializes the appropriate evaluator based on the evaluation_engine parameter
- Calls the EvaluationExecutor to process each answer
- Saves the results to the metrics output file and the relational database
evaluation_executor.py: Contains the core logic for evaluating answers:
- Defines the EvaluationExecutor class that orchestrates the evaluation process
- Selects the appropriate evaluator adapter based on configuration
- Adds context evaluation metrics if context evaluation is enabled
evaluator_adapters.py: Defines the available evaluator adapters:
- Maps evaluator names to their implementation classes
- Provides a consistent interface for different evaluation approaches
deep_eval_adapter.py and direct_llm_eval_adapter.py: Implement specific evaluation strategies:
- DeepEvalAdapter uses the DeepEval library with multiple metrics
- DirectLLMAdapter uses direct LLM calls with custom prompts

These files work together to evaluate the quality of the generated answers and optionally the retrieved contexts.

5. Creating Dashboard

Cognee generates a visual dashboard to help you analyze evaluation results. The dashboard presents metrics in an interactive HTML file that you can open in any web browser, making it easy to spot patterns and issues in your model’s outputs. Here’s what happens step-by-step:

Load Metrics: The dashboard generator reads metrics from the metrics_output.json and aggregate_metrics.json files.
Generate Visualizations: The system creates various charts and tables to visualize the evaluation results.
Compile HTML: All visualizations are combined into a single HTML file with CSS styling and JavaScript for interactivity.
Save Dashboard: The complete dashboard is saved to dashboard.html for easy viewing.

Dashboard Features

The generated dashboard includes several key visualizations:

Distribution Histograms: Shows the distribution of scores for each metric, helping you understand the overall performance patterns.
Confidence Interval Plot: Displays the mean score and 95% confidence interval for each metric, giving you statistical insight into the reliability of the results.
Detailed Tables: For each metric, a table shows the individual scores, reasons, and relevant data (questions, answers, contexts) for in-depth analysis.

Configuring Dashboard Generation

dashboard: bool = True
aggregate_metrics_path: str = "aggregate_metrics.json"
dashboard_path: str = "dashboard.html"

dashboard: When True, the system will generate the dashboard visualization.
aggregate_metrics_path: The path where aggregate metrics (means, confidence intervals) are stored.
dashboard_path: The output path for the generated HTML dashboard.

Implementation Details

The dashboard generation is implemented in the metrics_dashboard.py file, which contains several key functions:

create_dashboard(): The main entry point that orchestrates the dashboard creation process:
- Reads metrics data from the output files
- Calls visualization functions to generate charts
- Assembles the complete HTML dashboard
- Saves the result to the specified output file
create_distribution_plots(): Generates histograms showing the distribution of scores for each metric.
create_ci_plot(): Creates a bar chart with error bars showing the mean and confidence interval for each metric.
generate_details_html(): Produces HTML tables with detailed information about each evaluation result.
get_dashboard_html_template(): Assembles all visualizations into a complete HTML document with styling.

This dashboard provides a comprehensive view of your evaluation results, making it easy to understand how well your system is performing and where improvements might be needed.

Output Files: Where to Look

By default, the evaluation flow generates the following files (and a relational database) that capture the entire workflow:

questions_path: str = "questions_output.json"
answers_path: str = "answers_output.json"
metrics_path: str = "metrics_output.json"
dashboard_path: str = "dashboard.html"

Are you ready to test it out with your parameters?

Configure the above parameters as you wish in your .env file and simply run:

python evals/eval_framework/run_eval.py

Using cognee’s evaluation framework, you can:

Build (or reuse) a corpus from the specified benchmark.
Generate answers to the collected questions.
Evaluate those answers using the metrics you specified.
Produce a dashboard summarizing the results.
Inspect the outputs.

questions_output.json for generated or fetched questions.
answers_output.json for final answers and reference answers.
metrics_output.json for the calculated metrics.
dashboard.html to visually explore the evaluation results.

Feel free to reach out with any questions or suggestions on how to improve the evaluation framework.

Join the Conversation!

Join our community now to connect with professionals, share insights, and get your questions answered!

Optimization

Resources

Reference

Evaluation Framework

Cognee Evaluation Framework

To run evaluations quickly with cognee:

Let’s explore each executor and learn how to configure them

1. Main (run_eval) & cognee Framework

2. Corpus Building

Benchmark Adapters

Task Getters

Configuring Corpus Building

Implementation Details

3. Question Answering

Available Retrievers

Configuring Question Answering

Implementation Details

4. Evaluating the Answers

Available Evaluators

Evaluation Metrics

Configuring Evaluation

Implementation Details

5. Creating Dashboard

Dashboard Features

Configuring Dashboard Generation

Implementation Details

Output Files: Where to Look

Are you ready to test it out with your parameters?

Join the Conversation!

Optimization

Resources

Reference

​Cognee Evaluation Framework

​To run evaluations quickly with cognee:

​Let’s explore each executor and learn how to configure them

​1. Main (run_eval) & cognee Framework

​2. Corpus Building

​Benchmark Adapters

​Task Getters

​Configuring Corpus Building

​Implementation Details

​3. Question Answering

​Available Retrievers

​Configuring Question Answering

​Implementation Details

​4. Evaluating the Answers

​Available Evaluators

​Evaluation Metrics

​Configuring Evaluation

​Implementation Details

​5. Creating Dashboard

​Dashboard Features

​Configuring Dashboard Generation

​Implementation Details

​Output Files: Where to Look

​Are you ready to test it out with your parameters?

​Join the Conversation!

Cognee Evaluation Framework

To run evaluations quickly with cognee:

Let’s explore each executor and learn how to configure them

1. Main (run_eval) & cognee Framework

2. Corpus Building

Benchmark Adapters

Task Getters

Configuring Corpus Building

Implementation Details

3. Question Answering

Available Retrievers

Configuring Question Answering

Implementation Details

4. Evaluating the Answers

Available Evaluators

Evaluation Metrics

Configuring Evaluation

Implementation Details

5. Creating Dashboard

Dashboard Features

Configuring Dashboard Generation

Implementation Details

Output Files: Where to Look

Are you ready to test it out with your parameters?

Join the Conversation!