Skip to Content
How-to GuidesOptimizationEvaluation Framework

Cognee Evaluation Framework

Without evaluation, you have no objective way to know if your system’s answers are good or bad. With cognee’s evaluation framework, you will easily configure and run evaluations in cognee using the run_eval.py entry point. By capturing specific metrics, you can iterate on your pipeline with confidence. Below, you will find how to:

  1. Build or reuse an existing corpus.
  2. Generate and retrieve answers.
  3. Evaluate the answers using different metrics.
  4. Visualize the metrics in a dashboard.

To run evaluations quickly with cognee:

Step 1: Clone cognee repo

git clone https://github.com/topoteretes/cognee.git

Step 2: Install with poetry

Navigate to cognee repo

cd cognee

Install with poetry

poetry install

Step 3: Run evaluation for a dummy question

Run

python evals/eval_framework/run_eval.py

Step 4: Navigate to the output files to see the results

  • “questions_output.json”
  • “answers_output.json”
  • “metrics_output.json”
  • “dashboard.html”

Explore the details and further explanation below.

Here is a high-level diagram of the evaluation process, illustrating how various executors interact:

cognee_evaluation_diagram

Let’s explore each executor and learn how to configure them

Preliminaries: Configuration with eval_config.py

All the parameters mentioned throughout this documentation are defined in the eval_config.py file. This configuration file uses Pydantic settings to manage all evaluation parameters in one place.

# Example from eval_config.py class EvalConfig(BaseSettings): # Corpus builder params building_corpus_from_scratch: bool = True number_of_samples_in_corpus: int = 1 benchmark: str = "Dummy" # ... more parameters

You can override any of these parameters by setting them in your .env file. For example:

# .env file example BUILDING_CORPUS_FROM_SCRATCH=False NUMBER_OF_SAMPLES_IN_CORPUS=10 BENCHMARK=HotPotQA DEEPEVAL_MODEL=gpt-4o

This makes it easy to customize your evaluation runs without modifying the code. All the parameters described in the following sections can be configured this way.

1. Main (run_eval) & cognee Framework

  • Main (run_eval) orchestrates the entire flow.
  • cognee Framework provides the underlying infrastructure to run cognee’s cognify pipeline.

Here’s what happens step-by-step when you run the evaluation:

  • Configuration Loading: The script first loads all parameters from eval_config.py, which can be overridden by your .env file.
  • Corpus Building: Calls run_corpus_builder() to either create a new corpus or use an existing one.
  • Question Answering: Executes run_question_answering() to generate answers for each question in the corpus.
  • Evaluation: Runs run_evaluation() to calculate metrics comparing generated answers with reference answers.
  • Dashboard Generation: If enabled, creates a visual dashboard using create_dashboard() to help you analyze the results.

Each step produces output files that feed into the next step, creating a complete evaluation pipeline.

2. Corpus Building

A corpus is a collection of questions and contexts that the system will cognify and use to generate question answers which will later be evaluated. The CorpusBuilderExecutor is responsible for loading questions and contexts from one of the supported benchmarks, then “cognifying” them (processing and storing them in cognee’s system).

Benchmark Adapters

Cognee supports multiple benchmark datasets through adapter classes:

  • BaseBenchmarkAdapter: An abstract base class that defines the interface for all benchmark adapters.
  • Implementations:
    • HotpotQAAdapter: Loads multi-hop questions from the HotpotQA dataset.
    • TwoWikiMultihopAdapter: Extends HotpotQA with additional evidence from the 2WikiMultihop dataset.
    • MusiqueAdapter: Handles the Musique dataset with question decomposition.
    • DummyAdapter: Provides a simple test question for quick demonstrations.

Each adapter loads questions, their answers, and context documents. The golden answers are saved for later evaluation but aren’t cognified. Optionally, adapters can also extract “golden contexts” (the specific parts of documents that answer the question) for context evaluation.

Task Getters

The task_getter parameter allows flexibility in how cognee processes the corpus:

  • It determines which pipeline will be used to process the documents.
  • By default, it uses the standard cognee pipeline, but you can configure alternative pipelines.
  • This is useful for comparing different processing strategies in your evaluations.

Configuring Corpus Building

building_corpus_from_scratch: bool = True number_of_samples_in_corpus: int = 1 benchmark: str = "Dummy" # Options: 'HotPotQA', 'TwoWikiMultiHop', 'Musique', 'Dummy' task_getter_type: str = "Default" # Options: 'Default', 'CascadeGraph'
  • building_corpus_from_scratch: True for building a fresh corpus and removing existing data, while False indicates that it will not rebuild the corpus (and will not delete any existing data).
  • number_of_samples_in_corpus: How many questions you want to include from the selected benchmark.
  • benchmark: The dataset or benchmark from which questions are sampled.
  • task_getter_type: Determines which pipeline configuration to use when processing documents.

Implementation Details

The corpus building process is implemented in several key files:

  • run_corpus_builder.py: This is the entry point called by run_eval.py. It:

    • Creates a CorpusBuilderExecutor with the specified benchmark and task getter
    • Calls the executor to build the corpus with the configured parameters
    • Saves the questions to the output file and the relational database
    • Handles the evaluating_contexts flag to determine whether to load golden contexts
  • corpus_builder_executor.py: Contains the core logic for building the corpus:

    • Defines the CorpusBuilderExecutor class that orchestrates the corpus building process
    • Loads the appropriate benchmark adapter based on the configuration
    • Calls cognee’s core functions to process and store the documents
    • Manages the execution of the configured task getter pipeline
  • base_benchmark_adapter.py: Defines the abstract interface that all benchmark adapters must implement:

    • Provides a consistent load_corpus method that all adapters must implement
    • Ensures all adapters can handle parameters like sample limits and golden context loading

These files work together to load questions and contexts from the selected benchmark, process them through cognee’s pipeline, and prepare them for the question answering phase.

3. Question Answering

This is where cognee retrieves relevant context and generates answers to the questions. The AnswerGeneratorExecutor processes each question from the corpus built in the previous step.

Here’s what happens step-by-step:

  1. Load Questions: The executor reads questions from the questions_output.json file created in the corpus building step.
  2. Context Retrieval: For each question, the system retrieves relevant context using the configured retriever.
  3. Answer Generation: Using the retrieved context, the system generates an answer to the question.
  4. Save Results: The questions, generated answers, golden answers, and optionally the retrieval contexts are saved to answers_output.json.

Available Retrievers

Cognee supports multiple retrieval strategies through different retriever classes:

  • CompletionRetriever (cognee_completion): The standard retriever that uses semantic search to find relevant context.
  • GraphCompletionRetriever (cognee_graph_completion): Uses graph-based retrieval to find connected information across documents.
  • GraphSummaryCompletionRetriever (graph_summary_completion): Combines graph retrieval with summarization for more concise context.

Each retriever has different strengths depending on the types of questions and data you’re working with.

Configuring Question Answering

answering_questions: bool = True qa_engine: str = "cognee_completion" # Options: 'cognee_completion', 'cognee_graph_completion', 'graph_summary_completion' evaluating_contexts: bool = True # Controls whether contexts are saved for evaluation
  • answering_questions: True means the AnswerGeneratorExecutor will retrieve context and generate answers. False for skipping the answer generation step (e.g., if you just want to rebuild a corpus).
  • qa_engine: Specifies the retriever used for finding context and generating answers.
  • evaluating_contexts: When True, the system will save both the retrieved contexts and the golden contexts (if available) for evaluation. This allows you to assess not just answer quality but also retrieval quality.

The retrieved contexts are saved alongside the answers and can be evaluated in the next phase of the pipeline to assess the quality of the retrieval process.

Implementation Details

The question answering process is implemented in two main files:

  • run_question_answering_module.py: This is the entry point called by run_eval.py. It:

    • Loads questions from the output file of the corpus building step
    • Instantiates the appropriate retriever based on the qa_engine parameter
    • Calls the AnswerGeneratorExecutor to process each question
    • Saves the results to the answers output file and the relational database
  • answer_generation_executor.py: Contains the core logic for answering questions:

    • Defines the AnswerGeneratorExecutor class with the question_answering_non_parallel method
    • Maps retriever names to their implementation classes via the retriever_options dictionary
    • For each question, retrieves context and generates an answer using the configured retriever
    • Packages the question, answer, golden answer, and retrieval context into a structured format

These files work together to transform the questions from the corpus into answered questions with their associated contexts, ready for evaluation.

4. Evaluating the Answers

After the questions are answered, cognee evaluates each answer against benchmark’s reference (“golden”) answer using specified metrics. This lets you see how reliable or accurate your system is.

Here’s what happens step-by-step:

  1. Load Answers: The executor reads answers from the answers_output.json file created in the question answering step.
  2. Initialize Evaluator: The system creates an evaluator based on the configured evaluation engine.
  3. Apply Metrics: For each answer, the system calculates scores using the specified metrics.
  4. Save Results: The evaluation results are saved to metrics_output.json and the relational database.

Available Evaluators

Cognee supports multiple evaluation approaches through different adapter classes:

  • DeepEvalAdapter: Uses the DeepEval library to calculate metrics, supporting both traditional metrics and LLM-based evaluations.
  • DirectLLMEvalAdapter: Uses a direct call to an LLM with custom prompts to evaluate answer correctness.

Evaluation Metrics

Depending on the evaluator, different metrics are available:

  • correctness: Uses an LLM-based approach to see if the meaning of the answer aligns with the golden answer.
  • EM (Exact Match): Checks if the generated answer exactly matches the golden answer.
  • f1: Uses token-level matching to measure precision and recall.
  • contextual_relevancy: Evaluates how relevant the retrieved context is to the question (when context evaluation is enabled).
  • context_coverage: Assesses how well the retrieved context covers the information needed to answer the question (when context evaluation is enabled).

Configuring Evaluation

evaluating_answers: bool = True evaluating_contexts: bool = True evaluation_engine: str = "DeepEval" # Options: 'DeepEval', 'DirectLLM' evaluation_metrics: List[str] = ["correctness", "EM", "f1"] deepeval_model: str = "gpt-4o-mini"
  • evaluating_answers: True triggers the EvaluationExecutor to evaluate the answers.
  • evaluating_contexts: When True, additional context-related metrics are included in the evaluation.
  • evaluation_engine: The evaluation adapter to use.
  • evaluation_metrics: The list of metrics to calculate for each answer.
  • deepeval_model: The LLM used for computing the LLM-based metrics (e.g., gpt-4o-mini).

Implementation Details

The evaluation process is implemented in several key files:

  • run_evaluation_module.py: This is the entry point called by run_eval.py. It:

    • Loads answers from the output file of the question answering step
    • Initializes the appropriate evaluator based on the evaluation_engine parameter
    • Calls the EvaluationExecutor to process each answer
    • Saves the results to the metrics output file and the relational database
  • evaluation_executor.py: Contains the core logic for evaluating answers:

    • Defines the EvaluationExecutor class that orchestrates the evaluation process
    • Selects the appropriate evaluator adapter based on configuration
    • Adds context evaluation metrics if context evaluation is enabled
  • evaluator_adapters.py: Defines the available evaluator adapters:

    • Maps evaluator names to their implementation classes
    • Provides a consistent interface for different evaluation approaches
  • deep_eval_adapter.py and direct_llm_eval_adapter.py: Implement specific evaluation strategies:

    • DeepEvalAdapter uses the DeepEval library with multiple metrics
    • DirectLLMAdapter uses direct LLM calls with custom prompts

These files work together to evaluate the quality of the generated answers and optionally the retrieved contexts.

5. Creating Dashboard

Cognee generates a visual dashboard to help you analyze evaluation results. The dashboard presents metrics in an interactive HTML file that you can open in any web browser, making it easy to spot patterns and issues in your model’s outputs.

Here’s what happens step-by-step:

  1. Load Metrics: The dashboard generator reads metrics from the metrics_output.json and aggregate_metrics.json files.
  2. Generate Visualizations: The system creates various charts and tables to visualize the evaluation results.
  3. Compile HTML: All visualizations are combined into a single HTML file with CSS styling and JavaScript for interactivity.
  4. Save Dashboard: The complete dashboard is saved to dashboard.html for easy viewing.

Dashboard Features

The generated dashboard includes several key visualizations:

  • Distribution Histograms: Shows the distribution of scores for each metric, helping you understand the overall performance patterns.
  • Confidence Interval Plot: Displays the mean score and 95% confidence interval for each metric, giving you statistical insight into the reliability of the results.
  • Detailed Tables: For each metric, a table shows the individual scores, reasons, and relevant data (questions, answers, contexts) for in-depth analysis.

Configuring Dashboard Generation

dashboard: bool = True aggregate_metrics_path: str = "aggregate_metrics.json" dashboard_path: str = "dashboard.html"
  • dashboard: When True, the system will generate the dashboard visualization.
  • aggregate_metrics_path: The path where aggregate metrics (means, confidence intervals) are stored.
  • dashboard_path: The output path for the generated HTML dashboard.

Implementation Details

The dashboard generation is implemented in the metrics_dashboard.py file, which contains several key functions:

  • create_dashboard(): The main entry point that orchestrates the dashboard creation process:

    • Reads metrics data from the output files
    • Calls visualization functions to generate charts
    • Assembles the complete HTML dashboard
    • Saves the result to the specified output file
  • create_distribution_plots(): Generates histograms showing the distribution of scores for each metric.

  • create_ci_plot(): Creates a bar chart with error bars showing the mean and confidence interval for each metric.

  • generate_details_html(): Produces HTML tables with detailed information about each evaluation result.

  • get_dashboard_html_template(): Assembles all visualizations into a complete HTML document with styling.

This dashboard provides a comprehensive view of your evaluation results, making it easy to understand how well your system is performing and where improvements might be needed.

Output Files: Where to Look

By default, the evaluation flow generates the following files (and a relational database) that capture the entire workflow:

questions_path: str = "questions_output.json" answers_path: str = "answers_output.json" metrics_path: str = "metrics_output.json" dashboard_path: str = "dashboard.html"

questions_output.json

Contains the question objects produced by the corpus builder. Example:
[ { "answer": "Yes", "question": "Is Neo4j supported by cognee?", "type": "dummy" } ]

answers_output.json

Contains the generated answers, alongside their original questions and reference (“golden”) answers. Example:
[ { "question": "Is Neo4j supported by cognee?", "answer": "Yes, Neo4j is supported by cognee.", "golden_answer": "Yes" } ]

metrics_output.json

Contains evaluation results for each question, including scores and rationales. All the prompts that cognee is using right now is to maximize user experience and not the scores. So cognee provides a broader answer to the user as base setting. Example:
[ { "question": "Is Neo4j supported by cognee?", "answer": "Yes, Neo4j is supported by cognee.", "golden_answer": "Yes", "metrics": { "correctness": { "score": 0.7815554704711645, "reason": "The actual output confirms that Neo4j is supported by cognee, which aligns with the expected output's affirmative response, but it contains unnecessary detail..." }, "EM": { "score": 0.0, "reason": "Not an exact match" }, "f1": { "score": 0.2857142857142857, "reason": "F1: 0.29 (Precision: 0.17, Recall: 1.00)" } } } ]

dashboard.html

A HTML dashboard that summarizes all the metrics visually. You can find an example here.

Are you ready to test it out with your parameters?

Configure the above parameters as you wish in your .env file and simply run:

python evals/eval_framework/run_eval.py

Using cognee’s evaluation framework, you can:

  1. Build (or reuse) a corpus from the specified benchmark.
  2. Generate answers to the collected questions.
  3. Evaluate those answers using the metrics you specified.
  4. Produce a dashboard summarizing the results.
  5. Inspect the outputs.
  • questions_output.json for generated or fetched questions.
  • answers_output.json for final answers and reference answers.
  • metrics_output.json for the calculated metrics.
  • dashboard.html to visually explore the evaluation results.

Feel free to reach out with any questions or suggestions on how to improve the evaluation framework.


Join the Conversation!

Join our community now to connect with professionals, share insights, and get your questions answered!