Evaluation Framework

Cognee Evaluation Framework

Without evaluation, you have no objective way to know if your system’s answers are good or bad. With cognee’s evaluation framework, you will easily configure and run evaluations in cognee using the run_eval.py entry point. By capturing specific metrics, you can iterate on your pipeline with confidence. Below, you will find how to:

  1. Build or reuse an existing corpus.
  2. Generate and retrieve answers.
  3. Evaluate the answers using different metrics.
  4. Visualize the metrics in a dashboard.

To run evaluations quickly with cognee:

Step 1: Clone cognee repo
  git clone https://github.com/topoteretes/cognee.git
Step 2: Install with poetry

Navigate to cognee repo

cd cognee

Install with poetry

poetry install 
Step 3: Run evaluation for a dummy question

Run

python evals/eval_framework/run_eval.py
Step 4: Navigate to the output files to see the results
  • “questions_output.json”
  • “answers_output.json”
  • “metrics_output.json”
  • “dashboard.html”

Explore the details and further explanation below.

Here is a high-level diagram of the evaluation process, illustrating how various executors interact:

cognee_evaluation_diagram

Let’s explore each executor and learn how to configure them

1. Main (run_eval) & cognee Framework

  • Main (run_eval) orchestrates the entire flow.
  • cognee Framework provides the underlying infrastructure to run cognee’s cognify pipeline.

2. CorpusBuilderExecutor

A corpus is a collection of questions that your system will attempt to answer. Before answering any questions, you need to decide how to build (or not build but use an existing) corpus. For example:

  • If you want to test your system on a new dataset or remove old data, you build the corpus from scratch.
  • If you already have a dataset set up, you can reuse an existing corpus without rebuilding.

Configuring Corpus Building

building_corpus_from_scratch: bool = True
number_of_samples_in_corpus: int = 1
benchmark: str = "Dummy"  # Other Options: 'HotPotQA', 'TwoWikiMultiHop'
  • building_corpus_from_scratch: True for building a fresh corpus and remove existing data while False indicates that it will not rebuild the corpus (and will not delete any existing data).
  • number_of_samples_in_corpus: How many questions you want to include from the selected benchmark.
  • benchmark: The dataset or benchmark from which questions are sampled. Dummy simply produces a single test question for demonstration.

3. AnswerGeneratorExecutor

This is where cognee retrieves relevant context and generates answers to the questions. Ultimately the evaluation metrics measure how well your system answer these question.

Configuring Question Answering

answering_questions: bool = True
qa_engine: str = "cognee_completion"  # Options: 'cognee_completion' or 'cognee_graph_completion'
  • answering_questions: True means the AnswerGeneratorExecutor will retrieve context and generate answers. False for skipping the answer generation step (e.g., if you just want to rebuild a corpus).
  • qa_engine: Specifies the engine used for retrieval and generation. You can choose one of the two search types from cognee: cognee_completion or cognee_graph_completion

4. EvaluationExecutor

After the questions are answered, cognee can automatically evaluate each answer against a reference (“golden”) answer using specified metrics. This lets you see how reliable or accurate your system is.

Configuring Evaluation

evaluating_answers: bool = True
evaluation_engine: str = "DeepEval"
evaluation_metrics: List[str] = ["correctness", "EM", "f1"]
deepeval_model: str = "gpt-4o-mini"
  • evaluating_answers: True triggers the EvaluationExecutor to evaluate the answers.
  • evaluation_engine: The evaluation executor to use. Currently, DeepEval is supported.
  • evaluation_metrics
    1. correctness – Uses an LLM-based approach (via deepeval_model) to see if the meaning of the answer aligns with the golden answer.
    2. EM (Exact Match) – Checks if the generated answer exactly matches the golden (reference) answer.
    3. f1 – Uses token-level matching to measure precision and recall.
  • deepeval_model: The LLM used for computing the correctness score (e.g., gpt-4o-mini).

5. Dashboard Generator

Cognee generates a dashboard in a dashboard.html file for visualizing the evaluation results. You can open this file in a web browser to see charts and tables on each metric. It’s much easier to spot patterns or issues in your model’s outputs when you can visually inspect the data.

Configuring Visualization (Dashboard)

dashboard: bool = True

Output Files: Where to Look

By default, the evaluation flow generates the following files (and a relational database) that capture the entire workflow:

questions_path: str = "questions_output.json"
answers_path: str = "answers_output.json"
metrics_path: str = "metrics_output.json"
dashboard_path: str = "dashboard.html"
questions_output.json
Contains the question objects produced by the corpus builder. Example:
    [
      {
        "answer": "Yes",
        "question": "Is Neo4j supported by cognee?",
        "type": "dummy"
      }
    ]
answers_output.json
Contains the generated answers, alongside their original questions and reference (golden) answers. Example:
[
  {
    "question": "Is Neo4j supported by cognee?",
    "answer": "Yes, Neo4j is supported by cognee.",
    "golden_answer": "Yes"
  }
]
metrics_output.json
Contains evaluation results for each question, including scores and rationales. All the prompts that cognee is using right now is to maximize user experience and not the scores. So cognee provides a broader answer to the user as base setting. Example:
[
  {
    "question": "Is Neo4j supported by cognee?",
    "answer": "Yes, Neo4j is supported by cognee.",
    "golden_answer": "Yes",
    "metrics": {
      "correctness": {
        "score": 0.7815554704711645,
        "reason": "The actual output confirms that Neo4j is supported by cognee, which aligns with the expected output's affirmative response, but it contains unnecessary detail..."
      },
      "EM": {
        "score": 0.0,
        "reason": "Not an exact match"
      },
      "f1": {
        "score": 0.2857142857142857,
        "reason": "F1: 0.29 (Precision: 0.17, Recall: 1.00)"
      }
    }
  }
]
dashboard.html
A HTML dashboard that summarizes all the metrics visually. You can find an example here.

Are you ready to test it out with your parameters?

Configure the above parameters as you wish in your .env file and simply run:

python evals/eval_framework/run_eval.py

Using cognee’s evaluation framework, you can:

  1. Build (or reuse) a corpus from the specified benchmark.
  2. Generate answers to the collected questions.
  3. Evaluate those answers using the metrics you specified.
  4. Produce a dashboard summarizing the results.
  5. Inspect the outputs.
  • questions_output.json for generated or fetched questions.
  • answers_output.json for final answers and reference answers.
  • metrics_output.json for the calculated metrics.
  • dashboard.html to visually explore the evaluation results.

Feel free to reach out with any questions or suggestions on how to improve the evaluation framework.


Join the Conversation!

Join our community now to connect with professionals, share insights, and get your questions answered!