Cognee Evaluation Framework
Without evaluation, you have no objective way to know if your system’s answers are good or bad. With cognee’s evaluation framework, you will easily configure and run evaluations in cognee using the run_eval.py
entry point. By capturing specific metrics, you can iterate on your pipeline with confidence. Below, you will find how to:
- Build or reuse an existing corpus.
- Generate and retrieve answers.
- Evaluate the answers using different metrics.
- Visualize the metrics in a dashboard.
To run evaluations quickly with cognee:
Step 1: Clone cognee repo
git clone https://github.com/topoteretes/cognee.git
Step 2: Install with poetry
Navigate to cognee repo
cd cognee
Install with poetry
poetry install
Step 3: Run evaluation for a dummy question
Run
python evals/eval_framework/run_eval.py
Step 4: Navigate to the output files to see the results
- “questions_output.json”
- “answers_output.json”
- “metrics_output.json”
- “dashboard.html”
Explore the details and further explanation below.
Here is a high-level diagram of the evaluation process, illustrating how various executors interact:
![cognee_evaluation_diagram](/images/cognee_evaluation_diagram.png)
Let’s explore each executor and learn how to configure them
1. Main (run_eval) & cognee Framework
- Main (run_eval) orchestrates the entire flow.
- cognee Framework provides the underlying infrastructure to run cognee’s cognify pipeline.
2. CorpusBuilderExecutor
A corpus is a collection of questions that your system will attempt to answer. Before answering any questions, you need to decide how to build (or not build but use an existing) corpus. For example:
- If you want to test your system on a new dataset or remove old data, you build the corpus from scratch.
- If you already have a dataset set up, you can reuse an existing corpus without rebuilding.
Configuring Corpus Building
building_corpus_from_scratch: bool = True
number_of_samples_in_corpus: int = 1
benchmark: str = "Dummy" # Other Options: 'HotPotQA', 'TwoWikiMultiHop'
- building_corpus_from_scratch:
True
for building a fresh corpus and remove existing data whileFalse
indicates that it will not rebuild the corpus (and will not delete any existing data). - number_of_samples_in_corpus: How many questions you want to include from the selected benchmark.
- benchmark: The dataset or benchmark from which questions are sampled. Dummy simply produces a single test question for demonstration.
3. AnswerGeneratorExecutor
This is where cognee retrieves relevant context and generates answers to the questions. Ultimately the evaluation metrics measure how well your system answer these question.
Configuring Question Answering
answering_questions: bool = True
qa_engine: str = "cognee_completion" # Options: 'cognee_completion' or 'cognee_graph_completion'
- answering_questions:
True
means theAnswerGeneratorExecutor
will retrieve context and generate answers.False
for skipping the answer generation step (e.g., if you just want to rebuild a corpus). - qa_engine: Specifies the engine used for retrieval and generation. You can choose one of the two search types from cognee:
cognee_completion
orcognee_graph_completion
4. EvaluationExecutor
After the questions are answered, cognee can automatically evaluate each answer against a reference (“golden”) answer using specified metrics. This lets you see how reliable or accurate your system is.
Configuring Evaluation
evaluating_answers: bool = True
evaluation_engine: str = "DeepEval"
evaluation_metrics: List[str] = ["correctness", "EM", "f1"]
deepeval_model: str = "gpt-4o-mini"
- evaluating_answers:
True
triggers theEvaluationExecutor
to evaluate the answers. - evaluation_engine: The evaluation executor to use. Currently,
DeepEval
is supported. - evaluation_metrics
- correctness – Uses an LLM-based approach (via
deepeval_model
) to see if the meaning of the answer aligns with the golden answer. - EM (Exact Match) – Checks if the generated answer exactly matches the golden (reference) answer.
- f1 – Uses token-level matching to measure precision and recall.
- correctness – Uses an LLM-based approach (via
- deepeval_model: The LLM used for computing the correctness score (e.g.,
gpt-4o-mini
).
5. Dashboard Generator
Cognee generates a dashboard in a dashboard.html
file for visualizing the evaluation results. You can open this file in a web browser to see charts and tables on each metric. It’s much easier to spot patterns or issues in your model’s outputs when you can visually inspect the data.
Configuring Visualization (Dashboard)
dashboard: bool = True
Output Files: Where to Look
By default, the evaluation flow generates the following files (and a relational database) that capture the entire workflow:
questions_path: str = "questions_output.json"
answers_path: str = "answers_output.json"
metrics_path: str = "metrics_output.json"
dashboard_path: str = "dashboard.html"
questions_output.json
[
{
"answer": "Yes",
"question": "Is Neo4j supported by cognee?",
"type": "dummy"
}
]
answers_output.json
[
{
"question": "Is Neo4j supported by cognee?",
"answer": "Yes, Neo4j is supported by cognee.",
"golden_answer": "Yes"
}
]
metrics_output.json
[
{
"question": "Is Neo4j supported by cognee?",
"answer": "Yes, Neo4j is supported by cognee.",
"golden_answer": "Yes",
"metrics": {
"correctness": {
"score": 0.7815554704711645,
"reason": "The actual output confirms that Neo4j is supported by cognee, which aligns with the expected output's affirmative response, but it contains unnecessary detail..."
},
"EM": {
"score": 0.0,
"reason": "Not an exact match"
},
"f1": {
"score": 0.2857142857142857,
"reason": "F1: 0.29 (Precision: 0.17, Recall: 1.00)"
}
}
}
]
dashboard.html
Are you ready to test it out with your parameters?
Configure the above parameters as you wish in your .env file and simply run:
python evals/eval_framework/run_eval.py
Using cognee’s evaluation framework, you can:
- Build (or reuse) a corpus from the specified benchmark.
- Generate answers to the collected questions.
- Evaluate those answers using the metrics you specified.
- Produce a dashboard summarizing the results.
- Inspect the outputs.
questions_output.json
for generated or fetched questions.answers_output.json
for final answers and reference answers.metrics_output.json
for the calculated metrics.dashboard.html
to visually explore the evaluation results.
Feel free to reach out with any questions or suggestions on how to improve the evaluation framework.
Join the Conversation!
Join our community now to connect with professionals, share insights, and get your questions answered!