Cognee Evaluation Framework
Without evaluation, you have no objective way to know if your system’s answers are good or bad. With cognee’s evaluation framework, you will easily configure and run evaluations in cognee using the run_eval.py
entry point. By capturing specific metrics, you can iterate on your pipeline with confidence. Below, you will find how to:
- Build or reuse an existing corpus.
- Generate and retrieve answers.
- Evaluate the answers using different metrics.
- Visualize the metrics in a dashboard.
To run evaluations quickly with cognee:
Step 1: Clone cognee repo
git clone https://github.com/topoteretes/cognee.git
Step 2: Install with poetry
Navigate to cognee repo
cd cognee
Install with poetry
poetry install
Step 3: Run evaluation for a dummy question
Run
python evals/eval_framework/run_eval.py
Step 4: Navigate to the output files to see the results
- “questions_output.json”
- “answers_output.json”
- “metrics_output.json”
- “dashboard.html”
Explore the details and further explanation below.
Here is a high-level diagram of the evaluation process, illustrating how various executors interact:

Let’s explore each executor and learn how to configure them
Preliminaries: Configuration with eval_config.py
All the parameters mentioned throughout this documentation are defined in the eval_config.py
file. This configuration file uses Pydantic settings to manage all evaluation parameters in one place.
# Example from eval_config.py
class EvalConfig(BaseSettings):
# Corpus builder params
building_corpus_from_scratch: bool = True
number_of_samples_in_corpus: int = 1
benchmark: str = "Dummy"
# ... more parameters
You can override any of these parameters by setting them in your .env
file. For example:
# .env file example
BUILDING_CORPUS_FROM_SCRATCH=False
NUMBER_OF_SAMPLES_IN_CORPUS=10
BENCHMARK=HotPotQA
DEEPEVAL_MODEL=gpt-4o
This makes it easy to customize your evaluation runs without modifying the code. All the parameters described in the following sections can be configured this way.
1. Main (run_eval) & cognee Framework
- Main (run_eval) orchestrates the entire flow.
- cognee Framework provides the underlying infrastructure to run cognee’s cognify pipeline.
Here’s what happens step-by-step when you run the evaluation:
- Configuration Loading: The script first loads all parameters from
eval_config.py
, which can be overridden by your.env
file. - Corpus Building: Calls
run_corpus_builder()
to either create a new corpus or use an existing one. - Question Answering: Executes
run_question_answering()
to generate answers for each question in the corpus. - Evaluation: Runs
run_evaluation()
to calculate metrics comparing generated answers with reference answers. - Dashboard Generation: If enabled, creates a visual dashboard using
create_dashboard()
to help you analyze the results.
Each step produces output files that feed into the next step, creating a complete evaluation pipeline.
2. Corpus Building
A corpus is a collection of questions and contexts that the system will cognify and use to generate question answers which will later be evaluated. The CorpusBuilderExecutor
is responsible for loading questions and contexts from one of the supported benchmarks, then “cognifying” them (processing and storing them in cognee’s system).
Benchmark Adapters
Cognee supports multiple benchmark datasets through adapter classes:
- BaseBenchmarkAdapter: An abstract base class that defines the interface for all benchmark adapters.
- Implementations:
- HotpotQAAdapter: Loads multi-hop questions from the HotpotQA dataset.
- TwoWikiMultihopAdapter: Extends HotpotQA with additional evidence from the 2WikiMultihop dataset.
- MusiqueAdapter: Handles the Musique dataset with question decomposition.
- DummyAdapter: Provides a simple test question for quick demonstrations.
Each adapter loads questions, their answers, and context documents. The golden answers are saved for later evaluation but aren’t cognified. Optionally, adapters can also extract “golden contexts” (the specific parts of documents that answer the question) for context evaluation.
Task Getters
The task_getter
parameter allows flexibility in how cognee processes the corpus:
- It determines which pipeline will be used to process the documents.
- By default, it uses the standard cognee pipeline, but you can configure alternative pipelines.
- This is useful for comparing different processing strategies in your evaluations.
Configuring Corpus Building
building_corpus_from_scratch: bool = True
number_of_samples_in_corpus: int = 1
benchmark: str = "Dummy" # Options: 'HotPotQA', 'TwoWikiMultiHop', 'Musique', 'Dummy'
task_getter_type: str = "Default" # Options: 'Default', 'CascadeGraph'
- building_corpus_from_scratch:
True
for building a fresh corpus and removing existing data, whileFalse
indicates that it will not rebuild the corpus (and will not delete any existing data). - number_of_samples_in_corpus: How many questions you want to include from the selected benchmark.
- benchmark: The dataset or benchmark from which questions are sampled.
- task_getter_type: Determines which pipeline configuration to use when processing documents.
Implementation Details
The corpus building process is implemented in several key files:
-
run_corpus_builder.py: This is the entry point called by
run_eval.py
. It:- Creates a
CorpusBuilderExecutor
with the specified benchmark and task getter - Calls the executor to build the corpus with the configured parameters
- Saves the questions to the output file and the relational database
- Handles the
evaluating_contexts
flag to determine whether to load golden contexts
- Creates a
-
corpus_builder_executor.py: Contains the core logic for building the corpus:
- Defines the
CorpusBuilderExecutor
class that orchestrates the corpus building process - Loads the appropriate benchmark adapter based on the configuration
- Calls cognee’s core functions to process and store the documents
- Manages the execution of the configured task getter pipeline
- Defines the
-
base_benchmark_adapter.py: Defines the abstract interface that all benchmark adapters must implement:
- Provides a consistent
load_corpus
method that all adapters must implement - Ensures all adapters can handle parameters like sample limits and golden context loading
- Provides a consistent
These files work together to load questions and contexts from the selected benchmark, process them through cognee’s pipeline, and prepare them for the question answering phase.
3. Question Answering
This is where cognee retrieves relevant context and generates answers to the questions. The AnswerGeneratorExecutor
processes each question from the corpus built in the previous step.
Here’s what happens step-by-step:
- Load Questions: The executor reads questions from the
questions_output.json
file created in the corpus building step. - Context Retrieval: For each question, the system retrieves relevant context using the configured retriever.
- Answer Generation: Using the retrieved context, the system generates an answer to the question.
- Save Results: The questions, generated answers, golden answers, and optionally the retrieval contexts are saved to
answers_output.json
.
Available Retrievers
Cognee supports multiple retrieval strategies through different retriever classes:
- CompletionRetriever (
cognee_completion
): The standard retriever that uses semantic search to find relevant context. - GraphCompletionRetriever (
cognee_graph_completion
): Uses graph-based retrieval to find connected information across documents. - GraphSummaryCompletionRetriever (
graph_summary_completion
): Combines graph retrieval with summarization for more concise context.
Each retriever has different strengths depending on the types of questions and data you’re working with.
Configuring Question Answering
answering_questions: bool = True
qa_engine: str = "cognee_completion" # Options: 'cognee_completion', 'cognee_graph_completion', 'graph_summary_completion'
evaluating_contexts: bool = True # Controls whether contexts are saved for evaluation
- answering_questions:
True
means theAnswerGeneratorExecutor
will retrieve context and generate answers.False
for skipping the answer generation step (e.g., if you just want to rebuild a corpus). - qa_engine: Specifies the retriever used for finding context and generating answers.
- evaluating_contexts: When
True
, the system will save both the retrieved contexts and the golden contexts (if available) for evaluation. This allows you to assess not just answer quality but also retrieval quality.
The retrieved contexts are saved alongside the answers and can be evaluated in the next phase of the pipeline to assess the quality of the retrieval process.
Implementation Details
The question answering process is implemented in two main files:
-
run_question_answering_module.py: This is the entry point called by
run_eval.py
. It:- Loads questions from the output file of the corpus building step
- Instantiates the appropriate retriever based on the
qa_engine
parameter - Calls the
AnswerGeneratorExecutor
to process each question - Saves the results to the answers output file and the relational database
-
answer_generation_executor.py: Contains the core logic for answering questions:
- Defines the
AnswerGeneratorExecutor
class with thequestion_answering_non_parallel
method - Maps retriever names to their implementation classes via the
retriever_options
dictionary - For each question, retrieves context and generates an answer using the configured retriever
- Packages the question, answer, golden answer, and retrieval context into a structured format
- Defines the
These files work together to transform the questions from the corpus into answered questions with their associated contexts, ready for evaluation.
4. Evaluating the Answers
After the questions are answered, cognee evaluates each answer against benchmark’s reference (“golden”) answer using specified metrics. This lets you see how reliable or accurate your system is.
Here’s what happens step-by-step:
- Load Answers: The executor reads answers from the
answers_output.json
file created in the question answering step. - Initialize Evaluator: The system creates an evaluator based on the configured evaluation engine.
- Apply Metrics: For each answer, the system calculates scores using the specified metrics.
- Save Results: The evaluation results are saved to
metrics_output.json
and the relational database.
Available Evaluators
Cognee supports multiple evaluation approaches through different adapter classes:
- DeepEvalAdapter: Uses the DeepEval library to calculate metrics, supporting both traditional metrics and LLM-based evaluations.
- DirectLLMEvalAdapter: Uses a direct call to an LLM with custom prompts to evaluate answer correctness.
Evaluation Metrics
Depending on the evaluator, different metrics are available:
- correctness: Uses an LLM-based approach to see if the meaning of the answer aligns with the golden answer.
- EM (Exact Match): Checks if the generated answer exactly matches the golden answer.
- f1: Uses token-level matching to measure precision and recall.
- contextual_relevancy: Evaluates how relevant the retrieved context is to the question (when context evaluation is enabled).
- context_coverage: Assesses how well the retrieved context covers the information needed to answer the question (when context evaluation is enabled).
Configuring Evaluation
evaluating_answers: bool = True
evaluating_contexts: bool = True
evaluation_engine: str = "DeepEval" # Options: 'DeepEval', 'DirectLLM'
evaluation_metrics: List[str] = ["correctness", "EM", "f1"]
deepeval_model: str = "gpt-4o-mini"
- evaluating_answers:
True
triggers theEvaluationExecutor
to evaluate the answers. - evaluating_contexts: When
True
, additional context-related metrics are included in the evaluation. - evaluation_engine: The evaluation adapter to use.
- evaluation_metrics: The list of metrics to calculate for each answer.
- deepeval_model: The LLM used for computing the LLM-based metrics (e.g.,
gpt-4o-mini
).
Implementation Details
The evaluation process is implemented in several key files:
-
run_evaluation_module.py: This is the entry point called by
run_eval.py
. It:- Loads answers from the output file of the question answering step
- Initializes the appropriate evaluator based on the
evaluation_engine
parameter - Calls the
EvaluationExecutor
to process each answer - Saves the results to the metrics output file and the relational database
-
evaluation_executor.py: Contains the core logic for evaluating answers:
- Defines the
EvaluationExecutor
class that orchestrates the evaluation process - Selects the appropriate evaluator adapter based on configuration
- Adds context evaluation metrics if context evaluation is enabled
- Defines the
-
evaluator_adapters.py: Defines the available evaluator adapters:
- Maps evaluator names to their implementation classes
- Provides a consistent interface for different evaluation approaches
-
deep_eval_adapter.py and direct_llm_eval_adapter.py: Implement specific evaluation strategies:
- DeepEvalAdapter uses the DeepEval library with multiple metrics
- DirectLLMAdapter uses direct LLM calls with custom prompts
These files work together to evaluate the quality of the generated answers and optionally the retrieved contexts.
5. Creating Dashboard
Cognee generates a visual dashboard to help you analyze evaluation results. The dashboard presents metrics in an interactive HTML file that you can open in any web browser, making it easy to spot patterns and issues in your model’s outputs.
Here’s what happens step-by-step:
- Load Metrics: The dashboard generator reads metrics from the
metrics_output.json
andaggregate_metrics.json
files. - Generate Visualizations: The system creates various charts and tables to visualize the evaluation results.
- Compile HTML: All visualizations are combined into a single HTML file with CSS styling and JavaScript for interactivity.
- Save Dashboard: The complete dashboard is saved to
dashboard.html
for easy viewing.
Dashboard Features
The generated dashboard includes several key visualizations:
- Distribution Histograms: Shows the distribution of scores for each metric, helping you understand the overall performance patterns.
- Confidence Interval Plot: Displays the mean score and 95% confidence interval for each metric, giving you statistical insight into the reliability of the results.
- Detailed Tables: For each metric, a table shows the individual scores, reasons, and relevant data (questions, answers, contexts) for in-depth analysis.
Configuring Dashboard Generation
dashboard: bool = True
aggregate_metrics_path: str = "aggregate_metrics.json"
dashboard_path: str = "dashboard.html"
- dashboard: When
True
, the system will generate the dashboard visualization. - aggregate_metrics_path: The path where aggregate metrics (means, confidence intervals) are stored.
- dashboard_path: The output path for the generated HTML dashboard.
Implementation Details
The dashboard generation is implemented in the metrics_dashboard.py
file, which contains several key functions:
-
create_dashboard(): The main entry point that orchestrates the dashboard creation process:
- Reads metrics data from the output files
- Calls visualization functions to generate charts
- Assembles the complete HTML dashboard
- Saves the result to the specified output file
-
create_distribution_plots(): Generates histograms showing the distribution of scores for each metric.
-
create_ci_plot(): Creates a bar chart with error bars showing the mean and confidence interval for each metric.
-
generate_details_html(): Produces HTML tables with detailed information about each evaluation result.
-
get_dashboard_html_template(): Assembles all visualizations into a complete HTML document with styling.
This dashboard provides a comprehensive view of your evaluation results, making it easy to understand how well your system is performing and where improvements might be needed.
Output Files: Where to Look
By default, the evaluation flow generates the following files (and a relational database) that capture the entire workflow:
questions_path: str = "questions_output.json"
answers_path: str = "answers_output.json"
metrics_path: str = "metrics_output.json"
dashboard_path: str = "dashboard.html"
questions_output.json
[
{
"answer": "Yes",
"question": "Is Neo4j supported by cognee?",
"type": "dummy"
}
]
answers_output.json
[
{
"question": "Is Neo4j supported by cognee?",
"answer": "Yes, Neo4j is supported by cognee.",
"golden_answer": "Yes"
}
]
metrics_output.json
[
{
"question": "Is Neo4j supported by cognee?",
"answer": "Yes, Neo4j is supported by cognee.",
"golden_answer": "Yes",
"metrics": {
"correctness": {
"score": 0.7815554704711645,
"reason": "The actual output confirms that Neo4j is supported by cognee, which aligns with the expected output's affirmative response, but it contains unnecessary detail..."
},
"EM": {
"score": 0.0,
"reason": "Not an exact match"
},
"f1": {
"score": 0.2857142857142857,
"reason": "F1: 0.29 (Precision: 0.17, Recall: 1.00)"
}
}
}
]
dashboard.html
Are you ready to test it out with your parameters?
Configure the above parameters as you wish in your .env file and simply run:
python evals/eval_framework/run_eval.py
Using cognee’s evaluation framework, you can:
- Build (or reuse) a corpus from the specified benchmark.
- Generate answers to the collected questions.
- Evaluate those answers using the metrics you specified.
- Produce a dashboard summarizing the results.
- Inspect the outputs.
questions_output.json
for generated or fetched questions.answers_output.json
for final answers and reference answers.metrics_output.json
for the calculated metrics.dashboard.html
to visually explore the evaluation results.
Feel free to reach out with any questions or suggestions on how to improve the evaluation framework.
Join the Conversation!
Join our community now to connect with professionals, share insights, and get your questions answered!