run_eval.py
entry point. By capturing specific metrics, you can iterate on your pipeline with confidence. Below, you will find how to:
eval_config.py
, which can be overridden by your .env
file.run_corpus_builder()
to either create a new corpus or use an existing one.run_question_answering()
to generate answers for each question in the corpus.run_evaluation()
to calculate metrics comparing generated answers with reference answers.create_dashboard()
to help you analyze the results.CorpusBuilderExecutor
is responsible for loading questions and contexts from one of the supported benchmarks, then “cognifying” them (processing and storing them in cognee’s system).
task_getter
parameter allows flexibility in how cognee processes the corpus:
True
for building a fresh corpus and removing existing data, while False
indicates that it will not rebuild the corpus (and will not delete any existing data).run_eval.py
. It:
CorpusBuilderExecutor
with the specified benchmark and task getterevaluating_contexts
flag to determine whether to load golden contextsCorpusBuilderExecutor
class that orchestrates the corpus building processload_corpus
method that all adapters must implementAnswerGeneratorExecutor
processes each question from the corpus built in the previous step.
Here’s what happens step-by-step:
questions_output.json
file created in the corpus building step.answers_output.json
.cognee_completion
): The standard retriever that uses semantic search to find relevant context.cognee_graph_completion
): Uses graph-based retrieval to find connected information across documents.graph_summary_completion
): Combines graph retrieval with summarization for more concise context.True
means the AnswerGeneratorExecutor
will retrieve context and generate answers. False
for skipping the answer generation step (e.g., if you just want to rebuild a corpus).True
, the system will save both the retrieved contexts and the golden contexts (if available) for evaluation. This allows you to assess not just answer quality but also retrieval quality.run_eval.py
. It:
qa_engine
parameterAnswerGeneratorExecutor
to process each questionAnswerGeneratorExecutor
class with the question_answering_non_parallel
methodretriever_options
dictionaryanswers_output.json
file created in the question answering step.metrics_output.json
and the relational database.True
triggers the EvaluationExecutor
to evaluate the answers.True
, additional context-related metrics are included in the evaluation.gpt-4o-mini
).run_eval.py
. It:
evaluation_engine
parameterEvaluationExecutor
to process each answerEvaluationExecutor
class that orchestrates the evaluation processmetrics_output.json
and aggregate_metrics.json
files.dashboard.html
for easy viewing.True
, the system will generate the dashboard visualization.metrics_dashboard.py
file, which contains several key functions:
questions_output.json
for generated or fetched questions.answers_output.json
for final answers and reference answers.metrics_output.json
for the calculated metrics.dashboard.html
to visually explore the evaluation results.