Skip to main content
Cognee ships a self-contained evaluation framework at cognee/eval_framework/ that lets you benchmark retrieval quality on standard multi-hop QA datasets, compare different search strategies, and inspect results in an interactive HTML dashboard — all without any third-party evaluation service.
This page covers the built-in evaluation framework. For the DeepEval integration (LLM-as-a-judge scoring), see Evaluation with DeepEval.

Overview

The pipeline has four sequential stages:
Corpus Builder → Answer Generation → Evaluation → Dashboard
StageWhat it does
Corpus BuilderDownloads a benchmark dataset, ingests the corpus into Cognee (add → cognify), and persists Q&A pairs to disk.
Answer GenerationRuns each question through your chosen retriever and records the generated answer alongside the golden answer.
EvaluationScores every answer with the metrics you select (EM, F1, correctness, contextual relevancy, context coverage).
DashboardProduces a standalone dashboard.html with per-metric histograms, 95% confidence-interval bar charts, and a details table.

Quick Start

Before you start, complete Quickstart, make sure you already have data processed with cognify, and set LLM_API_KEY.

Basic Usage

This minimal example runs the built-in evaluation pipeline and writes an HTML dashboard:
import asyncio
from cognee.eval_framework.run_eval import main
from cognee.eval_framework.eval_config import EvalConfig

config = EvalConfig(
    benchmark="HotPotQA",
    number_of_samples_in_corpus=50,
    qa_engine="cognee_graph_completion",
    evaluation_engine="DeepEval",
    evaluation_metrics=["EM", "f1", "correctness"],
    dashboard=True,
    dashboard_path="my_dashboard.html",
)

asyncio.run(main(params=config.to_dict()))
This runs all four stages in sequence and writes my_dashboard.html to the current directory.

What just happened

  • Benchmark setupEvalConfig defines which benchmark to use, how many samples to ingest, and which retrieval and evaluation engines to run.
  • Full evaluation runmain(params=config.to_dict()) executes corpus building, answer generation, evaluation, and dashboard generation in one flow.
  • Output artifacts — the run produces metrics files and an HTML dashboard you can open locally to inspect results.
Open my_dashboard.html in any browser. You will see:
  • Distribution histograms — score distributions per metric (10 bins).
  • Confidence-interval bar chart — mean score ± 95% CI for each metric.
  • Details table — per-question breakdown with generated answer, golden answer, retrieved context, score, and LLM rationale.

Distributed Execution with Modal

For large-scale evaluations, the framework can run inside Modal containers with persistent volume storage:
modal run cognee/eval_framework/modal_run_eval.py
A Streamlit dashboard server is available for viewing aggregated results across multiple Modal runs:
modal serve cognee/eval_framework/modal_eval_dashboard.py
A Dockerfile is included at cognee/eval_framework/Dockerfile for containerized deployments.

Code

Each stage exposes a standalone async function you can call from your own scripts:
import asyncio
from cognee.eval_framework.corpus_builder.run_corpus_builder import run_corpus_builder
from cognee.eval_framework.answer_generation.run_question_answering_module import run_question_answering
from cognee.eval_framework.evaluation.run_evaluation_module import run_evaluation
from cognee.eval_framework.analysis.dashboard_generator import create_dashboard
from cognee.eval_framework.eval_config import EvalConfig

async def custom_eval():
    config = EvalConfig().to_dict()

    # Stage 1 – ingest corpus
    await run_corpus_builder(config)

    # Stage 2 – generate answers
    await run_question_answering(config)

    # Stage 3 – score answers
    await run_evaluation(config)

    # Stage 4 – build dashboard
    create_dashboard(
        metrics_path=config["metrics_path"],
        aggregate_metrics_path=config["aggregate_metrics_path"],
        output_file=config["dashboard_path"],
        benchmark=config["benchmark"],
    )

asyncio.run(custom_eval())
You can restrict which instances are evaluated using INSTANCE_FILTER:
from cognee.eval_framework.corpus_builder.run_corpus_builder import run_corpus_builder

# By integer indices (0-based)
await run_corpus_builder(config, instance_filter=[0, 1, 2, 10, 42])

# By string IDs (HotPotQA uses "_id" keys)
await run_corpus_builder(config, instance_filter=["5a7a06935542990198eaf050", ...])

# By path to a JSON file containing a list of IDs or indices
await run_corpus_builder(config, instance_filter="my_instance_ids.json")
A typical workflow for comparing two retrieval strategies:
import asyncio
from cognee.eval_framework.run_eval import main
from cognee.eval_framework.eval_config import EvalConfig
import os

for engine in ["cognee_graph_completion", "cognee_completion"]:
    os.environ["QA_ENGINE"] = engine
    os.environ["DASHBOARD_PATH"] = f"dashboard_{engine}.html"
    os.environ["METRICS_PATH"] = f"metrics_{engine}.json"
    os.environ["ANSWERS_PATH"] = f"answers_{engine}.json"
    asyncio.run(main())
Open both HTML files side-by-side to compare F1 and exact-match scores across retrieval strategies.

Further details

All options are read from an .env file (or environment variables) via a Pydantic BaseSettings class (EvalConfig).
VariableDefaultDescription
BENCHMARKDummyWhich dataset to load (Dummy, HotPotQA, Musique, TwoWikiMultiHop).
NUMBER_OF_SAMPLES_IN_CORPUS1How many corpus paragraphs to ingest.
BUILDING_CORPUS_FROM_SCRATCHTrueRe-ingest the corpus on every run, or reuse an existing Cognee index.
TASK_GETTER_TYPEDefaultCognify pipeline variant described in the Pipeline Strategies section below.
VariableDefaultDescription
ANSWERING_QUESTIONSTrueRun the QA stage.
QA_ENGINEcognee_graph_completionWhich retriever to use, as described in the QA Engines section below.
QUESTIONS_PATHquestions_output.jsonWhere to save generated answers.
VariableDefaultDescription
EVALUATING_ANSWERSTrueRun the metrics stage.
EVALUATING_CONTEXTSTrueAlso compute contextual_relevancy and context_coverage.
EVALUATION_ENGINEDeepEvalDeepEval or DirectLLM.
EVALUATION_METRICS["correctness", "EM", "f1"]Any combination of the five metrics.
DEEPEVAL_MODELgpt-4o-miniLLM used by DeepEval for correctness and contextual_relevancy.
ANSWERS_PATHanswers_output.jsonPath to the answers file produced by the QA stage.
METRICS_PATHmetrics_output.jsonWhere per-sample metric results are written.
VariableDefaultDescription
CALCULATE_METRICSTrueCompute aggregate statistics (mean, 95% CI).
DASHBOARDTrueGenerate the HTML report.
AGGREGATE_METRICS_PATHaggregate_metrics.jsonWhere aggregate stats are written.
DASHBOARD_PATHdashboard.htmlOutput path for the HTML dashboard.
BenchmarkAdapter keyDescription
HotPotQAHotPotQA~90 K multi-hop Q&A pairs from CMU; includes supporting-fact indices.
MuSiQueMusiqueMulti-step reasoning with question decompositions (Google Drive, JSONL).
2WikiMultiHopTwoWikiMultiHopFact-triplet-style multi-hop QA from HuggingFace.
DummyDummyOne hard-coded Q&A pair — useful for smoke-testing the pipeline.
Metric keyTypeDescription
EMStringExact Match — 1 if the generated answer exactly equals the golden answer (case-insensitive, whitespace-normalized).
f1StringToken-level F1 — precision/recall over word tokens between generated and golden answer.
correctnessLLMGEval correctness via DeepEval (uses DEEPEVAL_MODEL).
contextual_relevancyLLMDeepEval’s ContextualRelevancyMetric — how relevant the retrieved context is to the question.
context_coverageLLMCustom metric — fraction of the golden context covered by the retrieved context.
All metrics return a {"score": float, "reason": str} dict. Aggregate statistics include mean and a 95% confidence interval computed with 10 000 bootstrap samples.
The TASK_GETTER_TYPE variable controls how each corpus document is processed during the cognify stage:
StrategyDescription
DefaultFull pipeline: classify → chunk → extract graph → summarize → add data points.
CascadeGraphSame as Default but processes documents in batches of 10.
NoSummariesSkips the summary step; applies an ontology during graph extraction.
JustChunksMinimal pipeline: classify → chunk → add data points (no graph).
Engine keyDescription
cognee_graph_completionGraph traversal followed by LLM completion.
cognee_graph_completion_cotChain-of-thought reasoning over the graph.
cognee_graph_completion_context_extensionGraph traversal with extended context retrieval.
cognee_completionDirect LLM completion without graph traversal.
graph_summary_completionUses pre-computed graph summaries for retrieval.