Built-in Evaluation Framework - Cognee Documentation

Cognee ships a self-contained evaluation framework at cognee/eval_framework/ that lets you benchmark retrieval quality on standard multi-hop QA datasets, compare different search strategies, and inspect results in an interactive HTML dashboard — all without any third-party evaluation service.

This page covers the built-in evaluation framework. For the DeepEval integration (LLM-as-a-judge scoring), see Evaluation with DeepEval.

Overview

The pipeline has four sequential stages:

Corpus Builder → Answer Generation → Evaluation → Dashboard

Stage	What it does
Corpus Builder	Downloads a benchmark dataset, ingests the corpus into Cognee’s processing pipeline, and persists Q&A pairs to disk.
Answer Generation	Runs each question through your chosen retriever and records the generated answer alongside the golden answer.
Evaluation	Scores every answer with the metrics you select (EM, F1, correctness, contextual relevancy, context coverage).
Dashboard	Produces a standalone `dashboard.html` with per-metric histograms, 95% confidence-interval bar charts, and a details table.

Quick Start

Before you start, complete Quickstart, make sure your corpus is already processed and indexed or let the corpus builder create it from scratch, and set LLM_API_KEY.

Basic Usage

This minimal example runs the built-in evaluation pipeline and writes an HTML dashboard:

import asyncio
from cognee.eval_framework.run_eval import main
from cognee.eval_framework.eval_config import EvalConfig

config = EvalConfig(
    benchmark="HotPotQA",
    number_of_samples_in_corpus=50,
    qa_engine="cognee_graph_completion",
    evaluation_engine="DeepEval",
    evaluation_metrics=["EM", "f1", "correctness"],
    dashboard=True,
    dashboard_path="my_dashboard.html",
)

asyncio.run(main(params=config.to_dict()))

This runs all four stages in sequence and writes my_dashboard.html to the current directory.

What just happened

Benchmark setup — EvalConfig defines which benchmark to use, how many samples to ingest, and which retrieval and evaluation engines to run.
Full evaluation run — main(params=config.to_dict()) executes corpus building, answer generation, evaluation, and dashboard generation in one flow.
Output artifacts — the run produces metrics files and an HTML dashboard you can open locally to inspect results.

Open my_dashboard.html in any browser. You will see:

Distribution histograms — score distributions per metric (10 bins).
Confidence-interval bar chart — mean score ± 95% CI for each metric.
Details table — per-question breakdown with generated answer, golden answer, retrieved context, score, and LLM rationale.

For large-scale evaluations, the framework can run inside Modal containers with persistent volume storage:

modal run cognee/eval_framework/modal_run_eval.py

A Streamlit dashboard server is available for viewing aggregated results across multiple Modal runs:

modal serve cognee/eval_framework/modal_eval_dashboard.py

A Dockerfile is included at cognee/eval_framework/Dockerfile for containerized deployments.

Token Usage Analysis

The framework ships a standalone CLI utility at cognee/eval_framework/token_usage_analysis/ that estimates the token cost of Cognee persistent memory versus full-context prompting. It chunks an input text, measures the real ingestion token usage of running a few representative chunks through Cognee, and reports the break-even query count — after how many repeated queries full-context prompting has spent more tokens than Cognee’s one-time ingestion plus per-query recall. It can optionally write cumulative-cost plots. Install the eval dependencies and run the tool from its own directory:

uv sync --dev --all-extras
cd cognee/eval_framework/token_usage_analysis

# one representative file (the input is treated as the corpus)
uv run python analyze.py --file data/wikipedia_article.txt --plot

# a folder of .txt files, pooled then sampled
uv run python analyze.py --dir some_corpus/ --out report.json

# a single representative chunk (corpus size must be given explicitly)
uv run python analyze.py --text "$(cat one_chunk.txt)" --corpus-tokens 200000

The script loads the repo-root .env, so it must contain a working LLM_PROVIDER, LLM_MODEL, and API key. If --llm-models is omitted, the configured LLM_MODEL is used. It always writes a JSON report; --plot additionally writes the cumulative-cost figure. You must supply exactly one input form (--file, --dir, or --text). Key options:

Flag	Default	Meaning
`--file` / `--dir` / `--text`	—	input form (exactly one, required)
`--samples`	`3`	chunks to measure
`--max-chunk-size`	`4095`	chunk size (pass `8191` for Cognee’s default)
`--llm-models`	the `.env` model	comma list; runs each, switching Cognee’s config
`--corpus-tokens`	token count of the input	corpus size for the comparison (required with `--text`)
`--retrieved-context`	`1118`	recall context per query
`--query-overhead`	`32`	instruction + question tokens per query
`--reduction-factors`	`1,2,7,10`	milestones to report (`1` = parity/break-even)
`--out`	`token_usage_report.json`	JSON report output path
`--plot` / `--plot-dir`	off / `.`	also write the cross-over figure

--plot requires matplotlib, which is included in the evals extra (installed by uv sync --dev --all-extras).

For the full cost model, sample corpora, and precomputed results, see the in-repo cognee/eval_framework/token_usage_analysis/README.md and the results/chunk_4095/ and results/chunk_8191/ folders.

Code

Running Individual Stages

Each stage exposes a standalone async function you can call from your own scripts:

import asyncio
from cognee.eval_framework.corpus_builder.run_corpus_builder import run_corpus_builder
from cognee.eval_framework.answer_generation.run_question_answering_module import run_question_answering
from cognee.eval_framework.evaluation.run_evaluation_module import run_evaluation
from cognee.eval_framework.analysis.dashboard_generator import create_dashboard
from cognee.eval_framework.eval_config import EvalConfig

async def custom_eval():
    config = EvalConfig().to_dict()

    # Stage 1 – ingest corpus
    await run_corpus_builder(config)

    # Stage 2 – generate answers
    await run_question_answering(config)

    # Stage 3 – score answers
    await run_evaluation(config)

    # Stage 4 – build dashboard
    create_dashboard(
        metrics_path=config["metrics_path"],
        aggregate_metrics_path=config["aggregate_metrics_path"],
        output_file=config["dashboard_path"],
        benchmark=config["benchmark"],
    )

asyncio.run(custom_eval())

Filtering Benchmark Instances

You can restrict which instances are evaluated using INSTANCE_FILTER:

from cognee.eval_framework.corpus_builder.run_corpus_builder import run_corpus_builder

# By integer indices (0-based)
await run_corpus_builder(config, instance_filter=[0, 1, 2, 10, 42])

# By string IDs (HotPotQA uses "_id" keys)
await run_corpus_builder(config, instance_filter=["5a7a06935542990198eaf050", ...])

# By path to a JSON file containing a list of IDs or indices
await run_corpus_builder(config, instance_filter="my_instance_ids.json")

Comparing Search Strategies

A typical workflow for comparing two retrieval strategies:

import asyncio
from cognee.eval_framework.run_eval import main
from cognee.eval_framework.eval_config import EvalConfig
import os

for engine in ["cognee_graph_completion", "cognee_completion"]:
    os.environ["QA_ENGINE"] = engine
    os.environ["DASHBOARD_PATH"] = f"dashboard_{engine}.html"
    os.environ["METRICS_PATH"] = f"metrics_{engine}.json"
    os.environ["ANSWERS_PATH"] = f"answers_{engine}.json"
    asyncio.run(main())

Open both HTML files side-by-side to compare F1 and exact-match scores across retrieval strategies.

Further details

All options are read from an .env file (or environment variables) via a Pydantic BaseSettings class (EvalConfig).

Corpus Builder

Variable	Default	Description
`BENCHMARK`	`Dummy`	Which dataset to load (`Dummy`, `HotPotQA`, `Musique`, `TwoWikiMultiHop`).
`NUMBER_OF_SAMPLES_IN_CORPUS`	`1`	How many corpus paragraphs to ingest.
`BUILDING_CORPUS_FROM_SCRATCH`	`True`	Re-ingest the corpus on every run, or reuse an existing Cognee index.
`TASK_GETTER_TYPE`	`Default`	Cognify pipeline variant described in the Pipeline Strategies section below.

Answer Generation

Variable	Default	Description
`ANSWERING_QUESTIONS`	`True`	Run the QA stage.
`QA_ENGINE`	`cognee_graph_completion`	Which retriever to use, as described in the QA Engines section below.
`QUESTIONS_PATH`	`questions_output.json`	Where to save generated answers.

Evaluation

Variable	Default	Description
`EVALUATING_ANSWERS`	`True`	Run the metrics stage.
`EVALUATING_CONTEXTS`	`True`	Also compute `contextual_relevancy` and `context_coverage`.
`EVALUATION_ENGINE`	`DeepEval`	`DeepEval` or `DirectLLM`.
`EVALUATION_METRICS`	`["correctness", "EM", "f1"]`	Any combination of the five metrics.
`DEEPEVAL_MODEL`	`gpt-4o-mini`	LLM used by DeepEval for `correctness` and `contextual_relevancy`.
`ANSWERS_PATH`	`answers_output.json`	Path to the answers file produced by the QA stage.
`METRICS_PATH`	`metrics_output.json`	Where per-sample metric results are written.

Dashboard

Variable	Default	Description
`CALCULATE_METRICS`	`True`	Compute aggregate statistics (mean, 95% CI).
`DASHBOARD`	`True`	Generate the HTML report.
`AGGREGATE_METRICS_PATH`	`aggregate_metrics.json`	Where aggregate stats are written.
`DASHBOARD_PATH`	`dashboard.html`	Output path for the HTML dashboard.

Supported Benchmarks

Benchmark	Adapter key	Description
HotPotQA	`HotPotQA`	~90 K multi-hop Q&A pairs from CMU; includes supporting-fact indices.
MuSiQue	`Musique`	Multi-step reasoning with question decompositions (Google Drive, JSONL).
2WikiMultiHop	`TwoWikiMultiHop`	Fact-triplet-style multi-hop QA from HuggingFace.
Dummy	`Dummy`	One hard-coded Q&A pair — useful for smoke-testing the pipeline.

Available Metrics

Metric key	Type	Description
`EM`	String	Exact Match — 1 if the generated answer exactly equals the golden answer (case-insensitive, whitespace-normalized).
`f1`	String	Token-level F1 — precision/recall over word tokens between generated and golden answer.
`correctness`	LLM	GEval correctness via DeepEval (uses `DEEPEVAL_MODEL`).
`contextual_relevancy`	LLM	DeepEval’s `ContextualRelevancyMetric` — how relevant the retrieved context is to the question.
`context_coverage`	LLM	Custom metric — fraction of the golden context covered by the retrieved context.

All metrics return a {"score": float, "reason": str} dict. Aggregate statistics include mean and a 95% confidence interval computed with 10 000 bootstrap samples.

Pipeline Strategies

The TASK_GETTER_TYPE variable controls how each corpus document is processed during the corpus-building stage:

Strategy	Description
`Default`	Full pipeline: classify → chunk → extract graph → summarize → add data points.
`CascadeGraph`	Same as Default but processes documents in batches of 10.
`NoSummaries`	Skips the summary step; applies ontology grounding during graph extraction when an ontology file path is provided.
`JustChunks`	Minimal pipeline: classify → chunk → add data points (no graph).

QA Engines

Engine key	Description
`cognee_graph_completion`	Graph traversal followed by LLM completion.
`cognee_graph_completion_cot`	Chain-of-thought reasoning over the graph.
`cognee_graph_completion_context_extension`	Graph traversal with extended context retrieval.
`cognee_completion`	Direct LLM completion without graph traversal.
`graph_summary_completion`	Uses pre-computed graph summaries for retrieval.

​Overview

​Quick Start

​Basic Usage

​What just happened

​Distributed Execution with Modal

​Token Usage Analysis

​Code

​Further details

Overview

Quick Start

Basic Usage

What just happened

Distributed Execution with Modal

Token Usage Analysis

Code

Further details