cognee/eval_framework/ that lets you benchmark retrieval quality on standard multi-hop QA datasets, compare different search strategies, and inspect results in an interactive HTML dashboard — all without any third-party evaluation service.
This page covers the built-in evaluation framework. For the DeepEval integration (LLM-as-a-judge scoring), see Evaluation with DeepEval.
Overview
The pipeline has four sequential stages:| Stage | What it does |
|---|---|
| Corpus Builder | Downloads a benchmark dataset, ingests the corpus into Cognee (add → cognify), and persists Q&A pairs to disk. |
| Answer Generation | Runs each question through your chosen retriever and records the generated answer alongside the golden answer. |
| Evaluation | Scores every answer with the metrics you select (EM, F1, correctness, contextual relevancy, context coverage). |
| Dashboard | Produces a standalone dashboard.html with per-metric histograms, 95% confidence-interval bar charts, and a details table. |
Quick Start
Before you start, complete Quickstart, make sure you already have data processed withcognify, and set LLM_API_KEY.
Basic Usage
This minimal example runs the built-in evaluation pipeline and writes an HTML dashboard:my_dashboard.html to the current directory.
What just happened
- Benchmark setup —
EvalConfigdefines which benchmark to use, how many samples to ingest, and which retrieval and evaluation engines to run. - Full evaluation run —
main(params=config.to_dict())executes corpus building, answer generation, evaluation, and dashboard generation in one flow. - Output artifacts — the run produces metrics files and an HTML dashboard you can open locally to inspect results.
my_dashboard.html in any browser. You will see:
- Distribution histograms — score distributions per metric (10 bins).
- Confidence-interval bar chart — mean score ± 95% CI for each metric.
- Details table — per-question breakdown with generated answer, golden answer, retrieved context, score, and LLM rationale.
Distributed Execution with Modal
For large-scale evaluations, the framework can run inside Modal containers with persistent volume storage:Dockerfile is included at cognee/eval_framework/Dockerfile for containerized deployments.
Code
Running Individual Stages
Running Individual Stages
Each stage exposes a standalone async function you can call from your own scripts:
Filtering Benchmark Instances
Filtering Benchmark Instances
You can restrict which instances are evaluated using
INSTANCE_FILTER:Comparing Search Strategies
Comparing Search Strategies
A typical workflow for comparing two retrieval strategies:Open both HTML files side-by-side to compare F1 and exact-match scores across retrieval strategies.
Further details
All options are read from an.env file (or environment variables) via a Pydantic BaseSettings class (EvalConfig).
Corpus Builder
Corpus Builder
| Variable | Default | Description |
|---|---|---|
BENCHMARK | Dummy | Which dataset to load (Dummy, HotPotQA, Musique, TwoWikiMultiHop). |
NUMBER_OF_SAMPLES_IN_CORPUS | 1 | How many corpus paragraphs to ingest. |
BUILDING_CORPUS_FROM_SCRATCH | True | Re-ingest the corpus on every run, or reuse an existing Cognee index. |
TASK_GETTER_TYPE | Default | Cognify pipeline variant described in the Pipeline Strategies section below. |
Answer Generation
Answer Generation
| Variable | Default | Description |
|---|---|---|
ANSWERING_QUESTIONS | True | Run the QA stage. |
QA_ENGINE | cognee_graph_completion | Which retriever to use, as described in the QA Engines section below. |
QUESTIONS_PATH | questions_output.json | Where to save generated answers. |
Evaluation
Evaluation
| Variable | Default | Description |
|---|---|---|
EVALUATING_ANSWERS | True | Run the metrics stage. |
EVALUATING_CONTEXTS | True | Also compute contextual_relevancy and context_coverage. |
EVALUATION_ENGINE | DeepEval | DeepEval or DirectLLM. |
EVALUATION_METRICS | ["correctness", "EM", "f1"] | Any combination of the five metrics. |
DEEPEVAL_MODEL | gpt-4o-mini | LLM used by DeepEval for correctness and contextual_relevancy. |
ANSWERS_PATH | answers_output.json | Path to the answers file produced by the QA stage. |
METRICS_PATH | metrics_output.json | Where per-sample metric results are written. |
Dashboard
Dashboard
| Variable | Default | Description |
|---|---|---|
CALCULATE_METRICS | True | Compute aggregate statistics (mean, 95% CI). |
DASHBOARD | True | Generate the HTML report. |
AGGREGATE_METRICS_PATH | aggregate_metrics.json | Where aggregate stats are written. |
DASHBOARD_PATH | dashboard.html | Output path for the HTML dashboard. |
Supported Benchmarks
Supported Benchmarks
| Benchmark | Adapter key | Description |
|---|---|---|
| HotPotQA | HotPotQA | ~90 K multi-hop Q&A pairs from CMU; includes supporting-fact indices. |
| MuSiQue | Musique | Multi-step reasoning with question decompositions (Google Drive, JSONL). |
| 2WikiMultiHop | TwoWikiMultiHop | Fact-triplet-style multi-hop QA from HuggingFace. |
| Dummy | Dummy | One hard-coded Q&A pair — useful for smoke-testing the pipeline. |
Available Metrics
Available Metrics
| Metric key | Type | Description |
|---|---|---|
EM | String | Exact Match — 1 if the generated answer exactly equals the golden answer (case-insensitive, whitespace-normalized). |
f1 | String | Token-level F1 — precision/recall over word tokens between generated and golden answer. |
correctness | LLM | GEval correctness via DeepEval (uses DEEPEVAL_MODEL). |
contextual_relevancy | LLM | DeepEval’s ContextualRelevancyMetric — how relevant the retrieved context is to the question. |
context_coverage | LLM | Custom metric — fraction of the golden context covered by the retrieved context. |
{"score": float, "reason": str} dict. Aggregate statistics include mean and a 95% confidence interval computed with 10 000 bootstrap samples.Pipeline Strategies
Pipeline Strategies
The
TASK_GETTER_TYPE variable controls how each corpus document is processed during the cognify stage:| Strategy | Description |
|---|---|
Default | Full pipeline: classify → chunk → extract graph → summarize → add data points. |
CascadeGraph | Same as Default but processes documents in batches of 10. |
NoSummaries | Skips the summary step; applies an ontology during graph extraction. |
JustChunks | Minimal pipeline: classify → chunk → add data points (no graph). |
QA Engines
QA Engines
| Engine key | Description |
|---|---|
cognee_graph_completion | Graph traversal followed by LLM completion. |
cognee_graph_completion_cot | Chain-of-thought reasoning over the graph. |
cognee_graph_completion_context_extension | Graph traversal with extended context retrieval. |
cognee_completion | Direct LLM completion without graph traversal. |
graph_summary_completion | Uses pre-computed graph summaries for retrieval. |