> ## Documentation Index
> Fetch the complete documentation index at: https://docs.cognee.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Built-in Evaluation Framework

> Benchmark Cognee retrieval with the built-in evaluation framework.

Cognee ships a self-contained evaluation framework at `cognee/eval_framework/` that lets you benchmark retrieval quality on standard multi-hop QA datasets, compare different search strategies, and inspect results in an interactive HTML dashboard — all without any third-party evaluation service.

<Note>
  This page covers the **built-in** evaluation framework. For the DeepEval integration (LLM-as-a-judge scoring), see [Evaluation with DeepEval](/integrations/deepeval-integration).
</Note>

## Overview

The pipeline has four sequential stages:

```
Corpus Builder → Answer Generation → Evaluation → Dashboard
```

| Stage                 | What it does                                                                                                                |
| --------------------- | --------------------------------------------------------------------------------------------------------------------------- |
| **Corpus Builder**    | Downloads a benchmark dataset, ingests the corpus into Cognee's processing pipeline, and persists Q\&A pairs to disk.       |
| **Answer Generation** | Runs each question through your chosen retriever and records the generated answer alongside the golden answer.              |
| **Evaluation**        | Scores every answer with the metrics you select (EM, F1, correctness, contextual relevancy, context coverage).              |
| **Dashboard**         | Produces a standalone `dashboard.html` with per-metric histograms, 95% confidence-interval bar charts, and a details table. |

## Quick Start

Before you start, complete [Quickstart](/getting-started/quickstart), make sure your corpus is already processed and indexed or let the corpus builder create it from scratch, and set `LLM_API_KEY`.

### Basic Usage

This minimal example runs the built-in evaluation pipeline and writes an HTML dashboard:

```python theme={null}
import asyncio
from cognee.eval_framework.run_eval import main
from cognee.eval_framework.eval_config import EvalConfig

config = EvalConfig(
    benchmark="HotPotQA",
    number_of_samples_in_corpus=50,
    qa_engine="cognee_graph_completion",
    evaluation_engine="DeepEval",
    evaluation_metrics=["EM", "f1", "correctness"],
    dashboard=True,
    dashboard_path="my_dashboard.html",
)

asyncio.run(main(params=config.to_dict()))
```

This runs all four stages in sequence and writes `my_dashboard.html` to the current directory.

### What just happened

* **Benchmark setup** — `EvalConfig` defines which benchmark to use, how many samples to ingest, and which retrieval and evaluation engines to run.
* **Full evaluation run** — `main(params=config.to_dict())` executes corpus building, answer generation, evaluation, and dashboard generation in one flow.
* **Output artifacts** — the run produces metrics files and an HTML dashboard you can open locally to inspect results.

Open `my_dashboard.html` in any browser. You will see:

* **Distribution histograms** — score distributions per metric (10 bins).
* **Confidence-interval bar chart** — mean score ± 95% CI for each metric.
* **Details table** — per-question breakdown with generated answer, golden answer, retrieved context, score, and LLM rationale.

***

## Distributed Execution with Modal

For large-scale evaluations, the framework can run inside [Modal](https://modal.com) containers with persistent volume storage:

```bash theme={null}
modal run cognee/eval_framework/modal_run_eval.py
```

A Streamlit dashboard server is available for viewing aggregated results across multiple Modal runs:

```bash theme={null}
modal serve cognee/eval_framework/modal_eval_dashboard.py
```

A `Dockerfile` is included at `cognee/eval_framework/Dockerfile` for containerized deployments.

***

## Token Usage Analysis

The framework ships a standalone CLI utility at `cognee/eval_framework/token_usage_analysis/` that estimates the token cost of **Cognee persistent memory** versus **full-context prompting**. It chunks an input text, measures the real ingestion token usage of running a few representative chunks through Cognee, and reports the break-even query count — after how many repeated queries full-context prompting has spent more tokens than Cognee's one-time ingestion plus per-query recall. It can optionally write cumulative-cost plots.

Install the eval dependencies and run the tool from its own directory:

```bash theme={null}
uv sync --dev --all-extras
cd cognee/eval_framework/token_usage_analysis

# one representative file (the input is treated as the corpus)
uv run python analyze.py --file data/wikipedia_article.txt --plot

# a folder of .txt files, pooled then sampled
uv run python analyze.py --dir some_corpus/ --out report.json

# a single representative chunk (corpus size must be given explicitly)
uv run python analyze.py --text "$(cat one_chunk.txt)" --corpus-tokens 200000
```

The script loads the repo-root `.env`, so it must contain a working `LLM_PROVIDER`, `LLM_MODEL`, and API key. If `--llm-models` is omitted, the configured `LLM_MODEL` is used. It always writes a JSON report; `--plot` additionally writes the cumulative-cost figure.

You must supply exactly one input form (`--file`, `--dir`, or `--text`). Key options:

| Flag                          | Default                   | Meaning                                                 |
| ----------------------------- | ------------------------- | ------------------------------------------------------- |
| `--file` / `--dir` / `--text` | —                         | input form (exactly one, required)                      |
| `--samples`                   | `3`                       | chunks to measure                                       |
| `--max-chunk-size`            | `4095`                    | chunk size (pass `8191` for Cognee's default)           |
| `--llm-models`                | the `.env` model          | comma list; runs each, switching Cognee's config        |
| `--corpus-tokens`             | token count of the input  | corpus size for the comparison (required with `--text`) |
| `--retrieved-context`         | `1118`                    | recall context per query                                |
| `--query-overhead`            | `32`                      | instruction + question tokens per query                 |
| `--reduction-factors`         | `1,2,7,10`                | milestones to report (`1` = parity/break-even)          |
| `--out`                       | `token_usage_report.json` | JSON report output path                                 |
| `--plot` / `--plot-dir`       | off / `.`                 | also write the cross-over figure                        |

<Note>
  `--plot` requires `matplotlib`, which is included in the `evals` extra (installed by `uv sync --dev --all-extras`).
</Note>

For the full cost model, sample corpora, and precomputed results, see the in-repo `cognee/eval_framework/token_usage_analysis/README.md` and the `results/chunk_4095/` and `results/chunk_8191/` folders.

***

## Code

<AccordionGroup>
  <Accordion title="Running Individual Stages">
    Each stage exposes a standalone async function you can call from your own scripts:

    ```python theme={null}
    import asyncio
    from cognee.eval_framework.corpus_builder.run_corpus_builder import run_corpus_builder
    from cognee.eval_framework.answer_generation.run_question_answering_module import run_question_answering
    from cognee.eval_framework.evaluation.run_evaluation_module import run_evaluation
    from cognee.eval_framework.analysis.dashboard_generator import create_dashboard
    from cognee.eval_framework.eval_config import EvalConfig

    async def custom_eval():
        config = EvalConfig().to_dict()

        # Stage 1 – ingest corpus
        await run_corpus_builder(config)

        # Stage 2 – generate answers
        await run_question_answering(config)

        # Stage 3 – score answers
        await run_evaluation(config)

        # Stage 4 – build dashboard
        create_dashboard(
            metrics_path=config["metrics_path"],
            aggregate_metrics_path=config["aggregate_metrics_path"],
            output_file=config["dashboard_path"],
            benchmark=config["benchmark"],
        )

    asyncio.run(custom_eval())
    ```
  </Accordion>

  <Accordion title="Filtering Benchmark Instances">
    You can restrict which instances are evaluated using `INSTANCE_FILTER`:

    ```python theme={null}
    from cognee.eval_framework.corpus_builder.run_corpus_builder import run_corpus_builder

    # By integer indices (0-based)
    await run_corpus_builder(config, instance_filter=[0, 1, 2, 10, 42])

    # By string IDs (HotPotQA uses "_id" keys)
    await run_corpus_builder(config, instance_filter=["5a7a06935542990198eaf050", ...])

    # By path to a JSON file containing a list of IDs or indices
    await run_corpus_builder(config, instance_filter="my_instance_ids.json")
    ```
  </Accordion>

  <Accordion title="Comparing Search Strategies">
    A typical workflow for comparing two retrieval strategies:

    ```python theme={null}
    import asyncio
    from cognee.eval_framework.run_eval import main
    from cognee.eval_framework.eval_config import EvalConfig
    import os

    for engine in ["cognee_graph_completion", "cognee_completion"]:
        os.environ["QA_ENGINE"] = engine
        os.environ["DASHBOARD_PATH"] = f"dashboard_{engine}.html"
        os.environ["METRICS_PATH"] = f"metrics_{engine}.json"
        os.environ["ANSWERS_PATH"] = f"answers_{engine}.json"
        asyncio.run(main())
    ```

    Open both HTML files side-by-side to compare F1 and exact-match scores across retrieval strategies.
  </Accordion>
</AccordionGroup>

***

## Further details

All options are read from an `.env` file (or environment variables) via a Pydantic `BaseSettings` class (`EvalConfig`).

<AccordionGroup>
  <Accordion title="Corpus Builder">
    | Variable                       | Default   | Description                                                                  |
    | ------------------------------ | --------- | ---------------------------------------------------------------------------- |
    | `BENCHMARK`                    | `Dummy`   | Which dataset to load (`Dummy`, `HotPotQA`, `Musique`, `TwoWikiMultiHop`).   |
    | `NUMBER_OF_SAMPLES_IN_CORPUS`  | `1`       | How many corpus paragraphs to ingest.                                        |
    | `BUILDING_CORPUS_FROM_SCRATCH` | `True`    | Re-ingest the corpus on every run, or reuse an existing Cognee index.        |
    | `TASK_GETTER_TYPE`             | `Default` | Cognify pipeline variant described in the Pipeline Strategies section below. |
  </Accordion>

  <Accordion title="Answer Generation">
    | Variable              | Default                   | Description                                                           |
    | --------------------- | ------------------------- | --------------------------------------------------------------------- |
    | `ANSWERING_QUESTIONS` | `True`                    | Run the QA stage.                                                     |
    | `QA_ENGINE`           | `cognee_graph_completion` | Which retriever to use, as described in the QA Engines section below. |
    | `QUESTIONS_PATH`      | `questions_output.json`   | Where to save generated answers.                                      |
  </Accordion>

  <Accordion title="Evaluation">
    | Variable              | Default                       | Description                                                        |
    | --------------------- | ----------------------------- | ------------------------------------------------------------------ |
    | `EVALUATING_ANSWERS`  | `True`                        | Run the metrics stage.                                             |
    | `EVALUATING_CONTEXTS` | `True`                        | Also compute `contextual_relevancy` and `context_coverage`.        |
    | `EVALUATION_ENGINE`   | `DeepEval`                    | `DeepEval` or `DirectLLM`.                                         |
    | `EVALUATION_METRICS`  | `["correctness", "EM", "f1"]` | Any combination of the five metrics.                               |
    | `DEEPEVAL_MODEL`      | `gpt-4o-mini`                 | LLM used by DeepEval for `correctness` and `contextual_relevancy`. |
    | `ANSWERS_PATH`        | `answers_output.json`         | Path to the answers file produced by the QA stage.                 |
    | `METRICS_PATH`        | `metrics_output.json`         | Where per-sample metric results are written.                       |
  </Accordion>

  <Accordion title="Dashboard">
    | Variable                 | Default                  | Description                                  |
    | ------------------------ | ------------------------ | -------------------------------------------- |
    | `CALCULATE_METRICS`      | `True`                   | Compute aggregate statistics (mean, 95% CI). |
    | `DASHBOARD`              | `True`                   | Generate the HTML report.                    |
    | `AGGREGATE_METRICS_PATH` | `aggregate_metrics.json` | Where aggregate stats are written.           |
    | `DASHBOARD_PATH`         | `dashboard.html`         | Output path for the HTML dashboard.          |
  </Accordion>

  <Accordion title="Supported Benchmarks">
    | Benchmark         | Adapter key       | Description                                                              |
    | ----------------- | ----------------- | ------------------------------------------------------------------------ |
    | **HotPotQA**      | `HotPotQA`        | \~90 K multi-hop Q\&A pairs from CMU; includes supporting-fact indices.  |
    | **MuSiQue**       | `Musique`         | Multi-step reasoning with question decompositions (Google Drive, JSONL). |
    | **2WikiMultiHop** | `TwoWikiMultiHop` | Fact-triplet-style multi-hop QA from HuggingFace.                        |
    | **Dummy**         | `Dummy`           | One hard-coded Q\&A pair — useful for smoke-testing the pipeline.        |
  </Accordion>

  <Accordion title="Available Metrics">
    | Metric key             | Type   | Description                                                                                                             |
    | ---------------------- | ------ | ----------------------------------------------------------------------------------------------------------------------- |
    | `EM`                   | String | **Exact Match** — 1 if the generated answer exactly equals the golden answer (case-insensitive, whitespace-normalized). |
    | `f1`                   | String | **Token-level F1** — precision/recall over word tokens between generated and golden answer.                             |
    | `correctness`          | LLM    | **GEval correctness** via DeepEval (uses `DEEPEVAL_MODEL`).                                                             |
    | `contextual_relevancy` | LLM    | DeepEval's `ContextualRelevancyMetric` — how relevant the retrieved context is to the question.                         |
    | `context_coverage`     | LLM    | Custom metric — fraction of the golden context covered by the retrieved context.                                        |

    All metrics return a `{"score": float, "reason": str}` dict. Aggregate statistics include mean and a 95% confidence interval computed with 10 000 bootstrap samples.
  </Accordion>

  <Accordion title="Pipeline Strategies">
    The `TASK_GETTER_TYPE` variable controls how each corpus document is processed during the corpus-building stage:

    | Strategy       | Description                                                                                                        |
    | -------------- | ------------------------------------------------------------------------------------------------------------------ |
    | `Default`      | Full pipeline: classify → chunk → extract graph → summarize → add data points.                                     |
    | `CascadeGraph` | Same as Default but processes documents in batches of 10.                                                          |
    | `NoSummaries`  | Skips the summary step; applies ontology grounding during graph extraction when an ontology file path is provided. |
    | `JustChunks`   | Minimal pipeline: classify → chunk → add data points (no graph).                                                   |
  </Accordion>

  <Accordion title="QA Engines">
    | Engine key                                  | Description                                      |
    | ------------------------------------------- | ------------------------------------------------ |
    | `cognee_graph_completion`                   | Graph traversal followed by LLM completion.      |
    | `cognee_graph_completion_cot`               | Chain-of-thought reasoning over the graph.       |
    | `cognee_graph_completion_context_extension` | Graph traversal with extended context retrieval. |
    | `cognee_completion`                         | Direct LLM completion without graph traversal.   |
    | `graph_summary_completion`                  | Uses pre-computed graph summaries for retrieval. |
  </Accordion>
</AccordionGroup>