> ## Documentation Index
> Fetch the complete documentation index at: https://docs.cognee.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Cognify

> Transform ingested data into a knowledge graph.

<Note>
  `cognify()` is a legacy operation. In Cognee v1.0, most users should use [remember()](/core-concepts/main-operations/remember) instead, which replaces the `add()` + `cognify()` + `memify()` workflow with a single call.
</Note>

## What is the cognify operation

The `.cognify` operation takes the ingested data with [Add](/core-concepts/main-operations/legacy-operations/add) and turns plain text into structured knowledge: chunks, embeddings, summaries, nodes, and edges that live in Cognee's vector and graph stores. It prepares your data for downstream operations like [Search](/core-concepts/main-operations/legacy-operations/search).

* **Transforms ingested data**: builds chunks, embeddings, and summaries
* **Graph creation**: extracts entities and relationships to form a knowledge graph
* **Vector indexing**: makes everything searchable via embeddings
* **Dataset-scoped**: runs per dataset, respecting ownership and permissions

<Note>
  `.cognify` can be run multiple times as the dataset grows, and Cognee will skip what's already processed. Read more about **Incremental loading** in **[Examples and details](#examples-and-details)**
</Note>

## What happens under the hood

The `.cognify` pipeline is made of six ordered [Tasks](/core-concepts/building-blocks/tasks). Each task takes the output of the previous one and moves your data closer to becoming a searchable knowledge graph.

1. **Classify documents** — wrap each ingested file as a `Document` object with metadata and optional node sets
2. **Check permissions** — enforce that you have write access to the target dataset
3. **Extract chunks** — split documents into smaller pieces (paragraphs, sections)
4. **Extract graph** — use LLMs to identify entities and relationships, inserting them into the graph DB
5. **Summarize text** — generate summaries for each chunk, stored as `TextSummary` [DataPoints](/core-concepts/building-blocks/datapoints)
6. **Add data points** — embed nodes and summaries, write them into the vector store, and update graph edges

The result is a fully searchable, structured knowledge graph connected to your data.

## After cognify finishes

When `.cognify` completes for a dataset:

* **DocumentChunks** exist in memory as the granular breakdown of your files
* **Summaries** are stored and indexed in the vector database for semantic search
* **Knowledge graph nodes and edges** are committed to the graph database
* **Dataset metadata** is updated with token counts and pipeline status
* Your dataset is now **query-ready**: you can run [Search](/core-concepts/main-operations/legacy-operations/search) or graph queries immediately

<Note>
  Because `cognify()` calls the LLM for entity extraction and summarization, it can fail when the configured LLM provider (or LiteLLM proxy) reports that its token budget is exhausted. In that case it raises `LLMPaymentRequiredError`, which the API surfaces as **HTTP 402 (Payment Required)** with body `{"error": "Token budget exhausted", "detail": "..."}`. This error is **terminal** — Cognee does not retry budget-exhaustion failures — so treat a `402` as final for the request and prompt the user to top up their token budget rather than reissuing the call.
</Note>

## Examples and details

<Accordion title="Pipeline tasks (detailed)">
  1. **Classify documents**
     * Turns raw `Data` rows into `Document` objects
     * Chooses the right document type (PDF, text, image, audio, etc.)
     * Attaches metadata and optional node sets

  2. **Check permissions**
     * Verifies that the user has write access to the dataset

  3. **Extract chunks**
     * Splits documents into `DocumentChunk`s using a chunker
     * You can customize the chunk size and strategy — see [Chunkers](/core-concepts/further-concepts/chunkers) for details
     * Updates token counts in the relational DB

  4. **Extract graph**
     * Calls the LLM to extract entities and relationships
     * Deduplicates nodes and edges, commits to the graph DB

  5. **Summarize text**
     * Generates concise summaries per chunk
     * Stores them as `TextSummary` [DataPoints](/core-concepts/building-blocks/datapoints) for vector search

  6. **Add data points**
     * Converts summaries and other [DataPoints](/core-concepts/building-blocks/datapoints) into graph + vector nodes
     * Embeds them in the vector store, persists in the graph DB
</Accordion>

<Accordion title="Default extraction prompts">
  Cognee ships with several built-in system prompts for entity and relationship extraction, stored in `cognee/infrastructure/llm/prompts/`. The active prompt is controlled by the `GRAPH_PROMPT_PATH` environment variable (default: `generate_graph_prompt.txt`).

  | Prompt file                        | Use case                    | What it does                                                                                                                                                                          |
  | ---------------------------------- | --------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
  | `generate_graph_prompt.txt`        | Default balanced extraction | Extracts entities and relationships using the standard Cognee rules: basic node types, human-readable IDs, normalized dates, `snake_case` relationships, and coreference consistency. |
  | `generate_graph_prompt_simple.txt` | Lightweight extraction      | Uses a shorter, more compact rule set for straightforward graph extraction while keeping the same core conventions around node types, IDs, dates, and relationship naming.            |
  | `generate_graph_prompt_strict.txt` | Tighter schema control      | Applies a more explicit prompt with named node categories, stronger relationship constraints, examples, and a strict instruction not to infer facts that are not present in the text. |
  | `generate_graph_prompt_guided.txt` | More directed graph shaping | Adds guidance for edge direction, allows multi-word entity labels, and encourages logically implied facts when they improve graph clarity without repeating the same fact.            |

  To switch to a different built-in prompt, set the environment variable:

  ```bash theme={null}
  GRAPH_PROMPT_PATH=generate_graph_prompt_strict.txt
  ```

  Or configure it at runtime via `cognee.config`:

  ```python theme={null}
  import cognee

  cognee.config.llm_config.graph_prompt_path = "generate_graph_prompt_strict.txt"
  ```

  <Note>
    If you need to use a custom prompt, refer to our [Custom Prompts guide](/guides/custom-prompts)
  </Note>
</Accordion>

<Accordion title="Datasets and permissions">
  * Cognify always runs on a dataset
  * You must have **write access** to the target dataset
  * Permissions are enforced at pipeline start
  * Each dataset maintains its own cognify status and token counts
</Accordion>

<Accordion title="Incremental loading and deduplication">
  `incremental_loading=True` is the default for both `cognee.add()` and `cognee.cognify()`. Together they give you two layers of deduplication:

  **Layer 1 — content-hash deduplication in `add()`**

  Before `cognify()` runs, `add()` already deduplicates by content hash. Re-adding unchanged content is skipped at ingestion time, while changed content updates the record and marks it for reprocessing.

  For the full behavior and scenario table, see [Hash-based deduplication](/core-concepts/main-operations/add#hash-based-file-storage-deduplication-and-filename-collisions) on the Add page.

  **Layer 2 — pipeline-status tracking in `cognify()`**

  Before processing each data item, `cognify()` checks a `pipeline_status` field on the record. If the status for the `cognify_pipeline` in the current dataset is already `COMPLETED`, the item is skipped entirely — no LLM calls, no re-embedding, no graph writes.

  | Scenario                                                | What happens                                                   |
  | ------------------------------------------------------- | -------------------------------------------------------------- |
  | Same file added and cognified again (unchanged)         | Skipped — `pipeline_status` is already `COMPLETED`             |
  | File content changes, then `add()` + `cognify()`        | `pipeline_status` is reset by `add()`; `cognify()` reprocesses |
  | New file added to an existing dataset, then `cognify()` | Only the new file is processed; existing ones are skipped      |
  | `incremental_loading=False` passed to `cognify()`       | All items are reprocessed regardless of previous status        |

  Common usage patterns:

  <AccordionGroup>
    <Accordion title="Appending new data to an existing dataset">
      You can grow a dataset over time without reprocessing what's already there:

      ```python theme={null}
      import cognee

      # Initial load
      await cognee.add("First document content", dataset_name="my_dataset")
      await cognee.cognify(datasets=["my_dataset"])

      # Later: add more data — existing items are skipped automatically
      await cognee.add("Second document content", dataset_name="my_dataset")
      await cognee.cognify(datasets=["my_dataset"])  # only processes the new document
      ```
    </Accordion>

    <Accordion title="Forcing a full reprocess">
      To reprocess everything regardless of status, pass `incremental_loading=False`:

      ```python theme={null}
      await cognee.cognify(datasets=["my_dataset"], incremental_loading=False)
      ```

      This bypasses the pipeline-status check but does not re-ingest files — use `cognee.datasets.empty_dataset()` first if you also need to clear the stored data.
    </Accordion>
  </AccordionGroup>
</Accordion>

<Accordion title="Batching for faster processing">
  Two batching parameters control how much work Cognee runs at once during ingestion and graph building:

  For new workflows, prefer [remember()](/core-concepts/main-operations/remember). It is the current API and accepts these batching controls for permanent-memory ingestion. Use `add()` and `cognify()` directly only when you need lower-level control over ingestion and graph building as separate legacy steps.

  | Parameter          | Applies to                                                  | What it controls                                                                                                                     | Default                                                                                     |
  | ------------------ | ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------- |
  | `data_per_batch`   | `remember()` permanent memory; legacy `add()` / `cognify()` | Maximum number of data items processed concurrently within one dataset pipeline run                                                  | `20`                                                                                        |
  | `chunks_per_batch` | `remember()` permanent memory; legacy `cognify()`           | Number of chunks emitted to each chunk-level Cognify task batch, including graph extraction/summarization and data-point persistence | Default Cognify: `100`; temporal Cognify: `10`; `remember` HTTP endpoint form default: `36` |

  `data_per_batch` is the outer concurrency limit. Cognee schedules the data items in a dataset and uses a semaphore so at most this many items are processed at the same time.

  `chunks_per_batch` is the inner chunk-task batch size. In the default Cognify pipeline, Cognee passes it as `batch_size` to the graph extraction/summarization task and to `add_data_points`. If you do not pass it directly, Cognee checks the `chunks_per_batch` value from `CognifyConfig`; when that is unset, the default Cognify pipeline uses `100`.

  **Where to configure them**

  * Pass `data_per_batch` and `chunks_per_batch` to permanent-memory `remember()` for the current API path.
  * Use legacy `add()` / `cognify()` only when you intentionally split ingestion and graph building; `data_per_batch` applies to both, while `chunks_per_batch` applies to `cognify()`.
  * For the legacy `/api/v1/cognify` endpoint, both values are accepted in the JSON request body.
  * For `/api/v1/remember`, `chunks_per_batch` is exposed as a multipart form field. The current remember endpoint does not expose `data_per_batch` as a form field, so tune `data_per_batch` through the Python SDK or the lower-level Cognify API when you need that control.

  **Tuning guidance**

  | Scenario                                     | Suggested starting point                                                                                                                                          |
  | -------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- |
  | Small local datasets                         | Keep the defaults. Lower values rarely help unless you are debugging provider limits.                                                                             |
  | Large datasets on a larger machine           | Increase gradually, for example `data_per_batch=30` or `50` and `chunks_per_batch=100` or higher, while watching memory, database load, and provider rate limits. |
  | Memory-constrained environments              | Lower both values, for example `data_per_batch=2` to `5` and `chunks_per_batch=10` to `25`, to reduce concurrent model calls and in-memory intermediate results.  |
  | Faster ingestion with many independent files | Increase `data_per_batch` first, because it controls how many data items can move through the pipeline concurrently.                                              |
  | Faster graph extraction for long documents   | Tune `chunks_per_batch`, because it controls chunk-level batches after documents are split.                                                                       |

  Larger batches can improve throughput by keeping the pipeline, model provider, and databases busier, but they also increase memory pressure and can hit LLM, embedding, or database rate limits sooner. The best values depend on document size, chunk count, model/provider limits, embedding batch behavior, graph/vector database capacity, and the CPU/RAM available to the Cognee process.
</Accordion>

<Accordion title="How entity and relationship names are determined">
  During the **Extract graph** step, Cognee asks the LLM to turn each chunk into graph nodes and edges. The names and types in that graph are inferred from your content rather than fixed in advance.

  | Element                 | Fields                                                                                                                                                                             |
  | ----------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
  | **Node (vertex)**       | `id` - unique identifier derived from the entity name; `name` - human-readable label; `type` - semantic category such as `Person` or `Organization`; `description` - short summary |
  | **Edge (relationship)** | `source_node_id`, `target_node_id`, `relationship_name` - a free-text verb phrase such as `works_at` or `produces`                                                                 |

  The extraction prompt instructs the model to:

  * Capture entities, names, nouns, and implied mentions exhaustively
  * Form relationships as `(start_node, relationship_name, end_node)` triplets using explicit and inferred connections
  * Avoid duplicates and overly generic terms

  That means the resulting graph schema emerges from the data you ingest. Different datasets, prompts, or LLMs can produce slightly different node types and relationship names for similar content.

  <Note>
    If you need tighter control over naming, use an OWL ontology or a custom graph model. See [Ontologies](/core-concepts/further-concepts/ontologies) and [Custom Graph Model](/guides/custom-graph-model).
  </Note>
</Accordion>

<Accordion title="Inspect extracted graph schema">
  Once `.cognify` finishes, the graph schema is inspectable because the extracted node types and relationship names now exist in the graph store.

  ### Python SDK

  Use the graph engine directly to inspect the stored nodes and edges:

  ```python theme={null}
  from cognee.infrastructure.databases.graph import get_graph_engine

  graph_engine = await get_graph_engine()

  # Returns all nodes and edges
  nodes, edges = await graph_engine.get_graph_data()

  # Inspect unique node types
  node_types = {props.get("type") for _, props in nodes if props.get("type")}
  print("Node types:", node_types)

  # Inspect unique relationship names
  relationship_names = {rel_name for _, _, rel_name, _ in edges}
  print("Relationship names:", relationship_names)
  ```

  `get_graph_data()` returns:

  * **Nodes** as `(node_id: str, properties: dict)`
  * **Edges** as `(source_id: str, target_id: str, relationship_name: str, properties: dict)`

  If you only need aggregate information, inspect graph metrics instead:

  ```python theme={null}
  metrics = await graph_engine.get_graph_metrics()
  # Returns: num_nodes, num_edges, mean_degree, edge_density,
  #          num_connected_components, sizes_of_connected_components

  metrics = await graph_engine.get_graph_metrics(include_optional=True)
  print(metrics)
  ```

  ### HTTP server mode

  When you run the Cognee HTTP server, you can inspect graph data through the dataset graph endpoint:

  ```text theme={null}
  GET /api/v1/datasets/{dataset_id}/graph
  ```

  To explore the same graph visually, use the [Graph Visualization guide](/guides/graph-visualization).
</Accordion>

<Accordion title="Re-cognify after schema changes">
  If you update your data model (e.g., add new entity fields or relationships) and want to reprocess existing data:

  1. **Delete the dataset** first, then re-add and re-cognify:
     ```python theme={null}
     # Clear existing processed data
     await cognee.datasets.empty_dataset(dataset_id=my_dataset.id)

     # Re-add source files
     await cognee.add(source_files, dataset_name="my_dataset")

     # Re-cognify with the updated schema
     await cognee.cognify()
     ```

  2. **Alternatively, use [Memify](/core-concepts/main-operations/legacy-operations/memify)** for additive enrichment — it runs extraction and enrichment tasks over the existing graph without re-ingesting data. This is useful when you want to add new derived facts without reprocessing from scratch.

  <Warning>
    `.cognify` skips already-processed data by default. Simply re-running `.cognify` on unchanged files will not pick up schema changes. You must delete and re-add the data, or use memify for enrichment.
  </Warning>
</Accordion>

<Accordion title="Final outcome">
  * Vector database contains embeddings for summaries and nodes
  * Graph database contains entities and relationships
  * Relational database tracks token counts and pipeline run status
  * Your dataset is now ready for [Search](/core-concepts/main-operations/legacy-operations/search) (semantic or graph-based)
</Accordion>

<Accordion title="Checking indexing status">
  If you are using the current v1.0 API, see [Remember](/core-concepts/main-operations/remember#checking-indexing-status) for the recommended indexing-status workflow built around `remember()` and `recall()`.

  If you are working directly with legacy `.cognify()` or the MCP `cognify_status` tool, the same dataset status primitives still apply:

  * `cognee.datasets.get_status([dataset_id])`
  * `GET /api/v1/datasets/status?dataset=<dataset-uuid>`
  * `GET /api/v1/activity/pipeline-runs?dataset_id=<dataset-uuid>`

  In MCP mode, `cognify_status(dataset_name="main_dataset")` provides a text summary of recent cognify runs.
</Accordion>

<Accordion title="LLM call count and cost estimation">
  The default cognify pipeline makes **2 LLM calls per chunk**:

  1. **Graph extraction** — identifies entities and relationships from the chunk text
  2. **Summarization** — generates a concise summary of the chunk

  **Estimating total calls**

  The number of chunks depends on your document size and the configured `chunk_size`:

  ```
  chunks          = ceil(document_tokens / chunk_size)
  total_llm_calls = chunks × 2
  ```

  When `chunk_size` is not set explicitly, Cognee auto-calculates it as:

  ```
  chunk_size = min(embedding_model_max_tokens, llm_max_tokens ÷ 2)
  ```

  With typical defaults (e.g., `gpt-4o-mini` + `text-embedding-3-small`) this usually falls in the **1 024 – 8 192 token** range. See [Chunkers](/core-concepts/further-concepts/chunkers) for details.

  **Example estimates** at `chunk_size = 1024`:

  | Document size | Chunks | LLM calls |
  | ------------- | ------ | --------- |
  | 100 tokens    | 1      | 2         |
  | 1 000 tokens  | 1      | 2         |
  | 10 000 tokens | 10     | 20        |

  **Tips for reducing API usage**

  * **Increase `chunk_size`** — fewer, larger chunks mean fewer calls:
    ```python theme={null}
    await cognee.cognify(chunk_size=4096)
    ```
  * **Skip summarization** — use a [custom pipeline](/guides/custom-tasks-pipelines) that omits the `summarize_text` task, reducing calls to 1 per chunk.
  * **Enable rate limiting** — set `LLM_RATE_LIMIT_ENABLED=true` to avoid bursting your provider quota when processing many chunks in parallel.
</Accordion>

<Accordion title="Concurrent search while cognify is running">
  You can run [`search`](/core-concepts/main-operations/legacy-operations/search) while a `cognify` pipeline is active — there is no global lock that blocks one from the other.

  Cognee's locks are **session-level**: they serialize short read-modify-write operations (such as `update_qa` or `add_feedback`) within the same `(session_id, operation)` pair. They do not apply across `cognify` and `search`.

  **Within a single process**, the default LanceDB vector store uses an asyncio lock per adapter instance to serialize concurrent write coroutines, so interleaved searches and writes within the same worker are safe.

  **Across multiple processes** (e.g., two workers sharing the same LanceDB data directory), concurrent writes can produce errors like:

  ```
  CommitConflict: Too many concurrent writers
  ```

  This is LanceDB's own multi-process commit conflict detection. To reduce these conflicts, enable the dataset queue:

  ```bash theme={null}
  DATASET_QUEUE_ENABLED=true
  # Optional: tune the concurrency limit (defaults to DATABASE_MAX_LRU_CACHE_SIZE=128)
  DATASET_QUEUE_MAX_CONCURRENT=4
  ```

  When enabled, the queue limits how many dataset-level operations run at the same time, serializing writes to the same dataset and reducing commit conflicts in shared LanceDB setups.

  <Note>
    For single-process deployments (the default), concurrent search during cognify works without any special configuration.
  </Note>
</Accordion>

<Columns cols={3}>
  <Card title="Add" icon="plus" href="/core-concepts/main-operations/legacy-operations/add">
    First bring data into Cognee
  </Card>

  <Card title="Search" icon="search" href="/core-concepts/main-operations/legacy-operations/search">
    Query embeddings or graph structures built by Cognify
  </Card>

  <Card title="Memify" icon="sparkles" href="/core-concepts/main-operations/legacy-operations/memify">
    Enrich your graph with derived facts after cognify
  </Card>
</Columns>