> ## Documentation Index
> Fetch the complete documentation index at: https://docs.cognee.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# cognify()

> Transform raw data into a structured knowledge graph

# cognee.cognify()

```python theme={null}
async def cognify(
    datasets: Union[str, list[str], list[UUID]] = None,
    user: User = None,
    graph_model: BaseModel = KnowledgeGraph,
    chunker = TextChunker,
    chunk_size: int = None,
    chunks_per_batch: int = None,
    config: Config = None,
    vector_db_config: dict = None,
    graph_db_config: dict = None,
    run_in_background: bool = False,
    incremental_loading: bool = True,
    custom_prompt: Optional[str] = None,
    temporal_cognify: bool = False,
    data_per_batch: int = 20,
    llm_config: Optional[LLMConfig] = None,
    embedding_config: Optional[EmbeddingConfig] = None,
    dry_run: bool = False,
)
```

## Description

Transform ingested data into a structured knowledge graph.

This is the core processing step in Cognee that converts raw text and documents
into an intelligent knowledge graph. It analyzes content, extracts entities and
relationships, and creates semantic connections for enhanced search and reasoning.

Prerequisites:

* **LLM\_API\_KEY**: Must be configured (required for entity extraction and graph generation)
* **Data Added**: Must have data previously added via `cognee.add()`
* **Vector Database**: Must be accessible for embeddings storage
* **Graph Database**: Must be accessible for relationship storage

Input Requirements:

* **Datasets**: Must contain data previously added via `cognee.add()`
* **Content Types**: Works with any text-extractable content including:
  * Natural language documents
  * Structured data (CSV, JSON)
  * Code repositories
  * Academic papers and technical documentation
  * Mixed multimedia content (with text extraction)

Processing Pipeline:

1. **Document Classification**: Identifies document types and structures
2. **Text Chunking**: Breaks content into semantically meaningful segments
3. **Entity Extraction**: Identifies key concepts, people, places, organizations
4. **Relationship Detection**: Discovers connections between entities
5. **Graph Construction**: Builds semantic knowledge graph with embeddings
6. **Content Summarization**: Creates hierarchical summaries for navigation

Graph Model Customization:
The `graph_model` parameter allows custom knowledge structures:

* **Default**: General-purpose KnowledgeGraph for any domain
* **Custom Models**: Domain-specific schemas (e.g., scientific papers, code analysis)
* **Ontology Integration**: Pass an ontology resolver via `config` (or set the `ONTOLOGY_FILE_PATH` environment variable) for predefined vocabularies

Args:
datasets: Dataset name(s) or dataset uuid to process. Processes all available data if None.

* Single dataset: "my\_dataset"
* Multiple datasets: \["docs", "research", "reports"]
* None: Process all datasets for the user
  user: User context for authentication and data access. Uses default if None.
  graph\_model: Pydantic model defining the knowledge graph structure.
  Defaults to KnowledgeGraph for general-purpose processing.
  chunker: Text chunking strategy (TextChunker, LangchainChunker).
  * TextChunker: Paragraph-based chunking (default, most reliable)
  * LangchainChunker: Recursive character splitting with overlap
    Determines how documents are segmented for processing.
    chunk\_size: Maximum tokens per chunk. Auto-calculated based on LLM if None.
    Formula: min(embedding\_max\_completion\_tokens, llm\_max\_completion\_tokens // 2)
    Default limits: \~512-8192 tokens depending on models.
    Smaller chunks = more granular but potentially fragmented knowledge.
    chunks\_per\_batch: Number of chunks to be processed in a single batch in Cognify tasks.
    vector\_db\_config: Custom vector database configuration for embeddings storage.
    graph\_db\_config: Custom graph database configuration for relationship storage.
    run\_in\_background: If True, starts processing asynchronously and returns immediately.
    If False, waits for completion before returning.
    Background mode recommended for large datasets (>100MB).
    Use pipeline\_run\_id from return value to monitor progress.
    custom\_prompt: Optional custom prompt string to use for entity extraction and graph generation.
    If provided, this prompt will be used instead of the default prompts for
    knowledge graph extraction. The prompt should guide the LLM on how to
    extract entities and relationships from the text content.
    dry\_run: If True, return a stage-level estimate of LLM token usage and rough cost
    without making LLM calls or writing graph results. The estimate covers all
    data in the selected dataset(s); an incremental run may process fewer items.

Returns:
Union\[dict, list\[PipelineRunInfo], DryRunEstimate]:

* **Blocking mode**: Dictionary mapping dataset\_id -> PipelineRunInfo with:
  * Processing status (completed/failed/in\_progress)
  * Extracted entity and relationship counts
  * Processing duration and resource usage
  * Error details if any failures occurred
* **Background mode**: List of PipelineRunInfo objects for tracking progress
  * Use pipeline\_run\_id to monitor status
  * Check completion via pipeline monitoring APIs

Next Steps:
After successful cognify processing, use search functions to query the knowledge:

```python theme={null}
import cognee
from cognee import SearchType

# Process your data into knowledge graph
await cognee.cognify()

# Query for insights using different search types:

# 1. Natural language completion with graph context
insights = await cognee.search(
    "What are the main themes?",
    query_type=SearchType.GRAPH_COMPLETION
)

# 2. Get entity relationships and connections
relationships = await cognee.search(
    "connections between concepts",
    query_type=SearchType.GRAPH_COMPLETION
)

# 3. Find relevant document chunks
chunks = await cognee.search(
    "specific topic",
    query_type=SearchType.CHUNKS
)
```

Advanced Usage:

```python theme={null}
# Custom domain model for scientific papers
class ScientificPaper(DataPoint):
    title: str
    authors: List[str]
    methodology: str
    findings: List[str]

await cognee.cognify(
    datasets=["research_papers"],
    graph_model=ScientificPaper,
)

# Ground extraction in an ontology (there is no `ontology_file_path` argument;
# pass a resolver through `config` instead).
from cognee.modules.ontology.rdf_xml.RDFLibOntologyResolver import RDFLibOntologyResolver
from cognee.modules.ontology.ontology_config import Config

config: Config = {
    "ontology_config": {
        "ontology_resolver": RDFLibOntologyResolver(ontology_file="scientific_ontology.owl")
    }
}
await cognee.cognify(datasets=["research_papers"], config=config)

# Background processing for large datasets
run_info = await cognee.cognify(
    datasets=["large_corpus"],
    run_in_background=True
)
# Check status later with run_info.pipeline_run_id
```

Environment Variables:
Required:

* LLM\_API\_KEY: API key for your LLM provider

Optional (same as add function):

* LLM\_PROVIDER, LLM\_MODEL, VECTOR\_DB\_PROVIDER, GRAPH\_DATABASE\_PROVIDER
* LLM\_RATE\_LIMIT\_ENABLED: Enable rate limiting (default: False)
* LLM\_RATE\_LIMIT\_REQUESTS: Max requests per interval (default: 60)

## Parameters

<ParamField path="datasets" type="Union[str, list[str], list[UUID]]" default="None">Dataset name(s) or UUID(s) to process. Processes all datasets if not specified.</ParamField>
<ParamField path="user" type="User" default="None">User performing the operation.</ParamField>
<ParamField path="graph_model" type="BaseModel" default="KnowledgeGraph">Pydantic model defining the knowledge graph schema. Defaults to KnowledgeGraph.</ParamField>
<ParamField path="chunker" type="Any" default="TextChunker">Text chunking strategy class.</ParamField>
<ParamField path="chunk_size" type="int" default="None">Maximum size of text chunks in tokens.</ParamField>
<ParamField path="chunks_per_batch" type="int" default="None">Number of chunks to process per LLM batch.</ParamField>
<ParamField path="config" type="Config" default="None">Override the full Cognee config for this run.</ParamField>
<ParamField path="vector_db_config" type="dict" default="None">Override vector database configuration.</ParamField>
<ParamField path="graph_db_config" type="dict" default="None">Override graph database configuration.</ParamField>
<ParamField path="run_in_background" type="bool" default="False">If true, return immediately and process in background.</ParamField>
<ParamField path="incremental_loading" type="bool" default="True">If true, skip already-processed data.</ParamField>
<ParamField path="custom_prompt" type="Optional[str]" default="None">Custom system prompt for entity/relationship extraction.</ParamField>
<ParamField path="temporal_cognify" type="bool" default="False">Enable temporal-aware processing.</ParamField>
<ParamField path="data_per_batch" type="int" default="20">Number of data items per processing batch.</ParamField>
<ParamField path="llm_config" type="Optional[LLMConfig]" default="None">LLM settings to install into the current async context for this graph-building operation. When omitted, Cognee uses the active context config or global LLM config. Import `LLMConfig` from `cognee.infrastructure.llm.config`.</ParamField>
<ParamField path="embedding_config" type="Optional[EmbeddingConfig]" default="None">Embedding settings to install into the current async context for this graph-building operation. When omitted, Cognee uses the active context config or global embedding config. Import `EmbeddingConfig` from `cognee.infrastructure.databases.vector.embeddings.config`.</ParamField>
<ParamField path="dry_run" type="bool" default="False">If true, return a `DryRunEstimate` of LLM token usage and rough cost instead of running the pipeline. No LLM calls are made and no graph results are written. See [Dry-run cost estimation](#dry-run-cost-estimation).</ParamField>

## Dry-run cost estimation

Pass `dry_run=True` to preview the LLM token usage and rough USD cost of a `cognify()` run **without making any LLM calls or writing graph results**. This is useful for budgeting a large dataset before committing to the run.

```python theme={null}
import cognee

estimate = await cognee.cognify(datasets=["my_dataset"], dry_run=True)
print(estimate)                    # human-readable summary table
print(estimate.estimated_cost_usd) # e.g. 0.012345
print(estimate.total_tokens)       # input + output tokens across stages
print(estimate.to_dict())          # JSON-serializable dict
```

The estimate covers the two LLM-heavy stages of the default pipeline — `structured_graph_extraction` and `chunk_summarization` — reusing the real document classifier, chunker, and prompt templates so chunk and call counts track an actual run.

The returned `DryRunEstimate` exposes:

| Field                                             | Type        | Description                                                                                                   |
| ------------------------------------------------- | ----------- | ------------------------------------------------------------------------------------------------------------- |
| `operation`                                       | `str`       | `"cognify"` for this call.                                                                                    |
| `model`                                           | `str`       | The configured LLM model the estimate is priced against.                                                      |
| `chunks`                                          | `int`       | Number of chunks that would trigger LLM calls.                                                                |
| `chunk_tokens`                                    | `int`       | Total input tokens across those chunks.                                                                       |
| `input_tokens` / `output_tokens` / `total_tokens` | `int`       | Aggregate token counts across all stages.                                                                     |
| `estimated_cost_usd`                              | `float`     | Rough total cost across all stages.                                                                           |
| `skipped_items`                                   | `int`       | Items excluded from estimation (e.g. audio/image items, DLT row chunks).                                      |
| `warnings`                                        | `list[str]` | Notes about approximations (reasoning-model output allowance, skipped items, or missing pricing entries).     |
| `stages`                                          | `list`      | Per-stage breakdown (`name`, `calls`, `input_tokens`, `output_tokens`, `total_tokens`, `estimated_cost_usd`). |

Behavior notes:

* **Datasets are resolved read-only.** Unlike a normal run, a dry run never creates a missing dataset, so estimating a typo'd dataset name fails loudly instead of silently creating one.
* **Estimates are upper bounds for re-runs.** With `incremental_loading=True`, a real run skips already-processed documents, so a dry run may over-estimate a re-run.
* **Not supported with `temporal_cognify=True`** (only the default pipeline is estimated) or while connected to a remote Cognee instance via `serve()` — both raise a `ValueError`.
* **Unknown models emit a warning** rather than reporting a `$0` cost when no pricing entry exists for the configured model.

## Processing Pipeline

When you call `cognify()`, data goes through these stages:

1. **Document classification** — identify content type
2. **Text chunking** — split into manageable segments
3. **Entity extraction** — identify entities using the LLM
4. **Relationship detection** — find connections between entities
5. **Graph construction** — build the knowledge graph
6. **Summarization** — generate summaries of content

## Examples

```python theme={null}
import cognee

# Process all datasets
await cognee.cognify()

# Process a specific dataset
await cognee.cognify(datasets=["my_dataset"])

# Process in background
await cognee.cognify(datasets=["large_dataset"], run_in_background=True)

# Use a custom graph model
from pydantic import BaseModel

class MyGraph(BaseModel):
    nodes: list
    edges: list

await cognee.cognify(graph_model=MyGraph)

# Custom extraction prompt
await cognee.cognify(
    custom_prompt="Extract all technical concepts and their relationships."
)

# Pin specific entity and relationship types
custom_prompt = """
Extract only people and cities as entities.
Connect people to cities with the relationship "lives_in".
Ignore all other entities.
"""

await cognee.cognify(custom_prompt=custom_prompt)
```

When `custom_prompt` is set, it fully **replaces** the default graph extraction prompt (see [`GRAPH_PROMPT_PATH`](/core-concepts/main-operations/legacy-operations/cognify#default-extraction-prompts)) for the entity/relationship extraction step, so you can constrain exactly which entity types and relationship labels the LLM produces. For a step-by-step walkthrough, see the [Custom Prompts guide](/guides/custom-prompts).

<Note>
  `custom_prompt` is ignored when `temporal_cognify=True`.
</Note>

## Further details

<AccordionGroup>
  <Accordion title="Background Execution">
    When `run_in_background=True`, `cognify()` starts the processing pipeline as an async background task and **returns immediately**. The return shape is the same as blocking mode — a dict mapping `dataset_id` → `PipelineRunInfo` — but each entry has status `PipelineRunStarted` instead of `PipelineRunCompleted`, and the knowledge graph construction continues in the background.

    ```python theme={null}
    import cognee

    # Start processing without waiting for completion
    run_info = await cognee.cognify(
        datasets=["large_corpus"],
        run_in_background=True
    )

    # run_info is a dict of {dataset_id: PipelineRunInfo}
    for dataset_id, info in run_info.items():
        print(info.pipeline_run_id)  # UUID to track this run
        print(info.dataset_id)       # Dataset being processed
        print(info.status)           # Initial status (e.g. "PipelineRunStarted")
    ```

    The returned `PipelineRunInfo` fields relevant for monitoring:

    | Field             | Type   | Description                             |
    | ----------------- | ------ | --------------------------------------- |
    | `pipeline_run_id` | `UUID` | Unique identifier for this pipeline run |
    | `dataset_id`      | `UUID` | The dataset being processed             |
    | `dataset_name`    | `str`  | Name of the dataset                     |
    | `status`          | `str`  | Current status of the run               |

    Possible status values: `PipelineRunStarted`, `PipelineRunYield`, `PipelineRunCompleted`, `PipelineRunAlreadyCompleted`, `PipelineRunErrored`.
  </Accordion>

  <Accordion title="Monitoring progress via WebSocket (REST API)">
    When using the REST API, subscribe to real-time pipeline updates with the WebSocket endpoint:

    ```
    WebSocket: /cognify/subscribe/{pipeline_run_id}
    ```

    **Authentication**: Use the same authentication context as your REST API session. In cookie-based setups, this is typically the `auth_token` cookie.

    **Usage example (JavaScript)**:

    ```javascript theme={null}
    const pipelineRunId = "your-pipeline-run-id-uuid";
    const ws = new WebSocket(`ws://your-server/cognify/subscribe/${pipelineRunId}`);

    ws.onmessage = (event) => {
        const data = JSON.parse(event.data);
        console.log("Status:", data.status);
        console.log("Run ID:", data.pipeline_run_id);
        // data.payload contains the current graph data for the dataset
    };

    ws.onclose = () => {
        // Server closes the connection when processing completes (status: PipelineRunCompleted)
        console.log("Pipeline run finished");
    };
    ```

    Each WebSocket message has this shape:

    ```json theme={null}
    {
        "pipeline_run_id": "uuid-string",
        "status": "PipelineRunYield",
        "payload": { /* current graph data for the dataset */ }
    }
    ```

    The server closes the WebSocket with code `1000` (normal closure) once the run reaches `PipelineRunCompleted`. If authentication fails, the connection is closed with code `1008` (policy violation).
  </Accordion>

  <Accordion title="When to use background mode">
    * **Large datasets** (>100 MB) where blocking would time out HTTP connections
    * **API integrations** where you want to return a job ID to the caller immediately
    * **Parallel processing** of multiple datasets without waiting for each

    For small datasets or scripts, the default blocking mode (`run_in_background=False`) is simpler and returns the final result directly.
  </Accordion>
</AccordionGroup>