> ## Documentation Index
> Fetch the complete documentation index at: https://docs.cognee.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# cognify()

> Transform raw data into a structured knowledge graph

# cognee.cognify()

```python theme={null}
async def cognify(
    datasets: Union[str, list[str], list[UUID]] = None,
    user: User = None,
    graph_model: BaseModel = KnowledgeGraph,
    chunker = TextChunker,
    chunk_size: int = None,
    chunks_per_batch: int = None,
    config: Config = None,
    vector_db_config: dict = None,
    graph_db_config: dict = None,
    run_in_background: bool = False,
    incremental_loading: bool = True,
    custom_prompt: Optional[str] = None,
    temporal_cognify: bool = False,
    data_per_batch: int = 20,
)
```

## Description

Transform ingested data into a structured knowledge graph.

This is the core processing step in Cognee that converts raw text and documents
into an intelligent knowledge graph. It analyzes content, extracts entities and
relationships, and creates semantic connections for enhanced search and reasoning.

Prerequisites:

* **LLM\_API\_KEY**: Must be configured (required for entity extraction and graph generation)
* **Data Added**: Must have data previously added via `cognee.add()`
* **Vector Database**: Must be accessible for embeddings storage
* **Graph Database**: Must be accessible for relationship storage

Input Requirements:

* **Datasets**: Must contain data previously added via `cognee.add()`
* **Content Types**: Works with any text-extractable content including:
  * Natural language documents
  * Structured data (CSV, JSON)
  * Code repositories
  * Academic papers and technical documentation
  * Mixed multimedia content (with text extraction)

Processing Pipeline:

1. **Document Classification**: Identifies document types and structures
2. **Text Chunking**: Breaks content into semantically meaningful segments
3. **Entity Extraction**: Identifies key concepts, people, places, organizations
4. **Relationship Detection**: Discovers connections between entities
5. **Graph Construction**: Builds semantic knowledge graph with embeddings
6. **Content Summarization**: Creates hierarchical summaries for navigation

Graph Model Customization:
The `graph_model` parameter allows custom knowledge structures:

* **Default**: General-purpose KnowledgeGraph for any domain
* **Custom Models**: Domain-specific schemas (e.g., scientific papers, code analysis)
* **Ontology Integration**: Use `ontology_file_path` for predefined vocabularies

Args:
datasets: Dataset name(s) or dataset uuid to process. Processes all available data if None.

* Single dataset: "my\_dataset"
* Multiple datasets: \["docs", "research", "reports"]
* None: Process all datasets for the user
  user: User context for authentication and data access. Uses default if None.
  graph\_model: Pydantic model defining the knowledge graph structure.
  Defaults to KnowledgeGraph for general-purpose processing.
  chunker: Text chunking strategy (TextChunker, LangchainChunker).
  * TextChunker: Paragraph-based chunking (default, most reliable)
  * LangchainChunker: Recursive character splitting with overlap
    Determines how documents are segmented for processing.
    chunk\_size: Maximum tokens per chunk. Auto-calculated based on LLM if None.
    Formula: min(embedding\_max\_completion\_tokens, llm\_max\_completion\_tokens // 2)
    Default limits: \~512-8192 tokens depending on models.
    Smaller chunks = more granular but potentially fragmented knowledge.
    chunks\_per\_batch: Number of chunks to be processed in a single batch in Cognify tasks.
    vector\_db\_config: Custom vector database configuration for embeddings storage.
    graph\_db\_config: Custom graph database configuration for relationship storage.
    run\_in\_background: If True, starts processing asynchronously and returns immediately.
    If False, waits for completion before returning.
    Background mode recommended for large datasets (>100MB).
    Use pipeline\_run\_id from return value to monitor progress.
    custom\_prompt: Optional custom prompt string to use for entity extraction and graph generation.
    If provided, this prompt will be used instead of the default prompts for
    knowledge graph extraction. The prompt should guide the LLM on how to
    extract entities and relationships from the text content.

Returns:
Union\[dict, list\[PipelineRunInfo]]:

* **Blocking mode**: Dictionary mapping dataset\_id -> PipelineRunInfo with:
  * Processing status (completed/failed/in\_progress)
  * Extracted entity and relationship counts
  * Processing duration and resource usage
  * Error details if any failures occurred
* **Background mode**: List of PipelineRunInfo objects for tracking progress
  * Use pipeline\_run\_id to monitor status
  * Check completion via pipeline monitoring APIs

Next Steps:
After successful cognify processing, use search functions to query the knowledge:

```python theme={null}
import cognee
from cognee import SearchType

# Process your data into knowledge graph
await cognee.cognify()

# Query for insights using different search types:

# 1. Natural language completion with graph context
insights = await cognee.search(
    "What are the main themes?",
    query_type=SearchType.GRAPH_COMPLETION
)

# 2. Get entity relationships and connections
relationships = await cognee.search(
    "connections between concepts",
    query_type=SearchType.GRAPH_COMPLETION
)

# 3. Find relevant document chunks
chunks = await cognee.search(
    "specific topic",
    query_type=SearchType.CHUNKS
)
```

Advanced Usage:

```python theme={null}
# Custom domain model for scientific papers
class ScientificPaper(DataPoint):
    title: str
    authors: List[str]
    methodology: str
    findings: List[str]

await cognee.cognify(
    datasets=["research_papers"],
    graph_model=ScientificPaper,
    ontology_file_path="scientific_ontology.owl"
)

# Background processing for large datasets
run_info = await cognee.cognify(
    datasets=["large_corpus"],
    run_in_background=True
)
# Check status later with run_info.pipeline_run_id
```

Environment Variables:
Required:

* LLM\_API\_KEY: API key for your LLM provider

Optional (same as add function):

* LLM\_PROVIDER, LLM\_MODEL, VECTOR\_DB\_PROVIDER, GRAPH\_DATABASE\_PROVIDER
* LLM\_RATE\_LIMIT\_ENABLED: Enable rate limiting (default: False)
* LLM\_RATE\_LIMIT\_REQUESTS: Max requests per interval (default: 60)

## Parameters

<ParamField path="datasets" type="Union[str, list[str], list[UUID]]" default="None">Dataset name(s) or UUID(s) to process. Processes all datasets if not specified.</ParamField>
<ParamField path="user" type="User" default="None">User performing the operation.</ParamField>
<ParamField path="graph_model" type="BaseModel" default="KnowledgeGraph">Pydantic model defining the knowledge graph schema. Defaults to KnowledgeGraph.</ParamField>
<ParamField path="chunker" type="Any" default="TextChunker">Text chunking strategy class.</ParamField>
<ParamField path="chunk_size" type="int" default="None">Maximum size of text chunks in tokens.</ParamField>
<ParamField path="chunks_per_batch" type="int" default="None">Number of chunks to process per LLM batch.</ParamField>
<ParamField path="config" type="Config" default="None">Override the full Cognee config for this run.</ParamField>
<ParamField path="vector_db_config" type="dict" default="None">Override vector database configuration.</ParamField>
<ParamField path="graph_db_config" type="dict" default="None">Override graph database configuration.</ParamField>
<ParamField path="run_in_background" type="bool" default="False">If true, return immediately and process in background.</ParamField>
<ParamField path="incremental_loading" type="bool" default="True">If true, skip already-processed data.</ParamField>
<ParamField path="custom_prompt" type="Optional[str]" default="None">Custom system prompt for entity/relationship extraction.</ParamField>
<ParamField path="temporal_cognify" type="bool" default="False">Enable temporal-aware processing.</ParamField>
<ParamField path="data_per_batch" type="int" default="20">Number of data items per processing batch.</ParamField>

## Processing Pipeline

When you call `cognify()`, data goes through these stages:

1. **Document classification** — identify content type
2. **Text chunking** — split into manageable segments
3. **Entity extraction** — identify entities using the LLM
4. **Relationship detection** — find connections between entities
5. **Graph construction** — build the knowledge graph
6. **Summarization** — generate summaries of content

## Examples

```python theme={null}
import cognee

# Process all datasets
await cognee.cognify()

# Process a specific dataset
await cognee.cognify(datasets=["my_dataset"])

# Process in background
await cognee.cognify(datasets=["large_dataset"], run_in_background=True)

# Use a custom graph model
from pydantic import BaseModel

class MyGraph(BaseModel):
    nodes: list
    edges: list

await cognee.cognify(graph_model=MyGraph)

# Custom extraction prompt
await cognee.cognify(
    custom_prompt="Extract all technical concepts and their relationships."
)
```

## Further details

<AccordionGroup>
  <Accordion title="Background Execution">
    When `run_in_background=True`, `cognify()` starts the processing pipeline as an async background task and **returns immediately** with a list of `PipelineRunInfo` objects — one per dataset. The knowledge graph construction continues in the background.

    ```python theme={null}
    import cognee

    # Start processing without waiting for completion
    run_infos = await cognee.cognify(
        datasets=["large_corpus"],
        run_in_background=True
    )

    # run_infos is a list of PipelineRunInfo objects
    for info in run_infos:
        print(info.pipeline_run_id)  # UUID to track this run
        print(info.dataset_id)       # Dataset being processed
        print(info.status)           # Initial status (e.g. "PipelineRunStarted")
    ```

    The returned `PipelineRunInfo` fields relevant for monitoring:

    | Field             | Type   | Description                             |
    | ----------------- | ------ | --------------------------------------- |
    | `pipeline_run_id` | `UUID` | Unique identifier for this pipeline run |
    | `dataset_id`      | `UUID` | The dataset being processed             |
    | `dataset_name`    | `str`  | Name of the dataset                     |
    | `status`          | `str`  | Current status of the run               |

    Possible status values: `PipelineRunStarted`, `PipelineRunYield`, `PipelineRunCompleted`, `PipelineRunAlreadyCompleted`, `PipelineRunErrored`.
  </Accordion>

  <Accordion title="Monitoring progress via WebSocket (REST API)">
    When using the REST API, subscribe to real-time pipeline updates with the WebSocket endpoint:

    ```
    WebSocket: /cognify/subscribe/{pipeline_run_id}
    ```

    **Authentication**: Use the same authentication context as your REST API session. In cookie-based setups, this is typically the `auth_token` cookie.

    **Usage example (JavaScript)**:

    ```javascript theme={null}
    const pipelineRunId = "your-pipeline-run-id-uuid";
    const ws = new WebSocket(`ws://your-server/cognify/subscribe/${pipelineRunId}`);

    ws.onmessage = (event) => {
        const data = JSON.parse(event.data);
        console.log("Status:", data.status);
        console.log("Run ID:", data.pipeline_run_id);
        // data.payload contains the current graph data for the dataset
    };

    ws.onclose = () => {
        // Server closes the connection when processing completes (status: PipelineRunCompleted)
        console.log("Pipeline run finished");
    };
    ```

    Each WebSocket message has this shape:

    ```json theme={null}
    {
        "pipeline_run_id": "uuid-string",
        "status": "PipelineRunYield",
        "payload": { /* current graph data for the dataset */ }
    }
    ```

    The server closes the WebSocket with code `1000` (normal closure) once the run reaches `PipelineRunCompleted`. If authentication fails, the connection is closed with code `1008` (policy violation).
  </Accordion>

  <Accordion title="When to use background mode">
    * **Large datasets** (>100 MB) where blocking would time out HTTP connections
    * **API integrations** where you want to return a job ID to the caller immediately
    * **Parallel processing** of multiple datasets without waiting for each

    For small datasets or scripts, the default blocking mode (`run_in_background=False`) is simpler and returns the final result directly.
  </Accordion>
</AccordionGroup>
