cognee.cognify()

async def cognify(
    datasets: Union[str, list[str], list[UUID]] = None,
    user: User = None,
    graph_model: BaseModel = KnowledgeGraph,
    chunker = TextChunker,
    chunk_size: int = None,
    chunks_per_batch: int = None,
    config: Config = None,
    vector_db_config: dict = None,
    graph_db_config: dict = None,
    run_in_background: bool = False,
    incremental_loading: bool = True,
    custom_prompt: Optional[str] = None,
    temporal_cognify: bool = False,
    data_per_batch: int = 20,
    llm_config: Optional[LLMConfig] = None,
    embedding_config: Optional[EmbeddingConfig] = None,
    dry_run: bool = False,
)

Description

Transform ingested data into a structured knowledge graph. This is the core processing step in Cognee that converts raw text and documents into an intelligent knowledge graph. It analyzes content, extracts entities and relationships, and creates semantic connections for enhanced search and reasoning. Prerequisites:

LLM_API_KEY: Must be configured (required for entity extraction and graph generation)
Data Added: Must have data previously added via cognee.add()
Vector Database: Must be accessible for embeddings storage
Graph Database: Must be accessible for relationship storage

Input Requirements:

Datasets: Must contain data previously added via cognee.add()
Content Types: Works with any text-extractable content including:
- Natural language documents
- Structured data (CSV, JSON)
- Code repositories
- Academic papers and technical documentation
- Mixed multimedia content (with text extraction)

Processing Pipeline:

Document Classification: Identifies document types and structures
Text Chunking: Breaks content into semantically meaningful segments
Entity Extraction: Identifies key concepts, people, places, organizations
Relationship Detection: Discovers connections between entities
Graph Construction: Builds semantic knowledge graph with embeddings
Content Summarization: Creates hierarchical summaries for navigation

Graph Model Customization: The graph_model parameter allows custom knowledge structures:

Default: General-purpose KnowledgeGraph for any domain
Custom Models: Domain-specific schemas (e.g., scientific papers, code analysis)
Ontology Integration: Pass an ontology resolver via config (or set the ONTOLOGY_FILE_PATH environment variable) for predefined vocabularies

Args: datasets: Dataset name(s) or dataset uuid to process. Processes all available data if None.

Single dataset: “my_dataset”
Multiple datasets: [“docs”, “research”, “reports”]
None: Process all datasets for the user user: User context for authentication and data access. Uses default if None. graph_model: Pydantic model defining the knowledge graph structure. Defaults to KnowledgeGraph for general-purpose processing. chunker: Text chunking strategy (TextChunker, LangchainChunker).
- TextChunker: Paragraph-based chunking (default, most reliable)
- LangchainChunker: Recursive character splitting with overlap Determines how documents are segmented for processing. chunk_size: Maximum tokens per chunk. Auto-calculated based on LLM if None. Formula: min(embedding_max_completion_tokens, llm_max_completion_tokens // 2) Default limits: ~512-8192 tokens depending on models. Smaller chunks = more granular but potentially fragmented knowledge. chunks_per_batch: Number of chunks to be processed in a single batch in Cognify tasks. vector_db_config: Custom vector database configuration for embeddings storage. graph_db_config: Custom graph database configuration for relationship storage. run_in_background: If True, starts processing asynchronously and returns immediately. If False, waits for completion before returning. Background mode recommended for large datasets (>100MB). Use pipeline_run_id from return value to monitor progress. custom_prompt: Optional custom prompt string to use for entity extraction and graph generation. If provided, this prompt will be used instead of the default prompts for knowledge graph extraction. The prompt should guide the LLM on how to extract entities and relationships from the text content. dry_run: If True, return a stage-level estimate of LLM token usage and rough cost without making LLM calls or writing graph results. The estimate covers all data in the selected dataset(s); an incremental run may process fewer items.

Returns: Union[dict, list[PipelineRunInfo], DryRunEstimate]:

Blocking mode: Dictionary mapping dataset_id -> PipelineRunInfo with:
- Processing status (completed/failed/in_progress)
- Extracted entity and relationship counts
- Processing duration and resource usage
- Error details if any failures occurred
Background mode: List of PipelineRunInfo objects for tracking progress
- Use pipeline_run_id to monitor status
- Check completion via pipeline monitoring APIs

Next Steps: After successful cognify processing, use search functions to query the knowledge:

import cognee
from cognee import SearchType

# Process your data into knowledge graph
await cognee.cognify()

# Query for insights using different search types:

# 1. Natural language completion with graph context
insights = await cognee.search(
    "What are the main themes?",
    query_type=SearchType.GRAPH_COMPLETION
)

# 2. Get entity relationships and connections
relationships = await cognee.search(
    "connections between concepts",
    query_type=SearchType.GRAPH_COMPLETION
)

# 3. Find relevant document chunks
chunks = await cognee.search(
    "specific topic",
    query_type=SearchType.CHUNKS
)

Advanced Usage:

# Custom domain model for scientific papers
class ScientificPaper(DataPoint):
    title: str
    authors: List[str]
    methodology: str
    findings: List[str]

await cognee.cognify(
    datasets=["research_papers"],
    graph_model=ScientificPaper,
)

# Ground extraction in an ontology (there is no `ontology_file_path` argument;
# pass a resolver through `config` instead).
from cognee.modules.ontology.rdf_xml.RDFLibOntologyResolver import RDFLibOntologyResolver
from cognee.modules.ontology.ontology_config import Config

config: Config = {
    "ontology_config": {
        "ontology_resolver": RDFLibOntologyResolver(ontology_file="scientific_ontology.owl")
    }
}
await cognee.cognify(datasets=["research_papers"], config=config)

# Background processing for large datasets
run_info = await cognee.cognify(
    datasets=["large_corpus"],
    run_in_background=True
)
# Check status later with run_info.pipeline_run_id

Environment Variables: Required:

LLM_API_KEY: API key for your LLM provider

Optional (same as add function):

LLM_PROVIDER, LLM_MODEL, VECTOR_DB_PROVIDER, GRAPH_DATABASE_PROVIDER
LLM_RATE_LIMIT_ENABLED: Enable rate limiting (default: False)
LLM_RATE_LIMIT_REQUESTS: Max requests per interval (default: 60)

Parameters

Union[str, list[str], list[UUID]]

default:"None"

Dataset name(s) or UUID(s) to process. Processes all datasets if not specified.

User

default:"None"

User performing the operation.

BaseModel

default:"KnowledgeGraph"

Pydantic model defining the knowledge graph schema. Defaults to KnowledgeGraph.

Any

default:"TextChunker"

Text chunking strategy class.

int

default:"None"

Maximum size of text chunks in tokens.

int

default:"None"

Number of chunks to process per LLM batch.

Config

default:"None"

Override the full Cognee config for this run.

dict

default:"None"

Override vector database configuration.

dict

default:"None"

Override graph database configuration.

bool

default:"False"

If true, return immediately and process in background.

bool

default:"True"

If true, skip already-processed data.

Optional[str]

default:"None"

Custom system prompt for entity/relationship extraction.

bool

default:"False"

Enable temporal-aware processing.

int

default:"20"

Number of data items per processing batch.

Optional[LLMConfig]

default:"None"

LLM settings to install into the current async context for this graph-building operation. When omitted, Cognee uses the active context config or global LLM config. Import LLMConfig from cognee.infrastructure.llm.config.

Optional[EmbeddingConfig]

default:"None"

Embedding settings to install into the current async context for this graph-building operation. When omitted, Cognee uses the active context config or global embedding config. Import EmbeddingConfig from cognee.infrastructure.databases.vector.embeddings.config.

bool

default:"False"

If true, return a DryRunEstimate of LLM token usage and rough cost instead of running the pipeline. No LLM calls are made and no graph results are written. See Dry-run cost estimation.

Dry-run cost estimation

Pass dry_run=True to preview the LLM token usage and rough USD cost of a cognify() run without making any LLM calls or writing graph results. This is useful for budgeting a large dataset before committing to the run.

import cognee

estimate = await cognee.cognify(datasets=["my_dataset"], dry_run=True)
print(estimate)                    # human-readable summary table
print(estimate.estimated_cost_usd) # e.g. 0.012345
print(estimate.total_tokens)       # input + output tokens across stages
print(estimate.to_dict())          # JSON-serializable dict

The estimate covers the two LLM-heavy stages of the default pipeline — structured_graph_extraction and chunk_summarization — reusing the real document classifier, chunker, and prompt templates so chunk and call counts track an actual run. The returned DryRunEstimate exposes:

Field	Type	Description
`operation`	`str`	`"cognify"` for this call.
`model`	`str`	The configured LLM model the estimate is priced against.
`chunks`	`int`	Number of chunks that would trigger LLM calls.
`chunk_tokens`	`int`	Total input tokens across those chunks.
`input_tokens` / `output_tokens` / `total_tokens`	`int`	Aggregate token counts across all stages.
`estimated_cost_usd`	`float`	Rough total cost across all stages.
`skipped_items`	`int`	Items excluded from estimation (e.g. audio/image items, DLT row chunks).
`warnings`	`list[str]`	Notes about approximations (reasoning-model output allowance, skipped items, or missing pricing entries).
`stages`	`list`	Per-stage breakdown (`name`, `calls`, `input_tokens`, `output_tokens`, `total_tokens`, `estimated_cost_usd`).

Behavior notes:

Datasets are resolved read-only. Unlike a normal run, a dry run never creates a missing dataset, so estimating a typo’d dataset name fails loudly instead of silently creating one.
Estimates are upper bounds for re-runs. With incremental_loading=True, a real run skips already-processed documents, so a dry run may over-estimate a re-run.
Not supported with temporal_cognify=True (only the default pipeline is estimated) or while connected to a remote Cognee instance via serve() — both raise a ValueError.
Unknown models emit a warning rather than reporting a $0 cost when no pricing entry exists for the configured model.

Processing Pipeline

When you call cognify(), data goes through these stages:

Document classification — identify content type
Text chunking — split into manageable segments
Entity extraction — identify entities using the LLM
Relationship detection — find connections between entities
Graph construction — build the knowledge graph
Summarization — generate summaries of content

Examples

import cognee

# Process all datasets
await cognee.cognify()

# Process a specific dataset
await cognee.cognify(datasets=["my_dataset"])

# Process in background
await cognee.cognify(datasets=["large_dataset"], run_in_background=True)

# Use a custom graph model
from pydantic import BaseModel

class MyGraph(BaseModel):
    nodes: list
    edges: list

await cognee.cognify(graph_model=MyGraph)

# Custom extraction prompt
await cognee.cognify(
    custom_prompt="Extract all technical concepts and their relationships."
)

# Pin specific entity and relationship types
custom_prompt = """
Extract only people and cities as entities.
Connect people to cities with the relationship "lives_in".
Ignore all other entities.
"""

await cognee.cognify(custom_prompt=custom_prompt)

When custom_prompt is set, it fully replaces the default graph extraction prompt (see GRAPH_PROMPT_PATH) for the entity/relationship extraction step, so you can constrain exactly which entity types and relationship labels the LLM produces. For a step-by-step walkthrough, see the Custom Prompts guide.

custom_prompt is ignored when temporal_cognify=True.

Further details

Background Execution

When run_in_background=True, cognify() starts the processing pipeline as an async background task and returns immediately. The return shape is the same as blocking mode — a dict mapping dataset_id → PipelineRunInfo — but each entry has status PipelineRunStarted instead of PipelineRunCompleted, and the knowledge graph construction continues in the background.

import cognee

# Start processing without waiting for completion
run_info = await cognee.cognify(
    datasets=["large_corpus"],
    run_in_background=True
)

# run_info is a dict of {dataset_id: PipelineRunInfo}
for dataset_id, info in run_info.items():
    print(info.pipeline_run_id)  # UUID to track this run
    print(info.dataset_id)       # Dataset being processed
    print(info.status)           # Initial status (e.g. "PipelineRunStarted")

The returned PipelineRunInfo fields relevant for monitoring:

Field	Type	Description
`pipeline_run_id`	`UUID`	Unique identifier for this pipeline run
`dataset_id`	`UUID`	The dataset being processed
`dataset_name`	`str`	Name of the dataset
`status`	`str`	Current status of the run

Possible status values: PipelineRunStarted, PipelineRunYield, PipelineRunCompleted, PipelineRunAlreadyCompleted, PipelineRunErrored.

Monitoring progress via WebSocket (REST API)

When using the REST API, subscribe to real-time pipeline updates with the WebSocket endpoint:

WebSocket: /cognify/subscribe/{pipeline_run_id}

Authentication: Use the same authentication context as your REST API session. In cookie-based setups, this is typically the auth_token cookie.Usage example (JavaScript):

const pipelineRunId = "your-pipeline-run-id-uuid";
const ws = new WebSocket(`ws://your-server/cognify/subscribe/${pipelineRunId}`);

ws.onmessage = (event) => {
    const data = JSON.parse(event.data);
    console.log("Status:", data.status);
    console.log("Run ID:", data.pipeline_run_id);
    // data.payload contains the current graph data for the dataset
};

ws.onclose = () => {
    // Server closes the connection when processing completes (status: PipelineRunCompleted)
    console.log("Pipeline run finished");
};

Each WebSocket message has this shape:

{
    "pipeline_run_id": "uuid-string",
    "status": "PipelineRunYield",
    "payload": { /* current graph data for the dataset */ }
}

The server closes the WebSocket with code 1000 (normal closure) once the run reaches PipelineRunCompleted. If authentication fails, the connection is closed with code 1008 (policy violation).

When to use background mode

Large datasets (>100 MB) where blocking would time out HTTP connections
API integrations where you want to return a job ID to the caller immediately
Parallel processing of multiple datasets without waiting for each

For small datasets or scripts, the default blocking mode (run_in_background=False) is simpler and returns the final result directly.

​cognee.cognify()

​Description

​Parameters

​Dry-run cost estimation

​Processing Pipeline

​Examples

​Further details

cognee.cognify()

Description

Parameters

Dry-run cost estimation

Processing Pipeline

Examples

Further details