Skip to main content

cognee.cognify()

async def cognify(
    datasets: Union[str, list[str], list[UUID]] = None,
    user: User = None,
    graph_model: BaseModel = KnowledgeGraph,
    chunker = TextChunker,
    chunk_size: int = None,
    chunks_per_batch: int = None,
    config: Config = None,
    vector_db_config: dict = None,
    graph_db_config: dict = None,
    run_in_background: bool = False,
    incremental_loading: bool = True,
    custom_prompt: Optional[str] = None,
    temporal_cognify: bool = False,
    data_per_batch: int = 20,
)

Description

Transform ingested data into a structured knowledge graph. This is the core processing step in Cognee that converts raw text and documents into an intelligent knowledge graph. It analyzes content, extracts entities and relationships, and creates semantic connections for enhanced search and reasoning. Prerequisites:
  • LLM_API_KEY: Must be configured (required for entity extraction and graph generation)
  • Data Added: Must have data previously added via cognee.add()
  • Vector Database: Must be accessible for embeddings storage
  • Graph Database: Must be accessible for relationship storage
Input Requirements:
  • Datasets: Must contain data previously added via cognee.add()
  • Content Types: Works with any text-extractable content including:
    • Natural language documents
    • Structured data (CSV, JSON)
    • Code repositories
    • Academic papers and technical documentation
    • Mixed multimedia content (with text extraction)
Processing Pipeline:
  1. Document Classification: Identifies document types and structures
  2. Text Chunking: Breaks content into semantically meaningful segments
  3. Entity Extraction: Identifies key concepts, people, places, organizations
  4. Relationship Detection: Discovers connections between entities
  5. Graph Construction: Builds semantic knowledge graph with embeddings
  6. Content Summarization: Creates hierarchical summaries for navigation
Graph Model Customization: The graph_model parameter allows custom knowledge structures:
  • Default: General-purpose KnowledgeGraph for any domain
  • Custom Models: Domain-specific schemas (e.g., scientific papers, code analysis)
  • Ontology Integration: Use ontology_file_path for predefined vocabularies
Args: datasets: Dataset name(s) or dataset uuid to process. Processes all available data if None.
  • Single dataset: “my_dataset”
  • Multiple datasets: [“docs”, “research”, “reports”]
  • None: Process all datasets for the user user: User context for authentication and data access. Uses default if None. graph_model: Pydantic model defining the knowledge graph structure. Defaults to KnowledgeGraph for general-purpose processing. chunker: Text chunking strategy (TextChunker, LangchainChunker).
    • TextChunker: Paragraph-based chunking (default, most reliable)
    • LangchainChunker: Recursive character splitting with overlap Determines how documents are segmented for processing. chunk_size: Maximum tokens per chunk. Auto-calculated based on LLM if None. Formula: min(embedding_max_completion_tokens, llm_max_completion_tokens // 2) Default limits: ~512-8192 tokens depending on models. Smaller chunks = more granular but potentially fragmented knowledge. chunks_per_batch: Number of chunks to be processed in a single batch in Cognify tasks. vector_db_config: Custom vector database configuration for embeddings storage. graph_db_config: Custom graph database configuration for relationship storage. run_in_background: If True, starts processing asynchronously and returns immediately. If False, waits for completion before returning. Background mode recommended for large datasets (>100MB). Use pipeline_run_id from return value to monitor progress. custom_prompt: Optional custom prompt string to use for entity extraction and graph generation. If provided, this prompt will be used instead of the default prompts for knowledge graph extraction. The prompt should guide the LLM on how to extract entities and relationships from the text content.
Returns: Union[dict, list[PipelineRunInfo]]:
  • Blocking mode: Dictionary mapping dataset_id -> PipelineRunInfo with:
    • Processing status (completed/failed/in_progress)
    • Extracted entity and relationship counts
    • Processing duration and resource usage
    • Error details if any failures occurred
  • Background mode: List of PipelineRunInfo objects for tracking progress
    • Use pipeline_run_id to monitor status
    • Check completion via pipeline monitoring APIs
Next Steps: After successful cognify processing, use search functions to query the knowledge:
import cognee
from cognee import SearchType

# Process your data into knowledge graph
await cognee.cognify()

# Query for insights using different search types:

# 1. Natural language completion with graph context
insights = await cognee.search(
    "What are the main themes?",
    query_type=SearchType.GRAPH_COMPLETION
)

# 2. Get entity relationships and connections
relationships = await cognee.search(
    "connections between concepts",
    query_type=SearchType.GRAPH_COMPLETION
)

# 3. Find relevant document chunks
chunks = await cognee.search(
    "specific topic",
    query_type=SearchType.CHUNKS
)
Advanced Usage:
# Custom domain model for scientific papers
class ScientificPaper(DataPoint):
    title: str
    authors: List[str]
    methodology: str
    findings: List[str]

await cognee.cognify(
    datasets=["research_papers"],
    graph_model=ScientificPaper,
    ontology_file_path="scientific_ontology.owl"
)

# Background processing for large datasets
run_info = await cognee.cognify(
    datasets=["large_corpus"],
    run_in_background=True
)
# Check status later with run_info.pipeline_run_id
Environment Variables: Required:
  • LLM_API_KEY: API key for your LLM provider
Optional (same as add function):
  • LLM_PROVIDER, LLM_MODEL, VECTOR_DB_PROVIDER, GRAPH_DATABASE_PROVIDER
  • LLM_RATE_LIMIT_ENABLED: Enable rate limiting (default: False)
  • LLM_RATE_LIMIT_REQUESTS: Max requests per interval (default: 60)

Parameters

datasets
Union[str, list[str], list[UUID]]
default:"None"
Dataset name(s) or UUID(s) to process. Processes all datasets if not specified.
user
User
default:"None"
User performing the operation.
graph_model
BaseModel
default:"KnowledgeGraph"
Pydantic model defining the knowledge graph schema. Defaults to KnowledgeGraph.
chunker
Any
default:"TextChunker"
Text chunking strategy class.
chunk_size
int
default:"None"
Maximum size of text chunks in tokens.
chunks_per_batch
int
default:"None"
Number of chunks to process per LLM batch.
config
Config
default:"None"
Override the full Cognee config for this run.
vector_db_config
dict
default:"None"
Override vector database configuration.
graph_db_config
dict
default:"None"
Override graph database configuration.
run_in_background
bool
default:"False"
If true, return immediately and process in background.
incremental_loading
bool
default:"True"
If true, skip already-processed data.
custom_prompt
Optional[str]
default:"None"
Custom system prompt for entity/relationship extraction.
temporal_cognify
bool
default:"False"
Enable temporal-aware processing.
data_per_batch
int
default:"20"
Number of data items per processing batch.

Processing Pipeline

When you call cognify(), data goes through these stages:
  1. Document classification — identify content type
  2. Text chunking — split into manageable segments
  3. Entity extraction — identify entities using the LLM
  4. Relationship detection — find connections between entities
  5. Graph construction — build the knowledge graph
  6. Summarization — generate summaries of content

Examples

import cognee

# Process all datasets
await cognee.cognify()

# Process a specific dataset
await cognee.cognify(datasets=["my_dataset"])

# Process in background
await cognee.cognify(datasets=["large_dataset"], run_in_background=True)

# Use a custom graph model
from pydantic import BaseModel

class MyGraph(BaseModel):
    nodes: list
    edges: list

await cognee.cognify(graph_model=MyGraph)

# Custom extraction prompt
await cognee.cognify(
    custom_prompt="Extract all technical concepts and their relationships."
)