cognee.cognify()
Description
Transform ingested data into a structured knowledge graph. This is the core processing step in Cognee that converts raw text and documents into an intelligent knowledge graph. It analyzes content, extracts entities and relationships, and creates semantic connections for enhanced search and reasoning. Prerequisites:- LLM_API_KEY: Must be configured (required for entity extraction and graph generation)
- Data Added: Must have data previously added via
cognee.add() - Vector Database: Must be accessible for embeddings storage
- Graph Database: Must be accessible for relationship storage
- Datasets: Must contain data previously added via
cognee.add() - Content Types: Works with any text-extractable content including:
- Natural language documents
- Structured data (CSV, JSON)
- Code repositories
- Academic papers and technical documentation
- Mixed multimedia content (with text extraction)
- Document Classification: Identifies document types and structures
- Text Chunking: Breaks content into semantically meaningful segments
- Entity Extraction: Identifies key concepts, people, places, organizations
- Relationship Detection: Discovers connections between entities
- Graph Construction: Builds semantic knowledge graph with embeddings
- Content Summarization: Creates hierarchical summaries for navigation
graph_model parameter allows custom knowledge structures:
- Default: General-purpose KnowledgeGraph for any domain
- Custom Models: Domain-specific schemas (e.g., scientific papers, code analysis)
- Ontology Integration: Use
ontology_file_pathfor predefined vocabularies
- Single dataset: “my_dataset”
- Multiple datasets: [“docs”, “research”, “reports”]
- None: Process all datasets for the user
user: User context for authentication and data access. Uses default if None.
graph_model: Pydantic model defining the knowledge graph structure.
Defaults to KnowledgeGraph for general-purpose processing.
chunker: Text chunking strategy (TextChunker, LangchainChunker).
- TextChunker: Paragraph-based chunking (default, most reliable)
- LangchainChunker: Recursive character splitting with overlap Determines how documents are segmented for processing. chunk_size: Maximum tokens per chunk. Auto-calculated based on LLM if None. Formula: min(embedding_max_completion_tokens, llm_max_completion_tokens // 2) Default limits: ~512-8192 tokens depending on models. Smaller chunks = more granular but potentially fragmented knowledge. chunks_per_batch: Number of chunks to be processed in a single batch in Cognify tasks. vector_db_config: Custom vector database configuration for embeddings storage. graph_db_config: Custom graph database configuration for relationship storage. run_in_background: If True, starts processing asynchronously and returns immediately. If False, waits for completion before returning. Background mode recommended for large datasets (>100MB). Use pipeline_run_id from return value to monitor progress. custom_prompt: Optional custom prompt string to use for entity extraction and graph generation. If provided, this prompt will be used instead of the default prompts for knowledge graph extraction. The prompt should guide the LLM on how to extract entities and relationships from the text content.
- Blocking mode: Dictionary mapping dataset_id -> PipelineRunInfo with:
- Processing status (completed/failed/in_progress)
- Extracted entity and relationship counts
- Processing duration and resource usage
- Error details if any failures occurred
- Background mode: List of PipelineRunInfo objects for tracking progress
- Use pipeline_run_id to monitor status
- Check completion via pipeline monitoring APIs
- LLM_API_KEY: API key for your LLM provider
- LLM_PROVIDER, LLM_MODEL, VECTOR_DB_PROVIDER, GRAPH_DATABASE_PROVIDER
- LLM_RATE_LIMIT_ENABLED: Enable rate limiting (default: False)
- LLM_RATE_LIMIT_REQUESTS: Max requests per interval (default: 60)
Parameters
Dataset name(s) or UUID(s) to process. Processes all datasets if not specified.
User performing the operation.
Pydantic model defining the knowledge graph schema. Defaults to KnowledgeGraph.
Text chunking strategy class.
Maximum size of text chunks in tokens.
Number of chunks to process per LLM batch.
Override the full Cognee config for this run.
Override vector database configuration.
Override graph database configuration.
If true, return immediately and process in background.
If true, skip already-processed data.
Custom system prompt for entity/relationship extraction.
Enable temporal-aware processing.
Number of data items per processing batch.
Processing Pipeline
When you callcognify(), data goes through these stages:
- Document classification — identify content type
- Text chunking — split into manageable segments
- Entity extraction — identify entities using the LLM
- Relationship detection — find connections between entities
- Graph construction — build the knowledge graph
- Summarization — generate summaries of content