What is the cognify operation
The.cognify operation takes the ingested data with Add and turns plain text into structured knowledge: chunks, embeddings, summaries, nodes, and edges that live in Cognee’s vector and graph stores. It prepares your data for downstream operations like Search.
- Transforms ingested data: builds chunks, embeddings, and summaries
- Graph creation: extracts entities and relationships to form a knowledge graph
- Vector indexing: makes everything searchable via embeddings
- Dataset-scoped: runs per dataset, respecting ownership and permissions
.cognify can be run multiple times as the dataset grows, and Cognee will skip what’s already processed. Read more about Incremental loading in Examples and detailsWhat happens under the hood
The.cognify pipeline is made of six ordered Tasks. Each task takes the output of the previous one and moves your data closer to becoming a searchable knowledge graph.
- Classify documents — wrap each ingested file as a
Documentobject with metadata and optional node sets - Check permissions — enforce that you have write access to the target dataset
- Extract chunks — split documents into smaller pieces (paragraphs, sections)
- Extract graph — use LLMs to identify entities and relationships, inserting them into the graph DB
- Summarize text — generate summaries for each chunk, stored as
TextSummaryDataPoints - Add data points — embed nodes and summaries, write them into the vector store, and update graph edges
After cognify finishes
When.cognify completes for a dataset:
- DocumentChunks exist in memory as the granular breakdown of your files
- Summaries are stored and indexed in the vector database for semantic search
- Knowledge graph nodes and edges are committed to the graph database
- Dataset metadata is updated with token counts and pipeline status
- Your dataset is now query-ready: you can run Search or graph queries immediately
Examples and details
Pipeline tasks (detailed)
Pipeline tasks (detailed)
-
Classify documents
- Turns raw
Datarows intoDocumentobjects - Chooses the right document type (PDF, text, image, audio, etc.)
- Attaches metadata and optional node sets
- Turns raw
-
Check permissions
- Verifies that the user has write access to the dataset
-
Extract chunks
- Splits documents into
DocumentChunks using a chunker - You can customize the chunk size and strategy — see Chunkers for details
- Updates token counts in the relational DB
- Splits documents into
-
Extract graph
- Calls the LLM to extract entities and relationships
- Deduplicates nodes and edges, commits to the graph DB
-
Summarize text
- Generates concise summaries per chunk
- Stores them as
TextSummaryDataPoints for vector search
-
Add data points
- Converts summaries and other DataPoints into graph + vector nodes
- Embeds them in the vector store, persists in the graph DB
Default extraction prompts
Default extraction prompts
Cognee ships with several built-in system prompts for entity and relationship extraction, stored in
To switch to a different built-in prompt, set the environment variable:Or configure it at runtime via
cognee/infrastructure/llm/prompts/. The active prompt is controlled by the GRAPH_PROMPT_PATH environment variable (default: generate_graph_prompt.txt).| Prompt file | Use case | What it does |
|---|---|---|
generate_graph_prompt.txt | Default balanced extraction | Extracts entities and relationships using the standard Cognee rules: basic node types, human-readable IDs, normalized dates, snake_case relationships, and coreference consistency. |
generate_graph_prompt_simple.txt | Lightweight extraction | Uses a shorter, more compact rule set for straightforward graph extraction while keeping the same core conventions around node types, IDs, dates, and relationship naming. |
generate_graph_prompt_strict.txt | Tighter schema control | Applies a more explicit prompt with named node categories, stronger relationship constraints, examples, and a strict instruction not to infer facts that are not present in the text. |
generate_graph_prompt_guided.txt | More directed graph shaping | Adds guidance for edge direction, allows multi-word entity labels, and encourages logically implied facts when they improve graph clarity without repeating the same fact. |
cognee.config:If you need to use a custom prompt, refer to our Custom Prompts guide
Datasets and permissions
Datasets and permissions
- Cognify always runs on a dataset
- You must have write access to the target dataset
- Permissions are enforced at pipeline start
- Each dataset maintains its own cognify status and token counts
Incremental loading
Incremental loading
- By default,
.cognifyprocesses all data in a dataset - With
incremental_loading=True, only new or updated files are processed - Saves time and compute for large, evolving datasets
How entity and relationship names are determined
How entity and relationship names are determined
During the Extract graph step, Cognee asks the LLM to turn each chunk into graph nodes and edges. The names and types in that graph are inferred from your content rather than fixed in advance.
The extraction prompt instructs the model to:
| Element | Fields |
|---|---|
| Node (vertex) | id - unique identifier derived from the entity name; name - human-readable label; type - semantic category such as Person or Organization; description - short summary |
| Edge (relationship) | source_node_id, target_node_id, relationship_name - a free-text verb phrase such as works_at or produces |
- Capture entities, names, nouns, and implied mentions exhaustively
- Form relationships as
(start_node, relationship_name, end_node)triplets using explicit and inferred connections - Avoid duplicates and overly generic terms
If you need tighter control over naming, use an OWL ontology or a custom graph model. See Ontologies and Custom Graph Model.
Inspect extracted graph schema
Inspect extracted graph schema
Once To explore the same graph visually, use the Graph Visualization guide.
.cognify finishes, the graph schema is inspectable because the extracted node types and relationship names now exist in the graph store.Python SDK
Use the graph engine directly to inspect the stored nodes and edges:get_graph_data() returns:- Nodes as
(node_id: str, properties: dict) - Edges as
(source_id: str, target_id: str, relationship_name: str, properties: dict)
HTTP server mode
When you run the Cognee HTTP server, you can inspect graph data through the dataset graph endpoint:Re-cognify after schema changes
Re-cognify after schema changes
If you update your data model (e.g., add new entity fields or relationships) and want to reprocess existing data:
-
Delete the dataset first, then re-add and re-cognify:
- Alternatively, use Memify for additive enrichment — it runs extraction and enrichment tasks over the existing graph without re-ingesting data. This is useful when you want to add new derived facts without reprocessing from scratch.
Final outcome
Final outcome
- Vector database contains embeddings for summaries and nodes
- Graph database contains entities and relationships
- Relational database tracks token counts and pipeline run status
- Your dataset is now ready for Search (semantic or graph-based)
LLM call count and cost estimation
LLM call count and cost estimation
The default cognify pipeline makes 2 LLM calls per chunk:When With typical defaults (e.g.,
Tips for reducing API usage
- Graph extraction — identifies entities and relationships from the chunk text
- Summarization — generates a concise summary of the chunk
chunk_size:chunk_size is not set explicitly, Cognee auto-calculates it as:gpt-4o-mini + text-embedding-3-small) this usually falls in the 1 024 – 8 192 token range. See Chunkers for details.Example estimates at chunk_size = 1024:| Document size | Chunks | LLM calls |
|---|---|---|
| 100 tokens | 1 | 2 |
| 1 000 tokens | 1 | 2 |
| 10 000 tokens | 10 | 20 |
- Increase
chunk_size— fewer, larger chunks mean fewer calls: - Skip summarization — use a custom pipeline that omits the
summarize_texttask, reducing calls to 1 per chunk. - Enable rate limiting — set
LLM_RATE_LIMIT_ENABLED=trueto avoid bursting your provider quota when processing many chunks in parallel.
Add
First bring data into Cognee
Search
Query embeddings or graph structures built by Cognify
Memify
Enrich your graph with derived facts after cognify