Skip to main content

What is the cognify operation

The .cognify operation takes the ingested data with Add and turns plain text into structured knowledge: chunks, embeddings, summaries, nodes, and edges that live in Cognee’s vector and graph stores. It prepares your data for downstream operations like Search.
  • Transforms ingested data: builds chunks, embeddings, and summaries
  • Graph creation: extracts entities and relationships to form a knowledge graph
  • Vector indexing: makes everything searchable via embeddings
  • Dataset-scoped: runs per dataset, respecting ownership and permissions
.cognify can be run multiple times as the dataset grows, and Cognee will skip what’s already processed. Read more about Incremental loading in Examples and details

What happens under the hood

The .cognify pipeline is made of six ordered Tasks. Each task takes the output of the previous one and moves your data closer to becoming a searchable knowledge graph.
  1. Classify documents — wrap each ingested file as a Document object with metadata and optional node sets
  2. Check permissions — enforce that you have write access to the target dataset
  3. Extract chunks — split documents into smaller pieces (paragraphs, sections)
  4. Extract graph — use LLMs to identify entities and relationships, inserting them into the graph DB
  5. Summarize text — generate summaries for each chunk, stored as TextSummary DataPoints
  6. Add data points — embed nodes and summaries, write them into the vector store, and update graph edges
The result is a fully searchable, structured knowledge graph connected to your data.

After cognify finishes

When .cognify completes for a dataset:
  • DocumentChunks exist in memory as the granular breakdown of your files
  • Summaries are stored and indexed in the vector database for semantic search
  • Knowledge graph nodes and edges are committed to the graph database
  • Dataset metadata is updated with token counts and pipeline status
  • Your dataset is now query-ready: you can run Search or graph queries immediately

Examples and details

  1. Classify documents
    • Turns raw Data rows into Document objects
    • Chooses the right document type (PDF, text, image, audio, etc.)
    • Attaches metadata and optional node sets
  2. Check permissions
    • Verifies that the user has write access to the dataset
  3. Extract chunks
    • Splits documents into DocumentChunks using a chunker
    • You can customize the chunk size and strategy — see Chunkers for details
    • Updates token counts in the relational DB
  4. Extract graph
    • Calls the LLM to extract entities and relationships
    • Deduplicates nodes and edges, commits to the graph DB
  5. Summarize text
    • Generates concise summaries per chunk
    • Stores them as TextSummary DataPoints for vector search
  6. Add data points
    • Converts summaries and other DataPoints into graph + vector nodes
    • Embeds them in the vector store, persists in the graph DB
Cognee ships with several built-in system prompts for entity and relationship extraction, stored in cognee/infrastructure/llm/prompts/. The active prompt is controlled by the GRAPH_PROMPT_PATH environment variable (default: generate_graph_prompt.txt).
Prompt fileUse caseWhat it does
generate_graph_prompt.txtDefault balanced extractionExtracts entities and relationships using the standard Cognee rules: basic node types, human-readable IDs, normalized dates, snake_case relationships, and coreference consistency.
generate_graph_prompt_simple.txtLightweight extractionUses a shorter, more compact rule set for straightforward graph extraction while keeping the same core conventions around node types, IDs, dates, and relationship naming.
generate_graph_prompt_strict.txtTighter schema controlApplies a more explicit prompt with named node categories, stronger relationship constraints, examples, and a strict instruction not to infer facts that are not present in the text.
generate_graph_prompt_guided.txtMore directed graph shapingAdds guidance for edge direction, allows multi-word entity labels, and encourages logically implied facts when they improve graph clarity without repeating the same fact.
To switch to a different built-in prompt, set the environment variable:
GRAPH_PROMPT_PATH=generate_graph_prompt_strict.txt
Or configure it at runtime via cognee.config:
import cognee

cognee.config.llm_config.graph_prompt_path = "generate_graph_prompt_strict.txt"
If you need to use a custom prompt, refer to our Custom Prompts guide
  • Cognify always runs on a dataset
  • You must have write access to the target dataset
  • Permissions are enforced at pipeline start
  • Each dataset maintains its own cognify status and token counts
  • By default, .cognify processes all data in a dataset
  • With incremental_loading=True, only new or updated files are processed
  • Saves time and compute for large, evolving datasets
During the Extract graph step, Cognee asks the LLM to turn each chunk into graph nodes and edges. The names and types in that graph are inferred from your content rather than fixed in advance.
ElementFields
Node (vertex)id - unique identifier derived from the entity name; name - human-readable label; type - semantic category such as Person or Organization; description - short summary
Edge (relationship)source_node_id, target_node_id, relationship_name - a free-text verb phrase such as works_at or produces
The extraction prompt instructs the model to:
  • Capture entities, names, nouns, and implied mentions exhaustively
  • Form relationships as (start_node, relationship_name, end_node) triplets using explicit and inferred connections
  • Avoid duplicates and overly generic terms
That means the resulting graph schema emerges from the data you ingest. Different datasets, prompts, or LLMs can produce slightly different node types and relationship names for similar content.
If you need tighter control over naming, use an OWL ontology or a custom graph model. See Ontologies and Custom Graph Model.
Once .cognify finishes, the graph schema is inspectable because the extracted node types and relationship names now exist in the graph store.

Python SDK

Use the graph engine directly to inspect the stored nodes and edges:
from cognee.infrastructure.databases.graph import get_graph_engine

graph_engine = await get_graph_engine()

# Returns all nodes and edges
nodes, edges = await graph_engine.get_graph_data()

# Inspect unique node types
node_types = {props.get("type") for _, props in nodes if props.get("type")}
print("Node types:", node_types)

# Inspect unique relationship names
relationship_names = {rel_name for _, _, rel_name, _ in edges}
print("Relationship names:", relationship_names)
get_graph_data() returns:
  • Nodes as (node_id: str, properties: dict)
  • Edges as (source_id: str, target_id: str, relationship_name: str, properties: dict)
If you only need aggregate information, inspect graph metrics instead:
metrics = await graph_engine.get_graph_metrics()
# Returns: num_nodes, num_edges, mean_degree, edge_density,
#          num_connected_components, sizes_of_connected_components

metrics = await graph_engine.get_graph_metrics(include_optional=True)
print(metrics)

HTTP server mode

When you run the Cognee HTTP server, you can inspect graph data through the dataset graph endpoint:
GET /api/v1/datasets/{dataset_id}/graph
To explore the same graph visually, use the Graph Visualization guide.
If you update your data model (e.g., add new entity fields or relationships) and want to reprocess existing data:
  1. Delete the dataset first, then re-add and re-cognify:
    # Clear existing processed data
    await cognee.datasets.empty_dataset(dataset_id=my_dataset.id)
    
    # Re-add source files
    await cognee.add(source_files, dataset_name="my_dataset")
    
    # Re-cognify with the updated schema
    await cognee.cognify()
    
  2. Alternatively, use Memify for additive enrichment — it runs extraction and enrichment tasks over the existing graph without re-ingesting data. This is useful when you want to add new derived facts without reprocessing from scratch.
.cognify skips already-processed data by default. Simply re-running .cognify on unchanged files will not pick up schema changes. You must delete and re-add the data, or use memify for enrichment.
  • Vector database contains embeddings for summaries and nodes
  • Graph database contains entities and relationships
  • Relational database tracks token counts and pipeline run status
  • Your dataset is now ready for Search (semantic or graph-based)
The default cognify pipeline makes 2 LLM calls per chunk:
  1. Graph extraction — identifies entities and relationships from the chunk text
  2. Summarization — generates a concise summary of the chunk
Estimating total callsThe number of chunks depends on your document size and the configured chunk_size:
chunks          = ceil(document_tokens / chunk_size)
total_llm_calls = chunks × 2
When chunk_size is not set explicitly, Cognee auto-calculates it as:
chunk_size = min(embedding_model_max_tokens, llm_max_tokens ÷ 2)
With typical defaults (e.g., gpt-4o-mini + text-embedding-3-small) this usually falls in the 1 024 – 8 192 token range. See Chunkers for details.Example estimates at chunk_size = 1024:
Document sizeChunksLLM calls
100 tokens12
1 000 tokens12
10 000 tokens1020
Tips for reducing API usage
  • Increase chunk_size — fewer, larger chunks mean fewer calls:
    await cognee.cognify(chunk_size=4096)
    
  • Skip summarization — use a custom pipeline that omits the summarize_text task, reducing calls to 1 per chunk.
  • Enable rate limiting — set LLM_RATE_LIMIT_ENABLED=true to avoid bursting your provider quota when processing many chunks in parallel.

Add

First bring data into Cognee

Search

Query embeddings or graph structures built by Cognify

Memify

Enrich your graph with derived facts after cognify