Skip to Content
Core ConceptsArchitecture

Cognee Architecture: Dual Storage for Intelligent Memory

Why Two Storage Systems?

Cognee uses both graph databases and vector databases because they solve different, complementary problems in knowledge management:

Vector Database: Semantic Similarity

  • Purpose: Stores embeddings of text chunks and datapoints for similarity search
  • Strengths: Fast semantic retrieval, β€œfuzzy” matching, conceptual similarity
  • Use cases: β€œFind content similar to this query”, content recommendations, semantic search
  • Technology: LanceDB (default), PGVector, Qdrant

Graph Database: Relationships and Structure

  • Purpose: Stores entities and their explicit relationships as nodes and edges
  • Strengths: Complex traversals, relationship queries, structured knowledge navigation
  • Use cases: β€œFind all authors who wrote about AI and worked at companies founded after 2020”
  • Technology: NetworkX (default), Neo4j, KuzuDB

How They Work Together

The magic happens when both systems work in concert:

Raw Data β†’ Tasks β†’ DataPoints β†’ Graph Nodes + Vector Embeddings β†’ Hybrid Search

1. During Data Ingestion

# Example: Processing a document about "Machine Learning" text = "Machine learning is a subset of artificial intelligence..." # Tasks create DataPoints entity_ml = Entity(name="Machine Learning", type="Technology") entity_ai = Entity(name="Artificial Intelligence", type="Field") relationship = Relationship(source=entity_ml, target=entity_ai, type="subset_of") # Storage in both systems: # Graph: ML --[subset_of]--> AI # Vector: [0.1, 0.4, 0.8, ...] (embedding of "Machine Learning")

Cognee combines both approaches for superior results:

Vector Search: Finds semantically similar content

query = "What is deep learning?" # Vector DB finds: ML, Neural Networks, AI (semantically similar)

Graph Traversal: Explores relationships

# Graph DB explores: Deep Learning --[subset_of]--> ML --[subset_of]--> AI # Discovers related concepts through relationship chains

Hybrid Results: Best of both worlds

  • Vector similarity provides broad semantic matches
  • Graph traversal provides precise relational context
  • Combined scoring ranks results by both similarity and relationship relevance

System Components

Core Processing Layer

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Data Sources β”‚ β†’ β”‚ Tasks β”‚ β†’ β”‚ DataPoints β”‚ β”‚ (Files, APIs) β”‚ β”‚ (Processing) β”‚ β”‚ (Structured) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ↓ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Graph Store β”‚ ← β”‚ Pipelines β”‚ β†’ β”‚ Vector Store β”‚ β”‚ (Relationships) β”‚ β”‚ (Orchestration) β”‚ β”‚ (Embeddings) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Data Flow Architecture

  1. Ingestion Layer: Raw data from various sources
  2. Processing Layer: Tasks and pipelines transform data
  3. Storage Layer: Dual storage in graph and vector databases
  4. Query Layer: Hybrid search and retrieval
  5. Application Layer: SDK, API, UI interfaces

Storage Configurations

Default Setup (No External Dependencies)

# Minimal local setup GRAPH_DATABASE_PROVIDER="networkx" # In-memory graph VECTOR_DB_PROVIDER="lancedb" # Local vector storage DB_PROVIDER="sqlite" # Metadata storage

Production Setup (Scalable)

# High-performance setup GRAPH_DATABASE_PROVIDER="neo4j" # Dedicated graph server VECTOR_DB_PROVIDER="pgvector" # PostgreSQL with vector extensions DB_PROVIDER="postgres" # Production metadata storage

Cloud Setup (Managed Services)

# Cloud-native setup GRAPH_DATABASE_PROVIDER="neo4j" # Neo4j AuraDB VECTOR_DB_PROVIDER="pinecone" # Managed vector service DB_PROVIDER="postgres" # Cloud PostgreSQL

Data Processing Workflow

Step 1: Data Ingestion

await cognee.add(["document.pdf", "data.csv", "website.html"])
  • Files are parsed and content extracted
  • Metadata is stored in relational database
  • Content is prepared for processing

Step 2: Knowledge Graph Creation

await cognee.cognify()
  • Tasks process the data step by step
  • DataPoints are created with structured information
  • Graph nodes and edges are generated
  • Vector embeddings are created for semantic search

Step 3: Hybrid Querying

results = await cognee.search("Tell me about machine learning")
  • Vector search finds semantically relevant content
  • Graph traversal explores related concepts
  • Results are ranked by combined relevance

Performance and Scalability

Graph Database Benefits

  • Complex Queries: Multi-hop relationship traversals
  • Real-time Updates: Dynamic graph modifications
  • Semantic Consistency: Maintains knowledge structure

Vector Database Benefits

  • Fast Similarity: Sub-second semantic search
  • Scalable Embeddings: Millions of vectors
  • Approximate Matching: Handles query variations

Combined Advantages

  • Comprehensive Results: Both structured and semantic matches
  • Contextual Ranking: Relationship-aware scoring
  • Flexible Querying: Supports diverse question types

Architecture Benefits

This dual-storage architecture provides:

  1. Semantic Understanding: Vector embeddings capture meaning
  2. Structural Knowledge: Graph relationships provide context
  3. Query Flexibility: Support for both similarity and relationship queries
  4. Scalability: Each storage system optimized for its use case
  5. Technology Choice: Mix and match databases based on requirements

Next Steps

  • Learn about DataPoints that become graph nodes
  • Understand Pipelines that orchestrate the transformation
  • Explore Search Types that leverage both storage systems

The dual storage architecture is what makes cognee uniquely powerful - combining the structured reasoning of graphs with the semantic intelligence of vector search.

Last updated on