Cognee Architecture: Dual Storage for Intelligent Memory
Why Two Storage Systems?
Cognee uses both graph databases and vector databases because they solve different, complementary problems in knowledge management:
Vector Database: Semantic Similarity
- Purpose: Stores embeddings of text chunks and datapoints for similarity search
- Strengths: Fast semantic retrieval, βfuzzyβ matching, conceptual similarity
- Use cases: βFind content similar to this queryβ, content recommendations, semantic search
- Technology: LanceDB (default), PGVector, Qdrant
Graph Database: Relationships and Structure
- Purpose: Stores entities and their explicit relationships as nodes and edges
- Strengths: Complex traversals, relationship queries, structured knowledge navigation
- Use cases: βFind all authors who wrote about AI and worked at companies founded after 2020β
- Technology: NetworkX (default), Neo4j, KuzuDB
How They Work Together
The magic happens when both systems work in concert:
Raw Data β Tasks β DataPoints β Graph Nodes + Vector Embeddings β Hybrid Search
1. During Data Ingestion
# Example: Processing a document about "Machine Learning"
text = "Machine learning is a subset of artificial intelligence..."
# Tasks create DataPoints
entity_ml = Entity(name="Machine Learning", type="Technology")
entity_ai = Entity(name="Artificial Intelligence", type="Field")
relationship = Relationship(source=entity_ml, target=entity_ai, type="subset_of")
# Storage in both systems:
# Graph: ML --[subset_of]--> AI
# Vector: [0.1, 0.4, 0.8, ...] (embedding of "Machine Learning")
2. During Search
Cognee combines both approaches for superior results:
Vector Search: Finds semantically similar content
query = "What is deep learning?"
# Vector DB finds: ML, Neural Networks, AI (semantically similar)
Graph Traversal: Explores relationships
# Graph DB explores: Deep Learning --[subset_of]--> ML --[subset_of]--> AI
# Discovers related concepts through relationship chains
Hybrid Results: Best of both worlds
- Vector similarity provides broad semantic matches
- Graph traversal provides precise relational context
- Combined scoring ranks results by both similarity and relationship relevance
System Components
Core Processing Layer
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Data Sources β β β Tasks β β β DataPoints β
β (Files, APIs) β β (Processing) β β (Structured) β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Graph Store β β β Pipelines β β β Vector Store β
β (Relationships) β β (Orchestration) β β (Embeddings) β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
Data Flow Architecture
- Ingestion Layer: Raw data from various sources
- Processing Layer: Tasks and pipelines transform data
- Storage Layer: Dual storage in graph and vector databases
- Query Layer: Hybrid search and retrieval
- Application Layer: SDK, API, UI interfaces
Storage Configurations
Default Setup (No External Dependencies)
# Minimal local setup
GRAPH_DATABASE_PROVIDER="networkx" # In-memory graph
VECTOR_DB_PROVIDER="lancedb" # Local vector storage
DB_PROVIDER="sqlite" # Metadata storage
Production Setup (Scalable)
# High-performance setup
GRAPH_DATABASE_PROVIDER="neo4j" # Dedicated graph server
VECTOR_DB_PROVIDER="pgvector" # PostgreSQL with vector extensions
DB_PROVIDER="postgres" # Production metadata storage
Cloud Setup (Managed Services)
# Cloud-native setup
GRAPH_DATABASE_PROVIDER="neo4j" # Neo4j AuraDB
VECTOR_DB_PROVIDER="pinecone" # Managed vector service
DB_PROVIDER="postgres" # Cloud PostgreSQL
Data Processing Workflow
Step 1: Data Ingestion
await cognee.add(["document.pdf", "data.csv", "website.html"])
- Files are parsed and content extracted
- Metadata is stored in relational database
- Content is prepared for processing
Step 2: Knowledge Graph Creation
await cognee.cognify()
- Tasks process the data step by step
- DataPoints are created with structured information
- Graph nodes and edges are generated
- Vector embeddings are created for semantic search
Step 3: Hybrid Querying
results = await cognee.search("Tell me about machine learning")
- Vector search finds semantically relevant content
- Graph traversal explores related concepts
- Results are ranked by combined relevance
Performance and Scalability
Graph Database Benefits
- Complex Queries: Multi-hop relationship traversals
- Real-time Updates: Dynamic graph modifications
- Semantic Consistency: Maintains knowledge structure
Vector Database Benefits
- Fast Similarity: Sub-second semantic search
- Scalable Embeddings: Millions of vectors
- Approximate Matching: Handles query variations
Combined Advantages
- Comprehensive Results: Both structured and semantic matches
- Contextual Ranking: Relationship-aware scoring
- Flexible Querying: Supports diverse question types
Architecture Benefits
This dual-storage architecture provides:
- Semantic Understanding: Vector embeddings capture meaning
- Structural Knowledge: Graph relationships provide context
- Query Flexibility: Support for both similarity and relationship queries
- Scalability: Each storage system optimized for its use case
- Technology Choice: Mix and match databases based on requirements
Next Steps
- Learn about DataPoints that become graph nodes
- Understand Pipelines that orchestrate the transformation
- Explore Search Types that leverage both storage systems
The dual storage architecture is what makes cognee uniquely powerful - combining the structured reasoning of graphs with the semantic intelligence of vector search.
Last updated on