Cognee Data Processing
Overview
Cognee’s data processing pipeline transforms raw content into a set of interconnected knowledge graphs with vector representations. This document explains how the system processes, stores, and retrieves information.
The DataPoint System
Core Data Structure
At the heart of Cognee is the DataPoint
class - the foundation for all data entities:
class DataPoint:
id: UUID
created_at: datetime
updated_at: datetime
ontology_valid: bool
version: int
topological_rank: Optional[int]
metadata: Dict[str, Any]
Every DataPoint
includes:
- Unique identifier
- Timestamps for creation/modification
- Version tracking
- Metadata for configuration
- Serialization capabilities (JSON/pickle)
Data Hierarchy
Cognee uses specialized DataPoint
subclasses:
- Entity: Named concepts with descriptions
- DocumentChunk: Text fragments linked to sources
- Document: Source document metadata
- EntityType: Categories for entities
- NodeSet: Collections of related data points
Processing Pipeline
Text Ingestion
When content enters Cognee:
- Deduplication: Content hashing prevents redundant storage
- Chunking: Text is divided based on configurable strategies:
- Paragraph-based (default)
- Sentence-based
- Word-based
- Custom patterns
- Entity Extraction: Named entities are identified using:
- Regular expressions
- NLP models
- Custom extractors
Relationship Building
Cognee automatically connects data points:
- Chunks link to source documents
- Entities connect to chunks where they appear
- Entities associate with their types
- Custom relationships based on application needs
Storage Architecture
Vector Databases
Cognee supports multiple vector databases:
- ChromaDB: Lightweight, easy setup
- Weaviate: Schema-based with complex filtering
- PGVector: PostgreSQL extension for production
- Qdrant: High-performance similarity search
- Milvus: Scalable for large collections
Graph Databases
For relationship-rich applications:
- Neo4j: Industry-standard graph database
- FalkorDB: Redis-based graph processing
- KuzuDB: High-performance graph database
Search Capabilities
Cognee offers several search modes:
Graph Completion Search
results = cognee.search("applications of graphene", search_type=SearchType.GRAPH_COMPLETION)
- Starts with vector search
- Traverses relationships to find related content
- Delivers contextually rich results
Data Reconstruction
The DataPoint system ensures:
- Complete text reconstruction from chunks
- Preserved relationships between entities
- Original metadata integrity
- Bidirectional traversal of relationships
Performance Considerations
- Vector dimensions affect search quality and speed
- Chunk size impacts context preservation
- Entity extraction depth and the prompt influences graph richness
- Database choice determines scalability characteristics
Extending the System
Cognee’s modular architecture allows:
- Custom chunking strategies
- New entity extractors
- Additional relationship types
- Domain-specific DataPoint subclasses