Skip to Content
Core ConceptsData Processing

Cognee Data Processing

Overview

Cognee’s data processing pipeline transforms raw content into a set of interconnected knowledge graphs with vector representations. This document explains how the system processes, stores, and retrieves information.

The DataPoint System

Core Data Structure

At the heart of Cognee is the DataPoint class - the foundation for all data entities:

class DataPoint: id: UUID created_at: datetime updated_at: datetime ontology_valid: bool version: int topological_rank: Optional[int] metadata: Dict[str, Any]

Every DataPoint includes:

  • Unique identifier
  • Timestamps for creation/modification
  • Version tracking
  • Metadata for configuration
  • Serialization capabilities (JSON/pickle)

Data Hierarchy

Cognee uses specialized DataPoint subclasses:

  • Entity: Named concepts with descriptions
  • DocumentChunk: Text fragments linked to sources
  • Document: Source document metadata
  • EntityType: Categories for entities
  • NodeSet: Collections of related data points

Processing Pipeline

Text Ingestion

When content enters Cognee:

  1. Deduplication: Content hashing prevents redundant storage
  2. Chunking: Text is divided based on configurable strategies:
    • Paragraph-based (default)
    • Sentence-based
    • Word-based
    • Custom patterns
  3. Entity Extraction: Named entities are identified using:
    • Regular expressions
    • NLP models
    • Custom extractors

Relationship Building

Cognee automatically connects data points:

  • Chunks link to source documents
  • Entities connect to chunks where they appear
  • Entities associate with their types
  • Custom relationships based on application needs

Storage Architecture

Vector Databases

Cognee supports multiple vector databases:

  • ChromaDB: Lightweight, easy setup
  • Weaviate: Schema-based with complex filtering
  • PGVector: PostgreSQL extension for production
  • Qdrant: High-performance similarity search
  • Milvus: Scalable for large collections

Graph Databases

For relationship-rich applications:

  • Neo4j: Industry-standard graph database
  • FalkorDB: Redis-based graph processing
  • KuzuDB: High-performance graph database

Search Capabilities

Cognee offers several search modes:

results = cognee.search("applications of graphene", search_type=SearchType.GRAPH_COMPLETION)
  • Starts with vector search
  • Traverses relationships to find related content
  • Delivers contextually rich results

Data Reconstruction

The DataPoint system ensures:

  • Complete text reconstruction from chunks
  • Preserved relationships between entities
  • Original metadata integrity
  • Bidirectional traversal of relationships

Performance Considerations

  • Vector dimensions affect search quality and speed
  • Chunk size impacts context preservation
  • Entity extraction depth and the prompt influences graph richness
  • Database choice determines scalability characteristics

Extending the System

Cognee’s modular architecture allows:

  • Custom chunking strategies
  • New entity extractors
  • Additional relationship types
  • Domain-specific DataPoint subclasses