DataPoints: The Building Blocks of Knowledge Graphs
Overview
In cognee, a DataPoint is the core building block of a knowledge graph. Each DataPoint type defines both nodes (entities) and their connections (relationships), providing a structured way to model real-world information.
Think of it as a smart container that represents entities and automatically handles their relationships with other entities. Each DataPoint carries its own metadata, relationships, and embedding capabilities. This makes it easy to store, query, and expand knowledge dynamically, without needing to rely on rigid database schemas.
Core DataPoint Structure
Core Concept
- Nodes (Entities): Each DataPoint instance becomes a node in the knowledge graph
- Edges (Relationships): The fields within DataPoints define connections to other DataPoints
- Automatic Graph Formation: When you create DataPoint instances, they automatically form an interconnected graph
Base DataPoint Structure
Every DataPoint in cognee is built on a common foundation:
class DataPoint(BaseModel):
id: UUID # Unique identifier
created_at: int # Creation timestamp
updated_at: int # Last update timestamp
ontology_valid: bool = False # Validation status via ontology
version: int = 1 # Version number for tracking changes
topological_rank: Optional[int] = 0 # Graph ordering rank
metadata: Optional[MetaData] = {"index_fields": []} # Indexing and embedding configuration
type: str # Automatically set to class name
belongs_to_set: Optional[List["DataPoint"]] = None # Set membership
Key Properties
- Unique Identity: Every DataPoint has a UUID for precise identification
- Versioning: Built-in version tracking with automatic timestamp updates
- Self-Describing: The
type
field automatically reflects the DataPoint’s class name - Metadata-Driven: The
metadata
field controls how the DataPoint is indexed and embedded - Graph-Aware: Can be part of larger node sets and maintain topological relationships
How DataPoints Work
1. Creating Custom DataPoints
DataPoints are designed to be extended for specific use cases:
from cognee.low_level import DataPoint
class Person(DataPoint):
name: str
age: int
metadata: dict = {"index_fields": ["name"]} # Name will be embedded/indexed
class Company(DataPoint):
name: str
employees: list[Person] # Relationships to other DataPoints
metadata: dict = {"index_fields": ["name"]}
2. Embedding and Indexing
The metadata["index_fields"]
configuration determines which fields get embedded for semantic search:
- Embeddable Data: Fields listed in
index_fields
are processed for vector embeddings - Automatic Indexing: When DataPoints are stored, specified fields are automatically indexed
- Search Optimization: Indexed fields enable fast semantic and vector-based retrieval
3. Relationships and Graph Structure
DataPoints can reference other DataPoints, creating a rich knowledge graph:
class Department(DataPoint):
name: str
employees: list[Person] # List of Person DataPoints
class Company(DataPoint):
name: str
departments: list[Department] # Nested DataPoint relationships
Role in the cognee System
1. Data Ingestion Pipeline
Raw Data → DataPoint Creation → Graph Conversion → Storage → Indexing
- Input Processing: Raw data (documents, code, JSON) is converted into specific DataPoint types
- Graph Generation: DataPoints and their relationships are converted to nodes and edges
- Storage: DataPoints are stored in the graph database
- Vector Indexing: Embeddable fields are indexed in the vector database for search
2. Knowledge Graph Foundation
DataPoints serve as the nodes in cognee’s knowledge graph:
- Each DataPoint becomes a node with its properties
- Relationships between DataPoints become edges
- The graph structure enables complex queries and reasoning
3. Search and Retrieval
DataPoints enable multiple search strategies:
- Semantic Search: Through embedded
index_fields
- Graph Traversal: Following relationships between DataPoints
- Hybrid Queries: Combining vector similarity with graph structure
Adding DataPoints to the Graph
The add_data_points
function transforms structured data into a fully indexed, queryable knowledge graph:
import cognee
# Create DataPoint instances
data_points = [alice, book, park, purchase_event, reading_event]
# Add to the knowledge graph
await cognee.add_data_points(data_points)
What Happens Under the Hood
When you call add_data_points
, cognee:
- Recursive Extraction: Traverses all connected DataPoints, extracting nodes and edges using the ontology-like structure
- Deduplication: Removes duplicate nodes and edges to keep the graph clean
- Graph Storage: Adds cleaned nodes and edges to cognee’s graph engine
- Indexing: Creates indexes for fast lookups and efficient graph traversal
Built-in DataPoint Types
Cognee includes several specialized DataPoint
subclasses:
- Entity: Named concepts with descriptions and relationships
- DocumentChunk: Text fragments linked to their source documents
- Document: Source document metadata and content
- NodeSet: Collections of related data points for organization
Best Practices
Design Guidelines
- Single Responsibility: Each DataPoint should represent one clear concept
- Clear Relationships: Use descriptive field names for relationships (by, of, at, etc.)
- Consistent Metadata: Always include proper indexing metadata
- Optional Fields: Use Optional types for fields that may not always be present
Performance Considerations
- Index Strategy: Only index fields you’ll frequently query
- Relationship Depth: Consider the complexity of nested relationships
- Batch Operations: Process multiple DataPoints together when possible
- Deduplication: Design DataPoints to avoid unnecessary duplicates
Data Modeling
- Start Simple: Begin with basic entity types and add complexity gradually
- Think in Graphs: Consider how your DataPoints will connect to each other
- Version Control: Use the built-in versioning for data evolution
- Validation: Leverage Pydantic’s validation capabilities
Examples
1. Define Clear Index Fields
# Good: Specify which fields should be searchable
class Product(DataPoint):
name: str
description: str
price: float
metadata: dict = {"index_fields": ["name", "description"]}
2. Model Relationships Explicitly
# Good: Use typed relationships
class Author(DataPoint):
name: str
class Book(DataPoint):
title: str
author: Author # Clear relationship
3. Use Descriptive Names
# Good: Clear, descriptive class names
class GitRepository(DataPoint):
class PythonFunction(DataPoint):
class CustomerOrder(DataPoint):
Next Steps
- Learn about more about DataPoints from our blogs. You can start with The Building Blocks of Knowledge Graphs and you can read about an example where we built DataPoints here .
- Understand how to build complex knowledge graphs with multiple DataPoint types
- Discover tasks and pipelines
DataPoints are the foundation that makes Cognee’s knowledge graphs intelligent and interconnected. By understanding this concept, you’re ready to build powerful, relationship-aware knowledge systems.