Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.cognee.ai/llms.txt

Use this file to discover all available pages before exploring further.

DataPoints: Atomic Units of Knowledge

DataPoints are the smallest building blocks in Cognee.
They represent atomic units of knowledge — carrying both your actual content and the context needed to process, index, and connect it.
They’re the reason Cognee can turn raw documents into something that’s both searchable (via vectors) and connected (via graphs).

What are DataPoints

  • Atomic — each DataPoint represents one concept or unit of information.
  • Structured — implemented as Pydantic models for validation and serialization.
  • Contextual — carry provenance, versioning, and indexing hints so every step downstream knows where data came from and how to use it.

Core Structure

A DataPoint is just a Pydantic model with a set of standard fields.
class DataPoint(BaseModel):
    id: UUID = Field(default_factory=uuid4)
    created_at: int = ...
    updated_at: int = ...
    version: int = 1
    topological_rank: Optional[int] = 0
    metadata: Optional[dict] = {"index_fields": []}
    type: str = "DataPoint"
    belongs_to_set: Optional[List["DataPoint"]] = None
Key fields:
  • id — unique identifier (shared across all three stores, linking vector, graph, and relational records for the same DataPoint)
  • created_at, updated_at — timestamps (ms since epoch)
  • version — for tracking changes and schema evolution
  • topological_rank — an integer indicating the DataPoint’s position in a dependency hierarchy. Lower ranks mean fewer dependencies. For example, an Entity that other DataPoints reference would have a lower rank than a TextSummary that depends on it. Defaults to 0.
  • metadata.index_fields — critical: determines which fields are embedded for vector search
  • type — the Python class name of the DataPoint subclass (e.g., "Person", "Book")
  • belongs_to_set — groups related DataPoints

Indexing & Embeddings

The metadata.index_fields tells Cognee which fields to embed into the vector store. This is the mechanism behind semantic search.
  • Fields in index_fields → converted into embeddings
  • Each indexed field → its own vector collection named Class_field (e.g., a Person DataPoint with index_fields=["name"] creates a Person_name vector collection). The Class part comes from the Python class name of your DataPoint subclass.
  • Non-indexed fields → stay as regular properties in the graph and relational stores
  • Choosing what to index controls search granularity
Cross-store retrieval: When a vector search finds a match, Cognee uses the shared id to retrieve the full DataPoint from the graph store, which holds all properties (not just the indexed field). This is how Cognee returns complete results from a semantic search.

From DataPoints to the Graph

When you call add_data_points(), Cognee automatically:
  • Embeds the indexed fields into vectors
  • Converts the object into nodes and edges in the knowledge graph
  • Stores provenance in the relational store
This is how Cognee creates both semantic similarity (vector) and structural reasoning (graph) from the same unit.

Examples and details

class Person(DataPoint):
    name: str
    age: int
    metadata: dict = {"index_fields": ["name"]}
Only "name" is semantically searchable
class Book(DataPoint):
    title: str
    author: Author
    metadata: dict = {"index_fields": ["title"]}

# Produces:
# `Node(Book)` with `{title, type, ...}`
# Node(Author) with {name, type, ...}
# Edge(Book → Author, type="author")
# Simple relationship
`author: Author`  

# With edge metadata
`has_items: (Edge(weight=0.8), list[Item])`

# List relationship
`chapters: list[Chapter]`
Cognee ships with several built-in DataPoint types:
  • Documents — wrappers for source files (Text, PDF, Audio, Image)
    • Document (metadata.index_fields=["name"])
  • Chunks — segmented portions of documents
    • DocumentChunk (metadata.index_fields=["text"])
  • Summaries — generated text or code summaries
    • TextSummary / CodeSummary (metadata.index_fields=["text"])
  • Entities — named objects (people, places, concepts)
    • Entity, EntityType (metadata.index_fields=["name"])
  • Edges — relationships between DataPoints
    • Edge — links between DataPoints
class Product(DataPoint):
    name: str
    description: str
    price: float
    category: Category
    
    # Index name + description for search
    metadata: dict = {"index_fields": ["name", "description"]}
Best Practices:
  • Keep it small — one concept per DataPoint
  • Index carefully — only fields that matter for semantic search
  • Use built-in types first — extend with custom subclasses when needed
  • Version deliberately — track changes with version
  • Group related points — with belongs_to_set
To update a custom DataPoint, mutate its fields, call update_version() to record the change, then re-add it with add_data_points(). The upsert replaces the existing node in all three stores.
from cognee.infrastructure.engine import DataPoint
from cognee.tasks.storage import add_data_points

product = Product(name="Widget", price=9.99, description="Original")
await add_data_points([product])

# Later — update the description
product.description = "Improved description"
product.update_version()   # version → 2, updated_at refreshed
await add_data_points([product])
For documents and files remembered via cognee.remember(), use cognee.update() instead — it replaces the existing item in the target dataset by deleting the old data_id, re-adding the new content, and re-running graph processing for that dataset.How versioning worksChanging a field on a DataPoint does not automatically create a new revision or persist anything by itself. In other words, versioning is manual:
  • Edit the DataPoint fields
  • Call update_version() if this should count as a new revision
  • Re-add the DataPoint with add_data_points() to persist the updated state
By calling update_version(), you mark your in-memory object as a new revision before writing it back with add_data_points(). It does two things:
  • Increments version by 1. New DataPoints start at version=1.
  • Sets updated_at to the current UTC timestamp in milliseconds.
Use the cognee.datasets API:
import cognee

# Soft-delete one item by ID (default — marks as deleted)
await cognee.datasets.delete_data(dataset_id=ds.id, data_id=item.id)

# Hard-delete to remove from all stores permanently
await cognee.datasets.delete_data(dataset_id=ds.id, data_id=item.id, mode="hard")

# Remove all items from a dataset (keeps the dataset itself)
await cognee.datasets.empty_dataset(dataset_id=ds.id)

# Delete all datasets for the current user
await cognee.datasets.delete_all()
add_data_points() does not accept a dataset parameter directly. Dataset assignment is carried by the PipelineContext (ctx) that Cognee injects automatically when your task runs inside run_pipeline.
async for _ in run_pipeline(
    tasks=[my_task],          # task calls add_data_points internally
    data=my_data,
    datasets=["my_dataset"],  # <-- this sets ctx.dataset
):
    pass
Inside the task, ctx.dataset holds the resolved dataset object. add_data_points uses it to write provenance records (user, dataset, data item) to the relational store.If you call add_data_points outside a pipeline (without a ctx), nodes and edges are still written to the graph and vector stores, but no dataset-level provenance is recorded — the data is not associated with any named dataset.
add_data_points accepts an embed_triplets: bool = False parameter. When set to True, Cognee derives (subject → predicate → object) triplets from the graph edges and indexes each one as a Triplet DataPoint embedding.
await add_data_points([product], embed_triplets=True)
Each triplet is embedded as a single text string in the form:
<subject text> -› <predicate> -› <object text>
These triplets are derived from the graph you just wrote, but they are not added back as extra graph nodes or edges.This allows vector search to match not just individual nodes, but relationship patterns across the graph.Use embed_triplets=True when:
  • Your queries describe relationships (e.g., “products made by company X”)
  • You want to retrieve graph edges via semantic similarity, not just individual nodes
Leave it False (the default) for standard node-level retrieval.For the same idea applied after graph creation, see Triplet Embeddings.
By default, each DataPoint receives a random UUID4 on instantiation. To make identical entities share the same node — and avoid duplicates — mark one or more fields as the deduplication key.
from typing import Annotated
from cognee.infrastructure.engine import DataPoint, Embeddable, Dedup

class Product(DataPoint):
    sku: Annotated[str, Dedup()]
    name: Annotated[str, Embeddable()]
    price: float
Cognee automatically populates identity_fields from the Dedup() annotations. No explicit metadata declaration is needed.
How it worksWhen one or more identity fields are defined and all of them resolve on the instance, Cognee generates a deterministic UUID5 from the class name and the field values instead of a random UUID4. Two Product instances with the same sku produce the same id, so add_data_points upserts them onto the same graph node rather than creating a duplicate. If an identity field is missing and has no default, Cognee falls back to a random UUID4.Checking whether a DataPoint already existsBecause the ID is deterministic, you can check existence by constructing the instance (which generates the identity ID) and querying the graph engine:
from cognee.infrastructure.databases.graph import get_graph_engine

candidate = Product(sku="ABC-123", name="Widget", price=9.99)

graph = await get_graph_engine()
existing = await graph.get_node(str(candidate.id))
if existing:
    print("Already in the graph:", existing)
else:
    print("New entity — id:", candidate.id)

Tasks

Learn how DataPoints are created and processed

Pipelines

See how DataPoints flow through processing workflows

Main Operations

Understand how DataPoints are used in Remember, Improve, and Recall