> ## Documentation Index
> Fetch the complete documentation index at: https://docs.cognee.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# DataPoints

> Atomic units of knowledge in Cognee

# DataPoints: Atomic Units of Knowledge

DataPoints are the smallest building blocks in Cognee.\
They represent **atomic units of knowledge** — carrying both your actual content and the context needed to process, index, and connect it.

They're the reason Cognee can turn raw documents into something that's both **searchable** (via vectors) and **connected** (via graphs).

## What are DataPoints

* **Atomic** — each DataPoint represents one concept or unit of information.
* **Structured** — implemented as [Pydantic](https://docs.pydantic.dev/) models for validation and serialization.
* **Contextual** — carry provenance, versioning, and indexing hints so every step downstream knows where data came from and how to use it.

## Core Structure

A DataPoint is just a Pydantic model with a set of standard fields.

<Accordion title="See example class definition">
  ```python theme={null}
  class DataPoint(BaseModel):
      id: UUID = Field(default_factory=uuid4)
      created_at: int = ...
      updated_at: int = ...
      version: int = 1
      topological_rank: Optional[int] = 0
      metadata: Optional[dict] = {"index_fields": []}
      type: str = "DataPoint"
      belongs_to_set: Optional[List["DataPoint"]] = None
  ```

  Key fields:

  * `id` — unique identifier (shared across all three stores, linking vector, graph, and relational records for the same DataPoint)
  * `created_at`, `updated_at` — timestamps (ms since epoch)
  * `version` — for tracking changes and schema evolution
  * `topological_rank` — an integer indicating the DataPoint's position in a dependency hierarchy. Lower ranks mean fewer dependencies. For example, an `Entity` that other DataPoints reference would have a lower rank than a `TextSummary` that depends on it. Defaults to `0`.
  * `metadata.index_fields` — critical: determines which fields are embedded for vector search
  * `type` — the Python class name of the DataPoint subclass (e.g., `"Person"`, `"Book"`)
  * `belongs_to_set` — groups related DataPoints
</Accordion>

## Indexing & Embeddings

The `metadata.index_fields` tells Cognee which fields to embed into the vector store.
This is the mechanism behind semantic search.

* Fields in `index_fields` → converted into embeddings
* Each indexed field → its own vector collection named `Class_field` (e.g., a `Person` DataPoint with `index_fields=["name"]` creates a `Person_name` vector collection). The `Class` part comes from the Python class name of your DataPoint subclass.
* Non-indexed fields → stay as regular properties in the graph and relational stores
* Choosing what to index controls search granularity

<Info>
  **Cross-store retrieval:** When a vector search finds a match, Cognee uses the shared `id` to retrieve the full DataPoint from the graph store, which holds all properties (not just the indexed field). This is how Cognee returns complete results from a semantic search.
</Info>

## From DataPoints to the Graph

When you call `add_data_points()`, Cognee automatically:

* Embeds the indexed fields into vectors
* Converts the object into **nodes** and **edges** in the knowledge graph
* Stores provenance in the relational store

This is how Cognee creates both **semantic similarity** (vector) and **structural reasoning** (graph) from the same unit.

## Examples and details

<Accordion title="Example: indexing only one field">
  ```python theme={null}
  class Person(DataPoint):
      name: str
      age: int
      metadata: dict = {"index_fields": ["name"]}
  ```

  Only `"name"` is semantically searchable
</Accordion>

<Accordion title="Example: Book → Author transformation">
  ```python theme={null}
  class Book(DataPoint):
      title: str
      author: Author
      metadata: dict = {"index_fields": ["title"]}

  # Produces:
  # `Node(Book)` with `{title, type, ...}`
  # Node(Author) with {name, type, ...}
  # Edge(Book → Author, type="author")
  ```
</Accordion>

<Accordion title="Relationship syntax options">
  ```python theme={null}
  # Simple relationship
  `author: Author`  

  # With edge metadata
  `has_items: (Edge(weight=0.8), list[Item])`

  # List relationship
  `chapters: list[Chapter]`
  ```
</Accordion>

<Accordion title="Built-in DataPoint types">
  Cognee ships with several built-in DataPoint types:

  * **Documents** — wrappers for source files (Text, PDF, Audio, Image)
    * `Document` (`metadata.index_fields=["name"]`)
  * **Chunks** — segmented portions of documents
    * `DocumentChunk` (`metadata.index_fields=["text"]`)
  * **Summaries** — generated text or code summaries
    * `TextSummary` / `CodeSummary` (`metadata.index_fields=["text"]`)
  * **Entities** — named objects (people, places, concepts)
    * `Entity`, `EntityType` (`metadata.index_fields=["name"]`)
  * **Edges** — relationships between DataPoints
    * `Edge` — links between DataPoints
</Accordion>

<Accordion title="Example: custom DataPoint with best practices">
  ```python theme={null}
  class Product(DataPoint):
      name: str
      description: str
      price: float
      category: Category
      
      # Index name + description for search
      metadata: dict = {"index_fields": ["name", "description"]}
  ```

  **Best Practices:**

  * **Keep it small** — one concept per DataPoint
  * **Index carefully** — only fields that matter for semantic search
  * **Use built-in types first** — extend with custom subclasses when needed
  * **Version deliberately** — track changes with `version`
  * **Group related points** — with `belongs_to_set`
</Accordion>

<Accordion title="Updating DataPoints">
  To update a custom DataPoint, mutate its fields, call `update_version()` to record the change, then re-add it with `add_data_points()`. The upsert replaces the existing node in all three stores.

  ```python theme={null}
  from cognee.infrastructure.engine import DataPoint
  from cognee.tasks.storage import add_data_points

  product = Product(name="Widget", price=9.99, description="Original")
  await add_data_points([product])

  # Later — update the description
  product.description = "Improved description"
  product.update_version()   # version → 2, updated_at refreshed
  await add_data_points([product])
  ```

  For documents and files remembered via `cognee.remember()`, use [`cognee.update()`](/python-api/update) instead — it replaces the existing item in the target dataset by deleting the old `data_id`, re-adding the new content, and re-running graph processing for that dataset.

  **How versioning works**

  Changing a field on a DataPoint does **not** automatically create a new revision or persist anything by itself.
  In other words, versioning is manual:

  * Edit the DataPoint fields
  * Call `update_version()` if this should count as a new revision
  * Re-add the DataPoint with `add_data_points()` to persist the updated state

  By calling `update_version()`, you mark your in-memory object as a new revision before writing it back with `add_data_points()`. It does two things:

  * Increments `version` by 1. New DataPoints start at `version=1`.
  * Sets `updated_at` to the current UTC timestamp in milliseconds.
</Accordion>

<Accordion title="Deleting DataPoints">
  Use the [`cognee.datasets`](/python-api/datasets) API:

  ```python theme={null}
  import cognee

  # Soft-delete one item by ID (default — marks as deleted)
  await cognee.datasets.delete_data(dataset_id=ds.id, data_id=item.id)

  # Hard-delete to remove from all stores permanently
  await cognee.datasets.delete_data(dataset_id=ds.id, data_id=item.id, mode="hard")

  # Remove all items from a dataset (keeps the dataset itself)
  await cognee.datasets.empty_dataset(dataset_id=ds.id)

  # Delete all datasets for the current user
  await cognee.datasets.delete_all()
  ```
</Accordion>

<Accordion title="Dataset routing: which dataset receives add_data_points output?">
  `add_data_points()` does not accept a `dataset` parameter directly. Dataset assignment is carried by the [`PipelineContext`](/core-concepts/building-blocks/pipelines) (`ctx`) that Cognee injects automatically when your task runs inside `run_pipeline`.

  ```python theme={null}
  async for _ in run_pipeline(
      tasks=[my_task],          # task calls add_data_points internally
      data=my_data,
      datasets=["my_dataset"],  # <-- this sets ctx.dataset
  ):
      pass
  ```

  Inside the task, `ctx.dataset` holds the resolved dataset object. `add_data_points` uses it to write provenance records (user, dataset, data item) to the relational store.

  If you call `add_data_points` **outside** a pipeline (without a `ctx`), nodes and edges are still written to the graph and vector stores, but no dataset-level provenance is recorded — the data is not associated with any named dataset.
</Accordion>

<Accordion title="embed_triplets: graph-structure embeddings">
  `add_data_points` accepts an `embed_triplets: bool = False` parameter. When set to `True`, Cognee derives `(subject → predicate → object)` triplets from the graph edges and indexes each one as a `Triplet` DataPoint embedding.

  ```python theme={null}
  await add_data_points([product], embed_triplets=True)
  ```

  Each triplet is embedded as a single text string in the form:

  ```
  <subject text> -› <predicate> -› <object text>
  ```

  These triplets are derived from the graph you just wrote, but they are not added back as extra graph nodes or edges.

  This allows vector search to match not just individual nodes, but **relationship patterns** across the graph.

  Use `embed_triplets=True` when:

  * Your queries describe relationships (e.g., "products made by company X")
  * You want to retrieve graph edges via semantic similarity, not just individual nodes

  Leave it `False` (the default) for standard node-level retrieval.

  For the same idea applied after graph creation, see [Triplet Embeddings](/guides/memify-triplet-embeddings).
</Accordion>

<Accordion title="Deduplication: preventing duplicate entities">
  By default, each DataPoint receives a random UUID4 on instantiation. To make identical entities share the same node — and avoid duplicates — mark one or more fields as the **deduplication key**.

  <Tabs>
    <Tab title="Option 1: `Dedup()`">
      ```python theme={null}
      from typing import Annotated
      from cognee.infrastructure.engine import DataPoint, Embeddable, Dedup

      class Product(DataPoint):
          sku: Annotated[str, Dedup()]
          name: Annotated[str, Embeddable()]
          price: float
      ```

      Cognee automatically populates `identity_fields` from the `Dedup()` annotations. No explicit `metadata` declaration is needed.
    </Tab>

    <Tab title="Option 2: `identity_fields`">
      ```python theme={null}
      class Product(DataPoint):
          sku: str
          name: str
          price: float
          metadata: dict = {"index_fields": ["name"], "identity_fields": ["sku"]}
      ```
    </Tab>
  </Tabs>

  **How it works**

  When one or more identity fields are defined and all of them resolve on the instance, Cognee generates a deterministic UUID5 from the class name and the field values instead of a random UUID4. Two `Product` instances with the same `sku` produce the same `id`, so `add_data_points` upserts them onto the same graph node rather than creating a duplicate. If an identity field is missing and has no default, Cognee falls back to a random UUID4.

  **Checking whether a DataPoint already exists**

  Because the ID is deterministic, you can check existence by constructing the instance (which generates the identity ID) and querying the graph engine:

  ```python theme={null}
  from cognee.infrastructure.databases.graph import get_graph_engine

  candidate = Product(sku="ABC-123", name="Widget", price=9.99)

  graph = await get_graph_engine()
  existing = await graph.get_node(str(candidate.id))
  if existing:
      print("Already in the graph:", existing)
  else:
      print("New entity — id:", candidate.id)
  ```
</Accordion>

<Columns cols={3}>
  <Card title="Tasks" icon="square-check" href="/core-concepts/building-blocks/tasks">
    Learn how DataPoints are created and processed
  </Card>

  <Card title="Pipelines" icon="git-merge" href="/core-concepts/building-blocks/pipelines">
    See how DataPoints flow through processing workflows
  </Card>

  <Card title="Main Operations" icon="play" href="/core-concepts/main-operations/remember">
    Understand how DataPoints are used in Remember, Improve, and Recall
  </Card>
</Columns>
