> ## Documentation Index > Fetch the complete documentation index at: https://docs.cognee.ai/llms.txt > Use this file to discover all available pages before exploring further. # DataPoints > Atomic units of knowledge in Cognee. # DataPoints: Atomic Units of Knowledge DataPoints are the smallest building blocks in Cognee.\ They represent **atomic units of knowledge** — carrying both your actual content and the context needed to process, index, and connect it. They're the reason Cognee can turn raw documents into something that's both **searchable** (via vectors) and **connected** (via graphs). ## What are DataPoints * **Atomic** — each DataPoint represents one concept or unit of information. * **Structured** — implemented as [Pydantic](https://docs.pydantic.dev/) models for validation and serialization. * **Contextual** — carry provenance, versioning, and indexing hints so every step downstream knows where data came from and how to use it. ## Core Structure A DataPoint is just a Pydantic model with a set of standard fields. ```python theme={null} class DataPoint(BaseModel): id: UUID = Field(default_factory=uuid4) created_at: int = ... updated_at: int = ... version: int = 1 topological_rank: Optional[int] = 0 metadata: MetaData = {"index_fields": []} type: str = "DataPoint" belongs_to_set: Optional[List["DataPoint"]] = None ``` Key fields: * `id` — unique identifier (shared across all three stores, linking vector, graph, and relational records for the same DataPoint) * `created_at`, `updated_at` — timestamps (ms since epoch) * `version` — for tracking changes and schema evolution * `topological_rank` — an integer indicating the DataPoint's position in a dependency hierarchy. Lower ranks mean fewer dependencies. For example, an `Entity` that other DataPoints reference would have a lower rank than a `TextSummary` that depends on it. Defaults to `0`. * `metadata.index_fields` — critical: determines which fields are embedded for vector search * `type` — the Python class name of the DataPoint subclass (e.g., `"Person"`, `"Book"`) * `belongs_to_set` — groups related DataPoints ## Indexing & Embeddings The `metadata.index_fields` tells Cognee which fields to embed into the vector store. This is the mechanism behind semantic search. * Fields in `index_fields` → converted into embeddings * Each indexed field → its own vector collection named `Class_field` (e.g., a `Person` DataPoint with `index_fields=["name"]` creates a `Person_name` vector collection). The `Class` part comes from the Python class name of your DataPoint subclass. * Non-indexed fields → stay as regular properties in the graph and relational stores * Choosing what to index controls search granularity **Cross-store retrieval:** When a vector search finds a match, Cognee uses the shared `id` to retrieve the full DataPoint from the graph store, which holds all properties (not just the indexed field). This is how Cognee returns complete results from a semantic search. For custom scalar properties such as external IDs, labels, and tags, see [Custom Data Models](/guides/custom-data-models#custom-fields-and-read-back). ## From DataPoints to the Graph When you call `add_data_points()`, Cognee automatically: * Embeds the indexed fields into vectors * Converts the object into **nodes** and **edges** in the knowledge graph * Stores provenance in the relational store This is how Cognee creates both **semantic similarity** (vector) and **structural reasoning** (graph) from the same unit. ## Examples and details ```python theme={null} from typing import Annotated from cognee.infrastructure.engine import DataPoint, Embeddable class Person(DataPoint): name: Annotated[str, Embeddable()] age: int ``` Only `"name"` is semantically searchable ```python theme={null} from typing import Annotated from cognee.infrastructure.engine import DataPoint, Embeddable class Book(DataPoint): title: Annotated[str, Embeddable()] author: Author # Produces: # `Node(Book)` with `{title, type, ...}` # Node(Author) with {name, type, ...} # Edge(Book → Author, type="author") ``` ```python theme={null} # Simple relationship `author: Author` # With edge metadata `has_items: (Edge(weight=0.8), list[Item])` # List relationship `chapters: list[Chapter]` ``` Cognee ships with several built-in DataPoint types: * **Documents** — wrappers for source files (Text, PDF, Audio, Image) * `Document` (`metadata.index_fields=["name"]`) * **Chunks** — segmented portions of documents * `DocumentChunk` (`metadata.index_fields=["text"]`) * **Summaries** — generated text or code summaries * `TextSummary` / `CodeSummary` / `GlobalContextSummary` (`metadata.index_fields=["text"]`) * `GlobalContextSummary` powers the [Global Context Index](/core-concepts/further-concepts/global-context-index) * **Entities** — named objects (people, places, concepts) * `Entity`, `EntityType` (`metadata.index_fields=["name"]`) * **Edges** — relationships between DataPoints * `Edge` — links between DataPoints ```python theme={null} from typing import Annotated from cognee.infrastructure.engine import DataPoint, Embeddable class Product(DataPoint): name: Annotated[str, Embeddable()] description: Annotated[str, Embeddable()] price: float category: Category ``` **Best Practices:** * **Keep it small** — one concept per DataPoint * **Index carefully** — only fields that matter for semantic search * **Use built-in types first** — extend with custom subclasses when needed * **Version deliberately** — track changes with `version` * **Group related points** — with `belongs_to_set` To update a custom DataPoint, mutate its fields, call `update_version()` to record the change, then re-add it with `add_data_points()`. The upsert replaces the existing node in all three stores. ```python theme={null} from cognee.infrastructure.engine import DataPoint from cognee.tasks.storage import add_data_points product = Product(name="Widget", price=9.99, description="Original") await add_data_points([product]) # Later — update the description product.description = "Improved description" product.update_version() # version → 2, updated_at refreshed await add_data_points([product]) ``` For documents and files remembered via `cognee.remember()`, use [`cognee.update()`](/python-api/update) instead — it replaces the existing item in the target dataset by deleting the old `data_id`, re-adding the new content, and re-running graph processing for that dataset. **How versioning works** Changing a field on a DataPoint does **not** automatically create a new revision or persist anything by itself. In other words, versioning is manual: * Edit the DataPoint fields * Call `update_version()` if this should count as a new revision * Re-add the DataPoint with `add_data_points()` to persist the updated state By calling `update_version()`, you mark your in-memory object as a new revision before writing it back with `add_data_points()`. It does two things: * Increments `version` by 1. New DataPoints start at `version=1`. * Sets `updated_at` to the current UTC timestamp in milliseconds. Use the [`cognee.datasets`](/python-api/datasets) API: ```python theme={null} import cognee # Soft-delete one item by ID (default — marks as deleted) await cognee.datasets.delete_data(dataset_id=ds.id, data_id=item.id) # Hard-delete to remove from all stores permanently await cognee.datasets.delete_data(dataset_id=ds.id, data_id=item.id, mode="hard") # Remove all items from a dataset (keeps the dataset itself) await cognee.datasets.empty_dataset(dataset_id=ds.id) # Delete all datasets for the current user await cognee.datasets.delete_all() ``` `add_data_points()` does not accept a `dataset` parameter directly. Dataset assignment is carried by the [`PipelineContext`](/core-concepts/building-blocks/pipelines) (`ctx`) that Cognee injects automatically when your task runs inside `run_pipeline`. ```python theme={null} async for _ in run_pipeline( tasks=[my_task], # task calls add_data_points internally data=my_data, datasets=["my_dataset"], # <-- this sets ctx.dataset ): pass ``` Inside the task, `ctx.dataset` holds the resolved dataset object. `add_data_points` uses it to write provenance records (user, dataset, data item) to the relational store. If you call `add_data_points` **outside** a pipeline (without a `ctx`), nodes and edges are still written to the graph and vector stores, but no dataset-level provenance is recorded — the data is not associated with any named dataset. `add_data_points` accepts an `embed_triplets: bool = False` parameter. When set to `True`, Cognee derives `(subject → predicate → object)` triplets from the graph edges and indexes each one as a `Triplet` DataPoint embedding. ```python theme={null} await add_data_points([product], embed_triplets=True) ``` Each triplet is embedded as a single text string in the form: ``` -› -›