Skip to main content

What is a dataset in Cognee?

A dataset is a named container that groups documents and their metadata. It is the main boundary for:
  • Organizing content
  • Running pipelines
  • Applying permissions
Dataset isolation requires specific configuration. See permissions system for details on access control requirements and supported database setups.
  • Remember:
    • Direct new content into a specific dataset (by name or ID)
    • If it doesn’t exist, Cognee creates it and associates your permissions
    • Items ingested are linked to that dataset and deduplicated within it
  • Improve:
    • Runs enrichment against a chosen dataset
    • Loads the dataset’s existing graph, checks rights, and runs the improvement pipeline in dataset scope
    • Lets you deepen or bridge memory without re-ingesting the source data
  • Recall:
    • Queries can be scoped by dataset
    • Results and metrics remain separated by dataset
  • Forget:
    • Removes memory at item, dataset, or full-user scope
    • Uses dataset permissions to decide what the current user can remove

Access control

  • Permissions (read, write, share, delete) are enforced at the dataset level
  • Share one dataset with a team, keep another private
  • Independently manage who can modify or distribute content

Incremental processing

  • Processing status is tracked per dataset
  • After you remember more data, the underlying cognify step focuses on new or changed items
  • Skips what’s already completed for that dataset

Datasets vs NodeSets

Datasets scope storage, permissions, and pipeline execution; NodeSets are semantic tags within a dataset.
  • During remember(), you can label items with one or more NodeSet names (e.g., “AI”, “FinTech”)
  • The underlying graph-building step propagates those labels into the graph by creating NodeSet nodes and linking derived chunks and entities via belongs_to_set relationships
  • This lets you slice a single dataset’s graph by topic or team without creating new datasets, while dataset-level permissions still control overall access

Remember

Direct content into a dataset

Improve

Enrich memory within a dataset

Recall

Scope queries by dataset

Forget

Remove datasets and data