What is a dataset in Cognee?

A dataset is a named container that groups documents and their metadata. It is the main boundary for:
  • Organizing content
  • Running pipelines
  • Applying permissions
  • Add:
    • Direct new content into a specific dataset (by name or ID)
    • If it doesn’t exist, Cognee creates it and associates your permissions
    • Items ingested are linked to that dataset and deduplicated within it
  • Cognify:
    • Choose which dataset(s) to transform into a knowledge graph
    • Loads the dataset’s content, checks rights, and runs the pipeline per dataset
    • If none are specified, processes all datasets you’re authorized to use
    • Progress is tracked per dataset for reliable re-runs
  • Search:
    • Queries can be scoped by dataset
    • Results and metrics remain separated by dataset

Access control

  • Permissions (read, write, share, delete) are enforced at the dataset level
  • Share one dataset with a team, keep another private
  • Independently manage who can modify or distribute content

Incremental processing

  • Processing status is tracked per dataset
  • After you add more data, Cognify focuses on new or changed items
  • Skips what’s already completed for that dataset

Datasets vs NodeSets

Datasets scope storage, permissions, and pipeline execution; NodeSets are semantic tags within a dataset.
  • During Add, you can label items with one or more NodeSet names (e.g., “AI”, “FinTech”)
  • Cognify propagates those labels into the graph by creating NodeSet nodes and linking derived chunks and entities via belongs_to_set relationships
  • This lets you slice a single dataset’s graph by topic or team without creating new datasets, while dataset-level permissions still control overall access