What is a dataset in Cognee?
A dataset is a named container that groups documents and their metadata. It is the main boundary for:- Organizing content
- Running pipelines
- Applying permissions
Dataset isolation requires specific configuration. See permissions system for details on access control requirements and supported database setups.
-
Add:
- Direct new content into a specific dataset (by name or ID)
- If it doesn’t exist, Cognee creates it and associates your permissions
- Items ingested are linked to that dataset and deduplicated within it
-
Cognify:
- Choose which dataset(s) to transform into a knowledge graph
- Loads the dataset’s content, checks rights, and runs the pipeline per dataset
- If none are specified, processes all datasets you’re authorized to use
- Progress is tracked per dataset for reliable re-runs
-
Search:
- Queries can be scoped by dataset
- Results and metrics remain separated by dataset
Access control
- Permissions (read, write, share, delete) are enforced at the dataset level
- Share one dataset with a team, keep another private
- Independently manage who can modify or distribute content
Incremental processing
- Processing status is tracked per dataset
- After you add more data, Cognify focuses on new or changed items
- Skips what’s already completed for that dataset
Datasets vs NodeSets
Datasets scope storage, permissions, and pipeline execution; NodeSets are semantic tags within a dataset.- During Add, you can label items with one or more NodeSet names (e.g., “AI”, “FinTech”)
- Cognify propagates those labels into the graph by creating
NodeSet
nodes and linking derived chunks and entities viabelongs_to_set
relationships - This lets you slice a single dataset’s graph by topic or team without creating new datasets, while dataset-level permissions still control overall access