What is a dataset in Cognee?
A dataset is a named container that groups documents and their metadata. It is the main boundary for:- Organizing content
- Running pipelines
- Applying permissions
-
Remember:
- Direct new content into a specific dataset (by name or ID)
- If it doesn’t exist, Cognee creates it and associates your permissions
- Items ingested are linked to that dataset and deduplicated within it
-
Improve:
- Runs enrichment against a chosen dataset
- Loads the dataset’s existing graph, checks rights, and runs the improvement pipeline in dataset scope
- Lets you deepen or bridge memory without re-ingesting the source data
-
Recall:
- Queries can be scoped by dataset
- Results and metrics remain separated by dataset
-
Forget:
- Removes memory at item, dataset, or full-user scope
- Uses dataset permissions to decide what the current user can remove
Access control
- Permissions (read, write, share, delete) are enforced at the dataset level
- Share one dataset with a team, keep another private
- Independently manage who can modify or distribute content
Incremental processing
- Processing status is tracked per dataset
- After you remember more data, the underlying cognify step focuses on new or changed items
- Skips what’s already completed for that dataset
Datasets vs NodeSets
Datasets scope storage, permissions, and pipeline execution; NodeSets are semantic tags within a dataset.- During
remember(), you can label items with one or more NodeSet names (e.g., “AI”, “FinTech”) - The underlying graph-building step propagates those labels into the graph by creating
NodeSetnodes and linking derived chunks and entities viabelongs_to_setrelationships - This lets you slice a single dataset’s graph by topic or team without creating new datasets, while dataset-level permissions still control overall access
Remember
Direct content into a dataset
Improve
Enrich memory within a dataset
Recall
Scope queries by dataset
Forget
Remove datasets and data