Datasets: The Core Unit of Data
A dataset is a logical container for related documents and their processed knowledge graphs. All data in Cognee belongs to a dataset. When you add documents to Cognee usingcognee.add()
, they are processed and stored within a specific dataset.
Dataset-scoped permissions — All permissions in Cognee are defined at the dataset level, never for individual documents.
Ownership
When a principal creates a dataset, they become its owner. A principal is any entity that can have permissions - usually a user, but it can also be a tenant or role (we’ll explain these later). Once a dataset is created, its ownership cannot be changed. The owner can do anything with the dataset and can give permissions to others.Permission Types
Permissions are always defined at the dataset level, never for individual documents. There are four permission types:- Read — View documents and query the knowledge graph
- Write — Add, modify, or remove documents and data
- Delete — Remove the entire dataset
- Share — Grant permissions to other principals
Default Behavior
When no specific dataset is provided, Cognee uses a default dataset calledmain_dataset
. This dataset is created automatically if it doesn’t exist. Users can create additional datasets as needed for organizing their data. You can specify a different dataset by passing the dataset_name
parameter to cognee.add()
.
Dataset Creation
Cognee keeps core dataset metadata in the relational (SQL) database: each dataset row records its UUID, name, owner, and audit timestamps, and the Access Control List tables map principals (users, tenants, roles) to the permissions they hold on that dataset.Dataset Model Fields
Dataset Model Fields
The Dataset model defines what gets stored in the SQL database. The
datasets
table contains:id
: Unique identifier (UUID primary key)name
: Human-readable nameowner_id
: ID of the principal who created the datasetcreated_at
: Timestamp when createdupdated_at
: Timestamp when last modified
data
field is a relationship to the data
table through a many-to-many association table (dataset_data
), not stored directly in the datasets table.Dataset Creation Methods
Dataset Creation Methods
Two helpers exist for creating datasets:
-
create_dataset()
: Inserts the dataset row only and expects the caller to manage ACL entries. It’s used inside lower-level routines that already control the database session. -
create_authorized_dataset()
: Wrapscreate_dataset()
, then immediately grants the creator read/write/delete/share permissions via ACL entries. This is what user-facing flows call, especially whenENABLE_BACKEND_ACCESS_CONTROL=true
, because those ACL checks are enforced before any pipeline can touch the dataset.
Operation Requirements
Operation Requirements
Different operations require different permissions:
add
/cognify
operations → requirewrite
permissionsearch
operations → requireread
permissiondelete
operations → requiredelete
permission- Permission management → requires
share
permission
Integration with Main Operations
Datasets work seamlessly with Cognee’s main operations:- Add — Direct new content into a specific dataset (by name or ID)
- Cognify — Choose which dataset(s) to transform into a knowledge graph
- Search — Queries can be scoped by dataset
- Memify — Optional semantic enrichment per dataset
Access Control
Permissions (read, write, share, delete) are enforced at the dataset level. This allows you to:- Share one dataset with a team, keep another private
- Independently manage who can modify or distribute content
- Control access granularly across different data collections
Incremental Processing
Processing status is tracked per dataset. After you add more data, Cognify focuses on new or changed items, skipping what’s already completed for that dataset.Limitations
- Dataset ownership cannot be transferred
- Graph and vector stores are enforced as Kùzu and LanceDB in access control mode
- Cross-dataset search: Queries are dataset-scoped; cross-dataset searches run per authorized dataset context