cognee.datasets

Static class for managing datasets and their data.

Methods

datasets.list_datasets()

await cognee.datasets.list_datasets(user=None)

Returns all datasets accessible to the resolved user.

Parameter	Type	Default	Notes
`user`	`Optional[User]`	`None`	If omitted, Cognee resolves the default user.

datasets.discover_datasets()

cognee.datasets.discover_datasets(directory_path: str)

Discover dataset names from a local directory layout.

Parameter	Type	Default	Notes
`directory_path`	`str`	required	Local directory to scan for dataset-style subdirectories.

datasets.list_data()

await cognee.datasets.list_data(dataset_id, user=None)

Returns all Data records in a dataset. This is the API to use when you want to read back DataItem fields stored during cognee.add(), such as label and external_metadata.

Parameter	Type	Default	Notes
`dataset_id`	`UUID`	required	Dataset UUID to inspect.
`user`	`Optional[User]`	`None`	If omitted, Cognee resolves the default user before permission checks.

datasets.has_data()

await cognee.datasets.has_data(dataset_id, user=None) -> bool

Check whether a dataset contains any data.

Parameter	Type	Default	Notes
`dataset_id`	`str`	required	Dataset identifier to check.
`user`	`Optional[User]`	`None`	If omitted, Cognee resolves the default user before permission checks.

datasets.get_status()

await cognee.datasets.get_status(
    dataset_ids: list[UUID],
    pipeline_names: list[str] | None = None,
) -> dict

Get pipeline status for one or more datasets. When pipeline_names is omitted, this method keeps the legacy flat shape and returns the status of cognify_pipeline only.

Parameter	Type	Default	Notes
`dataset_ids`	`list[UUID]`	required	Dataset UUIDs to check.
`pipeline_names`	`Optional[list[str]]`	`None`	Pipeline names to query. If omitted, defaults to `cognify_pipeline`. Duplicate names are deduplicated while preserving order.

With no pipeline_names or a single pipeline name, the method returns {str(dataset_id): PipelineRunStatus}. With multiple pipeline names, it returns {str(dataset_id): {pipeline_name: PipelineRunStatus}}. Possible values:

Value	Meaning
`DATASET_PROCESSING_INITIATED`	Pipeline queued but not yet started
`DATASET_PROCESSING_STARTED`	Pipeline is running
`DATASET_PROCESSING_COMPLETED`	Indexing finished successfully
`DATASET_PROCESSING_ERRORED`	Processing failed

Datasets with no recorded run for the requested pipeline are absent from the result.

status = await cognee.datasets.get_status([dataset.id])
# {"<dataset-uuid>": "DATASET_PROCESSING_COMPLETED"}

Troubleshooting UUID errors

get_status() expects dataset_ids to be a list of dataset UUIDs, not dataset names or string ids. Internally the values are bound against the pipeline_runs.dataset_id UUID column, so passing a plain string raises a SQLAlchemy StatementError wrapping one of:

AttributeError: 'str' object has no attribute 'hex'
ValueError: badly formed hexadecimal UUID string

# ❌ Wrong — passing a dataset name (or string id)
await cognee.datasets.get_status(["my_dataset"])

# ✅ Right — resolve the name to its UUID first
datasets = await cognee.datasets.list_datasets()
dataset_id = next(ds.id for ds in datasets if ds.name == "my_dataset")
status = await cognee.datasets.get_status([dataset_id])

If you already hold a string id (for example one read back from the HTTP API), wrap it in UUID before calling:

from uuid import UUID

status = await cognee.datasets.get_status([UUID(dataset_id_str)])

datasets.empty_dataset()

await cognee.datasets.empty_dataset(dataset_id, user=None)

Delete all data in a dataset and remove the dataset itself.

Parameter	Type	Default	Notes
`dataset_id`	`UUID`	required	Dataset UUID to empty.
`user`	`Optional[User]`	`None`	If omitted, Cognee resolves the default user and checks `delete` permission.

Notes

Despite the name, empty_dataset() does not leave an empty dataset record behind. It deletes graph content, data records, and the dataset entity itself.

datasets.delete_data()

await cognee.datasets.delete_data(
    dataset_id,
    data_id,
    user=None,
    mode="soft",
    delete_dataset_if_empty=False,
)

Delete a specific data item from a dataset.

Parameter	Type	Default	Notes
`dataset_id`	`UUID`	required	Dataset UUID containing the target data item.
`data_id`	`UUID`	required	Data item UUID to delete.
`user`	`Optional[User]`	`None`	If omitted, Cognee resolves the default user and checks `delete` permission.
`mode`	`str`	`soft`	Kept for backward compatibility. The implementation warns against using `"hard"`.
`delete_dataset_if_empty`	`bool`	`False`	If `True`, deletes the dataset when the removed item was its last remaining data item.

Notes

mode="hard" is preserved for backward compatibility, but the implementation explicitly warns not to use it.

datasets.delete_all()

await cognee.datasets.delete_all(user=None)

Delete all datasets the user has permission to delete.

Parameter	Type	Default	Notes
`user`	`Optional[User]`	`None`	If omitted, Cognee resolves the default user.

Examples

Basic dataset operations

import cognee

# List all datasets
datasets = await cognee.datasets.list_datasets()
for ds in datasets:
    print(ds.name, ds.id)

# Check dataset contents
data = await cognee.datasets.list_data(dataset_id=ds.id)

# Delete a specific item
await cognee.datasets.delete_data(
    dataset_id=ds.id,
    data_id=item.id,
)

# Wipe everything
await cognee.datasets.delete_all()

Poll for indexing completion across parallel datasets

Use get_status() in a wait loop to confirm all datasets in a parallel batch have finished indexing before querying.

import asyncio
import cognee
from cognee.modules.pipelines.models import PipelineRunStatus

TERMINAL = {
    PipelineRunStatus.DATASET_PROCESSING_COMPLETED,
    PipelineRunStatus.DATASET_PROCESSING_ERRORED,
}

async def wait_for_indexing(dataset_ids, poll_interval=3, timeout=120):
    for _ in range(timeout // poll_interval):
        statuses = await cognee.datasets.get_status(dataset_ids)
        if all(s in TERMINAL for s in statuses.values()):
            return statuses
        await asyncio.sleep(poll_interval)
    raise TimeoutError("Indexing did not complete in time")

async def main():
    batches = {
        "batch_a": ["doc1.pdf", "doc2.pdf"],
        "batch_b": ["doc3.pdf", "doc4.pdf"],
    }

    # Add and index multiple datasets in parallel
    await asyncio.gather(*[
        cognee.add(files, dataset_name=name) for name, files in batches.items()
    ])
    await asyncio.gather(*[
        cognee.cognify(datasets=[name]) for name in batches
    ])

    # Confirm all datasets reached a terminal status
    all_datasets = await cognee.datasets.list_datasets()
    dataset_ids = [ds.id for ds in all_datasets if ds.name in batches]
    statuses = await wait_for_indexing(dataset_ids)

    for ds_id, status in statuses.items():
        if status == PipelineRunStatus.DATASET_PROCESSING_COMPLETED:
            print(f"{ds_id}: indexed successfully")
        else:
            print(f"{ds_id}: error — {status.value}")

asyncio.run(main())

The same pattern works when indexing is triggered via the HTTP API — poll get_status() from a separate process until all datasets reach DATASET_PROCESSING_COMPLETED or DATASET_PROCESSING_ERRORED.

Read back DataItem metadata

import cognee
from cognee.tasks.ingestion.data_item import DataItem

await cognee.add(
    DataItem(
        "/path/to/report.pdf",
        label="q4-report",
        external_metadata={"author": "Jane Smith", "quarter": "Q4-2024"},
    ),
    dataset_name="finance",
)

datasets = await cognee.datasets.list_datasets()
data_items = await cognee.datasets.list_data(dataset_id=datasets[0].id)

for item in data_items:
    print(item.label, item.external_metadata)
    # q4-report  {"author": "Jane Smith", "quarter": "Q4-2024"}

external_metadata is stored on the relational Data record only. It is not placed into the vector store or knowledge graph and is not returned by cognee.search(). If you need metadata to be vector-searchable, define a custom DataPoint subclass and list the fields to embed in metadata.index_fields. See DataPoints.

​cognee.datasets

​Methods

​datasets.list_datasets()

​datasets.discover_datasets()

​datasets.list_data()

​datasets.has_data()

​datasets.get_status()

​datasets.empty_dataset()

​datasets.delete_data()

​datasets.delete_all()

​Examples

cognee.datasets

Methods

datasets.list_datasets()

datasets.discover_datasets()

datasets.list_data()

datasets.has_data()

datasets.get_status()

datasets.empty_dataset()

datasets.delete_data()

datasets.delete_all()

Examples