Skip to main content

cognee.datasets

Static class for managing datasets and their data.

Methods

datasets.list_datasets()

await cognee.datasets.list_datasets(user=None)
Returns all datasets accessible to the resolved user.
ParameterTypeDefaultNotes
userOptional[User]NoneIf omitted, Cognee resolves the default user.

datasets.discover_datasets()

cognee.datasets.discover_datasets(directory_path: str)
Discover dataset names from a local directory layout.
ParameterTypeDefaultNotes
directory_pathstrrequiredLocal directory to scan for dataset-style subdirectories.

datasets.list_data()

await cognee.datasets.list_data(dataset_id, user=None)
Returns all Data records in a dataset. This is the API to use when you want to read back DataItem fields stored during cognee.add(), such as label and external_metadata.
ParameterTypeDefaultNotes
dataset_idUUIDrequiredDataset UUID to inspect.
userOptional[User]NoneIf omitted, Cognee resolves the default user before permission checks.

datasets.has_data()

await cognee.datasets.has_data(dataset_id, user=None) -> bool
Check whether a dataset contains any data.
ParameterTypeDefaultNotes
dataset_idstrrequiredDataset identifier to check.
userOptional[User]NoneIf omitted, Cognee resolves the default user before permission checks.

datasets.get_status()

await cognee.datasets.get_status(
    dataset_ids: list[UUID],
    pipeline_names: list[str] | None = None,
) -> dict
Get pipeline status for one or more datasets. When pipeline_names is omitted, this method keeps the legacy flat shape and returns the status of cognify_pipeline only.
ParameterTypeDefaultNotes
dataset_idslist[UUID]requiredDataset UUIDs to check.
pipeline_namesOptional[list[str]]NonePipeline names to query. If omitted, defaults to cognify_pipeline. Duplicate names are deduplicated while preserving order.
With no pipeline_names or a single pipeline name, the method returns {str(dataset_id): PipelineRunStatus}. With multiple pipeline names, it returns {str(dataset_id): {pipeline_name: PipelineRunStatus}}. Possible values:
ValueMeaning
DATASET_PROCESSING_INITIATEDPipeline queued but not yet started
DATASET_PROCESSING_STARTEDPipeline is running
DATASET_PROCESSING_COMPLETEDIndexing finished successfully
DATASET_PROCESSING_ERROREDProcessing failed
Datasets with no recorded run for the requested pipeline are absent from the result.
status = await cognee.datasets.get_status([dataset.id])
# {"<dataset-uuid>": "DATASET_PROCESSING_COMPLETED"}

datasets.empty_dataset()

await cognee.datasets.empty_dataset(dataset_id, user=None)
Delete all data in a dataset and remove the dataset itself.
ParameterTypeDefaultNotes
dataset_idUUIDrequiredDataset UUID to empty.
userOptional[User]NoneIf omitted, Cognee resolves the default user and checks delete permission.
Despite the name, empty_dataset() does not leave an empty dataset record behind. It deletes graph content, data records, and the dataset entity itself.

datasets.delete_data()

await cognee.datasets.delete_data(
    dataset_id,
    data_id,
    user=None,
    mode="soft",
    delete_dataset_if_empty=False,
)
Delete a specific data item from a dataset.
ParameterTypeDefaultNotes
dataset_idUUIDrequiredDataset UUID containing the target data item.
data_idUUIDrequiredData item UUID to delete.
userOptional[User]NoneIf omitted, Cognee resolves the default user and checks delete permission.
modestrsoftKept for backward compatibility. The implementation warns against using "hard".
delete_dataset_if_emptyboolFalseIf True, deletes the dataset when the removed item was its last remaining data item.
mode="hard" is preserved for backward compatibility, but the implementation explicitly warns not to use it.

datasets.delete_all()

await cognee.datasets.delete_all(user=None)
Delete all datasets the user has permission to delete.
ParameterTypeDefaultNotes
userOptional[User]NoneIf omitted, Cognee resolves the default user.

Examples

import cognee

# List all datasets
datasets = await cognee.datasets.list_datasets()
for ds in datasets:
    print(ds.name, ds.id)

# Check dataset contents
data = await cognee.datasets.list_data(dataset_id=ds.id)

# Delete a specific item
await cognee.datasets.delete_data(
    dataset_id=ds.id,
    data_id=item.id,
)

# Wipe everything
await cognee.datasets.delete_all()
Use get_status() in a wait loop to confirm all datasets in a parallel batch have finished indexing before querying.
import asyncio
import cognee
from cognee.modules.pipelines.models import PipelineRunStatus

TERMINAL = {
    PipelineRunStatus.DATASET_PROCESSING_COMPLETED,
    PipelineRunStatus.DATASET_PROCESSING_ERRORED,
}

async def wait_for_indexing(dataset_ids, poll_interval=3, timeout=120):
    for _ in range(timeout // poll_interval):
        statuses = await cognee.datasets.get_status(dataset_ids)
        if all(s in TERMINAL for s in statuses.values()):
            return statuses
        await asyncio.sleep(poll_interval)
    raise TimeoutError("Indexing did not complete in time")

async def main():
    batches = {
        "batch_a": ["doc1.pdf", "doc2.pdf"],
        "batch_b": ["doc3.pdf", "doc4.pdf"],
    }

    # Add and index multiple datasets in parallel
    await asyncio.gather(*[
        cognee.add(files, dataset_name=name) for name, files in batches.items()
    ])
    await asyncio.gather(*[
        cognee.cognify(datasets=[name]) for name in batches
    ])

    # Confirm all datasets reached a terminal status
    all_datasets = await cognee.datasets.list_datasets()
    dataset_ids = [ds.id for ds in all_datasets if ds.name in batches]
    statuses = await wait_for_indexing(dataset_ids)

    for ds_id, status in statuses.items():
        if status == PipelineRunStatus.DATASET_PROCESSING_COMPLETED:
            print(f"{ds_id}: indexed successfully")
        else:
            print(f"{ds_id}: error — {status.value}")

asyncio.run(main())
The same pattern works when indexing is triggered via the HTTP API — poll get_status() from a separate process until all datasets reach DATASET_PROCESSING_COMPLETED or DATASET_PROCESSING_ERRORED.
import cognee
from cognee.tasks.ingestion.data_item import DataItem

await cognee.add(
    DataItem(
        "/path/to/report.pdf",
        label="q4-report",
        external_metadata={"author": "Jane Smith", "quarter": "Q4-2024"},
    ),
    dataset_name="finance",
)

datasets = await cognee.datasets.list_datasets()
data_items = await cognee.datasets.list_data(dataset_id=datasets[0].id)

for item in data_items:
    print(item.label, item.external_metadata)
    # q4-report  {"author": "Jane Smith", "quarter": "Q4-2024"}
external_metadata is stored on the relational Data record only. It is not placed into the vector store or knowledge graph and is not returned by cognee.search(). If you need metadata to be vector-searchable, define a custom DataPoint subclass and list the fields to embed in metadata.index_fields. See DataPoints.