> ## Documentation Index
> Fetch the complete documentation index at: https://docs.cognee.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# datasets

> Dataset management: list, create, and delete datasets

# cognee.datasets

Static class for managing datasets and their data.

## Methods

### datasets.list\_datasets()

```python theme={null}
await cognee.datasets.list_datasets(user=None)
```

Returns all datasets accessible to the resolved user.

| Parameter | Type             | Default | Notes                                         |
| --------- | ---------------- | ------- | --------------------------------------------- |
| `user`    | `Optional[User]` | `None`  | If omitted, Cognee resolves the default user. |

### datasets.discover\_datasets()

```python theme={null}
cognee.datasets.discover_datasets(directory_path: str)
```

Discover dataset names from a local directory layout.

| Parameter        | Type  | Default  | Notes                                                     |
| ---------------- | ----- | -------- | --------------------------------------------------------- |
| `directory_path` | `str` | required | Local directory to scan for dataset-style subdirectories. |

### datasets.list\_data()

```python theme={null}
await cognee.datasets.list_data(dataset_id, user=None)
```

Returns all `Data` records in a dataset.

This is the API to use when you want to read back `DataItem` fields stored during `cognee.add()`, such as `label` and `external_metadata`.

| Parameter    | Type             | Default  | Notes                                                                  |
| ------------ | ---------------- | -------- | ---------------------------------------------------------------------- |
| `dataset_id` | `UUID`           | required | Dataset UUID to inspect.                                               |
| `user`       | `Optional[User]` | `None`   | If omitted, Cognee resolves the default user before permission checks. |

### datasets.has\_data()

```python theme={null}
await cognee.datasets.has_data(dataset_id, user=None) -> bool
```

Check whether a dataset contains any data.

| Parameter    | Type             | Default  | Notes                                                                  |
| ------------ | ---------------- | -------- | ---------------------------------------------------------------------- |
| `dataset_id` | `str`            | required | Dataset identifier to check.                                           |
| `user`       | `Optional[User]` | `None`   | If omitted, Cognee resolves the default user before permission checks. |

### datasets.get\_status()

```python theme={null}
await cognee.datasets.get_status(
    dataset_ids: list[UUID],
    pipeline_names: list[str] | None = None,
) -> dict
```

Get pipeline status for one or more datasets.

When `pipeline_names` is omitted, this method keeps the legacy flat shape and returns the status of `cognify_pipeline` only.

| Parameter        | Type                  | Default  | Notes                                                                                                                         |
| ---------------- | --------------------- | -------- | ----------------------------------------------------------------------------------------------------------------------------- |
| `dataset_ids`    | `list[UUID]`          | required | Dataset UUIDs to check.                                                                                                       |
| `pipeline_names` | `Optional[list[str]]` | `None`   | Pipeline names to query. If omitted, defaults to `cognify_pipeline`. Duplicate names are deduplicated while preserving order. |

With no `pipeline_names` or a single pipeline name, the method returns `{str(dataset_id): PipelineRunStatus}`.
With multiple pipeline names, it returns `{str(dataset_id): {pipeline_name: PipelineRunStatus}}`.

Possible values:

| Value                          | Meaning                             |
| ------------------------------ | ----------------------------------- |
| `DATASET_PROCESSING_INITIATED` | Pipeline queued but not yet started |
| `DATASET_PROCESSING_STARTED`   | Pipeline is running                 |
| `DATASET_PROCESSING_COMPLETED` | Indexing finished successfully      |
| `DATASET_PROCESSING_ERRORED`   | Processing failed                   |

Datasets with no recorded run for the requested pipeline are absent from the result.

```python theme={null}
status = await cognee.datasets.get_status([dataset.id])
# {"<dataset-uuid>": "DATASET_PROCESSING_COMPLETED"}
```

### datasets.empty\_dataset()

```python theme={null}
await cognee.datasets.empty_dataset(dataset_id, user=None)
```

Delete all data in a dataset and remove the dataset itself.

| Parameter    | Type             | Default  | Notes                                                                        |
| ------------ | ---------------- | -------- | ---------------------------------------------------------------------------- |
| `dataset_id` | `UUID`           | required | Dataset UUID to empty.                                                       |
| `user`       | `Optional[User]` | `None`   | If omitted, Cognee resolves the default user and checks `delete` permission. |

<AccordionGroup>
  <Accordion title="Notes">
    <Note>
      Despite the name, `empty_dataset()` does not leave an empty dataset record behind. It deletes graph content, data records, and the dataset entity itself.
    </Note>
  </Accordion>
</AccordionGroup>

### datasets.delete\_data()

```python theme={null}
await cognee.datasets.delete_data(
    dataset_id,
    data_id,
    user=None,
    mode="soft",
    delete_dataset_if_empty=False,
)
```

Delete a specific data item from a dataset.

| Parameter                 | Type             | Default  | Notes                                                                                  |
| ------------------------- | ---------------- | -------- | -------------------------------------------------------------------------------------- |
| `dataset_id`              | `UUID`           | required | Dataset UUID containing the target data item.                                          |
| `data_id`                 | `UUID`           | required | Data item UUID to delete.                                                              |
| `user`                    | `Optional[User]` | `None`   | If omitted, Cognee resolves the default user and checks `delete` permission.           |
| `mode`                    | `str`            | `soft`   | Kept for backward compatibility. The implementation warns against using `"hard"`.      |
| `delete_dataset_if_empty` | `bool`           | `False`  | If `True`, deletes the dataset when the removed item was its last remaining data item. |

<AccordionGroup>
  <Accordion title="Notes">
    <Warning>
      `mode="hard"` is preserved for backward compatibility, but the implementation explicitly warns not to use it.
    </Warning>
  </Accordion>
</AccordionGroup>

### datasets.delete\_all()

```python theme={null}
await cognee.datasets.delete_all(user=None)
```

Delete all datasets the user has permission to delete.

| Parameter | Type             | Default | Notes                                         |
| --------- | ---------------- | ------- | --------------------------------------------- |
| `user`    | `Optional[User]` | `None`  | If omitted, Cognee resolves the default user. |

## Examples

<AccordionGroup>
  <Accordion title="Basic dataset operations">
    ```python theme={null}
    import cognee

    # List all datasets
    datasets = await cognee.datasets.list_datasets()
    for ds in datasets:
        print(ds.name, ds.id)

    # Check dataset contents
    data = await cognee.datasets.list_data(dataset_id=ds.id)

    # Delete a specific item
    await cognee.datasets.delete_data(
        dataset_id=ds.id,
        data_id=item.id,
    )

    # Wipe everything
    await cognee.datasets.delete_all()
    ```
  </Accordion>

  <Accordion title="Poll for indexing completion across parallel datasets">
    Use `get_status()` in a wait loop to confirm all datasets in a parallel batch have finished indexing before querying.

    ```python theme={null}
    import asyncio
    import cognee
    from cognee.modules.pipelines.models import PipelineRunStatus

    TERMINAL = {
        PipelineRunStatus.DATASET_PROCESSING_COMPLETED,
        PipelineRunStatus.DATASET_PROCESSING_ERRORED,
    }

    async def wait_for_indexing(dataset_ids, poll_interval=3, timeout=120):
        for _ in range(timeout // poll_interval):
            statuses = await cognee.datasets.get_status(dataset_ids)
            if all(s in TERMINAL for s in statuses.values()):
                return statuses
            await asyncio.sleep(poll_interval)
        raise TimeoutError("Indexing did not complete in time")

    async def main():
        batches = {
            "batch_a": ["doc1.pdf", "doc2.pdf"],
            "batch_b": ["doc3.pdf", "doc4.pdf"],
        }

        # Add and index multiple datasets in parallel
        await asyncio.gather(*[
            cognee.add(files, dataset_name=name) for name, files in batches.items()
        ])
        await asyncio.gather(*[
            cognee.cognify(datasets=[name]) for name in batches
        ])

        # Confirm all datasets reached a terminal status
        all_datasets = await cognee.datasets.list_datasets()
        dataset_ids = [ds.id for ds in all_datasets if ds.name in batches]
        statuses = await wait_for_indexing(dataset_ids)

        for ds_id, status in statuses.items():
            if status == PipelineRunStatus.DATASET_PROCESSING_COMPLETED:
                print(f"{ds_id}: indexed successfully")
            else:
                print(f"{ds_id}: error — {status.value}")

    asyncio.run(main())
    ```

    The same pattern works when indexing is triggered via the [HTTP API](/api-reference/introduction) — poll `get_status()` from a separate process until all datasets reach `DATASET_PROCESSING_COMPLETED` or `DATASET_PROCESSING_ERRORED`.
  </Accordion>

  <Accordion title="Read back DataItem metadata">
    ```python theme={null}
    import cognee
    from cognee.tasks.ingestion.data_item import DataItem

    await cognee.add(
        DataItem(
            "/path/to/report.pdf",
            label="q4-report",
            external_metadata={"author": "Jane Smith", "quarter": "Q4-2024"},
        ),
        dataset_name="finance",
    )

    datasets = await cognee.datasets.list_datasets()
    data_items = await cognee.datasets.list_data(dataset_id=datasets[0].id)

    for item in data_items:
        print(item.label, item.external_metadata)
        # q4-report  {"author": "Jane Smith", "quarter": "Q4-2024"}
    ```

    `external_metadata` is stored on the relational `Data` record only. It is not placed into the vector store or knowledge graph and is not returned by `cognee.search()`. If you need metadata to be vector-searchable, define a custom `DataPoint` subclass and list the fields to embed in `metadata.index_fields`. See [DataPoints](/core-concepts/building-blocks/datapoints#indexing--embeddings).
  </Accordion>
</AccordionGroup>
