> ## Documentation Index
> Fetch the complete documentation index at: https://docs.cognee.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Add

> Ingesting and preparing data for processing in Cognee

<Note>
  `add()` is a legacy operation. In Cognee v1.0, most users should use [remember()](/core-concepts/main-operations/remember) instead, which replaces the `add()` + `cognify()` + `memify()` workflow with a single call.
</Note>

## What is the add operation

The `.add` operation is how you bring content into Cognee. It takes your files, directories, or raw text, normalizes them into plain text, and records them into a dataset that Cognee can later expand into vectors and graphs with [Cognify](/core-concepts/main-operations/legacy-operations/cognify).

* **Ingestion-only**: no embeddings, no graph yet
* **Flexible input**: raw text, local files, directories, any [Docling](https://github.com/docling-project/docling) supported format, S3 URIs, or HTTP/HTTPS URLs
* **Normalized storage**: everything is turned into text and stored consistently
* **Deduplicated**: Cognee uses content hashes to avoid duplicates
* **Dataset-first**: everything you add goes into a dataset
  * Datasets are how Cognee keeps different collections organized (e.g. "research-papers", "customer-reports")
  * Each dataset has its own ID, owner, and permissions for access control
  * You can read more about them below

## Where add fits

* First step before you run [Cognify](/core-concepts/main-operations/legacy-operations/cognify)
* Use it to **create a dataset** from scratch, or **append new data** over time
* Ideal for both local experiments and programmatic ingestion from storage (e.g. S3)

## What happens under the hood

1. **Expand your input**
   * Directories are walked, S3 paths are expanded, raw text is passed through
   * Result: a flat list of items (files, text, handles)

2. **Ingest and register**
   * Files are saved into Cognee's storage and converted to text
   * Cognee computes a stable content hash to prevent duplicates
   * Each item becomes a record in the database and is attached to your dataset
   * **Text extraction**: Converts various file formats into plain text
   * **Metadata preservation**: Keeps file-system metadata like name, extension, MIME type, file size, and content hash — not arbitrary user-defined fields
   * **Content normalization**: Ensures consistent text encoding and formatting

3. **Return a summary**
   * You get a pipeline run info object that tells you where everything went and which dataset is ready for the next step

## After add finishes

After `.add` completes, your data is ready for the next stage:

* **Files are safely stored** in Cognee's storage system with metadata preserved
* **Database records** track each ingested item and link it to your dataset
* **Dataset is prepared** for transformation with [Cognify](/core-concepts/main-operations/legacy-operations/cognify) — which will chunk, embed, and connect everything

## Further details

<Accordion title="Input sources">
  * Mix and match: `["some text", "/path/to/file.pdf", "s3://bucket/data.csv", "https://example.com/page"]`
  * Works with directories (recursively), S3 prefixes, file handles, and HTTP/HTTPS URLs
  * Local and cloud sources are normalized into the same format
  * HTTP/HTTPS URLs are scraped as web pages — see [URL ingestion](#url-ingestion-httphttps) below for the distinction between web pages and direct file downloads
</Accordion>

<Accordion title="URL ingestion (HTTP/HTTPS)">
  Passing an `http://` or `https://` URL to `cognee.add()` triggers **web page scraping** — the URL is fetched and its response is saved as HTML for processing. Cognee does **not** inspect the server's `Content-Type` or the URL's file extension to detect direct binary file downloads.

  | URL type                         | What Cognee does                                                                      |
  | -------------------------------- | ------------------------------------------------------------------------------------- |
  | `https://example.com/article`    | Fetches the page HTML and extracts text                                               |
  | `https://example.com/report.pdf` | Fetches the HTTP response as HTML — the PDF binary is **not** downloaded or extracted |

  By default Cognee uses its built-in `DefaultUrlCrawler` (BeautifulSoup). Set `TAVILY_API_KEY` in your environment to use the Tavily API instead for richer extraction from complex or JavaScript-heavy pages. See [Python API: add()](/python-api/add) for more URL ingestion examples, the [web URL content ingestion demo](https://github.com/topoteretes/cognee/blob/main/examples/demos/web_url_content_ingestion_example.py) for a complete example, and [Loaders](/core-concepts/further-concepts/loaders) for `preferred_loaders` examples.
</Accordion>

<Accordion title="Structured data (dlt)">
  Cognee integrates with [dlt](https://dlthub.com/) to ingest structured relational data directly into the knowledge graph:

  * **dlt resources**: Pass `@dlt.resource()` decorated generators directly to `cognee.add()`
  * **CSV files**: `.csv` files are auto-detected and ingested via dlt
  * **Database connections**: Pass a connection string (`postgresql://...`, `sqlite:///...`) to ingest tables directly
  * Foreign key relationships become graph edges automatically
  * Structured data bypasses LLM extraction — the graph is built deterministically from the schema
  * See the full [dlt integration guide](/integrations/dlt-integration) for details
</Accordion>

<Accordion title="LlamaIndex documents">
  `cognee.add()` can also accept LlamaIndex `Document` and `ImageDocument` objects when the `llama-index` extra is installed. See [Installation](/getting-started/installation) for the package extra list.

  * Works with LlamaIndex loaders and connectors without a manual conversion step
  * If a `Document` includes `metadata["file_path"]`, Cognee ingests that original file directly
  * If an `ImageDocument` includes `image_path`, Cognee ingests the image file directly
  * Otherwise, Cognee saves the document text to a temporary file and continues through the normal add pipeline
</Accordion>

<Accordion title="Supported formats">
  Cognee automatically selects the best loader based on file extension. The table below lists all supported extensions and whether optional extras are needed:

  | Loader                  | Extensions                                                                                                                          | Install extra                  |
  | ----------------------- | ----------------------------------------------------------------------------------------------------------------------------------- | ------------------------------ |
  | **TextLoader**          | `.txt` `.md` `.json` `.xml` `.yaml` `.yml` `.log`                                                                                   | — (built-in)                   |
  | **CsvLoader**           | `.csv`                                                                                                                              | — (built-in)                   |
  | **PyPdfLoader**         | `.pdf`                                                                                                                              | — (built-in)                   |
  | **ImageLoader**         | `.png` `.jpg` `.jpeg` `.gif` `.webp` `.bmp` `.tif` `.tiff` `.heic` `.avif` `.ico` `.psd` `.apng` `.cr2` `.dwg` `.xcf` `.jxr` `.jpx` | — (built-in)                   |
  | **AudioLoader**         | `.mp3` `.wav` `.aac` `.flac` `.ogg` `.m4a` `.mid` `.amr` `.aiff`                                                                    | — (built-in)                   |
  | **UnstructuredLoader**  | `.docx` `.doc` `.odt` `.xlsx` `.xls` `.ods` `.pptx` `.ppt` `.odp` `.rtf` `.html` `.htm` `.eml` `.msg` `.epub`                       | `pip install cognee[docs]`     |
  | **AdvancedPdfLoader**   | `.pdf` (layout-aware, preserves tables)                                                                                             | `pip install cognee[docs]`     |
  | **BeautifulSoupLoader** | `.html`                                                                                                                             | `pip install cognee[scraping]` |
  | **DoclingDocument**     | Pre-converted `DoclingDocument` objects                                                                                             | `pip install cognee[docling]`  |

  * **ImageLoader** uses a vision-capable LLM to describe image content.
  * **AudioLoader** transcribes audio using a Whisper-compatible model.
  * **AdvancedPdfLoader** preserves page layout and table structure; falls back to PyPdfLoader automatically on error.

  You can learn more about how loaders work, override defaults, or register custom loaders in the [Loaders](/core-concepts/further-concepts/loaders) section.
</Accordion>

<Accordion title="Datasets">
  * A dataset is your "knowledge base" — a grouping of related data that makes sense together
  * Datasets are **first-class objects in Cognee's database** with their own ID, name, owner, and permissions
  * They provide **scope**: `.add` writes into a dataset, [Cognify](/core-concepts/main-operations/legacy-operations/cognify) processes per-dataset
  * Think of them as separate shelves in your library — e.g., a "research-papers" dataset and a "customer-reports" dataset
  * If you name a dataset that doesn't exist, Cognee creates it for you; if you don't specify, a default one is used
  * More detail: [Datasets](/core-concepts/further-concepts/datasets)
</Accordion>

<Accordion title="Users and ownership">
  * Every dataset and data item belongs to a user
  * If you don't pass a user, Cognee creates/uses a default one
  * Ownership controls who can later read, write, or share that dataset
</Accordion>

<Accordion title="Node sets">
  * Optional labels to group or tag data on ingestion
  * Example: `node_set=["AI", "FinTech"]`
  * Useful later when you want to focus on subgraphs
  * More detail: [NodeSets](/core-concepts/further-concepts/node-sets)
</Accordion>

<Accordion title="Hash-based file storage, deduplication, and filename collisions">
  When Cognee stores an ingested file, it renames it using the pattern `text_<md5_hash>.txt`, where the hash is computed from the original file's byte content. For example, adding `report.pdf` produces a stored file like `text_a3f1c8b2....txt` rather than `report.txt`.

  This naming scheme is intentional and powers deduplication:

  * The MD5 hash is derived from the **file content**, not the filename
  * Re-adding the same file, even with a different name, produces the same hash, so Cognee detects the existing record and skips re-ingestion (`incremental_loading=True` by default)
  * All loaders, including text, PDF, image, and audio, follow the same convention, so your storage directory will contain hash-named `.txt` files regardless of the original format

  This is why stored files look unfamiliar when you inspect your `DATA_ROOT_DIRECTORY`. The original filename is preserved in the relational database as metadata on the `Data` record, but the on-disk representation uses the content hash.

  Cognee deduplicates by **file content**, not by filename. The content hash is combined with the owner's user ID and tenant ID to produce a stable UUID for the record.

  | Scenario                                                      | Result                                                                     |
  | ------------------------------------------------------------- | -------------------------------------------------------------------------- |
  | Same file added twice to the same dataset                     | Second call is a no-op because the record is already linked to the dataset |
  | Same file added to a different dataset                        | No new storage copy; the existing record is linked to the new dataset      |
  | Two files with the **same name** but **different contents**   | Two separate records because different content produces different hashes   |
  | Two files with **different names** but **identical contents** | One record because the same content produces the same hash                 |

  If two files arrive simultaneously with the same filename but different contents, Cognee computes separate hashes and stores them as distinct records. Filename is metadata only and plays no role in deduplication.

  Deduplication never crosses user or tenant boundaries. Cognee derives the record UUID from `content_hash + user_id + tenant_id`, so the same file uploaded by a different user or in a different tenant becomes a separate record rather than being collapsed into a shared one.
</Accordion>

<Accordion title="Custom metadata, labels, and grouping">
  `cognee.add()` automatically preserves only **file-system metadata** like name, MIME type, extension, content hash.

  If you need to associate extra information with ingested data, three mechanisms are available:

  <AccordionGroup>
    <Accordion title="node_set tags">
      Pass a list of string tags to mark every item in that `add()` call:

      ```python theme={null}
      await cognee.add(
          "Quarterly earnings report Q4 2024.",
          node_set=["finance", "Q4-2024"]
      )
      ```

      Tags flow into the knowledge graph as `NodeSet` nodes connected with `belongs_to_set` edges, and can be used to scope searches later — see [NodeSets](/core-concepts/further-concepts/node-sets).
    </Accordion>

    <Accordion title="DataItem metadata and labels">
      Wrap individual data items in `DataItem` when you need to attach metadata or control per-item identifiers:

      ```python theme={null}
      import cognee
      from cognee.tasks.ingestion.data_item import DataItem

      item = DataItem(
          data="/path/to/report.pdf",
          label="q4-earnings-report",
          external_metadata={
              "title": "Q4 Financial Report",
              "author": "Jane Smith",
              "date": "2024-12-31",
              "department": "Finance",
          },
      )

      await cognee.add(item, dataset_name="reports")
      ```

      You can also pass a list of `DataItem` objects to ingest multiple files with different metadata in one call:

      ```python theme={null}
      items = [
          DataItem(
              data="/path/to/paper1.pdf",
              label="paper-one",
              external_metadata={"title": "Paper One", "author": "Alice"},
          ),
          DataItem(
              data="/path/to/paper2.pdf",
              label="paper-two",
              external_metadata={"title": "Paper Two", "author": "Bob"},
          ),
      ]

      await cognee.add(items, dataset_name="research")
      ```

      **`DataItem` fields**

      | Field               | Type              | Description                                                                                          |
      | ------------------- | ----------------- | ---------------------------------------------------------------------------------------------------- |
      | `data`              | any               | The content to ingest — same types accepted as plain `cognee.add()` (text, file path, binary stream) |
      | `external_metadata` | `dict` (optional) | Arbitrary key-value metadata stored alongside the ingested record                                    |
      | `label`             | `str` (optional)  | A short label attached to the data record                                                            |
      | `data_id`           | `UUID` (optional) | A stable ID to use instead of the auto-generated content-hash ID                                     |

      The `external_metadata` dictionary and `label` are stored on the `Data` record in Cognee's relational database. `external_metadata` does not automatically become graph structure; use `node_set` when you need tags that flow into the knowledge graph.

      Arbitrary key-value metadata must be passed via `DataItem(external_metadata=...)`; it is not inferred automatically from plain strings or file paths passed directly to `add()`.
    </Accordion>

    <Accordion title="dataset_name grouping">
      Separate collections of data into [named datasets](/core-concepts/further-concepts/datasets) to keep different knowledge domains apart:

      ```python theme={null}
      await cognee.add("Legal contract text.", dataset_name="legal-docs")
      await cognee.add("Product spec text.",   dataset_name="product-specs")
      ```
    </Accordion>
  </AccordionGroup>
</Accordion>

<Accordion title="Troubleshooting 409 Conflict errors">
  `POST /api/v1/add` returns **409 Conflict** whenever an unhandled exception occurs during the add operation. It is a catch-all — the actual problem is always in the `error` field of the response body:

  ```json theme={null}
  {
    "error": "<description of what went wrong>"
  }
  ```

  Read that message first. The table below maps the most common error patterns to their fixes.

  | Symptom (error field contains…)                                                  | Cause                                                           | Fix                                                                                                                                                                                                                                 |
  | -------------------------------------------------------------------------------- | --------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
  | `"API key"`, `"authentication"`, `"invalid_api_key"`, `"401"`                    | Missing or invalid LLM API key                                  | Set `LLM_API_KEY` in your environment (`.env` file or shell). Even though `add` itself does not call the LLM, the database setup that runs on every request uses the configured provider.                                           |
  | `"connection refused"`, `"could not connect"`, `"timeout"`, `"OperationalError"` | Database unreachable                                            | Verify your database service is running. For Docker setups, use `DB_HOST=host.docker.internal` instead of `localhost`. Check `DB_HOST`, `DB_PORT`, `DB_USERNAME`, `DB_PASSWORD`, and `DB_NAME`.                                     |
  | `"permission denied"`, `"not authorized"`, `"forbidden"`, `"403"`                | The authenticated user lacks write access to the target dataset | Either use `datasetId` of a dataset you own, or disable access control with `ENABLE_BACKEND_ACCESS_CONTROL=False` in development.                                                                                                   |
  | `"decode"`, `"encoding"`, `"UnicodeDecodeError"`, `"failed to process"`          | Corrupted or non-text file content                              | Confirm the file is readable and in a [supported format](/core-concepts/main-operations/legacy-operations/add#supported-formats). As a workaround, read the file yourself and pass the text string directly instead of a file path. |
  | `"No such file or directory"`, `"FileNotFoundError"`                             | The file path does not exist on the server                      | Use an absolute path. If calling the HTTP API, upload the file as a multipart form attachment instead of passing a path string.                                                                                                     |
  | `"SSL"`, `"certificate"`                                                         | TLS/certificate issue connecting to an external database or S3  | Check SSL settings for your database or storage backend. Set `DB_SSL=false` in development if certificates are self-signed.                                                                                                         |

  ### Enabling debug logs

  For errors not covered above, enable verbose logging to see the full stack trace:

  ```bash theme={null}
  LITELLM_LOG="DEBUG"
  ENV="development"
  ```

  Then re-run the request. The server logs will show exactly where the failure occurred.

  ### Still stuck?

  * Check [Setup Configuration](/setup-configuration/overview) to verify your environment variables.
  * Ask in the [Discord community](https://discord.gg/m63hxKsp4p) with the full `error` field from the 409 response.
  * Open an issue on [GitHub](https://github.com/topoteretes/cognee/issues).
</Accordion>

<Columns cols={3}>
  <Card title="Cognify" icon="brain-cog" href="/core-concepts/main-operations/legacy-operations/cognify">
    Expand data into chunks, embeddings, and graphs
  </Card>

  <Card title="DataPoints" icon="circle" href="/core-concepts/building-blocks/datapoints">
    The units you'll see after Cognify
  </Card>

  <Card title="Building Blocks" icon="puzzle" href="/core-concepts/building-blocks/tasks">
    Learn about Tasks and Pipelines behind Add
  </Card>
</Columns>
