> ## Documentation Index
> Fetch the complete documentation index at: https://docs.cognee.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# add()

> Ingest data into the Cognee knowledge base

# cognee.add()

```python theme={null}
async def add(
    data: Union[BinaryIO, list[BinaryIO], str, list[str], DataItem, list[DataItem]],
    dataset_name: str = 'main_dataset',
    user: User = None,
    node_set: Optional[List[str]] = None,
    vector_db_config: dict = None,
    graph_db_config: dict = None,
    dataset_id: Optional[UUID] = None,
    preferred_loaders: Optional[List[Union[str, dict[str, dict[str, Any]]]]] = None,
    incremental_loading: bool = True,
    data_per_batch: Optional[int] = 20,
    llm_config: Optional[LLMConfig] = None,
    embedding_config: Optional[EmbeddingConfig] = None,
)
```

## Description

Add data to Cognee for knowledge graph processing.

This is the first step in the Cognee workflow - it ingests raw data and prepares it
for processing. The function accepts various data formats including text, files, urls and
binary streams, then stores them in a specified dataset for further processing.

Prerequisites:

* **LLM\_API\_KEY**: Must be set in environment variables for content processing
* **Database Setup**: Relational and vector databases must be configured
* **User Authentication**: Uses default user if none provided (created automatically)

Supported Input Types:

* **Text strings**: Direct text content (str) - any string not starting with "/" or "file://"
* **File paths**: Local file paths as strings in these formats:
  * Absolute paths: "/path/to/document.pdf"
  * File URLs: "file:///path/to/document.pdf" or "file://relative/path.txt"
  * S3 paths: "s3://bucket-name/path/to/file.pdf"
* **Binary file objects**: File handles/streams (BinaryIO)
* **Lists**: Multiple files or text strings in a single call

Supported File Formats:

* Text files (.txt, .md, .csv)
* PDFs (.pdf)
* Images (.png, .jpg, .jpeg) - extracted via OCR/vision models
* Audio files (.mp3, .wav) - transcribed to text
* Code files (.py, .js, .ts, etc.) - parsed for structure and content
* Office documents (.docx, .pptx)

See the [Supported File Formats](#supported-file-formats) table below for the full list grouped by loader,
including which formats require optional extras.

Workflow:

1. **Data Resolution**: Resolves file paths and validates accessibility
2. **Content Extraction**: Extracts text content from various file formats
3. **Dataset Storage**: Stores processed content in the specified dataset
4. **Metadata Tracking**: Records file metadata, timestamps, and user permissions
5. **Permission Assignment**: Grants user read/write/delete/share permissions on dataset

Args:
data: The data to ingest. Can be:

* Single text string: "Your text content here"
* Absolute file path: "/path/to/document.pdf"
* File URL: "file:///absolute/path/to/document.pdf" or "file://relative/path.txt"
* S3 path: "s3://my-bucket/documents/file.pdf"
* List of mixed types: \["text content", "/path/file.pdf", "file://doc.txt", file\_handle]
* Binary file object: open("file.txt", "rb")
* url: A web link url (https or http)
  dataset\_name: Name of the dataset to store data in. Defaults to "main\_dataset".
  Create separate datasets to organize different knowledge domains.
  user: User object for authentication and permissions. Uses default user if None.
  Default user: "[default\_user@example.com](mailto:default_user@example.com)" (created automatically on first use).
  Users can only access datasets they have permissions for.
  node\_set: Optional list of node identifiers for graph organization and access control.
  Used for grouping related data points in the knowledge graph.
  vector\_db\_config: Optional configuration for vector database (for custom setups).
  graph\_db\_config: Optional configuration for graph database (for custom setups).
  dataset\_id: Optional specific dataset UUID to use instead of dataset\_name.
  extraction\_rules: Optional dictionary of rules (e.g., CSS selectors, XPath) for extracting specific content from web pages using BeautifulSoup
  tavily\_config: Optional configuration for Tavily API, including API key and extraction settings
  soup\_crawler\_config: Optional configuration for BeautifulSoup crawler, specifying concurrency, crawl delay, and extraction rules.

Returns:
PipelineRunInfo: Information about the ingestion pipeline execution including:

* Pipeline run ID for tracking
* Dataset ID where data was stored
* Processing status and any errors
* Execution timestamps and metadata

Next Steps:
After successfully adding data, call `cognify()` to process the ingested content:

```python theme={null}
import cognee

# Step 1: Add your data (text content or file path)
await cognee.add("Your document content")  # Raw text
# OR
await cognee.add("/path/to/your/file.pdf")  # File path

# Step 2: Process into knowledge graph
await cognee.cognify()

# Step 3: Search and query
results = await cognee.search("What insights can you find?")
```

Example Usage:

```python theme={null}
# Add a single text document
await cognee.add("Natural language processing is a field of AI...")

# Add multiple files with different path formats
await cognee.add([
    "/absolute/path/to/research_paper.pdf",        # Absolute path
    "file://relative/path/to/dataset.csv",         # Relative file URL
    "file:///absolute/path/to/report.docx",        # Absolute file URL
    "s3://my-bucket/documents/data.json",           # S3 path
    "Additional context text"                       # Raw text content
])

# Add to a specific dataset
await cognee.add(
    data="Project documentation content",
    dataset_name="project_docs"
)

# Add a single file
await cognee.add("/home/user/documents/analysis.pdf")

# Add a single url and bs4 extract ingestion method
extraction_rules = {
    "title": "h1",
    "description": "p",
    "more_info": "a[href*='more-info']"
}
await cognee.add("https://example.com",extraction_rules=extraction_rules)

# Add a single url and tavily extract ingestion method
Make sure to set TAVILY_API_KEY = YOUR_TAVILY_API_KEY as a environment variable
await cognee.add("https://example.com")

# Add multiple urls
await cognee.add(["https://example.com","https://books.toscrape.com"])
```

Environment Variables:
Required:

* LLM\_API\_KEY: API key for your LLM provider (OpenAI, Anthropic, etc.)

Optional:

* LLM\_PROVIDER: "openai" (default), "anthropic", "gemini", "ollama", "mistral", "bedrock"
* LLM\_MODEL: Model name (default: "gpt-5-mini")
* DEFAULT\_USER\_EMAIL: Custom default user email
* DEFAULT\_USER\_PASSWORD: Custom default user password
* VECTOR\_DB\_PROVIDER: "lancedb" (default), "chromadb", "pgvector"
* GRAPH\_DATABASE\_PROVIDER: "kuzu" (default), "neo4j"
* TAVILY\_API\_KEY: YOUR\_TAVILY\_API\_KEY

## Parameters

<ParamField path="data" type="Union[BinaryIO, list[BinaryIO], str, list[str], DataItem, list[DataItem]]" required>
  Data to ingest. Accepts text strings, file paths (local, S3, or URLs), binary file objects, `DataItem` objects, or lists of any of these.

  **`DataItem`** is a lightweight wrapper that lets you attach per-item metadata, human-readable label, and an optional stable `data_id`. Import it from `cognee.tasks.ingestion.data_item`:

  ```python theme={null}
  from cognee.tasks.ingestion.data_item import DataItem

  item = DataItem(
      data="/path/to/report.pdf",
      label="q4-earnings-report",
      external_metadata={"title": "Q4 Financial Report", "author": "Jane Smith"},
  )
  await cognee.add(item)

  # Mix items with different labels / metadata in a list
  await cognee.add([
      DataItem("Contract text …", label="contract-2024"),
      DataItem("Meeting notes …", external_metadata={"source": "CRM"}),
  ])
  ```

  `label` and `external_metadata` are stored on the relational `Data` record. They are not propagated into the knowledge graph automatically and are not searchable via `cognee.search()`. Use `node_set` when you need tags that flow into the graph and can be used for scoped queries.
</ParamField>

<ParamField path="dataset_name" type="str" default="'main_dataset'">Name of the dataset to add data to.</ParamField>
<ParamField path="user" type="User" default="None">User performing the operation. Uses default user if not provided.</ParamField>
<ParamField path="node_set" type="Optional[List[str]]" default="None">List of node set names to associate with the data.</ParamField>
<ParamField path="vector_db_config" type="dict" default="None">Override vector database configuration for this operation.</ParamField>
<ParamField path="graph_db_config" type="dict" default="None">Override graph database configuration for this operation.</ParamField>
<ParamField path="dataset_id" type="Optional[UUID]" default="None">UUID of an existing dataset to add data to. Alternative to dataset\_name.</ParamField>
<ParamField path="preferred_loaders" type="Optional[List[Union[str, dict[str, dict[str, Any]]]]]" default="None">Custom loader configuration for specific file types.</ParamField>
<ParamField path="incremental_loading" type="bool" default="True">If true, skip data that has already been ingested.</ParamField>
<ParamField path="data_per_batch" type="Optional[int]" default="20">Number of data items to process per batch.</ParamField>
<ParamField path="llm_config" type="Optional[LLMConfig]" default="None">LLM settings to install into the current async context for this ingestion operation. When omitted, Cognee uses the active context config or global LLM config. Import `LLMConfig` from `cognee.infrastructure.llm.config`.</ParamField>
<ParamField path="embedding_config" type="Optional[EmbeddingConfig]" default="None">Embedding settings to install into the current async context for this ingestion operation. When omitted, Cognee uses the active context config or global embedding config. Import `EmbeddingConfig` from `cognee.infrastructure.databases.vector.embeddings.config`.</ParamField>
<ParamField path="importance_weight" type="Optional[float]" default="0.5">Floating-point score stored on the `Data` record for retrieval ranking. Applied uniformly to all items in the batch. Use a higher value to make items more likely to surface in ranked results.</ParamField>
<ParamField path="primary_key" type="str" default="None">Column name for primary key when ingesting structured data via dlt. Auto-detected if not specified.</ParamField>
<ParamField path="write_disposition" type="str" default="'merge'">How to handle existing data for dlt sources: "merge" (upsert), "append" (always insert), or "replace" (drop and recreate).</ParamField>
<ParamField path="query" type="str" default="None">SQL WHERE clause for filtering when using a database connection string as input.</ParamField>

## Supported Input Types

| Type          | Example                                             |
| ------------- | --------------------------------------------------- |
| Text string   | `"Cognee is a knowledge graph platform."`           |
| File path     | `"/path/to/document.pdf"`                           |
| S3 path       | `"s3://bucket/file.txt"`                            |
| URL           | `"https://example.com/article"`                     |
| Binary file   | `open("file.pdf", "rb")`                            |
| Mixed list    | `["text", "/path/file.pdf", open("f.txt", "rb")]`   |
| dlt resource  | `@dlt.resource()` decorated generator               |
| CSV file      | `"/path/to/data.csv"` (auto-detected as dlt source) |
| DB connection | `"postgresql://user:pass@host/db"`                  |

## Supported File Formats

| Loader                  | Extensions                                                                                                                          | Install extra                  |
| ----------------------- | ----------------------------------------------------------------------------------------------------------------------------------- | ------------------------------ |
| **TextLoader**          | `.txt` `.md` `.json` `.xml` `.yaml` `.yml` `.log`                                                                                   | — (built-in)                   |
| **CsvLoader**           | `.csv`                                                                                                                              | — (built-in)                   |
| **PyPdfLoader**         | `.pdf`                                                                                                                              | — (built-in)                   |
| **ImageLoader**         | `.png` `.jpg` `.jpeg` `.gif` `.webp` `.bmp` `.tif` `.tiff` `.heic` `.avif` `.ico` `.psd` `.apng` `.cr2` `.dwg` `.xcf` `.jxr` `.jpx` | — (built-in)                   |
| **AudioLoader**         | `.mp3` `.wav` `.aac` `.flac` `.ogg` `.m4a` `.mid` `.amr` `.aiff`                                                                    | — (built-in)                   |
| **UnstructuredLoader**  | `.docx` `.doc` `.odt` `.xlsx` `.xls` `.ods` `.pptx` `.ppt` `.odp` `.rtf` `.html` `.htm` `.eml` `.msg` `.epub`                       | `pip install cognee[docs]`     |
| **AdvancedPdfLoader**   | `.pdf` (layout-aware, preserves tables)                                                                                             | `pip install cognee[docs]`     |
| **BeautifulSoupLoader** | `.html`                                                                                                                             | `pip install cognee[scraping]` |

See [Loaders](/core-concepts/further-concepts/loaders) for how to override the default loader selection or register custom loaders.

## Examples

```python theme={null}
import cognee

# Add text
await cognee.add("Cognee builds knowledge graphs from your data.")

# Add a file
await cognee.add("/path/to/report.pdf", dataset_name="reports")

# Add multiple items to a named dataset
await cognee.add(
    ["First document text", "/path/to/second.pdf"],
    dataset_name="my_project",
)

# Add with custom node set
await cognee.add("Technical spec content", node_set=["engineering"])

# Add structured data via dlt resource
import dlt

@dlt.resource()
def my_data():
    yield [{"id": 1, "name": "Alice"}, {"id": 2, "name": "Bob"}]

await cognee.add(my_data, dataset_name="people", primary_key="id")

# Add a CSV file (auto-detected as dlt source)
await cognee.add("/path/to/data.csv", dataset_name="csv_data", primary_key="id")

# Add from a database with filtering
await cognee.add(
    "postgresql://user:pass@host/db",
    dataset_name="db_data",
    primary_key="id",
    query="SELECT * FROM users WHERE active = true",
)
```

<Note>
  For a complete guide on structured data ingestion with dlt, see the [dlt integration page](/integrations/dlt-integration).
</Note>