What is the add operation
The.add operation is how you bring content into Cognee. It takes your files, directories, or raw text, normalizes them into plain text, and records them into a dataset that Cognee can later expand into vectors and graphs with Cognify.
- Ingestion-only: no embeddings, no graph yet
- Flexible input: raw text, local files, directories, any Docling supported format or S3 URIs
- Normalized storage: everything is turned into text and stored consistently
- Deduplicated: Cognee uses content hashes to avoid duplicates
- Dataset-first: everything you add goes into a dataset
- Datasets are how Cognee keeps different collections organized (e.g. “research-papers”, “customer-reports”)
- Each dataset has its own ID, owner, and permissions for access control
- You can read more about them below
Where add fits
- First step before you run Cognify
- Use it to create a dataset from scratch, or append new data over time
- Ideal for both local experiments and programmatic ingestion from storage (e.g. S3)
What happens under the hood
-
Expand your input
- Directories are walked, S3 paths are expanded, raw text is passed through
- Result: a flat list of items (files, text, handles)
-
Ingest and register
- Files are saved into Cognee’s storage and converted to text
- Cognee computes a stable content hash to prevent duplicates
- Each item becomes a record in the database and is attached to your dataset
- Text extraction: Converts various file formats into plain text
- Metadata preservation: Keeps file-system metadata like name, extension, MIME type, file size, and content hash — not arbitrary user-defined fields
- Content normalization: Ensures consistent text encoding and formatting
-
Return a summary
- You get a pipeline run info object that tells you where everything went and which dataset is ready for the next step
After add finishes
After.add completes, your data is ready for the next stage:
- Files are safely stored in Cognee’s storage system with metadata preserved
- Database records track each ingested item and link it to your dataset
- Dataset is prepared for transformation with Cognify — which will chunk, embed, and connect everything
Further details
Input sources
Input sources
- Mix and match:
["some text", "/path/to/file.pdf", "s3://bucket/data.csv"] - Works with directories (recursively), S3 prefixes, and file handles
- Local and cloud sources are normalized into the same format
Structured data (dlt)
Structured data (dlt)
Cognee integrates with dlt to ingest structured relational data directly into the knowledge graph:
- dlt resources: Pass
@dlt.resource()decorated generators directly tocognee.add() - CSV files:
.csvfiles are auto-detected and ingested via dlt - Database connections: Pass a connection string (
postgresql://...,sqlite:///...) to ingest tables directly - Foreign key relationships become graph edges automatically
- Structured data bypasses LLM extraction — the graph is built deterministically from the schema
- See the full dlt integration guide for details
Supported formats
Supported formats
Cognee automatically selects the best loader based on file extension. The table below lists all supported extensions and whether optional extras are needed:
| Loader | Extensions | Install extra |
|---|---|---|
| TextLoader | .txt .md .json .xml .yaml .yml .log | — (built-in) |
| CsvLoader | .csv | — (built-in) |
| PyPdfLoader | .pdf | — (built-in) |
| ImageLoader | .png .jpg .jpeg .gif .webp .bmp .tif .tiff .heic .avif .ico .psd .apng .cr2 .dwg .xcf .jxr .jpx | — (built-in) |
| AudioLoader | .mp3 .wav .aac .flac .ogg .m4a .mid .amr .aiff | — (built-in) |
| UnstructuredLoader | .docx .doc .odt .xlsx .xls .ods .pptx .ppt .odp .rtf .html .htm .eml .msg .epub | pip install cognee[docs] |
| AdvancedPdfLoader | .pdf (layout-aware, preserves tables) | pip install cognee[docs] |
| BeautifulSoupLoader | .html | pip install cognee[scraping] |
- ImageLoader uses a vision-capable LLM to describe image content.
- AudioLoader transcribes audio using a Whisper-compatible model.
- AdvancedPdfLoader preserves page layout and table structure; falls back to PyPdfLoader automatically on error.
- Cognee can also ingest the
DoclingDocumentformat directly — any format Docling supports can be pre-converted and passed tocognee.add().
Datasets
Datasets
- A dataset is your “knowledge base” — a grouping of related data that makes sense together
- Datasets are first-class objects in Cognee’s database with their own ID, name, owner, and permissions
- They provide scope:
.addwrites into a dataset, Cognify processes per-dataset - Think of them as separate shelves in your library — e.g., a “research-papers” dataset and a “customer-reports” dataset
- If you name a dataset that doesn’t exist, Cognee creates it for you; if you don’t specify, a default one is used
- More detail: Datasets
Users and ownership
Users and ownership
- Every dataset and data item belongs to a user
- If you don’t pass a user, Cognee creates/uses a default one
- Ownership controls who can later read, write, or share that dataset
Node sets
Node sets
- Optional labels to group or tag data on ingestion
- Example:
node_set=["AI", "FinTech"] - Useful later when you want to focus on subgraphs
- More detail: NodeSets
Custom metadata and labeling
Custom metadata and labeling
cognee.add() automatically preserves only file-system metadata like name, MIME type, extension, content hash.If you need to associate extra information with ingested data, three mechanisms are available:1. node_set — categorical tags applied to a whole batchPass a list of string tags to mark every item in that add() call:NodeSet nodes connected with belongs_to_set edges, and can be used to scope searches later — see NodeSets.2. DataItem — per-item string labelWrap individual data items in DataItem to attach a single string label to each one. The label is stored in the relational database alongside the ingested record.dataset_name — logical groupingSeparate collections of data into named datasets to keep different knowledge domains apart:{"source": "CRM", "author": "Alice"}) cannot currently be attached via add(). If rich metadata is important for your use case, consider encoding it as part of the text content itself, or combine datasets (via dataset_name) and NodeSets (via node_set) to represent the dimensions you care about.Troubleshooting 409 Conflict errors
Troubleshooting 409 Conflict errors
POST /api/v1/add returns 409 Conflict whenever an unhandled exception occurs during the add operation. It is a catch-all — the actual problem is always in the error field of the response body:| Symptom (error field contains…) | Cause | Fix |
|---|---|---|
"API key", "authentication", "invalid_api_key", "401" | Missing or invalid LLM API key | Set LLM_API_KEY in your environment (.env file or shell). Even though add itself does not call the LLM, the database setup that runs on every request uses the configured provider. |
"connection refused", "could not connect", "timeout", "OperationalError" | Database unreachable | Verify your database service is running. For Docker setups, use DB_HOST=host.docker.internal instead of localhost. Check DB_HOST, DB_PORT, DB_USERNAME, DB_PASSWORD, and DB_NAME. |
"permission denied", "not authorized", "forbidden", "403" | The authenticated user lacks write access to the target dataset | Either use datasetId of a dataset you own, or disable access control with ENABLE_BACKEND_ACCESS_CONTROL=False in development. |
"decode", "encoding", "UnicodeDecodeError", "failed to process" | Corrupted or non-text file content | Confirm the file is readable and in a supported format. As a workaround, read the file yourself and pass the text string directly instead of a file path. |
"No such file or directory", "FileNotFoundError" | The file path does not exist on the server | Use an absolute path. If calling the HTTP API, upload the file as a multipart form attachment instead of passing a path string. |
"SSL", "certificate" | TLS/certificate issue connecting to an external database or S3 | Check SSL settings for your database or storage backend. Set DB_SSL=false in development if certificates are self-signed. |
Enabling debug logs
For errors not covered above, enable verbose logging to see the full stack trace:Still stuck?
- Check Setup Configuration to verify your environment variables.
- Ask in the Discord community with the full
errorfield from the 409 response. - Open an issue on GitHub.
Cognify
Expand data into chunks, embeddings, and graphs
DataPoints
The units you’ll see after Cognify
Building Blocks
Learn about Tasks and Pipelines behind Add