cognee.add()

async def add(
    data: Union[BinaryIO, list[BinaryIO], str, list[str], DataItem, list[DataItem]],
    dataset_name: str = 'main_dataset',
    user: User = None,
    node_set: Optional[List[str]] = None,
    vector_db_config: dict = None,
    graph_db_config: dict = None,
    dataset_id: Optional[UUID] = None,
    preferred_loaders: Optional[List[Union[str, dict[str, dict[str, Any]]]]] = None,
    incremental_loading: bool = True,
    data_per_batch: Optional[int] = 20,
    llm_config: Optional[LLMConfig] = None,
    embedding_config: Optional[EmbeddingConfig] = None,
)

Description

Add data to Cognee for knowledge graph processing. This is the first step in the Cognee workflow - it ingests raw data and prepares it for processing. The function accepts various data formats including text, files, urls and binary streams, then stores them in a specified dataset for further processing. Prerequisites:

LLM_API_KEY: Must be set in environment variables for content processing
Database Setup: Relational and vector databases must be configured
User Authentication: Uses default user if none provided (created automatically)

Supported Input Types:

Text strings: Direct text content (str) - any string not starting with ”/” or “file://”
File paths: Local file paths as strings in these formats:
- Absolute paths: “/path/to/document.pdf”
- File URLs: “file:///path/to/document.pdf” or “file://relative/path.txt”
- S3 paths: “s3://bucket-name/path/to/file.pdf”
Binary file objects: File handles/streams (BinaryIO)
Lists: Multiple files or text strings in a single call

Supported File Formats:

Text files (.txt, .md, .csv)
PDFs (.pdf)
Images (.png, .jpg, .jpeg) - extracted via OCR/vision models
Audio files (.mp3, .wav) - transcribed to text
Code files (.py, .js, .ts, etc.) - parsed for structure and content
Office documents (.docx, .pptx)

See the Supported File Formats table below for the full list grouped by loader, including which formats require optional extras. Workflow:

Data Resolution: Resolves file paths and validates accessibility
Content Extraction: Extracts text content from various file formats
Dataset Storage: Stores processed content in the specified dataset
Metadata Tracking: Records file metadata, timestamps, and user permissions
Permission Assignment: Grants user read/write/delete/share permissions on dataset

Args: data: The data to ingest. Can be:

Single text string: “Your text content here”
Absolute file path: “/path/to/document.pdf”
File URL: “file:///absolute/path/to/document.pdf” or “file://relative/path.txt”
S3 path: “s3://my-bucket/documents/file.pdf”
List of mixed types: [“text content”, “/path/file.pdf”, “file://doc.txt”, file_handle]
Binary file object: open(“file.txt”, “rb”)
url: A web link url (https or http) dataset_name: Name of the dataset to store data in. Defaults to “main_dataset”. Create separate datasets to organize different knowledge domains. user: User object for authentication and permissions. Uses default user if None. Default user: “default_user@example.com” (created automatically on first use). Users can only access datasets they have permissions for. node_set: Optional list of node identifiers for graph organization and access control. Used for grouping related data points in the knowledge graph. vector_db_config: Optional configuration for vector database (for custom setups). graph_db_config: Optional configuration for graph database (for custom setups). dataset_id: Optional specific dataset UUID to use instead of dataset_name. extraction_rules: Optional dictionary of rules (e.g., CSS selectors, XPath) for extracting specific content from web pages using BeautifulSoup tavily_config: Optional configuration for Tavily API, including API key and extraction settings soup_crawler_config: Optional configuration for BeautifulSoup crawler, specifying concurrency, crawl delay, and extraction rules.

Returns: PipelineRunInfo: Information about the ingestion pipeline execution including:

Pipeline run ID for tracking
Dataset ID where data was stored
Processing status and any errors
Execution timestamps and metadata

Next Steps: After successfully adding data, call cognify() to process the ingested content:

import cognee

# Step 1: Add your data (text content or file path)
await cognee.add("Your document content")  # Raw text
# OR
await cognee.add("/path/to/your/file.pdf")  # File path

# Step 2: Process into knowledge graph
await cognee.cognify()

# Step 3: Search and query
results = await cognee.search("What insights can you find?")

Example Usage:

# Add a single text document
await cognee.add("Natural language processing is a field of AI...")

# Add multiple files with different path formats
await cognee.add([
    "/absolute/path/to/research_paper.pdf",        # Absolute path
    "file://relative/path/to/dataset.csv",         # Relative file URL
    "file:///absolute/path/to/report.docx",        # Absolute file URL
    "s3://my-bucket/documents/data.json",           # S3 path
    "Additional context text"                       # Raw text content
])

# Add to a specific dataset
await cognee.add(
    data="Project documentation content",
    dataset_name="project_docs"
)

# Add a single file
await cognee.add("/home/user/documents/analysis.pdf")

# Add a single url and bs4 extract ingestion method
extraction_rules = {
    "title": "h1",
    "description": "p",
    "more_info": "a[href*='more-info']"
}
await cognee.add("https://example.com",extraction_rules=extraction_rules)

# Add a single url and tavily extract ingestion method
Make sure to set TAVILY_API_KEY = YOUR_TAVILY_API_KEY as a environment variable
await cognee.add("https://example.com")

# Add multiple urls
await cognee.add(["https://example.com","https://books.toscrape.com"])

Environment Variables: Required:

LLM_API_KEY: API key for your LLM provider (OpenAI, Anthropic, etc.)

Optional:

LLM_PROVIDER: “openai” (default), “anthropic”, “gemini”, “ollama”, “mistral”, “bedrock”
LLM_MODEL: Model name (default: “gpt-5-mini”)
DEFAULT_USER_EMAIL: Custom default user email
DEFAULT_USER_PASSWORD: Custom default user password
VECTOR_DB_PROVIDER: “lancedb” (default), “chromadb”, “pgvector”
GRAPH_DATABASE_PROVIDER: “kuzu” (default), “neo4j”
TAVILY_API_KEY: YOUR_TAVILY_API_KEY

Parameters

Union[BinaryIO, list[BinaryIO], str, list[str], DataItem, list[DataItem]]

required

Data to ingest. Accepts text strings, file paths (local, S3, or URLs), binary file objects, DataItem objects, or lists of any of these.DataItem is a lightweight wrapper that lets you attach per-item metadata, human-readable label, and an optional stable data_id. Import it from cognee.tasks.ingestion.data_item:

from cognee.tasks.ingestion.data_item import DataItem

item = DataItem(
    data="/path/to/report.pdf",
    label="q4-earnings-report",
    external_metadata={"title": "Q4 Financial Report", "author": "Jane Smith"},
)
await cognee.add(item)

# Mix items with different labels / metadata in a list
await cognee.add([
    DataItem("Contract text …", label="contract-2024"),
    DataItem("Meeting notes …", external_metadata={"source": "CRM"}),
])

label and external_metadata are stored on the relational Data record. They are not propagated into the knowledge graph automatically and are not searchable via cognee.search(). Use node_set when you need tags that flow into the graph and can be used for scoped queries.

str

default:"'main_dataset'"

Name of the dataset to add data to.

User

default:"None"

User performing the operation. Uses default user if not provided.

Optional[List[str]]

default:"None"

List of node set names to associate with the data.

dict

default:"None"

Override vector database configuration for this operation.

dict

default:"None"

Override graph database configuration for this operation.

Optional[UUID]

default:"None"

UUID of an existing dataset to add data to. Alternative to dataset_name.

Optional[List[Union[str, dict[str, dict[str, Any]]]]]

default:"None"

Custom loader configuration for specific file types.

bool

default:"True"

If true, skip data that has already been ingested.

Optional[int]

default:"20"

Number of data items to process per batch.

Optional[LLMConfig]

default:"None"

LLM settings to install into the current async context for this ingestion operation. When omitted, Cognee uses the active context config or global LLM config. Import LLMConfig from cognee.infrastructure.llm.config.

Optional[EmbeddingConfig]

default:"None"

Embedding settings to install into the current async context for this ingestion operation. When omitted, Cognee uses the active context config or global embedding config. Import EmbeddingConfig from cognee.infrastructure.databases.vector.embeddings.config.

Optional[float]

default:"0.5"

Floating-point score stored on the Data record for retrieval ranking. Applied uniformly to all items in the batch. Use a higher value to make items more likely to surface in ranked results.

str

default:"None"

Column name for primary key when ingesting structured data via dlt. Auto-detected if not specified.

str

default:"'merge'"

How to handle existing data for dlt sources: “merge” (upsert), “append” (always insert), or “replace” (drop and recreate).

str

default:"None"

SQL WHERE clause for filtering when using a database connection string as input.

Supported Input Types

Type	Example
Text string	`"Cognee is a knowledge graph platform."`
File path	`"/path/to/document.pdf"`
S3 path	`"s3://bucket/file.txt"`
URL	`"https://example.com/article"`
Binary file	`open("file.pdf", "rb")`
Mixed list	`["text", "/path/file.pdf", open("f.txt", "rb")]`
dlt resource	`@dlt.resource()` decorated generator
CSV file	`"/path/to/data.csv"` (auto-detected as dlt source)
DB connection	`"postgresql://user:pass@host/db"`

Supported File Formats

Loader	Extensions	Install extra
TextLoader	`.txt` `.md` `.json` `.xml` `.yaml` `.yml` `.log`	— (built-in)
CsvLoader	`.csv`	— (built-in)
PyPdfLoader	`.pdf`	— (built-in)
ImageLoader	`.png` `.jpg` `.jpeg` `.gif` `.webp` `.bmp` `.tif` `.tiff` `.heic` `.avif` `.ico` `.psd` `.apng` `.cr2` `.dwg` `.xcf` `.jxr` `.jpx`	— (built-in)
AudioLoader	`.mp3` `.wav` `.aac` `.flac` `.ogg` `.m4a` `.mid` `.amr` `.aiff`	— (built-in)
UnstructuredLoader	`.docx` `.doc` `.odt` `.xlsx` `.xls` `.ods` `.pptx` `.ppt` `.odp` `.rtf` `.html` `.htm` `.eml` `.msg` `.epub`	`pip install cognee[docs]`
AdvancedPdfLoader	`.pdf` (layout-aware, preserves tables)	`pip install cognee[docs]`
BeautifulSoupLoader	`.html`	`pip install cognee[scraping]`

See Loaders for how to override the default loader selection or register custom loaders.

Examples

import cognee

# Add text
await cognee.add("Cognee builds knowledge graphs from your data.")

# Add a file
await cognee.add("/path/to/report.pdf", dataset_name="reports")

# Add multiple items to a named dataset
await cognee.add(
    ["First document text", "/path/to/second.pdf"],
    dataset_name="my_project",
)

# Add with custom node set
await cognee.add("Technical spec content", node_set=["engineering"])

# Add structured data via dlt resource
import dlt

@dlt.resource()
def my_data():
    yield [{"id": 1, "name": "Alice"}, {"id": 2, "name": "Bob"}]

await cognee.add(my_data, dataset_name="people", primary_key="id")

# Add a CSV file (auto-detected as dlt source)
await cognee.add("/path/to/data.csv", dataset_name="csv_data", primary_key="id")

# Add from a database with filtering
await cognee.add(
    "postgresql://user:pass@host/db",
    dataset_name="db_data",
    primary_key="id",
    query="SELECT * FROM users WHERE active = true",
)

For a complete guide on structured data ingestion with dlt, see the dlt integration page.

​cognee.add()

​Description

​Parameters

​Supported Input Types

​Supported File Formats

​Examples

cognee.add()

Description

Parameters

Supported Input Types

Supported File Formats

Examples