Skip to main content

cognee.add()

async def add(
    data: Union[BinaryIO, list[BinaryIO], str, list[str], DataItem, list[DataItem]],
    dataset_name: str = 'main_dataset',
    user: User = None,
    node_set: Optional[List[str]] = None,
    vector_db_config: dict = None,
    graph_db_config: dict = None,
    dataset_id: Optional[UUID] = None,
    preferred_loaders: Optional[List[Union[str, dict[str, dict[str, Any]]]]] = None,
    incremental_loading: bool = True,
    data_per_batch: Optional[int] = 20,
)

Description

Add data to Cognee for knowledge graph processing. This is the first step in the Cognee workflow - it ingests raw data and prepares it for processing. The function accepts various data formats including text, files, urls and binary streams, then stores them in a specified dataset for further processing. Prerequisites:
  • LLM_API_KEY: Must be set in environment variables for content processing
  • Database Setup: Relational and vector databases must be configured
  • User Authentication: Uses default user if none provided (created automatically)
Supported Input Types:
  • Text strings: Direct text content (str) - any string not starting with ”/” or “file://”
  • File paths: Local file paths as strings in these formats:
    • Absolute paths: “/path/to/document.pdf”
    • File URLs: “file:///path/to/document.pdf” or “file://relative/path.txt”
    • S3 paths: “s3://bucket-name/path/to/file.pdf”
  • Binary file objects: File handles/streams (BinaryIO)
  • Lists: Multiple files or text strings in a single call
Supported File Formats:
  • Text files (.txt, .md, .csv)
  • PDFs (.pdf)
  • Images (.png, .jpg, .jpeg) - extracted via OCR/vision models
  • Audio files (.mp3, .wav) - transcribed to text
  • Code files (.py, .js, .ts, etc.) - parsed for structure and content
  • Office documents (.docx, .pptx) Workflow:
  1. Data Resolution: Resolves file paths and validates accessibility
  2. Content Extraction: Extracts text content from various file formats
  3. Dataset Storage: Stores processed content in the specified dataset
  4. Metadata Tracking: Records file metadata, timestamps, and user permissions
  5. Permission Assignment: Grants user read/write/delete/share permissions on dataset
Args: data: The data to ingest. Can be:
  • Single text string: “Your text content here”
  • Absolute file path: “/path/to/document.pdf”
  • File URL: “file:///absolute/path/to/document.pdf” or “file://relative/path.txt”
  • S3 path: “s3://my-bucket/documents/file.pdf”
  • List of mixed types: [“text content”, “/path/file.pdf”, “file://doc.txt”, file_handle]
  • Binary file object: open(“file.txt”, “rb”)
  • url: A web link url (https or http) dataset_name: Name of the dataset to store data in. Defaults to “main_dataset”. Create separate datasets to organize different knowledge domains. user: User object for authentication and permissions. Uses default user if None. Default user: “default_user@example.com” (created automatically on first use). Users can only access datasets they have permissions for. node_set: Optional list of node identifiers for graph organization and access control. Used for grouping related data points in the knowledge graph. vector_db_config: Optional configuration for vector database (for custom setups). graph_db_config: Optional configuration for graph database (for custom setups). dataset_id: Optional specific dataset UUID to use instead of dataset_name. extraction_rules: Optional dictionary of rules (e.g., CSS selectors, XPath) for extracting specific content from web pages using BeautifulSoup tavily_config: Optional configuration for Tavily API, including API key and extraction settings soup_crawler_config: Optional configuration for BeautifulSoup crawler, specifying concurrency, crawl delay, and extraction rules.
Returns: PipelineRunInfo: Information about the ingestion pipeline execution including:
  • Pipeline run ID for tracking
  • Dataset ID where data was stored
  • Processing status and any errors
  • Execution timestamps and metadata
Next Steps: After successfully adding data, call cognify() to process the ingested content:
import cognee

# Step 1: Add your data (text content or file path)
await cognee.add("Your document content")  # Raw text
# OR
await cognee.add("/path/to/your/file.pdf")  # File path

# Step 2: Process into knowledge graph
await cognee.cognify()

# Step 3: Search and query
results = await cognee.search("What insights can you find?")
Example Usage:
# Add a single text document
await cognee.add("Natural language processing is a field of AI...")

# Add multiple files with different path formats
await cognee.add([
    "/absolute/path/to/research_paper.pdf",        # Absolute path
    "file://relative/path/to/dataset.csv",         # Relative file URL
    "file:///absolute/path/to/report.docx",        # Absolute file URL
    "s3://my-bucket/documents/data.json",           # S3 path
    "Additional context text"                       # Raw text content
])

# Add to a specific dataset
await cognee.add(
    data="Project documentation content",
    dataset_name="project_docs"
)

# Add a single file
await cognee.add("/home/user/documents/analysis.pdf")

# Add a single url and bs4 extract ingestion method
extraction_rules = {
    "title": "h1",
    "description": "p",
    "more_info": "a[href*='more-info']"
}
await cognee.add("https://example.com",extraction_rules=extraction_rules)

# Add a single url and tavily extract ingestion method
Make sure to set TAVILY_API_KEY = YOUR_TAVILY_API_KEY as a environment variable
await cognee.add("https://example.com")

# Add multiple urls
await cognee.add(["https://example.com","https://books.toscrape.com"])
Environment Variables: Required:
  • LLM_API_KEY: API key for your LLM provider (OpenAI, Anthropic, etc.)
Optional:
  • LLM_PROVIDER: “openai” (default), “anthropic”, “gemini”, “ollama”, “mistral”, “bedrock”
  • LLM_MODEL: Model name (default: “gpt-5-mini”)
  • DEFAULT_USER_EMAIL: Custom default user email
  • DEFAULT_USER_PASSWORD: Custom default user password
  • VECTOR_DB_PROVIDER: “lancedb” (default), “chromadb”, “pgvector”
  • GRAPH_DATABASE_PROVIDER: “kuzu” (default), “neo4j”
  • TAVILY_API_KEY: YOUR_TAVILY_API_KEY

Parameters

data
Union[BinaryIO, list[BinaryIO], str, list[str], DataItem, list[DataItem]]
required
Data to ingest. Accepts text strings, file paths (local, S3, or URLs), binary file objects, DataItem objects, or lists of any of these.
dataset_name
str
default:"'main_dataset'"
Name of the dataset to add data to.
user
User
default:"None"
User performing the operation. Uses default user if not provided.
node_set
Optional[List[str]]
default:"None"
List of node set names to associate with the data.
vector_db_config
dict
default:"None"
Override vector database configuration for this operation.
graph_db_config
dict
default:"None"
Override graph database configuration for this operation.
dataset_id
Optional[UUID]
default:"None"
UUID of an existing dataset to add data to. Alternative to dataset_name.
preferred_loaders
Optional[List[Union[str, dict[str, dict[str, Any]]]]]
default:"None"
Custom loader configuration for specific file types.
incremental_loading
bool
default:"True"
If true, skip data that has already been ingested.
data_per_batch
Optional[int]
default:"20"
Number of data items to process per batch.

Supported Input Types

TypeExample
Text string"Cognee is a knowledge graph platform."
File path"/path/to/document.pdf"
S3 path"s3://bucket/file.txt"
URL"https://example.com/article"
Binary fileopen("file.pdf", "rb")
Mixed list["text", "/path/file.pdf", open("f.txt", "rb")]

Supported File Formats

.txt, .md, .csv, .pdf, .docx, .pptx, .png, .jpg, .jpeg, .mp3, .wav, .py, .js, .ts, and more.

Examples

import cognee

# Add text
await cognee.add("Cognee builds knowledge graphs from your data.")

# Add a file
await cognee.add("/path/to/report.pdf", dataset_name="reports")

# Add multiple items to a named dataset
await cognee.add(
    ["First document text", "/path/to/second.pdf"],
    dataset_name="my_project",
)

# Add with custom node set
await cognee.add("Technical spec content", node_set=["engineering"])