cognee.add()
Description
Add data to Cognee for knowledge graph processing. This is the first step in the Cognee workflow - it ingests raw data and prepares it for processing. The function accepts various data formats including text, files, urls and binary streams, then stores them in a specified dataset for further processing. Prerequisites:- LLM_API_KEY: Must be set in environment variables for content processing
- Database Setup: Relational and vector databases must be configured
- User Authentication: Uses default user if none provided (created automatically)
- Text strings: Direct text content (str) - any string not starting with ”/” or “file://”
- File paths: Local file paths as strings in these formats:
- Absolute paths: “/path/to/document.pdf”
- File URLs: “file:///path/to/document.pdf” or “file://relative/path.txt”
- S3 paths: “s3://bucket-name/path/to/file.pdf”
- Binary file objects: File handles/streams (BinaryIO)
- Lists: Multiple files or text strings in a single call
- Text files (.txt, .md, .csv)
- PDFs (.pdf)
- Images (.png, .jpg, .jpeg) - extracted via OCR/vision models
- Audio files (.mp3, .wav) - transcribed to text
- Code files (.py, .js, .ts, etc.) - parsed for structure and content
- Office documents (.docx, .pptx) Workflow:
- Data Resolution: Resolves file paths and validates accessibility
- Content Extraction: Extracts text content from various file formats
- Dataset Storage: Stores processed content in the specified dataset
- Metadata Tracking: Records file metadata, timestamps, and user permissions
- Permission Assignment: Grants user read/write/delete/share permissions on dataset
- Single text string: “Your text content here”
- Absolute file path: “/path/to/document.pdf”
- File URL: “file:///absolute/path/to/document.pdf” or “file://relative/path.txt”
- S3 path: “s3://my-bucket/documents/file.pdf”
- List of mixed types: [“text content”, “/path/file.pdf”, “file://doc.txt”, file_handle]
- Binary file object: open(“file.txt”, “rb”)
- url: A web link url (https or http) dataset_name: Name of the dataset to store data in. Defaults to “main_dataset”. Create separate datasets to organize different knowledge domains. user: User object for authentication and permissions. Uses default user if None. Default user: “default_user@example.com” (created automatically on first use). Users can only access datasets they have permissions for. node_set: Optional list of node identifiers for graph organization and access control. Used for grouping related data points in the knowledge graph. vector_db_config: Optional configuration for vector database (for custom setups). graph_db_config: Optional configuration for graph database (for custom setups). dataset_id: Optional specific dataset UUID to use instead of dataset_name. extraction_rules: Optional dictionary of rules (e.g., CSS selectors, XPath) for extracting specific content from web pages using BeautifulSoup tavily_config: Optional configuration for Tavily API, including API key and extraction settings soup_crawler_config: Optional configuration for BeautifulSoup crawler, specifying concurrency, crawl delay, and extraction rules.
- Pipeline run ID for tracking
- Dataset ID where data was stored
- Processing status and any errors
- Execution timestamps and metadata
cognify() to process the ingested content:
- LLM_API_KEY: API key for your LLM provider (OpenAI, Anthropic, etc.)
- LLM_PROVIDER: “openai” (default), “anthropic”, “gemini”, “ollama”, “mistral”, “bedrock”
- LLM_MODEL: Model name (default: “gpt-5-mini”)
- DEFAULT_USER_EMAIL: Custom default user email
- DEFAULT_USER_PASSWORD: Custom default user password
- VECTOR_DB_PROVIDER: “lancedb” (default), “chromadb”, “pgvector”
- GRAPH_DATABASE_PROVIDER: “kuzu” (default), “neo4j”
- TAVILY_API_KEY: YOUR_TAVILY_API_KEY
Parameters
Data to ingest. Accepts text strings, file paths (local, S3, or URLs), binary file objects, DataItem objects, or lists of any of these.
Name of the dataset to add data to.
User performing the operation. Uses default user if not provided.
List of node set names to associate with the data.
Override vector database configuration for this operation.
Override graph database configuration for this operation.
UUID of an existing dataset to add data to. Alternative to dataset_name.
Custom loader configuration for specific file types.
If true, skip data that has already been ingested.
Number of data items to process per batch.
Supported Input Types
| Type | Example |
|---|---|
| Text string | "Cognee is a knowledge graph platform." |
| File path | "/path/to/document.pdf" |
| S3 path | "s3://bucket/file.txt" |
| URL | "https://example.com/article" |
| Binary file | open("file.pdf", "rb") |
| Mixed list | ["text", "/path/file.pdf", open("f.txt", "rb")] |
Supported File Formats
.txt, .md, .csv, .pdf, .docx, .pptx, .png, .jpg, .jpeg, .mp3, .wav, .py, .js, .ts, and more.