dlt (Data Load Tool) - Cognee Documentation

Ingest structured relational data — databases, CSV files, and dlt resources — directly into cognee’s knowledge graph. Foreign keys become graph edges, tables become schema nodes, and each row becomes a searchable document, all built deterministically from the schema without LLM extraction.

Why Use This Integration

Schema-Aware Graphs: Foreign key relationships are preserved as first-class edges in the knowledge graph
Deterministic Graph Construction: Structured data bypasses LLM entity extraction — no hallucination risk
Mixed Ingestion: Combine structured (dlt) and unstructured (text, PDF) data in the same dataset
Multiple Input Modes: Pass explicit dlt resources, CSV file paths, or database connection strings
Write Dispositions: Control how data is synced — merge (upsert), append, or replace

Installation

pip install 'cognee[dlt]'

Or with uv:

uv pip install 'cognee[dlt]'

Quick Start

1. Ingest a dlt Resource

Define a dlt resource and pass it to cognee.remember(). The dlt-specific structured-ingestion options primary_key, write_disposition, SQL query, and max_rows_per_table are accepted by cognee.remember() and forwarded to the underlying ingestion step. After ingestion, use cognee.recall(...) to query the graph.

import dlt
import cognee
import asyncio

@dlt.resource()
def users_and_pets():
    yield [
        {
            "id": 1,
            "name": "Alice",
            "pets": [
                {"id": 1, "name": "Fluffy", "type": "cat"},
                {"id": 2, "name": "Spot", "type": "dog"},
            ],
        },
        {
            "id": 2,
            "name": "Bob",
            "pets": [{"id": 3, "name": "Fido", "type": "dog"}],
        },
    ]

async def main():
    await cognee.remember(
        users_and_pets,
        dataset_name="users_and_pets",
        primary_key="id",
    )
    results = await cognee.recall(
        query_text="Which pet does Alice have?",
        datasets=["users_and_pets"],
    )
    print(results)

asyncio.run(main())

dlt automatically detects nested structures (like pets inside each user) and creates separate tables with foreign key relationships.

The lower-level cognee.add(...) + cognee.cognify(...) pair still accepts the same dlt kwargs and remains useful when you need to run ingestion and graph building as separate steps. For the runnable end-to-end version of this walkthrough, see examples/demos/dlt_ingestion_example.py.

2. Build and Query the Graph

Once remember() finishes ingesting and building the graph, use cognee.recall(...) to query it.

Other Input Modes

CSV Auto-Detection

Pass a .csv file path and cognee creates a dlt source automatically:

await cognee.remember(
    "/path/to/employees.csv",
    dataset_name="employees",
    primary_key="id",
)

Database Connection String

Ingest tables directly from an existing database:

await cognee.remember(
    "postgresql://user:pass@host/db",
    dataset_name="company_db",
    primary_key="id",
)

Supported databases via auto-detection: SQLite, PostgreSQL, MySQL, MSSQL, Oracle. Amazon Redshift is also compatible since it speaks the PostgreSQL wire protocol — use a standard postgresql:// connection string pointing to your Redshift endpoint. For Snowflake and Google BigQuery, construct a dlt source directly and pass it to cognee.remember() (see the Cloud Data Warehouses accordion below). You can optionally filter with a SQL WHERE clause:

await cognee.remember(
    "postgresql://user:pass@host/db",
    dataset_name="engineering_team",
    primary_key="id",
    query="SELECT * FROM employees WHERE department = 'Engineering'",
)

Mixed Structured + Unstructured

Combine dlt resources with unstructured text in a single dataset:

text = """Alice has two pets: a cat named Fluffy and a dog named Spot.
Bob has a dog named Fido, who is friendly with both Fluffy and Spot."""

await cognee.remember(
    [text, users_and_pets],
    dataset_name="users_and_pets_with_text",
    primary_key="id",
)

Structured data creates deterministic graph nodes from the schema, while unstructured text goes through LLM-based entity extraction. Both are combined in the same knowledge graph.

Write Dispositions

Control how data is synced on repeated runs using the write_disposition parameter:

replace (default): Drop and recreate tables on each run. Use for full snapshot refreshes.
merge: Upsert by primary key — updates existing rows, inserts new ones. Best for data that changes over time.
append: Always insert without deduplication. Use for time-series data and event logs.

# Append mode — every call adds new rows, no dedup
await cognee.remember(
    event_resource,
    dataset_name="events",
    primary_key="id",
    write_disposition="append",
)

How It Works

Source Detection: cognee identifies dlt resources, CSV files, and connection strings in the input
Pipeline Execution: A dlt pipeline loads data into a per-dataset staging database
Schema Extraction: Table schemas, primary keys, and foreign keys are extracted
Graph Construction: Each row becomes a document node; foreign keys become edges between nodes
LLM Bypass: Structured rows skip chunking, entity extraction, and summarization — the graph is built entirely from schema metadata

The primary_key parameter controls upsert behavior when you use write_disposition="merge". If not specified, cognee auto-detects from an id column or falls back to the first column. Use the max_rows_per_table kwarg on remember() / add() to override the per-table row cap for a single call, or set the DLT_MAX_ROWS_PER_TABLE environment variable (default: 50) to change the process-wide default.

Foreign Key Resolution

A foreign key becomes a graph edge only when both the source row and the target row are loaded in the same ingestion run. Two edge cases are worth knowing about — cognee now logs a warning in each so they are diagnosable rather than silent:

Target row not loaded: if a foreign key points at a row that wasn’t ingested — most commonly because the target table hit the max_rows_per_table cap — the reference is dropped and no edge is created. The warning identifies the dropped references as source_table.column -> ref_table:value. If you see missing edges, raise max_rows_per_table so the referenced rows are included.
Duplicate primary keys within a table: if multiple rows in a table share the same primary key, foreign key edges that target that key resolve to the last such row loaded; earlier rows with the same key are shadowed for FK targeting. The warning names the affected table and pk.

Use Cases

CRM and Relational Data

Load customer, order, and product tables from a database. Foreign keys between tables (e.g., order.customer_id → customer.id) become graph edges, enabling cross-table queries like “Which customers ordered product X?”

CSV Analytics Pipeline

Point cognee at CSV exports from analytics tools. Each row becomes a searchable node in the graph, and you can combine them with unstructured reports in the same dataset.

Event Log Ingestion

Use write_disposition="append" to stream event batches into cognee without deduplication. Query across the full event history with natural language.

Database Mirroring

Use write_disposition="merge" to keep cognee’s graph in sync with a live database. Rows that are removed upstream are cleaned up best-effort; any orphaned rows that fail to delete are logged and retried on the next ingest.

Cloud Data Warehouses (Snowflake, Redshift, BigQuery)

Amazon Redshift speaks the PostgreSQL wire protocol, so the standard connection string auto-detection works:

await cognee.remember(
    "postgresql://user:pass@my-cluster.us-east-1.redshift.amazonaws.com:5439/mydb",
    dataset_name="redshift_data",
    primary_key="id",
)

Snowflake requires constructing a dlt sql_database source manually (install snowflake-sqlalchemy first):

pip install 'cognee[dlt]' snowflake-sqlalchemy

from dlt.sources.sql_database import sql_database
import cognee

source = sql_database(
    credentials="snowflake://user:password@account_identifier/database/schema?warehouse=MY_WH",
    table_names=["orders", "customers"],
)

await cognee.remember(source, dataset_name="snowflake_data", primary_key="id")

The account_identifier is the part before .snowflakecomputing.com in your Snowflake URL (e.g. myorg-myaccount). Omit table_names to ingest all tables in the schema.Google BigQuery works the same way using dlt’s BigQuery connector — construct the source and pass it directly to cognee.remember(). See the dlt sql_database docs for connector-specific setup.

Remember Operation

Learn more about data ingestion in cognee

dlt Documentation

Official dlt documentation and guides

​Why Use This Integration

​Installation

​Quick Start

​1. Ingest a dlt Resource

​2. Build and Query the Graph

​Other Input Modes

​CSV Auto-Detection

​Database Connection String

​Mixed Structured + Unstructured

​Write Dispositions

​How It Works

​Foreign Key Resolution

​Use Cases

Remember Operation

dlt Documentation

Why Use This Integration

Installation

Quick Start

1. Ingest a dlt Resource

2. Build and Query the Graph

Other Input Modes

CSV Auto-Detection

Database Connection String

Mixed Structured + Unstructured

Write Dispositions

How It Works

Foreign Key Resolution

Use Cases