Incremental Load

What is Incremental Load?

Incremental load is a data-processing strategy where only new or modified items are processed, while everything that has already been processed (and is still up-to-date) is left untouched.
Compared with a full reload it has two huge benefits:

Speed – you are not wasting compute on content that is already in the system.
Cost – embedding, classification and other LLM tasks are invoked only for the delta, so you pay strictly for the work that needs to be done.

In Cognee every piece of data carries a content–hash. During each run Cognee compares the current content-hashes with the hashes stored in the graph metadata.
If the hash is unchanged the datapoint is skipped; if it is new or changed the datapoint flows through the pipeline just like on the first run.

Enabled by Default

There is nothing you have to turn on. Incremental load is the default behaviour for everyone using Cognee – SDK, CLI, cloud or server deployments.
As soon as a piece of data is processed, its hash is stored. Next time Cognee sees the same hash it simply moves on.

Example Workflow

Imagine a dataset located at /datasets/reports/ with 3 PDF files:


reports/
├── 2023-Q1.pdf
├── 2023-Q2.pdf
└── 2023-Q3.pdf

1. Initial Cognify


import asyncio, cognee
 
async def initial_load():
    await cognee.prune.prune_data()           # start from a clean slate (optional)
    await cognee.add("/datasets/reports")     # add the folder
    await cognee.cognify()                    # processes 3 files
 
asyncio.run(initial_load())

After this run 3 files are embedded, chunked, classified, and stored in the graph.

2. Adding New Files

A month later two new files arrive:


reports/
├── 2023-Q4.pdf        # NEW
└── 2024-Q1.pdf        # NEW


import asyncio, cognee
 
async def incremental_load():
    await cognee.add("/datasets/reports")     # add the same folder again
    await cognee.cognify()                    # only the **2 new files** get processed
 
asyncio.run(incremental_load())

Even though cognify() is called again on the entire dataset, Cognee detects that 2023-Q1.pdf, 2023-Q2.pdf and 2023-Q3.pdf were already processed and skips them. Only 2023-Q4.pdf and 2024-Q1.pdf flow through the pipeline, saving you both time and compute.

Join the Conversation!

Have questions? Join our community now to connect with professionals, share insights, and get your questions answered!