Chunkers

Chunkers are responsible for splitting large documents into smaller, manageable pieces called chunks. This is a crucial step before embedding and graph extraction, as most embedding models have a limit on the amount of text they can process at once.

Token-Based Sizing

Cognee uses token-based sizing for chunks, rather than character counts. This means that chunk_size refers to the maximum number of tokens allowed in a chunk, which is directly tied to the tokenizer used by your embedding model. This ensures that chunks are always within the model’s context window. The token-counting tokenizer is auto-selected to match your embedding model, so chunk sizes (and the --dry-run token estimate) reflect how the model actually tokenizes text. If a matching tokenizer cannot be loaded, Cognee logs a non-fatal advisory warning and falls back to the TikToken tokenizer — token counts become approximate but ingestion continues. To restore exact counts and silence the warning, set HUGGINGFACE_TOKENIZER to a tokenizer that matches your embedding model. See Embedding Providers for the per-provider tokenizer mapping and override details.

Available Chunkers

Cognee provides several built-in chunkers to handle different types of content:

TextChunker: The default chunker. It splits text by paragraphs while respecting the token limit. It tries to keep paragraphs together but will split them if they exceed the chunk_size.
CsvChunker: Designed specifically for CSV data. It splits by rows, ensuring that each chunk contains complete rows and does not break data in the middle of a record. Each row is chunked independently — row-level state (text, size, and chunk index) is reset at each row boundary, so one row’s fields are never accumulated onto the next row’s chunk.
LangchainChunker: Wraps LangChain’s RecursiveCharacterTextSplitter. It splits text recursively by characters (e.g., \n\n, \n, ) and supports chunk_overlap (in words; default: 10). Requires pip install cognee[langchain].
TextChunkerWithOverlap: A paragraph-based chunker that supports overlap via a chunk_overlap_ratio (a fraction of chunk_size, e.g. 0.2 = 20% overlap). Useful for maintaining context across chunk boundaries.

Overlap only works with LangchainChunker and TextChunkerWithOverlap. The default TextChunker splits strictly at paragraph boundaries and does not use overlap. Calling cognee.config.set_chunk_overlap() has no effect when using TextChunker.

Additional Information

Using a Specific Chunker

from cognee.modules.chunking.TextChunker import TextChunker

await cognee.remember(
    data="my document text",
    dataset_name="my_dataset",
    chunker=TextChunker, # or CsvChunker, LangchainChunker
    chunk_size=1024,     # Maximum tokens per chunk
)

Using Chunk Overlap

Chunk overlap causes consecutive chunks to share a portion of text, which helps preserve context at chunk boundaries and can improve entity extraction quality — at the cost of more LLM calls and a slightly larger graph.When standard document readers instantiate your chunker, they call it as chunker_cls(document, get_text=..., max_chunk_size=...). The max_chunk_size value is the chunk_size you pass to remember() or cognify(). Extra chunking parameters such as overlap are not forwarded, so set them by subclassing the chunker and hard-wiring them inside __init__.

Name	Where it appears	Meaning
`chunk_size`	`remember(..., chunk_size=...)` / `cognify(..., chunk_size=...)`	Outer token ceiling passed into the pipeline.
`max_chunk_size`	`TextChunker` and `TextChunkerWithOverlap` constructors	Same outer token ceiling, after the pipeline passes it to the chunker.
`max_chunk_tokens`	`LangchainChunker` constructor	Same outer token ceiling, but with LangchainChunker’s constructor name.
`chunk_size`	`LangchainChunker` constructor	Splitter target size measured with `len(text.split())`, so it behaves like a word-count target.
`chunk_overlap_ratio`	`TextChunkerWithOverlap` constructor	Fraction of `max_chunk_size` to repeat between chunks.
`chunk_overlap`	`LangchainChunker` constructor	Number of words to repeat between chunks.

TextChunkerWithOverlap
LangchainChunker

TextChunkerWithOverlap takes a chunk_overlap_ratio between 0.0 and 1.0 (fraction of chunk_size). Because this value cannot be passed through cognee.config.set_chunk_overlap() at runtime, configure it by subclassing:

import asyncio
import cognee
from cognee.modules.chunking.text_chunker_with_overlap import TextChunkerWithOverlap

class OverlappingChunker(TextChunkerWithOverlap):
    def __init__(self, document, get_text, max_chunk_size):
        # 20% of chunk_size tokens will overlap with the next chunk
        super().__init__(document, get_text, max_chunk_size, chunk_overlap_ratio=0.2)

async def main():
    await cognee.remember(
        "my_document.txt",
        dataset_name="my_dataset",
        chunker=OverlappingChunker,
        chunk_size=1024,  # ~205 tokens will repeat in the next chunk
    )

asyncio.run(main())

LangchainChunker uses the same subclassing pattern, but its constructor names the token ceiling max_chunk_tokens and takes its own chunk_size (word-count splitter target, default 1024) and chunk_overlap (words shared between chunks, default 10). Map the pipeline’s max_chunk_size onto max_chunk_tokens and set the rest yourself. It requires pip install cognee[langchain].

import asyncio
import cognee
from cognee.modules.chunking.LangchainChunker import LangchainChunker

class OverlappingLangchainChunker(LangchainChunker):
    def __init__(self, document, get_text, max_chunk_size):
        super().__init__(
            document,
            get_text,
            max_chunk_tokens=max_chunk_size,  # token ceiling per chunk
            chunk_size=512,                   # word-count splitter target
            chunk_overlap=50,                 # words shared between chunks
        )

async def main():
    await cognee.remember(
        "my_document.txt",
        dataset_name="my_dataset",
        chunker=OverlappingLangchainChunker,
        chunk_size=1024,
    )

asyncio.run(main())

Custom Chunkers

You can create a custom chunker by inheriting from the Chunker base class and implementing the read method. Your chunker must yield DocumentChunk objects.

from cognee.modules.chunking.Chunker import Chunker
from cognee.modules.chunking.models.DocumentChunk import DocumentChunk

class MyCustomChunker(Chunker):
    async def read(self):
        async for text in self.get_text():
            # Your logic to split text into chunks
            yield DocumentChunk(
                text="chunk content",
                chunk_size=100,
                # ... other required fields
            )

Configuring chunk size via Docker or REST API

The /cognify REST API endpoint does not accept a chunk_size parameter directly. When you use Cognee through Docker or the HTTP API, set the chunk size via environment variables instead:

# In your .env file (or passed as Docker env vars)
chunk_size=1500       # Max tokens per chunk (default: 1500)
chunk_overlap=10      # Word overlap between chunks (default: 10, only applies to LangchainChunker)

With Docker Compose, add these to your .env before starting:

chunk_size=2048

Then start the server:

docker compose up --build cognee

With docker run, pass them inline:

docker run -e chunk_size=2048 -e LLM_API_KEY=... cognee/cognee:main

The chunk_size environment variable is read once at startup. Restart the container after changing it.

Chunk Size and Graph Quality

The chunk_size passed to remember() directly affects how the knowledge graph is built:

	Smaller chunks	Larger chunks
Entity granularity	Fine-grained — each entity fits in fewer chunks	Coarser — entities may span fewer nodes
Context per chunk	Less surrounding context for the LLM	More surrounding context for the LLM
LLM calls	More calls (higher cost, slower)	Fewer calls (lower cost, faster)
Best for	Dense, technical text (code, legal, science)	Narrative or long-form prose

If chunk_size is not set, Cognee auto-calculates it as the minimum of your embedding model’s context window and half of your LLM’s context window.

Preserving structured or procedural data (rules, XML, JSON)

For structured data, the goal is to make chunk boundaries follow the shape of the source. TextChunker (and the underlying chunk_by_paragraph) preserves sentence and paragraph boundaries where possible, batching complete sentence-sized units up to chunk_size before starting a new chunk. It does not understand a “rule”, XML element, or JSON record by itself, so treat chunk_size as a guardrail rather than a semantic guarantee.If you have procedural rules, configuration, or other structured content where each unit must stay intact, choose the smallest source unit you would be comfortable retrieving on its own:

Raise chunk_size so the whole document fits in one chunk. Because chunk_size is a token budget, any content that fits under it is kept together. Set it large enough to hold your entire ruleset and no splitting occurs:
await cognee.remember( data=my_rules_text, dataset_name="rules", chunk_size=8000, # large enough to hold the full ruleset )

Add each rule as its own data item so every rule becomes an independent document and is processed on its own:

await cognee.remember(
    data=["Rule 1: ...", "Rule 2: ...", "Rule 3: ..."],
    dataset_name="rules",
)

Write a custom chunker (see the Custom Chunkers section above) that splits on your own boundaries (e.g. one chunk per rule, per XML element, or per JSON record) when the natural unit isn’t a paragraph.

Cognee is strongest when the data has concepts, entities, and relationships to reason over. For workflows that must reproduce deterministic API calls or fixed rules verbatim, keep the canonical procedure in a regular file or source system, then use Cognee for the parts that benefit from semantic search and graph reasoning.

Getting Started

Core Concepts

Setup Configuration

Guides

Examples

CLI

Rust SDK

TypeScript SDK

OSS

Token-Based Sizing

Available Chunkers

Additional Information

​Token-Based Sizing

​Available Chunkers

​Additional Information

Token-Based Sizing

Available Chunkers

Additional Information