Skip to main content
Chunkers are responsible for splitting large documents into smaller, manageable pieces called chunks. This is a crucial step before embedding and graph extraction, as most embedding models have a limit on the amount of text they can process at once.

Token-Based Sizing

Cognee uses token-based sizing for chunks, rather than character counts. This means that chunk_size refers to the maximum number of tokens allowed in a chunk, which is directly tied to the tokenizer used by your embedding model. This ensures that chunks are always within the model’s context window.

Available Chunkers

Cognee provides several built-in chunkers to handle different types of content:
  • TextChunker: The default chunker. It splits text by paragraphs while respecting the token limit. It tries to keep paragraphs together but will split them if they exceed the chunk_size.
  • CsvChunker: Designed specifically for CSV data. It splits by rows, ensuring that each chunk contains complete rows and does not break data in the middle of a record.
  • LangchainChunker: Wraps LangChain’s RecursiveCharacterTextSplitter. It splits text recursively by characters (e.g., \n\n, \n, ) and supports chunk_overlap (in words; default: 10). Requires pip install cognee[langchain].
  • TextChunkerWithOverlap: A paragraph-based chunker that supports overlap via a chunk_overlap_ratio (a fraction of chunk_size, e.g. 0.2 = 20% overlap). Useful for maintaining context across chunk boundaries.
Overlap only works with LangchainChunker and TextChunkerWithOverlap. The default TextChunker splits strictly at paragraph boundaries and does not use overlap. Calling cognee.config.set_chunk_overlap() has no effect when using TextChunker.

Additional Information

from cognee.modules.chunking.TextChunker import TextChunker

await cognee.cognify(
    datasets=["my_dataset"],
    chunker=TextChunker, # or CsvChunker, LangchainChunker
    chunk_size=1024,     # Maximum tokens per chunk
)
Chunk overlap causes consecutive chunks to share a portion of text, which helps preserve context at chunk boundaries and can improve entity extraction quality — at the cost of more LLM calls and a slightly larger graph.TextChunkerWithOverlap takes a chunk_overlap_ratio between 0.0 and 1.0 (fraction of chunk_size). Because this value currently cannot be passed through cognee.config.set_chunk_overlap() at runtime, configure it by subclassing the chunker:
import asyncio
import cognee
from cognee.modules.chunking.text_chunker_with_overlap import TextChunkerWithOverlap

class OverlappingChunker(TextChunkerWithOverlap):
    def __init__(self, document, get_text, max_chunk_size):
        # 20% of chunk_size tokens will overlap with the next chunk
        super().__init__(document, get_text, max_chunk_size, chunk_overlap_ratio=0.2)

async def main():
    await cognee.add("my_document.txt")
    await cognee.cognify(
        datasets=["my_dataset"],
        chunker=OverlappingChunker,
        chunk_size=1024,  # ~205 tokens will repeat in the next chunk
    )

asyncio.run(main())
LangchainChunker uses the same subclassing pattern and accepts a chunk_overlap parameter measured in words (default: 10). It requires pip install cognee[langchain].
You can create a custom chunker by inheriting from the Chunker base class and implementing the read method. Your chunker must yield DocumentChunk objects.
from cognee.modules.chunking.Chunker import Chunker
from cognee.modules.chunking.models.DocumentChunk import DocumentChunk

class MyCustomChunker(Chunker):
    async def read(self):
        async for text in self.get_text():
            # Your logic to split text into chunks
            yield DocumentChunk(
                text="chunk content",
                chunk_size=100,
                # ... other required fields
            )
The /cognify REST API endpoint does not accept a chunk_size parameter directly. When you use Cognee through Docker or the HTTP API, set the chunk size via environment variables instead:
# In your .env file (or passed as Docker env vars)
chunk_size=1500       # Max tokens per chunk (default: 1500)
chunk_overlap=10      # Word overlap between chunks (default: 10, only applies to LangchainChunker)
With Docker Compose, add these to your .env before starting:
chunk_size=2048
Then start the server:
docker compose up --build cognee
With docker run, pass them inline:
docker run -e chunk_size=2048 -e LLM_API_KEY=... cognee/cognee:main
The chunk_size environment variable is read once at startup. Restart the container after changing it.
The chunk_size passed to cognify() directly affects how the knowledge graph is built:
Smaller chunksLarger chunks
Entity granularityFine-grained — each entity fits in fewer chunksCoarser — entities may span fewer nodes
Context per chunkLess surrounding context for the LLMMore surrounding context for the LLM
LLM callsMore calls (higher cost, slower)Fewer calls (lower cost, faster)
Best forDense, technical text (code, legal, science)Narrative or long-form prose
If chunk_size is not set, Cognee auto-calculates it as the minimum of your embedding model’s context window and half of your LLM’s context window.