Skip to main content
Chunkers are responsible for splitting large documents into smaller, manageable pieces called chunks. This is a crucial step before embedding and graph extraction, as most embedding models have a limit on the amount of text they can process at once.

Token-Based Sizing

Cognee uses token-based sizing for chunks, rather than character counts. This means that chunk_size refers to the maximum number of tokens allowed in a chunk, which is directly tied to the tokenizer used by your embedding model. This ensures that chunks are always within the model’s context window.

Available Chunkers

Cognee provides several built-in chunkers to handle different types of content:
  • TextChunker: The default chunker. It splits text by paragraphs while respecting the token limit. It tries to keep paragraphs together but will split them if they exceed the chunk_size.
  • CsvChunker: Designed specifically for CSV data. It splits by rows, ensuring that each chunk contains complete rows and does not break data in the middle of a record.
  • LangchainChunker: Wraps LangChain’s RecursiveCharacterTextSplitter. It splits text recursively by characters (e.g., \n\n, \n, ) and supports chunk_overlap.
  • TextChunkerWithOverlap: A programmatic-only chunker that allows you to specify an overlap ratio between chunks. This is useful for maintaining context across boundaries but is not exposed in the CLI.

Usage

You can specify which chunker to use and the maximum chunk size when running the .cognify() command.

Using a Specific Chunker

from cognee.modules.chunking.TextChunker import TextChunker

await cognee.cognify(
    datasets=["my_dataset"],
    chunker=TextChunker, # or CsvChunker, LangchainChunker
    chunk_size=1024,     # Maximum tokens per chunk
)

Custom Chunkers

You can create a custom chunker by inheriting from the Chunker base class and implementing the read method. Your chunker must yield DocumentChunk objects.
from cognee.modules.chunking.Chunker import Chunker
from cognee.modules.chunking.models.DocumentChunk import DocumentChunk

class MyCustomChunker(Chunker):
    async def read(self):
        async for text in self.get_text():
            # Your logic to split text into chunks
            yield DocumentChunk(
                text="chunk content",
                chunk_size=100,
                # ... other required fields
            )