> ## Documentation Index
> Fetch the complete documentation index at: https://docs.cognee.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Chunkers

> How Cognee splits documents into smaller pieces

Chunkers are responsible for splitting large documents into smaller, manageable pieces called chunks. This is a crucial step before embedding and graph extraction, as most embedding models have a limit on the amount of text they can process at once.

## Token-Based Sizing

Cognee uses token-based sizing for chunks, rather than character counts. This means that `chunk_size` refers to the maximum number of tokens allowed in a chunk, which is directly tied to the tokenizer used by your embedding model. This ensures that chunks are always within the model's context window.

## Available Chunkers

Cognee provides several built-in chunkers to handle different types of content:

* **TextChunker**: The default chunker. It splits text by paragraphs while respecting the token limit. It tries to keep paragraphs together but will split them if they exceed the `chunk_size`.
* **CsvChunker**: Designed specifically for CSV data. It splits by rows, ensuring that each chunk contains complete rows and does not break data in the middle of a record.
* **LangchainChunker**: Wraps LangChain's `RecursiveCharacterTextSplitter`. It splits text recursively by characters (e.g., `\n\n`, `\n`, ` `) and supports `chunk_overlap` (in words; default: `10`). Requires `pip install cognee[langchain]`.
* **TextChunkerWithOverlap**: A paragraph-based chunker that supports overlap via a `chunk_overlap_ratio` (a fraction of `chunk_size`, e.g. `0.2` = 20% overlap). Useful for maintaining context across chunk boundaries.

<Note>
  **Overlap only works with `LangchainChunker` and `TextChunkerWithOverlap`.** The default `TextChunker` splits strictly at paragraph boundaries and does not use overlap. Calling [`cognee.config.set_chunk_overlap()`](/python-api/config#chunking-configuration) has no effect when using `TextChunker`.
</Note>

## Additional Information

<AccordionGroup>
  <Accordion title="Using a Specific Chunker">
    ```python theme={null}
    from cognee.modules.chunking.TextChunker import TextChunker

    await cognee.remember(
        data="my document text",
        dataset_name="my_dataset",
        chunker=TextChunker, # or CsvChunker, LangchainChunker
        chunk_size=1024,     # Maximum tokens per chunk
    )
    ```
  </Accordion>

  <Accordion title="Using Chunk Overlap">
    Chunk overlap causes consecutive chunks to share a portion of text, which helps preserve context at chunk boundaries and can improve entity extraction quality — at the cost of more LLM calls and a slightly larger graph.

    `TextChunkerWithOverlap` takes a `chunk_overlap_ratio` between `0.0` and `1.0` (fraction of `chunk_size`). Because this value currently cannot be passed through `cognee.config.set_chunk_overlap()` at runtime, configure it by subclassing the chunker:

    ```python theme={null}
    import asyncio
    import cognee
    from cognee.modules.chunking.text_chunker_with_overlap import TextChunkerWithOverlap

    class OverlappingChunker(TextChunkerWithOverlap):
        def __init__(self, document, get_text, max_chunk_size):
            # 20% of chunk_size tokens will overlap with the next chunk
            super().__init__(document, get_text, max_chunk_size, chunk_overlap_ratio=0.2)

    async def main():
        await cognee.remember(
            "my_document.txt",
            dataset_name="my_dataset",
            chunker=OverlappingChunker,
            chunk_size=1024,  # ~205 tokens will repeat in the next chunk
        )

    asyncio.run(main())
    ```

    `LangchainChunker` uses the same subclassing pattern and accepts a `chunk_overlap` parameter measured in words (default: `10`). It requires `pip install cognee[langchain]`.
  </Accordion>

  <Accordion title="Custom Chunkers">
    You can create a custom chunker by inheriting from the `Chunker` base class and implementing the `read` method. Your chunker must yield `DocumentChunk` objects.

    ```python theme={null}
    from cognee.modules.chunking.Chunker import Chunker
    from cognee.modules.chunking.models.DocumentChunk import DocumentChunk

    class MyCustomChunker(Chunker):
        async def read(self):
            async for text in self.get_text():
                # Your logic to split text into chunks
                yield DocumentChunk(
                    text="chunk content",
                    chunk_size=100,
                    # ... other required fields
                )
    ```
  </Accordion>

  <Accordion title="Configuring chunk size via Docker or REST API">
    The `/cognify` REST API endpoint does not accept a `chunk_size` parameter directly. When you use Cognee through Docker or the HTTP API, set the chunk size via environment variables instead:

    ```bash theme={null}
    # In your .env file (or passed as Docker env vars)
    chunk_size=1500       # Max tokens per chunk (default: 1500)
    chunk_overlap=10      # Word overlap between chunks (default: 10, only applies to LangchainChunker)
    ```

    With Docker Compose, add these to your `.env` before starting:

    ```bash theme={null}
    chunk_size=2048
    ```

    Then start the server:

    ```bash theme={null}
    docker compose up --build cognee
    ```

    With `docker run`, pass them inline:

    ```bash theme={null}
    docker run -e chunk_size=2048 -e LLM_API_KEY=... cognee/cognee:main
    ```

    <Note>
      The `chunk_size` environment variable is read once at startup. Restart the container after changing it.
    </Note>
  </Accordion>

  <Accordion title="Chunk Size and Graph Quality">
    The `chunk_size` passed to `remember()` directly affects how the knowledge graph is built:

    |                        | Smaller chunks                                  | Larger chunks                           |
    | ---------------------- | ----------------------------------------------- | --------------------------------------- |
    | **Entity granularity** | Fine-grained — each entity fits in fewer chunks | Coarser — entities may span fewer nodes |
    | **Context per chunk**  | Less surrounding context for the LLM            | More surrounding context for the LLM    |
    | **LLM calls**          | More calls (higher cost, slower)                | Fewer calls (lower cost, faster)        |
    | **Best for**           | Dense, technical text (code, legal, science)    | Narrative or long-form prose            |

    If `chunk_size` is not set, Cognee auto-calculates it as the minimum of your embedding model's context window and half of your LLM's context window.
  </Accordion>
</AccordionGroup>
