> ## Documentation Index
> Fetch the complete documentation index at: https://docs.cognee.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Chunkers

> Learn how Cognee splits documents into smaller pieces.

Chunkers are responsible for splitting large documents into smaller, manageable pieces called chunks. This is a crucial step before embedding and graph extraction, as most embedding models have a limit on the amount of text they can process at once.

## Token-Based Sizing

Cognee uses token-based sizing for chunks, rather than character counts. This means that `chunk_size` refers to the maximum number of tokens allowed in a chunk, which is directly tied to the tokenizer used by your embedding model. This ensures that chunks are always within the model's context window.

## Available Chunkers

Cognee provides several built-in chunkers to handle different types of content:

* **TextChunker**: The default chunker. It splits text by paragraphs while respecting the token limit. It tries to keep paragraphs together but will split them if they exceed the `chunk_size`.
* **CsvChunker**: Designed specifically for CSV data. It splits by rows, ensuring that each chunk contains complete rows and does not break data in the middle of a record. Each row is chunked independently — row-level state (text, size, and chunk index) is reset at each row boundary, so one row's fields are never accumulated onto the next row's chunk.
* **LangchainChunker**: Wraps LangChain's `RecursiveCharacterTextSplitter`. It splits text recursively by characters (e.g., `\n\n`, `\n`, ` `) and supports `chunk_overlap` (in words; default: `10`). Requires `pip install cognee[langchain]`.
* **TextChunkerWithOverlap**: A paragraph-based chunker that supports overlap via a `chunk_overlap_ratio` (a fraction of `chunk_size`, e.g. `0.2` = 20% overlap). Useful for maintaining context across chunk boundaries.

<Note>
  **Overlap only works with `LangchainChunker` and `TextChunkerWithOverlap`.** The default `TextChunker` splits strictly at paragraph boundaries and does not use overlap. Calling [`cognee.config.set_chunk_overlap()`](/python-api/config#chunking-configuration) has no effect when using `TextChunker`.
</Note>

## Additional Information

<AccordionGroup>
  <Accordion title="Using a Specific Chunker">
    ```python theme={null}
    from cognee.modules.chunking.TextChunker import TextChunker

    await cognee.remember(
        data="my document text",
        dataset_name="my_dataset",
        chunker=TextChunker, # or CsvChunker, LangchainChunker
        chunk_size=1024,     # Maximum tokens per chunk
    )
    ```
  </Accordion>

  <Accordion title="Using Chunk Overlap">
    Chunk overlap causes consecutive chunks to share a portion of text, which helps preserve context at chunk boundaries and can improve entity extraction quality — at the cost of more LLM calls and a slightly larger graph.

    `TextChunkerWithOverlap` takes a `chunk_overlap_ratio` between `0.0` and `1.0` (fraction of `chunk_size`). Because this value currently cannot be passed through `cognee.config.set_chunk_overlap()` at runtime, configure it by subclassing the chunker:

    ```python theme={null}
    import asyncio
    import cognee
    from cognee.modules.chunking.text_chunker_with_overlap import TextChunkerWithOverlap

    class OverlappingChunker(TextChunkerWithOverlap):
        def __init__(self, document, get_text, max_chunk_size):
            # 20% of chunk_size tokens will overlap with the next chunk
            super().__init__(document, get_text, max_chunk_size, chunk_overlap_ratio=0.2)

    async def main():
        await cognee.remember(
            "my_document.txt",
            dataset_name="my_dataset",
            chunker=OverlappingChunker,
            chunk_size=1024,  # ~205 tokens will repeat in the next chunk
        )

    asyncio.run(main())
    ```

    `LangchainChunker` uses the same subclassing pattern and accepts a `chunk_overlap` parameter measured in words (default: `10`). It requires `pip install cognee[langchain]`.
  </Accordion>

  <Accordion title="Custom Chunkers">
    You can create a custom chunker by inheriting from the `Chunker` base class and implementing the `read` method. Your chunker must yield `DocumentChunk` objects.

    ```python theme={null}
    from cognee.modules.chunking.Chunker import Chunker
    from cognee.modules.chunking.models.DocumentChunk import DocumentChunk

    class MyCustomChunker(Chunker):
        async def read(self):
            async for text in self.get_text():
                # Your logic to split text into chunks
                yield DocumentChunk(
                    text="chunk content",
                    chunk_size=100,
                    # ... other required fields
                )
    ```
  </Accordion>

  <Accordion title="Configuring chunk size via Docker or REST API">
    The `/cognify` REST API endpoint does not accept a `chunk_size` parameter directly. When you use Cognee through Docker or the HTTP API, set the chunk size via environment variables instead:

    ```bash theme={null}
    # In your .env file (or passed as Docker env vars)
    chunk_size=1500       # Max tokens per chunk (default: 1500)
    chunk_overlap=10      # Word overlap between chunks (default: 10, only applies to LangchainChunker)
    ```

    With Docker Compose, add these to your `.env` before starting:

    ```bash theme={null}
    chunk_size=2048
    ```

    Then start the server:

    ```bash theme={null}
    docker compose up --build cognee
    ```

    With `docker run`, pass them inline:

    ```bash theme={null}
    docker run -e chunk_size=2048 -e LLM_API_KEY=... cognee/cognee:main
    ```

    <Note>
      The `chunk_size` environment variable is read once at startup. Restart the container after changing it.
    </Note>
  </Accordion>

  <Accordion title="Chunk Size and Graph Quality">
    The `chunk_size` passed to `remember()` directly affects how the knowledge graph is built:

    |                        | Smaller chunks                                  | Larger chunks                           |
    | ---------------------- | ----------------------------------------------- | --------------------------------------- |
    | **Entity granularity** | Fine-grained — each entity fits in fewer chunks | Coarser — entities may span fewer nodes |
    | **Context per chunk**  | Less surrounding context for the LLM            | More surrounding context for the LLM    |
    | **LLM calls**          | More calls (higher cost, slower)                | Fewer calls (lower cost, faster)        |
    | **Best for**           | Dense, technical text (code, legal, science)    | Narrative or long-form prose            |

    If `chunk_size` is not set, Cognee auto-calculates it as the minimum of your embedding model's context window and half of your LLM's context window.
  </Accordion>

  <Accordion title="Preserving structured or procedural data (rules, XML, JSON)">
    For structured data, the goal is to make chunk boundaries follow the shape of the source. `TextChunker` (and the underlying `chunk_by_paragraph`) preserves sentence and paragraph boundaries where possible, batching complete sentence-sized units up to `chunk_size` before starting a new chunk. It does not understand a "rule", XML element, or JSON record by itself, so treat `chunk_size` as a guardrail rather than a semantic guarantee.

    If you have procedural rules, configuration, or other structured content where each unit must stay intact, choose the smallest source unit you would be comfortable retrieving on its own:

    1. **Raise `chunk_size` so the whole document fits in one chunk.** Because `chunk_size` is a token budget, any content that fits under it is kept together. Set it large enough to hold your entire ruleset and no splitting occurs:

       ```python theme={null}
       await cognee.remember(
           data=my_rules_text,
           dataset_name="rules",
           chunk_size=8000,  # large enough to hold the full ruleset
       )
       ```

    2. **Add each rule as its own data item** so every rule becomes an independent document and is processed on its own:

       ```python theme={null}
       await cognee.remember(
           data=["Rule 1: ...", "Rule 2: ...", "Rule 3: ..."],
           dataset_name="rules",
       )
       ```

    3. **Write a custom chunker** (see the *Custom Chunkers* section above) that splits on your own boundaries (e.g. one chunk per rule, per XML element, or per JSON record) when the natural unit isn't a paragraph.

    <Note>
      Cognee is strongest when the data has concepts, entities, and relationships to reason over. For workflows that must reproduce deterministic API calls or fixed rules verbatim, keep the canonical procedure in a regular file or source system, then use Cognee for the parts that benefit from semantic search and graph reasoning.
    </Note>
  </Accordion>
</AccordionGroup>
