> ## Documentation Index
> Fetch the complete documentation index at: https://docs.cognee.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Loaders

> Learn how Cognee handles different file formats.

Loaders are responsible for reading files from your disk or cloud storage and converting them into plain text that Cognee can process. When you run `remember()`, Cognee automatically selects the most appropriate loader for each file based on its extension and content type.

## Loader Selection

Cognee uses a priority system to decide which loader to use. It tries to match a loader in the following order:

1. **TextLoader**: For plain text files (`.txt`, `.md`, `.json`, `.xml`, etc.).
2. **PyPdfLoader**: For PDF files (requires `pypdf`).
3. **ImageLoader**: For images (uses vision models to describe content).
4. **AudioLoader**: For audio files (uses transcription models).
5. **CsvLoader**: For CSV files (converts rows to text).
6. **UnstructuredLoader**: For complex formats like `.docx`, `.pptx`, `.epub` (requires `unstructured`).
7. **AdvancedPdfLoader**: For layout-aware PDF extraction (requires `unstructured`).
8. **DoclingLoader**: If no other loader can ingest the file type, [Docling](https://github.com/docling-project/docling) is used for conversion (if the type is supported).

If you want to force a specific loader or provide custom configuration, you can use the `preferred_loaders` parameter in `remember()`.

## Available Loaders

### Core Loaders

These are available by default in every Cognee installation:

* **TextLoader**: Reads text files with UTF-8 encoding.
* **CsvLoader**: Reads CSV files and converts each row into a structured text format (`Row N: key: value`).
* **ImageLoader**: Uses an LLM vision model to transcribe or describe the image content.
* **AudioLoader**: Uses an audio transcription API to transcribe audio files. This depends on your configured LLM provider supporting transcription endpoints; see [LLM Providers](/setup-configuration/llm-providers) for provider-specific caveats.

### External Loaders

These require additional dependencies to be installed:

* **PyPdfLoader**: Extracts text from PDFs page by page using the `pypdf` library, preserving page boundaries with `Page N:` markers in the extracted text. Requires `pip install cognee[docs]` (or `pip install pypdf`).
* **AdvancedPdfLoader**: Layout-aware PDF extraction using the `unstructured` library. Extracts text, tables (as HTML), and image placeholders per page. Falls back to `PyPdfLoader` automatically if extraction fails. Requires `pip install cognee[docs]`, plus `poppler` and `tesseract` installed on the system.
* **UnstructuredLoader**: Handles many office and document formats (`.docx`, `.xlsx`, `.pptx`, `.odt`, `.rtf`, `.eml`, `.epub`, `.html`, and more) via `unstructured`'s auto-partition. Requires `pip install cognee[docs]`.
* **BeautifulSoupLoader**: Extracts text from HTML files using `BeautifulSoup`. Applies CSS selector rules to pull structured content from specific tags. Requires `pip install cognee[scraping]`.
* **DoclingLoader**: Catch-all fallback that converts a wide range of document formats (PDF, DOCX, XLSX, PPTX, HTML, Markdown, and more) to plain text via [Docling](https://github.com/docling-project/docling). Supported extensions are discovered dynamically from Docling's `FormatToExtensions` map. Requires `pip install cognee[docling]`.

## Supported File Extensions

Cognee selects loaders based on file type. The table below shows the supported extensions and their default loaders.

<Accordion title="Supported extensions reference">
  | Extension(s)                                                                                                                                         | Default loader     | Notes                                                                            |
  | ---------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------ | -------------------------------------------------------------------------------- |
  | `.txt`, `.md`, `.json`, `.xml`, `.yaml`, `.yml`, `.log`                                                                                              | TextLoader         | Plain text and structured text-like files                                        |
  | `.csv`                                                                                                                                               | CsvLoader          | Tabular data; rows converted to text                                             |
  | `.pdf`                                                                                                                                               | PyPdfLoader        | Default PDF extraction; `AdvancedPdfLoader` is optional for layout-aware parsing |
  | `.docx`, `.doc`, `.odt`                                                                                                                              | UnstructuredLoader | Word-processor formats; requires `unstructured`                                  |
  | `.xlsx`, `.xls`, `.ods`                                                                                                                              | UnstructuredLoader | Spreadsheet formats; requires `unstructured`                                     |
  | `.pptx`, `.ppt`, `.odp`                                                                                                                              | UnstructuredLoader | Presentation formats; requires `unstructured`                                    |
  | `.rtf`, `.html`, `.htm`, `.eml`, `.msg`, `.epub`                                                                                                     | UnstructuredLoader | Additional document and markup formats; requires `unstructured`                  |
  | `.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.bmp`, `.tif`, `.tiff`, `.heic`, `.avif`, `.ico`, `.psd`, `.apng`, `.cr2`, `.dwg`, `.xcf`, `.jxr`, `.jpx` | ImageLoader        | Raster, design, raw, and CAD-adjacent image formats described by a vision LLM    |
  | `.mp3`, `.wav`, `.aac`, `.flac`, `.ogg`, `.m4a`, `.mid`, `.amr`, `.aiff`                                                                             | AudioLoader        | Audio transcription via a Whisper-compatible model                               |

  > **Note:** Files with extensions not in this table cannot be remembered by default. Use a [custom loader](#registering-custom-loaders) to handle additional formats.
  >
  > When no registered loader can handle a file, `remember()` raises a `ValueError` that names the file's extension and lists the currently supported extensions. For office and document formats that ship only via an optional loader (for example `.pptx`, `.docx`, `.xlsx`, `.html`, `.epub`), install `cognee[docling]` for Docling or `cognee[docs]` for Unstructured-backed loaders, then retry.
</Accordion>

## Usage

<AccordionGroup>
  <Accordion title="Using Preferred Loaders">
    You can override the default loader selection by specifying `preferred_loaders`. This is useful when you want to pass specific configuration options to a loader.

    ```python theme={null}
    import cognee

    await cognee.remember(
        data=["example_website.html"],
        preferred_loaders=[
            {
                "beautiful_soup_loader": {
                    "extraction_rules": {
                        "title": "h1",
                        "body": "article.main-content"
                    }
                }
            }
        ]
    )
    ```
  </Accordion>

  <Accordion title="Advanced PDF Loader">
    `AdvancedPdfLoader` uses the `unstructured` library to perform layout-aware extraction, preserving page structure, tables, image metadata, and page numbers. It groups content by page and prepends `Page N:` markers to each page's extracted text, making source pages traceable in downstream chunks. Because `PyPdfLoader` has higher default priority, you need to request it explicitly with `preferred_loaders`. It accepts a `strategy` parameter that controls the trade-off between speed and accuracy:

    If you want to inspect those page markers after ingestion, retrieve raw chunks with `SearchType.CHUNKS`. The page number is kept in the chunk `text`, not in a separate metadata field.

    <Note>
      Make sure `poppler` and `tesseract` are installed on your system before using `AdvancedPdfLoader`, in addition to installing the Python dependencies with `pip install cognee[docs]`.
    </Note>

    | Strategy           | Description                                                   |
    | ------------------ | ------------------------------------------------------------- |
    | `"auto"` (default) | Automatically selects the best strategy based on the document |
    | `"fast"`           | Fast text extraction without layout analysis                  |
    | `"hi_res"`         | High-resolution extraction with full layout analysis (slower) |
    | `"ocr_only"`       | Uses OCR for text extraction, useful for scanned PDFs         |

    <AccordionGroup>
      <Accordion title="Scanned vs. text PDFs (enabling OCR)">
        Cognee does **not** automatically distinguish scanned (image-only) PDFs from text PDFs. The default `PyPdfLoader` reads the embedded text layer page by page and skips pages with no extractable text. For a scanned PDF — where each page is an image with no text layer — this silently produces empty or near-empty output, and no OCR is performed.

        A text layer is the selectable, machine-readable text embedded in the PDF. If a page has no text layer, `PyPdfLoader` omits that page and no `Page N:` marker is added for it. If `pypdf` errors on a malformed page, Cognee logs a warning, skips that page, and continues loading the rest of the document.

        To process scanned PDFs, explicitly select `AdvancedPdfLoader` with an OCR strategy. Because `PyPdfLoader` has higher default priority, you must request the OCR-capable loader through `preferred_loaders`:

        ```python theme={null}
        import cognee

        await cognee.remember(
            data=["scanned_document.pdf"],
            preferred_loaders=[
                {"advanced_pdf_loader": {"strategy": "ocr_only"}}
            ]
        )
        ```

        Use `strategy="ocr_only"` for fully image-based or scanned PDFs, and `strategy="hi_res"` for documents that mix a text layer with scanned images. OCR requires `pip install cognee[docs]` plus `poppler` and `tesseract` installed on the system. For non-English or CJK scanned documents, also install the matching Tesseract language packs (see the next accordion).
      </Accordion>

      <Accordion title="OCR for non-English and CJK PDFs">
        Standard PDF text extraction via `PyPdfLoader` or `AdvancedPdfLoader` with `strategy="fast"` often fails silently on CJK and other non-Latin documents. These PDFs commonly embed glyphs as images or use non-standard font encodings, which can lead to empty or garbled output.

        Use `AdvancedPdfLoader` with an OCR-based strategy and install Tesseract language packs for your target language.

        **Step 1 — install Tesseract language packs** (Ubuntu/Debian):

        ```bash theme={null}
        # Japanese
        sudo apt-get install tesseract-ocr-jpn tesseract-ocr-jpn-vert

        # Chinese (Simplified / Traditional)
        sudo apt-get install tesseract-ocr-chi-sim tesseract-ocr-chi-tra

        # Korean
        sudo apt-get install tesseract-ocr-kor
        ```

        **Step 2 — use `ocr_only` strategy with the `languages` parameter**:

        ```python theme={null}
        import cognee

        # Japanese PDF
        await cognee.remember(
            data=["japanese_document.pdf"],
            preferred_loaders=[
                {
                    "advanced_pdf_loader": {
                        "strategy": "ocr_only",
                        "languages": ["jpn"]
                    }
                }
            ]
        )
        ```

        The `languages` list accepts [ISO 639-2 Tesseract language codes](https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html). Common values:

        | Language            | Code        |
        | ------------------- | ----------- |
        | Japanese            | `"jpn"`     |
        | Chinese Simplified  | `"chi_sim"` |
        | Chinese Traditional | `"chi_tra"` |
        | Korean              | `"kor"`     |

        Use `strategy="hi_res"` for better layout accuracy when the document has mixed text and images, and `strategy="ocr_only"` for fully image-based or scanned PDFs.

        <Note>
          If you also need translation after OCR extraction, use the [Multilingual Ingestion](/guides/multilingual-ingestion) pipeline before building the knowledge graph.
        </Note>
      </Accordion>
    </AccordionGroup>
  </Accordion>

  <Accordion title="Unstructured Loader">
    `UnstructuredLoader` handles a wide range of office and document formats using `unstructured`'s auto-partition feature. It supports the same `strategy` options as `AdvancedPdfLoader`.

    Supported file types:

    | Category       | Extensions                                       |
    | -------------- | ------------------------------------------------ |
    | Word documents | `.docx`, `.doc`, `.odt`                          |
    | Spreadsheets   | `.xlsx`, `.xls`, `.ods`                          |
    | Presentations  | `.pptx`, `.ppt`, `.odp`                          |
    | Other          | `.rtf`, `.html`, `.htm`, `.eml`, `.msg`, `.epub` |

    ```python theme={null}
    import cognee

    await cognee.remember(
        data=["presentation.pptx"],
        preferred_loaders=[
            {
                "unstructured_loader": {
                    "strategy": "fast"
                }
            }
        ]
    )
    ```
  </Accordion>

  <Accordion title="Docling Loader">
    `DoclingLoader` is the lowest-priority loader and acts as a catch-all fallback for any file type that no other loader can ingest. It converts the document with [Docling](https://github.com/docling-project/docling) and exports plain text via Docling's `export_to_text()`. Supported extensions are pulled at runtime from Docling's `FormatToExtensions` (PDF, DOCX, XLSX, PPTX, HTML, Markdown, and more), so the available formats track your installed Docling version.

    Install with `pip install 'cognee[docling]'`. Because it sits last in the loader priority, Cognee only reaches for it when no higher-priority loader matches — to force it on a file another loader would normally handle (for example, to use Docling's layout-aware parsing on a PDF instead of `PyPdfLoader`), pass it through `preferred_loaders`:

    ```python theme={null}
    import cognee

    await cognee.remember(
        data=["report.pdf"],
        preferred_loaders=[{"docling_loader": {}}]
    )
    ```

    **When to use Docling vs. `AdvancedPdfLoader`**: prefer `AdvancedPdfLoader` for PDFs when you need per-page markers, HTML-formatted tables, or OCR strategy control (see the Advanced PDF Loader accordion above). Reach for `DoclingLoader` when you want a single unified converter across many formats, or for non-PDF files Cognee does not otherwise handle.
  </Accordion>

  <Accordion title="BeautifulSoup Loader">
    `BeautifulSoupLoader` parses HTML files using CSS selectors. By default it applies a comprehensive set of extraction rules covering common HTML content areas (headings, paragraphs, articles, tables, code blocks, etc.). You can pass your own `extraction_rules` dict to target specific elements.

    Each rule is a dict with the following optional keys:

    | Key         | Type   | Description                                                    |
    | ----------- | ------ | -------------------------------------------------------------- |
    | `selector`  | `str`  | CSS selector to match elements                                 |
    | `xpath`     | `str`  | XPath expression (requires `lxml`)                             |
    | `attr`      | `str`  | HTML attribute to extract instead of text content              |
    | `all`       | `bool` | If `True`, extract all matches; otherwise only the first       |
    | `join_with` | `str`  | String used to join multiple extracted values (default: `" "`) |

    ```python theme={null}
    import cognee

    await cognee.remember(
        data=["page.html"],
        preferred_loaders=[
            {
                "beautiful_soup_loader": {
                    "extraction_rules": {
                        "title": {"selector": "h1", "all": False},
                        "body": {"selector": "article.main-content", "all": True, "join_with": "\n\n"},
                        "og_image": {"selector": "meta[property='og:image']", "attr": "content"}
                    }
                }
            }
        ]
    )
    ```

    For XPath-based extraction (requires `pip install lxml`):

    ```python theme={null}
    await cognee.remember(
        data=["page.html"],
        preferred_loaders=[
            {
                "beautiful_soup_loader": {
                    "extraction_rules": {
                        "content": {"xpath": "//div[@class='content']//p"}
                    }
                }
            }
        ]
    )
    ```
  </Accordion>

  <Accordion title="Configuring Vision Models for ImageLoader">
    `ImageLoader` uses your configured LLM to describe image content — there is no separate VLM configuration. To process images, set `LLM_MODEL` to a vision-capable model.

    **Vision-capable models by provider:**

    | Provider       | Example model                |
    | -------------- | ---------------------------- |
    | OpenAI         | `gpt-4o`, `gpt-4o-mini`      |
    | Google Gemini  | `gemini/gemini-2.0-flash`    |
    | Anthropic      | `claude-3-5-sonnet-20241022` |
    | Azure OpenAI   | `azure/gpt-4o`               |
    | Ollama (local) | `llava`, `llava-llama3`      |

    ```dotenv theme={null}
    # .env — enable vision by choosing a vision-capable model
    LLM_PROVIDER="openai"
    LLM_MODEL="gpt-4o-mini"
    LLM_API_KEY="sk-..."
    ```

    If your `LLM_MODEL` does not support vision, remembering an image file will fail at the description step. Switch to a vision-capable model and retry.

    <Info>
      **Ollama users**: Pull a vision-capable model (e.g. `ollama pull llava`) and set `LLM_MODEL="llava"`. Text-only models such as `llama3.1` cannot process images. See [LLM Providers](/setup-configuration/llm-providers) for full Ollama setup.
    </Info>
  </Accordion>

  <Accordion title="Registering Custom Loaders">
    If you need to handle a custom file format, you can create your own loader class and register it with Cognee.

    ```python theme={null}
    from cognee.infrastructure.loaders import use_loader
    from cognee.infrastructure.loaders.LoaderInterface import LoaderInterface

    class MyCustomLoader(LoaderInterface):
        loader_name = "my_custom_loader"
        supported_extensions = [".custom"]
        supported_mime_types = ["application/x-custom"]

        async def load(self, file_path, **kwargs):
            # Your custom logic to read the file and return text
            return "Extracted text content"

    # Register the loader so Cognee can use it
    use_loader("my_custom_loader", MyCustomLoader)

    await cognee.remember("data/file.custom")
    ```
  </Accordion>
</AccordionGroup>