Loaders are responsible for reading files from your disk or cloud storage and converting them into plain text that Cognee can process. When you runDocumentation Index
Fetch the complete documentation index at: https://docs.cognee.ai/llms.txt
Use this file to discover all available pages before exploring further.
remember(), Cognee automatically selects the most appropriate loader for each file based on its extension and content type.
Loader Selection
Cognee uses a priority system to decide which loader to use. It tries to match a loader in the following order:- TextLoader: For plain text files (
.txt,.md,.json,.xml, etc.). - PyPdfLoader: For PDF files (requires
pypdf). - ImageLoader: For images (uses vision models to describe content).
- AudioLoader: For audio files (uses transcription models).
- CsvLoader: For CSV files (converts rows to text).
- UnstructuredLoader: For complex formats like
.docx,.pptx,.epub(requiresunstructured). - AdvancedPdfLoader: For layout-aware PDF extraction (requires
unstructured). - DoclingLoader: If no other loader can ingest the file type, Docling is used for conversion (if the type is supported).
preferred_loaders parameter in remember().
Available Loaders
Core Loaders
These are available by default in every Cognee installation:- TextLoader: Reads text files with UTF-8 encoding.
- CsvLoader: Reads CSV files and converts each row into a structured text format (
Row N: key: value). - ImageLoader: Uses an LLM vision model to transcribe or describe the image content.
- AudioLoader: Uses an audio transcription API to transcribe audio files. This depends on your configured LLM provider supporting transcription endpoints; see LLM Providers for provider-specific caveats.
External Loaders
These require additional dependencies to be installed:- PyPdfLoader: Extracts text from PDFs page by page using the
pypdflibrary, preserving page boundaries withPage N:markers in the extracted text. Requirespip install cognee[docs](orpip install pypdf). - AdvancedPdfLoader: Layout-aware PDF extraction using the
unstructuredlibrary. Extracts text, tables (as HTML), and image placeholders per page. Falls back toPyPdfLoaderautomatically if extraction fails. Requirespip install cognee[docs], pluspopplerandtesseractinstalled on the system. - UnstructuredLoader: Handles many office and document formats (
.docx,.xlsx,.pptx,.odt,.rtf,.eml,.epub,.html, and more) viaunstructured’s auto-partition. Requirespip install cognee[docs]. - BeautifulSoupLoader: Extracts text from HTML files using
BeautifulSoup. Applies CSS selector rules to pull structured content from specific tags. Requirespip install cognee[scraping]. - DoclingLoader: Catch-all fallback that converts a wide range of document formats (PDF, DOCX, XLSX, PPTX, HTML, Markdown, and more) to plain text via Docling. Supported extensions are discovered dynamically from Docling’s
FormatToExtensionsmap. Requirespip install cognee[docling].
Supported File Extensions
Cognee selects loaders based on file type. The table below shows the supported extensions and their default loaders.Supported extensions reference
Supported extensions reference
| Extension(s) | Default loader | Notes |
|---|---|---|
.txt, .md, .json, .xml, .yaml, .yml, .log | TextLoader | Plain text and structured text-like files |
.csv | CsvLoader | Tabular data; rows converted to text |
.pdf | PyPdfLoader | Default PDF extraction; AdvancedPdfLoader is optional for layout-aware parsing |
.docx, .doc, .odt | UnstructuredLoader | Word-processor formats; requires unstructured |
.xlsx, .xls, .ods | UnstructuredLoader | Spreadsheet formats; requires unstructured |
.pptx, .ppt, .odp | UnstructuredLoader | Presentation formats; requires unstructured |
.rtf, .html, .htm, .eml, .msg, .epub | UnstructuredLoader | Additional document and markup formats; requires unstructured |
.png, .jpg, .jpeg, .gif, .webp, .bmp, .tif, .tiff, .heic, .avif, .ico, .psd, .apng, .cr2, .dwg, .xcf, .jxr, .jpx | ImageLoader | Raster, design, raw, and CAD-adjacent image formats described by a vision LLM |
.mp3, .wav, .aac, .flac, .ogg, .m4a, .mid, .amr, .aiff | AudioLoader | Audio transcription via a Whisper-compatible model |
Note: Files with extensions not in this table cannot be remembered by default. Use a custom loader to handle additional formats.
Usage
Using Preferred Loaders
Using Preferred Loaders
You can override the default loader selection by specifying
preferred_loaders. This is useful when you want to pass specific configuration options to a loader.Advanced PDF Loader
Advanced PDF Loader
AdvancedPdfLoader uses the unstructured library to perform layout-aware extraction, preserving page structure, tables, image metadata, and page numbers. It groups content by page and prepends Page N: markers to each page’s extracted text, making source pages traceable in downstream chunks. Because PyPdfLoader has higher default priority, you need to request it explicitly with preferred_loaders. It accepts a strategy parameter that controls the trade-off between speed and accuracy:If you want to inspect those page markers after ingestion, retrieve raw chunks with SearchType.CHUNKS. The page number is kept in the chunk text, not in a separate metadata field.Make sure
poppler and tesseract are installed on your system before using AdvancedPdfLoader, in addition to installing the Python dependencies with pip install cognee[docs].| Strategy | Description |
|---|---|
"auto" (default) | Automatically selects the best strategy based on the document |
"fast" | Fast text extraction without layout analysis |
"hi_res" | High-resolution extraction with full layout analysis (slower) |
"ocr_only" | Uses OCR for text extraction, useful for scanned PDFs |
OCR for non-English and CJK PDFs
OCR for non-English and CJK PDFs
Standard PDF text extraction via Step 2 — use The
Use
PyPdfLoader or AdvancedPdfLoader with strategy="fast" often fails silently on CJK and other non-Latin documents. These PDFs commonly embed glyphs as images or use non-standard font encodings, which can lead to empty or garbled output.Use AdvancedPdfLoader with an OCR-based strategy and install Tesseract language packs for your target language.Step 1 — install Tesseract language packs (Ubuntu/Debian):ocr_only strategy with the languages parameter:languages list accepts ISO 639-2 Tesseract language codes. Common values:| Language | Code |
|---|---|
| Japanese | "jpn" |
| Chinese Simplified | "chi_sim" |
| Chinese Traditional | "chi_tra" |
| Korean | "kor" |
strategy="hi_res" for better layout accuracy when the document has mixed text and images, and strategy="ocr_only" for fully image-based or scanned PDFs.If you also need translation after OCR extraction, use the Multilingual Ingestion pipeline before building the knowledge graph.
Unstructured Loader
Unstructured Loader
UnstructuredLoader handles a wide range of office and document formats using unstructured’s auto-partition feature. It supports the same strategy options as AdvancedPdfLoader.Supported file types:| Category | Extensions |
|---|---|
| Word documents | .docx, .doc, .odt |
| Spreadsheets | .xlsx, .xls, .ods |
| Presentations | .pptx, .ppt, .odp |
| Other | .rtf, .html, .htm, .eml, .msg, .epub |
Docling Loader
Docling Loader
DoclingLoader is the lowest-priority loader and acts as a catch-all fallback for any file type that no other loader can ingest. It converts the document with Docling and exports plain text via Docling’s export_to_text(). Supported extensions are pulled at runtime from Docling’s FormatToExtensions (PDF, DOCX, XLSX, PPTX, HTML, Markdown, and more), so the available formats track your installed Docling version.Install with pip install 'cognee[docling]'. Because it sits last in the loader priority, Cognee only reaches for it when no higher-priority loader matches — to force it on a file another loader would normally handle (for example, to use Docling’s layout-aware parsing on a PDF instead of PyPdfLoader), pass it through preferred_loaders:AdvancedPdfLoader: prefer AdvancedPdfLoader for PDFs when you need per-page markers, HTML-formatted tables, or OCR strategy control (see the Advanced PDF Loader accordion above). Reach for DoclingLoader when you want a single unified converter across many formats, or for non-PDF files Cognee does not otherwise handle.BeautifulSoup Loader
BeautifulSoup Loader
BeautifulSoupLoader parses HTML files using CSS selectors. By default it applies a comprehensive set of extraction rules covering common HTML content areas (headings, paragraphs, articles, tables, code blocks, etc.). You can pass your own extraction_rules dict to target specific elements.Each rule is a dict with the following optional keys:| Key | Type | Description |
|---|---|---|
selector | str | CSS selector to match elements |
xpath | str | XPath expression (requires lxml) |
attr | str | HTML attribute to extract instead of text content |
all | bool | If True, extract all matches; otherwise only the first |
join_with | str | String used to join multiple extracted values (default: " ") |
pip install lxml):Configuring Vision Models for ImageLoader
Configuring Vision Models for ImageLoader
ImageLoader uses your configured LLM to describe image content — there is no separate VLM configuration. To process images, set LLM_MODEL to a vision-capable model.Vision-capable models by provider:| Provider | Example model |
|---|---|
| OpenAI | gpt-4o, gpt-4o-mini |
| Google Gemini | gemini/gemini-2.0-flash |
| Anthropic | claude-3-5-sonnet-20241022 |
| Azure OpenAI | azure/gpt-4o |
| Ollama (local) | llava, llava-llama3 |
LLM_MODEL does not support vision, remembering an image file will fail at the description step. Switch to a vision-capable model and retry.Ollama users: Pull a vision-capable model (e.g.
ollama pull llava) and set LLM_MODEL="llava". Text-only models such as llama3.1 cannot process images. See LLM Providers for full Ollama setup.Registering Custom Loaders
Registering Custom Loaders
If you need to handle a custom file format, you can create your own loader class and register it with Cognee.