Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.cognee.ai/llms.txt

Use this file to discover all available pages before exploring further.

Loaders are responsible for reading files from your disk or cloud storage and converting them into plain text that Cognee can process. When you run remember(), Cognee automatically selects the most appropriate loader for each file based on its extension and content type.

Loader Selection

Cognee uses a priority system to decide which loader to use. It tries to match a loader in the following order:
  1. TextLoader: For plain text files (.txt, .md, .json, .xml, etc.).
  2. PyPdfLoader: For PDF files (requires pypdf).
  3. ImageLoader: For images (uses vision models to describe content).
  4. AudioLoader: For audio files (uses transcription models).
  5. CsvLoader: For CSV files (converts rows to text).
  6. UnstructuredLoader: For complex formats like .docx, .pptx, .epub (requires unstructured).
  7. AdvancedPdfLoader: For layout-aware PDF extraction (requires unstructured).
  8. DoclingLoader: If no other loader can ingest the file type, Docling is used for conversion (if the type is supported).
If you want to force a specific loader or provide custom configuration, you can use the preferred_loaders parameter in remember().

Available Loaders

Core Loaders

These are available by default in every Cognee installation:
  • TextLoader: Reads text files with UTF-8 encoding.
  • CsvLoader: Reads CSV files and converts each row into a structured text format (Row N: key: value).
  • ImageLoader: Uses an LLM vision model to transcribe or describe the image content.
  • AudioLoader: Uses an audio transcription API to transcribe audio files. This depends on your configured LLM provider supporting transcription endpoints; see LLM Providers for provider-specific caveats.

External Loaders

These require additional dependencies to be installed:
  • PyPdfLoader: Extracts text from PDFs page by page using the pypdf library, preserving page boundaries with Page N: markers in the extracted text. Requires pip install cognee[docs] (or pip install pypdf).
  • AdvancedPdfLoader: Layout-aware PDF extraction using the unstructured library. Extracts text, tables (as HTML), and image placeholders per page. Falls back to PyPdfLoader automatically if extraction fails. Requires pip install cognee[docs], plus poppler and tesseract installed on the system.
  • UnstructuredLoader: Handles many office and document formats (.docx, .xlsx, .pptx, .odt, .rtf, .eml, .epub, .html, and more) via unstructured’s auto-partition. Requires pip install cognee[docs].
  • BeautifulSoupLoader: Extracts text from HTML files using BeautifulSoup. Applies CSS selector rules to pull structured content from specific tags. Requires pip install cognee[scraping].
  • DoclingLoader: Catch-all fallback that converts a wide range of document formats (PDF, DOCX, XLSX, PPTX, HTML, Markdown, and more) to plain text via Docling. Supported extensions are discovered dynamically from Docling’s FormatToExtensions map. Requires pip install cognee[docling].

Supported File Extensions

Cognee selects loaders based on file type. The table below shows the supported extensions and their default loaders.
Extension(s)Default loaderNotes
.txt, .md, .json, .xml, .yaml, .yml, .logTextLoaderPlain text and structured text-like files
.csvCsvLoaderTabular data; rows converted to text
.pdfPyPdfLoaderDefault PDF extraction; AdvancedPdfLoader is optional for layout-aware parsing
.docx, .doc, .odtUnstructuredLoaderWord-processor formats; requires unstructured
.xlsx, .xls, .odsUnstructuredLoaderSpreadsheet formats; requires unstructured
.pptx, .ppt, .odpUnstructuredLoaderPresentation formats; requires unstructured
.rtf, .html, .htm, .eml, .msg, .epubUnstructuredLoaderAdditional document and markup formats; requires unstructured
.png, .jpg, .jpeg, .gif, .webp, .bmp, .tif, .tiff, .heic, .avif, .ico, .psd, .apng, .cr2, .dwg, .xcf, .jxr, .jpxImageLoaderRaster, design, raw, and CAD-adjacent image formats described by a vision LLM
.mp3, .wav, .aac, .flac, .ogg, .m4a, .mid, .amr, .aiffAudioLoaderAudio transcription via a Whisper-compatible model
Note: Files with extensions not in this table cannot be remembered by default. Use a custom loader to handle additional formats.

Usage

You can override the default loader selection by specifying preferred_loaders. This is useful when you want to pass specific configuration options to a loader.
import cognee

await cognee.remember(
    data=["example_website.html"],
    preferred_loaders=[
        {
            "beautiful_soup_loader": {
                "extraction_rules": {
                    "title": "h1",
                    "body": "article.main-content"
                }
            }
        }
    ]
)
AdvancedPdfLoader uses the unstructured library to perform layout-aware extraction, preserving page structure, tables, image metadata, and page numbers. It groups content by page and prepends Page N: markers to each page’s extracted text, making source pages traceable in downstream chunks. Because PyPdfLoader has higher default priority, you need to request it explicitly with preferred_loaders. It accepts a strategy parameter that controls the trade-off between speed and accuracy:If you want to inspect those page markers after ingestion, retrieve raw chunks with SearchType.CHUNKS. The page number is kept in the chunk text, not in a separate metadata field.
Make sure poppler and tesseract are installed on your system before using AdvancedPdfLoader, in addition to installing the Python dependencies with pip install cognee[docs].
StrategyDescription
"auto" (default)Automatically selects the best strategy based on the document
"fast"Fast text extraction without layout analysis
"hi_res"High-resolution extraction with full layout analysis (slower)
"ocr_only"Uses OCR for text extraction, useful for scanned PDFs
Standard PDF text extraction via PyPdfLoader or AdvancedPdfLoader with strategy="fast" often fails silently on CJK and other non-Latin documents. These PDFs commonly embed glyphs as images or use non-standard font encodings, which can lead to empty or garbled output.Use AdvancedPdfLoader with an OCR-based strategy and install Tesseract language packs for your target language.Step 1 — install Tesseract language packs (Ubuntu/Debian):
# Japanese
sudo apt-get install tesseract-ocr-jpn tesseract-ocr-jpn-vert

# Chinese (Simplified / Traditional)
sudo apt-get install tesseract-ocr-chi-sim tesseract-ocr-chi-tra

# Korean
sudo apt-get install tesseract-ocr-kor
Step 2 — use ocr_only strategy with the languages parameter:
import cognee

# Japanese PDF
await cognee.remember(
    data=["japanese_document.pdf"],
    preferred_loaders=[
        {
            "advanced_pdf_loader": {
                "strategy": "ocr_only",
                "languages": ["jpn"]
            }
        }
    ]
)
The languages list accepts ISO 639-2 Tesseract language codes. Common values:
LanguageCode
Japanese"jpn"
Chinese Simplified"chi_sim"
Chinese Traditional"chi_tra"
Korean"kor"
Use strategy="hi_res" for better layout accuracy when the document has mixed text and images, and strategy="ocr_only" for fully image-based or scanned PDFs.
If you also need translation after OCR extraction, use the Multilingual Ingestion pipeline before building the knowledge graph.
UnstructuredLoader handles a wide range of office and document formats using unstructured’s auto-partition feature. It supports the same strategy options as AdvancedPdfLoader.Supported file types:
CategoryExtensions
Word documents.docx, .doc, .odt
Spreadsheets.xlsx, .xls, .ods
Presentations.pptx, .ppt, .odp
Other.rtf, .html, .htm, .eml, .msg, .epub
import cognee

await cognee.remember(
    data=["presentation.pptx"],
    preferred_loaders=[
        {
            "unstructured_loader": {
                "strategy": "fast"
            }
        }
    ]
)
DoclingLoader is the lowest-priority loader and acts as a catch-all fallback for any file type that no other loader can ingest. It converts the document with Docling and exports plain text via Docling’s export_to_text(). Supported extensions are pulled at runtime from Docling’s FormatToExtensions (PDF, DOCX, XLSX, PPTX, HTML, Markdown, and more), so the available formats track your installed Docling version.Install with pip install 'cognee[docling]'. Because it sits last in the loader priority, Cognee only reaches for it when no higher-priority loader matches — to force it on a file another loader would normally handle (for example, to use Docling’s layout-aware parsing on a PDF instead of PyPdfLoader), pass it through preferred_loaders:
import cognee

await cognee.remember(
    data=["report.pdf"],
    preferred_loaders=[{"docling_loader": {}}]
)
When to use Docling vs. AdvancedPdfLoader: prefer AdvancedPdfLoader for PDFs when you need per-page markers, HTML-formatted tables, or OCR strategy control (see the Advanced PDF Loader accordion above). Reach for DoclingLoader when you want a single unified converter across many formats, or for non-PDF files Cognee does not otherwise handle.
BeautifulSoupLoader parses HTML files using CSS selectors. By default it applies a comprehensive set of extraction rules covering common HTML content areas (headings, paragraphs, articles, tables, code blocks, etc.). You can pass your own extraction_rules dict to target specific elements.Each rule is a dict with the following optional keys:
KeyTypeDescription
selectorstrCSS selector to match elements
xpathstrXPath expression (requires lxml)
attrstrHTML attribute to extract instead of text content
allboolIf True, extract all matches; otherwise only the first
join_withstrString used to join multiple extracted values (default: " ")
import cognee

await cognee.remember(
    data=["page.html"],
    preferred_loaders=[
        {
            "beautiful_soup_loader": {
                "extraction_rules": {
                    "title": {"selector": "h1", "all": False},
                    "body": {"selector": "article.main-content", "all": True, "join_with": "\n\n"},
                    "og_image": {"selector": "meta[property='og:image']", "attr": "content"}
                }
            }
        }
    ]
)
For XPath-based extraction (requires pip install lxml):
await cognee.remember(
    data=["page.html"],
    preferred_loaders=[
        {
            "beautiful_soup_loader": {
                "extraction_rules": {
                    "content": {"xpath": "//div[@class='content']//p"}
                }
            }
        }
    ]
)
ImageLoader uses your configured LLM to describe image content — there is no separate VLM configuration. To process images, set LLM_MODEL to a vision-capable model.Vision-capable models by provider:
ProviderExample model
OpenAIgpt-4o, gpt-4o-mini
Google Geminigemini/gemini-2.0-flash
Anthropicclaude-3-5-sonnet-20241022
Azure OpenAIazure/gpt-4o
Ollama (local)llava, llava-llama3
# .env — enable vision by choosing a vision-capable model
LLM_PROVIDER="openai"
LLM_MODEL="gpt-4o-mini"
LLM_API_KEY="sk-..."
If your LLM_MODEL does not support vision, remembering an image file will fail at the description step. Switch to a vision-capable model and retry.
Ollama users: Pull a vision-capable model (e.g. ollama pull llava) and set LLM_MODEL="llava". Text-only models such as llama3.1 cannot process images. See LLM Providers for full Ollama setup.
If you need to handle a custom file format, you can create your own loader class and register it with Cognee.
from cognee.infrastructure.loaders import use_loader
from cognee.infrastructure.loaders.LoaderInterface import LoaderInterface

class MyCustomLoader(LoaderInterface):
    loader_name = "my_custom_loader"
    supported_extensions = [".custom"]
    supported_mime_types = ["application/x-custom"]

    async def load(self, file_path, **kwargs):
        # Your custom logic to read the file and return text
        return "Extracted text content"

# Register the loader so Cognee can use it
use_loader("my_custom_loader", MyCustomLoader)

await cognee.remember("data/file.custom")