Skip to main content
Loaders are responsible for reading files from your disk or cloud storage and converting them into plain text that Cognee can process. When you run the .add() command, Cognee automatically selects the most appropriate loader for each file based on its extension and content type.

Loader Selection

Cognee uses a priority system to decide which loader to use. It tries to match a loader in the following order:
  1. TextLoader: For plain text files (.txt, .md, .json, .xml, etc.).
  2. PyPdfLoader: For PDF files (requires pypdf).
  3. ImageLoader: For images (uses vision models to describe content).
  4. AudioLoader: For audio files (uses transcription models).
  5. CsvLoader: For CSV files (converts rows to text).
  6. UnstructuredLoader: For complex formats like .docx, .pptx, .epub (requires unstructured).
  7. AdvancedPdfLoader: For layout-aware PDF extraction (requires unstructured).
If you want to force a specific loader or provide custom configuration, you can use the preferred_loaders parameter in the .add() command.

Available Loaders

Core Loaders

These are available by default in every Cognee installation:
  • TextLoader: Reads text files with UTF-8 encoding.
  • CsvLoader: Reads CSV files and converts each row into a structured text format (Row N: key: value).
  • ImageLoader: Uses an LLM vision model to transcribe or describe the image content.
  • AudioLoader: Uses an audio model (like Whisper) to transcribe audio files.

External Loaders

These require additional dependencies to be installed:
  • PyPdfLoader: Extracts text from PDFs using the pypdf library.
  • AdvancedPdfLoader: Uses unstructured to extract text from PDFs while preserving some layout information.
  • UnstructuredLoader: A versatile loader that supports many formats (.docx, .pptx, .html, .epub) using the unstructured library.
  • BeautifulSoupLoader: Extracts text from HTML files using BeautifulSoup. You can define extraction rules to target specific tags (e.g., div.content).

Usage

Using Preferred Loaders

You can override the default selection by specifying preferred_loaders. This is useful when you want to pass specific configuration options to a loader.
import cognee

await cognee.add(
    data=["example_website.html"],
    preferred_loaders={
        "beautiful_soup_loader": {
            "extraction_rules": {
                "title": "h1",
                "body": "article.main-content"
            }
        }
    }
)

Registering Custom Loaders

If you need to handle a custom file format, you can create your own loader class and register it with Cognee.
from cognee.infrastructure.loaders import use_loader
from cognee.infrastructure.loaders.LoaderInterface import LoaderInterface

class MyCustomLoader(LoaderInterface):
    loader_name = "my_custom_loader"
    supported_extensions = [".custom"]
    supported_mime_types = ["application/x-custom"]

    async def load(self, file_path, **kwargs):
        # Your custom logic to read the file and return text
        return "Extracted text content"

# Register the loader so Cognee can use it
use_loader("my_custom_loader", MyCustomLoader)

await cognee.add("data/file.custom")