.add() command, Cognee automatically selects the most appropriate loader for each file based on its extension and content type.
Loader Selection
Cognee uses a priority system to decide which loader to use. It tries to match a loader in the following order:- TextLoader: For plain text files (
.txt,.md,.json,.xml, etc.). - PyPdfLoader: For PDF files (requires
pypdf). - ImageLoader: For images (uses vision models to describe content).
- AudioLoader: For audio files (uses transcription models).
- CsvLoader: For CSV files (converts rows to text).
- UnstructuredLoader: For complex formats like
.docx,.pptx,.epub(requiresunstructured). - AdvancedPdfLoader: For layout-aware PDF extraction (requires
unstructured).
preferred_loaders parameter in the .add() command.
Available Loaders
Core Loaders
These are available by default in every Cognee installation:- TextLoader: Reads text files with UTF-8 encoding.
- CsvLoader: Reads CSV files and converts each row into a structured text format (
Row N: key: value). - ImageLoader: Uses an LLM vision model to transcribe or describe the image content.
- AudioLoader: Uses an audio model (like Whisper) to transcribe audio files.
External Loaders
These require additional dependencies to be installed:- PyPdfLoader: Extracts text from PDFs using the
pypdflibrary. - AdvancedPdfLoader: Uses
unstructuredto extract text from PDFs while preserving some layout information. - UnstructuredLoader: A versatile loader that supports many formats (
.docx,.pptx,.html,.epub) using theunstructuredlibrary. - BeautifulSoupLoader: Extracts text from HTML files using
BeautifulSoup. You can define extraction rules to target specific tags (e.g.,div.content).
Usage
Using Preferred Loaders
You can override the default selection by specifyingpreferred_loaders. This is useful when you want to pass specific configuration options to a loader.