New to configuration?See the Setup Configuration Overview for the complete workflow:install extras → create
.env → choose providers → handle pruning.Supported Providers
Cognee supports multiple LLM providers:- OpenAI — GPT models via OpenAI API (default)
- Azure OpenAI — GPT models via Azure OpenAI Service
- Google Gemini — Gemini models via Google AI
- Anthropic — Claude models via Anthropic API
- AWS Bedrock — Models available via AWS Bedrock
- Groq — Fast inference via Groq API (via LiteLLM)
- Ollama — Local models via Ollama
- LM Studio — Local models via LM Studio
- HuggingFace — Models via HuggingFace Inference API or Inference Endpoints
- llama.cpp — Local models via llama-cpp-python (in-process or server mode)
- Custom — OpenAI-compatible endpoints (like vLLM, OpenRouter, DeepInfra, company-internal)
Configuration
Environment Variables
Environment Variables
Set these environment variables in your
.env file:LLM_PROVIDER— The provider to use (openai, gemini, anthropic, ollama, custom)LLM_MODEL— The specific model to useLLM_API_KEY— Your API key for the providerLLM_ENDPOINT— Custom endpoint URL (for Azure, Ollama, or custom providers)LLM_API_VERSION— API version (for Azure OpenAI)LLM_TEMPERATURE— Sampling temperature for generation (default:0.0)LLM_MAX_TOKENS— Maximum tokens per request (optional)LLM_INSTRUCTOR_MODE— Structured-output mode override for Instructor-backed LLM calls (optional)
Why do model names have a prefix like
gemini/ or openrouter/?Cognee routes all LLM requests through LiteLLM, which uses provider prefixes to identify the correct API endpoint. For example, Google lists their model as gemini-2.0-flash, but in Cognee you must write gemini/gemini-2.0-flash. This prefix tells LiteLLM to use the Gemini API. The same applies to custom providers — openrouter/, hosted_vllm/, lm_studio/, etc. See each provider section below for the correct format.Provider Setup Guides
OpenAI (Default)
OpenAI (Default)
OpenAI is the default provider and works out of the box with minimal configuration.
Azure OpenAI
Azure OpenAI
Use Azure OpenAI Service with your own deployment.
Google Gemini
Google Gemini
Use Google’s Gemini models for text generation.
Anthropic
Anthropic
Use Anthropic’s Claude models for reasoning tasks.
Groq
Groq
Groq provides fast inference for open models. Cognee routes Groq requests through LiteLLM using the Installation: Install the Groq dependency:Popular Groq models (use with the
groq/ model prefix.groq/ prefix):groq/llama-3.3-70b-versatilegroq/llama3-8b-8192groq/mixtral-8x7b-32768groq/gemma2-9b-it
No endpoint needed: The
LLM_ENDPOINT variable is not required for Groq — LiteLLM resolves the Groq API endpoint automatically from the groq/ prefix.AWS Bedrock
AWS Bedrock
Use models available on AWS Bedrock for various tasks. For Bedrock specifically, you will need to
also specify some information regarding AWS.There are multiple ways of connecting to Bedrock models:
- Using an API key and region. Simply generate you key on AWS, and put it in the
LLM_API_KEYenv variable. - Using AWS Credentials. You can only specify
AWS_ACCESS_KEY_IDandAWS_SECRET_ACCESS_KEY, no need for theLLM_API_KEY. In this case, if you are using temporary credentials (e.g.AWS_ACCESS_KEY_IDstarting withASIA...), then you also must specify theAWS_SESSION_TOKEN. - Using AWS profiles. Create a file called something like
/.aws/credentials, and store your credentials inside it.
Model Name
The name of the model might differ based on the region (the name begins with eu for Europe, us of USA, etc.)
Ollama (Local)
Ollama (Local)
Run models locally with Ollama for privacy and cost control.
LLM_API_KEY="ollama" is a placeholder required by the client library — Ollama itself does not validate it.Installation: Install Ollama from ollama.ai and pull your desired model:Zero-API-key setup: To avoid falling back to OpenAI for embeddings, you must also configure the embedding provider to use a local backend. See the Local Setup guide for a complete
.env example using Ollama or Fastembed for both LLM and embeddings.Known Issues
NoDataErrorwith mixed providers: Using Ollama as LLM and OpenAI as embedding provider may fail withNoDataError. Workaround: configure both LLM and embeddings to the same local provider (see the local setup guide above).- Audio transcription is not supported:
AudioLoaderrelies on a Whisper-compatible transcription endpoint. Cognee’s Ollama adapter does not provide one, so audio ingestion will fail whenLLM_PROVIDER="ollama".
HuggingFace
HuggingFace
Use models from HuggingFace via the HuggingFace Inference API (serverless) or dedicated Inference Endpoints.Installation: Install the HuggingFace extra to enable the HuggingFace tokenizer used for chunking:
- Serverless
- Dedicated Endpoint
Model names: Use the full HuggingFace model repo ID after the
huggingface/ prefix (e.g., huggingface/mistralai/Mixtral-8x7B-Instruct-v0.1). Not all models on HuggingFace support the text generation inference API — check the model card for compatibility. The model is routed through LiteLLM.LM Studio (Local)
LM Studio (Local)
Run models locally with LM Studio for privacy and cost control.Installation: Install LM Studio from lmstudio.ai and download your desired model from
LM Studio’s interface.
Load your model, start the LM Studio server, and Cognee will be able to connect to it.
Set up instructor mode:
LLM_INSTRUCTOR_MODE controls how Cognee asks the model for structured output. LM Studio models often work best with json_schema_mode. For more detail, see LLM Instructor Modes below and Structured Output Backends.llama.cpp (Local)
llama.cpp (Local)
Run models locally using llama-cpp-python for full offline inference.Cognee supports two setup modes:
- Local mode — Load a
.ggufmodel directly in-process - Server mode — Connect to a running
llama-cpp-pythonserver over HTTP
Choosing a mode: Use local mode for the simplest setup with no separate server process. Use server mode if you want to share one model across multiple processes or run the model on another machine.
Local Mode (In-Process)
Local Mode (In-Process)
Load a GGUF model file directly. No server setup required.
GPU acceleration: Set
LLAMA_CPP_N_GPU_LAYERS=-1 to offload all layers to GPU, or set a positive integer to offload a specific number of layers. Leave it at 0 for CPU-only inference.Server Mode (OpenAI-Compatible)
Server Mode (OpenAI-Compatible)
Connect to a running Then configure Cognee to connect to it:
llama-cpp-python server. Start the server separately:Custom Providers
Custom Providers
Use OpenAI-compatible endpoints like OpenRouter or other services.See Fallback Provider in Advanced Options for full details.Custom Provider Prefixes: When using
LLM_PROVIDER="custom", you must include the correct provider prefix in your model name. Cognee forwards requests to LiteLLM, which uses these prefixes to route requests correctly.Common prefixes include:hosted_vllm/— vLLM serversopenrouter/— OpenRouterlm_studio/— LM Studioopenai/— OpenAI-compatible APIs
DeepSeek
DeepSeek
Use DeepSeek’s models for reasoning and chat via their OpenAI-compatible API.Get your API key from platform.deepseek.com. The
deepseek/ prefix tells LiteLLM to route to the DeepSeek API.Popular DeepSeek models (use with the deepseek/ prefix):deepseek/deepseek-chat— DeepSeek-V3 (general chat and instruction following)deepseek/deepseek-reasoner— DeepSeek-R1 (chain-of-thought reasoning)
Structured output: DeepSeek’s API is OpenAI-compatible, so the default
json_mode for custom providers works well. If you encounter issues with structured output, try setting LLM_INSTRUCTOR_MODE="tool_call".Kimi (Moonshot AI)
Kimi (Moonshot AI)
Use Moonshot AI’s Kimi models via their OpenAI-compatible API.Get your API key from platform.moonshot.cn. The
moonshot/ prefix tells LiteLLM to route to the Moonshot AI API.Available Kimi models (use with the moonshot/ prefix):moonshot/moonshot-v1-8k— 8k context windowmoonshot/moonshot-v1-32k— 32k context windowmoonshot/moonshot-v1-128k— 128k context window (for long documents)
OpenRouter
OpenRouter
Use OpenRouter to access hundreds of models from a single API endpoint.Get your API key from openrouter.ai/keys. Browse all available models at openrouter.ai/models — prefix the model slug with
openrouter/.Example models (use with the openrouter/ prefix):openrouter/deepseek/deepseek-r1— DeepSeek R1 via OpenRouteropenrouter/google/gemini-2.0-flash-lite-preview-02-05:free— Free Gemini tieropenrouter/openai/gpt-4o-mini— GPT-4o Mini via OpenRouter
DeepInfra
DeepInfra
Use DeepInfra to access open-source models via their OpenAI-compatible API.Find your model name in the DeepInfra model catalog. The
deepinfra/ prefix tells LiteLLM to route to DeepInfra.Company-Internal / Self-Hosted Endpoints
Company-Internal / Self-Hosted Endpoints
Any internal LLM server that exposes an OpenAI-compatible REST API (e.g., a corporate vLLM deployment, internal TGI server, or private OpenRouter proxy) can be used with the The model prefix you use (
custom provider.openai/, hosted_vllm/, etc.) determines which LiteLLM adapter handles the request. For most OpenAI-compatible servers, openai/ works best. Set LLM_API_KEY to whatever bearer token your server requires (use . if no auth is needed).vLLM
vLLM
Use vLLM for high-performance model serving with OpenAI-compatible API.Example with Gemma:To find the correct model name, see their documentation.
Advanced Options
LLM Instructor Modes
LLM Instructor Modes
When using the Instructor structured-output framework (the default), Cognee instructs the model to return structured data in a specific way. The
Per-provider defaults (from source code):
Example — override the mode:Override the default only when the model you are using requires a different mode. For example, LM Studio models typically need
LLM_INSTRUCTOR_MODE environment variable controls which strategy is used.Each provider has a built-in default that matches its API capabilities. Override it only when the default doesn’t work for your specific model.Available modes:| Mode | Description | When to use |
|---|---|---|
json_schema_mode | Passes the full JSON Schema of the expected output in the request and enforces strict schema compliance. | OpenAI models that support the response_format / structured-output feature (e.g. GPT-4o). Also works well with Bedrock and some local models. |
json_mode | Instructs the model to return any valid JSON object. Instructor then validates and coerces it to the target schema. | Gemini, Ollama, Generic/Custom endpoints, and any model that supports response_format: json_object but not strict schema enforcement. |
anthropic_tools | Uses Anthropic’s native tool-calling API to extract structured data. | Anthropic Claude models only. Leverages first-class tool-use support for reliable extraction. |
mistral_tools | Uses Mistral’s native tool-calling API to extract structured data. | Mistral models only. Mirrors the OpenAI function-calling interface provided by Mistral. |
tool_call | Uses the generic OpenAI-style function/tool-calling API to define the schema as a callable tool. | OpenAI-compatible APIs that support function calling but not strict JSON schema output. |
md_json | Asks the model to return JSON wrapped in a Markdown code block. Instructor extracts the block and validates it. | Models that reliably format code blocks but may not support json_mode (e.g. some self-hosted models). |
Provider (LLM_PROVIDER) | Default mode |
|---|---|
openai (and Azure OpenAI) | json_schema_mode |
anthropic | anthropic_tools |
gemini | json_mode |
bedrock | json_schema_mode |
mistral | mistral_tools |
ollama | json_mode |
custom (generic OpenAI-compatible) | json_mode |
json_schema_mode even though the custom provider defaults to json_mode.Temperature
Temperature
Control the randomness of LLM responses with the
When to adjust: Cognee’s default of
LLM_TEMPERATURE environment variable.| Variable | Default | Description |
|---|---|---|
LLM_TEMPERATURE | 0.0 | Sampling temperature. 0.0 = deterministic / focused output. Higher values (e.g. 0.7–1.0) produce more varied, creative responses. |
0.0 is recommended for knowledge-graph extraction because it produces consistent, structured output. Raise the temperature only if you need more variety in generated text (e.g. conversational responses or creative summarisation).Rate Limiting
Rate Limiting
Control client-side throttling for LLM calls to manage API usage and costs.Defaults (when rate limiting is enabled):
The defaults (60 requests / 60 seconds) allow 1 request/second on average. Adjust both values to match your provider’s tier limit.How it works:
| Variable | Default | Meaning |
|---|---|---|
LLM_RATE_LIMIT_ENABLED | false | Off by default — opt-in |
LLM_RATE_LIMIT_REQUESTS | 60 | Max requests per interval |
LLM_RATE_LIMIT_INTERVAL | 60 | Interval in seconds |
- Client-side limiter: Cognee paces outbound LLM calls before they reach the provider
- Moving window: Spreads allowance across the time window for smoother throughput
- Per-process scope: In-memory limits don’t share across multiple processes/containers
- Auto-applied: Works with all providers (OpenAI, Gemini, Anthropic, Ollama, Custom)
LLM_RATE_LIMIT_REQUESTS to your provider’s RPM (requests per minute) limit, and LLM_RATE_LIMIT_INTERVAL to 60. To leave headroom, use ~80–90% of the advertised limit. Check your provider’s dashboard for your current tier limits.Each cognify() call issues multiple LLM requests (entity extraction, summarization, etc.) per document chunk — plan for several requests per chunk, not one.Example configurations for common provider tiersThese examples target chat/completions-style LLM endpoints, such as OpenAI models like gpt-4o-mini.OpenAI - Tier 1
OpenAI - Tier 1
OpenAI - Tier 2
OpenAI - Tier 2
Anthropic - Tier 1
Anthropic - Tier 1
Google Gemini - Free Tier
Google Gemini - Free Tier
Conservative Default
Conservative Default
Always verify your exact tier limits in your provider’s dashboard — limits vary by model, tier, and region. The examples above are approximations for common tiers and may change.
Fallback Provider
Fallback Provider
Cognee supports a primary-plus-fallback model configuration that automatically retries a failed request against a secondary provider. This is useful when your primary provider may reject certain content, and you want a fallback to handle those cases gracefully.When the fallback triggersThe fallback is invoked only on content policy violations from the primary provider:For
ContentFilterFinishReasonError— the provider’s output filter blocked the responseContentPolicyViolationError— the request was rejected for policy reasonsInstructorRetryExceptioncontaining “content management policy”
LLM_PROVIDER is set to openai or custom. Other providers (Anthropic, Gemini, Mistral, Bedrock, Ollama) do not currently support the fallback chain.ConfigurationSet these three variables alongside your primary LLM configuration:LLM_PROVIDER="custom", all three fallback variables (FALLBACK_MODEL, FALLBACK_ENDPOINT, FALLBACK_API_KEY) must be set. If any is missing, Cognee raises a ContentPolicyFilterError instead of falling back.For LLM_PROVIDER="openai", only FALLBACK_MODEL and FALLBACK_API_KEY are required. FALLBACK_ENDPOINT is accepted but currently unused for the OpenAI adapter.Variable reference| Variable | Description |
|---|---|
FALLBACK_MODEL | Model identifier for the fallback provider (use LiteLLM prefix format, e.g. openrouter/openai/gpt-4o-mini) |
FALLBACK_ENDPOINT | Base URL for the fallback provider’s API (required for custom, optional for openai) |
FALLBACK_API_KEY | API key for the fallback provider |
Notes
- If
EMBEDDING_API_KEYis not set, Cognee falls back toLLM_API_KEYfor embeddings - Rate limiting helps manage API usage and costs
- Structured output frameworks ensure consistent data extraction from LLM responses
Embedding Providers
Configure embedding providers for semantic search
Overview
Return to setup configuration overview
Relational Databases
Set up SQLite or Postgres for metadata storage