LLM Providers

LLM (Large Language Model) providers handle text generation, reasoning, and structured output tasks in Cognee. You can choose from cloud providers like OpenAI and Anthropic, or run models locally with Ollama.

New to configuration?See the Setup Configuration Overview for the complete workflow:install extras → create .env → choose providers → handle pruning.

Supported Providers

Cognee supports multiple LLM providers:

OpenAI — GPT models via OpenAI API (default)
Azure OpenAI — GPT models via Azure OpenAI Service
Google Gemini — Gemini models via Google AI
Anthropic — Claude models via Anthropic API
AWS Bedrock — Models available via AWS Bedrock
Groq — Fast inference via Groq API (via LiteLLM)
Ollama — Local models via Ollama
LM Studio — Local models via LM Studio
HuggingFace — Models via HuggingFace Inference API or Inference Endpoints
llama.cpp — Local models via llama-cpp-python (in-process or server mode)
Custom — OpenAI-compatible endpoints (like vLLM, OpenRouter, DeepInfra, company-internal)
MCP Sampling — Reuse the host harness’s LLM via MCP sampling/createMessage (no LLM_API_KEY; only when Cognee runs as an MCP server under a host that grants sampling)

LLM/Embedding Configuration: If you configure only LLM or only embeddings, the other defaults to OpenAI. Ensure you have a working OpenAI API key, or configure both LLM and embeddings to avoid unexpected defaults.

Choosing a Model

Cognee always uses two models together: an LLM for entity/relationship extraction and reasoning, and an embedding model for semantic search. An embedding model is mandatory — every cognify run writes vectors to a vector store, and recall depends on them. If you only set one, the other silently falls back to OpenAI (see the warning above).

Light vs. powerful LLM

A small, fast model is the right default. Cognee ships with one (openai/gpt-5-mini) and the examples on this page use comparable light models such as gpt-4o-mini. Knowledge-graph extraction is many short, schema-constrained calls per document rather than a few long ones, so a light model keeps cost and latency low while handling most workloads well.Reach for a more powerful model when:

Your sources are dense or domain-specific (legal, medical, scientific) and you need higher-fidelity entities and relationships.
You use a custom graph model or ontology with a complex schema the model must populate accurately.
A light model produces noisy or incomplete graphs on your data.

Extraction relies on structured output, so very small or weak models may return malformed JSON or lower-quality graphs. If a small local model struggles, try a stronger one or adjust the instructor mode.

Resource expectations for local models

The embedding model is lightweight — defaults like nomic-embed-text (Ollama) or all-MiniLM-L6-v2 (Fastembed, CPU-only) run comfortably on a CPU or a small GPU and rarely dominate resource use.The LLM is the constraint for local setups. The Local Setup guide defaults to an 8B model (llama3.1:8b); as a rough guide, an 8B model quantized to 4-bit needs roughly 6 GB of free VRAM, while larger or less-quantized models need proportionally more. If a model does not fit, it spills to system RAM and CPU, which still works but is much slower — for llama.cpp you can tune LLAMA_CPP_N_GPU_LAYERS to offload only as many layers as fit. Limited VRAM does not change graph quality; it mainly affects how fast cognify runs, since extraction issues many sequential LLM calls per document. With low VRAM, prefer a smaller LLM and lower EMBEDDING_BATCH_SIZE (see Embedding Providers) over a large model that does not fit.

Configuration

Environment Variables

Set these environment variables in your .env file:

LLM_PROVIDER — The provider to use (openai, gemini, anthropic, ollama, custom, mcp-sampling)
LLM_MODEL — The specific model to use
LLM_API_KEY — Your API key for the provider (not used by mcp-sampling)
LLM_ENDPOINT — Custom endpoint URL (for Azure, Ollama, or custom providers)
LLM_API_VERSION — API version (for Azure OpenAI)
LLM_TEMPERATURE — Sampling temperature for generation (default: 0.0)
LLM_MAX_COMPLETION_TOKENS — Maximum tokens per request (optional)
LLM_INSTRUCTOR_MODE — Structured-output mode override for Instructor-backed LLM calls (optional)
LLM_EXTRACTION_*, LLM_SUMMARIZATION_*, LLM_QUERY_* — Optional per-stage overrides that route individual pipeline stages to different models/providers (see Per-Stage Model Routing)

A preflight LLM connection test can time out at 30s, especially against smaller models. Workaround: add COGNEE_SKIP_CONNECTION_TEST=true to your .env.

Why do model names have a prefix like gemini/ or openrouter/?Cognee routes all LLM requests through LiteLLM, which uses provider prefixes to identify the correct API endpoint. For example, Google lists their model as gemini-2.0-flash, but in Cognee you must write gemini/gemini-2.0-flash. This prefix tells LiteLLM to use the Gemini API. The same applies to custom providers — openrouter/, hosted_vllm/, lm_studio/, etc. See each provider section below for the correct format.

Provider Setup Guides

OpenAI (Default)

OpenAI is the default provider and works out of the box with minimal configuration.

LLM_PROVIDER="openai"
LLM_MODEL="gpt-4o-mini"
LLM_API_KEY="sk-..."
# Optional overrides
# LLM_ENDPOINT=https://api.openai.com/v1
# LLM_API_VERSION=
# LLM_MAX_COMPLETION_TOKENS=16384

Azure OpenAI

Use Azure OpenAI Service with your own deployment.

LLM_PROVIDER="openai"
LLM_MODEL="azure/gpt-4o-mini"
LLM_ENDPOINT="https://<your-resource>.openai.azure.com/openai/deployments/gpt-4o-mini"
LLM_API_KEY="az-..."
LLM_API_VERSION="2024-12-01-preview"

Google Gemini / Vertex AI

Cognee routes Gemini requests through LiteLLM. There are two ways to reach Gemini models: the Google AI Studio API (a single API key) or Vertex AI (Google Cloud project + service-account credentials).

Google AI Studio (API key)
Vertex AI (Google Cloud)

The simplest setup. Get an API key from Google AI Studio and use the gemini/ model prefix.

LLM_PROVIDER="gemini"
LLM_MODEL="gemini/gemini-2.0-flash"
LLM_API_KEY="AIza..."
# Optional
# LLM_ENDPOINT=https://generativelanguage.googleapis.com/
# LLM_API_VERSION=v1beta

This path talks to the Gemini REST API directly and needs no extra Google packages.

Use Vertex AI when your models are served through a Google Cloud project. Vertex routes through LiteLLM’s vertex_ai/ prefix and authenticates with Google Cloud credentials instead of an API key.

LLM_PROVIDER="gemini"
LLM_MODEL="vertex_ai/gemini-2.0-flash"
LLM_API_KEY="."                          # placeholder; Vertex auth uses credentials below
# Google Cloud project + region (read by LiteLLM)
VERTEXAI_PROJECT="your-gcp-project-id"
VERTEXAI_LOCATION="us-central1"
# Path to your service-account key file (Application Default Credentials)
GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"

Installation: Vertex AI requires the Google client libraries, which Cognee does not bundle by default. Install them with:

uv pip install google-cloud-aiplatform

GOOGLE_APPLICATION_CREDENTIALS points at a service-account JSON key. If you run inside Google Cloud (or after gcloud auth application-default login), you can omit it and rely on Application Default Credentials.

Anthropic

Use Anthropic’s Claude models for reasoning tasks.

LLM_PROVIDER="anthropic"
LLM_MODEL="claude-sonnet-4-5-20250929"
LLM_API_KEY="sk-ant-..."

Groq

Groq provides fast inference for open models. Cognee routes Groq requests through LiteLLM using the groq/ model prefix.

LLM_PROVIDER="custom"
LLM_MODEL="groq/llama-3.3-70b-versatile"
LLM_API_KEY="gsk_..."

Installation: Install the Groq dependency:

pip install cognee[groq]

Popular Groq models (use with the groq/ prefix):

groq/llama-3.3-70b-versatile
groq/llama3-8b-8192
groq/mixtral-8x7b-32768
groq/gemma2-9b-it

See the Groq model list for all available models. Your Groq API key can be created in the Groq Console.

No endpoint needed: The LLM_ENDPOINT variable is not required for Groq — LiteLLM resolves the Groq API endpoint automatically from the groq/ prefix.

AWS Bedrock

Use models available on AWS Bedrock for various tasks. For Bedrock specifically, you will need to also specify some information regarding AWS.

LLM_API_KEY="<your_bedrock_api_key>"
LLM_MODEL="eu.amazon.nova-lite-v1:0"
LLM_PROVIDER="bedrock"
LLM_MAX_COMPLETION_TOKENS="16384"
AWS_REGION="<your_aws_region>"
AWS_ACCESS_KEY_ID="<your_aws_access_key_id>"
AWS_SECRET_ACCESS_KEY="<your_aws_secret_access_key>"
AWS_SESSION_TOKEN="<your_aws_session_token>"

# Optional parameters
#AWS_BEDROCK_RUNTIME_ENDPOINT="bedrock-runtime.eu-west-1.amazonaws.com"
#AWS_PROFILE_NAME="<your_aws_profile_name>"

There are multiple ways of connecting to Bedrock models. Cognee picks the first one it finds, in this order:

Using an API key and region. Simply generate your key on AWS, and put it in the LLM_API_KEY env variable. If LLM_API_KEY is set, it takes precedence over the credential and profile methods below, so leave it empty when you want to use those.
Using AWS Credentials. You can only specify AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, no need for the LLM_API_KEY. In this case, if you are using temporary credentials (e.g. AWS_ACCESS_KEY_ID starting with ASIA..., such as those issued by aws sso login or aws sts assume-role), then you also must specify the AWS_SESSION_TOKEN. All three values expire and must be refreshed when AWS rotates them.
Using AWS profiles. AWS_PROFILE_NAME is the name of a profile (for example default or my-sso-profile), not a path to a file or to a folder. Cognee hands the name to boto3, which resolves the credentials through the standard AWS chain using the shared config and credentials files at ~/.aws/config and ~/.aws/credentials (override their locations with the AWS_CONFIG_FILE and AWS_SHARED_CREDENTIALS_FILE env variables). This is the recommended path for AWS SSO: run aws sso login --profile <your_aws_profile_name> first, then set AWS_PROFILE_NAME to that profile name and Cognee will use the temporary SSO credentials boto3 caches for it — no need to copy the ASIA... keys into your .env.

Installation: Install the required dependency:

pip install cognee[aws]

Model Name The name of the model might differ based on the region (the name begins with eu for Europe, us of USA, etc.)

Ollama (Local)

Run models locally with Ollama for privacy and cost control.

LLM_PROVIDER="ollama"
LLM_MODEL="llama3.1:8b"
LLM_ENDPOINT="http://localhost:11434/v1"
LLM_API_KEY="ollama"

LLM_API_KEY="ollama" is a placeholder required by the client library — Ollama itself does not validate it.Installation: Install Ollama from ollama.ai and pull your desired model:

ollama pull llama3.1:8b

Zero-API-key setup: To avoid falling back to OpenAI for embeddings, you must also configure the embedding provider to use a local backend. See the Local Setup guide for a complete .env example using Ollama or Fastembed for both LLM and embeddings.

Known Issues

ValidationError on import (Missing: [...]): Cognee validates the core embedding variables as an all-or-nothing group — if you set any of EMBEDDING_PROVIDER, EMBEDDING_MODEL, or EMBEDDING_DIMENSIONS, you must set all three. Setting only some raises Value error, You have set some but not all of the required environment variables for embeddings. Ollama embeddings also need HUGGINGFACE_TOKENIZER for token counting. Provide the full set, for example:
EMBEDDING_PROVIDER="ollama" EMBEDDING_MODEL="nomic-embed-text:latest" EMBEDDING_ENDPOINT="http://localhost:11434/api/embed" EMBEDDING_DIMENSIONS="768" HUGGINGFACE_TOKENIZER="nomic-ai/nomic-embed-text-v1.5"
See Embedding Providers → Ollama for model-to-tokenizer mappings and how to find the right HUGGINGFACE_TOKENIZER value. The same all-or-nothing rule applies to the LLM group (LLM_MODEL, LLM_ENDPOINT, LLM_API_KEY).
NoDataError with mixed providers: Using Ollama as LLM and OpenAI as embedding provider may fail with NoDataError. Workaround: configure both LLM and embeddings to the same local provider (see the local setup guide above).
Audio transcription is not supported: AudioLoader relies on a Whisper-compatible transcription endpoint. Cognee’s Ollama adapter does not provide one, so audio ingestion will fail when LLM_PROVIDER="ollama".

Context Window (`num_ctx`) and Custom Modelfiles

If cognify() returns HTTP 500 errors while the same model answers fine when you run ollama run <model> in a terminal, the usual cause is context-window truncation, not a connection problem.Ollama’s default context window can be much smaller than Cognee’s extraction window. Cognee sizes extraction chunks from LLM_MAX_COMPLETION_TOKENS (default 16384) — up to roughly half that per chunk — so the entity-extraction prompts it sends can be far larger than a short terminal prompt. When a prompt exceeds num_ctx, Ollama may truncate it, the model can return malformed or empty structured output, and Instructor’s parse failure can surface as a 500.LLM_MODEL is just an Ollama model tag, so the fix is to point it at a model whose num_ctx is large enough. Create a custom Modelfile:

FROM llama3.1:8b
PARAMETER num_ctx 8192

Build the tag and reference it in your .env:

ollama create llama3.1:8b-8k -f Modelfile

LLM_PROVIDER="ollama"
LLM_MODEL="llama3.1:8b-8k"
LLM_ENDPOINT="http://localhost:11434/v1"
LLM_API_KEY="ollama"

A larger num_ctx uses more memory. If you can’t raise it, lower LLM_MAX_COMPLETION_TOKENS instead so Cognee builds smaller chunks that fit the model’s existing context window.

Connection Troubleshooting

If you see cannot connect to host or connection refused errors, the most common causes are an unreachable endpoint, the wrong protocol, or a Docker networking mismatch.Default endpoint protocolOllama’s local server speaks plain HTTP, not HTTPS. Cognee does not add TLS by default — the protocol is determined entirely by the scheme in LLM_ENDPOINT and EMBEDDING_ENDPOINT. Use https:// only if you have placed Ollama behind a TLS-terminating reverse proxy (Caddy, nginx, Traefik, etc.). For a local Ollama setup, use:

Variable	Value
`LLM_ENDPOINT` (Ollama)	`http://localhost:11434/v1`
`EMBEDDING_ENDPOINT` (Ollama)	`http://localhost:11434/api/embed`

The Ollama embedding engine builds a secure SSL context for outgoing requests, but it is only applied when the endpoint URL uses https:// — plain HTTP requests are not upgraded.localhost vs host.docker.internalInside a Docker container, localhost refers to the container itself, not your host machine where Ollama is running. If Cognee runs in Docker and Ollama runs on the host, use host.docker.internal instead:

LLM_ENDPOINT="http://host.docker.internal:11434/v1"
EMBEDDING_ENDPOINT="http://host.docker.internal:11434/api/embed"

host.docker.internal is available on Docker Desktop (macOS/Windows) and on Linux when the host-gateway mapping is configured in docker-compose.yml. On Linux without that mapping, use --network host or the Docker bridge IP.Other common causes

Ollama not running: verify with curl http://localhost:11434/api/tags from the same machine and network namespace Cognee is running in.
Wrong port: the default Ollama port is 11434. If you started Ollama with OLLAMA_HOST=0.0.0.0:<port>, match that port in LLM_ENDPOINT.
Missing path suffix: the LLM endpoint must end in /v1 (OpenAI-compatible chat completions), and the embedding endpoint must end in /api/embed. Pointing either at the bare host (e.g. http://localhost:11434) will fail.
Bind address: Ollama binds to 127.0.0.1 by default. To accept connections from other machines or Docker containers via a LAN IP, start it with OLLAMA_HOST=0.0.0.0:11434.
Inside Docker Compose: if the LLM or embedding endpoint runs on your host machine, localhost inside the container points back to the container itself. Use host.docker.internal on Docker Desktop (macOS/Windows), or add a host-gateway mapping in docker-compose.yml on Linux.
LLM_ENDPOINT="http://host.docker.internal:11434/v1" EMBEDDING_ENDPOINT="http://host.docker.internal:11434/api/embed"
If the service runs in the same Compose project, use the Compose service name instead of localhost for any DB_HOST, VECTOR_DB_URL, or GRAPH_DATABASE_URL setting.

HuggingFace

Use models from HuggingFace via the HuggingFace Inference API (serverless) or dedicated Inference Endpoints.

Serverless
Dedicated Endpoint

LLM_PROVIDER="custom"
LLM_MODEL="huggingface/mistralai/Mistral-7B-Instruct-v0.3"
LLM_API_KEY="hf_..."

LLM_PROVIDER="custom"
LLM_MODEL="huggingface/mistralai/Mistral-7B-Instruct-v0.3"
LLM_ENDPOINT="https://<your-endpoint-id>.<region>.aws.endpoints.huggingface.cloud/v1/"
LLM_API_KEY="hf_..."

Installation: Install the HuggingFace extra to enable the HuggingFace tokenizer used for chunking:

pip install cognee[huggingface]

Model names: Use the full HuggingFace model repo ID after the huggingface/ prefix (e.g., huggingface/mistralai/Mixtral-8x7B-Instruct-v0.1). Not all models on HuggingFace support the text generation inference API — check the model card for compatibility. The model is routed through LiteLLM.

LM Studio (Local)

Run models locally with LM Studio for privacy and cost control.

LLM_PROVIDER="custom"
LLM_MODEL="lm_studio/magistral-small-2509"
LLM_ENDPOINT="http://127.0.0.1:1234/v1"
LLM_API_KEY="."
LLM_INSTRUCTOR_MODE="json_schema_mode"

Installation: Install LM Studio from lmstudio.ai and download your desired model from LM Studio’s interface. Load your model, start the LM Studio server, and Cognee will be able to connect to it.

Set up instructor mode: LLM_INSTRUCTOR_MODE controls how Cognee asks the model for structured output. LM Studio models often work best with json_schema_mode. For more detail, see LLM Instructor Modes below and Structured Output Backends.

llama.cpp (Local)

Run models locally using llama-cpp-python for full offline inference.Cognee supports two setup modes:

Local mode — Load a .gguf model directly in-process
Server mode — Connect to a running llama-cpp-python server over HTTP

Installation: Install the required dependency:

pip install cognee[llama-cpp]

Choosing a mode: Use local mode for the simplest setup with no separate server process. Use server mode if you want to share one model across multiple processes or run the model on another machine.

Local Mode (In-Process)

Load a GGUF model file directly. No server setup required.

LLM_PROVIDER="llama_cpp"
LLAMA_CPP_MODEL_PATH="/path/to/your/model.gguf"

# Optional: context window size (default: 2048)
LLAMA_CPP_N_CTX=4096

# Optional: GPU layers to offload (default: 0 = CPU only, -1 = all layers on GPU)
LLAMA_CPP_N_GPU_LAYERS=35

# Optional: chat format (default: chatml)
LLAMA_CPP_CHAT_FORMAT="chatml"

GPU acceleration: Set LLAMA_CPP_N_GPU_LAYERS=-1 to offload all layers to GPU, or set a positive integer to offload a specific number of layers. Leave it at 0 for CPU-only inference.

Concurrency: In local in-process mode the model is loaded once and shared across calls. Because the underlying llama_cpp.Llama instance is not thread-safe, Cognee serializes concurrent structured-output calls (such as the per-chunk extraction that cognify() fans out) on that single instance. This means in-process requests are processed one at a time rather than in parallel; if you need parallel decoding, run a llama-cpp-python server and use Server Mode instead.

Server Mode (OpenAI-Compatible)

Connect to a running llama-cpp-python server. Start the server separately:

python -m llama_cpp.server --model /path/to/your/model.gguf --port 8000

Then configure Cognee to connect to it:

LLM_PROVIDER="llama_cpp"
LLM_ENDPOINT="http://localhost:8000/v1"
LLM_API_KEY="."
LLM_MODEL="your-model-name"

Custom Providers

Use OpenAI-compatible endpoints like OpenRouter or other services.

LLM_PROVIDER="custom"
LLM_MODEL="openrouter/google/gemini-2.0-flash-lite-preview-02-05:free"
LLM_ENDPOINT="https://openrouter.ai/api/v1"
LLM_API_KEY="or-..."
# Optional: fallback provider for content policy violations
# FALLBACK_MODEL=openrouter/openai/gpt-4o-mini
# FALLBACK_ENDPOINT=https://openrouter.ai/api/v1
# FALLBACK_API_KEY=or-...

See Fallback Provider in Advanced Options for full details.Custom Provider Prefixes: When using LLM_PROVIDER="custom", you must include the correct provider prefix in your model name. Cognee forwards requests to LiteLLM, which uses these prefixes to route requests correctly.Common prefixes include:

hosted_vllm/ — vLLM servers
openrouter/ — OpenRouter
lm_studio/ — LM Studio
openai/ — OpenAI-compatible APIs

See the LiteLLM providers documentation for the full list of supported prefixes.Below are examples for common providers and patterns:

DeepSeek

Use DeepSeek’s models for reasoning and chat via their OpenAI-compatible API.

LLM_PROVIDER="custom"
LLM_MODEL="deepseek/deepseek-chat"
LLM_ENDPOINT="https://api.deepseek.com/v1"
LLM_API_KEY="sk-..."

Get your API key from platform.deepseek.com. The deepseek/ prefix tells LiteLLM to route to the DeepSeek API.Popular DeepSeek models (use with the deepseek/ prefix):

deepseek/deepseek-chat — DeepSeek-V3 (general chat and instruction following)
deepseek/deepseek-reasoner — DeepSeek-R1 (chain-of-thought reasoning)

Structured output: DeepSeek’s API is OpenAI-compatible, so the default json_mode for custom providers works well. If you encounter issues with structured output, try setting LLM_INSTRUCTOR_MODE="tool_call".

Kimi (Moonshot AI)

Use Moonshot AI’s Kimi models via their OpenAI-compatible API.

LLM_PROVIDER="custom"
LLM_MODEL="moonshot/moonshot-v1-32k"
LLM_ENDPOINT="https://api.moonshot.cn/v1"
LLM_API_KEY="sk-..."

Get your API key from platform.moonshot.cn. The moonshot/ prefix tells LiteLLM to route to the Moonshot AI API.Available Kimi models (use with the moonshot/ prefix):

moonshot/moonshot-v1-8k — 8k context window
moonshot/moonshot-v1-32k — 32k context window
moonshot/moonshot-v1-128k — 128k context window (for long documents)

OpenRouter

Use OpenRouter to access hundreds of models from a single API endpoint.

LLM_PROVIDER="custom"
LLM_MODEL="openrouter/deepseek/deepseek-r1"
LLM_ENDPOINT="https://openrouter.ai/api/v1"
LLM_API_KEY="sk-or-..."

Get your API key from openrouter.ai/keys. Browse all available models at openrouter.ai/models — prefix the model slug with openrouter/.Example models (use with the openrouter/ prefix):

openrouter/deepseek/deepseek-r1 — DeepSeek R1 via OpenRouter
openrouter/google/gemini-2.0-flash-lite-preview-02-05:free — Free Gemini tier
openrouter/openai/gpt-4o-mini — GPT-4o Mini via OpenRouter

DeepInfra

Use DeepInfra to access open-source models via their OpenAI-compatible API.

LLM_PROVIDER="custom"
LLM_MODEL="deepinfra/meta-llama/Meta-Llama-3-8B-Instruct"
LLM_ENDPOINT="https://api.deepinfra.com/v1/openai"
LLM_API_KEY="<your-deepinfra-api-key>"

Find your model name in the DeepInfra model catalog. The deepinfra/ prefix tells LiteLLM to route to DeepInfra.

Company-Internal / Self-Hosted Endpoints

Any internal LLM server that exposes an OpenAI-compatible REST API (e.g., a corporate vLLM deployment, internal TGI server, or private OpenRouter proxy) can be used with the custom provider.

LLM_PROVIDER="custom"
LLM_MODEL="openai/<your-internal-model-name>"
LLM_ENDPOINT="https://llm.internal.example.com/v1"
LLM_API_KEY="<internal-api-key-or-bearer-token>"

The model prefix you use (openai/, hosted_vllm/, etc.) determines which LiteLLM adapter handles the request. For most OpenAI-compatible servers, openai/ works best. Set LLM_API_KEY to whatever bearer token your server requires (use . if no auth is needed).

vLLM

Use vLLM for high-performance model serving with OpenAI-compatible API.

LLM_PROVIDER="custom"
LLM_MODEL="hosted_vllm/<your-model-name>"
LLM_ENDPOINT="https://your-vllm-endpoint/v1"
LLM_API_KEY="."

Example with Gemma:

LLM_PROVIDER="custom"
LLM_MODEL="hosted_vllm/gemma-3-12b"
LLM_ENDPOINT="https://your-vllm-endpoint/v1"
LLM_API_KEY="."

Important: The hosted_vllm/ prefix is required for LiteLLM to correctly route requests to your vLLM server. The model name after the prefix should match the model ID returned by your vLLM server’s /v1/models endpoint.

To find the correct model name, see their documentation.

MCP Sampling (reuse the host's LLM, no API key)

When Cognee runs as an MCP server (cognee-mcp) inside a host that grants the MCP sampling capability, LLM_PROVIDER="mcp-sampling" delegates completions to the host’s own model through sampling/createMessage. No LLM_API_KEY is required.

LLM_PROVIDER="mcp-sampling"
# LLM_MODEL is a preference hint only — the host chooses the actual model
LLM_MODEL="host-default"

Preconditions: This provider only works while Cognee is running as an MCP server inside a host process that granted the sampling capability to the client. If Cognee is not running under such a host — or the host did not grant sampling — the adapter fails closed with MCPSamplingUnavailableError before issuing any request. Treat that error as a configuration/capability issue: set LLM_PROVIDER to a provider with credentials, or run inside a sampling-granting host.

Host support varies. Not every MCP host grants the sampling capability. For example, as of early 2026 Claude Code does not yet grant sampling (anthropics/claude-code#1785). Check your host’s MCP documentation.

Completions only. MCP sampling covers text completions — it does not provide embeddings, audio transcription, or image description. Because vector search needs embeddings, you must still configure an embedding provider (audio transcription returns nothing and image description raises NotImplementedError).Structured output. The MCP protocol returns free text only, so Cognee produces structured output by embedding the response model’s JSON Schema in the prompt and running a bounded validate/repair loop (up to 5 attempts) before raising an error. Plain-string responses pass through unchanged.Background tasks (such as the cognify tasks launched from within a request) inherit the host MCP session automatically via the SDK’s per-request context, so no changes to cognee-mcp server code are needed.

Advanced Options

Switching the LLM on an existing dataset

LLM configuration is read at runtime from your environment/.env — it is not stored per dataset. Changing LLM_PROVIDER / LLM_MODEL therefore works fine on top of a dataset you have already processed; nothing about the existing data blocks the switch.What is not affected. Already-processed data is left untouched: the entities and relationships in your graph store and the embeddings in your vector store are neither re-computed nor invalidated. Cognee does not re-run past extraction, and vectors depend on the embedding model, not the LLM, so recall over existing data keeps working.What is affected. The new LLM applies only to future work:

Subsequent cognify / memify runs — new data is extracted and summarized with the new model.
Query-time reasoning during search (e.g. GRAPH_COMPLETION) — answers are generated by the new model over the existing graph and vectors.

This means a graph can mix output from different LLMs: nodes written by the old model stay as-is, and only newly cognified data reflects the new one. If you want the whole dataset to reflect the new model’s extraction quality, re-process it: run cognify with incremental_loading=False to force a full reprocess, or empty the dataset, re-add the source data, and run cognify again. Simply re-running cognify is not enough — it skips already-processed data by default.

Changing the embedding model is different: existing vectors were written with the old embeddings and become inconsistent with new ones. To change embeddings on an existing dataset you must re-process it — run cognify with incremental_loading=False, or delete and re-add the data — and if the new model has a different EMBEDDING_DIMENSIONS, remove the existing vector collections first (e.g. with prune, which wipes all datasets). See Embedding Providers.

Per-Stage Model Routing

By default Cognee uses a single model — the base LLM_* settings — for every stage of the pipeline. You can optionally route individual stages to different models or providers by setting stage-specific env var groups. Because extraction runs once per chunk and typically dominates token spend, it is often worth routing a cheaper or local model there while keeping a stronger model for summarization and query-time reasoning.Stages and their env groups

Env group	Stage it controls
`LLM_EXTRACTION_*`	Entity/relationship extraction during `cognify()` (runs per chunk)
`LLM_SUMMARIZATION_*`	Text summarization during `cognify()`
`LLM_QUERY_*`	Query-time completion during `search()`

Each group accepts the same fields as the base LLM_* config, with the stage name in place of the leading LLM:

LLM_<STAGE>_MODEL
LLM_<STAGE>_PROVIDER
LLM_<STAGE>_ENDPOINT
LLM_<STAGE>_API_KEY
LLM_<STAGE>_API_VERSION

Fallback to base config: any stage field you leave unset (empty or absent) falls back to the corresponding base LLM_* value, so you only set what you want to override. If you set no stage overrides at all, the effective config is identical to a single-model setup — default single-model behavior is unchanged.Example — route extraction to a local Ollama model while summarization and query keep using OpenAI:

# Base config (used for any stage field left unset)
LLM_PROVIDER="openai"
LLM_MODEL="openai/gpt-5-mini"
LLM_API_KEY="sk-..."

# Extraction → local Ollama (cheap, high-volume)
LLM_EXTRACTION_MODEL="ollama_chat/llama3.1"
LLM_EXTRACTION_PROVIDER="ollama"
LLM_EXTRACTION_ENDPOINT="http://localhost:11434"
LLM_EXTRACTION_API_KEY=""

# Summarization and query keep the base model (set explicitly if you want a different one)
LLM_SUMMARIZATION_MODEL="openai/gpt-5-mini"
LLM_SUMMARIZATION_PROVIDER="openai"
LLM_QUERY_MODEL="openai/gpt-5-mini"
LLM_QUERY_PROVIDER="openai"

No SDK or pipeline call signatures change when you enable per-stage routing. Each stage transparently gets its own cached client derived from its effective config, so concurrent stages can use different models safely.

LLM Instructor Modes

When using the Instructor structured-output framework (the default), Cognee instructs the model to return structured data in a specific way. The LLM_INSTRUCTOR_MODE environment variable controls which strategy is used.Each provider has a built-in default that matches its API capabilities. Override it only when the default doesn’t work for your specific model.Available modes:

Mode	Description	When to use
`json_schema_mode`	Passes the full JSON Schema of the expected output in the request and enforces strict schema compliance.	OpenAI models that support the `response_format` / structured-output feature (e.g. GPT-4o). Also works well with Bedrock and some local models.
`json_mode`	Instructs the model to return any valid JSON object. Instructor then validates and coerces it to the target schema.	Gemini, Ollama, Generic/Custom endpoints, and any model that supports `response_format: json_object` but not strict schema enforcement.
`anthropic_tools`	Uses Anthropic’s native tool-calling API to extract structured data.	Anthropic Claude models only. Leverages first-class tool-use support for reliable extraction.
`mistral_tools`	Uses Mistral’s native tool-calling API to extract structured data.	Mistral models only. Mirrors the OpenAI function-calling interface provided by Mistral.
`tool_call`	Uses the generic OpenAI-style function/tool-calling API to define the schema as a callable tool.	OpenAI-compatible APIs that support function calling but not strict JSON schema output.
`md_json`	Asks the model to return JSON wrapped in a Markdown code block. Instructor extracts the block and validates it.	Models that reliably format code blocks but may not support `json_mode` (e.g. some self-hosted models).

Per-provider defaults (from source code):

Provider (`LLM_PROVIDER`)	Default mode
`openai` (and Azure OpenAI)	`json_schema_mode`
`anthropic`	`anthropic_tools`
`gemini`	`json_mode`
`bedrock`	`json_schema_mode`
`mistral`	`mistral_tools`
`ollama`	`json_mode`
`custom` (generic OpenAI-compatible)	`json_mode`

Example — override the mode:

LLM_INSTRUCTOR_MODE="json_schema_mode"

Override the default only when the model you are using requires a different mode. For example, LM Studio models typically need json_schema_mode even though the custom provider defaults to json_mode.

Temperature

Control the randomness of LLM responses with the LLM_TEMPERATURE environment variable.

Variable	Default	Description
`LLM_TEMPERATURE`	`0.0`	Sampling temperature. `0.0` = deterministic / focused output. Higher values (e.g. `0.7`–`1.0`) produce more varied, creative responses.

When to adjust: Cognee’s default of 0.0 is recommended for knowledge-graph extraction because it produces consistent, structured output. Raise the temperature only if you need more variety in generated text (e.g. conversational responses or creative summarisation).

Max Completion Tokens

LLM_MAX_COMPLETION_TOKENS sets the maximum number of tokens an LLM call may generate per request (passed to the provider as max_tokens/max_completion_tokens).

Variable	Default	Description
`LLM_MAX_COMPLETION_TOKENS`	`16384`	Per-request output-token ceiling, and an input to automatic chunk sizing

Observable impact:

Truncation. If extraction or summarisation responses are larger than this ceiling, the provider stops generating mid-response. With structured output this surfaces as malformed/incomplete JSON or, with some local models, HTTP 500 errors. Raise the value if you see truncated output.
Effective value is clamped. When the model is in LiteLLM’s model registry, Cognee uses min(model limit from LiteLLM's registry, LLM_MAX_COMPLETION_TOKENS). Setting it far above the registry limit has no effect — “higher” is not automatically “better”.
Chunk size, cost and latency. Extraction chunks are sized as min(EMBEDDING_MAX_COMPLETION_TOKENS, LLM_MAX_COMPLETION_TOKENS // 2) — so this value also caps how much text goes into each cognify chunk. A larger value means fewer, larger chunks (fewer LLM calls but more tokens per call); a smaller value means more, smaller chunks (more calls, finer-grained extraction). See Chunkers for how chunk size shapes the graph.

Tuning guidance: the default 16384 is a good starting point for cloud models. Lower it for local models with a small context window so chunks fit (see the Ollama num_ctx note). Raise it only if your model supports a larger output window and you observe truncated extraction.

Rate Limiting

Control client-side throttling for LLM calls to manage API usage and costs.

Rate limiting is disabled by default. You must explicitly set LLM_RATE_LIMIT_ENABLED="true" to activate it.

Defaults (when rate limiting is enabled):

Variable	Default	Meaning
`LLM_RATE_LIMIT_ENABLED`	`false`	Off by default — opt-in
`LLM_RATE_LIMIT_REQUESTS`	`60`	Max requests per interval
`LLM_RATE_LIMIT_INTERVAL`	`60`	Interval in seconds

The defaults (60 requests / 60 seconds) allow 1 request/second on average. Adjust both values to match your provider’s tier limit.How it works:

Client-side limiter: Cognee paces outbound LLM calls before they reach the provider
Moving window: Spreads allowance across the time window for smoother throughput
Per-process scope: In-memory limits don’t share across multiple processes/containers
Auto-applied: Works with all providers (OpenAI, Gemini, Anthropic, Ollama, Custom)

Sizing guidance:Set LLM_RATE_LIMIT_REQUESTS to your provider’s RPM (requests per minute) limit, and LLM_RATE_LIMIT_INTERVAL to 60. To leave headroom, use ~80–90% of the advertised limit. Check your provider’s dashboard for your current tier limits.Each cognify() call issues multiple LLM requests (entity extraction, summarization, etc.) per document chunk — plan for several requests per chunk, not one.Example configurations for common provider tiersThese examples target chat/completions-style LLM endpoints, such as OpenAI models like gpt-4o-mini.

OpenAI - Tier 1

LLM_RATE_LIMIT_ENABLED="true"
LLM_RATE_LIMIT_REQUESTS="450"
LLM_RATE_LIMIT_INTERVAL="60"

OpenAI - Tier 2

LLM_RATE_LIMIT_ENABLED="true"
LLM_RATE_LIMIT_REQUESTS="4500"
LLM_RATE_LIMIT_INTERVAL="60"

Anthropic - Tier 1

LLM_RATE_LIMIT_ENABLED="true"
LLM_RATE_LIMIT_REQUESTS="45"
LLM_RATE_LIMIT_INTERVAL="60"

Google Gemini - Free Tier

LLM_RATE_LIMIT_ENABLED="true"
LLM_RATE_LIMIT_REQUESTS="13"
LLM_RATE_LIMIT_INTERVAL="60"

Conservative Default

LLM_RATE_LIMIT_ENABLED="true"
LLM_RATE_LIMIT_REQUESTS="60"
LLM_RATE_LIMIT_INTERVAL="60"

Always verify your exact tier limits in your provider’s dashboard — limits vary by model, tier, and region. The examples above are approximations for common tiers and may change.

Fallback Provider

Cognee supports a primary-plus-fallback model configuration that automatically retries a failed request against a secondary provider. This is useful when your primary provider may reject certain content, and you want a fallback to handle those cases gracefully.When the fallback triggersThe fallback is invoked only on content policy violations from the primary provider:

ContentFilterFinishReasonError — the provider’s output filter blocked the response
ContentPolicyViolationError — the request was rejected for policy reasons
InstructorRetryException containing “content management policy”

The fallback does not activate for network errors, rate limits, or authentication failures.Supported providersFallback is available when LLM_PROVIDER is set to openai or custom. Other providers (Anthropic, Gemini, Mistral, Bedrock, Ollama) do not currently support the fallback chain.ConfigurationSet these three variables alongside your primary LLM configuration:

# Primary provider
LLM_PROVIDER="openai"
LLM_MODEL="openai/gpt-4o-mini"
LLM_API_KEY="sk-..."

# Fallback provider (used only on content policy violations)
FALLBACK_MODEL="openrouter/openai/gpt-4o-mini"
FALLBACK_ENDPOINT="https://openrouter.ai/api/v1"
FALLBACK_API_KEY="or-..."

For LLM_PROVIDER="custom", all three fallback variables (FALLBACK_MODEL, FALLBACK_ENDPOINT, FALLBACK_API_KEY) must be set. If any is missing, Cognee raises a ContentPolicyFilterError instead of falling back.For LLM_PROVIDER="openai", only FALLBACK_MODEL and FALLBACK_API_KEY are required. If set, FALLBACK_ENDPOINT is now forwarded to the OpenAI adapter and routes the fallback request to that base URL; if omitted, the fallback request uses the default OpenAI endpoint.Variable reference

Variable	Description
`FALLBACK_MODEL`	Model identifier for the fallback provider (use LiteLLM prefix format, e.g. `openrouter/openai/gpt-4o-mini`)
`FALLBACK_ENDPOINT`	Base URL for the fallback provider’s API (required for `custom`, optional for `openai`)
`FALLBACK_API_KEY`	API key for the fallback provider

Retry Behavior

Structured-output LLM calls (acreate_structured_output, used internally for entity extraction, summarization, and other graph-building steps) are wrapped in a shared retry policy that retries transient failures with exponential backoff.How long a failing call persistsA call is allowed to give up only once both of these floors are met:

Floor	Value	Meaning
Minimum attempts	`2`	At least two attempts are made before failing.
Minimum elapsed time	`~240s`	At least ~240 seconds of wall-clock time must pass before failing.

Because both conditions must hold, a call against an unstable or rate-limited provider can keep retrying for up to a few minutes before it finally errors out. Backoff between attempts is exponential with jitter (starting around 8 seconds, capped near 128 seconds).

These floors are internal defaults shared across the OpenAI, Azure OpenAI, Anthropic, Gemini, Mistral, Ollama, Llama.cpp, Custom, and BAML structured-output paths. They are not environment-configurable.

Bedrock uses a separate retry path: its structured-output adapter relies on the Bedrock rate-limit/sleep retry wrapper and Instructor’s Bedrock retry setting instead of the shared ~240s retry floor.Some errors are treated as non-transient and are not retried — they fail immediately: authentication errors, model-not-found errors, cancellations, payment/budget exhaustion, and quota/billing exhaustion. That includes the shared retry paths for OpenAI, Azure OpenAI, Anthropic, Gemini, Mistral, Ollama, Llama.cpp, Custom, and BAML, so interrupted jobs and worker shutdowns stop promptly rather than waiting out the backoff window. Bedrock uses a separate retry path, but cancellations still unwind immediately there as well.Quota / billing exhaustion is terminal. When a provider reports that its quota or billing limit is exhausted, retrying cannot help, so the call fails fast instead of spinning through the retry window. The raw provider error is converted at the single acreate_structured_output choke point into an actionable LLMQuotaExceededError (provider- and framework-agnostic). The following provider wordings are classified as terminal:

Pattern	Provider
`insufficient_quota`	OpenAI / Azure OpenAI (billing quota exhausted)
`quota_exceeded`	Generic provider quota-exhaustion code
`billing hard limit`	OpenAI (monthly hard limit reached)
`credit balance is too low`	Anthropic (prepaid credits exhausted)
`out of credits`	Generic

Transient per-minute rate limits stay retryable. The patterns above are deliberately narrow: the bare phrase “exceeded your current quota” is intentionally not matched, because Gemini free tier uses it for recoverable RESOURCE_EXHAUSTED limits (OpenAI’s terminal case is still caught via insufficient_quota). Monitoring and alerting should treat LLMQuotaExceededError as a terminal condition and respond by checking the provider billing/quota dashboard, raising the limit, or switching credentials — not by retrying.

Operational note: when a provider is flaky, expect higher tail latency and additional API calls (and therefore cost) while retries play out. This persistent retry improves resilience for transient failures but does not mask genuine misconfiguration such as a bad API key.

Custom Endpoints & Corporate Proxies

LLM_ENDPOINT overrides the base URL Cognee uses to reach the LLM. Use it to point at an Azure deployment, a local server, an OpenAI-compatible proxy, or a company-internal gateway. For routing all outbound traffic through a corporate HTTP proxy without rewriting the endpoint, use the standard HTTPS_PROXY / HTTP_PROXY environment variables.Per-provider LLM_ENDPOINT semantics

`LLM_PROVIDER`	How `LLM_ENDPOINT` is used	Required?
`openai`	Passed to LiteLLM as `api_base`. Omit to use OpenAI’s default (`https://api.openai.com/v1`). Set to point at a compatible gateway or proxy.	Optional
`azure`	Azure resource endpoint (e.g., `https://<resource>.openai.azure.com`). The deployment is selected by `LLM_MODEL`.	Required
`gemini`	Passed to LiteLLM as `api_base`. Omit to use the provider’s default.	Optional
`mistral`	`LLM_ENDPOINT` is currently not used for generation; Cognee uses the default Mistral provider endpoint.	Not applicable
`ollama`	OpenAI-compatible endpoint of your Ollama server (typically `http://localhost:11434/v1`).	Required
`custom`	Base URL of your OpenAI-compatible server (vLLM, OpenRouter, LM Studio, internal gateway, etc.).	Required
`llama_cpp`	Required only in server mode (URL of the `llama-cpp-python` server). Ignored in local in-process mode.	Server mode only
`anthropic`	Not read. Anthropic’s SDK has its own internal base URL. To route through a proxy, use `HTTPS_PROXY`.	Not applicable
`bedrock`	Not read. Use `AWS_BEDROCK_RUNTIME_ENDPOINT` to override the Bedrock endpoint.	Not applicable

Routing through a corporate HTTP/HTTPS proxyCognee’s LLM transport is built on the openai, anthropic, httpx, and litellm Python clients, all of which honor the standard proxy environment variables. Set them in your shell or .env before starting Cognee:

HTTPS_PROXY="http://proxy.corp.example.com:8080"
HTTP_PROXY="http://proxy.corp.example.com:8080"
# Optional: hosts that should bypass the proxy
NO_PROXY="localhost,127.0.0.1,.internal.example.com"

This is the right approach when the LLM provider’s public URL is correct but your network blocks direct egress. No Cognee config change is needed — outbound LLM, embedding, and HTTP loader calls all pick up these variables automatically.Troubleshooting “not connected / cannot reach LLM”

LLM_ENDPOINT typos — values are stripped of surrounding quotes, but a missing scheme (http:// / https://) or trailing path segment will surface as a connection error. For OpenAI-compatible endpoints, the URL must end in /v1 (or whatever the server exposes).
Preflight timeout — Cognee runs a 30s connection test at startup. If your proxy adds latency or your local model is slow to warm up, set COGNEE_SKIP_CONNECTION_TEST=true to skip it.
Provider mismatch — if LLM_ENDPOINT points at a non-OpenAI server but LLM_PROVIDER="openai", Cognee will hit the wrong route. For OpenAI-compatible third parties, use LLM_PROVIDER="custom" with the correct LiteLLM model prefix (see Custom Providers above).
TLS interception — if your corporate proxy uses its own CA, set SSL_CERT_FILE or REQUESTS_CA_BUNDLE to the CA bundle path so Python’s HTTP clients trust the proxy certificate.

Notes

If EMBEDDING_API_KEY is not set, Cognee falls back to LLM_API_KEY for embeddings
Rate limiting helps manage API usage and costs
Structured output frameworks ensure consistent data extraction from LLM responses

Embedding Providers

Configure embedding providers for semantic search

Overview

Return to setup configuration overview

Relational Databases

Set up SQLite or Postgres for metadata storage

Getting Started

Core Concepts

Setup Configuration

Guides

Examples

CLI

Rust SDK

TypeScript SDK

OSS

Supported Providers

Choosing a Model

Configuration

Provider Setup Guides

Known Issues

Context Window (`num_ctx`) and Custom Modelfiles

Connection Troubleshooting

Advanced Options

Notes

Embedding Providers

Overview

Relational Databases

​Supported Providers

​Choosing a Model

​Configuration

​Provider Setup Guides

​Known Issues

​Context Window (num_ctx) and Custom Modelfiles

​Connection Troubleshooting

​Advanced Options

​Notes

Embedding Providers

Overview

Relational Databases

Supported Providers

Choosing a Model

Configuration

Provider Setup Guides

Known Issues

Context Window (`num_ctx`) and Custom Modelfiles

Connection Troubleshooting

Advanced Options

Notes