.env → choose providers → handle pruning.Supported Providers
Cognee supports multiple LLM providers:- OpenAI — GPT models via OpenAI API (default)
- Azure OpenAI — GPT models via Azure OpenAI Service
- Google Gemini — Gemini models via Google AI
- Anthropic — Claude models via Anthropic API
- AWS Bedrock — Models available via AWS Bedrock
- Groq — Fast inference via Groq API (via LiteLLM)
- Ollama — Local models via Ollama
- LM Studio — Local models via LM Studio
- HuggingFace — Models via HuggingFace Inference API or Inference Endpoints
- llama.cpp — Local models via llama-cpp-python (in-process or server mode)
- Custom — OpenAI-compatible endpoints (like vLLM, OpenRouter, DeepInfra, company-internal)
Choosing a Model
Cognee always uses two models together: an LLM for entity/relationship extraction and reasoning, and an embedding model for semantic search. An embedding model is mandatory — everycognify run writes vectors to a vector store, and recall depends on them. If you only set one, the other silently falls back to OpenAI (see the warning above).
Light vs. powerful LLM
Light vs. powerful LLM
openai/gpt-5-mini) and the examples on this page use comparable light models such as gpt-4o-mini. Knowledge-graph extraction is many short, schema-constrained calls per document rather than a few long ones, so a light model keeps cost and latency low while handling most workloads well.Reach for a more powerful model when:- Your sources are dense or domain-specific (legal, medical, scientific) and you need higher-fidelity entities and relationships.
- You use a custom graph model or ontology with a complex schema the model must populate accurately.
- A light model produces noisy or incomplete graphs on your data.
Resource expectations for local models
Resource expectations for local models
nomic-embed-text (Ollama) or all-MiniLM-L6-v2 (Fastembed, CPU-only) run comfortably on a CPU or a small GPU and rarely dominate resource use.The LLM is the constraint for local setups. The Local Setup guide defaults to an 8B model (llama3.1:8b); as a rough guide, an 8B model quantized to 4-bit needs roughly 6 GB of free VRAM, while larger or less-quantized models need proportionally more. If a model does not fit, it spills to system RAM and CPU, which still works but is much slower — for llama.cpp you can tune LLAMA_CPP_N_GPU_LAYERS to offload only as many layers as fit. Limited VRAM does not change graph quality; it mainly affects how fast cognify runs, since extraction issues many sequential LLM calls per document. With low VRAM, prefer a smaller LLM and lower EMBEDDING_BATCH_SIZE (see Embedding Providers) over a large model that does not fit.Configuration
Environment Variables
Environment Variables
.env file:LLM_PROVIDER— The provider to use (openai, gemini, anthropic, ollama, custom)LLM_MODEL— The specific model to useLLM_API_KEY— Your API key for the providerLLM_ENDPOINT— Custom endpoint URL (for Azure, Ollama, or custom providers)LLM_API_VERSION— API version (for Azure OpenAI)LLM_TEMPERATURE— Sampling temperature for generation (default:0.0)LLM_MAX_COMPLETION_TOKENS— Maximum tokens per request (optional)LLM_INSTRUCTOR_MODE— Structured-output mode override for Instructor-backed LLM calls (optional)
COGNEE_SKIP_CONNECTION_TEST=true to your .env.gemini/ or openrouter/?Cognee routes all LLM requests through LiteLLM, which uses provider prefixes to identify the correct API endpoint. For example, Google lists their model as gemini-2.0-flash, but in Cognee you must write gemini/gemini-2.0-flash. This prefix tells LiteLLM to use the Gemini API. The same applies to custom providers — openrouter/, hosted_vllm/, lm_studio/, etc. See each provider section below for the correct format.Provider Setup Guides
OpenAI (Default)
OpenAI (Default)
Azure OpenAI
Azure OpenAI
Google Gemini
Google Gemini
Anthropic
Anthropic
Groq
Groq
groq/ model prefix.groq/ prefix):groq/llama-3.3-70b-versatilegroq/llama3-8b-8192groq/mixtral-8x7b-32768groq/gemma2-9b-it
LLM_ENDPOINT variable is not required for Groq — LiteLLM resolves the Groq API endpoint automatically from the groq/ prefix.AWS Bedrock
AWS Bedrock
- Using an API key and region. Simply generate you key on AWS, and put it in the
LLM_API_KEYenv variable. - Using AWS Credentials. You can only specify
AWS_ACCESS_KEY_IDandAWS_SECRET_ACCESS_KEY, no need for theLLM_API_KEY. In this case, if you are using temporary credentials (e.g.AWS_ACCESS_KEY_IDstarting withASIA...), then you also must specify theAWS_SESSION_TOKEN. - Using AWS profiles. Create a file called something like
/.aws/credentials, and store your credentials inside it.
Ollama (Local)
Ollama (Local)
LLM_API_KEY="ollama" is a placeholder required by the client library — Ollama itself does not validate it.Installation: Install Ollama from ollama.ai and pull your desired model:.env example using Ollama or Fastembed for both LLM and embeddings.Known Issues
NoDataErrorwith mixed providers: Using Ollama as LLM and OpenAI as embedding provider may fail withNoDataError. Workaround: configure both LLM and embeddings to the same local provider (see the local setup guide above).- Audio transcription is not supported:
AudioLoaderrelies on a Whisper-compatible transcription endpoint. Cognee’s Ollama adapter does not provide one, so audio ingestion will fail whenLLM_PROVIDER="ollama".
Context Window (num_ctx) and Custom Modelfiles
If cognify() returns HTTP 500 errors while the same model answers fine when you run ollama run <model> in a terminal, the usual cause is context-window truncation, not a connection problem.Ollama’s default context window can be much smaller than Cognee’s extraction window. Cognee sizes extraction chunks from LLM_MAX_COMPLETION_TOKENS (default 16384) — up to roughly half that per chunk — so the entity-extraction prompts it sends can be far larger than a short terminal prompt. When a prompt exceeds num_ctx, Ollama may truncate it, the model can return malformed or empty structured output, and Instructor’s parse failure can surface as a 500.LLM_MODEL is just an Ollama model tag, so the fix is to point it at a model whose num_ctx is large enough. Create a custom Modelfile:.env:num_ctx uses more memory. If you can’t raise it, lower LLM_MAX_COMPLETION_TOKENS instead so Cognee builds smaller chunks that fit the model’s existing context window.Connection Troubleshooting
If you seecannot connect to host or connection refused errors, the most common causes are an unreachable endpoint, the wrong protocol, or a Docker networking mismatch.Default endpoint protocolOllama’s local server speaks plain HTTP, not HTTPS. Cognee does not add TLS by default — the protocol is determined entirely by the scheme in LLM_ENDPOINT and EMBEDDING_ENDPOINT. Use https:// only if you have placed Ollama behind a TLS-terminating reverse proxy (Caddy, nginx, Traefik, etc.). For a local Ollama setup, use:| Variable | Value |
|---|---|
LLM_ENDPOINT (Ollama) | http://localhost:11434/v1 |
EMBEDDING_ENDPOINT (Ollama) | http://localhost:11434/api/embed |
https:// — plain HTTP requests are not upgraded.localhost vs host.docker.internalInside a Docker container, localhost refers to the container itself, not your host machine where Ollama is running. If Cognee runs in Docker and Ollama runs on the host, use host.docker.internal instead:host.docker.internal is available on Docker Desktop (macOS/Windows) and on Linux when the host-gateway mapping is configured in docker-compose.yml. On Linux without that mapping, use --network host or the Docker bridge IP.Other common causes- Ollama not running: verify with
curl http://localhost:11434/api/tagsfrom the same machine and network namespace Cognee is running in. - Wrong port: the default Ollama port is
11434. If you started Ollama withOLLAMA_HOST=0.0.0.0:<port>, match that port inLLM_ENDPOINT. - Missing path suffix: the LLM endpoint must end in
/v1(OpenAI-compatible chat completions), and the embedding endpoint must end in/api/embed. Pointing either at the bare host (e.g.http://localhost:11434) will fail. - Bind address: Ollama binds to
127.0.0.1by default. To accept connections from other machines or Docker containers via a LAN IP, start it withOLLAMA_HOST=0.0.0.0:11434.
HuggingFace
HuggingFace
- Serverless
- Dedicated Endpoint
huggingface/ prefix (e.g., huggingface/mistralai/Mixtral-8x7B-Instruct-v0.1). Not all models on HuggingFace support the text generation inference API — check the model card for compatibility. The model is routed through LiteLLM.LM Studio (Local)
LM Studio (Local)
LLM_INSTRUCTOR_MODE controls how Cognee asks the model for structured output. LM Studio models often work best with json_schema_mode. For more detail, see LLM Instructor Modes below and Structured Output Backends.llama.cpp (Local)
llama.cpp (Local)
- Local mode — Load a
.ggufmodel directly in-process - Server mode — Connect to a running
llama-cpp-pythonserver over HTTP
Local Mode (In-Process)
Local Mode (In-Process)
LLAMA_CPP_N_GPU_LAYERS=-1 to offload all layers to GPU, or set a positive integer to offload a specific number of layers. Leave it at 0 for CPU-only inference.llama_cpp.Llama instance is not thread-safe, Cognee serializes concurrent structured-output calls (such as the per-chunk extraction that cognify() fans out) on that single instance. This means in-process requests are processed one at a time rather than in parallel; if you need parallel decoding, run a llama-cpp-python server and use Server Mode instead.Server Mode (OpenAI-Compatible)
Server Mode (OpenAI-Compatible)
llama-cpp-python server. Start the server separately:Custom Providers
Custom Providers
LLM_PROVIDER="custom", you must include the correct provider prefix in your model name. Cognee forwards requests to LiteLLM, which uses these prefixes to route requests correctly.Common prefixes include:hosted_vllm/— vLLM serversopenrouter/— OpenRouterlm_studio/— LM Studioopenai/— OpenAI-compatible APIs
DeepSeek
DeepSeek
deepseek/ prefix tells LiteLLM to route to the DeepSeek API.Popular DeepSeek models (use with the deepseek/ prefix):deepseek/deepseek-chat— DeepSeek-V3 (general chat and instruction following)deepseek/deepseek-reasoner— DeepSeek-R1 (chain-of-thought reasoning)
json_mode for custom providers works well. If you encounter issues with structured output, try setting LLM_INSTRUCTOR_MODE="tool_call".Kimi (Moonshot AI)
Kimi (Moonshot AI)
moonshot/ prefix tells LiteLLM to route to the Moonshot AI API.Available Kimi models (use with the moonshot/ prefix):moonshot/moonshot-v1-8k— 8k context windowmoonshot/moonshot-v1-32k— 32k context windowmoonshot/moonshot-v1-128k— 128k context window (for long documents)
OpenRouter
OpenRouter
openrouter/.Example models (use with the openrouter/ prefix):openrouter/deepseek/deepseek-r1— DeepSeek R1 via OpenRouteropenrouter/google/gemini-2.0-flash-lite-preview-02-05:free— Free Gemini tieropenrouter/openai/gpt-4o-mini— GPT-4o Mini via OpenRouter
DeepInfra
DeepInfra
deepinfra/ prefix tells LiteLLM to route to DeepInfra.Company-Internal / Self-Hosted Endpoints
Company-Internal / Self-Hosted Endpoints
custom provider.openai/, hosted_vllm/, etc.) determines which LiteLLM adapter handles the request. For most OpenAI-compatible servers, openai/ works best. Set LLM_API_KEY to whatever bearer token your server requires (use . if no auth is needed).vLLM
vLLM
Advanced Options
LLM Instructor Modes
LLM Instructor Modes
LLM_INSTRUCTOR_MODE environment variable controls which strategy is used.Each provider has a built-in default that matches its API capabilities. Override it only when the default doesn’t work for your specific model.Available modes:| Mode | Description | When to use |
|---|---|---|
json_schema_mode | Passes the full JSON Schema of the expected output in the request and enforces strict schema compliance. | OpenAI models that support the response_format / structured-output feature (e.g. GPT-4o). Also works well with Bedrock and some local models. |
json_mode | Instructs the model to return any valid JSON object. Instructor then validates and coerces it to the target schema. | Gemini, Ollama, Generic/Custom endpoints, and any model that supports response_format: json_object but not strict schema enforcement. |
anthropic_tools | Uses Anthropic’s native tool-calling API to extract structured data. | Anthropic Claude models only. Leverages first-class tool-use support for reliable extraction. |
mistral_tools | Uses Mistral’s native tool-calling API to extract structured data. | Mistral models only. Mirrors the OpenAI function-calling interface provided by Mistral. |
tool_call | Uses the generic OpenAI-style function/tool-calling API to define the schema as a callable tool. | OpenAI-compatible APIs that support function calling but not strict JSON schema output. |
md_json | Asks the model to return JSON wrapped in a Markdown code block. Instructor extracts the block and validates it. | Models that reliably format code blocks but may not support json_mode (e.g. some self-hosted models). |
Provider (LLM_PROVIDER) | Default mode |
|---|---|
openai (and Azure OpenAI) | json_schema_mode |
anthropic | anthropic_tools |
gemini | json_mode |
bedrock | json_schema_mode |
mistral | mistral_tools |
ollama | json_mode |
custom (generic OpenAI-compatible) | json_mode |
json_schema_mode even though the custom provider defaults to json_mode.Temperature
Temperature
LLM_TEMPERATURE environment variable.| Variable | Default | Description |
|---|---|---|
LLM_TEMPERATURE | 0.0 | Sampling temperature. 0.0 = deterministic / focused output. Higher values (e.g. 0.7–1.0) produce more varied, creative responses. |
0.0 is recommended for knowledge-graph extraction because it produces consistent, structured output. Raise the temperature only if you need more variety in generated text (e.g. conversational responses or creative summarisation).Rate Limiting
Rate Limiting
| Variable | Default | Meaning |
|---|---|---|
LLM_RATE_LIMIT_ENABLED | false | Off by default — opt-in |
LLM_RATE_LIMIT_REQUESTS | 60 | Max requests per interval |
LLM_RATE_LIMIT_INTERVAL | 60 | Interval in seconds |
- Client-side limiter: Cognee paces outbound LLM calls before they reach the provider
- Moving window: Spreads allowance across the time window for smoother throughput
- Per-process scope: In-memory limits don’t share across multiple processes/containers
- Auto-applied: Works with all providers (OpenAI, Gemini, Anthropic, Ollama, Custom)
LLM_RATE_LIMIT_REQUESTS to your provider’s RPM (requests per minute) limit, and LLM_RATE_LIMIT_INTERVAL to 60. To leave headroom, use ~80–90% of the advertised limit. Check your provider’s dashboard for your current tier limits.Each cognify() call issues multiple LLM requests (entity extraction, summarization, etc.) per document chunk — plan for several requests per chunk, not one.Example configurations for common provider tiersThese examples target chat/completions-style LLM endpoints, such as OpenAI models like gpt-4o-mini.OpenAI - Tier 1
OpenAI - Tier 1
OpenAI - Tier 2
OpenAI - Tier 2
Anthropic - Tier 1
Anthropic - Tier 1
Google Gemini - Free Tier
Google Gemini - Free Tier
Conservative Default
Conservative Default
Fallback Provider
Fallback Provider
ContentFilterFinishReasonError— the provider’s output filter blocked the responseContentPolicyViolationError— the request was rejected for policy reasonsInstructorRetryExceptioncontaining “content management policy”
LLM_PROVIDER is set to openai or custom. Other providers (Anthropic, Gemini, Mistral, Bedrock, Ollama) do not currently support the fallback chain.ConfigurationSet these three variables alongside your primary LLM configuration:LLM_PROVIDER="custom", all three fallback variables (FALLBACK_MODEL, FALLBACK_ENDPOINT, FALLBACK_API_KEY) must be set. If any is missing, Cognee raises a ContentPolicyFilterError instead of falling back.For LLM_PROVIDER="openai", only FALLBACK_MODEL and FALLBACK_API_KEY are required. FALLBACK_ENDPOINT is accepted but currently unused for the OpenAI adapter.Variable reference| Variable | Description |
|---|---|
FALLBACK_MODEL | Model identifier for the fallback provider (use LiteLLM prefix format, e.g. openrouter/openai/gpt-4o-mini) |
FALLBACK_ENDPOINT | Base URL for the fallback provider’s API (required for custom, optional for openai) |
FALLBACK_API_KEY | API key for the fallback provider |
Custom Endpoints & Corporate Proxies
Custom Endpoints & Corporate Proxies
LLM_ENDPOINT overrides the base URL Cognee uses to reach the LLM. Use it to point at an Azure deployment, a local server, an OpenAI-compatible proxy, or a company-internal gateway. For routing all outbound traffic through a corporate HTTP proxy without rewriting the endpoint, use the standard HTTPS_PROXY / HTTP_PROXY environment variables.Per-provider LLM_ENDPOINT semanticsLLM_PROVIDER | How LLM_ENDPOINT is used | Required? |
|---|---|---|
openai | Passed to LiteLLM as api_base. Omit to use OpenAI’s default (https://api.openai.com/v1). Set to point at a compatible gateway or proxy. | Optional |
azure | Azure resource endpoint (e.g., https://<resource>.openai.azure.com). The deployment is selected by LLM_MODEL. | Required |
gemini | Passed to LiteLLM as api_base. Omit to use the provider’s default. | Optional |
mistral | LLM_ENDPOINT is currently not used for generation; Cognee uses the default Mistral provider endpoint. | Not applicable |
ollama | OpenAI-compatible endpoint of your Ollama server (typically http://localhost:11434/v1). | Required |
custom | Base URL of your OpenAI-compatible server (vLLM, OpenRouter, LM Studio, internal gateway, etc.). | Required |
llama_cpp | Required only in server mode (URL of the llama-cpp-python server). Ignored in local in-process mode. | Server mode only |
anthropic | Not read. Anthropic’s SDK has its own internal base URL. To route through a proxy, use HTTPS_PROXY. | Not applicable |
bedrock | Not read. Use AWS_BEDROCK_RUNTIME_ENDPOINT to override the Bedrock endpoint. | Not applicable |
openai, anthropic, httpx, and litellm Python clients, all of which honor the standard proxy environment variables. Set them in your shell or .env before starting Cognee:LLM_ENDPOINTtypos — values are stripped of surrounding quotes, but a missing scheme (http:///https://) or trailing path segment will surface as a connection error. For OpenAI-compatible endpoints, the URL must end in/v1(or whatever the server exposes).- Preflight timeout — Cognee runs a 30s connection test at startup. If your proxy adds latency or your local model is slow to warm up, set
COGNEE_SKIP_CONNECTION_TEST=trueto skip it. - Provider mismatch — if
LLM_ENDPOINTpoints at a non-OpenAI server butLLM_PROVIDER="openai", Cognee will hit the wrong route. For OpenAI-compatible third parties, useLLM_PROVIDER="custom"with the correct LiteLLM model prefix (see Custom Providers above). - TLS interception — if your corporate proxy uses its own CA, set
SSL_CERT_FILEorREQUESTS_CA_BUNDLEto the CA bundle path so Python’s HTTP clients trust the proxy certificate.
Notes
- If
EMBEDDING_API_KEYis not set, Cognee falls back toLLM_API_KEYfor embeddings - Rate limiting helps manage API usage and costs
- Structured output frameworks ensure consistent data extraction from LLM responses
- Cancelled LLM operations stop promptly. For BAML structured output and the Anthropic, Azure OpenAI, Gemini, Mistral, and Ollama adapters, cancellation is propagated immediately so interrupted jobs and worker shutdowns do not wait for retry backoff.