> ## Documentation Index
> Fetch the complete documentation index at: https://docs.cognee.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# LLM Providers

> Configure LLM providers for text generation and reasoning in Cognee

LLM (Large Language Model) providers handle text generation, reasoning, and structured output tasks in Cognee. You can choose from cloud providers like OpenAI and Anthropic, or run models locally with Ollama.

<Info>
  **New to configuration?**

  See the [Setup Configuration Overview](./overview) for the complete workflow:

  install extras → create `.env` → choose providers → handle pruning.
</Info>

## Supported Providers

Cognee supports multiple LLM providers:

* **OpenAI** — GPT models via OpenAI API (default)
* **Azure OpenAI** — GPT models via Azure OpenAI Service
* **Google Gemini** — Gemini models via Google AI
* **Anthropic** — Claude models via Anthropic API
* **AWS Bedrock** — Models available via AWS Bedrock
* **Groq** — Fast inference via Groq API (via LiteLLM)
* **Ollama** — Local models via Ollama
* **LM Studio** — Local models via LM Studio
* **HuggingFace** — Models via HuggingFace Inference API or Inference Endpoints
* **llama.cpp** — Local models via llama-cpp-python (in-process or server mode)
* **Custom** — OpenAI-compatible endpoints (like vLLM, OpenRouter, DeepInfra, company-internal)

<Warning>
  **LLM/Embedding Configuration**: If you configure only LLM or only embeddings, the other defaults to OpenAI. Ensure you have a working OpenAI API key, or configure both LLM and embeddings to avoid unexpected defaults.
</Warning>

## Choosing a Model

Cognee always uses **two** models together: an **LLM** for entity/relationship extraction and reasoning, and an **embedding model** for semantic search. An embedding model is mandatory — every `cognify` run writes vectors to a [vector store](/setup-configuration/vector-stores), and [recall](/core-concepts/main-operations/recall) depends on them. If you only set one, the other silently falls back to OpenAI (see the warning above).

<AccordionGroup>
  <Accordion title="Light vs. powerful LLM">
    A small, fast model is the right default. Cognee ships with one (`openai/gpt-5-mini`) and the examples on this page use comparable light models such as `gpt-4o-mini`. Knowledge-graph extraction is many short, schema-constrained calls per document rather than a few long ones, so a light model keeps cost and latency low while handling most workloads well.

    Reach for a more powerful model when:

    * Your sources are dense or domain-specific (legal, medical, scientific) and you need higher-fidelity entities and relationships.
    * You use a [custom graph model](/guides/custom-graph-model) or [ontology](/guides/ontology-support) with a complex schema the model must populate accurately.
    * A light model produces noisy or incomplete graphs on your data.

    Extraction relies on [structured output](/setup-configuration/structured-output-backends), so very small or weak models may return malformed JSON or lower-quality graphs. If a small local model struggles, try a stronger one or adjust the [instructor mode](#llm-instructor-modes).
  </Accordion>

  <Accordion title="Resource expectations for local models">
    The embedding model is lightweight — defaults like `nomic-embed-text` (Ollama) or `all-MiniLM-L6-v2` ([Fastembed](/setup-configuration/embedding-providers#fastembed-local), CPU-only) run comfortably on a CPU or a small GPU and rarely dominate resource use.

    The **LLM** is the constraint for local setups. The [Local Setup guide](/guides/local-setup) defaults to an 8B model (`llama3.1:8b`); as a rough guide, an 8B model quantized to 4-bit needs roughly 6 GB of free VRAM, while larger or less-quantized models need proportionally more. If a model does not fit, it spills to system RAM and CPU, which still works but is much slower — for [llama.cpp](#llama-cpp-local) you can tune `LLAMA_CPP_N_GPU_LAYERS` to offload only as many layers as fit. Limited VRAM does not change graph quality; it mainly affects how fast `cognify` runs, since extraction issues many sequential LLM calls per document. With low VRAM, prefer a smaller LLM and lower `EMBEDDING_BATCH_SIZE` (see [Embedding Providers](/setup-configuration/embedding-providers#batch-size)) over a large model that does not fit.
  </Accordion>
</AccordionGroup>

## Configuration

<Accordion title="Environment Variables">
  Set these environment variables in your `.env` file:

  * `LLM_PROVIDER` — The provider to use (openai, gemini, anthropic, ollama, custom)
  * `LLM_MODEL` — The specific model to use
  * `LLM_API_KEY` — Your API key for the provider
  * `LLM_ENDPOINT` — Custom endpoint URL (for Azure, Ollama, or custom providers)
  * `LLM_API_VERSION` — API version (for Azure OpenAI)
  * `LLM_TEMPERATURE` — Sampling temperature for generation (default: `0.0`)
  * `LLM_MAX_COMPLETION_TOKENS` — Maximum tokens per request (optional)
  * `LLM_INSTRUCTOR_MODE` — Structured-output mode override for Instructor-backed LLM calls (optional)
</Accordion>

<Note>
  A preflight LLM connection test can time out at 30s, especially against smaller models. Workaround: add `COGNEE_SKIP_CONNECTION_TEST=true` to your `.env`.
</Note>

<Info>
  **Why do model names have a prefix like `gemini/` or `openrouter/`?**

  Cognee routes all LLM requests through [LiteLLM](https://docs.litellm.ai/docs/providers), which uses provider prefixes to identify the correct API endpoint. For example, Google lists their model as `gemini-2.0-flash`, but in Cognee you must write `gemini/gemini-2.0-flash`. This prefix tells LiteLLM to use the Gemini API. The same applies to custom providers — `openrouter/`, `hosted_vllm/`, `lm_studio/`, etc. See each provider section below for the correct format.
</Info>

## Provider Setup Guides

<AccordionGroup>
  <Accordion title="OpenAI (Default)">
    OpenAI is the default provider and works out of the box with minimal configuration.

    ```dotenv theme={null}
    LLM_PROVIDER="openai"
    LLM_MODEL="gpt-4o-mini"
    LLM_API_KEY="sk-..."
    # Optional overrides
    # LLM_ENDPOINT=https://api.openai.com/v1
    # LLM_API_VERSION=
    # LLM_MAX_COMPLETION_TOKENS=16384
    ```
  </Accordion>

  <Accordion title="Azure OpenAI">
    Use Azure OpenAI Service with your own deployment.

    ```dotenv theme={null}
    LLM_PROVIDER="openai"
    LLM_MODEL="azure/gpt-4o-mini"
    LLM_ENDPOINT="https://<your-resource>.openai.azure.com/openai/deployments/gpt-4o-mini"
    LLM_API_KEY="az-..."
    LLM_API_VERSION="2024-12-01-preview"
    ```
  </Accordion>

  <Accordion title="Google Gemini">
    Use Google's Gemini models for text generation.

    ```dotenv theme={null}
    LLM_PROVIDER="gemini"
    LLM_MODEL="gemini/gemini-2.0-flash"
    LLM_API_KEY="AIza..."
    # Optional
    # LLM_ENDPOINT=https://generativelanguage.googleapis.com/
    # LLM_API_VERSION=v1beta
    ```
  </Accordion>

  <Accordion title="Anthropic">
    Use Anthropic's Claude models for reasoning tasks.

    ```dotenv theme={null}
    LLM_PROVIDER="anthropic"
    LLM_MODEL="claude-sonnet-4-5-20250929"
    LLM_API_KEY="sk-ant-..."
    ```
  </Accordion>

  <Accordion title="Groq">
    Groq provides fast inference for open models. Cognee routes Groq requests through [LiteLLM](https://docs.litellm.ai/docs/providers/groq) using the `groq/` model prefix.

    ```dotenv theme={null}
    LLM_PROVIDER="custom"
    LLM_MODEL="groq/llama-3.3-70b-versatile"
    LLM_API_KEY="gsk_..."
    ```

    **Installation**: Install the Groq dependency:

    ```bash theme={null}
    pip install cognee[groq]
    ```

    **Popular Groq models** (use with the `groq/` prefix):

    * `groq/llama-3.3-70b-versatile`
    * `groq/llama3-8b-8192`
    * `groq/mixtral-8x7b-32768`
    * `groq/gemma2-9b-it`

    See the [Groq model list](https://console.groq.com/docs/models) for all available models. Your Groq API key can be created in the [Groq Console](https://console.groq.com/keys).

    <Info>
      **No endpoint needed**: The `LLM_ENDPOINT` variable is not required for Groq — LiteLLM resolves the Groq API endpoint automatically from the `groq/` prefix.
    </Info>
  </Accordion>

  <Accordion title="AWS Bedrock">
    Use models available on AWS Bedrock for various tasks. For Bedrock specifically, you will need to
    also specify some information regarding AWS.

    ```dotenv theme={null}
    LLM_API_KEY="<your_bedrock_api_key>"
    LLM_MODEL="eu.amazon.nova-lite-v1:0"
    LLM_PROVIDER="bedrock"
    LLM_MAX_COMPLETION_TOKENS="16384"
    AWS_REGION="<your_aws_region>"
    AWS_ACCESS_KEY_ID="<your_aws_access_key_id>"
    AWS_SECRET_ACCESS_KEY="<your_aws_secret_access_key>"
    AWS_SESSION_TOKEN="<your_aws_session_token>"

    # Optional parameters
    #AWS_BEDROCK_RUNTIME_ENDPOINT="bedrock-runtime.eu-west-1.amazonaws.com"
    #AWS_PROFILE_NAME="<your_aws_profile_name>"
    ```

    There are **multiple ways of connecting** to Bedrock models. Cognee picks the first one it finds, in this order:

    1. Using an API key and region. Simply generate your key on AWS, and put it in the `LLM_API_KEY` env variable. If `LLM_API_KEY` is set, it takes precedence over the credential and profile methods below, so leave it empty when you want to use those.
    2. Using AWS Credentials. You can only specify `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`, no need for the `LLM_API_KEY`.
       In this case, if you are using temporary credentials (e.g. `AWS_ACCESS_KEY_ID` starting with `ASIA...`, such as those issued by `aws sso login` or `aws sts assume-role`), then you also
       must specify the `AWS_SESSION_TOKEN`. All three values expire and must be refreshed when AWS rotates them.
    3. Using AWS profiles. `AWS_PROFILE_NAME` is the **name** of a profile (for example `default` or `my-sso-profile`), not a path to a file or to a folder. Cognee hands the name to boto3, which resolves the credentials through the standard AWS chain using the shared config and credentials files at `~/.aws/config` and `~/.aws/credentials` (override their locations with the `AWS_CONFIG_FILE` and `AWS_SHARED_CREDENTIALS_FILE` env variables). This is the recommended path for **AWS SSO**: run `aws sso login --profile <your_aws_profile_name>` first, then set `AWS_PROFILE_NAME` to that profile name and Cognee will use the temporary SSO credentials boto3 caches for it — no need to copy the `ASIA...` keys into your `.env`.

    **Installation**: Install the required dependency:

    ```bash theme={null}
    pip install cognee[aws]
    ```

    <Info>
      **Model Name**
      The name of the model might differ based on the region (the name begins with **eu** for Europe, **us** of USA, etc.)
    </Info>
  </Accordion>

  <Accordion title="Ollama (Local)">
    Run models locally with Ollama for privacy and cost control.

    ```dotenv theme={null}
    LLM_PROVIDER="ollama"
    LLM_MODEL="llama3.1:8b"
    LLM_ENDPOINT="http://localhost:11434/v1"
    LLM_API_KEY="ollama"
    ```

    `LLM_API_KEY="ollama"` is a placeholder required by the client library — Ollama itself does not validate it.

    **Installation**: Install Ollama from [ollama.ai](https://ollama.ai) and pull your desired model:

    ```bash theme={null}
    ollama pull llama3.1:8b
    ```

    <Info>
      **Zero-API-key setup**: To avoid falling back to OpenAI for embeddings, you must also configure the embedding provider to use a local backend. See the [Local Setup guide](/guides/local-setup) for a complete `.env` example using Ollama or Fastembed for both LLM and embeddings.
    </Info>

    ### Known Issues

    * **`NoDataError` with mixed providers**: Using Ollama as LLM and OpenAI as embedding provider may fail with `NoDataError`. Workaround: configure both LLM and embeddings to the same local provider (see the local setup guide above).
    * **Audio transcription is not supported**: `AudioLoader` relies on a Whisper-compatible transcription endpoint. Cognee's Ollama adapter does not provide one, so audio ingestion will fail when `LLM_PROVIDER="ollama"`.

    ### Context Window (`num_ctx`) and Custom Modelfiles

    If `cognify()` returns HTTP **500 errors** while the same model answers fine when you run `ollama run <model>` in a terminal, the usual cause is **context-window truncation**, not a connection problem.

    Ollama's default context window can be much smaller than Cognee's extraction window. Cognee sizes extraction chunks from `LLM_MAX_COMPLETION_TOKENS` (default `16384`) — up to roughly half that per chunk — so the entity-extraction prompts it sends can be far larger than a short terminal prompt. When a prompt exceeds `num_ctx`, Ollama may truncate it, the model can return malformed or empty structured output, and Instructor's parse failure can surface as a 500.

    `LLM_MODEL` is just an Ollama model tag, so the fix is to point it at a model whose `num_ctx` is large enough. Create a custom [Modelfile](https://github.com/ollama/ollama/blob/main/docs/modelfile.md):

    ```dockerfile theme={null}
    FROM llama3.1:8b
    PARAMETER num_ctx 8192
    ```

    Build the tag and reference it in your `.env`:

    ```bash theme={null}
    ollama create llama3.1:8b-8k -f Modelfile
    ```

    ```dotenv theme={null}
    LLM_PROVIDER="ollama"
    LLM_MODEL="llama3.1:8b-8k"
    LLM_ENDPOINT="http://localhost:11434/v1"
    LLM_API_KEY="ollama"
    ```

    <Info>
      A larger `num_ctx` uses more memory. If you can't raise it, lower `LLM_MAX_COMPLETION_TOKENS` instead so Cognee builds smaller chunks that fit the model's existing context window.
    </Info>

    ### Connection Troubleshooting

    If you see `cannot connect to host` or `connection refused` errors, the most common causes are an unreachable endpoint, the wrong protocol, or a Docker networking mismatch.

    **Default endpoint protocol**

    Ollama's local server speaks plain **HTTP**, not HTTPS. Cognee does not add TLS by default — the protocol is determined entirely by the scheme in `LLM_ENDPOINT` and `EMBEDDING_ENDPOINT`. Use `https://` only if you have placed Ollama behind a TLS-terminating reverse proxy (Caddy, nginx, Traefik, etc.). For a local Ollama setup, use:

    | Variable                      | Value                              |
    | ----------------------------- | ---------------------------------- |
    | `LLM_ENDPOINT` (Ollama)       | `http://localhost:11434/v1`        |
    | `EMBEDDING_ENDPOINT` (Ollama) | `http://localhost:11434/api/embed` |

    The Ollama embedding engine builds a secure SSL context for outgoing requests, but it is only applied when the endpoint URL uses `https://` — plain HTTP requests are not upgraded.

    **`localhost` vs `host.docker.internal`**

    Inside a Docker container, `localhost` refers to the container itself, not your host machine where Ollama is running. If Cognee runs in Docker and Ollama runs on the host, use `host.docker.internal` instead:

    ```dotenv theme={null}
    LLM_ENDPOINT="http://host.docker.internal:11434/v1"
    EMBEDDING_ENDPOINT="http://host.docker.internal:11434/api/embed"
    ```

    `host.docker.internal` is available on Docker Desktop (macOS/Windows) and on Linux when the `host-gateway` mapping is configured in `docker-compose.yml`. On Linux without that mapping, use `--network host` or the Docker bridge IP.

    **Other common causes**

    * **Ollama not running**: verify with `curl http://localhost:11434/api/tags` from the same machine and network namespace Cognee is running in.
    * **Wrong port**: the default Ollama port is `11434`. If you started Ollama with `OLLAMA_HOST=0.0.0.0:<port>`, match that port in `LLM_ENDPOINT`.
    * **Missing path suffix**: the LLM endpoint must end in `/v1` (OpenAI-compatible chat completions), and the embedding endpoint must end in `/api/embed`. Pointing either at the bare host (e.g. `http://localhost:11434`) will fail.
    * **Bind address**: Ollama binds to `127.0.0.1` by default. To accept connections from other machines or Docker containers via a LAN IP, start it with `OLLAMA_HOST=0.0.0.0:11434`.
  </Accordion>

  <Accordion title="HuggingFace">
    Use models from HuggingFace via the [HuggingFace Inference API](https://huggingface.co/docs/api-inference/index) (serverless) or dedicated [Inference Endpoints](https://huggingface.co/docs/inference-endpoints/index).

    <Tabs>
      <Tab title="Serverless">
        ```dotenv theme={null}
        LLM_PROVIDER="custom"
        LLM_MODEL="huggingface/mistralai/Mistral-7B-Instruct-v0.3"
        LLM_API_KEY="hf_..."
        ```
      </Tab>

      <Tab title="Dedicated Endpoint">
        ```dotenv theme={null}
        LLM_PROVIDER="custom"
        LLM_MODEL="huggingface/mistralai/Mistral-7B-Instruct-v0.3"
        LLM_ENDPOINT="https://<your-endpoint-id>.<region>.aws.endpoints.huggingface.cloud/v1/"
        LLM_API_KEY="hf_..."
        ```
      </Tab>
    </Tabs>

    **Installation**: Install the HuggingFace extra to enable the HuggingFace tokenizer used for chunking:

    ```bash theme={null}
    pip install cognee[huggingface]
    ```

    <Info>
      **Model names**: Use the full HuggingFace model repo ID after the `huggingface/` prefix (e.g., `huggingface/mistralai/Mixtral-8x7B-Instruct-v0.1`). Not all models on HuggingFace support the text generation inference API — check the model card for compatibility. The model is routed through [LiteLLM](https://docs.litellm.ai/docs/providers/huggingface).
    </Info>
  </Accordion>

  <Accordion title="LM Studio (Local)">
    Run models locally with LM Studio for privacy and cost control.

    ```dotenv theme={null}
    LLM_PROVIDER="custom"
    LLM_MODEL="lm_studio/magistral-small-2509"
    LLM_ENDPOINT="http://127.0.0.1:1234/v1"
    LLM_API_KEY="."
    LLM_INSTRUCTOR_MODE="json_schema_mode"
    ```

    **Installation**: Install LM Studio from [lmstudio.ai](https://lmstudio.ai/) and download your desired model from
    LM Studio's interface.
    Load your model, start the LM Studio server, and Cognee will be able to connect to it.

    <Info>
      **Set up instructor mode**: `LLM_INSTRUCTOR_MODE` controls how Cognee asks the model for structured output. LM Studio models often work best with `json_schema_mode`. For more detail, see [LLM Instructor Modes](#llm-instructor-modes) below and [Structured Output Backends](/setup-configuration/structured-output-backends).
    </Info>
  </Accordion>

  <Accordion title="llama.cpp (Local)">
    Run models locally using [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) for full offline inference.

    Cognee supports two setup modes:

    * **Local mode** — Load a `.gguf` model directly in-process
    * **Server mode** — Connect to a running `llama-cpp-python` server over HTTP

    **Installation**: Install the required dependency:

    ```bash theme={null}
    pip install cognee[llama-cpp]
    ```

    <Info>
      **Choosing a mode**: Use local mode for the simplest setup with no separate server process. Use server mode if you want to share one model across multiple processes or run the model on another machine.
    </Info>

    <AccordionGroup>
      <Accordion title="Local Mode (In-Process)">
        Load a GGUF model file directly. No server setup required.

        ```dotenv theme={null}
        LLM_PROVIDER="llama_cpp"
        LLAMA_CPP_MODEL_PATH="/path/to/your/model.gguf"

        # Optional: context window size (default: 2048)
        LLAMA_CPP_N_CTX=4096

        # Optional: GPU layers to offload (default: 0 = CPU only, -1 = all layers on GPU)
        LLAMA_CPP_N_GPU_LAYERS=35

        # Optional: chat format (default: chatml)
        LLAMA_CPP_CHAT_FORMAT="chatml"
        ```

        <Info>
          **GPU acceleration**: Set `LLAMA_CPP_N_GPU_LAYERS=-1` to offload all layers to GPU, or set a positive integer to offload a specific number of layers. Leave it at `0` for CPU-only inference.
        </Info>

        <Info>
          **Concurrency**: In local in-process mode the model is loaded once and shared across calls. Because the underlying `llama_cpp.Llama` instance is not thread-safe, Cognee serializes concurrent structured-output calls (such as the per-chunk extraction that `cognify()` fans out) on that single instance. This means in-process requests are processed one at a time rather than in parallel; if you need parallel decoding, run a `llama-cpp-python` server and use **Server Mode** instead.
        </Info>
      </Accordion>

      <Accordion title="Server Mode (OpenAI-Compatible)">
        Connect to a running `llama-cpp-python` server. Start the server separately:

        ```bash theme={null}
        python -m llama_cpp.server --model /path/to/your/model.gguf --port 8000
        ```

        Then configure Cognee to connect to it:

        ```dotenv theme={null}
        LLM_PROVIDER="llama_cpp"
        LLM_ENDPOINT="http://localhost:8000/v1"
        LLM_API_KEY="."
        LLM_MODEL="your-model-name"
        ```
      </Accordion>
    </AccordionGroup>
  </Accordion>

  <Accordion title="Custom Providers">
    Use OpenAI-compatible endpoints like OpenRouter or other services.

    ```dotenv theme={null}
    LLM_PROVIDER="custom"
    LLM_MODEL="openrouter/google/gemini-2.0-flash-lite-preview-02-05:free"
    LLM_ENDPOINT="https://openrouter.ai/api/v1"
    LLM_API_KEY="or-..."
    # Optional: fallback provider for content policy violations
    # FALLBACK_MODEL=openrouter/openai/gpt-4o-mini
    # FALLBACK_ENDPOINT=https://openrouter.ai/api/v1
    # FALLBACK_API_KEY=or-...
    ```

    See [Fallback Provider](#fallback-provider) in Advanced Options for full details.

    **Custom Provider Prefixes**: When using `LLM_PROVIDER="custom"`, you must include the correct provider prefix in your model name. Cognee forwards requests to [LiteLLM](https://docs.litellm.ai/docs/providers), which uses these prefixes to route requests correctly.

    Common prefixes include:

    * `hosted_vllm/` — vLLM servers
    * `openrouter/` — OpenRouter
    * `lm_studio/` — LM Studio
    * `openai/` — OpenAI-compatible APIs

    See the [LiteLLM providers documentation](https://docs.litellm.ai/docs/providers) for the full list of supported prefixes.

    Below are examples for common providers and patterns:

    <Accordion title="DeepSeek">
      Use DeepSeek's models for reasoning and chat via their OpenAI-compatible API.

      ```dotenv theme={null}
      LLM_PROVIDER="custom"
      LLM_MODEL="deepseek/deepseek-chat"
      LLM_ENDPOINT="https://api.deepseek.com/v1"
      LLM_API_KEY="sk-..."
      ```

      Get your API key from [platform.deepseek.com](https://platform.deepseek.com/api_keys). The `deepseek/` prefix tells LiteLLM to route to the DeepSeek API.

      **Popular DeepSeek models** (use with the `deepseek/` prefix):

      * `deepseek/deepseek-chat` — DeepSeek-V3 (general chat and instruction following)
      * `deepseek/deepseek-reasoner` — DeepSeek-R1 (chain-of-thought reasoning)

      <Info>
        **Structured output**: DeepSeek's API is OpenAI-compatible, so the default `json_mode` for `custom` providers works well. If you encounter issues with structured output, try setting `LLM_INSTRUCTOR_MODE="tool_call"`.
      </Info>
    </Accordion>

    <Accordion title="Kimi (Moonshot AI)">
      Use Moonshot AI's Kimi models via their OpenAI-compatible API.

      ```dotenv theme={null}
      LLM_PROVIDER="custom"
      LLM_MODEL="moonshot/moonshot-v1-32k"
      LLM_ENDPOINT="https://api.moonshot.cn/v1"
      LLM_API_KEY="sk-..."
      ```

      Get your API key from [platform.moonshot.cn](https://platform.moonshot.cn/console/api-keys). The `moonshot/` prefix tells LiteLLM to route to the Moonshot AI API.

      **Available Kimi models** (use with the `moonshot/` prefix):

      * `moonshot/moonshot-v1-8k` — 8k context window
      * `moonshot/moonshot-v1-32k` — 32k context window
      * `moonshot/moonshot-v1-128k` — 128k context window (for long documents)
    </Accordion>

    <Accordion title="OpenRouter">
      Use [OpenRouter](https://openrouter.ai) to access hundreds of models from a single API endpoint.

      ```dotenv theme={null}
      LLM_PROVIDER="custom"
      LLM_MODEL="openrouter/deepseek/deepseek-r1"
      LLM_ENDPOINT="https://openrouter.ai/api/v1"
      LLM_API_KEY="sk-or-..."
      ```

      Get your API key from [openrouter.ai/keys](https://openrouter.ai/keys). Browse all available models at [openrouter.ai/models](https://openrouter.ai/models) — prefix the model slug with `openrouter/`.

      **Example models** (use with the `openrouter/` prefix):

      * `openrouter/deepseek/deepseek-r1` — DeepSeek R1 via OpenRouter
      * `openrouter/google/gemini-2.0-flash-lite-preview-02-05:free` — Free Gemini tier
      * `openrouter/openai/gpt-4o-mini` — GPT-4o Mini via OpenRouter
    </Accordion>

    <Accordion title="DeepInfra">
      Use DeepInfra to access open-source models via their OpenAI-compatible API.

      ```dotenv theme={null}
      LLM_PROVIDER="custom"
      LLM_MODEL="deepinfra/meta-llama/Meta-Llama-3-8B-Instruct"
      LLM_ENDPOINT="https://api.deepinfra.com/v1/openai"
      LLM_API_KEY="<your-deepinfra-api-key>"
      ```

      Find your model name in the [DeepInfra model catalog](https://deepinfra.com/models). The `deepinfra/` prefix tells LiteLLM to route to DeepInfra.
    </Accordion>

    <Accordion title="Company-Internal / Self-Hosted Endpoints">
      Any internal LLM server that exposes an OpenAI-compatible REST API (e.g., a corporate vLLM deployment, internal TGI server, or private OpenRouter proxy) can be used with the `custom` provider.

      ```dotenv theme={null}
      LLM_PROVIDER="custom"
      LLM_MODEL="openai/<your-internal-model-name>"
      LLM_ENDPOINT="https://llm.internal.example.com/v1"
      LLM_API_KEY="<internal-api-key-or-bearer-token>"
      ```

      The model prefix you use (`openai/`, `hosted_vllm/`, etc.) determines which LiteLLM adapter handles the request. For most OpenAI-compatible servers, `openai/` works best. Set `LLM_API_KEY` to whatever bearer token your server requires (use `.` if no auth is needed).
    </Accordion>

    <Accordion title="vLLM">
      Use vLLM for high-performance model serving with OpenAI-compatible API.

      ```dotenv theme={null}
      LLM_PROVIDER="custom"
      LLM_MODEL="hosted_vllm/<your-model-name>"
      LLM_ENDPOINT="https://your-vllm-endpoint/v1"
      LLM_API_KEY="."
      ```

      **Example with Gemma:**

      ```dotenv theme={null}
      LLM_PROVIDER="custom"
      LLM_MODEL="hosted_vllm/gemma-3-12b"
      LLM_ENDPOINT="https://your-vllm-endpoint/v1"
      LLM_API_KEY="."
      ```

      <Warning>
        **Important**: The `hosted_vllm/` prefix is required for LiteLLM to correctly route requests to your vLLM server. The model name after the prefix should match the model ID returned by your vLLM server's `/v1/models` endpoint.
      </Warning>

      To find the correct model name, see [their documentation](https://docs.litellm.ai/docs/providers/vllm).
    </Accordion>
  </Accordion>
</AccordionGroup>

## Advanced Options

<Accordion title="LLM Instructor Modes">
  When using the Instructor structured-output framework (the default), Cognee instructs the model to return structured data in a specific way. The `LLM_INSTRUCTOR_MODE` environment variable controls which strategy is used.

  Each provider has a built-in default that matches its API capabilities. Override it only when the default doesn't work for your specific model.

  **Available modes:**

  | Mode               | Description                                                                                                         | When to use                                                                                                                                     |
  | ------------------ | ------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
  | `json_schema_mode` | Passes the full JSON Schema of the expected output in the request and enforces strict schema compliance.            | OpenAI models that support the `response_format` / structured-output feature (e.g. GPT-4o). Also works well with Bedrock and some local models. |
  | `json_mode`        | Instructs the model to return any valid JSON object. Instructor then validates and coerces it to the target schema. | Gemini, Ollama, Generic/Custom endpoints, and any model that supports `response_format: json_object` but not strict schema enforcement.         |
  | `anthropic_tools`  | Uses Anthropic's native tool-calling API to extract structured data.                                                | Anthropic Claude models only. Leverages first-class tool-use support for reliable extraction.                                                   |
  | `mistral_tools`    | Uses Mistral's native tool-calling API to extract structured data.                                                  | Mistral models only. Mirrors the OpenAI function-calling interface provided by Mistral.                                                         |
  | `tool_call`        | Uses the generic OpenAI-style function/tool-calling API to define the schema as a callable tool.                    | OpenAI-compatible APIs that support function calling but not strict JSON schema output.                                                         |
  | `md_json`          | Asks the model to return JSON wrapped in a Markdown code block. Instructor extracts the block and validates it.     | Models that reliably format code blocks but may not support `json_mode` (e.g. some self-hosted models).                                         |

  **Per-provider defaults (from source code):**

  | Provider (`LLM_PROVIDER`)            | Default mode       |
  | ------------------------------------ | ------------------ |
  | `openai` (and Azure OpenAI)          | `json_schema_mode` |
  | `anthropic`                          | `anthropic_tools`  |
  | `gemini`                             | `json_mode`        |
  | `bedrock`                            | `json_schema_mode` |
  | `mistral`                            | `mistral_tools`    |
  | `ollama`                             | `json_mode`        |
  | `custom` (generic OpenAI-compatible) | `json_mode`        |

  **Example — override the mode:**

  ```dotenv theme={null}
  LLM_INSTRUCTOR_MODE="json_schema_mode"
  ```

  Override the default only when the model you are using requires a different mode. For example, LM Studio models typically need `json_schema_mode` even though the `custom` provider defaults to `json_mode`.
</Accordion>

<Accordion title="Temperature">
  Control the randomness of LLM responses with the `LLM_TEMPERATURE` environment variable.

  | Variable          | Default | Description                                                                                                                             |
  | ----------------- | ------- | --------------------------------------------------------------------------------------------------------------------------------------- |
  | `LLM_TEMPERATURE` | `0.0`   | Sampling temperature. `0.0` = deterministic / focused output. Higher values (e.g. `0.7`–`1.0`) produce more varied, creative responses. |

  **When to adjust**: Cognee's default of `0.0` is recommended for knowledge-graph extraction because it produces consistent, structured output. Raise the temperature only if you need more variety in generated text (e.g. conversational responses or creative summarisation).
</Accordion>

<Accordion title="Rate Limiting">
  Control client-side throttling for LLM calls to manage API usage and costs.

  <Warning>
    **Rate limiting is disabled by default.** You must explicitly set `LLM_RATE_LIMIT_ENABLED="true"` to activate it.
  </Warning>

  **Defaults (when rate limiting is enabled):**

  | Variable                  | Default | Meaning                   |
  | ------------------------- | ------- | ------------------------- |
  | `LLM_RATE_LIMIT_ENABLED`  | `false` | Off by default — opt-in   |
  | `LLM_RATE_LIMIT_REQUESTS` | `60`    | Max requests per interval |
  | `LLM_RATE_LIMIT_INTERVAL` | `60`    | Interval in seconds       |

  The defaults (60 requests / 60 seconds) allow 1 request/second on average. Adjust both values to match your provider's tier limit.

  **How it works:**

  * **Client-side limiter**: Cognee paces outbound LLM calls before they reach the provider
  * **Moving window**: Spreads allowance across the time window for smoother throughput
  * **Per-process scope**: In-memory limits don't share across multiple processes/containers
  * **Auto-applied**: Works with all providers (OpenAI, Gemini, Anthropic, Ollama, Custom)

  **Sizing guidance:**

  Set `LLM_RATE_LIMIT_REQUESTS` to your provider's RPM (requests per minute) limit, and `LLM_RATE_LIMIT_INTERVAL` to `60`. To leave headroom, use \~80–90% of the advertised limit. Check your provider's dashboard for your current tier limits.

  Each `cognify()` call issues multiple LLM requests (entity extraction, summarization, etc.) per document chunk — plan for several requests per chunk, not one.

  **Example configurations for common provider tiers**

  These examples target chat/completions-style LLM endpoints, such as OpenAI models like `gpt-4o-mini`.

  <AccordionGroup>
    <Accordion title="OpenAI - Tier 1">
      ```dotenv theme={null}
      LLM_RATE_LIMIT_ENABLED="true"
      LLM_RATE_LIMIT_REQUESTS="450"
      LLM_RATE_LIMIT_INTERVAL="60"
      ```
    </Accordion>

    <Accordion title="OpenAI - Tier 2">
      ```dotenv theme={null}
      LLM_RATE_LIMIT_ENABLED="true"
      LLM_RATE_LIMIT_REQUESTS="4500"
      LLM_RATE_LIMIT_INTERVAL="60"
      ```
    </Accordion>

    <Accordion title="Anthropic - Tier 1">
      ```dotenv theme={null}
      LLM_RATE_LIMIT_ENABLED="true"
      LLM_RATE_LIMIT_REQUESTS="45"
      LLM_RATE_LIMIT_INTERVAL="60"
      ```
    </Accordion>

    <Accordion title="Google Gemini - Free Tier">
      ```dotenv theme={null}
      LLM_RATE_LIMIT_ENABLED="true"
      LLM_RATE_LIMIT_REQUESTS="13"
      LLM_RATE_LIMIT_INTERVAL="60"
      ```
    </Accordion>

    <Accordion title="Conservative Default">
      ```dotenv theme={null}
      LLM_RATE_LIMIT_ENABLED="true"
      LLM_RATE_LIMIT_REQUESTS="60"
      LLM_RATE_LIMIT_INTERVAL="60"
      ```
    </Accordion>
  </AccordionGroup>

  <Info>
    Always verify your exact tier limits in your provider's dashboard — limits vary by model, tier, and region. The examples above are approximations for common tiers and may change.
  </Info>
</Accordion>

<Accordion title="Fallback Provider">
  Cognee supports a primary-plus-fallback model configuration that automatically retries a failed request against a secondary provider. This is useful when your primary provider may reject certain content, and you want a fallback to handle those cases gracefully.

  **When the fallback triggers**

  The fallback is invoked only on **content policy violations** from the primary provider:

  * `ContentFilterFinishReasonError` — the provider's output filter blocked the response
  * `ContentPolicyViolationError` — the request was rejected for policy reasons
  * `InstructorRetryException` containing "content management policy"

  The fallback does **not** activate for network errors, rate limits, or authentication failures.

  **Supported providers**

  Fallback is available when `LLM_PROVIDER` is set to `openai` or `custom`. Other providers (Anthropic, Gemini, Mistral, Bedrock, Ollama) do not currently support the fallback chain.

  **Configuration**

  Set these three variables alongside your primary LLM configuration:

  ```dotenv theme={null}
  # Primary provider
  LLM_PROVIDER="openai"
  LLM_MODEL="openai/gpt-4o-mini"
  LLM_API_KEY="sk-..."

  # Fallback provider (used only on content policy violations)
  FALLBACK_MODEL="openrouter/openai/gpt-4o-mini"
  FALLBACK_ENDPOINT="https://openrouter.ai/api/v1"
  FALLBACK_API_KEY="or-..."
  ```

  For `LLM_PROVIDER="custom"`, all three fallback variables (`FALLBACK_MODEL`, `FALLBACK_ENDPOINT`, `FALLBACK_API_KEY`) must be set. If any is missing, Cognee raises a `ContentPolicyFilterError` instead of falling back.

  For `LLM_PROVIDER="openai"`, only `FALLBACK_MODEL` and `FALLBACK_API_KEY` are required. If set, `FALLBACK_ENDPOINT` is now forwarded to the OpenAI adapter and routes the fallback request to that base URL; if omitted, the fallback request uses the default OpenAI endpoint.

  **Variable reference**

  | Variable            | Description                                                                                                  |
  | ------------------- | ------------------------------------------------------------------------------------------------------------ |
  | `FALLBACK_MODEL`    | Model identifier for the fallback provider (use LiteLLM prefix format, e.g. `openrouter/openai/gpt-4o-mini`) |
  | `FALLBACK_ENDPOINT` | Base URL for the fallback provider's API (required for `custom`, optional for `openai`)                      |
  | `FALLBACK_API_KEY`  | API key for the fallback provider                                                                            |
</Accordion>

<Accordion title="Custom Endpoints & Corporate Proxies">
  `LLM_ENDPOINT` overrides the base URL Cognee uses to reach the LLM. Use it to point at an Azure deployment, a local server, an OpenAI-compatible proxy, or a company-internal gateway. For routing all outbound traffic through a corporate HTTP proxy without rewriting the endpoint, use the standard `HTTPS_PROXY` / `HTTP_PROXY` environment variables.

  **Per-provider `LLM_ENDPOINT` semantics**

  | `LLM_PROVIDER` | How `LLM_ENDPOINT` is used                                                                                                                  | Required?        |
  | -------------- | ------------------------------------------------------------------------------------------------------------------------------------------- | ---------------- |
  | `openai`       | Passed to LiteLLM as `api_base`. Omit to use OpenAI's default (`https://api.openai.com/v1`). Set to point at a compatible gateway or proxy. | Optional         |
  | `azure`        | Azure resource endpoint (e.g., `https://<resource>.openai.azure.com`). The deployment is selected by `LLM_MODEL`.                           | Required         |
  | `gemini`       | Passed to LiteLLM as `api_base`. Omit to use the provider's default.                                                                        | Optional         |
  | `mistral`      | `LLM_ENDPOINT` is currently not used for generation; Cognee uses the default Mistral provider endpoint.                                     | Not applicable   |
  | `ollama`       | OpenAI-compatible endpoint of your Ollama server (typically `http://localhost:11434/v1`).                                                   | Required         |
  | `custom`       | Base URL of your OpenAI-compatible server (vLLM, OpenRouter, LM Studio, internal gateway, etc.).                                            | Required         |
  | `llama_cpp`    | Required only in server mode (URL of the `llama-cpp-python` server). Ignored in local in-process mode.                                      | Server mode only |
  | `anthropic`    | Not read. Anthropic's SDK has its own internal base URL. To route through a proxy, use `HTTPS_PROXY`.                                       | Not applicable   |
  | `bedrock`      | Not read. Use `AWS_BEDROCK_RUNTIME_ENDPOINT` to override the Bedrock endpoint.                                                              | Not applicable   |

  **Routing through a corporate HTTP/HTTPS proxy**

  Cognee's LLM transport is built on the `openai`, `anthropic`, `httpx`, and `litellm` Python clients, all of which honor the standard proxy environment variables. Set them in your shell or `.env` before starting Cognee:

  ```dotenv theme={null}
  HTTPS_PROXY="http://proxy.corp.example.com:8080"
  HTTP_PROXY="http://proxy.corp.example.com:8080"
  # Optional: hosts that should bypass the proxy
  NO_PROXY="localhost,127.0.0.1,.internal.example.com"
  ```

  This is the right approach when the LLM provider's public URL is correct but your network blocks direct egress. No Cognee config change is needed — outbound LLM, embedding, and HTTP loader calls all pick up these variables automatically.

  **Troubleshooting "not connected / cannot reach LLM"**

  * **`LLM_ENDPOINT` typos** — values are stripped of surrounding quotes, but a missing scheme (`http://` / `https://`) or trailing path segment will surface as a connection error. For OpenAI-compatible endpoints, the URL must end in `/v1` (or whatever the server exposes).
  * **Preflight timeout** — Cognee runs a 30s connection test at startup. If your proxy adds latency or your local model is slow to warm up, set `COGNEE_SKIP_CONNECTION_TEST=true` to skip it.
  * **Provider mismatch** — if `LLM_ENDPOINT` points at a non-OpenAI server but `LLM_PROVIDER="openai"`, Cognee will hit the wrong route. For OpenAI-compatible third parties, use `LLM_PROVIDER="custom"` with the correct LiteLLM model prefix (see [Custom Providers](#custom-providers) above).
  * **TLS interception** — if your corporate proxy uses its own CA, set `SSL_CERT_FILE` or `REQUESTS_CA_BUNDLE` to the CA bundle path so Python's HTTP clients trust the proxy certificate.
</Accordion>

## Notes

* If `EMBEDDING_API_KEY` is not set, Cognee falls back to `LLM_API_KEY` for embeddings
* Rate limiting helps manage API usage and costs
* Structured output frameworks ensure consistent data extraction from LLM responses
* Cancelled LLM operations stop promptly. For BAML structured output and the Anthropic, Azure OpenAI, Gemini, Mistral, and Ollama adapters, cancellation is propagated immediately so interrupted jobs and worker shutdowns do not wait for retry backoff.

<Columns cols={3}>
  <Card title="Embedding Providers" icon="layers" href="/setup-configuration/embedding-providers">
    Configure embedding providers for semantic search
  </Card>

  <Card title="Overview" icon="settings" href="/setup-configuration/overview">
    Return to setup configuration overview
  </Card>

  <Card title="Relational Databases" icon="database" href="/setup-configuration/relational-databases">
    Set up SQLite or Postgres for metadata storage
  </Card>
</Columns>
