> ## Documentation Index > Fetch the complete documentation index at: https://docs.cognee.ai/llms.txt > Use this file to discover all available pages before exploring further. # Embedding Providers > Configure embedding providers for semantic search in Cognee Embedding providers convert text into vector representations that enable semantic search. These vectors capture the meaning of text, allowing Cognee to find conceptually related content even when the wording is different. **New to configuration?** See the [Setup Configuration Overview](./overview) for the complete workflow: install extras → create `.env` → choose providers → handle pruning. ## Supported Providers Cognee supports multiple embedding providers: * **OpenAI** — Text embedding models via OpenAI API (default) * **Azure OpenAI** — Text embedding models via Azure OpenAI Service * **Google Gemini** — Embedding models via Google AI * **Mistral** — Embedding models via Mistral AI * **AWS Bedrock** — Embedding models via AWS Bedrock * **Ollama** — Local embedding models via Ollama * **LM Studio** — Local embedding models via LM Studio * **Fastembed** — CPU-friendly local embeddings * **HuggingFace** — Embedding models via HuggingFace Inference API or Inference Endpoints * **vLLM** — Self-hosted embedding models via vLLM * **OpenAI-Compatible** — Direct OpenAI SDK for llama.cpp, vLLM, TEI, and any `/v1/embeddings` server (bypasses LiteLLM) * **Custom** — OpenAI-compatible embedding endpoints routed through LiteLLM (DeepInfra, company-internal) **LLM/Embedding Configuration**: If you configure only LLM or only embeddings, the other defaults to OpenAI. Ensure you have a working OpenAI API key, or configure both LLM and embeddings to avoid unexpected defaults. ## Configuration Set these environment variables in your `.env` file: * `EMBEDDING_PROVIDER` — The provider to use (openai, gemini, mistral, ollama, fastembed, openai\_compatible, custom) * `EMBEDDING_MODEL` — The specific embedding model to use * `EMBEDDING_DIMENSIONS` — The vector dimension size (must match your vector store) * `EMBEDDING_API_KEY` — Your API key (falls back to `LLM_API_KEY` if not set) * `EMBEDDING_ENDPOINT` — Custom endpoint URL (for Azure, Ollama, or custom providers) * `EMBEDDING_API_VERSION` — API version (for Azure OpenAI) * `EMBEDDING_MAX_COMPLETION_TOKENS` — Maximum tokens per embedding request; used for tokenizer-based chunk sizing (optional, default `8191`) * `HUGGINGFACE_TOKENIZER` — HuggingFace Hub model ID that overrides the tokenizer Cognee uses for token counting when the embedding model is not itself a HuggingFace repo. Commonly used with Ollama embeddings (for example, `nomic-ai/nomic-embed-text-v1.5`). ## Provider Setup Guides OpenAI provides high-quality embeddings with good performance. ```dotenv theme={null} EMBEDDING_PROVIDER="openai" EMBEDDING_MODEL="openai/text-embedding-3-large" EMBEDDING_DIMENSIONS="3072" # Optional # EMBEDDING_API_KEY=sk-... # falls back to LLM_API_KEY if omitted # EMBEDDING_ENDPOINT=https://api.openai.com/v1 # EMBEDDING_API_VERSION= # EMBEDDING_MAX_COMPLETION_TOKENS=8191 ``` Use Azure OpenAI Service for embeddings with your own deployment. ```dotenv theme={null} EMBEDDING_PROVIDER="openai" EMBEDDING_MODEL="azure/text-embedding-3-large" EMBEDDING_ENDPOINT="https://.cognitiveservices.azure.com/openai/deployments/text-embedding-3-large" EMBEDDING_API_KEY="az-..." EMBEDDING_API_VERSION="2023-05-15" EMBEDDING_DIMENSIONS="3072" ``` If startup fails with `KeyError: 'Could not automatically map text-embedding-3-large to a tokeniser.'`, the installed `tiktoken` is too old to recognise the model. Cognee strips the `azure/` prefix and asks TikToken for the encoding of `text-embedding-3-large`, which requires `tiktoken>=0.5.2` (Cognee pins `>=0.8.0`). Upgrade in your environment: ```bash theme={null} pip install --upgrade tiktoken ``` Use Google's embedding models for semantic search. ```dotenv theme={null} EMBEDDING_PROVIDER="gemini" EMBEDDING_MODEL="gemini/gemini-embedding-001" EMBEDDING_API_KEY="AIza..." EMBEDDING_DIMENSIONS="768" ``` Use Mistral's embedding models for high-quality vector representations. ```dotenv theme={null} EMBEDDING_PROVIDER="mistral" EMBEDDING_MODEL="mistral/mistral-embed" EMBEDDING_API_KEY="sk-mis-..." EMBEDDING_DIMENSIONS="1024" ``` **Installation**: Install the required dependency: ```bash theme={null} pip install mistral-common[sentencepiece] ``` Use embedding models provided by the AWS Bedrock service. ```dotenv theme={null} EMBEDDING_PROVIDER="bedrock" EMBEDDING_MODEL="" EMBEDDING_DIMENSIONS="" EMBEDDING_API_KEY="" EMBEDDING_MAX_COMPLETION_TOKENS="" ``` Run embedding models locally with Ollama for privacy and cost control. ```dotenv theme={null} EMBEDDING_PROVIDER="ollama" EMBEDDING_MODEL="nomic-embed-text:latest" EMBEDDING_ENDPOINT="http://localhost:11434/api/embed" EMBEDDING_DIMENSIONS="768" HUGGINGFACE_TOKENIZER="nomic-ai/nomic-embed-text-v1.5" ``` `HUGGINGFACE_TOKENIZER` is the HuggingFace repo ID of the tokenizer used for token-length counting when sending requests to the Ollama embedding endpoint. **Installation**: Install Ollama from [ollama.ai](https://ollama.ai) and pull your desired embedding model: ```bash theme={null} ollama pull nomic-embed-text:latest ``` `HUGGINGFACE_TOKENIZER` is **optional**. It is no longer part of Cognee's required startup validation, so setting `EMBEDDING_PROVIDER`, `EMBEDDING_MODEL`, and `EMBEDDING_DIMENSIONS` without it no longer raises a `ValidationError` on import. It is still recommended for Ollama: set it to the HuggingFace repo ID for the tokenizer that matches your embedding model so token-length counts stay accurate. See the `HUGGINGFACE_TOKENIZER environment variable` section below for how to find the correct value for your model. If `HUGGINGFACE_TOKENIZER` is unset or points at a repo that cannot be loaded, tokenizer resolution no longer raises — Cognee logs an advisory warning and falls back to the TikToken tokenizer, so ingestion continues with approximate token counts. Cognee also logs an advisory warning whenever the `HUGGINGFACE_TOKENIZER` value differs from the `EMBEDDING_MODEL` id, so a genuine mismatch is not silent. Because an Ollama tag such as `nomic-embed-text:latest` is never identical to its HuggingFace repo id (`nomic-ai/nomic-embed-text-v1.5`), you will see this advisory even for a correct setup — it is a reminder to confirm the two share a tokenizer, not an error. If a text input exceeds the model's context window, the Ollama embedding engine automatically falls back by splitting the batch in half and retrying both halves. For a single overlong text, it splits the string into two overlapping segments and averages the resulting embeddings. Cognee no longer pre-truncates text before sending it to Ollama, so this fallback only activates when the server returns a context-length error. **Zero-API-key setup**: To run fully offline with no OpenAI key, you must configure both the LLM provider **and** the embedding provider to use local backends. See the [Local Setup guide](/guides/local-setup) for a complete combined `.env` example. **Ollama falling behind?** Ollama processes requests sequentially. If it becomes unresponsive or returns errors under load, reduce `EMBEDDING_BATCH_SIZE` (default `36`) to send fewer chunks per call — values between `1` and `10` work well for most local hardware: ```dotenv theme={null} EMBEDDING_BATCH_SIZE="5" ``` Run embedding models locally with LM Studio for privacy and cost control. ```dotenv theme={null} EMBEDDING_PROVIDER="custom" EMBEDDING_MODEL="lm_studio/text-embedding-nomic-embed-text-1.5" EMBEDDING_ENDPOINT="http://127.0.0.1:1234/v1" EMBEDDING_API_KEY="." EMBEDDING_DIMENSIONS="768" ``` **Installation**: Install LM Studio from [lmstudio.ai](https://lmstudio.ai/) and download your desired model from LM Studio's interface. Load your model, start the LM Studio server, and Cognee will be able to connect to it. Use Fastembed for CPU-friendly local embeddings without GPU requirements. Fastembed runs in-process via ONNX Runtime — no separate server, no API key, and no GPU needed. ```dotenv theme={null} EMBEDDING_PROVIDER="fastembed" EMBEDDING_MODEL="sentence-transformers/all-MiniLM-L6-v2" EMBEDDING_DIMENSIONS="384" ``` **Installation**: Fastembed ships as an optional extra — it is **not** included in the base `cognee` install. Add it with: ```bash theme={null} pip install 'cognee[fastembed]' ``` This pulls in `fastembed` and a compatible `onnxruntime` build. The first call downloads and caches the model weights from Hugging Face, so the initial run needs network access and a few hundred MB of disk for the model cache. **Supported models**: Any model listed by [`fastembed`'s `TextEmbedding.list_supported_models()`](https://github.com/qdrant/fastembed). Common choices: | `EMBEDDING_MODEL` | `EMBEDDING_DIMENSIONS` | | ---------------------------------------- | ---------------------- | | `sentence-transformers/all-MiniLM-L6-v2` | `384` | | `BAAI/bge-small-en-v1.5` | `384` | | `BAAI/bge-base-en-v1.5` | `768` | | `BAAI/bge-large-en-v1.5` | `1024` | | `nomic-ai/nomic-embed-text-v1.5` | `768` | | `intfloat/multilingual-e5-large` | `1024` | If `EMBEDDING_DIMENSIONS` is omitted, Cognee tries to auto-derive it from the fastembed model registry; set it explicitly to avoid a fallback to `3072` if lookup fails. **Context window handling**: When a text input exceeds the model's context window, Fastembed automatically splits the batch and retries. For a single overlong text, it splits the string into two overlapping segments and averages the resulting embeddings. If a single string is already too short to split further yet still exceeds the context window, Fastembed raises the terminal `EmbeddingContextWindowTooSmallError` instead of retrying (see [Timeout and Retry Behavior](#timeout-and-retry-behavior)). This mirrors the behavior of the OpenAI-compatible engine. **Token counting**: Fastembed now counts tokens with the embedding model's own HuggingFace tokenizer (BGE, MiniLM, and E5 are wordpiece models), resolved from the fastembed model list. Earlier releases counted every Fastembed model with the OpenAI `gpt-4o` BPE tokenizer, which mis-sized chunks and skewed the `--dry-run` token estimate. You do not need to set `HUGGINGFACE_TOKENIZER` for Fastembed. If a model is not in the known list, Cognee logs an advisory warning and falls back to the TikToken tokenizer. Use embedding models from HuggingFace via the [HuggingFace Inference API](https://huggingface.co/docs/api-inference/index) (serverless) or dedicated [Inference Endpoints](https://huggingface.co/docs/inference-endpoints/index). ```dotenv theme={null} EMBEDDING_PROVIDER="custom" EMBEDDING_MODEL="huggingface/BAAI/bge-large-en-v1.5" EMBEDDING_API_KEY="hf_..." EMBEDDING_DIMENSIONS="1024" ``` ```dotenv theme={null} EMBEDDING_PROVIDER="custom" EMBEDDING_MODEL="huggingface/BAAI/bge-large-en-v1.5" EMBEDDING_ENDPOINT="https://..aws.endpoints.huggingface.cloud" EMBEDDING_API_KEY="hf_..." EMBEDDING_DIMENSIONS="1024" ``` **Installation**: Install the HuggingFace extra for tokenizer support: ```bash theme={null} pip install cognee[huggingface] ``` **HUGGINGFACE\_TOKENIZER with HuggingFace embeddings**: When using `EMBEDDING_PROVIDER="custom"` with a `huggingface/` model, Cognee automatically attempts to load a HuggingFace tokenizer from the model repo for token counting. If that fails, it falls back to the TikToken tokenizer. You do not need to set `HUGGINGFACE_TOKENIZER` manually for this provider — it is only required when using `EMBEDDING_PROVIDER="ollama"` (see the Ollama section above). Use vLLM to serve local or self-hosted embedding models with an OpenAI-compatible API. **Example with Qwen3-Embedding-4B on port 8001:** ```dotenv theme={null} EMBEDDING_PROVIDER="custom" EMBEDDING_MODEL="hosted_vllm/Qwen/Qwen3-Embedding-4B" EMBEDDING_ENDPOINT="http://localhost:8001/v1" EMBEDDING_API_KEY="." EMBEDDING_DIMENSIONS="2560" ``` **`hosted_vllm/` prefix required**: Include `hosted_vllm/` at the start of the model name so LiteLLM routes requests to your vLLM server. The model name after the prefix should match the model ID returned by your vLLM server's `/v1/models` endpoint. **Tokenization**: Cognee automatically strips the `hosted_vllm/` prefix when loading the HuggingFace tokenizer, so no separate `HUGGINGFACE_TOKENIZER` setting is needed as long as the model name after the prefix is a valid HuggingFace model ID. To verify the model name your vLLM server exposes, run: ```bash theme={null} curl http://localhost:8001/v1/models ``` See the [LiteLLM vLLM documentation](https://docs.litellm.ai/docs/providers/vllm) for more details. Use `EMBEDDING_PROVIDER="openai_compatible"` for any local inference server that exposes the standard `/v1/embeddings` endpoint. This provider talks directly to OpenAI-compatible embedding servers via the OpenAI Python SDK, bypassing LiteLLM. Use this provider for: llama.cpp (`llama-server --embedding`), vLLM, Hugging Face TEI, LocalAI, Infinity, and similar servers. Start llama.cpp with embedding support: ```bash theme={null} llama-server --model your-model.gguf --embedding --port 8080 ``` ```dotenv theme={null} EMBEDDING_PROVIDER="openai_compatible" EMBEDDING_MODEL="default" EMBEDDING_ENDPOINT="http://localhost:8080/v1" EMBEDDING_API_KEY="no-key-required" EMBEDDING_DIMENSIONS="768" ``` ```dotenv theme={null} EMBEDDING_PROVIDER="openai_compatible" EMBEDDING_MODEL="BAAI/bge-large-en-v1.5" EMBEDDING_ENDPOINT="http://localhost:8001/v1" EMBEDDING_API_KEY="." EMBEDDING_DIMENSIONS="1024" ``` Unlike `EMBEDDING_PROVIDER="custom"` (LiteLLM), you do **not** need a `hosted_vllm/` prefix in the model name — use the model ID directly as reported by your vLLM server's `/v1/models` endpoint. ```dotenv theme={null} EMBEDDING_PROVIDER="openai_compatible" EMBEDDING_MODEL="BAAI/bge-large-en-v1.5" EMBEDDING_ENDPOINT="http://localhost:8080/v1" EMBEDDING_API_KEY="." EMBEDDING_DIMENSIONS="1024" ``` **Endpoint normalisation**: The engine automatically appends `/v1` to `EMBEDDING_ENDPOINT` if it is missing, and strips a trailing `/embeddings` suffix. You can pass either `http://localhost:8080` or `http://localhost:8080/v1` — both work. **Tokenizer and token limits**: The `openai_compatible` engine automatically loads a tokenizer for chunk sizing. It first tries to load a HuggingFace tokenizer matching `EMBEDDING_MODEL`; if that fails (for example, because the model name is a local alias not on the HuggingFace Hub), it falls back to the TikToken tokenizer. The token limit passed to the tokenizer is controlled by `EMBEDDING_MAX_COMPLETION_TOKENS` (default `8191`). Set this to match your server's context window if it differs from the default: ```dotenv theme={null} EMBEDDING_MAX_COMPLETION_TOKENS="4096" ``` **HUGGINGFACE\_TOKENIZER is not needed for this provider.** The engine automatically tries to load a HuggingFace tokenizer using the model name (e.g. `BAAI/bge-large-en-v1.5`) for token counting. If that fails — for example when using `EMBEDDING_MODEL="default"` with llama.cpp — Cognee logs an advisory warning and falls back to TikToken, so token counts become approximate but ingestion continues. You do not need to set `HUGGINGFACE_TOKENIZER` when using `EMBEDDING_PROVIDER="openai_compatible"`. Use OpenAI-compatible embedding endpoints from other providers such as DeepInfra, OpenRouter, or a company-internal server. These are routed through LiteLLM and require a provider prefix in the model name. **Required variables**: `EMBEDDING_PROVIDER="custom"`, `EMBEDDING_MODEL` (with a LiteLLM provider prefix such as `openrouter/`, `deepinfra/`, or `openai/`), `EMBEDDING_API_KEY`, and `EMBEDDING_DIMENSIONS` (set it explicitly — `custom` models are not in the auto-derive registry, so it otherwise falls back to `3072` and causes a vector-store shape mismatch). `EMBEDDING_ENDPOINT` is **optional**: omit it for a named LiteLLM prefix like `openrouter/` (LiteLLM supplies the base URL), and set it only when pointing at a specific `api_base` such as DeepInfra or a self-hosted server. ```dotenv theme={null} EMBEDDING_PROVIDER="custom" EMBEDDING_MODEL="deepinfra/BAAI/bge-base-en-v1.5" EMBEDDING_ENDPOINT="https://api.deepinfra.com/v1/openai" EMBEDDING_API_KEY="" EMBEDDING_DIMENSIONS="768" ``` Use the `openrouter/` model prefix. Do not set `EMBEDDING_ENDPOINT`. ```dotenv theme={null} EMBEDDING_PROVIDER="custom" EMBEDDING_MODEL="openrouter/openai/text-embedding-3-small" EMBEDDING_API_KEY="sk-or-..." EMBEDDING_DIMENSIONS="1536" ``` **Automatic `encoding_format="float"` for OpenRouter**: Cognee detects OpenRouter routes — a model id beginning with `openrouter/`, an explicit `openrouter` provider, or an `openrouter.ai` endpoint host (all matched case-insensitively) — and sets `encoding_format="float"` on the embedding request. Older LiteLLM releases serialize an omitted `encoding_format` as JSON `null`, which OpenRouter rejects with a `400 invalid_value` error (it accepts only `"float"`/`"base64"`); forcing `"float"` avoids that. This guard is scoped to OpenRouter only, so it does not affect other `custom` providers. On current LiteLLM versions it is a no-op for `openrouter/`-prefixed models, but endpoint-based configs (an unprefixed model pointed at `openrouter.ai`) still rely on it — litellm's OpenAI handler injects the `null` even on current versions. ```dotenv theme={null} EMBEDDING_PROVIDER="custom" EMBEDDING_MODEL="openai/" EMBEDDING_ENDPOINT="https://embeddings.internal.example.com/v1" EMBEDDING_API_KEY="" EMBEDDING_DIMENSIONS="" ``` **No endpoint normalisation for `custom`**: Unlike [`openai_compatible`](#openai-compatible-local-servers-llama-cpp-tei-vllm), the `custom` provider passes `EMBEDDING_ENDPOINT` directly to LiteLLM as `api_base` with no automatic `/v1` appending or `/embeddings` stripping. Set the endpoint to exactly the base URL your provider expects (e.g., `https://api.deepinfra.com/v1/openai`), or omit it entirely when using a named LiteLLM prefix such as `openrouter/`. ## Additional Information Cognee does not enforce a fixed allow-list of embedding models. Supported models depend on the provider configured in `EMBEDDING_PROVIDER`; Cognee forwards the embedding request to that provider and stores the returned vectors. * **`fastembed`**: any model returned by [`TextEmbedding.list_supported_models()`](https://github.com/qdrant/fastembed), such as `sentence-transformers/all-MiniLM-L6-v2`. See the [Fastembed section](#fastembed-local) for the common list. * **`ollama` / LM Studio**: any embedding model loaded locally, such as `bge-m3:latest` or `all-minilm:latest`. * **`openai_compatible`**: any model exposed by a local `/v1/embeddings` server, including llama.cpp, TEI, vLLM, LocalAI, and Infinity. * **`openai` / `gemini` / `mistral` / `bedrock` / `custom`**: any model available through the configured provider API, using the corresponding LiteLLM prefix such as `openai/` or `gemini/`. Common model examples and dimensions are shown below. Set `EMBEDDING_DIMENSIONS` to match the model output size: | Model | `EMBEDDING_DIMENSIONS` | How to run it | | ------------------- | ---------------------- | ------------------------------------------------------------------------------------- | | `all-MiniLM-L6-v2` | `384` | `fastembed` (`sentence-transformers/all-MiniLM-L6-v2`) or Ollama (`all-minilm`) | | `all-mpnet-base-v2` | `768` | HuggingFace, `openai_compatible`, or vLLM (`sentence-transformers/all-mpnet-base-v2`) | | `bge-m3` | `1024` | Ollama (`bge-m3`), HuggingFace, or `custom` (`BAAI/bge-m3`) | **Dimensions must match the model's output size.** Cognee auto-derives `EMBEDDING_DIMENSIONS` for `fastembed` models and for models LiteLLM knows via each model's `output_vector_size`. For other models, including local aliases and `custom` or `openai_compatible` endpoints, set `EMBEDDING_DIMENSIONS` explicitly. If Cognee cannot infer the size, it falls back to `3072`, which will cause a vector-store mismatch when the stored dimensions differ from the embedding output. See [Important Notes](#important-notes) for the dimension-consistency rule. `EMBEDDING_BATCH_SIZE` controls how many text chunks are grouped into a single embedding API call. Cognee splits all chunks into batches of this size and sends them concurrently to the embedding engine. | Variable | Default | Description | | ---------------------- | ------- | ----------------------------- | | `EMBEDDING_BATCH_SIZE` | `36` | Chunks per embedding API call | **Local inference (Ollama, llama.cpp, LM Studio)**: Local servers handle one request at a time with limited concurrency. The default `36` can overwhelm them. Reduce the batch size if you see errors or slowdowns: ```dotenv theme={null} EMBEDDING_BATCH_SIZE="5" ``` **Cloud providers**: Larger batches reduce the number of API calls and are efficient with cloud APIs. The default `36` suits most cloud providers. **Relationship to rate limiting**: Each batch counts as one request toward `EMBEDDING_RATE_LIMIT_REQUESTS`. A single file may produce many chunks — with `EMBEDDING_BATCH_SIZE=36`, a document split into 360 chunks generates 10 requests. Despite the name, embeddings have no "completion" — `EMBEDDING_MAX_COMPLETION_TOKENS` is the per-text token budget Cognee's tokenizer uses to size chunks for the embedding model. It should reflect your embedding model's context window. | Variable | Default | Description | | --------------------------------- | ------- | -------------------------------------------------------------------- | | `EMBEDDING_MAX_COMPLETION_TOKENS` | `8191` | Maximum tokens per embedded text; an input to automatic chunk sizing | **Observable impact:** * **Chunk size, cost and latency.** Chunks are sized as `min(EMBEDDING_MAX_COMPLETION_TOKENS, LLM_MAX_COMPLETION_TOKENS // 2)`. A lower value forces smaller chunks, so a document produces more chunks → more embedding requests (and more LLM extraction calls) → higher latency and cost. A higher value (within the model's real limit) produces fewer, larger chunks. See [Chunkers](/core-concepts/further-concepts/chunkers) for how chunk size shapes the graph. * **Over-length requests.** Setting this above the embedding model's real context window can make individual texts exceed the limit. Cognee no longer fails outright — it splits the batch (or mean-pools an over-length string, see [Timeout and Retry Behavior](#timeout-and-retry-behavior)) — but that recovery adds latency, so it is not free. "Higher" is only better up to the model's actual window. **Tuning guidance:** match this to your embedding model's context window. The default `8191` fits OpenAI's `text-embedding-3-*` models. For models with a smaller window, lower it so chunks fit without triggering the split-and-pool fallback; the related `LLM_MAX_COMPLETION_TOKENS` ([LLM Providers](/setup-configuration/llm-providers#max-completion-tokens)) caps the other half of the formula. The `LiteLLMEmbeddingEngine` applies two layers of protection against slow or unreachable endpoints: | Limit | Value | Configurable | | ------------------- | ----------- | -------------- | | Per-attempt timeout | 30 seconds | No (hardcoded) | | Total retry window | 128 seconds | No (hardcoded) | **How retries work**: Failed attempts are retried with exponential back-off starting at 2 seconds, with random jitter, until the 128-second window is exhausted. **What is not retried**: * `404 Not Found` errors are raised immediately — they indicate a configuration problem (wrong model name or endpoint) rather than a transient failure. * `asyncio.CancelledError` is treated as terminal and re-raised without retrying, so cancelled tasks unwind promptly instead of consuming the full retry window. The same exclusion applies to the `FastembedEmbeddingEngine`, `OllamaEmbeddingEngine`, and `OpenAICompatibleEmbeddingEngine` retry decorators. * `EmbeddingContextWindowTooSmallError` is treated as terminal and raised immediately. It is thrown when a single embedding text still exceeds the model's context window but can no longer be split (the string is too short to divide further), so retrying would deterministically fail again. This exclusion applies to the `LiteLLMEmbeddingEngine`, `FastembedEmbeddingEngine`, and `OpenAICompatibleEmbeddingEngine` retry decorators; the failure returns at once instead of consuming the full 128-second retry window. The exception subclasses `EmbeddingException` (default message `Text is too short to split further but exceeds context window.`), so code that already catches `EmbeddingException` continues to catch it — catch `EmbeddingContextWindowTooSmallError` specifically to distinguish this deterministic, non-retryable case and shorten or pre-split the offending input. **Over-length input recovery**: When the provider rejects an embedding request because the input exceeds the model's context window, `LiteLLMEmbeddingEngine` no longer fails the request. This covers both LiteLLM's `ContextWindowExceededError` and a plain `400 BadRequestError` whose message matches `maximum input length` (for example OpenAI's `maximum input length is 8192 tokens`, which is returned as a plain 400 by the embeddings API). On either of these: * If the batch contains more than one text, it is split in half and each half is embedded in parallel, then the results are concatenated. * If a single over-length string is left, it is split into two overlapping segments (the first two-thirds and the last two-thirds of the string), each segment is embedded, and the two vectors are averaged (mean-pooled) into one. This split-and-pool step recurses until each piece fits, so long documents that previously failed are now embedded automatically. If a single string is already too short to split further (fewer than three characters) yet still exceeds the context window, the engine raises the terminal `EmbeddingContextWindowTooSmallError` instead of retrying — this deterministic failure returns at once rather than consuming the full retry window (see [What is not retried](#timeout-and-retry-behavior)). Any other `400 BadRequestError` (one whose message does **not** indicate an over-length input) is re-raised unchanged so genuinely malformed requests still fail fast. **Common error messages and causes**: * `EmbeddingException: Embedding request timed out. Check EMBEDDING_ENDPOINT connectivity.` — The endpoint did not respond within 30 seconds. Verify that `EMBEDDING_ENDPOINT` is reachable from your network. * `EmbeddingException: Cannot connect to embedding endpoint. Check EMBEDDING_ENDPOINT.` — TCP connection was refused or the server closed the connection before responding. Confirm the server is running and the URL is correct. * `EmbeddingException: Failed to index data points using model ` — The provider returned a `404 Not Found`. Common causes: wrong model name, missing `hosted_vllm/` prefix for vLLM, or an unsupported model at that endpoint. Non-over-length `400 BadRequestError` responses are re-raised unchanged. * `400 invalid_value` on `encoding_format` from OpenRouter — Older LiteLLM releases serialize an omitted `encoding_format` as JSON `null`, which OpenRouter rejects. Cognee now forces `encoding_format="float"` on detected OpenRouter routes (see the Custom Providers accordion under [Provider Setup Guides](#provider-setup-guides)), so this should no longer occur; if you still see it, upgrade Cognee to a version that includes this guard. **Diagnosing slow local servers**: If you see repeated timeouts with Ollama, LM Studio, or vLLM, the 30-second per-attempt limit may be tight for large batches. Reduce `EMBEDDING_BATCH_SIZE` to send fewer texts per request: ```dotenv theme={null} EMBEDDING_BATCH_SIZE="5" ``` The timeout and retry values are hardcoded in `LiteLLMEmbeddingEngine` and cannot be changed via environment variables. To use different limits, subclass `LiteLLMEmbeddingEngine` and override `embed_text` with a custom `@retry` decorator. For the LiteLLM-routed providers (`openai`, `gemini`, `mistral`, `bedrock`, and `custom`), Cognee sends the OpenAI-style `dimensions` parameter to `litellm.aembedding()` whenever an embedding dimension is configured. In the `.env` flow documented on this page, `EMBEDDING_DIMENSIONS` is required, so this parameter is normally present. Only some models accept `dimensions` (notably OpenAI's `text-embedding-3-*`). Other models — and some LiteLLM proxies — reject it with: ``` litellm.exceptions.UnsupportedParamsError: ... does not support parameters: ['dimensions'] ``` Cognee does **not** expose an environment variable to suppress `dimensions` or to forward extra LiteLLM params. The intended fix is LiteLLM's own `drop_params` switch, which makes LiteLLM silently drop any parameter the target model does not support (including `dimensions`) instead of raising — so you can keep routing embeddings through your LiteLLM proxy: Set the flag before you call any Cognee operation: ```python theme={null} import litellm litellm.drop_params = True import cognee # embeddings now drop unsupported params instead of raising ``` If you run your own LiteLLM proxy, enable `drop_params` in the proxy config so unsupported params are stripped before forwarding upstream. You can set it proxy-wide: ```yaml theme={null} litellm_settings: drop_params: true ``` Or per model: ```yaml theme={null} model_list: - model_name: my-embedder litellm_params: model: openai/text-embedding-3-large drop_params: true ``` `drop_params` silences the error but does **not** change the vector size your model returns. Still set `EMBEDDING_DIMENSIONS` to the model's real output size so it matches your vector store — see [Which embedding models are supported?](#which-embedding-models-are-supported) and [Important Notes](#important-notes). If you don't need the LiteLLM proxy, the [`openai_compatible`](#openai-compatible-local-servers-llama-cpp-tei-vllm) provider talks to any `/v1/embeddings` server directly through the OpenAI SDK and never sends the `dimensions` parameter, so it avoids this error entirely — at the cost of bypassing LiteLLM. Control client-side throttling for embedding calls to manage API usage and costs. **Rate limiting is disabled by default.** You must explicitly set `EMBEDDING_RATE_LIMIT_ENABLED="true"` to activate it. **Defaults (when rate limiting is enabled):** | Variable | Default | Meaning | | ------------------------------- | ------- | ------------------------- | | `EMBEDDING_RATE_LIMIT_ENABLED` | `false` | Off by default — opt-in | | `EMBEDDING_RATE_LIMIT_REQUESTS` | `60` | Max requests per interval | | `EMBEDDING_RATE_LIMIT_INTERVAL` | `60` | Interval in seconds | **What counts as one request?** One rate-limit request = one `embed_text()` API call = one batch of chunks (not one chunk). With the default `EMBEDDING_BATCH_SIZE=36`, processing 360 chunks produces 10 requests. See the [Batch Size](#batch-size) section for how to tune batch size. **Sizing guidance:** Set `EMBEDDING_RATE_LIMIT_REQUESTS` to your provider's RPM limit and `EMBEDDING_RATE_LIMIT_INTERVAL` to `60`. Use \~80–90% of your provider's advertised limit to leave headroom. **Example configurations for common provider tiers** These examples target embedding endpoints, such as OpenAI embedding models like `text-embedding-3-large`. ```dotenv theme={null} EMBEDDING_RATE_LIMIT_ENABLED="true" EMBEDDING_RATE_LIMIT_REQUESTS="2700" EMBEDDING_RATE_LIMIT_INTERVAL="60" ``` ```dotenv theme={null} EMBEDDING_RATE_LIMIT_ENABLED="true" EMBEDDING_RATE_LIMIT_REQUESTS="180" EMBEDDING_RATE_LIMIT_INTERVAL="60" ``` ```dotenv theme={null} EMBEDDING_RATE_LIMIT_ENABLED="true" EMBEDDING_RATE_LIMIT_REQUESTS="1350" EMBEDDING_RATE_LIMIT_INTERVAL="60" ``` ```dotenv theme={null} EMBEDDING_RATE_LIMIT_ENABLED="true" EMBEDDING_RATE_LIMIT_REQUESTS="60" EMBEDDING_RATE_LIMIT_INTERVAL="60" ``` Always verify your exact tier limits in your provider's dashboard — limits vary by model, tier, and region. The examples above are approximations for common tiers and may change. ```dotenv theme={null} # Mock embeddings for testing (returns zero vectors) MOCK_EMBEDDING="true" ``` The `HUGGINGFACE_TOKENIZER` environment variable specifies which Hugging Face tokenizer to use for counting tokens before sending text to the embedding model. It is **optional** — Cognee no longer requires it at startup, so omitting it does not raise a `ValidationError` on import — but it is recommended when using the **Ollama** provider for accurate token counting. **Value format**: The value is the Hugging Face model repository ID — the `{organization}/{model-name}` path that appears in the URL on [huggingface.co/models](https://huggingface.co/models). This should match the underlying model used by your Ollama embedding. For example, if the Ollama model `nomic-embed-text:latest` is built from `nomic-ai/nomic-embed-text-v1.5` on Hugging Face, set: ```dotenv theme={null} HUGGINGFACE_TOKENIZER="nomic-ai/nomic-embed-text-v1.5" ``` ### Common model-to-tokenizer mappings | Ollama model | `HUGGINGFACE_TOKENIZER` value | Dimensions | | ---------------------------------- | ---------------------------------------- | ---------- | | `nomic-embed-text:latest` | `nomic-ai/nomic-embed-text-v1.5` | 768 | | `bge-m3:latest` | `BAAI/bge-m3` | 1024 | | `mxbai-embed-large:latest` | `mixedbread-ai/mxbai-embed-large-v1` | 1024 | | `avr/sfr-embedding-mistral:latest` | `Salesforce/SFR-Embedding-Mistral` | 4096 | | `all-minilm:latest` | `sentence-transformers/all-MiniLM-L6-v2` | 384 | ### Finding the tokenizer for any model 1. Look up the model on [huggingface.co/models](https://huggingface.co/models). 2. The repository ID is the `{organization}/{model-name}` part of the URL (e.g., `huggingface.co/BAAI/bge-m3` → `BAAI/bge-m3`). 3. Use the repository ID that corresponds to the model your Ollama tag is built from. The Ollama model page typically links to the original Hugging Face repository. `HUGGINGFACE_TOKENIZER` is only used by the Ollama embedding engine. It is not needed for OpenAI, Fastembed, `openai_compatible`, or other providers. For `openai_compatible`, the engine automatically tries the model name as a HuggingFace tokenizer ID and falls back to TikToken if unavailable — no separate `HUGGINGFACE_TOKENIZER` value is required. ### Troubleshooting dependency errors The HuggingFace tokenizer (`HuggingFaceTokenizer`, which wraps `transformers.AutoTokenizer`) backs token counting for the **Ollama** embedding engine. Cognee also attempts to use it for `custom` and `openai_compatible` chunk sizing; if it cannot load, Cognee falls back to TikToken. `transformers` is **not** part of the base `cognee` install — it is an optional dependency shipped by the `huggingface` and `ollama` extras (both pin `transformers>=4.46.3,<5`). **`ModuleNotFoundError: No module named 'transformers'`** The tokenizer was triggered (most often by `EMBEDDING_PROVIDER="ollama"`) but `transformers` is missing. This typically surfaces as `Connection to Embedding handler could not be established`. Install the extra: ```bash theme={null} pip install 'cognee[huggingface]' # or, if you are using Ollama embeddings: pip install 'cognee[ollama]' ``` **`cannot import name 'is_offline_mode' from 'huggingface_hub'`** This is a version mismatch: an old `transformers` (older than 4.46) is paired with a newer `huggingface_hub` that no longer exports `is_offline_mode`. Cognee's `transformers>=4.46.3,<5` pin is compatible with current `huggingface_hub` (Cognee resolves `huggingface_hub` 0.36.x). If you previously installed `transformers` manually, upgrade it into Cognee's supported range: ```bash theme={null} pip install --upgrade "transformers>=4.46.3,<5" ``` Installing or reinstalling the `huggingface` / `ollama` extra pulls in a compatible `huggingface_hub` automatically, so prefer the extra over pinning `huggingface_hub` by hand. ## Important Notes * **Dimension Consistency**: `EMBEDDING_DIMENSIONS` must match your vector store collection schema * **API Key Fallback**: If `EMBEDDING_API_KEY` is not set, Cognee uses `LLM_API_KEY` (except for custom providers) * **Tokenization**: `HUGGINGFACE_TOKENIZER` is optional and no longer enforced by Cognee's startup validation — but it is recommended for the Ollama provider; set it to the HuggingFace model repo ID that matches your embedding model for accurate token counting * **Performance**: Local providers (Ollama, Fastembed) are slower but offer privacy and cost benefits Token counts drive chunk sizing and the `--dry-run` token estimate, so Cognee auto-selects a tokenizer that matches your embedding model: * **`openai`** — the model's TikToken (BPE) encoding. * **`gemini`** — the default TikToken encoding (Gemini has no local tokenizer, so counts are approximate). * **`mistral`** — the Mistral tokenizer. * **`fastembed`** — the model's own HuggingFace (wordpiece) tokenizer, resolved from the fastembed model list. * **`ollama`** — the tokenizer named by `HUGGINGFACE_TOKENIZER`. * **`openai_compatible` / `custom`** — the embedding model id used as a HuggingFace repo (e.g. `BAAI/bge-large-en-v1.5`); when that is not a loadable repo — such as a local alias or a LiteLLM-prefixed id — Cognee falls back to TikToken. If a matching tokenizer cannot be loaded, Cognee logs an advisory warning and falls back to the TikToken tokenizer. This resolution **never raises**: a mismatched or missing tokenizer only degrades token-count accuracy (mis-sized chunks and a skewed `--dry-run` estimate), it does not stop ingestion. Configure LLM providers for text generation Set up vector databases for embedding storage Return to setup configuration overview