New to configuration?See the Setup Configuration Overview for the complete workflow:install extras → create
.env → choose providers → handle pruning.Supported Providers
Cognee supports multiple embedding providers:- OpenAI — Text embedding models via OpenAI API (default)
- Azure OpenAI — Text embedding models via Azure OpenAI Service
- Google Gemini — Embedding models via Google AI
- Mistral — Embedding models via Mistral AI
- AWS Bedrock — Embedding models via AWS Bedrock
- Ollama — Local embedding models via Ollama
- LM Studio — Local embedding models via LM Studio
- Fastembed — CPU-friendly local embeddings
- HuggingFace — Embedding models via HuggingFace Inference API or Inference Endpoints
- vLLM — Self-hosted embedding models via vLLM
- OpenAI-Compatible — Direct OpenAI SDK for llama.cpp, vLLM, TEI, and any
/v1/embeddingsserver (bypasses LiteLLM) - Custom — OpenAI-compatible embedding endpoints routed through LiteLLM (DeepInfra, company-internal)
Configuration
Environment Variables
Environment Variables
Set these environment variables in your
.env file:EMBEDDING_PROVIDER— The provider to use (openai, gemini, mistral, ollama, fastembed, openai_compatible, custom)EMBEDDING_MODEL— The specific embedding model to useEMBEDDING_DIMENSIONS— The vector dimension size (must match your vector store)EMBEDDING_API_KEY— Your API key (falls back toLLM_API_KEYif not set)EMBEDDING_ENDPOINT— Custom endpoint URL (for Azure, Ollama, or custom providers)EMBEDDING_API_VERSION— API version (for Azure OpenAI)EMBEDDING_MAX_TOKENS— Maximum tokens per request (optional)HUGGINGFACE_TOKENIZER— HuggingFace Hub model ID used for token counting whenEMBEDDING_PROVIDERisollama
Provider Setup Guides
OpenAI (Default)
OpenAI (Default)
OpenAI provides high-quality embeddings with good performance.
Azure OpenAI Embeddings
Azure OpenAI Embeddings
Use Azure OpenAI Service for embeddings with your own deployment.
Google Gemini
Google Gemini
Use Google’s embedding models for semantic search.
Mistral
Mistral
Use Mistral’s embedding models for high-quality vector representations.Installation: Install the required dependency:
AWS Bedrock
AWS Bedrock
Use embedding models provided by the AWS Bedrock service.
Ollama (Local)
Ollama (Local)
Run embedding models locally with Ollama for privacy and cost control.
HUGGINGFACE_TOKENIZER is the HuggingFace repo ID of the tokenizer used for token-length counting when sending requests to the Ollama embedding endpoint.Installation: Install Ollama from ollama.ai and pull your desired embedding model:HUGGINGFACE_TOKENIZER is required when using Ollama. See the HUGGINGFACE_TOKENIZER environment variable section below for how to find the correct value for your model.Zero-API-key setup: To run fully offline with no OpenAI key, you must configure both the LLM provider and the embedding provider to use local backends. See the Local Setup guide for a complete combined
.env example.LM Studio (Local)
LM Studio (Local)
Run embedding models locally with LM Studio for privacy and cost control.Installation: Install LM Studio from lmstudio.ai and download your desired model from
LM Studio’s interface.
Load your model, start the LM Studio server, and Cognee will be able to connect to it.
Fastembed (Local)
Fastembed (Local)
Use Fastembed for CPU-friendly local embeddings without GPU requirements.Installation: Fastembed is included by default with Cognee.Known Issues:
- As of September 2025, Fastembed requires Python < 3.13 (not compatible with Python 3.13+)
HuggingFace
HuggingFace
Use embedding models from HuggingFace via the HuggingFace Inference API (serverless) or dedicated Inference Endpoints.Installation: Install the HuggingFace extra for tokenizer support:
- Serverless
- Dedicated Endpoint
HUGGINGFACE_TOKENIZER with HuggingFace embeddings: When using
EMBEDDING_PROVIDER="custom" with a huggingface/ model, Cognee automatically attempts to load a HuggingFace tokenizer from the model repo for token counting. If that fails, it falls back to the TikToken tokenizer. You do not need to set HUGGINGFACE_TOKENIZER manually for this provider — it is only required when using EMBEDDING_PROVIDER="ollama" (see the Ollama section above).vLLM
vLLM
Use vLLM to serve local or self-hosted embedding models with an OpenAI-compatible API.Example with Qwen3-Embedding-4B on port 8001:Tokenization: Cognee automatically strips the See the LiteLLM vLLM documentation for more details.
hosted_vllm/ prefix when loading the HuggingFace tokenizer, so no separate HUGGINGFACE_TOKENIZER setting is needed as long as the model name after the prefix is a valid HuggingFace model ID.To verify the model name your vLLM server exposes, run:OpenAI-Compatible Local Servers (llama.cpp, TEI, vLLM)
OpenAI-Compatible Local Servers (llama.cpp, TEI, vLLM)
Use
EMBEDDING_PROVIDER="openai_compatible" for any local inference server that exposes the standard /v1/embeddings endpoint. This provider talks directly to OpenAI-compatible embedding servers via the OpenAI Python SDK, bypassing LiteLLM.Use this provider for: llama.cpp (llama-server --embedding), vLLM, Hugging Face TEI, LocalAI, Infinity, and similar servers.- llama.cpp
- vLLM
- Hugging Face TEI
Start llama.cpp with embedding support:
Endpoint normalisation: The engine automatically appends
/v1 to EMBEDDING_ENDPOINT if it is missing, and strips a trailing /embeddings suffix. You can pass either http://localhost:8080 or http://localhost:8080/v1 — both work.Custom Providers
Custom Providers
Use OpenAI-compatible embedding endpoints from other providers such as DeepInfra, OpenRouter, or a company-internal server. These are routed through LiteLLM and require a provider prefix in the model name.
- DeepInfra
- OpenRouter
- Self-Hosted
No endpoint normalisation for
custom: Unlike openai_compatible, the custom provider passes EMBEDDING_ENDPOINT directly to LiteLLM as api_base with no automatic /v1 appending or /embeddings stripping. Set the endpoint to exactly the base URL your provider expects (e.g., https://api.deepinfra.com/v1/openai), or omit it entirely when using a named LiteLLM prefix such as openrouter/.Additional Information
Batch Size
Batch Size
EMBEDDING_BATCH_SIZE controls how many text chunks are grouped into a single embedding API call. Cognee splits all chunks into batches of this size and sends them concurrently to the embedding engine.| Variable | Default | Description |
|---|---|---|
EMBEDDING_BATCH_SIZE | 36 | Chunks per embedding API call |
36 can overwhelm them. Reduce the batch size if you see errors or slowdowns:36 suits most cloud providers.Relationship to rate limiting: Each batch counts as one request toward EMBEDDING_RATE_LIMIT_REQUESTS. A single file may produce many chunks — with EMBEDDING_BATCH_SIZE=36, a document split into 360 chunks generates 10 requests.Timeout and Retry Behavior
Timeout and Retry Behavior
The
How retries work: Failed attempts are retried with exponential back-off starting at 2 seconds, with random jitter, until the 128-second window is exhausted.What is not retried:
LiteLLMEmbeddingEngine applies two layers of protection against slow or unreachable endpoints:| Limit | Value | Configurable |
|---|---|---|
| Per-attempt timeout | 30 seconds | No (hardcoded) |
| Total retry window | 128 seconds | No (hardcoded) |
404 Not Found errors are raised immediately — they indicate a configuration problem (wrong model name or endpoint) rather than a transient failure.Common error messages and causes:EmbeddingException: Embedding request timed out. Check EMBEDDING_ENDPOINT connectivity.— The endpoint did not respond within 30 seconds. Verify thatEMBEDDING_ENDPOINTis reachable from your network.EmbeddingException: Cannot connect to embedding endpoint. Check EMBEDDING_ENDPOINT.— TCP connection was refused or the server closed the connection before responding. Confirm the server is running and the URL is correct.EmbeddingException: Failed to index data points using model <model>— The provider returned a 400 or 404 error. Common causes: wrong model name, missinghosted_vllm/prefix for vLLM, or an unsupported model at that endpoint.
EMBEDDING_BATCH_SIZE to send fewer texts per request:The timeout and retry values are hardcoded in
LiteLLMEmbeddingEngine and cannot be changed via environment variables. To use different limits, subclass LiteLLMEmbeddingEngine and override embed_text with a custom @retry decorator.Rate Limiting
Rate Limiting
Control client-side throttling for embedding calls to manage API usage and costs.Defaults (when rate limiting is enabled):
What counts as one request?One rate-limit request = one
| Variable | Default | Meaning |
|---|---|---|
EMBEDDING_RATE_LIMIT_ENABLED | false | Off by default — opt-in |
EMBEDDING_RATE_LIMIT_REQUESTS | 60 | Max requests per interval |
EMBEDDING_RATE_LIMIT_INTERVAL | 60 | Interval in seconds |
embed_text() API call = one batch of chunks (not one chunk). With the default EMBEDDING_BATCH_SIZE=36, processing 360 chunks produces 10 requests. See the Batch Size section for how to tune batch size.Sizing guidance:Set EMBEDDING_RATE_LIMIT_REQUESTS to your provider’s RPM limit and EMBEDDING_RATE_LIMIT_INTERVAL to 60. Use ~80–90% of your provider’s advertised limit to leave headroom.Example configurations for common provider tiersThese examples target embedding endpoints, such as OpenAI embedding models like text-embedding-3-large.OpenAI - Tier 1
OpenAI - Tier 1
OpenAI - Free / Very Low Tier
OpenAI - Free / Very Low Tier
Google Gemini - Free Tier
Google Gemini - Free Tier
Conservative Default
Conservative Default
Always verify your exact tier limits in your provider’s dashboard — limits vary by model, tier, and region. The examples above are approximations for common tiers and may change.
Testing and Development
Testing and Development
HUGGINGFACE_TOKENIZER environment variable
HUGGINGFACE_TOKENIZER environment variable
The
HUGGINGFACE_TOKENIZER environment variable specifies which Hugging Face tokenizer to use for counting tokens before sending text to the embedding model. This is required when using the Ollama provider.Value format: The value is the Hugging Face model repository ID — the {organization}/{model-name} path that appears in the URL on huggingface.co/models. This should match the underlying model used by your Ollama embedding.For example, if the Ollama model nomic-embed-text:latest is built from nomic-ai/nomic-embed-text-v1.5 on Hugging Face, set:Common model-to-tokenizer mappings
| Ollama model | HUGGINGFACE_TOKENIZER value | Dimensions |
|---|---|---|
nomic-embed-text:latest | nomic-ai/nomic-embed-text-v1.5 | 768 |
bge-m3:latest | BAAI/bge-m3 | 1024 |
mxbai-embed-large:latest | mixedbread-ai/mxbai-embed-large-v1 | 1024 |
avr/sfr-embedding-mistral:latest | Salesforce/SFR-Embedding-Mistral | 4096 |
all-minilm:latest | sentence-transformers/all-MiniLM-L6-v2 | 384 |
Finding the tokenizer for any model
- Look up the model on huggingface.co/models.
- The repository ID is the
{organization}/{model-name}part of the URL (e.g.,huggingface.co/BAAI/bge-m3→BAAI/bge-m3). - Use the repository ID that corresponds to the model your Ollama tag is built from. The Ollama model page typically links to the original Hugging Face repository.
HUGGINGFACE_TOKENIZER is only used by the Ollama embedding engine. It is not needed for OpenAI, Fastembed, or other providers.Important Notes
- Dimension Consistency:
EMBEDDING_DIMENSIONSmust match your vector store collection schema - API Key Fallback: If
EMBEDDING_API_KEYis not set, Cognee usesLLM_API_KEY(except for custom providers) - Tokenization:
HUGGINGFACE_TOKENIZERis required for the Ollama provider — set it to the HuggingFace model repo ID that matches your embedding model - Performance: Local providers (Ollama, Fastembed) are slower but offer privacy and cost benefits
LLM Providers
Configure LLM providers for text generation
Vector Stores
Set up vector databases for embedding storage
Overview
Return to setup configuration overview