New to configuration?See the Setup Configuration Overview for the complete workflow:install extras → create
.env → choose providers → handle pruning.Supported Providers
Cognee supports multiple embedding providers:- OpenAI — Text embedding models via OpenAI API (default)
- Azure OpenAI — Text embedding models via Azure OpenAI Service
- Google Gemini — Embedding models via Google AI
- Mistral — Embedding models via Mistral AI
- AWS Bedrock — Embedding models via AWS Bedrock
- Ollama — Local embedding models via Ollama
- LM Studio — Local embedding models via LM Studio
- Fastembed — CPU-friendly local embeddings
- HuggingFace — Embedding models via HuggingFace Inference API or Inference Endpoints
- vLLM — Self-hosted embedding models via vLLM
- OpenAI-Compatible — Direct OpenAI SDK for llama.cpp, vLLM, TEI, and any
/v1/embeddingsserver (bypasses LiteLLM) - Custom — OpenAI-compatible embedding endpoints routed through LiteLLM (DeepInfra, company-internal)
Configuration
Environment Variables
Environment Variables
Set these environment variables in your
.env file:EMBEDDING_PROVIDER— The provider to use (openai, gemini, mistral, ollama, fastembed, openai_compatible, custom)EMBEDDING_MODEL— The specific embedding model to useEMBEDDING_DIMENSIONS— The vector dimension size (must match your vector store)EMBEDDING_API_KEY— Your API key (falls back toLLM_API_KEYif not set)EMBEDDING_ENDPOINT— Custom endpoint URL (for Azure, Ollama, or custom providers)EMBEDDING_API_VERSION— API version (for Azure OpenAI)EMBEDDING_MAX_TOKENS— Maximum tokens per request (optional)HUGGINGFACE_TOKENIZER— HuggingFace Hub model ID used for token counting whenEMBEDDING_PROVIDERisollama
Provider Setup Guides
OpenAI (Default)
OpenAI (Default)
OpenAI provides high-quality embeddings with good performance.
Azure OpenAI Embeddings
Azure OpenAI Embeddings
Use Azure OpenAI Service for embeddings with your own deployment.
Google Gemini
Google Gemini
Use Google’s embedding models for semantic search.
Mistral
Mistral
Use Mistral’s embedding models for high-quality vector representations.Installation: Install the required dependency:
AWS Bedrock
AWS Bedrock
Use embedding models provided by the AWS Bedrock service.
Ollama (Local)
Ollama (Local)
Run embedding models locally with Ollama for privacy and cost control.
HUGGINGFACE_TOKENIZER is the HuggingFace repo ID of the tokenizer used for token-length counting when sending requests to the Ollama embedding endpoint.Installation: Install Ollama from ollama.ai and pull your desired embedding model:HUGGINGFACE_TOKENIZER is required when using Ollama. See the HUGGINGFACE_TOKENIZER environment variable section below for how to find the correct value for your model.Zero-API-key setup: To run fully offline with no OpenAI key, you must configure both the LLM provider and the embedding provider to use local backends. See the Local Setup guide for a complete combined
.env example.LM Studio (Local)
LM Studio (Local)
Run embedding models locally with LM Studio for privacy and cost control.Installation: Install LM Studio from lmstudio.ai and download your desired model from
LM Studio’s interface.
Load your model, start the LM Studio server, and Cognee will be able to connect to it.
Fastembed (Local)
Fastembed (Local)
Use Fastembed for CPU-friendly local embeddings without GPU requirements.Installation: Fastembed is included by default with Cognee.Known Issues:
- As of September 2025, Fastembed requires Python < 3.13 (not compatible with Python 3.13+)
HuggingFace
HuggingFace
Use embedding models from HuggingFace via the HuggingFace Inference API (serverless) or dedicated Inference Endpoints.Installation: Install the HuggingFace extra for tokenizer support:
- Serverless
- Dedicated Endpoint
HUGGINGFACE_TOKENIZER with HuggingFace embeddings: When using
EMBEDDING_PROVIDER="custom" with a huggingface/ model, Cognee automatically attempts to load a HuggingFace tokenizer from the model repo for token counting. If that fails, it falls back to the TikToken tokenizer. You do not need to set HUGGINGFACE_TOKENIZER manually for this provider — it is only required when using EMBEDDING_PROVIDER="ollama" (see the Ollama section above).vLLM
vLLM
Use vLLM to serve local or self-hosted embedding models with an OpenAI-compatible API.Example with Qwen3-Embedding-4B on port 8001:Tokenization: Cognee automatically strips the See the LiteLLM vLLM documentation for more details.
hosted_vllm/ prefix when loading the HuggingFace tokenizer, so no separate HUGGINGFACE_TOKENIZER setting is needed as long as the model name after the prefix is a valid HuggingFace model ID.To verify the model name your vLLM server exposes, run:OpenAI-Compatible Local Servers (llama.cpp, TEI, vLLM)
OpenAI-Compatible Local Servers (llama.cpp, TEI, vLLM)
Use
EMBEDDING_PROVIDER="openai_compatible" for any local inference server that exposes the standard /v1/embeddings endpoint. This provider talks directly to OpenAI-compatible embedding servers via the OpenAI Python SDK, bypassing LiteLLM.Use this provider for: llama.cpp (llama-server --embedding), vLLM, Hugging Face TEI, LocalAI, Infinity, and similar servers.- llama.cpp
- vLLM
- Hugging Face TEI
Start llama.cpp with embedding support:
Endpoint normalisation: The engine automatically appends
/v1 to EMBEDDING_ENDPOINT if it is missing, and strips a trailing /embeddings suffix. You can pass either http://localhost:8080 or http://localhost:8080/v1 — both work.Custom Providers
Custom Providers
Use OpenAI-compatible embedding endpoints from other providers such as DeepInfra or a company-internal server. These are routed through LiteLLM and require a provider prefix in the model name.
- DeepInfra
- Self-Hosted
Advanced Options
Rate Limiting
Rate Limiting
Control client-side throttling for embedding calls to manage API usage and costs.Defaults (when rate limiting is enabled):
Sizing guidance:Set
| Variable | Default | Meaning |
|---|---|---|
EMBEDDING_RATE_LIMIT_ENABLED | false | Off by default — opt-in |
EMBEDDING_RATE_LIMIT_REQUESTS | 60 | Max requests per interval |
EMBEDDING_RATE_LIMIT_INTERVAL | 60 | Interval in seconds |
EMBEDDING_RATE_LIMIT_REQUESTS to your provider’s RPM limit and EMBEDDING_RATE_LIMIT_INTERVAL to 60. Embedding calls are typically more frequent than LLM calls — each document chunk generates one embedding request. Use ~80–90% of your provider’s advertised limit to leave headroom.Example configurations for common provider tiersThese examples target embedding endpoints, such as OpenAI embedding models like text-embedding-3-large.OpenAI - Tier 1
OpenAI - Tier 1
OpenAI - Free / Very Low Tier
OpenAI - Free / Very Low Tier
Google Gemini - Free Tier
Google Gemini - Free Tier
Conservative Default
Conservative Default
Always verify your exact tier limits in your provider’s dashboard — limits vary by model, tier, and region. The examples above are approximations for common tiers and may change.
Testing and Development
Testing and Development
HUGGINGFACE_TOKENIZER environment variable
HUGGINGFACE_TOKENIZER environment variable
The
HUGGINGFACE_TOKENIZER environment variable specifies which Hugging Face tokenizer to use for counting tokens before sending text to the embedding model. This is required when using the Ollama provider.Value format: The value is the Hugging Face model repository ID — the {organization}/{model-name} path that appears in the URL on huggingface.co/models. This should match the underlying model used by your Ollama embedding.For example, if the Ollama model nomic-embed-text:latest is built from nomic-ai/nomic-embed-text-v1.5 on Hugging Face, set:Common model-to-tokenizer mappings
| Ollama model | HUGGINGFACE_TOKENIZER value | Dimensions |
|---|---|---|
nomic-embed-text:latest | nomic-ai/nomic-embed-text-v1.5 | 768 |
bge-m3:latest | BAAI/bge-m3 | 1024 |
mxbai-embed-large:latest | mixedbread-ai/mxbai-embed-large-v1 | 1024 |
avr/sfr-embedding-mistral:latest | Salesforce/SFR-Embedding-Mistral | 4096 |
all-minilm:latest | sentence-transformers/all-MiniLM-L6-v2 | 384 |
Finding the tokenizer for any model
- Look up the model on huggingface.co/models.
- The repository ID is the
{organization}/{model-name}part of the URL (e.g.,huggingface.co/BAAI/bge-m3→BAAI/bge-m3). - Use the repository ID that corresponds to the model your Ollama tag is built from. The Ollama model page typically links to the original Hugging Face repository.
HUGGINGFACE_TOKENIZER is only used by the Ollama embedding engine. It is not needed for OpenAI, Fastembed, or other providers.Important Notes
- Dimension Consistency:
EMBEDDING_DIMENSIONSmust match your vector store collection schema - API Key Fallback: If
EMBEDDING_API_KEYis not set, Cognee usesLLM_API_KEY(except for custom providers) - Tokenization:
HUGGINGFACE_TOKENIZERis required for the Ollama provider — set it to the HuggingFace model repo ID that matches your embedding model - Performance: Local providers (Ollama, Fastembed) are slower but offer privacy and cost benefits
LLM Providers
Configure LLM providers for text generation
Vector Stores
Set up vector databases for embedding storage
Overview
Return to setup configuration overview