Skip to main content
Embedding providers convert text into vector representations that enable semantic search. These vectors capture the meaning of text, allowing Cognee to find conceptually related content even when the wording is different.
New to configuration?See the Setup Configuration Overview for the complete workflow:install extras → create .env → choose providers → handle pruning.

Supported Providers

Cognee supports multiple embedding providers:
  • OpenAI — Text embedding models via OpenAI API (default)
  • Azure OpenAI — Text embedding models via Azure OpenAI Service
  • Google Gemini — Embedding models via Google AI
  • Mistral — Embedding models via Mistral AI
  • AWS Bedrock — Embedding models via AWS Bedrock
  • Ollama — Local embedding models via Ollama
  • LM Studio — Local embedding models via LM Studio
  • Fastembed — CPU-friendly local embeddings
  • HuggingFace — Embedding models via HuggingFace Inference API or Inference Endpoints
  • vLLM — Self-hosted embedding models via vLLM
  • OpenAI-Compatible — Direct OpenAI SDK for llama.cpp, vLLM, TEI, and any /v1/embeddings server (bypasses LiteLLM)
  • Custom — OpenAI-compatible embedding endpoints routed through LiteLLM (DeepInfra, company-internal)
LLM/Embedding Configuration: If you configure only LLM or only embeddings, the other defaults to OpenAI. Ensure you have a working OpenAI API key, or configure both LLM and embeddings to avoid unexpected defaults.

Configuration

Set these environment variables in your .env file:
  • EMBEDDING_PROVIDER — The provider to use (openai, gemini, mistral, ollama, fastembed, openai_compatible, custom)
  • EMBEDDING_MODEL — The specific embedding model to use
  • EMBEDDING_DIMENSIONS — The vector dimension size (must match your vector store)
  • EMBEDDING_API_KEY — Your API key (falls back to LLM_API_KEY if not set)
  • EMBEDDING_ENDPOINT — Custom endpoint URL (for Azure, Ollama, or custom providers)
  • EMBEDDING_API_VERSION — API version (for Azure OpenAI)
  • EMBEDDING_MAX_TOKENS — Maximum tokens per request (optional)
  • HUGGINGFACE_TOKENIZER — HuggingFace Hub model ID used for token counting when EMBEDDING_PROVIDER is ollama

Provider Setup Guides

OpenAI provides high-quality embeddings with good performance.
EMBEDDING_PROVIDER="openai"
EMBEDDING_MODEL="openai/text-embedding-3-large"
EMBEDDING_DIMENSIONS="3072"
# Optional
# EMBEDDING_API_KEY=sk-...   # falls back to LLM_API_KEY if omitted
# EMBEDDING_ENDPOINT=https://api.openai.com/v1
# EMBEDDING_API_VERSION=
# EMBEDDING_MAX_TOKENS=8191
Use Azure OpenAI Service for embeddings with your own deployment.
EMBEDDING_PROVIDER="openai"
EMBEDDING_MODEL="azure/text-embedding-3-large"
EMBEDDING_ENDPOINT="https://<your-az>.cognitiveservices.azure.com/openai/deployments/text-embedding-3-large"
EMBEDDING_API_KEY="az-..."
EMBEDDING_API_VERSION="2023-05-15"
EMBEDDING_DIMENSIONS="3072"
Use Google’s embedding models for semantic search.
EMBEDDING_PROVIDER="gemini"
EMBEDDING_MODEL="gemini/gemini-embedding-001"
EMBEDDING_API_KEY="AIza..."
EMBEDDING_DIMENSIONS="768"
Use Mistral’s embedding models for high-quality vector representations.
EMBEDDING_PROVIDER="mistral"
EMBEDDING_MODEL="mistral/mistral-embed"
EMBEDDING_API_KEY="sk-mis-..."
EMBEDDING_DIMENSIONS="1024"
Installation: Install the required dependency:
pip install mistral-common[sentencepiece]
Use embedding models provided by the AWS Bedrock service.
EMBEDDING_PROVIDER="bedrock"
EMBEDDING_MODEL="<your_model_name>"
EMBEDDING_DIMENSIONS="<dimensions_of_the_model>"
EMBEDDING_API_KEY="<your_api_key>"
EMBEDDING_MAX_TOKENS="<max_tokens_of_your_model>"
Run embedding models locally with Ollama for privacy and cost control.
EMBEDDING_PROVIDER="ollama"
EMBEDDING_MODEL="nomic-embed-text:latest"
EMBEDDING_ENDPOINT="http://localhost:11434/api/embed"
EMBEDDING_DIMENSIONS="768"
HUGGINGFACE_TOKENIZER="nomic-ai/nomic-embed-text-v1.5"
HUGGINGFACE_TOKENIZER is the HuggingFace repo ID of the tokenizer used for token-length counting when sending requests to the Ollama embedding endpoint.Installation: Install Ollama from ollama.ai and pull your desired embedding model:
ollama pull nomic-embed-text:latest
HUGGINGFACE_TOKENIZER is required when using Ollama. See the HUGGINGFACE_TOKENIZER environment variable section below for how to find the correct value for your model.
Zero-API-key setup: To run fully offline with no OpenAI key, you must configure both the LLM provider and the embedding provider to use local backends. See the Local Setup guide for a complete combined .env example.
Ollama falling behind? Ollama processes requests sequentially. If it becomes unresponsive or returns errors under load, reduce EMBEDDING_BATCH_SIZE (default 36) to send fewer chunks per call — values between 1 and 10 work well for most local hardware:
EMBEDDING_BATCH_SIZE="5"
Run embedding models locally with LM Studio for privacy and cost control.
EMBEDDING_PROVIDER="custom"
EMBEDDING_MODEL="lm_studio/text-embedding-nomic-embed-text-1.5"
EMBEDDING_ENDPOINT="http://127.0.0.1:1234/v1"
EMBEDDING_API_KEY="."
EMBEDDING_DIMENSIONS="768"
Installation: Install LM Studio from lmstudio.ai and download your desired model from LM Studio’s interface. Load your model, start the LM Studio server, and Cognee will be able to connect to it.
Use Fastembed for CPU-friendly local embeddings without GPU requirements.
EMBEDDING_PROVIDER="fastembed"
EMBEDDING_MODEL="sentence-transformers/all-MiniLM-L6-v2"
EMBEDDING_DIMENSIONS="384"
Installation: Fastembed is included by default with Cognee.Known Issues:
  • As of September 2025, Fastembed requires Python < 3.13 (not compatible with Python 3.13+)
Use embedding models from HuggingFace via the HuggingFace Inference API (serverless) or dedicated Inference Endpoints.
EMBEDDING_PROVIDER="custom"
EMBEDDING_MODEL="huggingface/BAAI/bge-large-en-v1.5"
EMBEDDING_API_KEY="hf_..."
EMBEDDING_DIMENSIONS="1024"
Installation: Install the HuggingFace extra for tokenizer support:
pip install cognee[huggingface]
HUGGINGFACE_TOKENIZER with HuggingFace embeddings: When using EMBEDDING_PROVIDER="custom" with a huggingface/ model, Cognee automatically attempts to load a HuggingFace tokenizer from the model repo for token counting. If that fails, it falls back to the TikToken tokenizer. You do not need to set HUGGINGFACE_TOKENIZER manually for this provider — it is only required when using EMBEDDING_PROVIDER="ollama" (see the Ollama section above).
Use vLLM to serve local or self-hosted embedding models with an OpenAI-compatible API.Example with Qwen3-Embedding-4B on port 8001:
EMBEDDING_PROVIDER="custom"
EMBEDDING_MODEL="hosted_vllm/Qwen/Qwen3-Embedding-4B"
EMBEDDING_ENDPOINT="http://localhost:8001/v1"
EMBEDDING_API_KEY="."
EMBEDDING_DIMENSIONS="2560"
hosted_vllm/ prefix required: Include hosted_vllm/ at the start of the model name so LiteLLM routes requests to your vLLM server. The model name after the prefix should match the model ID returned by your vLLM server’s /v1/models endpoint.
Tokenization: Cognee automatically strips the hosted_vllm/ prefix when loading the HuggingFace tokenizer, so no separate HUGGINGFACE_TOKENIZER setting is needed as long as the model name after the prefix is a valid HuggingFace model ID.To verify the model name your vLLM server exposes, run:
curl http://localhost:8001/v1/models
See the LiteLLM vLLM documentation for more details.
Use EMBEDDING_PROVIDER="openai_compatible" for any local inference server that exposes the standard /v1/embeddings endpoint. This provider talks directly to OpenAI-compatible embedding servers via the OpenAI Python SDK, bypassing LiteLLM.Use this provider for: llama.cpp (llama-server --embedding), vLLM, Hugging Face TEI, LocalAI, Infinity, and similar servers.
Start llama.cpp with embedding support:
llama-server --model your-model.gguf --embedding --port 8080
EMBEDDING_PROVIDER="openai_compatible"
EMBEDDING_MODEL="default"
EMBEDDING_ENDPOINT="http://localhost:8080/v1"
EMBEDDING_API_KEY="no-key-required"
EMBEDDING_DIMENSIONS="768"
Endpoint normalisation: The engine automatically appends /v1 to EMBEDDING_ENDPOINT if it is missing, and strips a trailing /embeddings suffix. You can pass either http://localhost:8080 or http://localhost:8080/v1 — both work.
Use OpenAI-compatible embedding endpoints from other providers such as DeepInfra, OpenRouter, or a company-internal server. These are routed through LiteLLM and require a provider prefix in the model name.
EMBEDDING_PROVIDER="custom"
EMBEDDING_MODEL="deepinfra/BAAI/bge-base-en-v1.5"
EMBEDDING_ENDPOINT="https://api.deepinfra.com/v1/openai"
EMBEDDING_API_KEY="<your-deepinfra-api-key>"
EMBEDDING_DIMENSIONS="768"
No endpoint normalisation for custom: Unlike openai_compatible, the custom provider passes EMBEDDING_ENDPOINT directly to LiteLLM as api_base with no automatic /v1 appending or /embeddings stripping. Set the endpoint to exactly the base URL your provider expects (e.g., https://api.deepinfra.com/v1/openai), or omit it entirely when using a named LiteLLM prefix such as openrouter/.

Additional Information

EMBEDDING_BATCH_SIZE controls how many text chunks are grouped into a single embedding API call. Cognee splits all chunks into batches of this size and sends them concurrently to the embedding engine.
VariableDefaultDescription
EMBEDDING_BATCH_SIZE36Chunks per embedding API call
Local inference (Ollama, llama.cpp, LM Studio): Local servers handle one request at a time with limited concurrency. The default 36 can overwhelm them. Reduce the batch size if you see errors or slowdowns:
EMBEDDING_BATCH_SIZE="5"
Cloud providers: Larger batches reduce the number of API calls and are efficient with cloud APIs. The default 36 suits most cloud providers.Relationship to rate limiting: Each batch counts as one request toward EMBEDDING_RATE_LIMIT_REQUESTS. A single file may produce many chunks — with EMBEDDING_BATCH_SIZE=36, a document split into 360 chunks generates 10 requests.
The LiteLLMEmbeddingEngine applies two layers of protection against slow or unreachable endpoints:
LimitValueConfigurable
Per-attempt timeout30 secondsNo (hardcoded)
Total retry window128 secondsNo (hardcoded)
How retries work: Failed attempts are retried with exponential back-off starting at 2 seconds, with random jitter, until the 128-second window is exhausted.What is not retried: 404 Not Found errors are raised immediately — they indicate a configuration problem (wrong model name or endpoint) rather than a transient failure.Common error messages and causes:
  • EmbeddingException: Embedding request timed out. Check EMBEDDING_ENDPOINT connectivity. — The endpoint did not respond within 30 seconds. Verify that EMBEDDING_ENDPOINT is reachable from your network.
  • EmbeddingException: Cannot connect to embedding endpoint. Check EMBEDDING_ENDPOINT. — TCP connection was refused or the server closed the connection before responding. Confirm the server is running and the URL is correct.
  • EmbeddingException: Failed to index data points using model <model> — The provider returned a 400 or 404 error. Common causes: wrong model name, missing hosted_vllm/ prefix for vLLM, or an unsupported model at that endpoint.
Diagnosing slow local servers: If you see repeated timeouts with Ollama, LM Studio, or vLLM, the 30-second per-attempt limit may be tight for large batches. Reduce EMBEDDING_BATCH_SIZE to send fewer texts per request:
EMBEDDING_BATCH_SIZE="5"
The timeout and retry values are hardcoded in LiteLLMEmbeddingEngine and cannot be changed via environment variables. To use different limits, subclass LiteLLMEmbeddingEngine and override embed_text with a custom @retry decorator.
Control client-side throttling for embedding calls to manage API usage and costs.
Rate limiting is disabled by default. You must explicitly set EMBEDDING_RATE_LIMIT_ENABLED="true" to activate it.
Defaults (when rate limiting is enabled):
VariableDefaultMeaning
EMBEDDING_RATE_LIMIT_ENABLEDfalseOff by default — opt-in
EMBEDDING_RATE_LIMIT_REQUESTS60Max requests per interval
EMBEDDING_RATE_LIMIT_INTERVAL60Interval in seconds
What counts as one request?One rate-limit request = one embed_text() API call = one batch of chunks (not one chunk). With the default EMBEDDING_BATCH_SIZE=36, processing 360 chunks produces 10 requests. See the Batch Size section for how to tune batch size.Sizing guidance:Set EMBEDDING_RATE_LIMIT_REQUESTS to your provider’s RPM limit and EMBEDDING_RATE_LIMIT_INTERVAL to 60. Use ~80–90% of your provider’s advertised limit to leave headroom.Example configurations for common provider tiersThese examples target embedding endpoints, such as OpenAI embedding models like text-embedding-3-large.
EMBEDDING_RATE_LIMIT_ENABLED="true"
EMBEDDING_RATE_LIMIT_REQUESTS="2700"
EMBEDDING_RATE_LIMIT_INTERVAL="60"
EMBEDDING_RATE_LIMIT_ENABLED="true"
EMBEDDING_RATE_LIMIT_REQUESTS="180"
EMBEDDING_RATE_LIMIT_INTERVAL="60"
EMBEDDING_RATE_LIMIT_ENABLED="true"
EMBEDDING_RATE_LIMIT_REQUESTS="1350"
EMBEDDING_RATE_LIMIT_INTERVAL="60"
EMBEDDING_RATE_LIMIT_ENABLED="true"
EMBEDDING_RATE_LIMIT_REQUESTS="60"
EMBEDDING_RATE_LIMIT_INTERVAL="60"
Always verify your exact tier limits in your provider’s dashboard — limits vary by model, tier, and region. The examples above are approximations for common tiers and may change.
# Mock embeddings for testing (returns zero vectors)
MOCK_EMBEDDING="true"
The HUGGINGFACE_TOKENIZER environment variable specifies which Hugging Face tokenizer to use for counting tokens before sending text to the embedding model. This is required when using the Ollama provider.Value format: The value is the Hugging Face model repository ID — the {organization}/{model-name} path that appears in the URL on huggingface.co/models. This should match the underlying model used by your Ollama embedding.For example, if the Ollama model nomic-embed-text:latest is built from nomic-ai/nomic-embed-text-v1.5 on Hugging Face, set:
HUGGINGFACE_TOKENIZER="nomic-ai/nomic-embed-text-v1.5"

Common model-to-tokenizer mappings

Ollama modelHUGGINGFACE_TOKENIZER valueDimensions
nomic-embed-text:latestnomic-ai/nomic-embed-text-v1.5768
bge-m3:latestBAAI/bge-m31024
mxbai-embed-large:latestmixedbread-ai/mxbai-embed-large-v11024
avr/sfr-embedding-mistral:latestSalesforce/SFR-Embedding-Mistral4096
all-minilm:latestsentence-transformers/all-MiniLM-L6-v2384

Finding the tokenizer for any model

  1. Look up the model on huggingface.co/models.
  2. The repository ID is the {organization}/{model-name} part of the URL (e.g., huggingface.co/BAAI/bge-m3BAAI/bge-m3).
  3. Use the repository ID that corresponds to the model your Ollama tag is built from. The Ollama model page typically links to the original Hugging Face repository.
HUGGINGFACE_TOKENIZER is only used by the Ollama embedding engine. It is not needed for OpenAI, Fastembed, or other providers.

Important Notes

  • Dimension Consistency: EMBEDDING_DIMENSIONS must match your vector store collection schema
  • API Key Fallback: If EMBEDDING_API_KEY is not set, Cognee uses LLM_API_KEY (except for custom providers)
  • Tokenization: HUGGINGFACE_TOKENIZER is required for the Ollama provider — set it to the HuggingFace model repo ID that matches your embedding model
  • Performance: Local providers (Ollama, Fastembed) are slower but offer privacy and cost benefits

LLM Providers

Configure LLM providers for text generation

Vector Stores

Set up vector databases for embedding storage

Overview

Return to setup configuration overview