LiteLLM Integration

Cognee integrates with LiteLLM to provide a unified interface for interacting with multiple Large Language Model (LLM) providers. This integration enables seamless switching between different LLM providers while maintaining consistent functionality across the platform.

LiteLLM integration in Cognee combines the Instructor library with LiteLLM for type-safe, validated responses from 100+ LLM providers with built-in rate limiting and retry mechanisms.

Key Features

100+ Providers

Universal AccessAccess OpenAI, Anthropic, Google, Ollama, and 100+ other LLM providers through a single interface.

Type-Safe Outputs

Structured ResponsesGenerate validated, type-safe responses using Pydantic models with Instructor integration.

Rate Limiting

Built-in ProtectionAutomatic rate limiting with exponential backoff and retry mechanisms to prevent API quota exhaustion.

Fallback Support

High ReliabilityAutomatic fallback to backup configurations for production reliability.

Architecture

The LiteLLM integration is built around a structured output framework that provides robust LLM operations:

LLMGateway

Central InterfaceCentral interface for all LLM operations with provider abstraction.

Provider Adapters

Specific ImplementationsDedicated adapters for each LLM provider with optimized configurations.

Rate Limiting

Request ManagementBuilt-in rate limiting and retry mechanisms with exponential backoff.

Embedding Engine

Vector GenerationLiteLLM-powered embedding functionality for semantic search.

Supported Providers

Cognee supports the following LLM providers through LiteLLM:

Provider Overview

Available LLM Providers

Provider	Adapter Class	Description
OpenAI	`OpenAIAdapter`	GPT-3.5, GPT-4, and other OpenAI models
Anthropic	`AnthropicAdapter`	Claude models
Google Gemini	`GeminiAdapter`	Gemini Pro and other Google models
Ollama	`OllamaAPIAdapter`	Local Ollama models
Custom	`GenericAPIAdapter`	Any OpenAI-compatible API endpoint

Provider Configuration

Setup Examples

# OpenAI
os.environ["LLM_PROVIDER"] = "openai"
os.environ["LLM_MODEL"] = "gpt-4o-mini"
os.environ["LLM_API_KEY"] = "your_openai_key"

# Anthropic
os.environ["LLM_PROVIDER"] = "anthropic"
os.environ["LLM_MODEL"] = "claude-3-5-sonnet-20241022"
os.environ["LLM_API_KEY"] = "your_anthropic_key"

# Google Gemini
os.environ["LLM_PROVIDER"] = "gemini"
os.environ["LLM_MODEL"] = "gemini-pro"
os.environ["LLM_API_KEY"] = "your_google_key"

# Local Ollama
os.environ["LLM_PROVIDER"] = "ollama"
os.environ["LLM_MODEL"] = "llama2"
os.environ["LLM_ENDPOINT"] = "http://localhost:11434"

Configuration

Environment Variables

# Core LLM Configuration
LLM_PROVIDER=openai          # Provider: openai, anthropic, gemini, ollama, custom
LLM_MODEL=gpt-4o-mini        # Model name
LLM_API_KEY=your_api_key     # API key for the provider
LLM_ENDPOINT=                # Custom endpoint (optional)
LLM_API_VERSION=             # API version (optional)
LLM_TEMPERATURE=0.0          # Temperature setting
LLM_MAX_TOKENS=16384         # Maximum tokens

Programmatic Configuration

Get Configuration

from cognee.infrastructure.llm import get_llm_config

# Get current configuration
config = get_llm_config()

# Access configuration values
print(f"Provider: {config.llm_provider}")
print(f"Model: {config.llm_model}")
print(f"Max Tokens: {config.llm_max_tokens}")
print(f"Temperature: {config.llm_temperature}")
print(f"Rate Limiting: {config.llm_rate_limit_enabled}")

Core Features

1. Structured Output Generation

Generate type-safe, validated responses using Pydantic models:

from cognee.infrastructure.llm.LLMGateway import LLMGateway
from pydantic import BaseModel

class EntityModel(BaseModel):
    name: str
    type: str
    confidence: float

# Generate structured output
result = await LLMGateway.acreate_structured_output(
    text_input="Extract entities from this text: Apple Inc. was founded by Steve Jobs.",
    system_prompt="You are an entity extraction expert. Extract named entities with confidence scores.",
    response_model=EntityModel
)

print(f"Entity: {result.name}")
print(f"Type: {result.type}")
print(f"Confidence: {result.confidence}")

2. Rate Limiting & Retry Logic

Built-in protection against API limits:

Configure Rate Limits

# Enable rate limiting
os.environ["LLM_RATE_LIMIT_ENABLED"] = "true"
os.environ["LLM_RATE_LIMIT_REQUESTS"] = "60"
os.environ["LLM_RATE_LIMIT_INTERVAL"] = "60"

# Rate limiting is automatically applied to all LLM calls
result = await LLMGateway.acreate_structured_output(
    text_input="Your input",
    system_prompt="Your prompt",
    response_model=YourModel
)
# Automatically respects rate limits

3. Model Token Management

Automatic token limit detection and management:

from cognee.infrastructure.llm.utils import get_model_max_tokens

# Get max tokens for different models
gpt4_tokens = get_model_max_tokens("gpt-4o-mini")
claude_tokens = get_model_max_tokens("claude-3-5-sonnet-20241022")
gemini_tokens = get_model_max_tokens("gemini-pro")

print(f"GPT-4o-mini max tokens: {gpt4_tokens}")
print(f"Claude-3.5-Sonnet max tokens: {claude_tokens}")
print(f"Gemini Pro max tokens: {gemini_tokens}")

4. Embedding Engine Integration

LiteLLM powers the embedding functionality for semantic search:

from cognee.infrastructure.databases.vector.embeddings.LiteLLMEmbeddingEngine import LiteLLMEmbeddingEngine

# Initialize embedding engine
embedding_engine = LiteLLMEmbeddingEngine(
    model="openai/text-embedding-3-large",
    provider="openai",
    dimensions=3072
)

# Generate embeddings
texts = ["Sample text to embed", "Another text for embedding"]
embeddings = await embedding_engine.embed_text(texts)

print(f"Generated {len(embeddings)} embeddings")
print(f"Embedding dimensions: {len(embeddings[0])}")

Advanced Features

Fallback Configuration

Enhanced ReliabilityConfigure fallback models for production reliability:

# Primary configuration
LLM_PROVIDER=openai
LLM_MODEL=gpt-4o-mini
LLM_API_KEY=your_primary_key

# Fallback configuration
FALLBACK_API_KEY=backup_key
FALLBACK_ENDPOINT=backup_endpoint
FALLBACK_MODEL=backup_model

Automatic Fallback Logic:

Primary provider fails → Automatic switch to fallback
Rate limits exceeded → Use fallback provider
Model unavailable → Switch to fallback model

Multi-Framework Support

Framework FlexibilityLiteLLM works with both structured output frameworks:

# Choose your framework
os.environ["STRUCTURED_OUTPUT_FRAMEWORK"] = "instructor"  # or "baml"

# LiteLLM adapts automatically
from cognee.infrastructure.llm.LLMGateway import LLMGateway

# Same interface, different underlying framework
result = await LLMGateway.acreate_structured_output(
    text_input="Your input text",
    system_prompt="Your system prompt",
    response_model=YourModel
)

# Framework selection is transparent to your code

Connection Testing

Reliability VerificationBuilt-in utilities to test provider connections:

from cognee.infrastructure.llm.utils import test_llm_connection, test_embedding_connection

# Test LLM provider connection
llm_status = await test_llm_connection()
print(f"LLM Connection: {'✅ Connected' if llm_status else '❌ Failed'}")

# Test embedding provider connection
embedding_status = await test_embedding_connection()
print(f"Embedding Connection: {'✅ Connected' if embedding_status else '❌ Failed'}")

# Test before deployment
if llm_status and embedding_status:
    print("🚀 Ready for production!")
else:
    print("⚠️ Check your configuration")

Usage Examples

Structured Output Generation

from cognee.infrastructure.llm.LLMGateway import LLMGateway
from pydantic import BaseModel

class AnalysisModel(BaseModel):
    summary: str
    key_points: list[str]
    sentiment: str
    confidence: float

# Generate structured analysis
response = await LLMGateway.acreate_structured_output(
    text_input="Analyze this document: AI is revolutionizing healthcare...",
    system_prompt="You are a document analyzer. Provide structured analysis.",
    response_model=AnalysisModel
)

print(f"Summary: {response.summary}")
print(f"Key Points: {response.key_points}")
print(f"Sentiment: {response.sentiment}")
print(f"Confidence: {response.confidence}")

Advanced Integration

Audio Transcription

Speech-to-Text

from cognee.infrastructure.llm.LLMGateway import LLMGateway

# Transcribe audio files
transcript = await LLMGateway.create_transcript(audio_input)

print(f"Transcript: {transcript.text}")
print(f"Duration: {transcript.duration}")
print(f"Language: {transcript.language}")

Image Analysis

Vision Capabilities

# Analyze images with vision models
description = await LLMGateway.transcribe_image(image_input)

print(f"Image description: {description}")

# Use with vision-capable models
os.environ["LLM_MODEL"] = "gpt-4-vision-preview"

Streaming Support

Large Content Processing

# Stream responses for large content
async for chunk in LLMGateway.stream_structured_output(
    text_input="Very long document...",
    system_prompt="Process this content",
    response_model=YourModel
):
    print(f"Processing chunk: {chunk}")
    # Process chunk as it arrives

Performance & Monitoring

Performance Optimization

Efficient Processing

# Token optimization
max_tokens = get_model_max_tokens("gpt-4o-mini")

# Concurrent requests
tasks = [
    LLMGateway.acreate_structured_output(...)
    for content in content_list
]
results = await asyncio.gather(*tasks)

Observability

Monitoring Integration

# Langfuse integration (automatic)
# Structured logging
# Performance metrics
# Request/response tracking

Dependencies

The LiteLLM integration requires specific package versions:

litellm = ">=1.71.0, <2.0.0"
instructor = ">=1.9.1, <2.0.0"
limits = ">=4.4.1, <5"

These dependencies are automatically installed when you install Cognee with LiteLLM support.

Error Handling

Automatic Error Detection

Custom Error Handling

Application-Specific Logic

from cognee.infrastructure.llm.LLMGateway import LLMGateway
from litellm import RateLimitError, ContentPolicyError

try:
    result = await LLMGateway.acreate_structured_output(
        text_input="Your input",
        system_prompt="Your prompt",
        response_model=YourModel
    )
except RateLimitError as e:
    print(f"Rate limit exceeded: {e}")
    # Automatic retry will be attempted
except ContentPolicyError as e:
    print(f"Content policy violation: {e}")
    # Handle content issues
except Exception as e:
    print(f"Other error: {e}")
    # Handle other errors

Best Practices

Configuration

Setup Guidelines

Always use structured outputs for type safety
Configure rate limits to match API quotas
Set up fallback configurations for reliability
Test connections before production deployment

Optimization

Performance Tips

Use appropriate models for your use case (cost vs. performance)
Monitor token usage to optimize costs
Use async operations for concurrent processing
Enable rate limiting to prevent quota exhaustion

Troubleshooting

Common Issues

Frequent Problems & SolutionsAPI Key Issues:

# Verify API key is set
import os
api_key = os.getenv("LLM_API_KEY")
print(f"API Key set: {'Yes' if api_key else 'No'}")

Rate Limits:

# Check rate limit configuration
config = get_llm_config()
print(f"Rate limiting: {config.llm_rate_limit_enabled}")
print(f"Requests per interval: {config.llm_rate_limit_requests}")

Model Availability:

# Test model access
await test_llm_connection()

Debug Mode

Verbose LoggingEnable detailed logging for debugging:

import litellm

# Enable verbose logging
litellm.set_verbose = True

# Now all LiteLLM operations will show detailed logs
result = await LLMGateway.acreate_structured_output(
    text_input="Your input",
    system_prompt="Your prompt",
    response_model=YourModel
)

Performance Issues

Next Steps

BAML Integration

Modern FrameworkConsider upgrading to BAML for better type safety and prompt management.

LLMs Overview

Provider SetupLearn how to configure specific LLM providers with LiteLLM.

Configuration

Advanced SetupConfigure advanced LiteLLM settings and optimization parameters.

Structured Output Overview

Framework ComparisonCompare LiteLLM with other structured output frameworks.

Framework Status: LiteLLM is fully supported for multi-provider access and cost optimization. For new projects requiring advanced structured outputs, consider BAML for better type safety and prompt management.

Getting Started

Core Concepts

Examples

CLI

Python SDK

UI

LiteLLM

LiteLLM Integration

Key Features

100+ Providers

Type-Safe Outputs

Rate Limiting

Fallback Support

Architecture

Supported Providers

Configuration

Environment Variables

Programmatic Configuration

Core Features

1. Structured Output Generation

2. Rate Limiting & Retry Logic

3. Model Token Management

4. Embedding Engine Integration

Advanced Features

Usage Examples

Advanced Integration

Performance & Monitoring

Performance Optimization

Observability

Dependencies

Error Handling

Best Practices

Configuration

Optimization

Troubleshooting

Next Steps

BAML Integration

LLMs Overview

Configuration

Structured Output Overview

Getting Started

Core Concepts

Examples

CLI

Python SDK

UI

​LiteLLM Integration

​Key Features

100+ Providers

Type-Safe Outputs

Rate Limiting

Fallback Support

​Architecture

​Supported Providers

​Configuration

​Environment Variables

​Programmatic Configuration

​Core Features

​1. Structured Output Generation

​2. Rate Limiting & Retry Logic

​3. Model Token Management

​4. Embedding Engine Integration

​Advanced Features

​Usage Examples

​Advanced Integration

​Performance & Monitoring

Performance Optimization

Observability

​Dependencies

​Error Handling

​Best Practices

Configuration

Optimization

​Troubleshooting

​Next Steps

BAML Integration

LLMs Overview

Configuration

Structured Output Overview

LiteLLM Integration

Key Features

Architecture

Supported Providers

Configuration

Environment Variables

Programmatic Configuration

Core Features

1. Structured Output Generation

2. Rate Limiting & Retry Logic

3. Model Token Management

4. Embedding Engine Integration

Advanced Features

Usage Examples

Advanced Integration

Performance & Monitoring

Dependencies

Error Handling

Best Practices

Troubleshooting

Next Steps