vLLM is a high-performance inference framework for large language models that you can self-host.
Copy
Ask AI
LLM_PROVIDER="openai"LLM_MODEL="openai/<model name>" # Must start with openai/LLM_ENDPOINT="https://vllm-host/v1" # Must end with /v1LLM_API_KEY="<key>"
Features:
Performance: PagedAttention for efficient memory usage
Models: Support for Llama, Mistral, CodeLlama, and more
Deployment: Docker, Kubernetes, or standalone server