Skip to main content
vLLM is a high-performance inference engine with an OpenAI-compatible API. Connect your vLLM server to the AI Gateway as a custom provider.

What you’ll need

Overview

vLLM provides an OpenAI-compatible server that you can expose with an ngrok internal endpoint (or use a public HTTPS URL), register as a custom provider, and attach credentials through an access key configuration.

Getting started

1

Start vLLM

Start the OpenAI-compatible server:
vllm serve meta-llama/Llama-3.2-8B-Instruct
Verify it’s running:
curl http://localhost:8000/v1/models
2

Expose with ngrok

Create an internal endpoint:
ngrok http 8000 --url https://vllm.internal
If vLLM already has a public HTTPS endpoint, skip this step and use that URL as the provider base URL instead.
3

Create the custom provider

See Create a custom provider. Use provider ID vllm, base URL https://vllm.internal, API format OpenAI Chat Completions, and your model IDs.
4

Store a provider key (if required)

If your vLLM server requires an API key (vllm serve model --api-key your-secret-key), add a provider key.
5

Configure access

Create an access key configuration that:
  1. Allows the vllm provider in the access scope
  2. Adds a routing rule with Bring your own API key if the server requires authentication
Assign the configuration to your access key.
6

Send requests

from openai import OpenAI

client = OpenAI(
    base_url="https://gateway.ngrok.ai/v1",
    api_key="ng-xxxxx-g1-xxxxx"
)

response = client.chat.completions.create(
    model="vllm:meta-llama/Llama-3.2-8B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}]
)

print(response.choices[0].message.content)

Tips

  • Secure vLLM: Run with --api-key and attach the key through an access key configuration. The AI Gateway adds it to upstream requests server-side.
  • Gated models: Set HF_TOKEN before starting vLLM for Hugging Face gated models.
  • Timeouts: Large models can be slow. Increase perRequestTimeout and totalTimeout in account settings.
  • Multiple models: Run separate vLLM instances with different internal endpoints and register each as its own custom provider.

Troubleshooting

SymptomFix
Model loading errorsCheck GPU memory with nvidia-smi; try --gpu-memory-utilization 0.9 or a smaller model
Connection timeoutsVerify the ngrok tunnel and vLLM health (curl http://localhost:8000/health); increase gateway timeouts
401 unauthorizedConfirm the provider key in app.ngrok.ai matches your vLLM --api-key and is attached in the access key configuration

Next steps