vLLM - ngrok documentation

vLLM is a high-performance inference engine with an OpenAI-compatible API. Connect your vLLM server to the AI Gateway as a custom provider.

What you’ll need

ngrok account with AI Gateway access
vLLM installed
ngrok agent installed
GPU with sufficient VRAM for your chosen model
An access key from app.ngrok.ai

Overview

vLLM provides an OpenAI-compatible server that you can expose with an ngrok internal endpoint (or use a public HTTPS URL), register as a custom provider, and attach credentials through an access key configuration.

Getting started

Start vLLM

Start the OpenAI-compatible server:

vllm serve meta-llama/Llama-3.2-8B-Instruct

Verify it’s running:

curl http://localhost:8000/v1/models

Expose with ngrok

Create an internal endpoint:

ngrok http 8000 --url https://vllm.internal

If vLLM already has a public HTTPS endpoint, skip this step and use that URL as the provider base URL instead.

Create the custom provider

See Create a custom provider. Use provider ID vllm, base URL https://vllm.internal, API format OpenAI Chat Completions, and your model IDs.

Store a provider key (if required)

If your vLLM server requires an API key (vllm serve model --api-key your-secret-key), add a provider key.

Configure access

Create an access key configuration that:

Allows the vllm provider in the access scope
Adds a routing rule with Bring your own API key if the server requires authentication

Assign the configuration to your access key.

Send requests

from openai import OpenAI

client = OpenAI(
    base_url="https://gateway.ngrok.ai/v1",
    api_key="ng-xxxxx-g1-xxxxx"
)

response = client.chat.completions.create(
    model="vllm:meta-llama/Llama-3.2-8B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}]
)

print(response.choices[0].message.content)

Tips

Secure vLLM: Run with --api-key and attach the key through an access key configuration. The AI Gateway adds it to upstream requests server-side.
Gated models: Set HF_TOKEN before starting vLLM for Hugging Face gated models.
Timeouts: Large models can be slow. Increase perRequestTimeout and totalTimeout in account settings.
Multiple models: Run separate vLLM instances with different internal endpoints and register each as its own custom provider.

Troubleshooting

Symptom	Fix
Model loading errors	Check GPU memory with `nvidia-smi`; try `--gpu-memory-utilization 0.9` or a smaller model
Connection timeouts	Verify the ngrok tunnel and vLLM health (`curl http://localhost:8000/health`); increase gateway timeouts
401 unauthorized	Confirm the provider key in app.ngrok.ai matches your vLLM `--api-key` and is attached in the access key configuration

Next steps

Use a model you run yourself: URL requirements and configuration
Provider Keys: Store upstream credentials
Quickstart: Create your first access key

​What you’ll need

​Overview

​Getting started

​Tips

​Troubleshooting

​Next steps