Skip to main content
Ollama runs open-source large language models locally. This guide shows you how to connect Ollama to the ngrok AI Gateway as a custom provider.

What you’ll need

Overview

Ollama runs on HTTP locally. Expose it with an ngrok internal endpoint, register the endpoint as a custom provider, then route traffic through the gateway.

Getting started

1

Start Ollama

Start the Ollama server:
ollama serve
Pull a model if you haven’t already:
ollama pull llama3.2
Verify Ollama is running:
curl http://localhost:11434/api/tags
2

Expose Ollama with ngrok

Create an internal endpoint with the ngrok agent:
ngrok http 11434 --url https://ollama.internal
Internal endpoints (.internal domains) are private to your ngrok account, meaning they’re not reachable from the public internet. Use the same ngrok account here and in the AI Gateway, otherwise the gateway can’t reach the endpoint.
3

Create the custom provider

See Create a custom provider. Use provider ID ollama, base URL https://ollama.internal, and your model IDs (for example llama3.2).
Ollama doesn’t require upstream authentication, so you can skip provider keys.
4

Configure access

Create an access key configuration that allows your ollama provider, then assign it to your access key. See the Quickstart if you haven’t created an access key yet.
5

Send requests

Point any OpenAI-compatible SDK at the gateway using your access key:
from openai import OpenAI

client = OpenAI(
    base_url="https://gateway.ngrok.ai/v1",
    api_key="ng-xxxxx-g1-xxxxx"
)

response = client.chat.completions.create(
    model="ollama:llama3.2",
    messages=[{"role": "user", "content": "Hello!"}]
)

print(response.choices[0].message.content)

Tips

  • Slow first response: Ollama loads models into memory on first use. Increase perRequestTimeout in account settings if requests time out during warm-up.
  • Multiple instances: Create separate custom providers (for example ollama-gpu-1, ollama-gpu-2) with different internal endpoints. Pin a request with ollama-gpu-1:llama3.2 or list multiple models for failover.
  • Cloud fallback: Add a built-in provider to your access key configuration and request models: ["ollama:llama3.2", "openai:gpt-4o"] for cross-provider failover. See Multi-provider failover.

Troubleshooting

SymptomFix
Connection refusedConfirm Ollama is running (curl http://localhost:11434/api/tags) and the ngrok tunnel is active
Model not foundRun ollama list, pull the model, and match the model ID exactly (including tags like :1b)
Out of memoryUse a smaller or quantized model, or set OLLAMA_NUM_PARALLEL=1

Next steps