Ollama - ngrok documentation

Ollama runs open-source large language models locally. This guide shows you how to connect Ollama to the ngrok AI Gateway as a custom provider.

What you’ll need

ngrok account with AI Gateway access
Ollama installed locally
ngrok agent installed
An access key from app.ngrok.ai

Overview

Ollama runs on HTTP locally. Expose it with an ngrok internal endpoint, register the endpoint as a custom provider, then route traffic through the gateway.

Getting started

Start Ollama

Start the Ollama server:

ollama serve

Pull a model if you haven’t already:

ollama pull llama3.2

Verify Ollama is running:

curl http://localhost:11434/api/tags

Expose Ollama with ngrok

Create an internal endpoint with the ngrok agent:

ngrok http 11434 --url https://ollama.internal

Internal endpoints (.internal domains) are private to your ngrok account, meaning they’re not reachable from the public internet. Use the same ngrok account here and in the AI Gateway, otherwise the gateway can’t reach the endpoint.

Create the custom provider

See Create a custom provider. Use provider ID ollama, base URL https://ollama.internal, and your model IDs (for example llama3.2).

Ollama doesn’t require upstream authentication, so you can skip provider keys.

Configure access

Create an access key configuration that allows your ollama provider, then assign it to your access key. See the Quickstart if you haven’t created an access key yet.

Send requests

Point any OpenAI-compatible SDK at the gateway using your access key:

from openai import OpenAI

client = OpenAI(
    base_url="https://gateway.ngrok.ai/v1",
    api_key="ng-xxxxx-g1-xxxxx"
)

response = client.chat.completions.create(
    model="ollama:llama3.2",
    messages=[{"role": "user", "content": "Hello!"}]
)

print(response.choices[0].message.content)

Tips

Slow first response: Ollama loads models into memory on first use. Increase perRequestTimeout in account settings if requests time out during warm-up.
Multiple instances: Create separate custom providers (for example ollama-gpu-1, ollama-gpu-2) with different internal endpoints. Pin a request with ollama-gpu-1:llama3.2 or list multiple models for failover.
Cloud fallback: Add a built-in provider to your access key configuration and request models: ["ollama:llama3.2", "openai:gpt-4o"] for cross-provider failover. See Multi-provider failover.

Troubleshooting

Symptom	Fix
Connection refused	Confirm Ollama is running (`curl http://localhost:11434/api/tags`) and the ngrok tunnel is active
Model not found	Run `ollama list`, pull the model, and match the model ID exactly (including tags like `:1b`)
Out of memory	Use a smaller or quantized model, or set `OLLAMA_NUM_PARALLEL=1`

Next steps

Use a model you run yourself: URL requirements and local networking
Access Key Configurations: Scope providers per key
Quickstart: Create your first access key

​What you’ll need

​Overview

​Getting started

​Tips

​Troubleshooting

​Next steps

What you’ll need

Overview

Getting started

Tips

Troubleshooting

Next steps