Self-Host the Ollama + Open WebUI Stack: A Foundational Guide

AI, Devops & Infrastructure, News, Open Source, Python, and Tutorials

Self-Host the Ollama + Open WebUI Stack: A Foundational Guide

If you have searched for how to run AI locally, self-hosted ChatGPT, or Ollama Open WebUI, you have probably noticed something: every tutorial jumps straight into install commands without ever explaining the stack itself. What is Ollama actually doing? Why do you also need Open WebUI? Where does Nginx fit? Which model should you pick — and how do you not waste a weekend running a 70B model on a CPU-only VPS that will never finish a single response?

This guide is the foundational primer for the Ollama + Open WebUI self-hosting stack. We will cover what each component does, how the pieces connect, the trade-offs versus hosted APIs (ChatGPT, Claude, Gemini), a model-picking decision matrix, and a minimal install path. When you are ready to put it into production with TLS, zero downtime deployments, and model-specific tuning, we link out to the deeper guides.

TL;DR

  • Ollama is the model runtime — it pulls, loads, and serves open-source LLMs over a local HTTP API at http://127.0.0.1:11434.
  • Open WebUI is the browser interface — chat history, RAG document upload, multi-user accounts, model gallery, prompt templates.
  • Nginx is the public-facing reverse proxy with TLS — never expose Ollama or Open WebUI directly to the internet.
  • For a small 7B model you need ~6 GB free RAM. For 70B+ you need a GPU with 24 GB+ VRAM or a very patient CPU.
  • Hosted APIs win on raw quality at the frontier (GPT-5, Claude Sonnet 4.5). Self-hosted wins on privacy, fixed cost, model variety, and offline capability.

What this stack actually is

Most self-host ChatGPT guides hand-wave over architecture. Here is the precise picture.

flowchart LR
    Browser["Browser<br/>(your laptop, phone)"]
    Nginx["Nginx<br/>TLS + reverse proxy<br/>:443"]
    OW["Open WebUI<br/>chat UI, RAG, auth<br/>:3000 or :8080"]
    Ollama["Ollama<br/>model runtime<br/>:11434"]
    Models["Local models on disk<br/>~/.ollama/models"]

    Browser -->|HTTPS| Nginx
    Nginx -->|HTTP loopback| OW
    OW -->|OLLAMA_API_BASE_URL| Ollama
    Ollama -->|memory-map GGUF files| Models

Three components, three responsibilities. None of them is optional in a real deployment, and conflating them is the single most common source of confusion in this space.

Ollama — the model runtime

Ollama is the engine. It downloads quantised model weights (GGUF format, roughly 4-bit or 5-bit compressed versions of the original FP16 weights), loads them into RAM or VRAM, and exposes a simple HTTP API:

curl http://127.0.0.1:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Why is the sky blue?"
}'

Under the hood, Ollama wraps llama.cpp — the C++ inference library that made it possible to run LLMs on consumer hardware. You will never interact with llama.cpp directly; Ollama handles model management (ollama pull, ollama list, ollama rm), keeps recently used models warm in memory, and lazy-loads new ones on demand.

Two environment variables are worth knowing:

  • OLLAMA_KEEP_ALIVE — how long a model stays loaded after the last request. Default is 5 minutes. Set it to 24h if you have one constant-use model and enough RAM, or to 0 to unload immediately if RAM is tight.
  • OLLAMA_NUM_PARALLEL — how many requests Ollama processes in parallel against the same model. Default 1. Bumping to 2-4 helps if multiple users hit the same model simultaneously, at the cost of more RAM.

The DeepSeek deep-dive covers these in production-grade detail with real benchmarks, including a Modelfile, GPU vs CPU latency numbers, and a hardware sizing table — link further down.

Open WebUI — the browser experience

Ollama's HTTP API is great for scripts but unusable for daily chat. Open WebUI (50k+ GitHub stars) is the UI layer. What you actually get:

  • Chat with persistent history — SQLite-backed conversation storage by default; PostgreSQL in production
  • Multi-user accounts — admin and user roles, controlled by the WEBUI_AUTH environment variable
  • Model gallery — browse and ollama pull directly from the UI without touching SSH
  • RAG document upload — drop a PDF, Open WebUI chunks it, embeds it, and the model can answer questions about it
  • Prompt templates and system prompts per workspace — useful for keeping a code reviewer prompt separate from a marketing writer prompt
  • Cloud API connectors (optional) — Open WebUI can also talk to OpenAI, Anthropic, or Google Gemini APIs alongside your local Ollama, so you can A/B-test self-hosted Llama 3 against GPT-5 in the same UI

The two environment variables you will hit first:

  • WEBUI_AUTH=trueset this before exposing Open WebUI to the public internet. The default is true in recent versions, but earlier builds shipped with auth disabled by default, and any tutorial older than mid-2024 may quietly skip past this. An unauthenticated Open WebUI on port 8080 facing the internet is a free LLM for whoever finds it on Shodan.
  • OLLAMA_API_BASE_URL=http://127.0.0.1:11434 — tells Open WebUI where Ollama is. If you run Open WebUI in Docker and Ollama on the host, set this to http://host.docker.internal:11434 (or use --network=host).

Reference: Open WebUI documentation and the features page for the current capability list.

Nginx — the only thing the public internet talks to

Neither Ollama nor Open WebUI should be reachable directly from the internet. Both default to plain HTTP, neither was designed to be a public-facing TLS terminator, and exposing port 11434 is essentially free GPU time, please abuse me.

The pattern is always the same:

Internet (HTTPS :443)
        ↓
   Nginx (TLS via Let's Encrypt)
        ↓ (HTTP loopback)
   Open WebUI (:3000 or :8080)
        ↓ (HTTP loopback)
   Ollama (:11434)

Bind both Open WebUI and Ollama to 127.0.0.1, configure UFW or iptables to drop everything except 22, 80, 443, and let Nginx be the only thing that touches 0.0.0.0. The DeepSeek guide has a battle-tested Nginx vhost with proxy_buffering off (critical for streaming token-by-token responses) and 600-second read timeouts (long generations on CPU can exceed default proxy timeouts). Re-use that config; do not write your own from scratch.


Self-hosted vs hosted APIs: when does this make sense?

The honest answer is sometimes. Self-hosting is not unconditionally better — it has real trade-offs.

Hosted API (GPT-5, Claude, Gemini) Self-hosted (Ollama + Open WebUI)
Frontier capability State-of-the-art reasoning, multimodal, long context Trails frontier by 6-12 months on most tasks
Privacy Prompts traverse third-party servers Data never leaves your VPS
Cost shape Per-token, scales linearly with usage Fixed monthly VPS cost, unlimited inference
Latency Network round-trip + provider queue Local; constrained only by your CPU/GPU
Model choice Whatever the provider ships Any open model (Llama, Mistral, Qwen, DeepSeek, Gemma, Phi)
Customisation API parameters, fine-tuning if offered System prompts, RAG, custom Modelfiles, fine-tuning
Offline mode None Works without internet once weights are pulled
Compliance Subject to provider's data handling You control GDPR/HIPAA boundary
Operational burden None You manage updates, backups, monitoring

Realistic picks:

  • I want maximum quality on hard reasoning tasks once a week → use the hosted API directly, you will not match GPT-5 on a $20/month VPS.
  • I burn $200+/month on OpenAI API calls for repetitive workloads → self-hosting on a beefier VPS pays back fast.
  • I cannot send customer data to third parties → self-hosted is the only option that does not require a Business Associate Agreement or DPA. The same logic applies to other tools: see our self-host GitLab on a VPS guide for the source-code equivalent of this argument.
  • I want to experiment with 30 different open models without paying per token → self-hosted, easily.

For the broader hardware-vs-cost vs model-quality decision and how to budget VRAM against your real workload, our self-hosting AI models hardware guide lays out the numbers.


Picking a model: the decision matrix

This is where most people lose a weekend. The Ollama model library lists hundreds of models. You do not need to try them all. Pick by use case and hardware.

Use case Model class Recommended starting point RAM (Q4) Notes
General chat, low spec Small instruct llama3.2:3b or phi3.5 ~3 GB Runs on a $10/month VPS without a GPU
General chat, mid spec 7-8B instruct llama3.1:8b or mistral:7b ~6 GB The default choice — quality plateau for daily chat
Coding assistant Code-specialised qwen2.5-coder:7b or deepseek-coder-v2:16b ~6-12 GB Trained on code, far better than a general 7B for autocomplete
Reasoning / step-by-step Reasoning-tuned deepseek-r1:7b or larger ~6 GB+ Slower because it generates a long chain-of-thought before answering
RAG over your documents Strong instruction-following llama3.1:8b + a good embedding model (nomic-embed-text) ~6 GB + ~500 MB Open WebUI handles the chunking and retrieval automatically
Multimodal (image input) Vision model llava:7b or llama3.2-vision:11b ~6-9 GB OCR, image Q&A, screenshot description
Frontier-quality, beefy hardware 70B class llama3.3:70b or qwen2.5:72b ~40 GB+ Needs a GPU with 48 GB VRAM or two consumer cards

The 3-to-6-to-13 rule of thumb (Q4 quantisation):

  • A 3B model needs roughly 3 GB of RAM to load.
  • A 7-8B model needs roughly 6 GB.
  • A 13B model needs roughly 9-10 GB.
  • A 70B model needs roughly 40 GB.

Add 1-2 GB of headroom for the OS, Open WebUI, and the model's KV cache. If you only have 8 GB total on your VPS, do not try to squeeze in a 13B model — Linux's OOM killer will end your ollama process at the worst possible moment.

For model-specific deep-dives that include real latency numbers and hardware sizing per parameter count, see the production install and DeepSeek guides linked at the end of this article.


Minimal install path

This is the foundational install. It is intentionally lighter than the production guides linked above — the goal here is to get you talking to a model in the next ten minutes. When you are ready to harden it (TLS, systemd units, automated rollbacks), follow the deeper guides.

Prerequisites

  • Ubuntu 22.04 or 24.04 VPS with at least 8 GB RAM (4 GB if you stick to a 3B model)
  • A sudo user with SSH access
  • 20 GB of free disk for the OS and a couple of models

1. Install Ollama

curl -fsSL https://ollama.com/install.sh | sh
ollama --version

The installer registers a systemd service that listens on 127.0.0.1:11434 by default — exactly what you want. Verify it is running:

systemctl status ollama

2. Pull your first model

ollama pull llama3.2:3b   # ~2 GB download, runs on small VPS
ollama run llama3.2:3b "Explain TCP backpressure in two sentences."

If that prints a response, the runtime layer works. You can stop here if you only want a CLI tool.

3. Run Open WebUI

The simplest path is Docker:

docker run -d \
  --name open-webui \
  --network=host \
  -e WEBUI_AUTH=true \
  -e OLLAMA_API_BASE_URL=http://127.0.0.1:11434 \
  -v open-webui:/app/backend/data \
  --restart=always \
  ghcr.io/open-webui/open-webui:main

Open http://your-server-ip:8080 in a browser. The first account you create becomes the admin.

Do not stop here if your VPS is on the public internet. Open WebUI on port 8080 with no TLS is fine for a local network, but exposing it to the internet without Nginx + Let's Encrypt is asking to leak whatever you and your colleagues type into it.

4. Put Nginx in front (production-bound)

For the full Nginx vhost with TLS, proxy_buffering off (critical so streaming token-by-token responses are not held in a buffer), and the 600-second read timeouts you need for slow CPU generations, graduate to the production install — link in the next section.


Open WebUI features worth knowing about

Most tutorials install Open WebUI and then never come back to it. That is a mistake — half the value of self-hosting is in the features the hosted UIs do not give you.

RAG: chat with your own documents

Click the + next to the chat input and upload a PDF, Markdown file, or text dump. Open WebUI:

  1. Chunks the document (default ~1000 tokens per chunk with overlap)
  2. Embeds each chunk using a local embedding model (pull nomic-embed-text for a good free default)
  3. Stores embeddings in a local vector store
  4. Retrieves the top-K chunks for each user question and injects them into the prompt

This is real RAG, not a the model has read it hack. Model size matters less here than instruction-following quality — llama3.1:8b outperforms larger but less instruction-tuned models for retrieval tasks.

Multi-user mode

Set WEBUI_AUTH=true (the default in recent versions). The first registered account is admin. From the admin panel you can:

  • Approve or block new sign-ups
  • Restrict which models each user can access
  • Set per-user rate limits
  • Audit conversation logs

For a small team, this turns a single $20/month VPS into a private LLM service that several people can share without you having to worry about per-seat hosted-API billing.

The Models tab lets you pull new models from the Ollama library directly through the UI — no SSH needed. You can also create custom Modelfiles (system prompt, temperature, context window) and save them as named personas — think of it as Open WebUI's answer to OpenAI's GPT Builder, except you own the runtime.

Cloud API connectors

Open WebUI can talk to OpenAI, Anthropic, and Google Gemini APIs at the same time as your local Ollama. Why bother? Because then you can:

  • Side-by-side compare a local 7B model against GPT-5 in the same conversation
  • Route cheap repetitive requests to local Ollama and hard reasoning requests to a frontier API from the same UI
  • Keep using the frontier APIs without your team scattering into half a dozen separate ChatGPT/Claude subscriptions

Where to go from here

You now understand the stack: Ollama runs the model, Open WebUI is the UI, Nginx terminates TLS in front of both. From here, pick the path that matches your actual goal:

Once your stack is running, push your docker-compose.yml, Nginx vhost, and a model-pull script into a Git repo and connect it to DeployHQ. Every config change — new model, tweaked system prompt, updated Nginx rule — flows through the same automatic deployments from Git that you would use for any other production app, with one-click rollback when a config change accidentally takes down chat at 4 PM on a Friday. You can deploy from GitHub or deploy from GitLab — same pipeline either way.

Start a free DeployHQ trial to run your AI stack through a real CI/CD pipeline, or see pricing for team plans.


Questions, war stories, or a model recommendation we missed? Email us at support@deployhq.com or ping us on @deployhq.