Running your own AI chat interface on a VPS gives you full control over data privacy, model selection, and costs. Instead of paying per-token API fees or sending sensitive prompts to third-party servers, you can self-host an interface that connects to local open-source models or cloud APIs — your choice, your rules.
This guide walks through deploying Open WebUI with Ollama on a Linux VPS. Open WebUI is the most popular self-hosted ChatGPT alternative (50k+ GitHub stars), and Ollama makes running large language models locally as simple as ollama pull llama3. We will also set up automated deployments with DeployHQ so configuration changes and customisations flow through a proper CI/CD pipeline.
Self-hosted AI in 2026: what has changed
The self-hosted AI landscape has matured significantly. Here is how it compares to using hosted APIs directly:
| Cloud API (OpenAI, Anthropic) | Self-hosted (Ollama + Open WebUI) | |
|---|---|---|
| Privacy | Prompts sent to third-party servers | Everything stays on your VPS |
| Cost model | Per-token billing, scales with usage | Fixed VPS cost, unlimited local inference |
| Model choice | Locked to provider's models | Run any open model (Llama 3, Mistral, Qwen, DeepSeek, Gemma) |
| Latency | Network round-trip + queue time | Local inference, no network dependency |
| Customisation | Limited to API parameters | Full control over system prompts, RAG pipelines, tools |
| Offline capability | None | Works without internet once models are downloaded |
Important clarification: ChatGPT is OpenAI's proprietary hosted service — you cannot install ChatGPT itself on a VPS. What you can do is run an equivalent chat interface backed by open-source models that run locally, or connect to cloud APIs (OpenAI, Anthropic, Google) through a unified self-hosted interface. That is exactly what Open WebUI provides.
Architecture overview
flowchart LR
Browser["Browser"]
Nginx["Nginx\n(TLS + reverse proxy)"]
OW["Open WebUI\n(:3000)"]
Ollama["Ollama\n(model runtime)"]
Models["Local Models\n(Llama 3, Mistral, etc.)"]
CloudAPI["Cloud APIs\n(OpenAI, Anthropic)\n(optional)"]
DeployHQ["DeployHQ"]
Git["Git Repo"]
Browser -->|HTTPS :443| Nginx
Nginx -->|HTTP :3000| OW
OW -->|HTTP :11434| Ollama
Ollama --> Models
OW -.->|optional| CloudAPI
Git -->|push| DeployHQ
DeployHQ -->|SSH deploy| OW
Prerequisites
- A VPS with at least 4 vCPUs and 8 GB RAM (16 GB recommended for larger models)
- Ubuntu 22.04 or 24.04
- A domain name pointed at your VPS (e.g.
chat.example.com) - SSH access with a sudo-capable user
- Docker Engine and Docker Compose v2
GPU is optional. Ollama runs on CPU with quantised models (Q4/Q5). A 7B parameter model like Llama 3.2 runs comfortably on 8 GB RAM without a GPU. For faster inference or larger models (70B+), a GPU with 24 GB+ VRAM is recommended.
Step 1: Install Docker
sudo apt update && sudo apt upgrade -y
curl -fsSL https://get.docker.com | sudo sh
sudo usermod -aG docker $USER
Log out and back in, then verify:
docker compose version
Step 2: Create the project structure
mkdir -p ~/ai-chat/{nginx,ollama-data,webui-data}
cd ~/ai-chat
Step 3: Write docker-compose.yml
services:
ollama:
image: ollama/ollama:latest
restart: unless-stopped
volumes:
- ./ollama-data:/root/.ollama
environment:
- OLLAMA_HOST=0.0.0.0
ports:
- "127.0.0.1:11434:11434"
healthcheck:
test: ["CMD", "ollama", "list"]
interval: 30s
timeout: 10s
retries: 3
open-webui:
image: ghcr.io/open-webui/open-webui:main
restart: unless-stopped
depends_on:
ollama:
condition: service_healthy
environment:
- OLLAMA_BASE_URL=http://ollama:11434
- WEBUI_SECRET_KEY=${WEBUI_SECRET_KEY}
- ENABLE_SIGNUP=false
volumes:
- ./webui-data:/app/backend/data
ports:
- "127.0.0.1:3000:8080"
nginx:
image: nginx:alpine
restart: unless-stopped
depends_on:
- open-webui
ports:
- "80:80"
- "443:443"
volumes:
- ./nginx/default.conf:/etc/nginx/conf.d/default.conf:ro
- /etc/letsencrypt:/etc/letsencrypt:ro
Key decisions:
- Ollama and Open WebUI bind to
127.0.0.1only — Nginx handles all external traffic ENABLE_SIGNUP=falseprevents strangers from creating accounts on your instance- Persistent volumes ensure models and chat history survive container restarts
Step 4: Configure Nginx with TLS
Obtain a certificate:
sudo apt install certbot -y
sudo certbot certonly --standalone -d chat.example.com --email you@example.com --agree-tos --no-eff-email
Create nginx/default.conf:
upstream webui {
server open-webui:8080;
}
server {
listen 80;
server_name chat.example.com;
return 301 https://$host$request_uri;
}
server {
listen 443 ssl http2;
server_name chat.example.com;
ssl_certificate /etc/letsencrypt/live/chat.example.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/chat.example.com/privkey.pem;
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers HIGH:!aNULL:!MD5;
add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
client_max_body_size 50m;
location / {
proxy_pass http://webui;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# WebSocket support (required for streaming responses)
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_read_timeout 300s;
}
}
The WebSocket configuration is critical — without it, streaming chat responses will not work.
Step 5: Start the stack and pull your first model
cd ~/ai-chat
docker compose up -d
Wait for the containers to start, then pull a model:
docker exec ollama ollama pull llama3.2:3b
This downloads the Llama 3.2 3B model (~2 GB). For a more capable model:
docker exec ollama ollama pull llama3.1:8b # 4.7 GB, good general purpose
docker exec ollama ollama pull mistral:7b # 4.1 GB, strong at code
docker exec ollama ollama pull qwen2.5:7b # 4.7 GB, multilingual
Model sizing guide
| Model | RAM needed | Best for |
|---|---|---|
| Llama 3.2 3B | ~4 GB | Quick responses, light tasks, low-resource VPS |
| Llama 3.1 8B | ~8 GB | General purpose, good quality/speed balance |
| Mistral 7B | ~8 GB | Code generation, technical writing |
| Qwen 2.5 14B | ~12 GB | Complex reasoning, multilingual |
| Llama 3.1 70B | ~40 GB | Maximum quality (requires GPU) |
Step 6: Access the interface
Open https://chat.example.com in your browser. On first visit:
- Create your admin account (since
ENABLE_SIGNUP=false, only you can do this on first access) - Select a model from the dropdown (you should see
llama3.2:3bor whichever you pulled) - Start chatting
Open WebUI provides:
- Multiple model switching — swap between models mid-conversation
- Document upload with RAG — upload PDFs or text files and ask questions about them
- Web search integration — augment model responses with live web results
- System prompts — customise model behaviour per conversation
- Chat history and export — full conversation management
- Multi-user support — create accounts for your team with role-based access
Step 7: (Optional) Connect cloud APIs
Open WebUI can also act as a unified interface for cloud APIs. In the admin panel:
- Go to Settings > Connections
- Add an OpenAI-compatible endpoint:
- URL:
https://api.openai.com/v1 - API Key: your OpenAI key
- URL:
- You can also add Anthropic, Google, or any OpenAI-compatible API
This lets you compare local model responses against cloud models side-by-side, or fall back to cloud APIs for tasks that exceed your local model's capability.
Step 8: Automate deployments with DeployHQ
As you customise Open WebUI (system prompts, model configurations, Nginx rules, Docker Compose changes), you want those changes version-controlled and automatically deployed.
8a: Repository structure
ai-chat-config/
docker-compose.yml
nginx/
default.conf
scripts/
deploy.sh
pull-models.sh
.env.example
8b: Connect to DeployHQ
- Sign up or log in to DeployHQ
- Create a new project and connect your GitHub or GitLab repository
- Add an SSH server pointing to your VPS
- Set the deploy path to
/home/deploy/ai-chat/ - Add a config file for
.envto keep secrets out of Git
8c: Post-deploy command
In DeployHQ's SSH Commands section:
cd /home/deploy/ai-chat && bash scripts/deploy.sh
Your scripts/deploy.sh:
#!/usr/bin/env bash
set -euo pipefail
# Pull latest images
docker compose pull
# Restart with updated configuration
docker compose up -d --remove-orphans
# Pull any new models defined in the model list
bash scripts/pull-models.sh
echo "AI chat stack deployed successfully"
Now every git push updates your configuration, restarts services if needed, and ensures new models are pulled.
Performance tuning
CPU inference optimisation
If running on CPU only, these environment variables can improve Ollama's performance:
# Add to the ollama service in docker-compose.yml
environment:
- OLLAMA_HOST=0.0.0.0
- OLLAMA_NUM_PARALLEL=2 # concurrent requests
- OLLAMA_MAX_LOADED_MODELS=1 # keep 1 model in memory
Memory management
Ollama unloads models after 5 minutes of inactivity by default. On a RAM-constrained VPS, this is desirable. To keep models loaded longer:
environment:
- OLLAMA_KEEP_ALIVE=30m # keep model loaded for 30 minutes
Monitoring
Add a simple health check to your monitoring:
# Ollama health
curl -sf http://localhost:11434/api/tags > /dev/null && echo "Ollama OK" || echo "Ollama DOWN"
# Open WebUI health
curl -sf http://localhost:3000/health > /dev/null && echo "WebUI OK" || echo "WebUI DOWN"
Security checklist
- Disable public signup (
ENABLE_SIGNUP=false) — only you should create accounts - Set a strong
WEBUI_SECRET_KEY— used for session token signing - Keep Ollama off the public internet — bind to
127.0.0.1only (done in our Compose file) - Enable automatic TLS renewal:
sudo certbot renew --deploy-hook "docker compose -f /home/deploy/ai-chat/docker-compose.yml restart nginx" - Update regularly:
docker compose pull && docker compose up -d - Back up chat data: the
webui-data/volume contains all conversations and user data
Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
No models availablein Open WebUI |
Models not pulled yet | Run docker exec ollama ollama pull llama3.2:3b |
| Open WebUI cannot connect to Ollama | OLLAMA_BASE_URL wrong |
Verify it is http://ollama:11434 (Docker service name) |
| Streaming responses hang | Missing WebSocket proxy config | Add proxy_http_version 1.1 and Upgrade headers in Nginx |
| Out of memory when loading model | Model too large for available RAM | Use a smaller quantised model (3B or 7B) |
| Slow inference | CPU-only with large model | Switch to a smaller model or add GPU passthrough |
What to do next
- Experiment with models — try different models for different tasks (code, writing, analysis)
- Set up RAG — upload your documentation and create a knowledge-augmented assistant
- Create team accounts — Open WebUI supports multi-user with role-based access
- Explore function calling — Open WebUI supports tool use with compatible models
- Add GPU acceleration — if you need faster inference, look into NVIDIA Container Toolkit for Docker GPU passthrough
For more on automating your deployment pipelines and managing Docker-based deployments, check out the DeployHQ blog.
If you have questions or need help, reach out at support@deployhq.com or on Twitter/X.