Run DeepSeek on a VPS with Ollama: Complete Self-Host Guide

AI, Launches, Open Source, Python, and Tutorials

Run DeepSeek on a VPS with Ollama: Complete Self-Host Guide

Running large language models like DeepSeek-R1 on your own VPS or cloud server gives you control over data, predictable costs, and the ability to fine-tune the runtime — none of which are guaranteed when you call a hosted API. This guide walks through self-hosting DeepSeek with Ollama on Ubuntu 24.04, putting it behind an Nginx reverse proxy, and wiring up automated deployments so configuration and supporting services ship from Git rather than manual SSH sessions.

If you have already deployed generative AI models with Ollama and Open WebUI, the workflow below will look familiar — DeepSeek slots into the same Ollama pipeline as Llama 3, Mistral, or Phi.

What you will end up with

  • DeepSeek-R1 running locally via Ollama, no third-party API calls
  • An Nginx reverse proxy with TLS so the model is only reachable over HTTPS
  • Open WebUI as a browser-based chat interface
  • A Git-backed configuration repo deployed via DeployHQ — every Nginx vhost, systemd unit, and Modelfile change is versioned, reviewable, and rollback-able

Hardware sizing: don't skip this

DeepSeek-R1 ships in several sizes. Picking the wrong one is the most common failure mode for self-hosted LLMs — the model loads, then crashes mid-generation when the kernel OOM-kills the process.

Model Quantized weights Min RAM (CPU only) VRAM (GPU) Realistic tokens/sec
deepseek-r1:1.5b ~1.1 GB 4 GB 2 GB 20–40 (CPU), 80+ (GPU)
deepseek-r1:7b ~4.7 GB 16 GB 8 GB 5–12 (CPU), 40–60 (GPU)
deepseek-r1:14b ~9 GB 32 GB 12 GB 2–4 (CPU), 25–35 (GPU)
deepseek-r1:32b ~20 GB 64 GB 24 GB <1 (CPU), 15–25 (GPU)
deepseek-r1:70b ~43 GB 128 GB 48 GB+ unusable on CPU, 10–18 (GPU)

A few rules of thumb from running these in production:

  • CPU-only is fine for 1.5b and 7b if you accept ~10 tok/s. Anything larger needs a GPU to be usable interactively.
  • Reserve at least 2 GB of RAM for the OS and Nginx on top of the model footprint. A 16 GB box running 7b with no headroom will swap and feel broken.
  • NVMe storage matters — first-token latency is bounded by how fast Ollama can mmap the weights. SATA SSDs add 2–5 seconds to cold-start latency.
  • For a contained experiment, a 4 vCPU / 16 GB / NVMe VPS in the $25–40/mo range will run deepseek-r1:7b fine. Production workloads with multiple concurrent users belong on a GPU instance or scaled-out CPU pool.

Prerequisites

  • A VPS or cloud instance running Ubuntu 24.04 (sized per the table above)
  • Root or sudo access
  • A domain name pointing at the server (required for HTTPS)
  • A Git repository for your configuration files
  • A DeployHQ account

Step 1: Initial server hardening

SSH into the server:

ssh root@your-server-ip

Update packages and install essentials:

apt update && apt upgrade -y
apt install -y python3 python3-pip git ufw nginx certbot python3-certbot-nginx fail2ban

Configure the firewall — note that Ollama's default port 11434 is intentionally not opened to the internet. We expose Open WebUI on 443 via Nginx and keep Ollama on localhost.

ufw allow OpenSSH
ufw allow 80/tcp
ufw allow 443/tcp
ufw --force enable

Create a non-root deploy user:

adduser --disabled-password --gecos "" deploy
usermod -aG sudo deploy
mkdir -p /home/deploy/.ssh
chmod 700 /home/deploy/.ssh
touch /home/deploy/.ssh/authorized_keys
chmod 600 /home/deploy/.ssh/authorized_keys
chown -R deploy:deploy /home/deploy/.ssh

You will paste DeployHQ's deployment public key into /home/deploy/.ssh/authorized_keys in Step 5. See the Git-based deployment guide for the underlying workflow.

Step 2: Install Ollama

curl -fsSL https://ollama.com/install.sh | sh
ollama --version

The installer creates a ollama systemd service that listens on 127.0.0.1:11434. Verify it is up:

systemctl status ollama
curl http://127.0.0.1:11434/api/tags

The second command should return {"models":[]} — empty, but reachable. If it doesn't, check journalctl -u ollama -n 50.

Ollama ships breaking changes in minor releases. For production, pin the version in /etc/systemd/system/ollama.service.d/override.conf:

[Service]
ExecStart=
ExecStart=/usr/local/bin/ollama serve
Environment="OLLAMA_KEEP_ALIVE=24h"
Environment="OLLAMA_NUM_PARALLEL=2"

OLLAMA_KEEP_ALIVE=24h keeps the model loaded in RAM (avoids re-loading on every request — saves 5–30 seconds per cold call). OLLAMA_NUM_PARALLEL=2 allows two concurrent generations; raise it only if you have RAM headroom.

Reload and restart:

systemctl daemon-reload
systemctl restart ollama

Step 3: Pull DeepSeek-R1

Pick the size that fits your hardware (see sizing table above):

# 7B is the sweet spot for a 16 GB CPU-only VPS
ollama pull deepseek-r1:7b

# Verify
ollama list
ollama run deepseek-r1:7b "Explain Git rebase in two sentences."

The first run downloads 4.7 GB. Subsequent calls are local.

Optional: tune generation defaults with a Modelfile

Create ~/deepseek.Modelfile:

FROM deepseek-r1:7b
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
PARAMETER repeat_penalty 1.1
SYSTEM "You are a concise technical assistant. Prefer code and bullet points."

Build and use it:

ollama create deepseek-tech -f ~/deepseek.Modelfile
ollama run deepseek-tech "Show me a Python decorator for retries."

This Modelfile lives in your Git repo and ships via DeployHQ along with everything else.

Step 4: Open WebUI behind Nginx

Open WebUI is the browser-based chat client. Run it under a dedicated user inside a Python virtualenv so system packages don't conflict.

python3 -m venv /opt/openwebui
/opt/openwebui/bin/pip install --upgrade pip
/opt/openwebui/bin/pip install open-webui

Create a systemd unit at /etc/systemd/system/openwebui.service:

[Unit]
Description=Open WebUI
After=network.target ollama.service

[Service]
Type=simple
User=deploy
Environment="OLLAMA_BASE_URL=http://127.0.0.1:11434"
Environment="WEBUI_AUTH=true"
ExecStart=/opt/openwebui/bin/open-webui serve --host 127.0.0.1 --port 8080
Restart=on-failure

[Install]
WantedBy=multi-user.target
systemctl daemon-reload
systemctl enable --now openwebui

WEBUI_AUTH=true forces account creation on first visit — do not skip this. Without it, anyone who finds your domain can use your model and rack up your CPU time.

Nginx reverse proxy with TLS

Place a vhost at /etc/nginx/sites-available/deepseek:

limit_req_zone $binary_remote_addr zone=deepseek:10m rate=10r/s;

server {
    listen 80;
    server_name yourdomain.com;
    return 301 https://$server_name$request_uri;
}

server {
    listen 443 ssl http2;
    server_name yourdomain.com;

    # certbot fills in ssl_certificate / ssl_certificate_key

    client_max_body_size 50M;
    proxy_read_timeout 600s;
    proxy_send_timeout 600s;

    location / {
        limit_req zone=deepseek burst=20 nodelay;

        proxy_pass http://127.0.0.1:8080;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # Streaming token responses need buffering off
        proxy_buffering off;
    }
}

Two non-obvious settings worth calling out:

  • proxy_read_timeout 600s — the default 60 seconds will cut off long-form generations on slower hardware mid-token. 10 minutes is generous and harmless.
  • proxy_buffering off — Open WebUI streams tokens via Server-Sent Events. Default Nginx buffering breaks the streaming UX and makes the model feel slow even when it isn't.

Enable, request a cert, and reload:

ln -s /etc/nginx/sites-available/deepseek /etc/nginx/sites-enabled/
nginx -t
systemctl reload nginx
certbot --nginx -d yourdomain.com

Step 5: Automate with DeployHQ

So far every config file lives on the server. The point of Git-based deployment is that the next change — a tweaked Modelfile, a new Nginx rule, a switch from 7b to 14b — happens through a pull request, not an SSH session.

Repository layout

deepseek-host/
├── nginx/
│   └── deepseek.conf              # vhost from Step 4
├── systemd/
│   ├── openwebui.service
│   └── ollama-override.conf
├── ollama/
│   ├── deepseek.Modelfile
│   └── pull-models.sh             # idempotent: ollama pull deepseek-r1:7b
└── config/
    └── webui.env                  # OLLAMA_BASE_URL, WEBUI_AUTH, etc.

DeployHQ project setup

  1. In DeployHQ, create a new project and connect the repo via GitHub or GitLab.
  2. Add the server: hostname, deploy user (deploy), deploy path (e.g. /var/www/deepseek-config).
  3. Paste the deployment public key (DeployHQ shows it in Servers → SSH Keys) into /home/deploy/.ssh/authorized_keys.
  4. Enable automatic deployments so a push to main triggers a deploy.

SSH commands after deploy

In the DeployHQ project, add these post-deploy SSH commands. They are idempotent — safe to run on every deploy:

# Install/refresh systemd units
sudo cp /var/www/deepseek-config/systemd/openwebui.service /etc/systemd/system/
sudo mkdir -p /etc/systemd/system/ollama.service.d/
sudo cp /var/www/deepseek-config/systemd/ollama-override.conf /etc/systemd/system/ollama.service.d/override.conf

# Install/refresh Nginx vhost
sudo cp /var/www/deepseek-config/nginx/deepseek.conf /etc/nginx/sites-available/deepseek
sudo nginx -t && sudo systemctl reload nginx

# Ensure required models are pulled
bash /var/www/deepseek-config/ollama/pull-models.sh

# Reload services with zero downtime
sudo systemctl daemon-reload
sudo systemctl restart openwebui

For a zero-downtime deployment flow on the application layer, swap restart for reload where the unit supports it (Nginx does; Open WebUI does not — but its restart is sub-second).

Monitoring: catch silent failures

LLM workloads have a specific failure mode that generic monitoring misses: the service stays up, but generations get slower and slower until they time out. Watch three things:

# 1. RAM pressure (the OOM killer is your enemy)
free -h
# Add this to a cron with alerting:
#   awk '/MemAvailable/ {if ($2 < 1000000) print "LOW MEM"}' /proc/meminfo

# 2. Ollama loaded models (should show your model warm)
curl -s http://127.0.0.1:11434/api/ps

# 3. Generation latency (cheap synthetic check every 5 minutes)
time curl -s http://127.0.0.1:11434/api/generate \
  -d '{"model":"deepseek-r1:7b","prompt":"hi","stream":false}' \
  > /dev/null

If the synthetic check exceeds 30 seconds, the model has been evicted from RAM and is reloading from disk — usually a sign you need more RAM or a longer OLLAMA_KEEP_ALIVE.

Security checklist

  1. Auth on Open WebUIWEBUI_AUTH=true is non-negotiable for an internet-facing instance.
  2. Rate limiting at Nginx — already in the vhost above. Tune rate=10r/s based on real usage.
  3. fail2ban for SSH — installed in Step 1 with sane defaults.
  4. No exposed Ollama port — port 11434 should never appear in ufw status. If it does, remove the rule.
  5. Update model weights deliberately, not automaticallyollama pull can replace a model mid-request and break in-flight generations. Pull during a maintenance window and bounce Open WebUI afterwards.

What's next


Questions, or hit a snag? Email support@deployhq.com or reach out on X / Twitter.

Happy deploying!