Running AI models on your own infrastructure instead of calling cloud APIs gives you three things that no hosted service can: complete data privacy, predictable costs, and the freedom to choose any model. The trade-off is that you need the right hardware and a basic understanding of how large language models use memory.

This guide covers the practical side of self-hosting: what hardware you actually need, which models are worth running locally in 2026, and how the costs compare to cloud APIs. If you want a hands-on tutorial with Docker Compose and [DeployHQ](https://www.deployhq.com), see our step-by-step guide to [self-hosting Open WebUI and Ollama on a VPS](https://www.deployhq.com/blog/how-to-install-and-run-chatgpt-on-a-vps).

## Why self-host AI models?

### Data stays on your servers

When you call the OpenAI or Anthropic API, every prompt and response passes through their servers. For most use cases that is fine — but if you work with customer PII, medical records, legal documents, or proprietary code, sending that data to a third party may violate compliance requirements or internal security policies.

Self-hosted models process everything locally. The data never leaves your network.

### Predictable, fixed costs

Cloud API pricing scales with usage. A team of 20 developers using GPT-4o for code review can easily spend $500–2,000/month in API fees. A self-hosted 8B model on a $50/month VPS handles unlimited requests at a fixed cost — the model does not meter tokens.

### Full control over models and behaviour

You choose which model to run, how it is configured, and what system prompts it uses. You can fine-tune models on your own data, swap models without changing application code, and run multiple models side-by-side for different tasks.

## Understanding the hardware: why VRAM matters most

The single most important factor in self-hosting AI is **GPU VRAM** (video memory). A language model must be loaded into memory before it can generate text. If it fits entirely in VRAM, you get fast inference (30–50 tokens/second). If it overflows to system RAM, inference drops to 1–5 tokens/second — unusable for interactive chat.

The rule of thumb: **you need roughly 0.5 GB of VRAM per billion parameters** when using 4-bit quantisation (the standard for self-hosting).

### Hardware tiers

| Tier | RAM | GPU | Models you can run | Monthly VPS cost |
| --- | --- | --- | --- | --- |
| **Starter** | 8 GB | None (CPU only) | 1B–3B models (Llama 3.2 1B, Phi-3 Mini) | $5–15 |
| **Developer** | 16 GB | None or 8 GB VRAM | 7B–8B models (Llama 3.1 8B, Mistral 7B) | $25–50 |
| **Professional** | 32 GB | 16–24 GB VRAM | 13B–30B models (Qwen 2.5 14B, CodeLlama 34B) | $80–200 |
| **Enterprise** | 64 GB+ | 48 GB+ VRAM (or multi-GPU) | 70B+ models (Llama 3.1 70B, DeepSeek V3) | $300+ |

**Not sure if your hardware is enough?** Use [Can I Run AI?](https://www.canirun.ai/) to check whether a specific model will fit on your machine before downloading anything. It estimates VRAM and RAM requirements based on your hardware specs and the model's size.

**CPU-only is viable for small models.** A 7B model quantised to 4-bit runs on 8 GB of system RAM at ~5 tokens/second. That is slow for chat but acceptable for batch processing, summarisation, or code review where latency is less critical.

### Quantisation: the key to fitting models in less memory

Full-precision models use 16 bits per parameter (FP16). A 70B model at FP16 needs ~140 GB — far more than any single consumer GPU. **Quantisation** reduces precision to 4 or 5 bits with minimal quality loss:

| Quantisation | Memory per 1B params | 7B model | 70B model | Quality impact |
| --- | --- | --- | --- | --- |
| FP16 (full) | ~2 GB | ~14 GB | ~140 GB | Baseline |
| Q8 (8-bit) | ~1 GB | ~7 GB | ~70 GB | Negligible |
| Q5\_K\_M (5-bit) | ~0.65 GB | ~4.5 GB | ~45 GB | Very minor |
| Q4\_K\_M (4-bit) | ~0.5 GB | ~3.5 GB | ~35 GB | Minor on most benchmarks |

**Q4\_K\_M** is the sweet spot for self-hosting: it fits models in roughly a quarter of the full-precision memory while retaining 95%+ of benchmark performance. Ollama uses this quantisation level by default.

## Best models for self-hosting in 2026

The open-source model landscape moves fast. Here are the current leaders by use case:

### General purpose

| Model | Parameters | Min VRAM | Strengths |
| --- | --- | --- | --- |
| Llama 3.2 3B | 3B | 4 GB RAM (CPU) | Fast, lightweight, good for simple tasks |
| Llama 3.1 8B | 8B | 8 GB | Best quality/speed ratio for most use cases |
| Qwen 2.5 14B | 14B | 12 GB | Strong reasoning, excellent multilingual support |
| Llama 3.1 70B | 70B | 40 GB | Near-GPT-4 quality, requires serious hardware |

### Code generation

| Model | Parameters | Min VRAM | Strengths |
| --- | --- | --- | --- |
| DeepSeek Coder V2 | 16B | 12 GB | Top coding benchmarks, excellent at refactoring |
| Qwen 2.5 Coder 7B | 7B | 8 GB | Strong code completion, fits on consumer hardware |
| CodeLlama 34B | 34B | 24 GB | Large context window, good at complex codebases |

### Reasoning and analysis

| Model | Parameters | Min VRAM | Strengths |
| --- | --- | --- | --- |
| DeepSeek R1 | 70B | 40 GB | Chain-of-thought reasoning, MIT licensed |
| Qwen 3.5 | 32B | 24 GB | Highest GPQA scores among open models |
| GLM-5 | 40B active | 24 GB | Strong across all benchmarks, MIT licensed |

### Frontier models (enterprise hardware only)

Models like **DeepSeek V3.2** (671B MoE, 37B active), **Kimi K2.5** (1T MoE, 32B active), and **GLM-5** (744B total) compete with GPT-4o and Claude on benchmarks. They require multi-GPU setups (8x H200 or similar) and are realistic only for organisations with dedicated ML infrastructure.

## Runtime tools for self-hosting

You do not interact with model weights directly. A runtime tool loads the model, handles quantisation, and serves an API. Here are the main options:

| Tool | Best for | GPU support | API compatibility |
| --- | --- | --- | --- |
| [Ollama](https://ollama.com/) | Simplicity, single-server | NVIDIA, AMD, Apple Silicon | OpenAI-compatible |
| [vLLM](https://docs.vllm.ai/) | High-throughput production | NVIDIA, AMD | OpenAI-compatible |
| [llama.cpp](https://github.com/ggerganov/llama.cpp) | Maximum hardware flexibility | NVIDIA, AMD, Apple, CPU | Custom + OpenAI-compatible |
| [LocalAI](https://localai.io/) | Drop-in OpenAI replacement | NVIDIA, AMD, CPU | OpenAI-compatible |
| [TGI](https://huggingface.co/docs/text-generation-inference) | HuggingFace ecosystem | NVIDIA | Custom |

**Ollama** is the easiest starting point. It handles model downloading, quantisation, and serving in a single binary. Combined with [Open WebUI](https://docs.openwebui.com/), it provides a full ChatGPT-like interface.

For a complete walkthrough of setting up Ollama + Open WebUI with Docker Compose, Nginx, and TLS, see our guide: [How to Self-Host Your Own AI Chat Interface on a VPS](https://www.deployhq.com/blog/how-to-install-and-run-chatgpt-on-a-vps).

## Cost comparison: self-hosted vs. cloud APIs

Here is a realistic cost comparison for a team of 10 developers using AI for code review and chat, processing roughly 2 million tokens per day:

| | Self-hosted (Llama 3.1 8B) | OpenAI GPT-4o | Anthropic Claude Sonnet |
| --- | --- | --- | --- |
| Monthly compute | $50 (VPS with 16 GB RAM) | ~$600 (at $2.50/1M input + $10/1M output) | ~$540 (at $3/1M input + $15/1M output) |
| Quality | Good for most tasks | Excellent | Excellent |
| Privacy | Full — data stays local | Data processed by OpenAI | Data processed by Anthropic |
| Latency | ~10–20 tokens/sec (CPU) | ~50–80 tokens/sec | ~50–80 tokens/sec |
| Scaling cost | Fixed | Linear with usage | Linear with usage |

**The break-even point** is roughly 500K tokens/day. Below that, cloud APIs are simpler and cheaper. Above that, self-hosting saves money every month — and the savings grow as usage increases.

For teams that need both privacy _and_ quality, a hybrid approach works well: run a local model for routine tasks (code review, summarisation, drafting) and call cloud APIs only for complex reasoning tasks. Open WebUI supports this natively — you can configure both local Ollama models and cloud API keys in the same interface.

## Deploying and managing self-hosted AI with DeployHQ

Once your AI stack is running, you need a way to manage configuration changes, model updates, and Nginx rules without SSH-ing into the server every time.

[DeployHQ](https://www.deployhq.com) automates this by deploying from a Git repository to your VPS via SSH. Push a change to your repo (updated `docker-compose.yml`, new Nginx config, model pull script) and [DeployHQ](https://www.deployhq.com) handles the rest.

Key [DeployHQ](https://www.deployhq.com) features for AI deployments:

- **[SSH commands](https://www.deployhq.com/support/configuration/ssh-commands)** run after each deploy — restart Docker containers, pull new models
- **[Config files](https://www.deployhq.com/support/configuration/config-files)** inject `.env` secrets without committing them to Git
- **[Build pipelines](https://www.deployhq.com/blog/what-is-a-build-pipeline-and-how-can-it-improve-your-workflow)** run build steps before deploying
- **Automatic deploys** on every push to your main branch

For the full setup walkthrough with Docker Compose files and deploy scripts, see our [Open WebUI + Ollama VPS guide](https://www.deployhq.com/blog/how-to-install-and-run-chatgpt-on-a-vps).

## Security best practices

Self-hosting gives you control, but also responsibility:

- **Network isolation** : bind model APIs to `127.0.0.1` — never expose Ollama or vLLM directly to the internet
- **Reverse proxy with TLS** : use Nginx or Caddy to terminate HTTPS in front of your model API
- **Access control** : Open WebUI supports user accounts with role-based access; disable public signup
- **Update regularly** : model runtimes (Ollama, vLLM) receive frequent security patches
- **Monitor resource usage** : a runaway inference request can exhaust RAM; set memory limits in Docker
- **Protect API keys** : if bridging to cloud APIs, use [environment variables](https://www.deployhq.com/blog/protecting-your-api-keys-best-practices-for-secure-deployment), never hardcode keys

## Related guides

- [How to Self-Host Your Own AI Chat Interface on a VPS with Open WebUI and Ollama](https://www.deployhq.com/blog/how-to-install-and-run-chatgpt-on-a-vps) — hands-on Docker Compose tutorial with [DeployHQ](https://www.deployhq.com)
- [How to Install DeepSeek on Your Cloud Server with Ollama LLM](https://www.deployhq.com/blog/how-to-install-deepseek-on-your-cloud-server-with-ollama-llm) — DeepSeek-specific deployment
- [Running Generative AI Models with Ollama and Open WebUI Using DeployHQ](https://www.deployhq.com/blog/running-generative-ai-models-with-ollama-and-open-webui-using-deployhq) — alternative deployment approach
- [What Is Docker? A Beginner's Guide to Containerisation and Deployment](https://www.deployhq.com/blog/what-is-docker-a-beginners-guide-to-containerization-and-deployment) — Docker fundamentals

If you have questions or need help, reach out at [support@deployhq.com](mailto:support@deployhq.com) or on [Twitter/X](https://x.com/deployhq).