Can I Run This Locally? A Practical Guide to Self-Hosted AI Coding Models in 2026

By Alex M · Posted on 29th May 2026

AI, Devops & Infrastructure, and Tips & Tricks

Can I Run This Locally? A Practical Guide to Self-Hosted AI Coding Models in 2026

Hosted AI coding tools have a pricing problem, and a lot of developers are quietly solving it with a 16GB graphics card and a download.

That's the short version of a longer trend we covered in the routing landscape and three pricing camps. If you've already decided you want to self-host at scale, our self-hosting AI models guide covers the deeper architecture, VRAM math, and cost comparison with hosted APIs. The interesting thing isn't that you can run an open-weight model on your laptop — that's been technically possible for two years. The interesting thing is that, in 2026, the gap between local Qwen and hosted Claude on day-to-day refactors has narrowed enough that self-host first, escalate to hosted for hard problems is a reasonable team policy.

This post is the practical entry point. Three questions:

Can your hardware actually run anything useful?
Which runtime and model do you pick first?
How do you standardize that across a team without it becoming a snowflake-per-laptop disaster?

A free tool handles question one in about ten seconds, an open-source database handles question two, and a DeployHQ workflow handles question three. Let's go.

Start with the hardware question, not the model

The mistake most people make when they first try local AI is downloading a 70B-parameter model to see what happens, waiting forever for the first token, and concluding that local is unusable. It isn't — they just skipped the hardware compatibility step.

The cleanest way to do that step is canirun.ai, a browser-based hardware compatibility checker built by midudev. It detects your GPU, VRAM, memory bandwidth, RAM, and CPU cores via browser APIs, then grades every model in its database across six tiers: Runs great, Runs well, Decent, Tight fit, Barely runs, Too heavy. The data comes from the three runtimes most local-AI users care about — llama.cpp, Ollama, and LM Studio — so the grades reflect actual quantization availability, not theoretical math.

Filter the catalog by task type and the code filter surfaces models tuned for programming work. Sort by score, context length, or VRAM. Two notes on using it well:

Browser detection has caveats. The site itself flags that actual specs may vary — browsers don't always report integrated GPU details precisely. Treat the grade as a strong directional signal, especially near borderline tiers.
Quantization matters more than parameter count. A 32B model at 4-bit fits in roughly half the VRAM of the same model at 8-bit. canirun.ai accounts for this by showing which quantization formats are available per model — the grade is for the specific quant, not the full-precision weights.

The output of this step is a shortlist: which models, at which quantization, are realistic on your machine. Don't move on until you have one.

Pick the model — and check what it actually does

Once you have a shortlist, the next question is what each model can do beyond just run. The fact that you can load a model doesn't mean it has the context window, tool-calling support, or knowledge cutoff you need for coding work.

models.dev — an open-source, MIT-licensed database maintained by the SST team — exposes the per-model specs that matter:

Context window (how much code you can hand it before it forgets the top of the file)
Tool calling (does it correctly emit function-call payloads, or will Cline silently fall back to text-only mode?)
Reasoning support (chain-of-thought tokens, mostly relevant for harder refactors)
Knowledge cutoff (last training data — relevant if you're using libraries that shipped this year)
Modalities (text only, or images / PDFs too)
Pricing (hosted equivalents, useful for the what am I replacing? calculation)
Open-weights flag (the one you care about for self-hosting)

The data is also available at models.dev/api.json and as TOML files in the repo, which is going to matter in a few sections when we talk about scripting team configs.

A practical rule of thumb when matching the canirun.ai shortlist to the models.dev catalog: filter for open_weights: true first, then sort by context window descending, then look at tool-calling support. If a model can run on your machine, has at least 32k of usable context, and supports tool calls, it's a candidate for serious coding work. If it's missing any one of those, it's a chat toy.

The names that show up in those filters today — Qwen Coder variants, DeepSeek-Coder, the GPT-OSS family, Llama coding fine-tunes, smaller Gemma releases — change every few months. Don't memorize the leaderboard; bookmark models.dev.

Picking a runtime: Ollama, LM Studio, or llama.cpp

The runtime is the program that actually loads model weights into memory and serves inference. There are three serious options, and they're not really competing for the same user.

Ollama — the default

Ollama is the closest thing local AI has to a default. Install it on macOS, Linux, Windows, or Docker; pull a model with ollama run qwen3-coder and it downloads and runs in one step; it exposes an OpenAI-compatible chat completions API on http://localhost:11434 so every AI coding tool that speaks the OpenAI dialect can talk to it.

The piece most teams underuse is the Modelfile, Ollama's configuration format for building a customized variant of an existing model. A minimal one looks like this:

FROM qwen3-coder
PARAMETER temperature 0.2
PARAMETER num_ctx 32768
SYSTEM You are a coding assistant for a TypeScript + Node.js team. Prefer ESM imports. Never use `any` unless explicitly justified in a comment.

You build it with ollama create my-team-coder -f ./Modelfile, and now my-team-coder is a model entry your tooling can point at. The Modelfile is checked into Git, which means the system prompt, context length, and temperature are versioned alongside your code instead of living in a settings panel that drifts per developer. We'll come back to this in the standardization section.

LM Studio — the GUI-first option

LM Studio is what you reach for if a teammate doesn't want to live in a terminal. It's free for home and work use, runs on Windows / Mac / Linux, ships with a polished GUI for downloading and chatting with models, and also exposes an OpenAI-compatible local server when you flip it into developer mode. It supports MLX natively on Apple Silicon, which can be a meaningful speed advantage over GGUF on the same hardware.

LM Studio also ships a headless runtime called llmster for servers, cloud boxes, and CI environments — so a team can standardize on LM Studio on developer laptops and use the same engine on a shared inference server, without two different toolchains.

llama.cpp — the engine underneath

llama.cpp is the C/C++ inference engine that powers most of the rest of the local-AI world. It runs with no external dependencies, it requires GGUF-format weights, and it supports an absurd range of hardware backends: CUDA, ROCm/HIP, Metal, Vulkan, SYCL, WebGPU, AVX/AVX2/AVX512, ARM NEON, and Apple's Accelerate framework. It also ships llama-server, a lightweight HTTP server with OpenAI-compatible endpoints at /v1/chat/completions and a built-in web UI.

Most teams don't use llama.cpp directly — they use Ollama or LM Studio, both of which wrap llama.cpp underneath. You go to llama.cpp directly when you need a backend Ollama doesn't expose (Vulkan on a non-CUDA GPU is the classic case), or when you're embedding inference into a build process and want zero abstraction layers.

The 80% answer: start with Ollama. Move to LM Studio if your team includes people who want a GUI. Drop down to llama.cpp only if Ollama can't see your hardware.

Wiring a local model into your AI coding tool

A local Ollama endpoint is useless if your editor can't see it. The good news is that the OpenAI-compatible API on port 11434 means basically every modern AI coding tool can target it with a one-line config change. Here's how the four most common ones connect.

Cline (VS Code)

In Cline, open Settings → Provider, pick Ollama, set Base URL to http://localhost:11434, and select your model from the dropdown. The Cline guide covers Plan vs Act modes and the rules system in detail — both matter more with local models, because constraining the agent to one focused task per session keeps you inside the context window. Cline's docs also recommend enabling Use Compact Prompt for local inference, which trims the boilerplate Cline sends on every turn.

Aider (terminal)

Aider points at Ollama with two environment variables and a model flag:

export OLLAMA_API_BASE=http://127.0.0.1:11434
OLLAMA_CONTEXT_LENGTH=32768 ollama serve
aider --model ollama_chat/qwen3-coder

Two important footguns the Aider playbook calls out: use the ollama_chat/ prefix (not just ollama/), and override OLLAMA_CONTEXT_LENGTH when you start the server. Ollama defaults to a 2k context window and silently truncates overflow — meaning Aider may be working with half your file invisibly missing. The Aider docs explicitly flag this as the highest-impact configuration mistake when going local.

Continue.dev (VS Code / JetBrains)

Continue.dev configures models in config.yaml (the older config.json is deprecated). A minimal Ollama entry looks roughly like:

models:
  - name: Qwen Coder (local)
    provider: ollama
    model: qwen3-coder
    roles:
      - chat
      - autocomplete

The roles array lets you slot one local model into autocomplete and a different one (potentially hosted) into chat — which is the right pattern for most teams. Autocomplete fires constantly and benefits from low latency; chat fires occasionally and benefits from a smarter model.

OpenCode (terminal / IDE / desktop)

OpenCode installs with a single curl command and integrates with 75+ providers via Models.dev, including local ones — which means an Ollama endpoint is just another provider entry. It's the open-source option for teams that want a Cursor-like agent UX without a paid plan or a vendor account.

Cursor

Worth a brief mention: Cursor doesn't support BYOK on its standard tiers, so it's not a viable home for local models for most teams. If you're on Cursor and serious about local, you're effectively choosing between switching tools or running local alongside Cursor for the workloads where it makes sense.

Once one of these is wired up, you have a working setup: local model, AI coding agent, OpenAI-compatible plumbing in the middle. The interesting problem moves up a level — from does it run on my laptop? to does it run *consistently* across the team?

→ Already comfortable with the local-model side and want the deployment story? You can deploy AI-generated code straight from your terminal with the DeployHQ CLI — same agent on the laptop, same build pipeline in production.

Team standardization: stop letting laptops drift

A single developer running Ollama is straightforward. Eight developers running Ollama is where the wheels come off.

Within a month, each laptop has a different system prompt, a different num_ctx, a slightly different model version, and a different config inside Cline or Continue or Aider. One developer is on a 32B model at 4-bit, another quietly downgraded to a smaller variant because their GPU runs hot, a third is on the previous version because they never pulled the update. When their AI-generated PRs look wildly inconsistent, nobody knows why.

The fix isn't a wiki page — it's treating local AI configuration as deployable infrastructure. DeployHQ already does this for everything else on a server; the same pattern works for developer workstations and shared inference boxes:

Check the configs into the repo. A local-ai/ directory with the team's Modelfile, the canonical ~/.continue/config.yaml, the canonical Cline rules file, and a setup script.
Use DeployHQ config files to ship those onto a shared inference server (or onto developer machines, if you provision them via DeployHQ). Config files in DeployHQ are tracked, versioned, and updated atomically with every deploy — so a config change goes out the same way a code change does, instead of as a manual Slack message.
Use DeployHQ SSH commands to run ollama create my-team-coder -f ./Modelfile on the inference host after every deploy, so the team's customized model rebuilds automatically when the Modelfile changes. Now the system prompt, context length, and base model are versioned alongside the application they're helping write.
Run the AI-generated code through the same pipeline as everything else. This is the part teams miss most often. Local models, even good ones, hallucinate imports and skip edge cases more than hosted Claude does. A containerized build pipeline that runs every commit through your tests is the catch — it doesn't matter whether the code came from a human, Qwen Coder, or Claude, the build either goes green or it doesn't. The closing safety net is one-click rollback to the last good deploy when something does slip through and ship.
Layer in automated AI code review on PRs before the build pipeline ever sees the change. Local models drift differently than hosted ones, and the patterns CodeRabbit / Greptile catch are exactly the patterns local models reproduce most often.

The deeper point: a local model doesn't change the deployment contract. Code still needs to build, test, and ship through the same gates. What changes is who wrote it — and once you've absorbed the configuration into your DeployHQ workflow, your shared inference server, your Modelfile, and your developers' editors all stay in lockstep without anyone having to think about it.

→ If you're spinning up a team-wide local AI workflow this quarter, you can start a DeployHQ trial and standardize your team's AI-coding setup alongside the rest of your deployment pipeline.

When local-model coding is the wrong answer

Local isn't always the right call.

Greenfield work in an unfamiliar stack. Hosted Claude or GPT still wins on I have no idea what good looks like here, show me. Local models are stronger on patterns they've already seen than on novel problem-shapes.
Long-context reasoning across many files. Hosted frontier models hold multi-hundred-thousand-token contexts with attention quality local models on consumer hardware can't match. If you regularly hand the agent ten files and ask it to refactor, stay hosted.
Small team, no ops capacity. A shared inference server needs monitoring, patching, and someone's pager. If your team is three people, hosted per-seat pricing is probably still cheaper than the operational tax.
You need vendor support. When local Ollama hangs, you're on GitHub issues. When hosted Claude hangs, there's an SLA.

The realistic pattern most teams converge on is local for the routine, hosted for the hard — autocomplete and small refactors on a local model, harder architectural work routed to Claude or Codex. The first-party CLI agents from the lab vendors live happily alongside a local Ollama setup. If you'd rather route the hosted side through one API and one bill, OpenRouter sits in front of Claude, GPT, Gemini, and 400+ other models — useful when you're keeping local for routine work and escalating the hard problems to a hosted frontier model.

FAQ

Do I need a GPU to run a coding model locally? Not strictly. CPU-only inference works for small models and is usable for autocomplete; it gets painful for chat-style coding. Apple Silicon Macs are an underrated middle ground — unified memory lets the GPU access system RAM, which makes mid-sized models tolerable without a discrete card.

How much VRAM do I actually need? Wrong question in the abstract — it depends on which model, which quantization, and how much context you want. Check canirun.ai against your specific hardware; the grades are more honest than any rule-of-thumb table.

Will a local model match Claude or GPT on quality? On routine completion and small refactors in 2026, the gap is small enough that most developers don't notice. On novel architectural problems and multi-file reasoning, hosted frontier models are still ahead. Plan around the strengths.

Does Ollama support tool calling for agents like Cline? Yes, but support varies by model. Check models.dev for the per-model tool-calling flag and test with a small task before committing.

Can I use a local model in CI? Yes — LM Studio's llmster, llama-server, and Ollama all run cleanly in containers. The pattern most teams use is a shared GPU box on the network, not a model per CI job; inference startup is too slow for ephemeral runners.

Wrap-up

The two-step entry point for going local in 2026 is:

Run your machine against canirun.ai to see what model classes are realistic.
Filter models.dev for open_weights: true, the context window you need, and tool-calling support — that's your candidate list.

Pick Ollama as the runtime, wire it into Cline or Aider or Continue or OpenCode, write a Modelfile so the team's system prompt is versioned, and ship the AI-generated code through the same build pipeline and rollback path as everything else. That last part is what the local-AI hype cycle skips: the deployment contract doesn't change just because a different author wrote the code.

If you want the deployment side of that workflow handled — config files versioned across every workstation and inference server, AI-generated commits running through the same containerized pipeline, and a one-click way back when something ships broken — that's what DeployHQ is built for.

Questions about how to wire DeployHQ into a local-AI coding workflow? Email us at support@deployhq.com or reach out on X / Twitter.