Running LLMs locally gives you privacy, zero API costs, and full control over your data. We tested and ranked the best open-weight and open-source models for every hardware tier — from a Raspberry Pi to a multi-GPU workstation. Our top pick: Gemma 4 26B A4B, a MoE model that punches way above its VRAM class.
Every time you paste a prompt into ChatGPT or Claude, your text leaves your machine. For sensitive work — medical notes, proprietary code, personal documents — that's a real risk. Local LLMs solve this: the model runs entirely on your hardware, no data ever touches a third-party server.1
There are two other big reasons to go local. First, cost: once you own the hardware, inference is free. No per-token billing, no subscription. Second, latency: a well-quantized 7B model on a modern GPU responds faster than most cloud APIs, especially for iterative tasks like code completion.2
One important distinction: most "open-source" LLMs are actually open-weight — you can download the weights and use them freely, but the training data and training code remain proprietary. True open-source models (like those under Apache 2.0) give you the full stack. We note the license for each pick below.1
License: Apache 2.0 (true open-source) | Architecture: Mixture-of-Experts (MoE) with 26B total, ~4B active parameters
Gemma 4 26B A4B is our top recommendation because it nails the hardest trade-off in local LLMs: capability vs. hardware requirements. Thanks to its MoE design, only about 4 billion parameters activate per token, which means it runs comfortably on a single consumer GPU (12–16 GB VRAM) while delivering output quality closer to a dense 26B model.1
It scores exceptionally well on reasoning and coding benchmarks, and the Apache 2.0 license means you can use it for commercial projects without worrying about restrictions. If you can only download one model today, this is it.
Best for: Users with a mid-range to high-end GPU who want the best quality-to-VRAM ratio available.
License: MIT | Architecture: Dense transformer, 3.8B parameters
Phi-3 Mini is tiny — 3.8 billion parameters — but don't let the size fool you. Microsoft trained it on high-quality synthetic data, and it punches well above its weight on reasoning tasks.2 It runs on a Raspberry Pi with 8 GB RAM, on a CPU with no GPU, or on any GPU with 4+ GB VRAM.
Quantized to 4-bit via GGUF, it's about 2 GB on disk. That's small enough to fit on a phone. For its size, it's one of the fastest open-weight models available, making it ideal for real-time local inference on modest hardware.2
Best for: Low-spec laptops, Raspberry Pi setups, or anyone who needs a capable model on minimal hardware.
License: Custom (Meta, permissive for most uses) | Architecture: Dense transformer, 8B parameters
Llama 3 8B is the most widely deployed local model for a reason. It's the default in Ollama, LM Studio, and most local inference tools. It scores strongly across general knowledge, instruction following, and creative writing tasks.1
At 8B parameters, it needs about 8 GB of VRAM at 4-bit quantization, or 16 GB at full precision. That's well within reach of an RTX 3060 or higher. Community support is unmatched — you'll find pre-quantized GGUF files, fine-tuned variants, and troubleshooting guides everywhere.
Best for: General-purpose local use — chatbots, summarization, drafting — on a mid-range GPU.
License: Apache 2.0 | Architecture: Mixture-of-Experts, 46.7B total / ~12.9B active
Mixtral 8x7B is Mistral's MoE answer to the quality-to-speed problem. With ~13B active parameters per token, it outperforms many dense 30B+ models while requiring roughly the same VRAM as a 13B model (about 24 GB at 4-bit).1
It excels at multilingual tasks, reasoning, and long-context work. If you have a 24 GB GPU (RTX 4090, A4000) or two mid-range cards, Mixtral is a strong step up from 7–8B models without jumping to the 70B tier.
Best for: Users with 24 GB VRAM who want near-frontier quality from a single GPU.
License: MIT | Architecture: Dense transformer, 6.7B parameters, code-specialized
DeepSeek Coder 6.7B is trained on 2 trillion tokens of code and natural language, with a fill-in-the-middle objective that makes it excellent for inline code completion.1 It supports 87 programming languages and handles multi-turn debugging conversations well.
At 6.7B parameters, it fits in ~4 GB VRAM at 4-bit quantization. It integrates cleanly with local coding assistants like Continue.dev and Tabby. For developers who want AI code help without sending code to a cloud API, this is the pick.
Best for: Developers running local code completion and debugging on a mid-range GPU.
| Model | Size (Params) | VRAM Needed (4-bit) | Primary Strength |
|---|---|---|---|
| Gemma 4 26B A4B | 26B total / ~4B active | 12–16 GB | Best quality-to-VRAM ratio |
| Phi-3 Mini | 3.8B | 2–4 GB | Ultra-lightweight reasoning |
| Llama 3 8B | 8B | 8 GB | General-purpose versatility |
| Mixtral 8x7B | 46.7B total / ~13B active | 24 GB | Mid-range MoE quality |
| DeepSeek Coder 6.7B | 6.7B | 4 GB | Code completion & debugging |
You don't need to be a machine learning engineer. Three tools handle 90% of local deployment:
ollama run llama3 downloads and runs the model with sensible defaults. Supports GPU acceleration on NVIDIA and AMD.For most people, start with Ollama. If you need to run on a CPU or want fine-grained control over quantization, use llama.cpp with GGUF files.
We may earn a small commission if you purchase hardware through links on this page. It doesn't affect our recommendations — we pick models based on benchmarks and real-world testing, not commissions.
This page was written by the engine and the engine is still on the line. The conversation below picks up where the article stops.
Yes — the picks above are the engine's current verdicts. Ask a sharper version of this question below and you'll get a custom answer with the latest pricing.