ModelDen

Local LLMs, ranked by VRAM fit and quality

Each score is the model’s result on its best public benchmarkfor that skill — we name the benchmark on every row (no blended index). Scores compare cleanly when the benchmark matches; where it differs, the name tells you why. Pick a VRAM size to see only the models that fit.

Frontier (Claude, GPT, Gemini, the largest open models) is shown separately below as a ceiling— they’re measured on harder, contamination-resistant tests, so their numbers aren’t directly comparable to the local ones above.

VRAM to fit — showing models that run on 12 GB
Rank by
#Model
1Gemma 4 12B12.0B · 7.3 GB (Q4) · May 2026 · 79GPQA2Gemma 2 9B9.2B · 5.6 GB (Q4) · Jun 2024 · 71MMLU3Mistral NeMo 12B Instruct12B · 7.3 GB (Q4) · Jul 2024 · 68MMLU4MiniCPM-SALA9.5B · 5.8 GB (Q4) · Feb 2026 · 67MMLU-Pro5DeepSeek R1 Distill Qwen 14B14.8B · 9.0 GB (Q4) · Jan 2025 · 59GPQA6Gemma 4 E4B8B · 4.8 GB (Q4) · Apr 2026 · 59GPQA7Granite 3.3 8B Base8.2B · 5.0 GB (Q4) · Apr 2025 · 58Arena Hard8Granite 3.3 8B Instruct8B · 4.8 GB (Q4) · Apr 2025 · 58Arena Hard9Phi-4 14B14.7B · 8.9 GB (Q4) · Dec 2024 · 56GPQA10Llama 3.1 Nemotron Nano 8B V18B · 4.8 GB (Q4) · Mar 2025 · 54GPQA11DeepSeek R1 Distill Qwen 7B7.6B · 4.6 GB (Q4) · Jan 2025 · 49GPQA12DeepSeek R1 Distill Llama 8B8.0B · 4.9 GB (Q4) · Jan 2025 · 49GPQA13Gemma 4 E2B5.1B · 3.1 GB (Q4) · Apr 2026 · 43GPQA14Gemma 3 12B12B · 7.3 GB (Q4) · Mar 2025 · 41GPQA15DeepSeek R1 Distill Qwen 1.5B1.8B · 1.1 GB (Q4) · Jan 2025 · 34GPQA16Llama 3.2 11B Instruct10.6B · 6.4 GB (Q4) · Sep 2024 · 33GPQA17Llama 3.2 3B Instruct3.2B · 1.9 GB (Q4) · Sep 2024 · 33GPQA18Gemma 3 4B4B · 2.4 GB (Q4) · Mar 2025 · 31GPQA19IBM Granite 4.0 Tiny Preview7B · 4.2 GB (Q4) · May 2025 · 27Arena Hard20Gemma 3n E2B Instructed LiteRT (Preview)1.9B · 1.2 GB (Q4) · May 2025 · 25GPQA
Frontier — the paid ceiling

Cloud-only flagships, for reference. Their generalnumbers come from harder, contamination-resistant tests (named below), so they’re not directly comparable to the local scores above — they show the ceiling, and what you trade in privacy and cost to reach it.

Each score is a single named benchmark (not a blend) — open a model to see it and its source. Frontier models are tested on harder benchmarks, so their numbers aren’t 1:1 comparable with the local scores. Fit is computed for the Q4 build against the selected VRAM; colour shows whether it sits comfortably or tightly in that memory.