Local LLM GPU guide
How to Choose a GPU for Local LLMs
Use this guide to choose the right GPU planning path for local LLMs by starting from workload fit, model assumptions, quantization, context length, MoE behavior, and validation risk.
The practical answer is to choose a GPU last: estimate the model first, map the result to a VRAM tier, then compare source-aware GPU profiles after the workload has a memory target.
Planning notice: this guide avoids benchmark, tokens-per-second, price, stock, affiliate, and guaranteed-fit claims. Use it to choose a validation path before making any hardware decision.
Start with these 5 questions
Which exact model or model family are you testing first?
Start from a real model page or calculator record. A generic local LLM goal is too broad to choose a useful GPU tier.
Is it a dense LLM, MoE model, or mixed image workflow?
Dense models, MoE models, and image-generation pipelines need different estimate paths and different validation checks.
What quantization format are you actually planning to run?
4-bit, 8-bit, and FP16/BF16 can land in very different memory bands, even when the model name stays the same.
How much context does your real workflow need?
Short chat prompts, coding sessions, retrieval context, and long documents create different KV-cache pressure.
Is the estimate close enough to the limit that you should validate first?
If the estimate lands near a VRAM boundary, use a runtime test or cloud validation before narrowing local hardware.
Decision flow: from model to GPU profile
Name the model
Pick the exact model family or page first. If the page exists, use it as the starting context before opening GPU profiles.
Choose dense or MoE path
Use dense LLM assumptions for ordinary dense models. Switch to MoE mode when total parameters, active parameters, and packaging differ.
Lock the quantization target
Write down whether you are planning for 4-bit, 8-bit, or FP16/BF16. Do not compare GPUs from a vague model-size guess.
Stress the context assumption
Move the estimate toward your real prompt shape: short chat, code context, retrieval context, or longer multi-turn work.
Check the memory boundary
If the estimate is close to 8GB, 12GB, 16GB, or 24GB, treat that tier as a validation target instead of a comfortable answer.
Read GPU profiles last
Once the memory target is visible, compare source-aware GPU records. At this point GPU pages become useful research, not guessing.
Start from these model examples
Use the current dense LLM pages as examples of decision types, not as generic proof that every nearby model will fit the same GPU tier.
What matters more than the GPU name
The useful inputs are model family, parameter size, quantization, context length, runtime, offload behavior, and tolerance for troubleshooting. A GPU model name is only useful after those inputs produce a memory target.
This is why the page links into the calculator and model pages before the GPU index. It keeps the reader from comparing cards without knowing the workload boundary.
When to move beyond local-first planning
Move to cloud validation when the estimate is close to the card limit, the runtime is unfamiliar, the model is MoE or larger than your first dense LLM target, or the workload needs heavier image-generation crossover.
Cloud testing is not a provider ranking here. It is a way to reduce uncertainty before committing to a local hardware path.
When not to buy yet
How to read GPU profiles after estimating VRAM
Mistakes to avoid
These are the patterns that most often turn a useful planning question into a weak GPU recommendation.
- Do not choose a GPU only from parameter count.
- Do not read active MoE parameters as a complete VRAM requirement.
- Do not treat one successful prompt as proof that longer context will fit.
- Do not compare GPU prices or stock from this guide; no live commerce data is used here.
- Do not treat calculator output as tokens-per-second, image-speed, or official hardware support evidence.
Choose your next tool
Pick the next route based on the uncertainty you still need to reduce.
Need a memory estimate?
Use the calculator before any GPU comparison. It separates dense LLM, MoE, and image-generation estimate paths.
Run an estimate→Need tier tradeoffs?
Use the 12GB vs 16GB guide when the main uncertainty is whether a starter tier or buffer tier makes more sense.
Compare tiers→Need model-specific facts?
Use the model pages when the question is about a specific dense 7B or 8B planning target.
Browse model pages→Need to reduce local risk?
Use the cloud guide when a temporary validation run is safer than committing to local hardware first.
Open cloud guide→FAQ
What is the first thing to check when choosing a GPU for local LLMs?
Start with the model family and workload shape. Dense LLMs, MoE models, long-context coding use, and image-generation crossover should be estimated through separate assumptions before comparing GPUs.
How much VRAM should a first local LLM GPU have?
For compact dense 7B and 8B 4-bit planning, 12GB is a practical first testing tier and 16GB is a more comfortable experimentation buffer. The exact runtime, quantization, and context still need validation.
Should I choose the fastest GPU or the GPU with more VRAM?
This guide does not rank performance. For local LLM planning, first confirm that the workload fits the VRAM tier, then review source-aware GPU profiles and validate the exact runtime.
Can the same GPU choice cover local LLMs and image generation?
Sometimes, but do not assume it. Image generation has separate memory drivers such as resolution, VAE, LoRA, ControlNet, runtime, and offload settings.
Do MoE models need a different GPU planning method?
Yes. MoE models should not be estimated with dense LLM shortcuts or active parameters alone. Use the separate MoE estimate mode and treat the result as a planning baseline.
Is this guide buying advice?
No. It avoids price, stock, affiliate, benchmark, speed, and guaranteed-fit claims. Use it to choose a validation path before making a hardware decision.