Local LLM GPU guide

How to Choose a GPU for Local LLMs

Use this guide to choose the right GPU planning path for local LLMs by starting from workload fit, model assumptions, quantization, context length, MoE behavior, and validation risk.

The practical answer is to choose a GPU last: estimate the model first, map the result to a VRAM tier, then compare source-aware GPU profiles after the workload has a memory target.

Estimate VRAM first Compare 12GB and 16GB

Planning notice: this guide avoids benchmark, tokens-per-second, price, stock, affiliate, and guaranteed-fit claims. Use it to choose a validation path before making any hardware decision.

Start with these 5 questions

Model

Which exact model or model family are you testing first?

Start from a real model page or calculator record. A generic local LLM goal is too broad to choose a useful GPU tier.

Architecture

Is it a dense LLM, MoE model, or mixed image workflow?

Dense models, MoE models, and image-generation pipelines need different estimate paths and different validation checks.

Format

What quantization format are you actually planning to run?

4-bit, 8-bit, and FP16/BF16 can land in very different memory bands, even when the model name stays the same.

Context

How much context does your real workflow need?

Short chat prompts, coding sessions, retrieval context, and long documents create different KV-cache pressure.

Risk

Is the estimate close enough to the limit that you should validate first?

If the estimate lands near a VRAM boundary, use a runtime test or cloud validation before narrowing local hardware.

Decision flow: from model to GPU profile

Name the model

Pick the exact model family or page first. If the page exists, use it as the starting context before opening GPU profiles.

Browse model pages

Choose dense or MoE path

Use dense LLM assumptions for ordinary dense models. Switch to MoE mode when total parameters, active parameters, and packaging differ.

Use calculator modes

Lock the quantization target

Write down whether you are planning for 4-bit, 8-bit, or FP16/BF16. Do not compare GPUs from a vague model-size guess.

Set quantization

Stress the context assumption

Move the estimate toward your real prompt shape: short chat, code context, retrieval context, or longer multi-turn work.

Adjust context

Check the memory boundary

If the estimate is close to 8GB, 12GB, 16GB, or 24GB, treat that tier as a validation target instead of a comfortable answer.

Review tier tradeoffs

Read GPU profiles last

Once the memory target is visible, compare source-aware GPU records. At this point GPU pages become useful research, not guessing.

Review GPUs

Start from these model examples

Use the current dense LLM pages as examples of decision types, not as generic proof that every nearby model will fit the same GPU tier.

ExampleLong-context caution

Meta Llama 3.1 8B InstructUse this page to understand why a model's context metadata should not automatically become the local VRAM target.Open model page →

ExampleCompact 7B planning

Qwen2.5 7B InstructUse this page to see how a compact dense 7B model can still need context and runtime validation.Open model page →

ExampleBaseline 7B comparison

Mistral 7B Instruct v0.3Use this page as a clean 7B baseline before assuming another 7B or 8B model behaves identically.Open model page →

What matters more than the GPU name

The useful inputs are model family, parameter size, quantization, context length, runtime, offload behavior, and tolerance for troubleshooting. A GPU model name is only useful after those inputs produce a memory target.

This is why the page links into the calculator and model pages before the GPU index. It keeps the reader from comparing cards without knowing the workload boundary.

When to move beyond local-first planning

Move to cloud validation when the estimate is close to the card limit, the runtime is unfamiliar, the model is MoE or larger than your first dense LLM target, or the workload needs heavier image-generation crossover.

Cloud testing is not a provider ranking here. It is a way to reduce uncertainty before committing to a local hardware path.

When not to buy yet

PauseValidate first

The estimate sits near a VRAM boundaryA result close to a tier limit is a signal to test the exact runtime, not a signal to immediately pick the cheapest card that crosses the line.

PauseValidate first

You have not picked a runtimellama.cpp, Ollama, vLLM, Transformers-style paths, and offload settings can allocate memory differently.

PauseValidate first

The model is MoE or unusually packagedMoE models need the separate estimate mode. Active parameters alone are not the resident VRAM requirement.

PauseValidate first

You need long-context coding or retrievalCode files, retrieval chunks, and long chats can make context the dominant uncertainty.

PauseValidate first

The same machine must also do image generationImage-generation memory drivers are separate. Validate the image workflow before assuming a local LLM GPU choice covers both jobs.

How to read GPU profiles after estimating VRAM

Profile checkAfter estimate

Verified VRAM capacityUse the profile to confirm the card's memory capacity from source-backed fields before comparing a tier against your estimate.

Profile checkAfter estimate

Source confidencePrefer records with reviewed source fields. Treat low-confidence or incomplete records as planning candidates, not facts.

Profile checkAfter estimate

Comparison contextUse comparison pages after the estimate so you compare realistic candidates instead of every GPU in the index.

Profile checkAfter estimate

Non-memory constraintsAfter VRAM, review physical fit, power, platform compatibility, drivers, and workflow setup outside this guide's scope.

Mistakes to avoid

These are the patterns that most often turn a useful planning question into a weak GPU recommendation.

Do not choose a GPU only from parameter count.
Do not read active MoE parameters as a complete VRAM requirement.
Do not treat one successful prompt as proof that longer context will fit.
Do not compare GPU prices or stock from this guide; no live commerce data is used here.
Do not treat calculator output as tokens-per-second, image-speed, or official hardware support evidence.

Choose your next tool

Pick the next route based on the uncertainty you still need to reduce.

Need a memory estimate?

Use the calculator before any GPU comparison. It separates dense LLM, MoE, and image-generation estimate paths.

Run an estimate→

Need tier tradeoffs?

Use the 12GB vs 16GB guide when the main uncertainty is whether a starter tier or buffer tier makes more sense.

Compare tiers→

Need model-specific facts?

Use the model pages when the question is about a specific dense 7B or 8B planning target.

Browse model pages→

Need to reduce local risk?

Use the cloud guide when a temporary validation run is safer than committing to local hardware first.

Open cloud guide→

FAQ

What is the first thing to check when choosing a GPU for local LLMs?

Start with the model family and workload shape. Dense LLMs, MoE models, long-context coding use, and image-generation crossover should be estimated through separate assumptions before comparing GPUs.

How much VRAM should a first local LLM GPU have?

For compact dense 7B and 8B 4-bit planning, 12GB is a practical first testing tier and 16GB is a more comfortable experimentation buffer. The exact runtime, quantization, and context still need validation.

Should I choose the fastest GPU or the GPU with more VRAM?

This guide does not rank performance. For local LLM planning, first confirm that the workload fits the VRAM tier, then review source-aware GPU profiles and validate the exact runtime.

Can the same GPU choice cover local LLMs and image generation?

Sometimes, but do not assume it. Image generation has separate memory drivers such as resolution, VAE, LoRA, ControlNet, runtime, and offload settings.

Do MoE models need a different GPU planning method?

Yes. MoE models should not be estimated with dense LLM shortcuts or active parameters alone. Use the separate MoE estimate mode and treat the result as a planning baseline.

Is this guide buying advice?

No. It avoids price, stock, affiliate, benchmark, speed, and guaranteed-fit claims. Use it to choose a validation path before making a hardware decision.