Free planning tool

VRAM Calculator for Local AI, MoE, and Image Generation

Estimate how much GPU VRAM you may need for local LLMs, MoE models, quantized models, Stable Diffusion, FLUX, and image-generation workflows.

ModeAI model

MoE has a separate estimate mode

Dense LLM estimates use dense model size. Switch to MoE mode for models such as DeepSeek-R1 and Mixtral so total parameters and active parameters are handled separately.

QuantizationContext presetRuntime profileSafety margin: 20%

Planning estimateNot benchmark dataAssumption profile 2026-06-calculator-eligibility-v1

8.7 GB

Planning minimum: 9 GB VRAM

Planning tier: 12 GB VRAM planning tier
Selected model: DeepSeek-R1-Distill-Llama-8B
Runtime: llama.cpp / GGUF planning
Confidence: LOW

8.7 GB planning estimate with assumption profile 2026-06-calculator-eligibility-v1; validate on your exact runtime.

This is a planning estimate, not a benchmark. Validate your exact model, runtime, context, and driver stack before hardware decisions.

Make VRAM calculator assumptions explicit and versioned. This is a planning estimate, not a benchmark.
GPU matches are planning candidates only and are not benchmark-based buying advice.
Observed validation samples are tracked separately and are currently estimate-only unless sourced.

Source-backed GPU matches

RTX 508016 GB VRAMSource-backed RTX 5070 Ti16 GB VRAMSource-backed RTX 507012 GB VRAMSource-backed RTX 5060 Ti 16GB16 GB VRAMSource-backed RTX 4070 Super12 GB VRAMSource-backed RTX 3060 12GB12 GB VRAMSource-backed

This calculator provides a rough estimate only. Use runtime-specific validation before selecting hardware: actual VRAM usage depends on quantization, context length, KV cache, runtime, batch size, drivers, and model architecture.

Estimate GPU VRAM for LLMs and AI workloads

A VRAM calculator for LLM planning is useful when you are deciding whether a local AI workflow is realistic on a single GPU. Model size alone is not enough to answer that question. A local language model loaded in FP16 can need substantially more memory than a quantized version of the same model, while longer prompts and conversations add memory pressure through the context window and KV cache. This estimator gives you a starting range before you test a specific model, runtime, and GPU configuration.

Choose a model size, a rough quantization level, and the context preset that most closely represents your intended workload. The output is meant for early GPU memory planning: exploring local LLM GPU memory, evaluating whether a build should target a larger VRAM tier, or deciding where further benchmarking is necessary. The same caution applies when researching Stable Diffusion VRAM needs and other AI image workloads, where resolution, batch size, model pipeline, and extensions can materially alter usage.

The LLM estimate is deliberately conservative and transparent. It uses simple memory assumptions for FP16, INT8, and INT4 weights, adds a context allowance, then applies your selected safety margin. It does not report tokens per second, generation speed, or official hardware support. Use it to narrow your initial options, then verify the selected runtime, quantization format, driver stack, and actual model on the hardware you plan to run.

The MoE mode uses a separate formula for models such as DeepSeek-R1 and Mixtral, where total parameters and active parameters have different meanings. The image-generation mode uses separate workflow presets for SDXL, Stable Diffusion 3.5, and FLUX-style planning. Resolution, batch size, runtime, VAE, LoRA, and ControlNet can change memory use, so the output remains a planning tier rather than a benchmark-backed support claim.

Read the image-generation VRAM planning guide

Visual workflow

Turn workload assumptions into a planning tier

Each input changes the rough estimate. The output is a direction for further validation, not a verified performance claim.

Model sizeParameter scale
QuantizationWeight format
ContextMemory overhead
VRAM estimateRough output
GPU tierResearch next

How this VRAM estimate works

The MVP estimate applies approximately 2 GB per billion parameters for FP16, 1 GB for INT8, or 0.5 GB for INT4, plus a short, medium, or long context overhead and a configurable safety margin. These assumptions are a planning heuristic only and require validation against the target runtime.

GPU VRAM planning tiers

Results are grouped into planning tiers such as 8 GB, 12 GB, 16 GB, or 24 GB and above. A tier is not a GPU endorsement. GPU records and model requirements remain draft until sourced specifications and controlled workload tests are available.

Mixture-of-Experts estimate mode

Mixture-of-Experts models now use a separate planning estimate path instead of the dense LLM formula. A MoE model can have a large total parameter count, a smaller active parameter count, routing behavior, and runtime-specific memory allocation that do not map cleanly to a dense parameter-size estimate.

This means models such as DeepSeek-R1 and Mixtral are not estimated by pretending they are ordinary dense models. The MoE path distinguishes total parameters, active parameters, KV cache, context length, batching, expert routing, and runtime behavior.

The current MoE formula is intentionally conservative: it uses total parameters, or a higher source-backed packaged model size when available, as the resident weight-memory baseline. Active parameters are displayed for architecture context, not as a minimum VRAM claim.

Source-backed model VRAM pages

Model VRAM8B

Meta Llama 3.1 8B InstructSource-backed LLM profile for long-context local planning.Review VRAM planning page →

Model VRAM7B

Qwen2.5 7B InstructSource-backed Qwen profile for calculator-ready local LLM planning.Review VRAM planning page →

Model VRAM7B

Mistral 7B Instruct v0.3Source-backed Mistral profile for calculator planning.Review VRAM planning page →

How to read this estimate

Treat the estimate as a planning baseline. First review whether your selected model, runtime, and context are realistic for your workflow, then compare source-backed GPU profiles before testing the exact setup on your own environment.

After estimating VRAM, compare source-backed GPU profiles

Source-backed vs planning-only matches

Source-backed matches prioritize GPUs with verified core fields. Planning-only candidates can still appear when needed, but they are not benchmark claims and require additional verification.

Frequently asked questions

How much VRAM do I need for a 7B model?

It depends on quantization, context length, runtime, and overhead. A quantized 7B configuration may require far less memory than FP16, so use the calculator as an initial estimate and validate the intended runtime.

Is 8GB VRAM enough for local AI?

It may be sufficient for some smaller or quantized workloads, but it is not a universal threshold. Image generation, longer contexts, larger batches, and different runtimes can increase memory demand.

How much VRAM do I need for SDXL or FLUX?

Image-generation VRAM depends on the model family, resolution, batch size, runtime, VAE, and adapters such as LoRA or ControlNet. Use the image-generation mode as a planning estimate, then validate the exact workflow.

Does quantization reduce VRAM usage?

Quantization generally reduces weight memory compared with higher precision formats, but actual VRAM use also includes context, KV cache, runtime overhead, and other implementation details.

How does the MoE estimate work?

The MoE mode uses source-backed total parameters as the conservative resident weight-memory baseline, while active parameters are shown as architecture context. The result remains a planning estimate, not benchmark data or a hardware guarantee.

Why are active parameters not the VRAM requirement?

Active parameters describe the subset of parameters used for per-token computation. They do not prove that only those weights need to reside in GPU memory, especially when runtime loading, expert routing, offload, and model packaging vary.