VRAM (video RAM on your GPU) is the single most important hardware constraint for local AI workloads. Too little, and your model either won't load or falls back to slow CPU/RAM inference. This guide gives you the exact numbers for every common workload.
The VRAM Formula
For inference (running a model, not training it), the minimum VRAM requirement is approximately:
VRAM (GB) ≈ Model Parameters (B) × Bytes per Parameter + 1–2GB overhead
Bytes per parameter by precision:
- FP16 / BF16: 2 bytes per parameter
- INT8 (Q8): 1 byte per parameter
- INT4 (Q4): 0.5 bytes per parameter
Example: Llama 3.1 70B at Q4_K_M = 70 × 0.5 + 2GB overhead = ~37GB VRAM needed.
Model-Specific VRAM Requirements
| Model | Params | FP16 | Q8 | Q4_K_M | 12GB GPU | 24GB GPU | 48GB GPU |
|---|---|---|---|---|---|---|---|
| Phi-3.5 Mini | 3.8B | 7.6GB | 4GB | 2.5GB | ✓ FP16 | ✓ FP16 | ✓ FP16 |
| Llama 3.2 8B | 8B | 16GB | 8.5GB | 4.7GB | Q4 only | ✓ Q8 | ✓ FP16 |
| Mistral 7B | 7B | 14GB | 7.5GB | 4.2GB | ✓ Q4 | ✓ Q8 | ✓ FP16 |
| Llama 3.1 13B | 13B | 26GB | 13.5GB | 7.5GB | Q4 only | ✓ Q4/Q8 | ✓ FP16 |
| DeepSeek Coder 33B | 33B | 66GB | 33GB | 19GB | OOM | Q4 tight | ✓ Q8 |
| Llama 3.1 70B | 70B | 140GB | 70GB | 43GB | OOM | OOM | Q4_K_M |
| Mixtral 8x7B | 47B active | 94GB | 47GB | 26GB | OOM | Q4 marginal | ✓ Q4/Q8 |
Fine-Tuning VRAM Requirements
Fine-tuning requires significantly more VRAM than inference because you need to store gradients and optimizer states in addition to model weights.
- Full fine-tuning: ~4× the inference VRAM. A 7B model needs ~56GB for full fine-tuning. Impractical on single consumer GPUs above 7B.
- LoRA fine-tuning: ~2× inference VRAM. A 13B model at Q4 LoRA fits in 16GB (RTX 4080). Best approach for most teams.
- QLoRA (4-bit + LoRA): Enables fine-tuning large models on consumer hardware. 70B QLoRA runs on 2×4090 (48GB total). Standard choice for teams without A100 access.
The practical recommendation: If you're fine-tuning models under 34B, an RTX 4090 (24GB) handles most workloads with QLoRA. For 70B fine-tuning, you need either 2×4090s in NVLink or cloud A100 access. Cloud is usually the more economical choice for infrequent fine-tuning runs.
Multi-GPU Setups
Two options for running models that exceed single-GPU VRAM:
- NVLink (NVIDIA only): True GPU-to-GPU memory pooling. 2×4090s become 48GB effective VRAM. Highest performance for large models. Requires an NVLink bridge and compatible motherboard.
- Tensor parallelism (via llama.cpp or vLLM): Splits model layers across GPUs connected by PCIe. Slower due to PCIe bandwidth bottleneck but much cheaper to set up — any multi-GPU machine works.
Need Help Sizing Your AI Hardware?
Tell us your workload and we'll spec the right hardware — and deploy it properly. Free consultation.
Get a Free AI Audit