AI Hardware

GPU Requirements for AI Workloads: The Complete Guide

Exactly how much VRAM you need for inference, fine-tuning, and training — with model-specific numbers so you can stop guessing.

9 min readApril 2025

VRAM (video RAM on your GPU) is the single most important hardware constraint for local AI workloads. Too little, and your model either won't load or falls back to slow CPU/RAM inference. This guide gives you the exact numbers for every common workload.

GPU memory specifications for AI workloads
VRAM capacity is the primary constraint for local LLM deployment — everything else is secondary.

The VRAM Formula

For inference (running a model, not training it), the minimum VRAM requirement is approximately:

VRAM (GB) ≈ Model Parameters (B) × Bytes per Parameter + 1–2GB overhead

Bytes per parameter by precision:

Example: Llama 3.1 70B at Q4_K_M = 70 × 0.5 + 2GB overhead = ~37GB VRAM needed.

Model-Specific VRAM Requirements

ModelParamsFP16Q8Q4_K_M12GB GPU24GB GPU48GB GPU
Phi-3.5 Mini3.8B7.6GB4GB2.5GB✓ FP16✓ FP16✓ FP16
Llama 3.2 8B8B16GB8.5GB4.7GBQ4 only✓ Q8✓ FP16
Mistral 7B7B14GB7.5GB4.2GB✓ Q4✓ Q8✓ FP16
Llama 3.1 13B13B26GB13.5GB7.5GBQ4 only✓ Q4/Q8✓ FP16
DeepSeek Coder 33B33B66GB33GB19GBOOMQ4 tight✓ Q8
Llama 3.1 70B70B140GB70GB43GBOOMOOMQ4_K_M
Mixtral 8x7B47B active94GB47GB26GBOOMQ4 marginal✓ Q4/Q8
GPU memory and computing architecture diagram
Understanding VRAM constraints before purchasing hardware saves thousands in poor buying decisions.

Fine-Tuning VRAM Requirements

Fine-tuning requires significantly more VRAM than inference because you need to store gradients and optimizer states in addition to model weights.

The practical recommendation: If you're fine-tuning models under 34B, an RTX 4090 (24GB) handles most workloads with QLoRA. For 70B fine-tuning, you need either 2×4090s in NVLink or cloud A100 access. Cloud is usually the more economical choice for infrequent fine-tuning runs.

Multi-GPU Setups

Two options for running models that exceed single-GPU VRAM:

Multi-GPU AI server rack configuration
Multi-GPU configurations use NVLink or PCIe parallelism to handle models that exceed single-card VRAM.

Need Help Sizing Your AI Hardware?

Tell us your workload and we'll spec the right hardware — and deploy it properly. Free consultation.

Get a Free AI Audit
Devin Mallonee

Devin Mallonee

Founder & AI Agent Architect · CodeStaff

Devin has been building software products and remote teams since 2017. He founded CodeStaff to deploy purpose-built AI agents and workstations that replace repetitive work and scale operations for businesses of every size. He writes about AI strategy, agent architecture, and the practical reality of deploying AI in production.