GPU Requirements for AI Workloads: The Complete Guide

VRAM (video RAM on your GPU) is the single most important hardware constraint for local AI workloads. Too little, and your model either won't load or falls back to slow CPU/RAM inference. This guide gives you the exact numbers for every common workload.

GPU memory specifications for AI workloads

VRAM capacity is the primary constraint for local LLM deployment — everything else is secondary.

The VRAM Formula

For inference (running a model, not training it), the minimum VRAM requirement is approximately:

VRAM (GB) ≈ Model Parameters (B) × Bytes per Parameter + 1–2GB overhead

Bytes per parameter by precision:

FP16 / BF16: 2 bytes per parameter
INT8 (Q8): 1 byte per parameter
INT4 (Q4): 0.5 bytes per parameter

Example: Llama 3.1 70B at Q4_K_M = 70 × 0.5 + 2GB overhead = ~37GB VRAM needed.

Model-Specific VRAM Requirements

Model	Params	FP16	Q8	Q4_K_M	12GB GPU	24GB GPU	48GB GPU
Phi-3.5 Mini	3.8B	7.6GB	4GB	2.5GB	✓ FP16	✓ FP16	✓ FP16
Llama 3.2 8B	8B	16GB	8.5GB	4.7GB	Q4 only	✓ Q8	✓ FP16
Mistral 7B	7B	14GB	7.5GB	4.2GB	✓ Q4	✓ Q8	✓ FP16
Llama 3.1 13B	13B	26GB	13.5GB	7.5GB	Q4 only	✓ Q4/Q8	✓ FP16
DeepSeek Coder 33B	33B	66GB	33GB	19GB	OOM	Q4 tight	✓ Q8
Llama 3.1 70B	70B	140GB	70GB	43GB	OOM	OOM	Q4_K_M
Mixtral 8x7B	47B active	94GB	47GB	26GB	OOM	Q4 marginal	✓ Q4/Q8

GPU memory and computing architecture diagram

Understanding VRAM constraints before purchasing hardware saves thousands in poor buying decisions.

Fine-Tuning VRAM Requirements

Fine-tuning requires significantly more VRAM than inference because you need to store gradients and optimizer states in addition to model weights.

Full fine-tuning: ~4× the inference VRAM. A 7B model needs ~56GB for full fine-tuning. Impractical on single consumer GPUs above 7B.
LoRA fine-tuning: ~2× inference VRAM. A 13B model at Q4 LoRA fits in 16GB (RTX 4080). Best approach for most teams.
QLoRA (4-bit + LoRA): Enables fine-tuning large models on consumer hardware. 70B QLoRA runs on 2×4090 (48GB total). Standard choice for teams without A100 access.

The practical recommendation: If you're fine-tuning models under 34B, an RTX 4090 (24GB) handles most workloads with QLoRA. For 70B fine-tuning, you need either 2×4090s in NVLink or cloud A100 access. Cloud is usually the more economical choice for infrequent fine-tuning runs.

Multi-GPU Setups

Two options for running models that exceed single-GPU VRAM:

NVLink (NVIDIA only): True GPU-to-GPU memory pooling. 2×4090s become 48GB effective VRAM. Highest performance for large models. Requires an NVLink bridge and compatible motherboard.
Tensor parallelism (via llama.cpp or vLLM): Splits model layers across GPUs connected by PCIe. Slower due to PCIe bandwidth bottleneck but much cheaper to set up — any multi-GPU machine works.

Multi-GPU configurations use NVLink or PCIe parallelism to handle models that exceed single-card VRAM.

Need Help Sizing Your AI Hardware?

Tell us your workload and we'll spec the right hardware — and deploy it properly. Free consultation.

Get a Free AI Audit

Devin Mallonee

Founder & AI Agent Architect · CodeStaff

Devin has been building software products and remote teams since 2017. He founded CodeStaff to deploy purpose-built AI agents and workstations that replace repetitive work and scale operations for businesses of every size. He writes about AI strategy, agent architecture, and the practical reality of deploying AI in production.