VRAM<->Model Chart

Quick Picks

Best value (local 8B, fast): RTX 4070 12 GB — runs 8B Q4 comfortably with headroom for bigger context/batch.
More headroom / future-proof: RTX 4080 16 GB — lets you push 8B at higher quality (Q5/Q8), bigger batch, longer context.
“Do everything local” beast: RTX 4090 24 GB — can host multiple 8B models or one 8B at high precision + fat context. (70B still wants multi-GPU or remote.)

What Fits in VRAM

(Quantized GGUF / vLLM tensorization; ranges account for runtime + KV cache. Context length and batch also matter.)

Model	Quant	Typical VRAM Needed	Feels Good On
3B class	Q4	3–4 GB	8 GB cards and up
3B class	Q5	4–5 GB	8–12 GB
8B class	Q4	6–8 GB	12 GB (4070)
8B class	Q5	8–10 GB	12–16 GB (4070/4080)
8B class	Q8	12–14 GB	16 GB (4080)
70B class	Q4/Q5	≈40–60 GB	Multi-GPU or remote instance

Context/KV cache overhead: Add ~1–2 GB per 8B model for long contexts (8–16k) and larger batches. For 70B, KV overhead is much larger; assume remote unless you own datacenter cards.

What This Means for SCRY

Default (Llama 3.2/3.1 8B + RAG):
- 4070 12 GB → Q4/Q5, fast, plenty for 8–16k context at modest batch.
- 4080 16 GB → crank batch/throughput or use Q8 for slightly better quality.
Occasional “big-gun” answers (70B): run remotely; route only hard questions there (or offline distillation).

Throughput Expectations (Very Rough)

8B Q4 on 4070: 30–50 tok/s single stream (llama.cpp or vLLM).
8B Q5 on 4080: 40–70 tok/s single stream; much better concurrency.
(Depends on quantization, rope scaling, CUDA/cuBLAS, and server.)

Docker Sanity Checks (NVIDIA)

# Prove GPU is visible in containers
docker run --rm --gpus all nvidia/cuda:12.4.1-runtime-ubuntu22.04 nvidia-smi

vLLM (CUDA) smoke test:

docker run --rm -it --gpus all -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3-8B-Instruct --dtype auto

Ollama tip: use NVIDIA build and pull a quantized 8B:

docker run -d --name ollama --gpus all -p 11434:11434 \
  -v ollama:/root/.ollama ollama/ollama:latest

curl -s http://localhost:11434/api/pull -d '{"name":"llama3.1:8b-instruct-q4_K_M"}'

VRAM<->Model Chart

Quick Picks

What Fits in VRAM

What This Means for SCRY

Throughput Expectations (Very Rough)

Docker Sanity Checks (NVIDIA)

broward horne

Comments

Press ESC to close

Quick Picks

What Fits in VRAM

What This Means for SCRY

Throughput Expectations (Very Rough)

Docker Sanity Checks (NVIDIA)

Share Article:

State Legislatures 2025, Bitcoin

Comments