VRAM<->Model Chart

Quick Picks

Best value (local 8B, fast): RTX 4070 12 GB — runs 8B Q4 comfortably with headroom for bigger context/batch.
More headroom / future-proof: RTX 4080 16 GB — lets you push 8B at higher quality (Q5/Q8), bigger batch, longer context.
“Do everything local” beast: RTX 4090 24 GB — can host multiple 8B models or one 8B at high precision + fat context. (70B still wants multi-GPU or remote.)

What Fits in VRAM

(Quantized GGUF / vLLM tensorization; ranges account for runtime + KV cache. Context length and batch also matter.)

Model	Quant	Typical VRAM Needed	Feels Good On
3B class	Q4	3–4 GB	8 GB cards and up
3B class	Q5	4–5 GB	8–12 GB
8B class	Q4	6–8 GB	12 GB (4070)
8B class	Q5	8–10 GB	12–16 GB (4070/4080)
8B class	Q8	12–14 GB	16 GB (4080)
70B class	Q4/Q5	≈40–60 GB	Multi-GPU or remote instance

Context/KV cache overhead: Add ~1–2 GB per 8B model for long contexts (8–16k) and larger batches. For 70B, KV overhead is much larger; assume remote unless you own datacenter cards.

What This Means for SCRY

Default (Llama 3.2/3.1 8B + RAG):
- 4070 12 GB → Q4/Q5, fast, plenty for 8–16k context at modest batch.
- 4080 16 GB → crank batch/throughput or use Q8 for slightly better quality.
Occasional “big-gun” answers (70B): run remotely; route only hard questions there (or offline distillation).

Throughput Expectations (Very Rough)

8B Q4 on 4070: 30–50 tok/s single stream (llama.cpp or vLLM).
8B Q5 on 4080: 40–70 tok/s single stream; much better concurrency.
(Depends on quantization, rope scaling, CUDA/cuBLAS, and server.)

Docker Sanity Checks (NVIDIA)

# Prove GPU is visible in containers
docker run --rm --gpus all nvidia/cuda:12.4.1-runtime-ubuntu22.04 nvidia-smi

vLLM (CUDA) smoke test:

docker run --rm -it --gpus all -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3-8B-Instruct --dtype auto

Ollama tip: use NVIDIA build and pull a quantized 8B:

docker run -d --name ollama --gpus all -p 11434:11434 \
  -v ollama:/root/.ollama ollama/ollama:latest

curl -s http://localhost:11434/api/pull -d '{"name":"llama3.1:8b-instruct-q4_K_M"}'