VRAM<->Model Chart

Quick Picks

  • Best value (local 8B, fast): RTX 4070 12 GB — runs 8B Q4 comfortably with headroom for bigger context/batch.
  • More headroom / future-proof: RTX 4080 16 GB — lets you push 8B at higher quality (Q5/Q8), bigger batch, longer context.
  • “Do everything local” beast: RTX 4090 24 GB — can host multiple 8B models or one 8B at high precision + fat context. (70B still wants multi-GPU or remote.)

What Fits in VRAM

(Quantized GGUF / vLLM tensorization; ranges account for runtime + KV cache. Context length and batch also matter.)

Model Quant Typical VRAM Needed Feels Good On
3B class Q4 3–4 GB 8 GB cards and up
3B class Q5 4–5 GB 8–12 GB
8B class Q4 6–8 GB 12 GB (4070)
8B class Q5 8–10 GB 12–16 GB (4070/4080)
8B class Q8 12–14 GB 16 GB (4080)
70B class Q4/Q5 ≈40–60 GB Multi-GPU or remote instance

Context/KV cache overhead: Add ~1–2 GB per 8B model for long contexts (8–16k) and larger batches. For 70B, KV overhead is much larger; assume remote unless you own datacenter cards.

What This Means for SCRY

  • Default (Llama 3.2/3.1 8B + RAG):
    • 4070 12 GB → Q4/Q5, fast, plenty for 8–16k context at modest batch.
    • 4080 16 GB → crank batch/throughput or use Q8 for slightly better quality.
  • Occasional “big-gun” answers (70B): run remotely; route only hard questions there (or offline distillation).

Throughput Expectations (Very Rough)

  • 8B Q4 on 4070: 30–50 tok/s single stream (llama.cpp or vLLM).
  • 8B Q5 on 4080: 40–70 tok/s single stream; much better concurrency.
    (Depends on quantization, rope scaling, CUDA/cuBLAS, and server.)

Docker Sanity Checks (NVIDIA)

# Prove GPU is visible in containers
docker run --rm --gpus all nvidia/cuda:12.4.1-runtime-ubuntu22.04 nvidia-smi

vLLM (CUDA) smoke test:

docker run --rm -it --gpus all -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3-8B-Instruct --dtype auto

Ollama tip: use NVIDIA build and pull a quantized 8B:

docker run -d --name ollama --gpus all -p 11434:11434 \
  -v ollama:/root/.ollama ollama/ollama:latest

curl -s http://localhost:11434/api/pull -d '{"name":"llama3.1:8b-instruct-q4_K_M"}'