Quick Picks
- Best value (local 8B, fast): RTX 4070 12 GB — runs 8B Q4 comfortably with headroom for bigger context/batch.
- More headroom / future-proof: RTX 4080 16 GB — lets you push 8B at higher quality (Q5/Q8), bigger batch, longer context.
- “Do everything local” beast: RTX 4090 24 GB — can host multiple 8B models or one 8B at high precision + fat context. (70B still wants multi-GPU or remote.)
What Fits in VRAM
(Quantized GGUF / vLLM tensorization; ranges account for runtime + KV cache. Context length and batch also matter.)
Model | Quant | Typical VRAM Needed | Feels Good On |
---|---|---|---|
3B class | Q4 | 3–4 GB | 8 GB cards and up |
3B class | Q5 | 4–5 GB | 8–12 GB |
8B class | Q4 | 6–8 GB | 12 GB (4070) |
8B class | Q5 | 8–10 GB | 12–16 GB (4070/4080) |
8B class | Q8 | 12–14 GB | 16 GB (4080) |
70B class | Q4/Q5 | ≈40–60 GB | Multi-GPU or remote instance |
Context/KV cache overhead: Add ~1–2 GB per 8B model for long contexts (8–16k) and larger batches. For 70B, KV overhead is much larger; assume remote unless you own datacenter cards.
What This Means for SCRY
- Default (Llama 3.2/3.1 8B + RAG):
- 4070 12 GB → Q4/Q5, fast, plenty for 8–16k context at modest batch.
- 4080 16 GB → crank batch/throughput or use Q8 for slightly better quality.
- Occasional “big-gun” answers (70B): run remotely; route only hard questions there (or offline distillation).
Throughput Expectations (Very Rough)
- 8B Q4 on 4070: 30–50 tok/s single stream (llama.cpp or vLLM).
- 8B Q5 on 4080: 40–70 tok/s single stream; much better concurrency.
(Depends on quantization, rope scaling, CUDA/cuBLAS, and server.)
Docker Sanity Checks (NVIDIA)
# Prove GPU is visible in containers
docker run --rm --gpus all nvidia/cuda:12.4.1-runtime-ubuntu22.04 nvidia-smi
vLLM (CUDA) smoke test:
docker run --rm -it --gpus all -p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3-8B-Instruct --dtype auto
Ollama tip: use NVIDIA build and pull a quantized 8B:
docker run -d --name ollama --gpus all -p 11434:11434 \
-v ollama:/root/.ollama ollama/ollama:latest
curl -s http://localhost:11434/api/pull -d '{"name":"llama3.1:8b-instruct-q4_K_M"}'
Comments