The Math & Assumptions.
This calculator does not use marketing numbers. It models the physical limitations of your hardware based on three core constraints.
1. Allocation Geometry
- Weights: Params × Quantization Byte Size × 1.05 (GGUF metadata overhead).
- KV Cache: Params × 8.5 × Cache Bytes (2.0 for FP16, 0.5 for Q4) × Context length.
- Overhead: Base ~400MB reserved for CUDA/Metal context.
2. The PCIe Bottleneck
When a model exceeds physical VRAM on an NVIDIA/AMD card, the excess layers spill to host RAM. Data must cross the motherboard's PCIe bus. Assuming a PCIe 4.0 x16 connection, bandwidth crashes to a hard limit of ~32 GB/s.
3. Hardware Efficiency Penalty
A GPU never achieves its theoretical max memory bandwidth for single-batch decoding. The TPS estimation applies a realistic 38% efficiency multiplier for discrete GPUs running entirely in VRAM, a 25% multiplier for Apple Unified Memory due to unified CPU/GPU overhead, and a 45% multiplier (~14.4 GB/s) for PCIe offload lanes.