arrow_back_ios Back to Toolbox

Local LLM
Calculator.

The technical math behind running local agents on constrained hardware. Cut the guesswork. Calculate exact VRAM costs, architectural bottlenecks, and format implications before deployment.

calculate

The Math & Assumptions.

This calculator does not use marketing numbers. It models the physical limitations of your hardware based on three core constraints.

1. Allocation Geometry

  • Weights: Params × Quantization Byte Size × 1.05 (GGUF metadata overhead).
  • KV Cache: Params × 8.5 × Cache Bytes (2.0 for FP16, 0.5 for Q4) × Context length.
  • Overhead: Base ~400MB reserved for CUDA/Metal context.

2. The PCIe Bottleneck

When a model exceeds physical VRAM on an NVIDIA/AMD card, the excess layers spill to host RAM. Data must cross the motherboard's PCIe bus. Assuming a PCIe 4.0 x16 connection, bandwidth crashes to a hard limit of ~32 GB/s.

3. Hardware Efficiency Penalty

A GPU never achieves its theoretical max memory bandwidth for single-batch decoding. The TPS estimation applies a realistic 38% efficiency multiplier for discrete GPUs running entirely in VRAM, a 25% multiplier for Apple Unified Memory due to unified CPU/GPU overhead, and a 45% multiplier (~14.4 GB/s) for PCIe offload lanes.

data_object

The VRAM Wall & Formatting Implications

Your VRAM status permanently dictates which format you are allowed to load. Do not fight the math.

1. Optimal State (Model < VRAM)

Implication: Maximum hardware speed.

The entire model and KV cache fit within dedicated VRAM. You achieve peak Tokens-Per-Second (TPS). You have the flexibility to use EXL2 for raw speed or GGUF for modularity.

2. PCIe Tax State (Model > VRAM)

Implication: Heavy TPS degradation. MUST use GGUF.

The model splits. Excess layers are offloaded to system RAM. Data bottlenecks at the PCIe bus. Expected TPS drops significantly. Loading an EXL2 format under this condition will instantly throw an OutOfMemoryError and crash.

3. SSD Swap Death (Model > Total RAM)

Implication: System freeze.

Total allocation exceeds VRAM plus Host RAM. The OS pages to the hard drive. TPS drops below 1. The agent and your PC will freeze. Reduce model parameters immediately.

psychology

Parameter Intelligence.

Model size dictates reasoning depth. Do not expect a 3B router to write your backend. Choose hardware capable of holding the intelligence level your task requires.

< 3B: The Routers

Extremely fast, highly constrained. Capable of basic NLP classification, summarization, and syntax checking. Will reliably fail complex JSON generation or logical leaps.

7B - 14B: The Generalists

The standard local tier. Good for coding assistants, single-task agents, and localized RAG. Can output structured JSON but struggles with multi-step logical verification without extreme prompting grids.

32B - 70B+: The Professionals

Partner-level logic. Reliable zero-shot structured output, complex reasoning, nuance, and code architecture. Requires serious VRAM or high-capacity Unified Memory (Mac).

compress

Precision & Quantization Trade-offs.

The Quantization Cliff

You are compressing nuance, not just file size. Moving below Q4 damages reasoning and JSON schema adherence.

  • FP16/BF16: Uncompressed baseline. Massive VRAM footprint.
  • Q8_0 (8-bit): Near lossless. Indistinguishable from FP16 in benchmarks.
  • Q6_K (6-bit): High precision. Ideal for coding & complex logic if RAM allows.
  • Q5_K_M (5-bit): Excellent balance. Minimal perplexity loss over Q6.
  • Q4_K_M (4-bit): The Industry Sweet Spot. Maintains ~98% reasoning but cuts memory by 70%. Generic baseline for constrained setups.
  • IQ4_XS/IQ3: Next-gen Importance Matrix quants. Packs Q4-level logic into smaller file sizes using calibration datasets. Highly recommended.
  • Q2_K (2-bit): The Lobotomy. Hallucinates fields, breaks JSON. Do not use.

Flash Attention

Standard attention scales quadratically (O(N²)). Without Flash Attention enabled, processing a 32k context window will trigger a massive mathematical explosion, consuming gigabytes of RAM instantly. Flash Attention keeps memory linear (O(N)).

Multi-Token Prediction (MTP)

MTP bypasses the PCIe memory bottleneck by predicting 3-4 tokens per forward pass instead of 1. While it adds a minor VRAM penalty to load the draft heads, it artificially inflates your TPS on constrained hardware.

terminal

Local Execution Engines.

The UI you choose dictates your VRAM overhead. Pick the engine built for your specific outcome, not the one with the prettiest interface. Arranged by operational complexity.

LM Studio

Level 1: GUI

The discovery tool. Excellent for browsing and testing models. The heavy Electron interface consumes precious VRAM.

memory
High VRAM Tax ~800MB Overhead
chat Chat UI travel_explore Model Browser api Local API

Ollama

Level 2: CLI / API

The "Docker" of local LLMs. Ridiculously easy to install and run. Abstracts away advanced tuning dials. Ideal for simple API wrappers.

memory
Medium VRAM Tax ~400MB Overhead
terminal CLI Chat api REST API cable Ecosystem Integrations

llama.cpp

Level 3: Bare Metal Server

The workhorse. Lowest possible VRAM footprint. Exposes every parameter. Mandatory for automated agents on constrained hardware.

memory
Minimal VRAM Tax < 50MB Overhead
data_object GBNF Grammars tune Deep Param Tuning api OpenAI Compatible API

vLLM

Level 4: Data Center

The enterprise serving king. Built for absolute maximum throughput when serving hundreds of concurrent requests. Linux/CUDA specific.

memory
Extreme VRAM Tax Pre-allocates gigabytes
dynamic_form Continuous Batching router High Concurrency API