App B The Napkin Math: Predicting Token Speed from Memory Bandwidth
Before Post 4’s benchmarks, I needed a way to sanity-check results. If llama-benchy reports 57 t/s decode on a single request, is that good? Is the hardware doing what it should? Or is something misconfigured — wrong driver, fallback kernel path, bad CUDA version?
One formula answers that question.
The Formula
During single-request decoding, the GPU reloads the entire active model weights from memory for every token. The bottleneck isn’t compute — it’s how fast bytes move. So:
tok/s = β / (P_active × B)
| Symbol | Meaning |
|---|---|
| β | Memory bandwidth (GB/s) |
| P_active | Active parameters per token (billions) |
| B | Bytes per parameter (depends on quantization) |
For long contexts, KV-cache eats bandwidth too:
tok/s = β / (P_active × B + KV_per_token)
But for short-context sanity checks, the simple version is all you need.
Bytes per Parameter
| Format | Bytes/param | Notes |
|---|---|---|
| BF16 | 2.0 | Full precision baseline |
| FP8 / W8A8 | 1.0 | Half the bandwidth of BF16 |
| AWQ / INT4 | 0.5 | Quarter of BF16 |
| NVFP4 | 0.5 | Same size as AWQ/INT4 |
Worked Example: DGX Spark
The DGX Spark has 273 GB/s LPDDR5x bandwidth. Take Qwen3-Coder-30B-A3B — a 30B MoE model with ~3B active parameters per token:
BF16: 273 / (3 × 2.0) = 273 / 6.0 = 45.5 tok/s
FP8: 273 / (3 × 1.0) = 273 / 3.0 = 91.0 tok/s
NVFP4: 273 / (3 × 0.5) = 273 / 1.5 = 182.0 tok/s
These are theoretical ceilings — zero context, single request, perfect kernel efficiency. Real numbers will be lower. The question is how much lower.
Comparative Example: RTX PRO 6000
Same model, same formula, different bandwidth. The RTX PRO 6000 has 1,792 GB/s GDDR7:
BF16: 1792 / 6.0 = 299 tok/s
FP8: 1792 / 3.0 = 597 tok/s
NVFP4: 1792 / 1.5 = 1195 tok/s
6.6x more bandwidth maps directly to 6.6x more throughput. The formula doesn’t care about the GPU architecture — it only cares about the memory bus.
Theory vs Reality
Community benchmarks from the NVIDIA developer forum show consistent gaps between theory and practice on the DGX Spark:
| Format | Theoretical | Best Measured | Efficiency |
|---|---|---|---|
| BF16 | 45.5 tok/s | ~32 tok/s | ~70% |
| FP8 | 91.0 tok/s | ~55 tok/s | ~60% |
| NVFP4 | 182.0 tok/s | ~65 tok/s | ~36% |
BF16 and FP8 land at 60-70% efficiency — normal for real workloads. Kernel overhead, MoE routing, attention backends, and KV-cache management all consume bandwidth the formula doesn’t account for.
NVFP4 is the outlier at 36%. That gap comes from SM121 architectural limitations covered in Post 3 — the GB10 lacks the dedicated Tensor Memory that datacenter Blackwell uses to exploit FP4 natively. The formula tells you the gap exists; the architecture explains why.
The Sanity Check
This is where the formula earns its keep. When I got 57 t/s single-user decode on gpt-oss-120b (MXFP4, 5.1B active parameters) in Post 4:
Theoretical: 273 / (5.1 × 0.5) = 273 / 2.55 = 107 tok/s
Measured: 57 tok/s
Efficiency: 53%
53% efficiency on NVFP4 with a MoE model — higher than the ~36% seen with simpler models. Likely because gpt-oss-120b’s MoE routing is well-optimized in the community Docker image. More importantly: the number is in the right ballpark. If I’d measured 15 tok/s, that would signal a problem — wrong CUDA version, fallback kernel path, misconfigured quantization. If I’d measured 200 tok/s, something would be measuring wrong.
The formula doesn’t tell you the exact number. It tells you the neighborhood the number should live in.
One note: benchmarks occasionally report bandwidth above 273 GB/s (up to ~285 GB/s). That’s burst — the LPDDR5x controller can exceed the sustained spec for short windows. For the formula, always use the rated 273 GB/s, not burst peaks.
Quick Reference
tok/s = bandwidth_GB/s / (active_params_B × bytes_per_param)
| Hardware | BW (GB/s) | BF16 ceiling | FP8 ceiling | NVFP4 ceiling |
|---|---|---|---|---|
| DGX Spark (GB10) | 273 | 45.5 | 91 | 182 |
| Mac Studio M2 Ultra | 800 | 133 | 267 | 533 |
| RTX 5090 (32 GB) | 1,792 | 299 | 597 | 1,195 |
| RTX PRO 6000 (96 GB) | 1,792 | 299 | 597 | 1,195 |
Ceilings assume 3B active parameters (e.g., Qwen3-Coder-30B-A3B). Scale linearly with active parameter count.
Expect 60-70% of these ceilings for BF16/FP8 workloads. For NVFP4, expect 35-55% depending on architecture (SM100 vs SM121) and kernel optimization.
The Sweet Spot: Reverse the Formula
For interactive use, flip the formula to find the maximum model size at a target speed:
Max data per token = bandwidth / target_tok/s
On the DGX Spark, targeting 10 tok/s minimum:
273 / 10 = 27.3 GB per token pass
Active weights plus KV-cache must stay under 27.3 GB. This is why MoE models with small active parameter counts are the sweet spot for bandwidth-limited hardware — a 120B model with 5.1B active parameters at NVFP4 only needs ~2.55 GB per pass, well within budget.
The Takeaway
One formula, one minute of math, and you know whether your hardware is working correctly. If your measured throughput is 50-70% of the theoretical ceiling for BF16/FP8, your setup is healthy. If it’s below 30%, something is wrong — check your CUDA version, kernel paths, and quantization config.
The community used this formula to diagnose the 40 t/s problem from Post 3. Theoretical NVFP4 throughput was 107 tok/s for gpt-oss-120b. Getting 40 tok/s meant 37% efficiency — plausible but suspiciously low. That suspicion led to finding the SM121 fallback paths and the CUDA 12.1a issue. The formula didn’t fix the problem, but it told everyone the problem existed.