App B The Napkin Math: Predicting Token Speed from Memory Bandwidth

March 23, 2026 · 6 min read

#DGX Spark #benchmarking #memory bandwidth #AI #Local AI

Before Post 4’s benchmarks, I needed a way to sanity-check results. If llama-benchy reports 57 t/s decode on a single request, is that good? Is the hardware doing what it should? Or is something misconfigured — wrong driver, fallback kernel path, bad CUDA version?

One formula answers that question.

The Formula

During single-request decoding, the GPU reloads the entire active model weights from memory for every token. The bottleneck isn’t compute — it’s how fast bytes move. So:

tok/s = β / (P_active × B)

Symbol	Meaning
β	Memory bandwidth (GB/s)
P_active	Active parameters per token (billions)
B	Bytes per parameter (depends on quantization)

For long contexts, KV-cache eats bandwidth too:

tok/s = β / (P_active × B + KV_per_token)

But for short-context sanity checks, the simple version is all you need.

Bytes per Parameter

Format	Bytes/param	Notes
BF16	2.0	Full precision baseline
FP8 / W8A8	1.0	Half the bandwidth of BF16
AWQ / INT4	0.5	Quarter of BF16
NVFP4	0.5	Same size as AWQ/INT4

Worked Example: DGX Spark

The DGX Spark has 273 GB/s LPDDR5x bandwidth. Take Qwen3-Coder-30B-A3B — a 30B MoE model with ~3B active parameters per token:

BF16:  273 / (3 × 2.0) = 273 / 6.0  = 45.5 tok/s
FP8:   273 / (3 × 1.0) = 273 / 3.0  = 91.0 tok/s
NVFP4: 273 / (3 × 0.5) = 273 / 1.5  = 182.0 tok/s

These are theoretical ceilings — zero context, single request, perfect kernel efficiency. Real numbers will be lower. The question is how much lower.

Comparative Example: RTX PRO 6000

Same model, same formula, different bandwidth. The RTX PRO 6000 has 1,792 GB/s GDDR7:

BF16:  1792 / 6.0  = 299 tok/s
FP8:   1792 / 3.0  = 597 tok/s
NVFP4: 1792 / 1.5  = 1195 tok/s

6.6x more bandwidth maps directly to 6.6x more throughput. The formula doesn’t care about the GPU architecture — it only cares about the memory bus.

Theory vs Reality

Community benchmarks from the NVIDIA developer forum show consistent gaps between theory and practice on the DGX Spark:

Format	Theoretical	Best Measured	Efficiency
BF16	45.5 tok/s	~32 tok/s	~70%
FP8	91.0 tok/s	~55 tok/s	~60%
NVFP4	182.0 tok/s	~65 tok/s	~36%

BF16 and FP8 land at 60-70% efficiency — normal for real workloads. Kernel overhead, MoE routing, attention backends, and KV-cache management all consume bandwidth the formula doesn’t account for.

NVFP4 is the outlier at 36%. That gap comes from SM121 architectural limitations covered in Post 3 — the GB10 lacks the dedicated Tensor Memory that datacenter Blackwell uses to exploit FP4 natively. The formula tells you the gap exists; the architecture explains why.

The Sanity Check

This is where the formula earns its keep. When I got 57 t/s single-user decode on gpt-oss-120b (MXFP4, 5.1B active parameters) in Post 4:

Theoretical: 273 / (5.1 × 0.5) = 273 / 2.55 = 107 tok/s
Measured:    57 tok/s
Efficiency:  53%

53% efficiency on NVFP4 with a MoE model — higher than the ~36% seen with simpler models. Likely because gpt-oss-120b’s MoE routing is well-optimized in the community Docker image. More importantly: the number is in the right ballpark. If I’d measured 15 tok/s, that would signal a problem — wrong CUDA version, fallback kernel path, misconfigured quantization. If I’d measured 200 tok/s, something would be measuring wrong.

The formula doesn’t tell you the exact number. It tells you the neighborhood the number should live in.

One note: benchmarks occasionally report bandwidth above 273 GB/s (up to ~285 GB/s). That’s burst — the LPDDR5x controller can exceed the sustained spec for short windows. For the formula, always use the rated 273 GB/s, not burst peaks.

Quick Reference

tok/s = bandwidth_GB/s / (active_params_B × bytes_per_param)

Hardware	BW (GB/s)	BF16 ceiling	FP8 ceiling	NVFP4 ceiling
DGX Spark (GB10)	273	45.5	91	182
Mac Studio M2 Ultra	800	133	267	533
RTX 5090 (32 GB)	1,792	299	597	1,195
RTX PRO 6000 (96 GB)	1,792	299	597	1,195

Ceilings assume 3B active parameters (e.g., Qwen3-Coder-30B-A3B). Scale linearly with active parameter count.

Expect 60-70% of these ceilings for BF16/FP8 workloads. For NVFP4, expect 35-55% depending on architecture (SM100 vs SM121) and kernel optimization.

The Sweet Spot: Reverse the Formula

For interactive use, flip the formula to find the maximum model size at a target speed:

Max data per token = bandwidth / target_tok/s

On the DGX Spark, targeting 10 tok/s minimum:

273 / 10 = 27.3 GB per token pass

Active weights plus KV-cache must stay under 27.3 GB. This is why MoE models with small active parameter counts are the sweet spot for bandwidth-limited hardware — a 120B model with 5.1B active parameters at NVFP4 only needs ~2.55 GB per pass, well within budget.

The Takeaway

One formula, one minute of math, and you know whether your hardware is working correctly. If your measured throughput is 50-70% of the theoretical ceiling for BF16/FP8, your setup is healthy. If it’s below 30%, something is wrong — check your CUDA version, kernel paths, and quantization config.

The community used this formula to diagnose the 40 t/s problem from Post 3. Theoretical NVFP4 throughput was 107 tok/s for gpt-oss-120b. Getting 40 tok/s meant 37% efficiency — plausible but suspiciously low. That suspicion led to finding the SM121 fallback paths and the CUDA 12.1a issue. The formula didn’t fix the problem, but it told everyone the problem existed.

Back to Blog

Posts