App B The Napkin Math: Predicting Token Speed from Memory Bandwidth
One formula tells you whether your hardware is working correctly or misconfigured. The community used this to diagnose the 40 t/s problem.
Browse subfolders and articles in Dgx_series
One formula tells you whether your hardware is working correctly or misconfigured. The community used this to diagnose the 40 t/s problem.
Blackwell's native MX-format support eliminates the dequantization tax — the hardware and gpt-oss-120b were designed for each other.
Everything in the series built toward this: Claude Code running on locally served models. Here's what works, what's rough, and where it's heading.
The recipe system from spark-vllm-docker turns twenty minutes of flag archaeology into one command — and makes everything reproducible.
Neither topology dominates — cluster wins decode at every concurrency level, but 2x Solo wins prefill and TTFT under load. The right choice depends on the workload.
The second DGX Spark arrived. Before writing a single line of config: check firmware. Then cables, SSH, Docker, vLLM, model cache — and Claude Code helping build the skills to manage it all.
Claude Code speaks Anthropic. gpt-oss-120b speaks OpenAI with Harmony-style tool calls. LiteLLM sits in the middle and translates — including a custom callback that patches the tool calls neither side gets right.
Post 3 hit 4,158 t/s at c64. llama-benchy puts those numbers under the microscope. Same hardware, two tools, 25x difference in TTFT.
Every reviewer tested single-user latency and called the DGX Spark slow. Nobody tested concurrency. The community found the real number: 3,975 tokens per second.
The standard recipe works but wastes the hardware. Scaling from 20B to 120B on Ollama shows the potential — and the ceiling.
NVIDIA's own monitoring can't see their newest hardware. The community had a fix before NVIDIA did.
Specs can lie in both directions. Snake oil oversells. Reviews undersell. The only truth is your own testing.