Nemotron-3-Super-120B: Topology Benchmark on DGX Spark
The model where cluster never loses. Nemotron-3-Super benefits more from TP=2 than any other model tested -- and the SM12.1 CUTLASS patch doubles performance vs FlashInfer.
Found 19 posts
The model where cluster never loses. Nemotron-3-Super benefits more from TP=2 than any other model tested -- and the SM12.1 CUTLASS patch doubles performance vs FlashInfer.
One formula tells you whether your hardware is working correctly or misconfigured. The community used this to diagnose the 40 t/s problem.
Blackwell's native MX-format support eliminates the dequantization tax — the hardware and gpt-oss-120b were designed for each other.
Cluster TP=2 dominates decode up to c32, then simple-shuffle takes over. Least-busy routing collapses under load. Full topology comparison with four configurations.
Everything in the series built toward this: Claude Code running on locally served models. Here's what works, what's rough, and where it's heading.
The recipe system from spark-vllm-docker turns twenty minutes of flag archaeology into one command — and makes everything reproducible.
Neither topology dominates — cluster wins decode at every concurrency level, but 2x Solo wins prefill and TTFT under load. The right choice depends on the workload.
The second DGX Spark arrived. Before writing a single line of config: check firmware. Then cables, SSH, Docker, vLLM, model cache — and Claude Code helping build the skills to manage it all.
Skills you build today become components for autonomous agents tomorrow. The progression: skill → command → plugin → autonomous agent. Here's where it's all heading.
Claude Code speaks Anthropic. gpt-oss-120b speaks OpenAI with Harmony-style tool calls. LiteLLM sits in the middle and translates — including a custom callback that patches the tool calls neither side gets right.
10 agents. 3 commands. 3 skills. One plugin that turns a topic into a finished presentation. Here's how agents, commands, skills, and hooks work together inside Marp Magic.
Post 3 hit 4,158 t/s at c64. llama-benchy puts those numbers under the microscope. Same hardware, two tools, 25x difference in TTFT.
Every reviewer tested single-user latency and called the DGX Spark slow. Nobody tested concurrency. The community found the real number: 3,975 tokens per second.
Manual invoice matching: 20% exception rates, hours of CFO time, error-prone. A skill reduces that to under 5% in minutes. Here's how — and it was built by a non-developer.
The standard recipe works but wastes the hardware. Scaling from 20B to 120B on Ollama shows the potential — and the ceiling.
Write a skill file. Run it on messy meeting notes. Get a structured summary. Refine it. Run it again. The whole cycle in 10 minutes — no code, just clear instructions.
NVIDIA's own monitoring can't see their newest hardware. The community had a fix before NVIDIA did.
You don't need to write code to use Claude Code. Skills are instructions in plain Markdown — if you can write a recipe, you can teach AI to do your repetitive work.
Specs can lie in both directions. Snake oil oversells. Reviews undersell. The only truth is your own testing.