Nemotron-3-Super-120B: Topology Benchmark on DGX Spark

March 30, 2026 · 9 min read

#DGX Spark #benchmarking #Nemotron #vLLM #CUTLASS #AI #Local AI

The Model

nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 — 120B total parameters, 12B active per token. LatentMoE (Mamba-2 + MoE + Attention hybrid) with NVFP4 quantization. At 69.5 GiB, it nearly fills a single DGX Spark’s memory budget.

Nemotron-3-Super is the slowest model in this benchmark series by raw token speed, but it has unique characteristics: a reasoning-heavy architecture that generates thinking tokens before content, and 12B active params that create an ideal ratio for tensor parallelism — enough compute to justify two GPUs, but small enough NCCL traffic to keep synchronization cheap.

This is also the model that exposed the SM12.1 kernel problem. Without the CUTLASS patch, FlashInfer/Marlin backends run at half speed.

Test Setup

Hardware: 2x NVIDIA DGX Spark (Grace Blackwell GB10, 128 GB unified each)
Interconnect: ConnectX-7 dual 200GbE QSFP stacking link
Image: vllm-experimental (vLLM 0.17.2rc1.dev0 + SM12.1 FP8 CUTLASS patch)
Patch: saifgithub/vllm-gb10-sm121
Recipe: nvidia-nemotron-3-super-120b-nvfp4-cutlass.yaml (gpu_memory_utilization=0.70, max_model_len=262144)
Benchmark: llama-benchy 0.3.5, pp1024 tg128, 50 runs per concurrency level
Date: 2026-03-30

Topologies Tested

Topology	Description
Solo	Single DGX Spark (GF2), TP=1
Cluster TP=2	Both nodes via Ray distributed, single API endpoint
2x Solo + simple-shuffle	Both nodes independent, LiteLLM simple-shuffle
2x Solo + least-busy	Both nodes independent, LiteLLM routes to least loaded

Decode Throughput

Generation speed (tg128) — higher is better. Solo c64 omitted (OOM).

Concurrency	Solo	Cluster TP=2	Shuffle	Least-busy	Winner
c1	16.29	22.97	16.43	16.43	Cluster (+40%)
c2	28.47	37.77	30.37	28.52	Cluster (+24%)
c4	43.68	52.99	50.22	43.51	Cluster (+6%)
c8	55.79	82.77	73.81	55.94	Cluster (+12%)
c16	69.08	103.71	96.53	70.11	Cluster (+7%)
c32	83.50	128.27	124.21	83.88	Cluster (+3%)
c64	—	156.40	156.02	99.14	Tie

--- config: themeVariables: xyChart: titleColor: "#333" plotColorPalette: "#6b7280, #3b82f6, #22c55e, #ef4444" --- xychart-beta title "Decode Throughput (tokens/s) — Nemotron-3-Super-120B" x-axis ["c1", "c2", "c4", "c8", "c16", "c32", "c64"] y-axis "tokens/s" 0 --> 170 line "Solo" [16.29, 28.47, 43.68, 55.79, 69.08, 83.50, 83.50] line "Cluster TP=2" [22.97, 37.77, 52.99, 82.77, 103.71, 128.27, 156.40] line "Shuffle" [16.43, 30.37, 50.22, 73.81, 96.53, 124.21, 156.02] line "Least-busy" [16.43, 28.52, 43.51, 55.94, 70.11, 83.88, 99.14]

Legend: ━━ Solo · ━━ Cluster TP=2 · ━━ Shuffle · ━━ Least-busy

Cluster TP=2 wins or ties decode at every concurrency level. This is unique among the three models tested — GPT-OSS-120B and Gemma4 both lose to shuffle at c64.

The advantage is largest at c1 (+40%) and narrows to +3% at c32 before converging to a tie at c64. Least-busy tracks Solo almost exactly — confirming it’s effectively a single-node topology due to backend starvation.

Prefill Throughput

Prompt processing speed (pp1024) — higher is better.

Concurrency	Solo	Cluster TP=2	Shuffle	Least-busy	Winner
c1	1,328	2,110	1,302	1,317	Cluster (+62%)
c2	1,535	2,168	1,959	1,550	Cluster (+11%)
c4	1,512	2,146	2,340	1,522	Shuffle (+9%)
c8	1,546	2,330	2,436	1,568	Shuffle (+5%)
c16	1,552	2,310	2,531	1,595	Shuffle (+10%)
c32	1,549	2,338	2,706	1,560	Shuffle (+16%)
c64	—	2,292	2,824	1,178	Shuffle (+23%)

--- config: themeVariables: xyChart: titleColor: "#333" plotColorPalette: "#6b7280, #3b82f6, #22c55e, #ef4444" --- xychart-beta title "Prefill Throughput (tokens/s) — Nemotron-3-Super-120B" x-axis ["c1", "c2", "c4", "c8", "c16", "c32", "c64"] y-axis "tokens/s" 0 --> 3000 line "Solo" [1328, 1535, 1512, 1546, 1552, 1549, 1549] line "Cluster TP=2" [2110, 2168, 2146, 2330, 2310, 2338, 2292] line "Shuffle" [1302, 1959, 2340, 2436, 2531, 2706, 2824] line "Least-busy" [1317, 1550, 1522, 1568, 1595, 1560, 1178]

Legend: ━━ Solo · ━━ Cluster TP=2 · ━━ Shuffle · ━━ Least-busy

Prefill tells a different story. Cluster wins at c1-c2 where its faster per-request processing matters, but shuffle overtakes at c4 and keeps pulling away. At c64, shuffle leads by 23%.

Note how flat Solo’s prefill curve is — it saturates early at ~1,550 t/s and can’t grow regardless of concurrency. The 69.5 GiB model leaves limited KV cache budget.

Time to First Token (TTFT)

End-to-end latency to first generated token (ms) — lower is better.

Concurrency	Solo	Cluster TP=2	Shuffle	Least-busy	Winner
c1	777	504	814	780	Cluster (-35%)
c2	1,337	867	1,129	1,323	Cluster (-23%)
c4	2,598	1,490	1,605	2,566	Cluster (-7%)
c8	4,368	2,375	2,878	4,295	Cluster (-17%)
c16	7,333	4,238	4,623	7,124	Cluster (-8%)
c32	12,744	7,709	7,541	12,637	Shuffle (-2%)
c64	—	14,803	12,888	22,789	Shuffle (-13%)

--- config: themeVariables: xyChart: titleColor: "#333" plotColorPalette: "#6b7280, #3b82f6, #22c55e, #ef4444" --- xychart-beta title "TTFT (ms, lower is better) — Nemotron-3-Super-120B" x-axis ["c1", "c2", "c4", "c8", "c16", "c32", "c64"] y-axis "ms" 0 --> 23000 line "Solo" [777, 1337, 2598, 4368, 7333, 12744, 12744] line "Cluster TP=2" [504, 867, 1490, 2375, 4238, 7709, 14803] line "Shuffle" [814, 1129, 1605, 2878, 4623, 7541, 12888] line "Least-busy" [780, 1323, 2566, 4295, 7124, 12637, 22789]

Legend: ━━ Solo · ━━ Cluster TP=2 · ━━ Shuffle · ━━ Least-busy

The TTFT crossover is later than the other models — cluster holds the lead through c16 and only loses at c32. This reflects Nemotron’s higher per-request latency: each request is expensive enough that tensor parallelism helps more than request independence.

Least-busy at c64 is catastrophic: 22.8 seconds TTFT, nearly 2x worse than cluster.

The CUTLASS Patch: 2x Performance

This benchmark only exists because of the SM12.1 CUTLASS patch. Without it, Nemotron-3-Super runs at half speed on DGX Spark.

Backend	c1 decode	c2 decode	c4 decode	c1 prefill
FlashInfer+Marlin (no patch)	8.16 t/s	13.19 t/s	18.45 t/s	1,071 t/s
Old image v0.17.1 (PTX JIT)	15.22 t/s	26.72 t/s	38.87 t/s	1,113 t/s
CUTLASS (SM12.1 patched)	16.29 t/s	28.47 t/s	43.68 t/s	1,328 t/s

--- config: themeVariables: xyChart: titleColor: "#333" plotColorPalette: "#ef4444, #f59e0b, #22c55e" --- xychart-beta title "Backend Comparison — Solo Decode (tokens/s)" x-axis ["c1", "c2", "c4"] y-axis "tokens/s" 0 --> 50 line "FlashInfer+Marlin" [8.16, 13.19, 18.45] line "Old v0.17.1 PTX" [15.22, 26.72, 38.87] line "CUTLASS SM12.1" [16.29, 28.47, 43.68]

Legend: ━━ FlashInfer+Marlin · ━━ Old v0.17.1 PTX · ━━ CUTLASS SM12.1

Why FlashInfer/Marlin are slow on SM12.1

vLLM’s kernel priority: Marlin > FlashInfer > CUTLASS > Torch. On DGX Spark (SM12.1):

MarlinFP8 is rejected by default (capability gate >= 89), unless VLLM_TEST_FORCE_FP8_MARLIN=1 forces it — which the original recipe did, actively hurting performance
FlashInferFP8 is selected next (capability >= 100) — functional but not optimized for SM12.1
CutlassFP8 would crash without the enable_sm120_family patch (SM12.1 cubins hit asm("trap"))

The fix is to disable FlashInfer/Marlin and let the patched CUTLASS kernels handle everything:

env:
  VLLM_NVFP4_GEMM_BACKEND: "cutlass"
  VLLM_MARLIN_USE_ATOMIC_ADD: "1"
  VLLM_USE_FLASHINFER_MOE_FP4: "0"
  VLLM_DISABLED_KERNELS: "MarlinFP8ScaledMMLinearKernel,FlashInferFP8ScaledMMLinearKernel"

Routing: Least-busy Starvation Confirmed

Third model, same result. Least-busy at c4+ matches Solo throughput exactly — it’s routing almost all traffic to one backend.

Concurrency	Shuffle decode	Least-busy decode	Gap
c4	50.22 t/s	43.51 t/s	+15%
c16	96.53 t/s	70.11 t/s	+38%
c64	156.02 t/s	99.14 t/s	+57%

Never use least-busy for identical backends. This is now confirmed across GPT-OSS-120B, Gemma4, and Nemotron-3-Super.

When to Use Which Topology

Use Case	Best Topology	Why
Single user / low latency (c1-c4)	Cluster TP=2	+6-40% decode, -7-35% TTFT
Claude Code / moderate load (c8-c16)	Cluster TP=2	+7-12% decode, best TTFT
High throughput (c32+)	Cluster TP=2	Still wins decode (+3%), ties at c64
Prefill-heavy workloads (c4+)	2x Solo + simple-shuffle	+5-23% prefill vs cluster
Maximum TTFT at scale (c32+)	2x Solo + simple-shuffle	-2% to -13% better TTFT

Cluster TP=2 is the recommended default for Nemotron-3-Super — it’s the only model where cluster wins or ties decode at every concurrency level. The 12B active params create ideal tensor parallelism conditions.

The CUTLASS recipe is mandatory — FlashInfer/Marlin backends are 2x slower on SM12.1.

Technical Notes

GLOO_SOCKET_IFNAME: Required on vLLM v0.17.2+ for cluster mode (enp1s0f0np0). Independent from NCCL_SOCKET_IFNAME — Gloo and NCCL use separate transports
Solo c64: OOM — the 69.5 GiB model at gpu_mem 0.70 doesn’t leave enough KV cache for 64 concurrent 1024-token prefills
Reasoning tokens: Nemotron generates thinking tokens before content, making decode sequences longer and amplifying the benefit of faster per-request decode (cluster advantage)
SM12.1 patch source: saifgithub/vllm-gb10-sm121 — enables enable_sm120_family for CUTLASS FP8 kernels

Back to Blog

Posts

Nemotron-3-Super-120B: Topology Benchmark on DGX Spark

The Model

Test Setup

Topologies Tested

Decode Throughput

Prefill Throughput

Time to First Token (TTFT)

The CUTLASS Patch: 2x Performance

Why FlashInfer/Marlin are slow on SM12.1

Routing: Least-busy Starvation Confirmed

When to Use Which Topology

Technical Notes

Posts