Nemotron-3-Super-120B: Topology Benchmark on DGX Spark

· 9 min read

The Model

nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 — 120B total parameters, 12B active per token. LatentMoE (Mamba-2 + MoE + Attention hybrid) with NVFP4 quantization. At 69.5 GiB, it nearly fills a single DGX Spark’s memory budget.

Nemotron-3-Super is the slowest model in this benchmark series by raw token speed, but it has unique characteristics: a reasoning-heavy architecture that generates thinking tokens before content, and 12B active params that create an ideal ratio for tensor parallelism — enough compute to justify two GPUs, but small enough NCCL traffic to keep synchronization cheap.

This is also the model that exposed the SM12.1 kernel problem. Without the CUTLASS patch, FlashInfer/Marlin backends run at half speed.

Test Setup

  • Hardware: 2x NVIDIA DGX Spark (Grace Blackwell GB10, 128 GB unified each)
  • Interconnect: ConnectX-7 dual 200GbE QSFP stacking link
  • Image: vllm-experimental (vLLM 0.17.2rc1.dev0 + SM12.1 FP8 CUTLASS patch)
  • Patch: saifgithub/vllm-gb10-sm121
  • Recipe: nvidia-nemotron-3-super-120b-nvfp4-cutlass.yaml (gpu_memory_utilization=0.70, max_model_len=262144)
  • Benchmark: llama-benchy 0.3.5, pp1024 tg128, 50 runs per concurrency level
  • Date: 2026-03-30

Topologies Tested

TopologyDescription
SoloSingle DGX Spark (GF2), TP=1
Cluster TP=2Both nodes via Ray distributed, single API endpoint
2x Solo + simple-shuffleBoth nodes independent, LiteLLM simple-shuffle
2x Solo + least-busyBoth nodes independent, LiteLLM routes to least loaded

Decode Throughput

Generation speed (tg128) — higher is better. Solo c64 omitted (OOM).

ConcurrencySoloCluster TP=2ShuffleLeast-busyWinner
c116.2922.9716.4316.43Cluster (+40%)
c228.4737.7730.3728.52Cluster (+24%)
c443.6852.9950.2243.51Cluster (+6%)
c855.7982.7773.8155.94Cluster (+12%)
c1669.08103.7196.5370.11Cluster (+7%)
c3283.50128.27124.2183.88Cluster (+3%)
c64156.40156.0299.14Tie
--- config: themeVariables: xyChart: titleColor: "#333" plotColorPalette: "#6b7280, #3b82f6, #22c55e, #ef4444" --- xychart-beta title "Decode Throughput (tokens/s) — Nemotron-3-Super-120B" x-axis ["c1", "c2", "c4", "c8", "c16", "c32", "c64"] y-axis "tokens/s" 0 --> 170 line "Solo" [16.29, 28.47, 43.68, 55.79, 69.08, 83.50, 83.50] line "Cluster TP=2" [22.97, 37.77, 52.99, 82.77, 103.71, 128.27, 156.40] line "Shuffle" [16.43, 30.37, 50.22, 73.81, 96.53, 124.21, 156.02] line "Least-busy" [16.43, 28.52, 43.51, 55.94, 70.11, 83.88, 99.14]

Legend: ━━ Solo · ━━ Cluster TP=2 · ━━ Shuffle · ━━ Least-busy

Cluster TP=2 wins or ties decode at every concurrency level. This is unique among the three models tested — GPT-OSS-120B and Gemma4 both lose to shuffle at c64.

The advantage is largest at c1 (+40%) and narrows to +3% at c32 before converging to a tie at c64. Least-busy tracks Solo almost exactly — confirming it’s effectively a single-node topology due to backend starvation.

Prefill Throughput

Prompt processing speed (pp1024) — higher is better.

ConcurrencySoloCluster TP=2ShuffleLeast-busyWinner
c11,3282,1101,3021,317Cluster (+62%)
c21,5352,1681,9591,550Cluster (+11%)
c41,5122,1462,3401,522Shuffle (+9%)
c81,5462,3302,4361,568Shuffle (+5%)
c161,5522,3102,5311,595Shuffle (+10%)
c321,5492,3382,7061,560Shuffle (+16%)
c642,2922,8241,178Shuffle (+23%)
--- config: themeVariables: xyChart: titleColor: "#333" plotColorPalette: "#6b7280, #3b82f6, #22c55e, #ef4444" --- xychart-beta title "Prefill Throughput (tokens/s) — Nemotron-3-Super-120B" x-axis ["c1", "c2", "c4", "c8", "c16", "c32", "c64"] y-axis "tokens/s" 0 --> 3000 line "Solo" [1328, 1535, 1512, 1546, 1552, 1549, 1549] line "Cluster TP=2" [2110, 2168, 2146, 2330, 2310, 2338, 2292] line "Shuffle" [1302, 1959, 2340, 2436, 2531, 2706, 2824] line "Least-busy" [1317, 1550, 1522, 1568, 1595, 1560, 1178]

Legend: ━━ Solo · ━━ Cluster TP=2 · ━━ Shuffle · ━━ Least-busy

Prefill tells a different story. Cluster wins at c1-c2 where its faster per-request processing matters, but shuffle overtakes at c4 and keeps pulling away. At c64, shuffle leads by 23%.

Note how flat Solo’s prefill curve is — it saturates early at ~1,550 t/s and can’t grow regardless of concurrency. The 69.5 GiB model leaves limited KV cache budget.

Time to First Token (TTFT)

End-to-end latency to first generated token (ms) — lower is better.

ConcurrencySoloCluster TP=2ShuffleLeast-busyWinner
c1777504814780Cluster (-35%)
c21,3378671,1291,323Cluster (-23%)
c42,5981,4901,6052,566Cluster (-7%)
c84,3682,3752,8784,295Cluster (-17%)
c167,3334,2384,6237,124Cluster (-8%)
c3212,7447,7097,54112,637Shuffle (-2%)
c6414,80312,88822,789Shuffle (-13%)
--- config: themeVariables: xyChart: titleColor: "#333" plotColorPalette: "#6b7280, #3b82f6, #22c55e, #ef4444" --- xychart-beta title "TTFT (ms, lower is better) — Nemotron-3-Super-120B" x-axis ["c1", "c2", "c4", "c8", "c16", "c32", "c64"] y-axis "ms" 0 --> 23000 line "Solo" [777, 1337, 2598, 4368, 7333, 12744, 12744] line "Cluster TP=2" [504, 867, 1490, 2375, 4238, 7709, 14803] line "Shuffle" [814, 1129, 1605, 2878, 4623, 7541, 12888] line "Least-busy" [780, 1323, 2566, 4295, 7124, 12637, 22789]

Legend: ━━ Solo · ━━ Cluster TP=2 · ━━ Shuffle · ━━ Least-busy

The TTFT crossover is later than the other models — cluster holds the lead through c16 and only loses at c32. This reflects Nemotron’s higher per-request latency: each request is expensive enough that tensor parallelism helps more than request independence.

Least-busy at c64 is catastrophic: 22.8 seconds TTFT, nearly 2x worse than cluster.

The CUTLASS Patch: 2x Performance

This benchmark only exists because of the SM12.1 CUTLASS patch. Without it, Nemotron-3-Super runs at half speed on DGX Spark.

Backendc1 decodec2 decodec4 decodec1 prefill
FlashInfer+Marlin (no patch)8.16 t/s13.19 t/s18.45 t/s1,071 t/s
Old image v0.17.1 (PTX JIT)15.22 t/s26.72 t/s38.87 t/s1,113 t/s
CUTLASS (SM12.1 patched)16.29 t/s28.47 t/s43.68 t/s1,328 t/s
--- config: themeVariables: xyChart: titleColor: "#333" plotColorPalette: "#ef4444, #f59e0b, #22c55e" --- xychart-beta title "Backend Comparison — Solo Decode (tokens/s)" x-axis ["c1", "c2", "c4"] y-axis "tokens/s" 0 --> 50 line "FlashInfer+Marlin" [8.16, 13.19, 18.45] line "Old v0.17.1 PTX" [15.22, 26.72, 38.87] line "CUTLASS SM12.1" [16.29, 28.47, 43.68]

Legend: ━━ FlashInfer+Marlin · ━━ Old v0.17.1 PTX · ━━ CUTLASS SM12.1

Why FlashInfer/Marlin are slow on SM12.1

vLLM’s kernel priority: Marlin > FlashInfer > CUTLASS > Torch. On DGX Spark (SM12.1):

  • MarlinFP8 is rejected by default (capability gate >= 89), unless VLLM_TEST_FORCE_FP8_MARLIN=1 forces it — which the original recipe did, actively hurting performance
  • FlashInferFP8 is selected next (capability >= 100) — functional but not optimized for SM12.1
  • CutlassFP8 would crash without the enable_sm120_family patch (SM12.1 cubins hit asm("trap"))

The fix is to disable FlashInfer/Marlin and let the patched CUTLASS kernels handle everything:

env:
  VLLM_NVFP4_GEMM_BACKEND: "cutlass"
  VLLM_MARLIN_USE_ATOMIC_ADD: "1"
  VLLM_USE_FLASHINFER_MOE_FP4: "0"
  VLLM_DISABLED_KERNELS: "MarlinFP8ScaledMMLinearKernel,FlashInferFP8ScaledMMLinearKernel"

Routing: Least-busy Starvation Confirmed

Third model, same result. Least-busy at c4+ matches Solo throughput exactly — it’s routing almost all traffic to one backend.

ConcurrencyShuffle decodeLeast-busy decodeGap
c450.22 t/s43.51 t/s+15%
c1696.53 t/s70.11 t/s+38%
c64156.02 t/s99.14 t/s+57%

Never use least-busy for identical backends. This is now confirmed across GPT-OSS-120B, Gemma4, and Nemotron-3-Super.

When to Use Which Topology

Use CaseBest TopologyWhy
Single user / low latency (c1-c4)Cluster TP=2+6-40% decode, -7-35% TTFT
Claude Code / moderate load (c8-c16)Cluster TP=2+7-12% decode, best TTFT
High throughput (c32+)Cluster TP=2Still wins decode (+3%), ties at c64
Prefill-heavy workloads (c4+)2x Solo + simple-shuffle+5-23% prefill vs cluster
Maximum TTFT at scale (c32+)2x Solo + simple-shuffle-2% to -13% better TTFT

Cluster TP=2 is the recommended default for Nemotron-3-Super — it’s the only model where cluster wins or ties decode at every concurrency level. The 12B active params create ideal tensor parallelism conditions.

The CUTLASS recipe is mandatory — FlashInfer/Marlin backends are 2x slower on SM12.1.

Technical Notes

  • GLOO_SOCKET_IFNAME: Required on vLLM v0.17.2+ for cluster mode (enp1s0f0np0). Independent from NCCL_SOCKET_IFNAME — Gloo and NCCL use separate transports
  • Solo c64: OOM — the 69.5 GiB model at gpu_mem 0.70 doesn’t leave enough KV cache for 64 concurrent 1024-token prefills
  • Reasoning tokens: Nemotron generates thinking tokens before content, making decode sequences longer and amplifying the benefit of faster per-request decode (cluster advantage)
  • SM12.1 patch source: saifgithub/vllm-gb10-sm121 — enables enable_sm120_family for CUTLASS FP8 kernels