Nemotron-3-Super-120B: Topology Benchmark on DGX Spark
The Model
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 — 120B total parameters, 12B active per token. LatentMoE (Mamba-2 + MoE + Attention hybrid) with NVFP4 quantization. At 69.5 GiB, it nearly fills a single DGX Spark’s memory budget.
Nemotron-3-Super is the slowest model in this benchmark series by raw token speed, but it has unique characteristics: a reasoning-heavy architecture that generates thinking tokens before content, and 12B active params that create an ideal ratio for tensor parallelism — enough compute to justify two GPUs, but small enough NCCL traffic to keep synchronization cheap.
This is also the model that exposed the SM12.1 kernel problem. Without the CUTLASS patch, FlashInfer/Marlin backends run at half speed.
Test Setup
- Hardware: 2x NVIDIA DGX Spark (Grace Blackwell GB10, 128 GB unified each)
- Interconnect: ConnectX-7 dual 200GbE QSFP stacking link
- Image: vllm-experimental (vLLM 0.17.2rc1.dev0 + SM12.1 FP8 CUTLASS patch)
- Patch: saifgithub/vllm-gb10-sm121
- Recipe: nvidia-nemotron-3-super-120b-nvfp4-cutlass.yaml (gpu_memory_utilization=0.70, max_model_len=262144)
- Benchmark: llama-benchy 0.3.5, pp1024 tg128, 50 runs per concurrency level
- Date: 2026-03-30
Topologies Tested
| Topology | Description |
|---|---|
| Solo | Single DGX Spark (GF2), TP=1 |
| Cluster TP=2 | Both nodes via Ray distributed, single API endpoint |
| 2x Solo + simple-shuffle | Both nodes independent, LiteLLM simple-shuffle |
| 2x Solo + least-busy | Both nodes independent, LiteLLM routes to least loaded |
Decode Throughput
Generation speed (tg128) — higher is better. Solo c64 omitted (OOM).
| Concurrency | Solo | Cluster TP=2 | Shuffle | Least-busy | Winner |
|---|---|---|---|---|---|
| c1 | 16.29 | 22.97 | 16.43 | 16.43 | Cluster (+40%) |
| c2 | 28.47 | 37.77 | 30.37 | 28.52 | Cluster (+24%) |
| c4 | 43.68 | 52.99 | 50.22 | 43.51 | Cluster (+6%) |
| c8 | 55.79 | 82.77 | 73.81 | 55.94 | Cluster (+12%) |
| c16 | 69.08 | 103.71 | 96.53 | 70.11 | Cluster (+7%) |
| c32 | 83.50 | 128.27 | 124.21 | 83.88 | Cluster (+3%) |
| c64 | — | 156.40 | 156.02 | 99.14 | Tie |
Legend: ━━ Solo · ━━ Cluster TP=2 · ━━ Shuffle · ━━ Least-busy
Cluster TP=2 wins or ties decode at every concurrency level. This is unique among the three models tested — GPT-OSS-120B and Gemma4 both lose to shuffle at c64.
The advantage is largest at c1 (+40%) and narrows to +3% at c32 before converging to a tie at c64. Least-busy tracks Solo almost exactly — confirming it’s effectively a single-node topology due to backend starvation.
Prefill Throughput
Prompt processing speed (pp1024) — higher is better.
| Concurrency | Solo | Cluster TP=2 | Shuffle | Least-busy | Winner |
|---|---|---|---|---|---|
| c1 | 1,328 | 2,110 | 1,302 | 1,317 | Cluster (+62%) |
| c2 | 1,535 | 2,168 | 1,959 | 1,550 | Cluster (+11%) |
| c4 | 1,512 | 2,146 | 2,340 | 1,522 | Shuffle (+9%) |
| c8 | 1,546 | 2,330 | 2,436 | 1,568 | Shuffle (+5%) |
| c16 | 1,552 | 2,310 | 2,531 | 1,595 | Shuffle (+10%) |
| c32 | 1,549 | 2,338 | 2,706 | 1,560 | Shuffle (+16%) |
| c64 | — | 2,292 | 2,824 | 1,178 | Shuffle (+23%) |
Legend: ━━ Solo · ━━ Cluster TP=2 · ━━ Shuffle · ━━ Least-busy
Prefill tells a different story. Cluster wins at c1-c2 where its faster per-request processing matters, but shuffle overtakes at c4 and keeps pulling away. At c64, shuffle leads by 23%.
Note how flat Solo’s prefill curve is — it saturates early at ~1,550 t/s and can’t grow regardless of concurrency. The 69.5 GiB model leaves limited KV cache budget.
Time to First Token (TTFT)
End-to-end latency to first generated token (ms) — lower is better.
| Concurrency | Solo | Cluster TP=2 | Shuffle | Least-busy | Winner |
|---|---|---|---|---|---|
| c1 | 777 | 504 | 814 | 780 | Cluster (-35%) |
| c2 | 1,337 | 867 | 1,129 | 1,323 | Cluster (-23%) |
| c4 | 2,598 | 1,490 | 1,605 | 2,566 | Cluster (-7%) |
| c8 | 4,368 | 2,375 | 2,878 | 4,295 | Cluster (-17%) |
| c16 | 7,333 | 4,238 | 4,623 | 7,124 | Cluster (-8%) |
| c32 | 12,744 | 7,709 | 7,541 | 12,637 | Shuffle (-2%) |
| c64 | — | 14,803 | 12,888 | 22,789 | Shuffle (-13%) |
Legend: ━━ Solo · ━━ Cluster TP=2 · ━━ Shuffle · ━━ Least-busy
The TTFT crossover is later than the other models — cluster holds the lead through c16 and only loses at c32. This reflects Nemotron’s higher per-request latency: each request is expensive enough that tensor parallelism helps more than request independence.
Least-busy at c64 is catastrophic: 22.8 seconds TTFT, nearly 2x worse than cluster.
The CUTLASS Patch: 2x Performance
This benchmark only exists because of the SM12.1 CUTLASS patch. Without it, Nemotron-3-Super runs at half speed on DGX Spark.
| Backend | c1 decode | c2 decode | c4 decode | c1 prefill |
|---|---|---|---|---|
| FlashInfer+Marlin (no patch) | 8.16 t/s | 13.19 t/s | 18.45 t/s | 1,071 t/s |
| Old image v0.17.1 (PTX JIT) | 15.22 t/s | 26.72 t/s | 38.87 t/s | 1,113 t/s |
| CUTLASS (SM12.1 patched) | 16.29 t/s | 28.47 t/s | 43.68 t/s | 1,328 t/s |
Legend: ━━ FlashInfer+Marlin · ━━ Old v0.17.1 PTX · ━━ CUTLASS SM12.1
Why FlashInfer/Marlin are slow on SM12.1
vLLM’s kernel priority: Marlin > FlashInfer > CUTLASS > Torch. On DGX Spark (SM12.1):
- MarlinFP8 is rejected by default (capability gate >= 89), unless
VLLM_TEST_FORCE_FP8_MARLIN=1forces it — which the original recipe did, actively hurting performance - FlashInferFP8 is selected next (capability >= 100) — functional but not optimized for SM12.1
- CutlassFP8 would crash without the
enable_sm120_familypatch (SM12.1 cubins hitasm("trap"))
The fix is to disable FlashInfer/Marlin and let the patched CUTLASS kernels handle everything:
env:
VLLM_NVFP4_GEMM_BACKEND: "cutlass"
VLLM_MARLIN_USE_ATOMIC_ADD: "1"
VLLM_USE_FLASHINFER_MOE_FP4: "0"
VLLM_DISABLED_KERNELS: "MarlinFP8ScaledMMLinearKernel,FlashInferFP8ScaledMMLinearKernel"
Routing: Least-busy Starvation Confirmed
Third model, same result. Least-busy at c4+ matches Solo throughput exactly — it’s routing almost all traffic to one backend.
| Concurrency | Shuffle decode | Least-busy decode | Gap |
|---|---|---|---|
| c4 | 50.22 t/s | 43.51 t/s | +15% |
| c16 | 96.53 t/s | 70.11 t/s | +38% |
| c64 | 156.02 t/s | 99.14 t/s | +57% |
Never use least-busy for identical backends. This is now confirmed across GPT-OSS-120B, Gemma4, and Nemotron-3-Super.
When to Use Which Topology
| Use Case | Best Topology | Why |
|---|---|---|
| Single user / low latency (c1-c4) | Cluster TP=2 | +6-40% decode, -7-35% TTFT |
| Claude Code / moderate load (c8-c16) | Cluster TP=2 | +7-12% decode, best TTFT |
| High throughput (c32+) | Cluster TP=2 | Still wins decode (+3%), ties at c64 |
| Prefill-heavy workloads (c4+) | 2x Solo + simple-shuffle | +5-23% prefill vs cluster |
| Maximum TTFT at scale (c32+) | 2x Solo + simple-shuffle | -2% to -13% better TTFT |
Cluster TP=2 is the recommended default for Nemotron-3-Super — it’s the only model where cluster wins or ties decode at every concurrency level. The 12B active params create ideal tensor parallelism conditions.
The CUTLASS recipe is mandatory — FlashInfer/Marlin backends are 2x slower on SM12.1.
Technical Notes
- GLOO_SOCKET_IFNAME: Required on vLLM v0.17.2+ for cluster mode (
enp1s0f0np0). Independent from NCCL_SOCKET_IFNAME — Gloo and NCCL use separate transports - Solo c64: OOM — the 69.5 GiB model at gpu_mem 0.70 doesn’t leave enough KV cache for 64 concurrent 1024-token prefills
- Reasoning tokens: Nemotron generates thinking tokens before content, making decode sequences longer and amplifying the benefit of faster per-request decode (cluster advantage)
- SM12.1 patch source: saifgithub/vllm-gb10-sm121 — enables
enable_sm120_familyfor CUTLASS FP8 kernels