Three Models, Two Sparks: Cross-Model Benchmark Comparison

· 8 min read

The Lineup

Three models, all running on the same hardware: 2x NVIDIA DGX Spark (Grace Blackwell GB10, 128 GB unified each), connected via ConnectX-7 200GbE stacking link. Each benchmarked across Solo, Cluster TP=2, and 2x Solo + simple-shuffle topologies using llama-benchy (pp1024 tg128, 50 runs per concurrency level).

ModelTotal ParamsActive/TokenQuantizationMemory (Solo)HuggingFace
GPT-OSS-120B117B5.1BMXFP4 (4-bit)~61 GiBopenai/gpt-oss-120b
Gemma4-26B25.2B3.8BFP8 (8-bit)25.67 GiBgoogle/gemma-4-26B-A4B-it
Nemotron-3-Super120B12BNVFP4 (4-bit)69.5 GiBnvidia/Nemotron-3-Super-120B-A12B

All three are Mixture-of-Experts variants — sparse MoE, standard MoE, and LatentMoE respectively. The active parameter count determines real-world speed more than total parameter count.

Why Total Parameters Don’t Predict Speed

The headline number on a model card — “120B parameters” — tells you almost nothing about inference speed on MoE hardware. What matters is:

  1. Active parameters per token — how much compute per forward pass
  2. Quantization precision — bits per active parameter
  3. Effective bit-work — active params x bits = total work per token
ModelActive ParamsBits/ParamEffective Bit-WorkSolo c1 Decode
GPT-OSS-120B5.1B4 (MXFP4)20.4 Gbit57.5 t/s
Gemma4-26B3.8B8 (FP8)30.4 Gbit38.68 t/s
Nemotron-3-Super12B~5 (NVFP4 mixed)~60 Gbit16.29 t/s

GPT-OSS-120B is the fastest despite having 117B total params — MXFP4 at 4-bit gives the lowest effective compute per token. Nemotron has the most active params (12B) and is 3.5x slower at c1.

--- config: themeVariables: xyChart: titleColor: "#333" plotColorPalette: "#3b82f6, #22c55e, #ef4444" --- xychart-beta title "Solo Decode Throughput (tokens/s) — All Models" x-axis ["c1", "c4", "c8", "c16", "c32", "c64"] y-axis "tokens/s" 0 --> 400 line "GPT-OSS-120B" [57.5, 107.9, 153.5, 218.5, 318.7, 315.2] line "Gemma4-26B" [38.68, 100.44, 152.36, 210.95, 281.92, 355.73] line "Nemotron-3-Super" [16.29, 43.68, 55.79, 69.08, 83.50, 83.50]

Legend: ━━ GPT-OSS-120B · ━━ Gemma4-26B · ━━ Nemotron-3-Super

GPT-OSS leads at low concurrency but Gemma4 catches up by c8 and overtakes at c32+. Nemotron scales the slowest — its 12B active params per token limit how many concurrent requests the hardware can process.

Cluster TP=2: Who Benefits Most?

Tensor parallelism splits each request across both GPUs. The benefit depends on how much NCCL synchronization traffic the model generates relative to the compute saved.

ModelSolo c1Cluster c1TP=2 AdvantageWhy
GPT-OSS-120B57.569.0+20%Large total params but tiny active set (5.1B) — modest TP gain
Gemma4-26B38.6857.93+50%Smallest active params (3.8B) = least NCCL traffic = best TP scaling
Nemotron-3-Super16.2922.97+41%12B active but LatentMoE’s structure benefits from parallelism
--- config: themeVariables: xyChart: titleColor: "#333" plotColorPalette: "#3b82f6, #22c55e, #ef4444" --- xychart-beta title "Cluster TP=2 Decode Throughput (tokens/s) — All Models" x-axis ["c1", "c4", "c8", "c16", "c32", "c64"] y-axis "tokens/s" 0 --> 500 line "GPT-OSS-120B" [69.0, 155.3, 231.0, 342.1, 471.7, 471.8] line "Gemma4-26B" [57.93, 148.26, 222.59, 304.03, 382.80, 477.77] line "Nemotron-3-Super" [22.97, 52.99, 82.77, 103.71, 128.27, 156.40]

Legend: ━━ GPT-OSS-120B · ━━ Gemma4-26B · ━━ Nemotron-3-Super

Gemma4 benefits most from TP=2 at c1 (+50%), but GPT-OSS-120B maintains higher absolute throughput at every concurrency level until c64, where Gemma4 edges ahead (478 vs 472 t/s).

The Crossover: When Does Shuffle Beat Cluster?

Every model shows the same pattern: Cluster TP=2 wins decode at low concurrency, but 2x Solo + simple-shuffle catches up as concurrency rises. The crossover point varies dramatically by model:

ModelShuffle Overtakes DecodeShuffle Overtakes PrefillTTFT Crossover
GPT-OSS-120Bc64 (+20% vs cluster)c16c8
Gemma4-26Bc64 (+6% vs cluster)c4c8
Nemotron-3-SuperNever (tie at c64)c4c32
--- config: themeVariables: xyChart: titleColor: "#333" plotColorPalette: "#3b82f6, #22c55e, #ef4444" --- xychart-beta title "Shuffle 2x Decode Throughput (tokens/s) — All Models" x-axis ["c1", "c4", "c8", "c16", "c32", "c64"] y-axis "tokens/s" 0 --> 600 line "GPT-OSS-120B" [57.7, 127.9, 177.5, 268.6, 382.1, 567.9] line "Gemma4-26B" [39.59, 100.65, 160.54, 245.31, 342.53, 508.09] line "Nemotron-3-Super" [16.43, 50.22, 73.81, 96.53, 124.21, 156.02]

Legend: ━━ GPT-OSS-120B · ━━ Gemma4-26B · ━━ Nemotron-3-Super

Nemotron is the only model where Cluster TP=2 never loses decode. Its 12B active params create enough per-request compute that tensor parallelism always outperforms independent routing, even at c64.

GPT-OSS shows the biggest shuffle advantage at c64 (+20% vs cluster) — its tiny 5.1B active set means per-request compute is cheap, so the overhead of NCCL synchronization becomes the bottleneck before the independent-routing overhead of shuffle does.

Prefill: Shuffle Always Wins at Scale

All three models show the same pattern for prefill throughput: Cluster wins at c1-c2, shuffle overtakes at c4, and the gap widens steadily.

ConcurrencyGPT-OSS ShuffleGPT-OSS ClusterGemma4 ShuffleGemma4 ClusterNemotron ShuffleNemotron Cluster
c12,9224,4424,4746,0341,3022,110
c45,5786,3437,6697,6512,3402,146
c1610,0728,81610,1729,0152,5312,310
c6412,0643,87312,9899,1652,8242,292

At c64, GPT-OSS shuffle achieves +211% vs cluster prefill. The cluster’s prefill collapses under high concurrency because NCCL synchronization serializes what could be independent parallel work.

TTFT: The Latency Picture

Time to First Token matters most for interactive use. Lower is better.

ConcurrencyGPT-OSS ClusterGPT-OSS ShuffleGemma4 ClusterGemma4 ShuffleNemotron ClusterNemotron Shuffle
c1251370181249504814
c81,0201,0058586562,3752,878
c323,6422,5512,3301,8887,7097,541
c648,0324,9404,1253,16814,80312,888
xychart-beta title "TTFT at c1 — Cluster TP=2 (ms, lower is better)" x-axis ["GPT-OSS-120B", "Gemma4-26B", "Nemotron-3-Super"] y-axis "ms" 0 --> 550 bar [251, 181, 504]

Gemma4 has the best absolute TTFT: 181ms at c1 (cluster). Its small model memory (25.67 GiB) means less data to move before generating the first token. Nemotron’s 504ms reflects its 69.5 GiB footprint and 12B active compute per prefill token.

Routing Strategy: simple-shuffle Everywhere

Tested on GPT-OSS-120B and Nemotron-3-Super, the result is consistent:

simple-shuffle outperforms least-busy at c8+, across all models. Least-busy causes backend starvation: one node gets all traffic while the other idles. At c64, the gap is 57-68% in decode throughput.

This isn’t model-dependent — it’s a fundamental property of least-busy routing with identical backends. Always use simple-shuffle for 2x Solo mode.

Choosing the Right Model

PriorityBest ModelWhy
Raw speed (tokens/s)GPT-OSS-120BLowest effective bit-work (20.4 Gbit), fastest at all concurrency levels (Solo/Cluster)
Lowest latency (TTFT)Gemma4-26BSmallest memory footprint, 181ms TTFT at c1 cluster
Best TP=2 scalingGemma4-26B+50% at c1, best cluster efficiency
Reasoning tasksNemotron-3-SuperLatentMoE with thinking tokens, 12B active gives more reasoning capacity
Hardware stability (GF1)Gemma4-26B25.67 GiB — well below crash threshold for GF1 hardware issue
Max prefill throughputGemma4-26B12,989 t/s at c64 shuffle
Max decode throughputGPT-OSS-120B567.9 t/s at c64 shuffle

Choosing the Right Topology

WorkloadTopologyWorks for all 3 models?
Single user / interactive (c1-c4)Cluster TP=2Yes — +20-50% decode, best TTFT
Claude Code / moderate agents (c8-c16)Cluster TP=2Yes — still best decode for all models
High concurrency decode (c32-c64)Cluster TP=2 for Nemotron, Shuffle for GPT-OSS/Gemma4No — depends on model
Prefill-heavy (c4+)2x Solo + simple-shuffleYes — all models benefit from independent prefill
Latency-sensitive at scale (c8+)2x Solo + simple-shuffleYes — TTFT crossover at c8 for GPT-OSS/Gemma4, c32 for Nemotron

Technical Differences That Matter

FactorGPT-OSS-120BGemma4-26BNemotron-3-Super
vLLM imagevllm-node-mxfp4vllm-node-20260405vllm-experimental
Attention backendFlashInferTRITON_ATTN (forced)CUTLASS (patched)
Kernel patch neededNoNoYes (SM12.1 CUTLASS)
--no-ray testedNoYes (+2-8% gain)No
Solo c64 possibleYesYesNo (OOM)
GF1 stableDepends on recipeYes (always)Depends on recipe
GLOO_SOCKET_IFNAMENot needed (v0.17.1)Required (v0.19.x)Required (v0.17.2)

For detailed per-model results, see the individual benchmark articles: