Three Models, Two Sparks: Cross-Model Benchmark Comparison

April 6, 2026 · 8 min read

#DGX Spark #benchmarking #GPT-OSS-120B #Gemma4 #Nemotron #vLLM #AI #Local AI

The Lineup

Three models, all running on the same hardware: 2x NVIDIA DGX Spark (Grace Blackwell GB10, 128 GB unified each), connected via ConnectX-7 200GbE stacking link. Each benchmarked across Solo, Cluster TP=2, and 2x Solo + simple-shuffle topologies using llama-benchy (pp1024 tg128, 50 runs per concurrency level).

Model	Total Params	Active/Token	Quantization	Memory (Solo)	HuggingFace
GPT-OSS-120B	117B	5.1B	MXFP4 (4-bit)	~61 GiB	openai/gpt-oss-120b
Gemma4-26B	25.2B	3.8B	FP8 (8-bit)	25.67 GiB	google/gemma-4-26B-A4B-it
Nemotron-3-Super	120B	12B	NVFP4 (4-bit)	69.5 GiB	nvidia/Nemotron-3-Super-120B-A12B

All three are Mixture-of-Experts variants — sparse MoE, standard MoE, and LatentMoE respectively. The active parameter count determines real-world speed more than total parameter count.

Why Total Parameters Don’t Predict Speed

The headline number on a model card — “120B parameters” — tells you almost nothing about inference speed on MoE hardware. What matters is:

Active parameters per token — how much compute per forward pass
Quantization precision — bits per active parameter
Effective bit-work — active params x bits = total work per token

Model	Active Params	Bits/Param	Effective Bit-Work	Solo c1 Decode
GPT-OSS-120B	5.1B	4 (MXFP4)	20.4 Gbit	57.5 t/s
Gemma4-26B	3.8B	8 (FP8)	30.4 Gbit	38.68 t/s
Nemotron-3-Super	12B	~5 (NVFP4 mixed)	~60 Gbit	16.29 t/s

GPT-OSS-120B is the fastest despite having 117B total params — MXFP4 at 4-bit gives the lowest effective compute per token. Nemotron has the most active params (12B) and is 3.5x slower at c1.

--- config: themeVariables: xyChart: titleColor: "#333" plotColorPalette: "#3b82f6, #22c55e, #ef4444" --- xychart-beta title "Solo Decode Throughput (tokens/s) — All Models" x-axis ["c1", "c4", "c8", "c16", "c32", "c64"] y-axis "tokens/s" 0 --> 400 line "GPT-OSS-120B" [57.5, 107.9, 153.5, 218.5, 318.7, 315.2] line "Gemma4-26B" [38.68, 100.44, 152.36, 210.95, 281.92, 355.73] line "Nemotron-3-Super" [16.29, 43.68, 55.79, 69.08, 83.50, 83.50]

Legend: ━━ GPT-OSS-120B · ━━ Gemma4-26B · ━━ Nemotron-3-Super

GPT-OSS leads at low concurrency but Gemma4 catches up by c8 and overtakes at c32+. Nemotron scales the slowest — its 12B active params per token limit how many concurrent requests the hardware can process.

Cluster TP=2: Who Benefits Most?

Tensor parallelism splits each request across both GPUs. The benefit depends on how much NCCL synchronization traffic the model generates relative to the compute saved.

Model	Solo c1	Cluster c1	TP=2 Advantage	Why
GPT-OSS-120B	57.5	69.0	+20%	Large total params but tiny active set (5.1B) — modest TP gain
Gemma4-26B	38.68	57.93	+50%	Smallest active params (3.8B) = least NCCL traffic = best TP scaling
Nemotron-3-Super	16.29	22.97	+41%	12B active but LatentMoE’s structure benefits from parallelism

--- config: themeVariables: xyChart: titleColor: "#333" plotColorPalette: "#3b82f6, #22c55e, #ef4444" --- xychart-beta title "Cluster TP=2 Decode Throughput (tokens/s) — All Models" x-axis ["c1", "c4", "c8", "c16", "c32", "c64"] y-axis "tokens/s" 0 --> 500 line "GPT-OSS-120B" [69.0, 155.3, 231.0, 342.1, 471.7, 471.8] line "Gemma4-26B" [57.93, 148.26, 222.59, 304.03, 382.80, 477.77] line "Nemotron-3-Super" [22.97, 52.99, 82.77, 103.71, 128.27, 156.40]

Legend: ━━ GPT-OSS-120B · ━━ Gemma4-26B · ━━ Nemotron-3-Super

Gemma4 benefits most from TP=2 at c1 (+50%), but GPT-OSS-120B maintains higher absolute throughput at every concurrency level until c64, where Gemma4 edges ahead (478 vs 472 t/s).

The Crossover: When Does Shuffle Beat Cluster?

Every model shows the same pattern: Cluster TP=2 wins decode at low concurrency, but 2x Solo + simple-shuffle catches up as concurrency rises. The crossover point varies dramatically by model:

Model	Shuffle Overtakes Decode	Shuffle Overtakes Prefill	TTFT Crossover
GPT-OSS-120B	c64 (+20% vs cluster)	c16	c8
Gemma4-26B	c64 (+6% vs cluster)	c4	c8
Nemotron-3-Super	Never (tie at c64)	c4	c32

--- config: themeVariables: xyChart: titleColor: "#333" plotColorPalette: "#3b82f6, #22c55e, #ef4444" --- xychart-beta title "Shuffle 2x Decode Throughput (tokens/s) — All Models" x-axis ["c1", "c4", "c8", "c16", "c32", "c64"] y-axis "tokens/s" 0 --> 600 line "GPT-OSS-120B" [57.7, 127.9, 177.5, 268.6, 382.1, 567.9] line "Gemma4-26B" [39.59, 100.65, 160.54, 245.31, 342.53, 508.09] line "Nemotron-3-Super" [16.43, 50.22, 73.81, 96.53, 124.21, 156.02]

Legend: ━━ GPT-OSS-120B · ━━ Gemma4-26B · ━━ Nemotron-3-Super

Nemotron is the only model where Cluster TP=2 never loses decode. Its 12B active params create enough per-request compute that tensor parallelism always outperforms independent routing, even at c64.

GPT-OSS shows the biggest shuffle advantage at c64 (+20% vs cluster) — its tiny 5.1B active set means per-request compute is cheap, so the overhead of NCCL synchronization becomes the bottleneck before the independent-routing overhead of shuffle does.

Prefill: Shuffle Always Wins at Scale

All three models show the same pattern for prefill throughput: Cluster wins at c1-c2, shuffle overtakes at c4, and the gap widens steadily.

Concurrency	GPT-OSS Shuffle	GPT-OSS Cluster	Gemma4 Shuffle	Gemma4 Cluster	Nemotron Shuffle	Nemotron Cluster
c1	2,922	4,442	4,474	6,034	1,302	2,110
c4	5,578	6,343	7,669	7,651	2,340	2,146
c16	10,072	8,816	10,172	9,015	2,531	2,310
c64	12,064	3,873	12,989	9,165	2,824	2,292

At c64, GPT-OSS shuffle achieves +211% vs cluster prefill. The cluster’s prefill collapses under high concurrency because NCCL synchronization serializes what could be independent parallel work.

TTFT: The Latency Picture

Time to First Token matters most for interactive use. Lower is better.

Concurrency	GPT-OSS Cluster	GPT-OSS Shuffle	Gemma4 Cluster	Gemma4 Shuffle	Nemotron Cluster	Nemotron Shuffle
c1	251	370	181	249	504	814
c8	1,020	1,005	858	656	2,375	2,878
c32	3,642	2,551	2,330	1,888	7,709	7,541
c64	8,032	4,940	4,125	3,168	14,803	12,888

xychart-beta title "TTFT at c1 — Cluster TP=2 (ms, lower is better)" x-axis ["GPT-OSS-120B", "Gemma4-26B", "Nemotron-3-Super"] y-axis "ms" 0 --> 550 bar [251, 181, 504]

Gemma4 has the best absolute TTFT: 181ms at c1 (cluster). Its small model memory (25.67 GiB) means less data to move before generating the first token. Nemotron’s 504ms reflects its 69.5 GiB footprint and 12B active compute per prefill token.

Routing Strategy: simple-shuffle Everywhere

Tested on GPT-OSS-120B and Nemotron-3-Super, the result is consistent:

simple-shuffle outperforms least-busy at c8+, across all models. Least-busy causes backend starvation: one node gets all traffic while the other idles. At c64, the gap is 57-68% in decode throughput.

This isn’t model-dependent — it’s a fundamental property of least-busy routing with identical backends. Always use simple-shuffle for 2x Solo mode.

Choosing the Right Model

Priority	Best Model	Why
Raw speed (tokens/s)	GPT-OSS-120B	Lowest effective bit-work (20.4 Gbit), fastest at all concurrency levels (Solo/Cluster)
Lowest latency (TTFT)	Gemma4-26B	Smallest memory footprint, 181ms TTFT at c1 cluster
Best TP=2 scaling	Gemma4-26B	+50% at c1, best cluster efficiency
Reasoning tasks	Nemotron-3-Super	LatentMoE with thinking tokens, 12B active gives more reasoning capacity
Hardware stability (GF1)	Gemma4-26B	25.67 GiB — well below crash threshold for GF1 hardware issue
Max prefill throughput	Gemma4-26B	12,989 t/s at c64 shuffle
Max decode throughput	GPT-OSS-120B	567.9 t/s at c64 shuffle

Choosing the Right Topology

Workload	Topology	Works for all 3 models?
Single user / interactive (c1-c4)	Cluster TP=2	Yes — +20-50% decode, best TTFT
Claude Code / moderate agents (c8-c16)	Cluster TP=2	Yes — still best decode for all models
High concurrency decode (c32-c64)	Cluster TP=2 for Nemotron, Shuffle for GPT-OSS/Gemma4	No — depends on model
Prefill-heavy (c4+)	2x Solo + simple-shuffle	Yes — all models benefit from independent prefill
Latency-sensitive at scale (c8+)	2x Solo + simple-shuffle	Yes — TTFT crossover at c8 for GPT-OSS/Gemma4, c32 for Nemotron

Technical Differences That Matter

Factor	GPT-OSS-120B	Gemma4-26B	Nemotron-3-Super
vLLM image	vllm-node-mxfp4	vllm-node-20260405	vllm-experimental
Attention backend	FlashInfer	TRITON_ATTN (forced)	CUTLASS (patched)
Kernel patch needed	No	No	Yes (SM12.1 CUTLASS)
`--no-ray` tested	No	Yes (+2-8% gain)	No
Solo c64 possible	Yes	Yes	No (OOM)
GF1 stable	Depends on recipe	Yes (always)	Depends on recipe
GLOO_SOCKET_IFNAME	Not needed (v0.17.1)	Required (v0.19.x)	Required (v0.17.2)

For detailed per-model results, see the individual benchmark articles:

Back to Blog

Posts

Three Models, Two Sparks: Cross-Model Benchmark Comparison

The Lineup

Why Total Parameters Don’t Predict Speed

Cluster TP=2: Who Benefits Most?

The Crossover: When Does Shuffle Beat Cluster?

Prefill: Shuffle Always Wins at Scale

TTFT: The Latency Picture

Routing Strategy: simple-shuffle Everywhere

Choosing the Right Model

Choosing the Right Topology

Technical Differences That Matter

Posts