Three Models, Two Sparks: Cross-Model Benchmark Comparison
The Lineup
Three models, all running on the same hardware: 2x NVIDIA DGX Spark (Grace Blackwell GB10, 128 GB unified each), connected via ConnectX-7 200GbE stacking link. Each benchmarked across Solo, Cluster TP=2, and 2x Solo + simple-shuffle topologies using llama-benchy (pp1024 tg128, 50 runs per concurrency level).
| Model | Total Params | Active/Token | Quantization | Memory (Solo) | HuggingFace |
|---|---|---|---|---|---|
| GPT-OSS-120B | 117B | 5.1B | MXFP4 (4-bit) | ~61 GiB | openai/gpt-oss-120b |
| Gemma4-26B | 25.2B | 3.8B | FP8 (8-bit) | 25.67 GiB | google/gemma-4-26B-A4B-it |
| Nemotron-3-Super | 120B | 12B | NVFP4 (4-bit) | 69.5 GiB | nvidia/Nemotron-3-Super-120B-A12B |
All three are Mixture-of-Experts variants — sparse MoE, standard MoE, and LatentMoE respectively. The active parameter count determines real-world speed more than total parameter count.
Why Total Parameters Don’t Predict Speed
The headline number on a model card — “120B parameters” — tells you almost nothing about inference speed on MoE hardware. What matters is:
- Active parameters per token — how much compute per forward pass
- Quantization precision — bits per active parameter
- Effective bit-work — active params x bits = total work per token
| Model | Active Params | Bits/Param | Effective Bit-Work | Solo c1 Decode |
|---|---|---|---|---|
| GPT-OSS-120B | 5.1B | 4 (MXFP4) | 20.4 Gbit | 57.5 t/s |
| Gemma4-26B | 3.8B | 8 (FP8) | 30.4 Gbit | 38.68 t/s |
| Nemotron-3-Super | 12B | ~5 (NVFP4 mixed) | ~60 Gbit | 16.29 t/s |
GPT-OSS-120B is the fastest despite having 117B total params — MXFP4 at 4-bit gives the lowest effective compute per token. Nemotron has the most active params (12B) and is 3.5x slower at c1.
Legend: ━━ GPT-OSS-120B · ━━ Gemma4-26B · ━━ Nemotron-3-Super
GPT-OSS leads at low concurrency but Gemma4 catches up by c8 and overtakes at c32+. Nemotron scales the slowest — its 12B active params per token limit how many concurrent requests the hardware can process.
Cluster TP=2: Who Benefits Most?
Tensor parallelism splits each request across both GPUs. The benefit depends on how much NCCL synchronization traffic the model generates relative to the compute saved.
| Model | Solo c1 | Cluster c1 | TP=2 Advantage | Why |
|---|---|---|---|---|
| GPT-OSS-120B | 57.5 | 69.0 | +20% | Large total params but tiny active set (5.1B) — modest TP gain |
| Gemma4-26B | 38.68 | 57.93 | +50% | Smallest active params (3.8B) = least NCCL traffic = best TP scaling |
| Nemotron-3-Super | 16.29 | 22.97 | +41% | 12B active but LatentMoE’s structure benefits from parallelism |
Legend: ━━ GPT-OSS-120B · ━━ Gemma4-26B · ━━ Nemotron-3-Super
Gemma4 benefits most from TP=2 at c1 (+50%), but GPT-OSS-120B maintains higher absolute throughput at every concurrency level until c64, where Gemma4 edges ahead (478 vs 472 t/s).
The Crossover: When Does Shuffle Beat Cluster?
Every model shows the same pattern: Cluster TP=2 wins decode at low concurrency, but 2x Solo + simple-shuffle catches up as concurrency rises. The crossover point varies dramatically by model:
| Model | Shuffle Overtakes Decode | Shuffle Overtakes Prefill | TTFT Crossover |
|---|---|---|---|
| GPT-OSS-120B | c64 (+20% vs cluster) | c16 | c8 |
| Gemma4-26B | c64 (+6% vs cluster) | c4 | c8 |
| Nemotron-3-Super | Never (tie at c64) | c4 | c32 |
Legend: ━━ GPT-OSS-120B · ━━ Gemma4-26B · ━━ Nemotron-3-Super
Nemotron is the only model where Cluster TP=2 never loses decode. Its 12B active params create enough per-request compute that tensor parallelism always outperforms independent routing, even at c64.
GPT-OSS shows the biggest shuffle advantage at c64 (+20% vs cluster) — its tiny 5.1B active set means per-request compute is cheap, so the overhead of NCCL synchronization becomes the bottleneck before the independent-routing overhead of shuffle does.
Prefill: Shuffle Always Wins at Scale
All three models show the same pattern for prefill throughput: Cluster wins at c1-c2, shuffle overtakes at c4, and the gap widens steadily.
| Concurrency | GPT-OSS Shuffle | GPT-OSS Cluster | Gemma4 Shuffle | Gemma4 Cluster | Nemotron Shuffle | Nemotron Cluster |
|---|---|---|---|---|---|---|
| c1 | 2,922 | 4,442 | 4,474 | 6,034 | 1,302 | 2,110 |
| c4 | 5,578 | 6,343 | 7,669 | 7,651 | 2,340 | 2,146 |
| c16 | 10,072 | 8,816 | 10,172 | 9,015 | 2,531 | 2,310 |
| c64 | 12,064 | 3,873 | 12,989 | 9,165 | 2,824 | 2,292 |
At c64, GPT-OSS shuffle achieves +211% vs cluster prefill. The cluster’s prefill collapses under high concurrency because NCCL synchronization serializes what could be independent parallel work.
TTFT: The Latency Picture
Time to First Token matters most for interactive use. Lower is better.
| Concurrency | GPT-OSS Cluster | GPT-OSS Shuffle | Gemma4 Cluster | Gemma4 Shuffle | Nemotron Cluster | Nemotron Shuffle |
|---|---|---|---|---|---|---|
| c1 | 251 | 370 | 181 | 249 | 504 | 814 |
| c8 | 1,020 | 1,005 | 858 | 656 | 2,375 | 2,878 |
| c32 | 3,642 | 2,551 | 2,330 | 1,888 | 7,709 | 7,541 |
| c64 | 8,032 | 4,940 | 4,125 | 3,168 | 14,803 | 12,888 |
Gemma4 has the best absolute TTFT: 181ms at c1 (cluster). Its small model memory (25.67 GiB) means less data to move before generating the first token. Nemotron’s 504ms reflects its 69.5 GiB footprint and 12B active compute per prefill token.
Routing Strategy: simple-shuffle Everywhere
Tested on GPT-OSS-120B and Nemotron-3-Super, the result is consistent:
simple-shuffle outperforms least-busy at c8+, across all models. Least-busy causes backend starvation: one node gets all traffic while the other idles. At c64, the gap is 57-68% in decode throughput.
This isn’t model-dependent — it’s a fundamental property of least-busy routing with identical backends. Always use simple-shuffle for 2x Solo mode.
Choosing the Right Model
| Priority | Best Model | Why |
|---|---|---|
| Raw speed (tokens/s) | GPT-OSS-120B | Lowest effective bit-work (20.4 Gbit), fastest at all concurrency levels (Solo/Cluster) |
| Lowest latency (TTFT) | Gemma4-26B | Smallest memory footprint, 181ms TTFT at c1 cluster |
| Best TP=2 scaling | Gemma4-26B | +50% at c1, best cluster efficiency |
| Reasoning tasks | Nemotron-3-Super | LatentMoE with thinking tokens, 12B active gives more reasoning capacity |
| Hardware stability (GF1) | Gemma4-26B | 25.67 GiB — well below crash threshold for GF1 hardware issue |
| Max prefill throughput | Gemma4-26B | 12,989 t/s at c64 shuffle |
| Max decode throughput | GPT-OSS-120B | 567.9 t/s at c64 shuffle |
Choosing the Right Topology
| Workload | Topology | Works for all 3 models? |
|---|---|---|
| Single user / interactive (c1-c4) | Cluster TP=2 | Yes — +20-50% decode, best TTFT |
| Claude Code / moderate agents (c8-c16) | Cluster TP=2 | Yes — still best decode for all models |
| High concurrency decode (c32-c64) | Cluster TP=2 for Nemotron, Shuffle for GPT-OSS/Gemma4 | No — depends on model |
| Prefill-heavy (c4+) | 2x Solo + simple-shuffle | Yes — all models benefit from independent prefill |
| Latency-sensitive at scale (c8+) | 2x Solo + simple-shuffle | Yes — TTFT crossover at c8 for GPT-OSS/Gemma4, c32 for Nemotron |
Technical Differences That Matter
| Factor | GPT-OSS-120B | Gemma4-26B | Nemotron-3-Super |
|---|---|---|---|
| vLLM image | vllm-node-mxfp4 | vllm-node-20260405 | vllm-experimental |
| Attention backend | FlashInfer | TRITON_ATTN (forced) | CUTLASS (patched) |
| Kernel patch needed | No | No | Yes (SM12.1 CUTLASS) |
--no-ray tested | No | Yes (+2-8% gain) | No |
| Solo c64 possible | Yes | Yes | No (OOM) |
| GF1 stable | Depends on recipe | Yes (always) | Depends on recipe |
| GLOO_SOCKET_IFNAME | Not needed (v0.17.1) | Required (v0.19.x) | Required (v0.17.2) |
For detailed per-model results, see the individual benchmark articles: