GPT-OSS-120B: Topology Benchmark on DGX Spark

· 7 min read

The Model

openai/gpt-oss-120b — 117B total parameters, 5.1B active per token. Sparse Mixture-of-Experts with native MXFP4 quantization, designed for Blackwell’s MX-format tensor cores. At 4-bit precision, the full model fits in a single DGX Spark’s 128 GB unified memory with room for KV cache.

This is the model the DGX Spark community rallied around first. Every topology comparison, every routing strategy, every vLLM optimization — GPT-OSS-120B was the proving ground.

Test Setup

  • Hardware: 2x NVIDIA DGX Spark (Grace Blackwell GB10, 128 GB unified each)
  • Interconnect: ConnectX-7 dual 200GbE QSFP stacking link
  • Image: vllm-node-mxfp4 (0.1.dev12777)
  • Recipe: openai-gpt-oss-120b.yaml (gpu_memory_utilization=0.70, max_model_len=4096)
  • Flags: --enforce-eager --mxfp4-layers moe,qkv,o,lm_head --kv-cache-dtype fp8
  • Benchmark: llama-benchy 0.3.5, pp1024 tg128, 50 runs per concurrency level
  • Date: 2026-03-22

Topologies Tested

TopologyDescription
SoloSingle DGX Spark, TP=1
Cluster TP=2Both nodes via NCCL all-reduce, single API endpoint
2x Solo + simple-shuffleBoth nodes independent, LiteLLM simple-shuffle
2x Solo + least-busyBoth nodes independent, LiteLLM routes to least loaded

Decode Throughput

Generation speed (tg128) — higher is better.

ConcurrencySoloCluster TP=2ShuffleLeast-busyWinner
c157.569.057.757.5Cluster (+20%)
c278.4104.199.487.7Cluster (+33%)
c4107.9155.3127.9122.3Cluster (+44%)
c8153.5231.0177.5162.5Cluster (+50%)
c16218.5342.1268.6238.5Cluster (+57%)
c32318.7471.7382.1333.2Cluster (+48%)
c64315.2471.8567.9338.4Shuffle (+20%)
--- config: themeVariables: xyChart: titleColor: "#333" plotColorPalette: "#6b7280, #3b82f6, #22c55e, #ef4444" --- xychart-beta title "Decode Throughput (tokens/s) — GPT-OSS-120B" x-axis ["c1", "c2", "c4", "c8", "c16", "c32", "c64"] y-axis "tokens/s" 0 --> 600 line "Solo" [57.5, 78.4, 107.9, 153.5, 218.5, 318.7, 315.2] line "Cluster TP=2" [69.0, 104.1, 155.3, 231.0, 342.1, 471.7, 471.8] line "Shuffle" [57.7, 99.4, 127.9, 177.5, 268.6, 382.1, 567.9] line "Least-busy" [57.5, 87.7, 122.3, 162.5, 238.5, 333.2, 338.4]

Legend: ━━ Solo · ━━ Cluster TP=2 · ━━ Shuffle · ━━ Least-busy

Cluster TP=2 dominates c1 through c32 — both GPUs working on each request via NCCL all-reduce. At c64, the synchronization overhead catches up and simple-shuffle overtakes by 20%.

Note that Solo plateaus at c32 (315 t/s) while the multi-node topologies keep scaling. Least-busy also flatlines — backend starvation kicks in.

Prefill Throughput

Prompt processing speed (pp1024) — higher is better.

ConcurrencySoloCluster TP=2ShuffleLeast-busyWinner
c12,9264,4422,9223,074Cluster (+52%)
c23,6845,4284,6694,074Cluster (+47%)
c44,5406,3435,5785,299Cluster (+40%)
c85,9057,9657,4836,579Cluster (+35%)
c166,8588,81610,0728,886Shuffle (+14%)
c326,8278,96611,7419,753Shuffle (+31%)
c642,7533,87312,0644,982Shuffle (+211%)
--- config: themeVariables: xyChart: titleColor: "#333" plotColorPalette: "#6b7280, #3b82f6, #22c55e, #ef4444" --- xychart-beta title "Prefill Throughput (tokens/s) — GPT-OSS-120B" x-axis ["c1", "c2", "c4", "c8", "c16", "c32", "c64"] y-axis "tokens/s" 0 --> 13000 line "Solo" [2926, 3684, 4540, 5905, 6858, 6827, 2753] line "Cluster TP=2" [4442, 5428, 6343, 7965, 8816, 8966, 3873] line "Shuffle" [2922, 4669, 5578, 7483, 10072, 11741, 12064] line "Least-busy" [3074, 4074, 5299, 6579, 8886, 9753, 4982]

Legend: ━━ Solo · ━━ Cluster TP=2 · ━━ Shuffle · ━━ Least-busy

The crossover is dramatic. Cluster wins at low concurrency, but at c16+ shuffle pulls away because each node independently processes different prefills. At c64, Solo and Cluster both collapse (memory pressure) while shuffle keeps climbing — each node only sees ~32 concurrent requests.

Time to First Token (TTFT)

End-to-end latency to first generated token (ms) — lower is better.

ConcurrencySoloCluster TP=2ShuffleLeast-busyWinner
c1369251370364Cluster (-32%)
c2570381459526Cluster (-33%)
c4898640675735Cluster (-29%)
c81,3851,0201,0051,232Shuffle (-2%)
c162,3841,8551,4951,690Shuffle (-19%)
c324,7683,6422,5513,303Shuffle (-30%)
c6410,8748,0324,9407,360Shuffle (-39%)
--- config: themeVariables: xyChart: titleColor: "#333" plotColorPalette: "#6b7280, #3b82f6, #22c55e, #ef4444" --- xychart-beta title "TTFT (ms, lower is better) — GPT-OSS-120B" x-axis ["c1", "c2", "c4", "c8", "c16", "c32", "c64"] y-axis "ms" 0 --> 11000 line "Solo" [369, 570, 898, 1385, 2384, 4768, 10874] line "Cluster TP=2" [251, 381, 640, 1020, 1855, 3642, 8032] line "Shuffle" [370, 459, 675, 1005, 1495, 2551, 4940] line "Least-busy" [364, 526, 735, 1232, 1690, 3303, 7360]

Legend: ━━ Solo · ━━ Cluster TP=2 · ━━ Shuffle · ━━ Least-busy

TTFT crossover happens at c8. Below that, Cluster’s faster per-request prefill wins. Above c8, shuffle’s independent processing wins because each node starts its prefill without waiting for the other.

Routing Strategy: simple-shuffle vs least-busy

This benchmark included a controlled A/B test — same model, same day, only the LiteLLM routing strategy changed.

simple-shuffle wins across the board at c8+. At c64, the gap is enormous:

Metric at c64simple-shuffleleast-busyGap
Decode567.9 t/s338.4 t/s+68%
Prefill12,064 t/s4,982 t/s+142%
TTFT4,940 ms7,360 ms-33%

Why least-busy fails

least-busy tracks in-flight requests and routes to the least loaded backend. With identical backends and similar request durations, this creates oscillation — Backend A finishes a batch, drops to zero in-flight, ALL new requests rush to A, Backend B starves, cycle repeats.

Variance confirms it:

ConcurrencyShuffle stdevLeast-busy stdevRatio
c3233.7357.091.7x
c6448.0264.111.3x

Never use least-busy for identical backends.

When to Use Which Topology

Use CaseBest TopologyWhy
Single user / low latency (c1-c4)Cluster TP=2+20-44% decode, -29-33% TTFT
Claude Code / moderate load (c8-c16)Cluster TP=2+30-57% decode, competitive TTFT
High throughput / many agents (c32-c64)2x Solo + simple-shuffle+20% decode at c64, +211% prefill
Any 2x Solo configurationsimple-shuffle routing15-68% better than least-busy at c32+

Technical Notes

  • Always use IP addresses — mDNS hostname resolution causes hard crashes at c64
  • LiteLLM Docker DNS: Remote backends need IPs (mDNS doesn’t resolve in containers)
  • Thermal: GPUs reach 84-90C under sustained c64 load. T.Limit throttle at 53C is normal operation
  • Solo recipe is safer for GF1: gpu_mem 0.70 with 4K context stays within thermal margin. The claude recipe (0.75, 131K) has crashed GF1 under load