GPT-OSS-120B: Topology Benchmark on DGX Spark

March 22, 2026 · 7 min read

#DGX Spark #benchmarking #GPT-OSS-120B #vLLM #LiteLLM #AI #Local AI

The Model

openai/gpt-oss-120b — 117B total parameters, 5.1B active per token. Sparse Mixture-of-Experts with native MXFP4 quantization, designed for Blackwell’s MX-format tensor cores. At 4-bit precision, the full model fits in a single DGX Spark’s 128 GB unified memory with room for KV cache.

This is the model the DGX Spark community rallied around first. Every topology comparison, every routing strategy, every vLLM optimization — GPT-OSS-120B was the proving ground.

Test Setup

Hardware: 2x NVIDIA DGX Spark (Grace Blackwell GB10, 128 GB unified each)
Interconnect: ConnectX-7 dual 200GbE QSFP stacking link
Image: vllm-node-mxfp4 (0.1.dev12777)
Recipe: openai-gpt-oss-120b.yaml (gpu_memory_utilization=0.70, max_model_len=4096)
Flags: --enforce-eager --mxfp4-layers moe,qkv,o,lm_head --kv-cache-dtype fp8
Benchmark: llama-benchy 0.3.5, pp1024 tg128, 50 runs per concurrency level
Date: 2026-03-22

Topologies Tested

Topology	Description
Solo	Single DGX Spark, TP=1
Cluster TP=2	Both nodes via NCCL all-reduce, single API endpoint
2x Solo + simple-shuffle	Both nodes independent, LiteLLM simple-shuffle
2x Solo + least-busy	Both nodes independent, LiteLLM routes to least loaded

Decode Throughput

Generation speed (tg128) — higher is better.

Concurrency	Solo	Cluster TP=2	Shuffle	Least-busy	Winner
c1	57.5	69.0	57.7	57.5	Cluster (+20%)
c2	78.4	104.1	99.4	87.7	Cluster (+33%)
c4	107.9	155.3	127.9	122.3	Cluster (+44%)
c8	153.5	231.0	177.5	162.5	Cluster (+50%)
c16	218.5	342.1	268.6	238.5	Cluster (+57%)
c32	318.7	471.7	382.1	333.2	Cluster (+48%)
c64	315.2	471.8	567.9	338.4	Shuffle (+20%)

--- config: themeVariables: xyChart: titleColor: "#333" plotColorPalette: "#6b7280, #3b82f6, #22c55e, #ef4444" --- xychart-beta title "Decode Throughput (tokens/s) — GPT-OSS-120B" x-axis ["c1", "c2", "c4", "c8", "c16", "c32", "c64"] y-axis "tokens/s" 0 --> 600 line "Solo" [57.5, 78.4, 107.9, 153.5, 218.5, 318.7, 315.2] line "Cluster TP=2" [69.0, 104.1, 155.3, 231.0, 342.1, 471.7, 471.8] line "Shuffle" [57.7, 99.4, 127.9, 177.5, 268.6, 382.1, 567.9] line "Least-busy" [57.5, 87.7, 122.3, 162.5, 238.5, 333.2, 338.4]

Legend: ━━ Solo · ━━ Cluster TP=2 · ━━ Shuffle · ━━ Least-busy

Cluster TP=2 dominates c1 through c32 — both GPUs working on each request via NCCL all-reduce. At c64, the synchronization overhead catches up and simple-shuffle overtakes by 20%.

Note that Solo plateaus at c32 (315 t/s) while the multi-node topologies keep scaling. Least-busy also flatlines — backend starvation kicks in.

Prefill Throughput

Prompt processing speed (pp1024) — higher is better.

Concurrency	Solo	Cluster TP=2	Shuffle	Least-busy	Winner
c1	2,926	4,442	2,922	3,074	Cluster (+52%)
c2	3,684	5,428	4,669	4,074	Cluster (+47%)
c4	4,540	6,343	5,578	5,299	Cluster (+40%)
c8	5,905	7,965	7,483	6,579	Cluster (+35%)
c16	6,858	8,816	10,072	8,886	Shuffle (+14%)
c32	6,827	8,966	11,741	9,753	Shuffle (+31%)
c64	2,753	3,873	12,064	4,982	Shuffle (+211%)

--- config: themeVariables: xyChart: titleColor: "#333" plotColorPalette: "#6b7280, #3b82f6, #22c55e, #ef4444" --- xychart-beta title "Prefill Throughput (tokens/s) — GPT-OSS-120B" x-axis ["c1", "c2", "c4", "c8", "c16", "c32", "c64"] y-axis "tokens/s" 0 --> 13000 line "Solo" [2926, 3684, 4540, 5905, 6858, 6827, 2753] line "Cluster TP=2" [4442, 5428, 6343, 7965, 8816, 8966, 3873] line "Shuffle" [2922, 4669, 5578, 7483, 10072, 11741, 12064] line "Least-busy" [3074, 4074, 5299, 6579, 8886, 9753, 4982]

Legend: ━━ Solo · ━━ Cluster TP=2 · ━━ Shuffle · ━━ Least-busy

The crossover is dramatic. Cluster wins at low concurrency, but at c16+ shuffle pulls away because each node independently processes different prefills. At c64, Solo and Cluster both collapse (memory pressure) while shuffle keeps climbing — each node only sees ~32 concurrent requests.

Time to First Token (TTFT)

End-to-end latency to first generated token (ms) — lower is better.

Concurrency	Solo	Cluster TP=2	Shuffle	Least-busy	Winner
c1	369	251	370	364	Cluster (-32%)
c2	570	381	459	526	Cluster (-33%)
c4	898	640	675	735	Cluster (-29%)
c8	1,385	1,020	1,005	1,232	Shuffle (-2%)
c16	2,384	1,855	1,495	1,690	Shuffle (-19%)
c32	4,768	3,642	2,551	3,303	Shuffle (-30%)
c64	10,874	8,032	4,940	7,360	Shuffle (-39%)

--- config: themeVariables: xyChart: titleColor: "#333" plotColorPalette: "#6b7280, #3b82f6, #22c55e, #ef4444" --- xychart-beta title "TTFT (ms, lower is better) — GPT-OSS-120B" x-axis ["c1", "c2", "c4", "c8", "c16", "c32", "c64"] y-axis "ms" 0 --> 11000 line "Solo" [369, 570, 898, 1385, 2384, 4768, 10874] line "Cluster TP=2" [251, 381, 640, 1020, 1855, 3642, 8032] line "Shuffle" [370, 459, 675, 1005, 1495, 2551, 4940] line "Least-busy" [364, 526, 735, 1232, 1690, 3303, 7360]

Legend: ━━ Solo · ━━ Cluster TP=2 · ━━ Shuffle · ━━ Least-busy

TTFT crossover happens at c8. Below that, Cluster’s faster per-request prefill wins. Above c8, shuffle’s independent processing wins because each node starts its prefill without waiting for the other.

Routing Strategy: simple-shuffle vs least-busy

This benchmark included a controlled A/B test — same model, same day, only the LiteLLM routing strategy changed.

simple-shuffle wins across the board at c8+. At c64, the gap is enormous:

Metric at c64	simple-shuffle	least-busy	Gap
Decode	567.9 t/s	338.4 t/s	+68%
Prefill	12,064 t/s	4,982 t/s	+142%
TTFT	4,940 ms	7,360 ms	-33%

Why least-busy fails

least-busy tracks in-flight requests and routes to the least loaded backend. With identical backends and similar request durations, this creates oscillation — Backend A finishes a batch, drops to zero in-flight, ALL new requests rush to A, Backend B starves, cycle repeats.

Variance confirms it:

Concurrency	Shuffle stdev	Least-busy stdev	Ratio
c32	33.73	57.09	1.7x
c64	48.02	64.11	1.3x

Never use least-busy for identical backends.

When to Use Which Topology

Use Case	Best Topology	Why
Single user / low latency (c1-c4)	Cluster TP=2	+20-44% decode, -29-33% TTFT
Claude Code / moderate load (c8-c16)	Cluster TP=2	+30-57% decode, competitive TTFT
High throughput / many agents (c32-c64)	2x Solo + simple-shuffle	+20% decode at c64, +211% prefill
Any 2x Solo configuration	simple-shuffle routing	15-68% better than least-busy at c32+

Technical Notes

Always use IP addresses — mDNS hostname resolution causes hard crashes at c64
LiteLLM Docker DNS: Remote backends need IPs (mDNS doesn’t resolve in containers)
Thermal: GPUs reach 84-90C under sustained c64 load. T.Limit throttle at 53C is normal operation
Solo recipe is safer for GF1: gpu_mem 0.70 with 4K context stays within thermal margin. The claude recipe (0.75, 131K) has crashed GF1 under load

Back to Blog

Posts

GPT-OSS-120B: Topology Benchmark on DGX Spark

The Model

Test Setup

Topologies Tested

Decode Throughput

Prefill Throughput

Time to First Token (TTFT)

Routing Strategy: simple-shuffle vs least-busy

Why least-busy fails

When to Use Which Topology

Technical Notes

Posts