GPT-OSS-120B: Topology Benchmark on DGX Spark
The Model
openai/gpt-oss-120b — 117B total parameters, 5.1B active per token. Sparse Mixture-of-Experts with native MXFP4 quantization, designed for Blackwell’s MX-format tensor cores. At 4-bit precision, the full model fits in a single DGX Spark’s 128 GB unified memory with room for KV cache.
This is the model the DGX Spark community rallied around first. Every topology comparison, every routing strategy, every vLLM optimization — GPT-OSS-120B was the proving ground.
Test Setup
- Hardware: 2x NVIDIA DGX Spark (Grace Blackwell GB10, 128 GB unified each)
- Interconnect: ConnectX-7 dual 200GbE QSFP stacking link
- Image: vllm-node-mxfp4 (0.1.dev12777)
- Recipe: openai-gpt-oss-120b.yaml (gpu_memory_utilization=0.70, max_model_len=4096)
- Flags:
--enforce-eager --mxfp4-layers moe,qkv,o,lm_head --kv-cache-dtype fp8 - Benchmark: llama-benchy 0.3.5, pp1024 tg128, 50 runs per concurrency level
- Date: 2026-03-22
Topologies Tested
| Topology | Description |
|---|---|
| Solo | Single DGX Spark, TP=1 |
| Cluster TP=2 | Both nodes via NCCL all-reduce, single API endpoint |
| 2x Solo + simple-shuffle | Both nodes independent, LiteLLM simple-shuffle |
| 2x Solo + least-busy | Both nodes independent, LiteLLM routes to least loaded |
Decode Throughput
Generation speed (tg128) — higher is better.
| Concurrency | Solo | Cluster TP=2 | Shuffle | Least-busy | Winner |
|---|---|---|---|---|---|
| c1 | 57.5 | 69.0 | 57.7 | 57.5 | Cluster (+20%) |
| c2 | 78.4 | 104.1 | 99.4 | 87.7 | Cluster (+33%) |
| c4 | 107.9 | 155.3 | 127.9 | 122.3 | Cluster (+44%) |
| c8 | 153.5 | 231.0 | 177.5 | 162.5 | Cluster (+50%) |
| c16 | 218.5 | 342.1 | 268.6 | 238.5 | Cluster (+57%) |
| c32 | 318.7 | 471.7 | 382.1 | 333.2 | Cluster (+48%) |
| c64 | 315.2 | 471.8 | 567.9 | 338.4 | Shuffle (+20%) |
Legend: ━━ Solo · ━━ Cluster TP=2 · ━━ Shuffle · ━━ Least-busy
Cluster TP=2 dominates c1 through c32 — both GPUs working on each request via NCCL all-reduce. At c64, the synchronization overhead catches up and simple-shuffle overtakes by 20%.
Note that Solo plateaus at c32 (315 t/s) while the multi-node topologies keep scaling. Least-busy also flatlines — backend starvation kicks in.
Prefill Throughput
Prompt processing speed (pp1024) — higher is better.
| Concurrency | Solo | Cluster TP=2 | Shuffle | Least-busy | Winner |
|---|---|---|---|---|---|
| c1 | 2,926 | 4,442 | 2,922 | 3,074 | Cluster (+52%) |
| c2 | 3,684 | 5,428 | 4,669 | 4,074 | Cluster (+47%) |
| c4 | 4,540 | 6,343 | 5,578 | 5,299 | Cluster (+40%) |
| c8 | 5,905 | 7,965 | 7,483 | 6,579 | Cluster (+35%) |
| c16 | 6,858 | 8,816 | 10,072 | 8,886 | Shuffle (+14%) |
| c32 | 6,827 | 8,966 | 11,741 | 9,753 | Shuffle (+31%) |
| c64 | 2,753 | 3,873 | 12,064 | 4,982 | Shuffle (+211%) |
Legend: ━━ Solo · ━━ Cluster TP=2 · ━━ Shuffle · ━━ Least-busy
The crossover is dramatic. Cluster wins at low concurrency, but at c16+ shuffle pulls away because each node independently processes different prefills. At c64, Solo and Cluster both collapse (memory pressure) while shuffle keeps climbing — each node only sees ~32 concurrent requests.
Time to First Token (TTFT)
End-to-end latency to first generated token (ms) — lower is better.
| Concurrency | Solo | Cluster TP=2 | Shuffle | Least-busy | Winner |
|---|---|---|---|---|---|
| c1 | 369 | 251 | 370 | 364 | Cluster (-32%) |
| c2 | 570 | 381 | 459 | 526 | Cluster (-33%) |
| c4 | 898 | 640 | 675 | 735 | Cluster (-29%) |
| c8 | 1,385 | 1,020 | 1,005 | 1,232 | Shuffle (-2%) |
| c16 | 2,384 | 1,855 | 1,495 | 1,690 | Shuffle (-19%) |
| c32 | 4,768 | 3,642 | 2,551 | 3,303 | Shuffle (-30%) |
| c64 | 10,874 | 8,032 | 4,940 | 7,360 | Shuffle (-39%) |
Legend: ━━ Solo · ━━ Cluster TP=2 · ━━ Shuffle · ━━ Least-busy
TTFT crossover happens at c8. Below that, Cluster’s faster per-request prefill wins. Above c8, shuffle’s independent processing wins because each node starts its prefill without waiting for the other.
Routing Strategy: simple-shuffle vs least-busy
This benchmark included a controlled A/B test — same model, same day, only the LiteLLM routing strategy changed.
simple-shuffle wins across the board at c8+. At c64, the gap is enormous:
| Metric at c64 | simple-shuffle | least-busy | Gap |
|---|---|---|---|
| Decode | 567.9 t/s | 338.4 t/s | +68% |
| Prefill | 12,064 t/s | 4,982 t/s | +142% |
| TTFT | 4,940 ms | 7,360 ms | -33% |
Why least-busy fails
least-busy tracks in-flight requests and routes to the least loaded backend. With identical backends and similar request durations, this creates oscillation — Backend A finishes a batch, drops to zero in-flight, ALL new requests rush to A, Backend B starves, cycle repeats.
Variance confirms it:
| Concurrency | Shuffle stdev | Least-busy stdev | Ratio |
|---|---|---|---|
| c32 | 33.73 | 57.09 | 1.7x |
| c64 | 48.02 | 64.11 | 1.3x |
Never use least-busy for identical backends.
When to Use Which Topology
| Use Case | Best Topology | Why |
|---|---|---|
| Single user / low latency (c1-c4) | Cluster TP=2 | +20-44% decode, -29-33% TTFT |
| Claude Code / moderate load (c8-c16) | Cluster TP=2 | +30-57% decode, competitive TTFT |
| High throughput / many agents (c32-c64) | 2x Solo + simple-shuffle | +20% decode at c64, +211% prefill |
| Any 2x Solo configuration | simple-shuffle routing | 15-68% better than least-busy at c32+ |
Technical Notes
- Always use IP addresses — mDNS hostname resolution causes hard crashes at c64
- LiteLLM Docker DNS: Remote backends need IPs (mDNS doesn’t resolve in containers)
- Thermal: GPUs reach 84-90C under sustained c64 load. T.Limit throttle at 53C is normal operation
- Solo recipe is safer for GF1: gpu_mem 0.70 with 4K context stays within thermal margin. The claude recipe (0.75, 131K) has crashed GF1 under load