7/9 Solo vs Cluster: Where Two Sparks Beat One (and Where They Don't)
Previous: Two Sparks, One Cluster: Setting Up with Claude Code
Post 6 built the two-node cluster. Now four operating modes to test: TP=2 cluster (one model split across both GPUs), 2x Solo with two different LiteLLM routing strategies, and plain Solo baseline. Which topology wins? Which routing strategy? I benchmarked all four — and the answers aren’t what I expected.
The Test Setup
Model: gpt-oss-120b with MXFP4 quantization. Benchmark: llama-benchy with pp1024 for prefill, tg128 for decode, 50 runs per concurrency level. vLLM batching: max_num_batched_tokens=32768. Concurrency sweep from c1 to c64 across four topologies:
- Solo — Single DGX Spark, TP=1
- 2x Solo + simple-shuffle — Both nodes running independently, LiteLLM load-balancing with simple-shuffle routing (the setup from Post 5)
- 2x Solo + least-busy — Same dual-node setup, LiteLLM least-busy routing
- Cluster TP=2 — Ray distributed inference, tensor parallelism across both nodes via CX7
All four runs used the same model, same image, same hardware, same day — the only variable was topology and routing strategy. All benchmarks used IP addresses — mDNS hostname resolution causes hard crashes at c64, a gotcha from Post 6.
Decode: Cluster Dominates… Until c64
For decode (token generation after the first token), the cluster wins at every concurrency level — except the highest:
| Concurrency | Solo | simple-shuffle | least-busy | Cluster TP=2 |
|---|---|---|---|---|
| c1 | 57.5 t/s | 57.7 t/s | 57.5 t/s | 69.0 t/s (+20%) |
| c2 | 78.4 t/s | 99.4 t/s | 87.7 t/s | 104.1 t/s (+33%) |
| c4 | 107.9 t/s | 127.9 t/s | 122.3 t/s | 155.3 t/s (+44%) |
| c8 | 153.5 t/s | 177.5 t/s | 162.5 t/s | 231.0 t/s (+50%) |
| c16 | 218.5 t/s | 268.6 t/s | 238.5 t/s | 342.1 t/s (+57%) |
| c32 | 318.7 t/s | 382.1 t/s | 333.2 t/s | 471.7 t/s (+48%) |
| c64 | 315.2 t/s | 567.9 t/s | 338.4 t/s | 471.8 t/s (shuffle +20%) |
Legend: ━━ Solo · ━━ 2× Solo simple-shuffle · ━━ 2× Solo least-busy · ━━ Cluster TP=2
The CX7 interconnect pays for itself through c32. Splitting the model across two GPUs means each GPU does half the decode work per token. Even a single request sees +20% — no batching tricks needed. Peak advantage hits +57% at c16.
But at c64, something flips. Two independent nodes with simple-shuffle routing outpace the cluster by 20%. Solo actually decreases from c32 to c64 (315 vs 319 t/s) — the single GPU saturates. The cluster keeps scaling but can’t match two independent engines each processing their own queue without synchronization overhead.
Prefill: The Crossover
Here’s where things get dramatic. Prefill throughput (processing the input prompt) has a clear crossover — and at c64 the cluster nearly collapses:
| Concurrency | Solo | simple-shuffle | least-busy | Cluster TP=2 |
|---|---|---|---|---|
| c1 | 2,926 t/s | 2,922 t/s | 3,074 t/s | 4,442 t/s (+52%) |
| c2 | 3,684 t/s | 4,669 t/s | 4,074 t/s | 5,428 t/s (+47%) |
| c4 | 4,540 t/s | 5,578 t/s | 5,299 t/s | 6,343 t/s (+40%) |
| c8 | 5,905 t/s | 7,483 t/s | 6,579 t/s | 7,965 t/s (+35%) |
| c16 | 6,858 t/s | 10,072 t/s | 8,886 t/s | 8,816 t/s (shuffle +14%) |
| c32 | 6,827 t/s | 11,741 t/s | 9,753 t/s | 8,966 t/s (shuffle +31%) |
| c64 | 2,753 t/s | 12,064 t/s | 4,982 t/s | 3,873 t/s (shuffle +211%) |
The crossover, visualized:
Legend: ━━ Solo · ━━ 2× Solo simple-shuffle · ━━ 2× Solo least-busy · ━━ Cluster TP=2
Cluster wins at low concurrency — single-request prefill is 52% faster when both GPUs collaborate. But at c16 the lines cross. By c64, simple-shuffle is 211% ahead of the cluster. The cluster doesn’t just lose — it collapses from 8,966 t/s at c32 to 3,873 t/s at c64. Solo collapses too (6,827 to 2,753 t/s). Only the two-node shuffle topology keeps scaling.
The max_num_batched_tokens=32768 setting explains exactly where the crossover happens. With pp1024, each prefill request consumes 1,024 tokens from the batch budget. One vLLM engine can batch 32,768 / 1,024 = 32 concurrent prefills before requests start queuing. At c32, a single engine fills its batch perfectly — Solo and Cluster are near their peak. At c64, a single engine must process two batches sequentially (64 × 1,024 = 65,536 > 32,768), and throughput collapses. But 2x Solo with simple-shuffle splits traffic evenly — each node gets ~32 requests, each fits in one batch. Two engines, double the batch budget, no queuing.
For decode, the batch limit doesn’t matter — each decode step produces one token per request, so c64 is only 64 tokens per step, well under the 32,768 budget. The cluster wins decode through c32 because TP=2 parallelism halves the per-token compute. The c64 decode crossover is about synchronization overhead: each CX7 all-reduce step has fixed latency that scales with request count, and at c64 two independent engines avoid that cost entirely.
TTFT: Same Crossover, Same Reason
Time to first token follows the same pattern as prefill, because TTFT is essentially prefill latency from the user’s perspective:
| Concurrency | Solo | simple-shuffle | least-busy | Cluster TP=2 |
|---|---|---|---|---|
| c1 | 369 ms | 370 ms | 364 ms | 251 ms (-32%) |
| c2 | 570 ms | 459 ms | 526 ms | 381 ms (-33%) |
| c4 | 898 ms | 675 ms | 735 ms | 640 ms (-29%) |
| c8 | 1,385 ms | 1,005 ms | 1,232 ms | 1,020 ms |
| c16 | 2,384 ms | 1,495 ms | 1,690 ms | 1,855 ms (shuffle -19%) |
| c32 | 4,768 ms | 2,551 ms | 3,303 ms | 3,642 ms (shuffle -30%) |
| c64 | 10,874 ms | 4,940 ms | 7,360 ms | 8,032 ms (shuffle -39%) |
Legend: ━━ Solo · ━━ 2× Solo simple-shuffle · ━━ 2× Solo least-busy · ━━ Cluster TP=2
Cluster wins TTFT at c1-c4 — 251ms vs 369ms at c1, nearly a third faster. But the crossover hits at c8, and by c64 simple-shuffle delivers first tokens 39% faster than the cluster. If your users care about time-to-first-token under concurrent load, 2x Solo with simple-shuffle is the better topology.
LiteLLM Routing: simple-shuffle vs least-busy
The four-topology test doubled as an A/B test of LiteLLM routing strategies. Same model, same hardware, same day — only the routing strategy changed. The results are unambiguous.
simple-shuffle beats least-busy at every concurrency level for decode throughput. The gap widens with load:
- c1-c4: Both perform similarly. Not enough traffic to stress the router.
- c8-c16: simple-shuffle pulls ahead by 9-13% in decode.
- c32: shuffle +15% decode, +20% prefill, +23% better TTFT.
- c64: shuffle +68% decode (568 vs 338 t/s), +142% prefill (12,064 vs 4,982 t/s).
Variance tells the story. least-busy shows significantly higher standard deviation at high concurrency:
| Concurrency | shuffle stdev | least-busy stdev | Ratio |
|---|---|---|---|
| c1 | 0.43 | 0.18 | 0.4x |
| c8 | 21.50 | 20.12 | 0.9x |
| c16 | 25.84 | 29.69 | 1.1x |
| c32 | 33.73 | 57.09 | 1.7x |
| c64 | 48.02 | 64.11 | 1.3x |
Why least-busy fails: it tracks in-flight requests and routes to the least loaded backend. With identical backends and similar request durations, this creates oscillation. Backend A finishes a batch, drops to zero in-flight. All new requests rush to A. Backend B starves. The cycle repeats. simple-shuffle uses stateless weighted random distribution — each backend gets ~50% of traffic regardless of current load. For identical backends, this is optimal.
Never use least-busy for identical backends. It causes backend starvation at high concurrency.
When to Use Which
Neither topology dominates. The right choice depends on the workload:
| Use Case | Best Topology | Why |
|---|---|---|
| Single user / low latency (c1-c4) | Cluster TP=2 | +20-44% decode, -29-33% TTFT |
| Moderate concurrency (c8-c16) | Cluster TP=2 | +30-57% decode, competitive TTFT |
| High throughput / many users (c32-c64) | 2x Solo + simple-shuffle | +20% decode at c64, +211% prefill at c64, -39% TTFT |
| Any 2x Solo mode | simple-shuffle routing | 15-68% better than least-busy at c32-c64 |
Cluster TP=2 — Best for decode-heavy workloads: chatbots, code generation, long-form writing. Best single-request speed across the board. Full 131K context window (the model fits comfortably across 256GB). Best choice when you need fast token generation and have moderate concurrency.
2x Solo + simple-shuffle — Best for prefill-heavy, high-concurrency workloads: batch processing, document summarization, RAG pipelines with many parallel requests. Better TTFT under load. And the only topology that doesn’t collapse at c64.
Solo — When the other node is running a different model, or when simplicity matters more than throughput.
Further Reading
- Scaling LLM Inference: DP, PP, and TP Explained — data, pipeline, and tensor parallelism strategies for inference
- LLM Transformer Inference Guide — how token generation works across GPUs
- vLLM Internals — deep dive into vLLM’s scheduling and batching
- Optimizing vLLM for MoE Models — tuning guide for mixture-of-experts inference (GPT-OSS-120B is MoE)
- Tensor Parallelism Explained (video) — visual walkthrough of TP mechanics
What’s Next
Next: The Recipe System: One Command, Zero Flag Archaeology — how run-recipe.sh turns twenty minutes of flag archaeology into one command.