7/9 Solo vs Cluster: Where Two Sparks Beat One (and Where They Don't)

· 10 min read

Previous: Two Sparks, One Cluster: Setting Up with Claude Code

Post 6 built the two-node cluster. Now four operating modes to test: TP=2 cluster (one model split across both GPUs), 2x Solo with two different LiteLLM routing strategies, and plain Solo baseline. Which topology wins? Which routing strategy? I benchmarked all four — and the answers aren’t what I expected.

The Test Setup

Model: gpt-oss-120b with MXFP4 quantization. Benchmark: llama-benchy with pp1024 for prefill, tg128 for decode, 50 runs per concurrency level. vLLM batching: max_num_batched_tokens=32768. Concurrency sweep from c1 to c64 across four topologies:

  1. Solo — Single DGX Spark, TP=1
  2. 2x Solo + simple-shuffle — Both nodes running independently, LiteLLM load-balancing with simple-shuffle routing (the setup from Post 5)
  3. 2x Solo + least-busy — Same dual-node setup, LiteLLM least-busy routing
  4. Cluster TP=2 — Ray distributed inference, tensor parallelism across both nodes via CX7

All four runs used the same model, same image, same hardware, same day — the only variable was topology and routing strategy. All benchmarks used IP addresses — mDNS hostname resolution causes hard crashes at c64, a gotcha from Post 6.

Decode: Cluster Dominates… Until c64

For decode (token generation after the first token), the cluster wins at every concurrency level — except the highest:

ConcurrencySolosimple-shuffleleast-busyCluster TP=2
c157.5 t/s57.7 t/s57.5 t/s69.0 t/s (+20%)
c278.4 t/s99.4 t/s87.7 t/s104.1 t/s (+33%)
c4107.9 t/s127.9 t/s122.3 t/s155.3 t/s (+44%)
c8153.5 t/s177.5 t/s162.5 t/s231.0 t/s (+50%)
c16218.5 t/s268.6 t/s238.5 t/s342.1 t/s (+57%)
c32318.7 t/s382.1 t/s333.2 t/s471.7 t/s (+48%)
c64315.2 t/s567.9 t/s338.4 t/s471.8 t/s (shuffle +20%)
--- config: themeVariables: xyChart: titleColor: "#333" plotColorPalette: "#6b7280, #22c55e, #ef4444, #3b82f6" --- xychart-beta title "Decode Throughput by Concurrency (t/s)" x-axis "Concurrency" [c1, c2, c4, c8, c16, c32, c64] y-axis "Throughput (t/s)" 0 --> 600 line "Solo" [57.5, 78.4, 107.9, 153.5, 218.5, 318.7, 315.2] line "simple-shuffle" [57.7, 99.4, 127.9, 177.5, 268.6, 382.1, 567.9] line "least-busy" [57.5, 87.7, 122.3, 162.5, 238.5, 333.2, 338.4] line "Cluster TP=2" [69.0, 104.1, 155.3, 231.0, 342.1, 471.7, 471.8]

Legend: ━━ Solo · ━━ 2× Solo simple-shuffle · ━━ 2× Solo least-busy · ━━ Cluster TP=2

The CX7 interconnect pays for itself through c32. Splitting the model across two GPUs means each GPU does half the decode work per token. Even a single request sees +20% — no batching tricks needed. Peak advantage hits +57% at c16.

But at c64, something flips. Two independent nodes with simple-shuffle routing outpace the cluster by 20%. Solo actually decreases from c32 to c64 (315 vs 319 t/s) — the single GPU saturates. The cluster keeps scaling but can’t match two independent engines each processing their own queue without synchronization overhead.

Prefill: The Crossover

Here’s where things get dramatic. Prefill throughput (processing the input prompt) has a clear crossover — and at c64 the cluster nearly collapses:

ConcurrencySolosimple-shuffleleast-busyCluster TP=2
c12,926 t/s2,922 t/s3,074 t/s4,442 t/s (+52%)
c23,684 t/s4,669 t/s4,074 t/s5,428 t/s (+47%)
c44,540 t/s5,578 t/s5,299 t/s6,343 t/s (+40%)
c85,905 t/s7,483 t/s6,579 t/s7,965 t/s (+35%)
c166,858 t/s10,072 t/s8,886 t/s8,816 t/s (shuffle +14%)
c326,827 t/s11,741 t/s9,753 t/s8,966 t/s (shuffle +31%)
c642,753 t/s12,064 t/s4,982 t/s3,873 t/s (shuffle +211%)

The crossover, visualized:

--- config: themeVariables: xyChart: titleColor: "#333" plotColorPalette: "#6b7280, #22c55e, #ef4444, #3b82f6" --- xychart-beta title "Prefill Throughput by Concurrency (t/s)" x-axis "Concurrency" [c1, c2, c4, c8, c16, c32, c64] y-axis "Throughput (t/s)" 0 --> 13000 line "Solo" [2926, 3684, 4540, 5905, 6858, 6827, 2753] line "simple-shuffle" [2922, 4669, 5578, 7483, 10072, 11741, 12064] line "least-busy" [3074, 4074, 5299, 6579, 8886, 9753, 4982] line "Cluster TP=2" [4442, 5428, 6343, 7965, 8816, 8966, 3873]

Legend: ━━ Solo · ━━ 2× Solo simple-shuffle · ━━ 2× Solo least-busy · ━━ Cluster TP=2

Cluster wins at low concurrency — single-request prefill is 52% faster when both GPUs collaborate. But at c16 the lines cross. By c64, simple-shuffle is 211% ahead of the cluster. The cluster doesn’t just lose — it collapses from 8,966 t/s at c32 to 3,873 t/s at c64. Solo collapses too (6,827 to 2,753 t/s). Only the two-node shuffle topology keeps scaling.

The max_num_batched_tokens=32768 setting explains exactly where the crossover happens. With pp1024, each prefill request consumes 1,024 tokens from the batch budget. One vLLM engine can batch 32,768 / 1,024 = 32 concurrent prefills before requests start queuing. At c32, a single engine fills its batch perfectly — Solo and Cluster are near their peak. At c64, a single engine must process two batches sequentially (64 × 1,024 = 65,536 > 32,768), and throughput collapses. But 2x Solo with simple-shuffle splits traffic evenly — each node gets ~32 requests, each fits in one batch. Two engines, double the batch budget, no queuing.

For decode, the batch limit doesn’t matter — each decode step produces one token per request, so c64 is only 64 tokens per step, well under the 32,768 budget. The cluster wins decode through c32 because TP=2 parallelism halves the per-token compute. The c64 decode crossover is about synchronization overhead: each CX7 all-reduce step has fixed latency that scales with request count, and at c64 two independent engines avoid that cost entirely.

TTFT: Same Crossover, Same Reason

Time to first token follows the same pattern as prefill, because TTFT is essentially prefill latency from the user’s perspective:

ConcurrencySolosimple-shuffleleast-busyCluster TP=2
c1369 ms370 ms364 ms251 ms (-32%)
c2570 ms459 ms526 ms381 ms (-33%)
c4898 ms675 ms735 ms640 ms (-29%)
c81,385 ms1,005 ms1,232 ms1,020 ms
c162,384 ms1,495 ms1,690 ms1,855 ms (shuffle -19%)
c324,768 ms2,551 ms3,303 ms3,642 ms (shuffle -30%)
c6410,874 ms4,940 ms7,360 ms8,032 ms (shuffle -39%)
--- config: themeVariables: xyChart: titleColor: "#333" plotColorPalette: "#6b7280, #22c55e, #ef4444, #3b82f6" --- xychart-beta title "Time to First Token by Concurrency (ms, lower is better)" x-axis "Concurrency" [c1, c2, c4, c8, c16, c32, c64] y-axis "TTFT (ms)" 0 --> 11000 line "Solo" [369, 570, 898, 1385, 2384, 4768, 10874] line "simple-shuffle" [370, 459, 675, 1005, 1495, 2551, 4940] line "least-busy" [364, 526, 735, 1232, 1690, 3303, 7360] line "Cluster TP=2" [251, 381, 640, 1020, 1855, 3642, 8032]

Legend: ━━ Solo · ━━ 2× Solo simple-shuffle · ━━ 2× Solo least-busy · ━━ Cluster TP=2

Cluster wins TTFT at c1-c4 — 251ms vs 369ms at c1, nearly a third faster. But the crossover hits at c8, and by c64 simple-shuffle delivers first tokens 39% faster than the cluster. If your users care about time-to-first-token under concurrent load, 2x Solo with simple-shuffle is the better topology.

LiteLLM Routing: simple-shuffle vs least-busy

The four-topology test doubled as an A/B test of LiteLLM routing strategies. Same model, same hardware, same day — only the routing strategy changed. The results are unambiguous.

simple-shuffle beats least-busy at every concurrency level for decode throughput. The gap widens with load:

  • c1-c4: Both perform similarly. Not enough traffic to stress the router.
  • c8-c16: simple-shuffle pulls ahead by 9-13% in decode.
  • c32: shuffle +15% decode, +20% prefill, +23% better TTFT.
  • c64: shuffle +68% decode (568 vs 338 t/s), +142% prefill (12,064 vs 4,982 t/s).

Variance tells the story. least-busy shows significantly higher standard deviation at high concurrency:

Concurrencyshuffle stdevleast-busy stdevRatio
c10.430.180.4x
c821.5020.120.9x
c1625.8429.691.1x
c3233.7357.091.7x
c6448.0264.111.3x

Why least-busy fails: it tracks in-flight requests and routes to the least loaded backend. With identical backends and similar request durations, this creates oscillation. Backend A finishes a batch, drops to zero in-flight. All new requests rush to A. Backend B starves. The cycle repeats. simple-shuffle uses stateless weighted random distribution — each backend gets ~50% of traffic regardless of current load. For identical backends, this is optimal.

Never use least-busy for identical backends. It causes backend starvation at high concurrency.

When to Use Which

Neither topology dominates. The right choice depends on the workload:

Use CaseBest TopologyWhy
Single user / low latency (c1-c4)Cluster TP=2+20-44% decode, -29-33% TTFT
Moderate concurrency (c8-c16)Cluster TP=2+30-57% decode, competitive TTFT
High throughput / many users (c32-c64)2x Solo + simple-shuffle+20% decode at c64, +211% prefill at c64, -39% TTFT
Any 2x Solo modesimple-shuffle routing15-68% better than least-busy at c32-c64

Cluster TP=2 — Best for decode-heavy workloads: chatbots, code generation, long-form writing. Best single-request speed across the board. Full 131K context window (the model fits comfortably across 256GB). Best choice when you need fast token generation and have moderate concurrency.

2x Solo + simple-shuffle — Best for prefill-heavy, high-concurrency workloads: batch processing, document summarization, RAG pipelines with many parallel requests. Better TTFT under load. And the only topology that doesn’t collapse at c64.

Solo — When the other node is running a different model, or when simplicity matters more than throughput.

Further Reading

What’s Next

Next: The Recipe System: One Command, Zero Flag Archaeology — how run-recipe.sh turns twenty minutes of flag archaeology into one command.