7/9 Solo vs Cluster: Where Two Sparks Beat One (and Where They Don't)

March 17, 2026 · 10 min read

#DGX Spark #cluster #benchmarking #vLLM #LiteLLM #AI #Local AI

Previous: Two Sparks, One Cluster: Setting Up with Claude Code

Post 6 built the two-node cluster. Now four operating modes to test: TP=2 cluster (one model split across both GPUs), 2x Solo with two different LiteLLM routing strategies, and plain Solo baseline. Which topology wins? Which routing strategy? I benchmarked all four — and the answers aren’t what I expected.

The Test Setup

Model: gpt-oss-120b with MXFP4 quantization. Benchmark: llama-benchy with pp1024 for prefill, tg128 for decode, 50 runs per concurrency level. vLLM batching: max_num_batched_tokens=32768. Concurrency sweep from c1 to c64 across four topologies:

Solo — Single DGX Spark, TP=1
2x Solo + simple-shuffle — Both nodes running independently, LiteLLM load-balancing with simple-shuffle routing (the setup from Post 5)
2x Solo + least-busy — Same dual-node setup, LiteLLM least-busy routing
Cluster TP=2 — Ray distributed inference, tensor parallelism across both nodes via CX7

All four runs used the same model, same image, same hardware, same day — the only variable was topology and routing strategy. All benchmarks used IP addresses — mDNS hostname resolution causes hard crashes at c64, a gotcha from Post 6.

Decode: Cluster Dominates… Until c64

For decode (token generation after the first token), the cluster wins at every concurrency level — except the highest:

Concurrency	Solo	simple-shuffle	least-busy	Cluster TP=2
c1	57.5 t/s	57.7 t/s	57.5 t/s	69.0 t/s (+20%)
c2	78.4 t/s	99.4 t/s	87.7 t/s	104.1 t/s (+33%)
c4	107.9 t/s	127.9 t/s	122.3 t/s	155.3 t/s (+44%)
c8	153.5 t/s	177.5 t/s	162.5 t/s	231.0 t/s (+50%)
c16	218.5 t/s	268.6 t/s	238.5 t/s	342.1 t/s (+57%)
c32	318.7 t/s	382.1 t/s	333.2 t/s	471.7 t/s (+48%)
c64	315.2 t/s	567.9 t/s	338.4 t/s	471.8 t/s (shuffle +20%)

--- config: themeVariables: xyChart: titleColor: "#333" plotColorPalette: "#6b7280, #22c55e, #ef4444, #3b82f6" --- xychart-beta title "Decode Throughput by Concurrency (t/s)" x-axis "Concurrency" [c1, c2, c4, c8, c16, c32, c64] y-axis "Throughput (t/s)" 0 --> 600 line "Solo" [57.5, 78.4, 107.9, 153.5, 218.5, 318.7, 315.2] line "simple-shuffle" [57.7, 99.4, 127.9, 177.5, 268.6, 382.1, 567.9] line "least-busy" [57.5, 87.7, 122.3, 162.5, 238.5, 333.2, 338.4] line "Cluster TP=2" [69.0, 104.1, 155.3, 231.0, 342.1, 471.7, 471.8]

Legend: ━━ Solo · ━━ 2× Solo simple-shuffle · ━━ 2× Solo least-busy · ━━ Cluster TP=2

The CX7 interconnect pays for itself through c32. Splitting the model across two GPUs means each GPU does half the decode work per token. Even a single request sees +20% — no batching tricks needed. Peak advantage hits +57% at c16.

But at c64, something flips. Two independent nodes with simple-shuffle routing outpace the cluster by 20%. Solo actually decreases from c32 to c64 (315 vs 319 t/s) — the single GPU saturates. The cluster keeps scaling but can’t match two independent engines each processing their own queue without synchronization overhead.

Prefill: The Crossover

Here’s where things get dramatic. Prefill throughput (processing the input prompt) has a clear crossover — and at c64 the cluster nearly collapses:

Concurrency	Solo	simple-shuffle	least-busy	Cluster TP=2
c1	2,926 t/s	2,922 t/s	3,074 t/s	4,442 t/s (+52%)
c2	3,684 t/s	4,669 t/s	4,074 t/s	5,428 t/s (+47%)
c4	4,540 t/s	5,578 t/s	5,299 t/s	6,343 t/s (+40%)
c8	5,905 t/s	7,483 t/s	6,579 t/s	7,965 t/s (+35%)
c16	6,858 t/s	10,072 t/s	8,886 t/s	8,816 t/s (shuffle +14%)
c32	6,827 t/s	11,741 t/s	9,753 t/s	8,966 t/s (shuffle +31%)
c64	2,753 t/s	12,064 t/s	4,982 t/s	3,873 t/s (shuffle +211%)

The crossover, visualized:

--- config: themeVariables: xyChart: titleColor: "#333" plotColorPalette: "#6b7280, #22c55e, #ef4444, #3b82f6" --- xychart-beta title "Prefill Throughput by Concurrency (t/s)" x-axis "Concurrency" [c1, c2, c4, c8, c16, c32, c64] y-axis "Throughput (t/s)" 0 --> 13000 line "Solo" [2926, 3684, 4540, 5905, 6858, 6827, 2753] line "simple-shuffle" [2922, 4669, 5578, 7483, 10072, 11741, 12064] line "least-busy" [3074, 4074, 5299, 6579, 8886, 9753, 4982] line "Cluster TP=2" [4442, 5428, 6343, 7965, 8816, 8966, 3873]

Legend: ━━ Solo · ━━ 2× Solo simple-shuffle · ━━ 2× Solo least-busy · ━━ Cluster TP=2

Cluster wins at low concurrency — single-request prefill is 52% faster when both GPUs collaborate. But at c16 the lines cross. By c64, simple-shuffle is 211% ahead of the cluster. The cluster doesn’t just lose — it collapses from 8,966 t/s at c32 to 3,873 t/s at c64. Solo collapses too (6,827 to 2,753 t/s). Only the two-node shuffle topology keeps scaling.

The max_num_batched_tokens=32768 setting explains exactly where the crossover happens. With pp1024, each prefill request consumes 1,024 tokens from the batch budget. One vLLM engine can batch 32,768 / 1,024 = 32 concurrent prefills before requests start queuing. At c32, a single engine fills its batch perfectly — Solo and Cluster are near their peak. At c64, a single engine must process two batches sequentially (64 × 1,024 = 65,536 > 32,768), and throughput collapses. But 2x Solo with simple-shuffle splits traffic evenly — each node gets ~32 requests, each fits in one batch. Two engines, double the batch budget, no queuing.

For decode, the batch limit doesn’t matter — each decode step produces one token per request, so c64 is only 64 tokens per step, well under the 32,768 budget. The cluster wins decode through c32 because TP=2 parallelism halves the per-token compute. The c64 decode crossover is about synchronization overhead: each CX7 all-reduce step has fixed latency that scales with request count, and at c64 two independent engines avoid that cost entirely.

TTFT: Same Crossover, Same Reason

Time to first token follows the same pattern as prefill, because TTFT is essentially prefill latency from the user’s perspective:

Concurrency	Solo	simple-shuffle	least-busy	Cluster TP=2
c1	369 ms	370 ms	364 ms	251 ms (-32%)
c2	570 ms	459 ms	526 ms	381 ms (-33%)
c4	898 ms	675 ms	735 ms	640 ms (-29%)
c8	1,385 ms	1,005 ms	1,232 ms	1,020 ms
c16	2,384 ms	1,495 ms	1,690 ms	1,855 ms (shuffle -19%)
c32	4,768 ms	2,551 ms	3,303 ms	3,642 ms (shuffle -30%)
c64	10,874 ms	4,940 ms	7,360 ms	8,032 ms (shuffle -39%)

--- config: themeVariables: xyChart: titleColor: "#333" plotColorPalette: "#6b7280, #22c55e, #ef4444, #3b82f6" --- xychart-beta title "Time to First Token by Concurrency (ms, lower is better)" x-axis "Concurrency" [c1, c2, c4, c8, c16, c32, c64] y-axis "TTFT (ms)" 0 --> 11000 line "Solo" [369, 570, 898, 1385, 2384, 4768, 10874] line "simple-shuffle" [370, 459, 675, 1005, 1495, 2551, 4940] line "least-busy" [364, 526, 735, 1232, 1690, 3303, 7360] line "Cluster TP=2" [251, 381, 640, 1020, 1855, 3642, 8032]

Legend: ━━ Solo · ━━ 2× Solo simple-shuffle · ━━ 2× Solo least-busy · ━━ Cluster TP=2

Cluster wins TTFT at c1-c4 — 251ms vs 369ms at c1, nearly a third faster. But the crossover hits at c8, and by c64 simple-shuffle delivers first tokens 39% faster than the cluster. If your users care about time-to-first-token under concurrent load, 2x Solo with simple-shuffle is the better topology.

LiteLLM Routing: simple-shuffle vs least-busy

The four-topology test doubled as an A/B test of LiteLLM routing strategies. Same model, same hardware, same day — only the routing strategy changed. The results are unambiguous.

simple-shuffle beats least-busy at every concurrency level for decode throughput. The gap widens with load:

c1-c4: Both perform similarly. Not enough traffic to stress the router.
c8-c16: simple-shuffle pulls ahead by 9-13% in decode.
c32: shuffle +15% decode, +20% prefill, +23% better TTFT.
c64: shuffle +68% decode (568 vs 338 t/s), +142% prefill (12,064 vs 4,982 t/s).

Variance tells the story. least-busy shows significantly higher standard deviation at high concurrency:

Concurrency	shuffle stdev	least-busy stdev	Ratio
c1	0.43	0.18	0.4x
c8	21.50	20.12	0.9x
c16	25.84	29.69	1.1x
c32	33.73	57.09	1.7x
c64	48.02	64.11	1.3x

Why least-busy fails: it tracks in-flight requests and routes to the least loaded backend. With identical backends and similar request durations, this creates oscillation. Backend A finishes a batch, drops to zero in-flight. All new requests rush to A. Backend B starves. The cycle repeats. simple-shuffle uses stateless weighted random distribution — each backend gets ~50% of traffic regardless of current load. For identical backends, this is optimal.

Never use least-busy for identical backends. It causes backend starvation at high concurrency.

When to Use Which

Neither topology dominates. The right choice depends on the workload:

Use Case	Best Topology	Why
Single user / low latency (c1-c4)	Cluster TP=2	+20-44% decode, -29-33% TTFT
Moderate concurrency (c8-c16)	Cluster TP=2	+30-57% decode, competitive TTFT
High throughput / many users (c32-c64)	2x Solo + simple-shuffle	+20% decode at c64, +211% prefill at c64, -39% TTFT
Any 2x Solo mode	simple-shuffle routing	15-68% better than least-busy at c32-c64

Cluster TP=2 — Best for decode-heavy workloads: chatbots, code generation, long-form writing. Best single-request speed across the board. Full 131K context window (the model fits comfortably across 256GB). Best choice when you need fast token generation and have moderate concurrency.

2x Solo + simple-shuffle — Best for prefill-heavy, high-concurrency workloads: batch processing, document summarization, RAG pipelines with many parallel requests. Better TTFT under load. And the only topology that doesn’t collapse at c64.

Solo — When the other node is running a different model, or when simplicity matters more than throughput.

What’s Next

Next: The Recipe System: One Command, Zero Flag Archaeology — how run-recipe.sh turns twenty minutes of flag archaeology into one command.

Back to Blog

Posts

7/9 Solo vs Cluster: Where Two Sparks Beat One (and Where They Don't)

The Test Setup

Decode: Cluster Dominates… Until c64

Prefill: The Crossover

TTFT: Same Crossover, Same Reason

LiteLLM Routing: simple-shuffle vs least-busy

When to Use Which

Further Reading

What’s Next

Posts