Gemma4-26B: Topology Benchmark on DGX Spark
The Model
google/gemma-4-26B-A4B-it — 25.2B total parameters, 3.8B active per token. MoE architecture (8 active / 128 total experts + 1 shared) with FP8 quantization for inference. At 25.67 GiB, this is the smallest model in the DGX Spark benchmark series — and the one that benefits most from tensor parallelism.
Gemma4 has an unusual property: heterogeneous attention head dimensions (256 for local attention, 512 for global). This forces vLLM into TRITON_ATTN backend — no FlashAttention, no FlashInfer. Despite this constraint, the small active parameter count makes it fast.
Test Setup
- Hardware: 2x NVIDIA DGX Spark (Grace Blackwell GB10, 128 GB unified each)
- Interconnect: ConnectX-7 dual 200GbE QSFP stacking link
- Image: vllm-node-20260405 (vLLM 0.19.1rc1.dev36, Transformers 5.5.0)
- Recipe: gemma4-26b-a4b.yaml (gpu_memory_utilization=0.70, quantization=fp8, kv-cache-dtype=fp8)
- Context: 131K (solo), 262K (cluster — enough KV budget at 13.57 GiB/node)
- Benchmark: llama-benchy 0.3.5, pp1024 tg128, 50 runs per concurrency level
- Date: 2026-04-06
Topologies Tested
| Topology | Description |
|---|---|
| Solo | Single DGX Spark (GF2), TP=1 |
| Cluster TP=2 (no-ray) | Both nodes via PyTorch distributed, single API |
| Cluster TP=2 (ray) | Both nodes via Ray distributed, single API |
| 2x Solo + simple-shuffle | Both nodes independent, LiteLLM simple-shuffle |
Decode Throughput
Generation speed (tg128) — higher is better.
| Concurrency | Solo | Cluster TP=2 | Shuffle | Winner |
|---|---|---|---|---|
| c1 | 38.68 | 57.93 | 39.59 | Cluster (+50%) |
| c2 | 65.97 | 98.16 | 72.74 | Cluster (+49%) |
| c4 | 100.44 | 148.26 | 100.65 | Cluster (+48%) |
| c8 | 152.36 | 222.59 | 160.54 | Cluster (+46%) |
| c16 | 210.95 | 304.03 | 245.31 | Cluster (+44%) |
| c32 | 281.92 | 382.80 | 342.53 | Cluster (+36%) |
| c64 | 355.73 | 477.77 | 508.09 | Shuffle (+6%) |
Legend: ━━ Solo · ━━ Cluster TP=2 · ━━ Shuffle 2x
Cluster TP=2 dominates c1 through c32 with a consistent +36-50% advantage. The small active parameter count (4B) means minimal NCCL traffic between nodes — tensor parallelism scales almost linearly.
Shuffle only overtakes at c64, and even then by just 6%. For Gemma4, cluster is the clear default.
Prefill Throughput
Prompt processing speed (pp1024) — higher is better.
| Concurrency | Solo | Cluster TP=2 | Shuffle | Winner |
|---|---|---|---|---|
| c1 | 4,807 | 6,034 | 4,474 | Cluster (+26%) |
| c2 | 6,204 | 7,312 | 7,272 | Cluster (tie) |
| c4 | 6,478 | 7,651 | 7,669 | Shuffle (tie) |
| c8 | 6,965 | 8,487 | 9,655 | Shuffle (+14%) |
| c16 | 7,191 | 9,015 | 10,172 | Shuffle (+13%) |
| c32 | 7,264 | 9,147 | 11,573 | Shuffle (+27%) |
| c64 | 7,177 | 9,165 | 12,989 | Shuffle (+42%) |
Legend: ━━ Solo · ━━ Cluster TP=2 · ━━ Shuffle 2x
The prefill crossover happens earlier than with GPT-OSS-120B — shuffle takes over at c4. Two independent nodes processing separate prefills outscale a synchronized cluster once there’s enough concurrent work.
Time to First Token (TTFT)
End-to-end latency to first generated token (ms) — lower is better.
| Concurrency | Solo | Cluster TP=2 | Shuffle | Winner |
|---|---|---|---|---|
| c1 | 223 | 181 | 249 | Cluster (-19%) |
| c2 | 328 | 269 | 282 | Cluster (-5% vs Shuffle) |
| c4 | 540 | 417 | 536 | Cluster (-22%) |
| c8 | 1,081 | 858 | 656 | Shuffle (-24%) |
| c16 | 1,810 | 1,361 | 1,166 | Shuffle (-14%) |
| c32 | 3,091 | 2,330 | 1,888 | Shuffle (-19%) |
| c64 | 5,431 | 4,125 | 3,168 | Shuffle (-23%) |
Legend: ━━ Solo · ━━ Cluster TP=2 · ━━ Shuffle 2x
Gemma4 has the best absolute TTFT of any model tested: 181ms at c1 (cluster). The crossover is at c8 — below that, cluster’s faster per-request prefill wins. Above c8, shuffle’s independent processing takes over.
Ray vs No-Ray: The Free Performance
Cluster TP=2 was tested with both Ray and PyTorch distributed (--no-ray). No-ray wins at every concurrency level:
| Concurrency | Ray | No-Ray | Gain |
|---|---|---|---|
| c1 | 53.64 | 57.93 | +8.0% |
| c2 | 92.50 | 98.16 | +6.1% |
| c4 | 141.75 | 148.26 | +4.6% |
| c8 | 216.32 | 222.59 | +2.9% |
| c16 | 297.85 | 304.03 | +2.1% |
| c32 | 378.30 | 382.80 | +1.2% |
| c64 | 470.32 | 477.77 | +1.6% |
Legend: ━━ Ray · ━━ No-Ray
The biggest gain is at c1 (+8%) where Ray’s per-request overhead — object store serialization, GCS lookup, worker scheduling — is a measurable fraction of total request time. With only 3.8B active params, per-token compute is cheap, so the overhead is proportionally larger than for bigger models.
Always use --no-ray for Gemma4 cluster mode. It’s free performance.
When to Use Which Topology
| Use Case | Best Topology | Why |
|---|---|---|
| Single user / low latency (c1-c4) | Cluster TP=2 (no-ray) | +48-50% decode, 181ms TTFT |
| Moderate concurrency (c8-c32) | Cluster TP=2 (no-ray) | +36-46% decode, still dominant |
| Maximum throughput (c64) | 2x Solo + simple-shuffle | +6% decode, +42% prefill |
| Prefill-heavy workloads (c4+) | 2x Solo + simple-shuffle | Up to +42% prefill at c64 |
| Latency-sensitive at scale (c8+) | 2x Solo + simple-shuffle | -14% to -24% better TTFT |
Cluster TP=2 with --no-ray is the recommended default for Gemma4. It wins decode by wide margins through c32. The crossover to shuffle only happens at c64, and even then the decode gain is modest (+6%).
Hardware Note: GF1 Safe
Gemma4 is the safest model for the GF1 hardware issue. At 25.67 GiB solo (13.57 GiB/node in cluster), memory pressure is well below the threshold that triggers crashes with larger models. All benchmark runs completed without stability issues.
Technical Notes
- Attention backend: TRITON_ATTN forced by heterogeneous head dimensions (256 local, 512 global). FlashAttention/FlashInfer not available for this architecture
- FP8 KV cache: No calibrated scaling factors in checkpoint — vLLM uses default scale 1.0
- GLOO_SOCKET_IFNAME: Required for cluster mode on vLLM v0.19.x (
enp1s0f0np0) - FP8 is the bottleneck: 3.8B active at FP8 does more bit-work per token than GPT-OSS’s 5.1B at MXFP4 (30.4 Gbit vs 20.4 Gbit). An MXFP4 Gemma4 checkpoint would likely be the fastest model on this hardware