Gemma4-26B: Topology Benchmark on DGX Spark

April 6, 2026 · 7 min read

#DGX Spark #benchmarking #Gemma4 #vLLM #AI #Local AI

The Model

google/gemma-4-26B-A4B-it — 25.2B total parameters, 3.8B active per token. MoE architecture (8 active / 128 total experts + 1 shared) with FP8 quantization for inference. At 25.67 GiB, this is the smallest model in the DGX Spark benchmark series — and the one that benefits most from tensor parallelism.

Gemma4 has an unusual property: heterogeneous attention head dimensions (256 for local attention, 512 for global). This forces vLLM into TRITON_ATTN backend — no FlashAttention, no FlashInfer. Despite this constraint, the small active parameter count makes it fast.

Test Setup

Hardware: 2x NVIDIA DGX Spark (Grace Blackwell GB10, 128 GB unified each)
Interconnect: ConnectX-7 dual 200GbE QSFP stacking link
Image: vllm-node-20260405 (vLLM 0.19.1rc1.dev36, Transformers 5.5.0)
Recipe: gemma4-26b-a4b.yaml (gpu_memory_utilization=0.70, quantization=fp8, kv-cache-dtype=fp8)
Context: 131K (solo), 262K (cluster — enough KV budget at 13.57 GiB/node)
Benchmark: llama-benchy 0.3.5, pp1024 tg128, 50 runs per concurrency level
Date: 2026-04-06

Topologies Tested

Topology	Description
Solo	Single DGX Spark (GF2), TP=1
Cluster TP=2 (no-ray)	Both nodes via PyTorch distributed, single API
Cluster TP=2 (ray)	Both nodes via Ray distributed, single API
2x Solo + simple-shuffle	Both nodes independent, LiteLLM simple-shuffle

Decode Throughput

Generation speed (tg128) — higher is better.

Concurrency	Solo	Cluster TP=2	Shuffle	Winner
c1	38.68	57.93	39.59	Cluster (+50%)
c2	65.97	98.16	72.74	Cluster (+49%)
c4	100.44	148.26	100.65	Cluster (+48%)
c8	152.36	222.59	160.54	Cluster (+46%)
c16	210.95	304.03	245.31	Cluster (+44%)
c32	281.92	382.80	342.53	Cluster (+36%)
c64	355.73	477.77	508.09	Shuffle (+6%)

--- config: themeVariables: xyChart: titleColor: "#333" plotColorPalette: "#6b7280, #3b82f6, #22c55e" --- xychart-beta title "Decode Throughput (tokens/s) — Gemma4-26B" x-axis ["c1", "c2", "c4", "c8", "c16", "c32", "c64"] y-axis "tokens/s" 0 --> 550 line "Solo" [38.68, 65.97, 100.44, 152.36, 210.95, 281.92, 355.73] line "Cluster TP=2" [57.93, 98.16, 148.26, 222.59, 304.03, 382.80, 477.77] line "Shuffle 2x" [39.59, 72.74, 100.65, 160.54, 245.31, 342.53, 508.09]

Legend: ━━ Solo · ━━ Cluster TP=2 · ━━ Shuffle 2x

Cluster TP=2 dominates c1 through c32 with a consistent +36-50% advantage. The small active parameter count (4B) means minimal NCCL traffic between nodes — tensor parallelism scales almost linearly.

Shuffle only overtakes at c64, and even then by just 6%. For Gemma4, cluster is the clear default.

Prefill Throughput

Prompt processing speed (pp1024) — higher is better.

Concurrency	Solo	Cluster TP=2	Shuffle	Winner
c1	4,807	6,034	4,474	Cluster (+26%)
c2	6,204	7,312	7,272	Cluster (tie)
c4	6,478	7,651	7,669	Shuffle (tie)
c8	6,965	8,487	9,655	Shuffle (+14%)
c16	7,191	9,015	10,172	Shuffle (+13%)
c32	7,264	9,147	11,573	Shuffle (+27%)
c64	7,177	9,165	12,989	Shuffle (+42%)

--- config: themeVariables: xyChart: titleColor: "#333" plotColorPalette: "#6b7280, #3b82f6, #22c55e" --- xychart-beta title "Prefill Throughput (tokens/s) — Gemma4-26B" x-axis ["c1", "c2", "c4", "c8", "c16", "c32", "c64"] y-axis "tokens/s" 0 --> 14000 line "Solo" [4807, 6204, 6478, 6965, 7191, 7264, 7177] line "Cluster TP=2" [6034, 7312, 7651, 8487, 9015, 9147, 9165] line "Shuffle 2x" [4474, 7272, 7669, 9655, 10172, 11573, 12989]

Legend: ━━ Solo · ━━ Cluster TP=2 · ━━ Shuffle 2x

The prefill crossover happens earlier than with GPT-OSS-120B — shuffle takes over at c4. Two independent nodes processing separate prefills outscale a synchronized cluster once there’s enough concurrent work.

Time to First Token (TTFT)

End-to-end latency to first generated token (ms) — lower is better.

Concurrency	Solo	Cluster TP=2	Shuffle	Winner
c1	223	181	249	Cluster (-19%)
c2	328	269	282	Cluster (-5% vs Shuffle)
c4	540	417	536	Cluster (-22%)
c8	1,081	858	656	Shuffle (-24%)
c16	1,810	1,361	1,166	Shuffle (-14%)
c32	3,091	2,330	1,888	Shuffle (-19%)
c64	5,431	4,125	3,168	Shuffle (-23%)

--- config: themeVariables: xyChart: titleColor: "#333" plotColorPalette: "#6b7280, #3b82f6, #22c55e" --- xychart-beta title "TTFT (ms, lower is better) — Gemma4-26B" x-axis ["c1", "c2", "c4", "c8", "c16", "c32", "c64"] y-axis "ms" 0 --> 5500 line "Solo" [223, 328, 540, 1081, 1810, 3091, 5431] line "Cluster TP=2" [181, 269, 417, 858, 1361, 2330, 4125] line "Shuffle 2x" [249, 282, 536, 656, 1166, 1888, 3168]

Legend: ━━ Solo · ━━ Cluster TP=2 · ━━ Shuffle 2x

Gemma4 has the best absolute TTFT of any model tested: 181ms at c1 (cluster). The crossover is at c8 — below that, cluster’s faster per-request prefill wins. Above c8, shuffle’s independent processing takes over.

Ray vs No-Ray: The Free Performance

Cluster TP=2 was tested with both Ray and PyTorch distributed (--no-ray). No-ray wins at every concurrency level:

Concurrency	Ray	No-Ray	Gain
c1	53.64	57.93	+8.0%
c2	92.50	98.16	+6.1%
c4	141.75	148.26	+4.6%
c8	216.32	222.59	+2.9%
c16	297.85	304.03	+2.1%
c32	378.30	382.80	+1.2%
c64	470.32	477.77	+1.6%

--- config: themeVariables: xyChart: titleColor: "#333" plotColorPalette: "#ef4444, #22c55e" --- xychart-beta title "Ray vs No-Ray Decode (tokens/s) — Gemma4-26B Cluster TP=2" x-axis ["c1", "c2", "c4", "c8", "c16", "c32", "c64"] y-axis "tokens/s" 0 --> 500 line "Ray" [53.64, 92.50, 141.75, 216.32, 297.85, 378.30, 470.32] line "No-Ray" [57.93, 98.16, 148.26, 222.59, 304.03, 382.80, 477.77]

Legend: ━━ Ray · ━━ No-Ray

The biggest gain is at c1 (+8%) where Ray’s per-request overhead — object store serialization, GCS lookup, worker scheduling — is a measurable fraction of total request time. With only 3.8B active params, per-token compute is cheap, so the overhead is proportionally larger than for bigger models.

Always use --no-ray for Gemma4 cluster mode. It’s free performance.

When to Use Which Topology

Use Case	Best Topology	Why
Single user / low latency (c1-c4)	Cluster TP=2 (no-ray)	+48-50% decode, 181ms TTFT
Moderate concurrency (c8-c32)	Cluster TP=2 (no-ray)	+36-46% decode, still dominant
Maximum throughput (c64)	2x Solo + simple-shuffle	+6% decode, +42% prefill
Prefill-heavy workloads (c4+)	2x Solo + simple-shuffle	Up to +42% prefill at c64
Latency-sensitive at scale (c8+)	2x Solo + simple-shuffle	-14% to -24% better TTFT

Cluster TP=2 with --no-ray is the recommended default for Gemma4. It wins decode by wide margins through c32. The crossover to shuffle only happens at c64, and even then the decode gain is modest (+6%).

Hardware Note: GF1 Safe

Gemma4 is the safest model for the GF1 hardware issue. At 25.67 GiB solo (13.57 GiB/node in cluster), memory pressure is well below the threshold that triggers crashes with larger models. All benchmark runs completed without stability issues.

Technical Notes

Attention backend: TRITON_ATTN forced by heterogeneous head dimensions (256 local, 512 global). FlashAttention/FlashInfer not available for this architecture
FP8 KV cache: No calibrated scaling factors in checkpoint — vLLM uses default scale 1.0
GLOO_SOCKET_IFNAME: Required for cluster mode on vLLM v0.19.x (enp1s0f0np0)
FP8 is the bottleneck: 3.8B active at FP8 does more bit-work per token than GPT-OSS’s 5.1B at MXFP4 (30.4 Gbit vs 20.4 Gbit). An MXFP4 Gemma4 checkpoint would likely be the fastest model on this hardware

Back to Blog

Posts

Gemma4-26B: Topology Benchmark on DGX Spark

The Model

Test Setup

Topologies Tested

Decode Throughput

Prefill Throughput

Time to First Token (TTFT)

Ray vs No-Ray: The Free Performance

When to Use Which Topology

Hardware Note: GF1 Safe

Technical Notes

Posts