Gemma4-26B: Topology Benchmark on DGX Spark

· 7 min read

The Model

google/gemma-4-26B-A4B-it — 25.2B total parameters, 3.8B active per token. MoE architecture (8 active / 128 total experts + 1 shared) with FP8 quantization for inference. At 25.67 GiB, this is the smallest model in the DGX Spark benchmark series — and the one that benefits most from tensor parallelism.

Gemma4 has an unusual property: heterogeneous attention head dimensions (256 for local attention, 512 for global). This forces vLLM into TRITON_ATTN backend — no FlashAttention, no FlashInfer. Despite this constraint, the small active parameter count makes it fast.

Test Setup

  • Hardware: 2x NVIDIA DGX Spark (Grace Blackwell GB10, 128 GB unified each)
  • Interconnect: ConnectX-7 dual 200GbE QSFP stacking link
  • Image: vllm-node-20260405 (vLLM 0.19.1rc1.dev36, Transformers 5.5.0)
  • Recipe: gemma4-26b-a4b.yaml (gpu_memory_utilization=0.70, quantization=fp8, kv-cache-dtype=fp8)
  • Context: 131K (solo), 262K (cluster — enough KV budget at 13.57 GiB/node)
  • Benchmark: llama-benchy 0.3.5, pp1024 tg128, 50 runs per concurrency level
  • Date: 2026-04-06

Topologies Tested

TopologyDescription
SoloSingle DGX Spark (GF2), TP=1
Cluster TP=2 (no-ray)Both nodes via PyTorch distributed, single API
Cluster TP=2 (ray)Both nodes via Ray distributed, single API
2x Solo + simple-shuffleBoth nodes independent, LiteLLM simple-shuffle

Decode Throughput

Generation speed (tg128) — higher is better.

ConcurrencySoloCluster TP=2ShuffleWinner
c138.6857.9339.59Cluster (+50%)
c265.9798.1672.74Cluster (+49%)
c4100.44148.26100.65Cluster (+48%)
c8152.36222.59160.54Cluster (+46%)
c16210.95304.03245.31Cluster (+44%)
c32281.92382.80342.53Cluster (+36%)
c64355.73477.77508.09Shuffle (+6%)
--- config: themeVariables: xyChart: titleColor: "#333" plotColorPalette: "#6b7280, #3b82f6, #22c55e" --- xychart-beta title "Decode Throughput (tokens/s) — Gemma4-26B" x-axis ["c1", "c2", "c4", "c8", "c16", "c32", "c64"] y-axis "tokens/s" 0 --> 550 line "Solo" [38.68, 65.97, 100.44, 152.36, 210.95, 281.92, 355.73] line "Cluster TP=2" [57.93, 98.16, 148.26, 222.59, 304.03, 382.80, 477.77] line "Shuffle 2x" [39.59, 72.74, 100.65, 160.54, 245.31, 342.53, 508.09]

Legend: ━━ Solo · ━━ Cluster TP=2 · ━━ Shuffle 2x

Cluster TP=2 dominates c1 through c32 with a consistent +36-50% advantage. The small active parameter count (4B) means minimal NCCL traffic between nodes — tensor parallelism scales almost linearly.

Shuffle only overtakes at c64, and even then by just 6%. For Gemma4, cluster is the clear default.

Prefill Throughput

Prompt processing speed (pp1024) — higher is better.

ConcurrencySoloCluster TP=2ShuffleWinner
c14,8076,0344,474Cluster (+26%)
c26,2047,3127,272Cluster (tie)
c46,4787,6517,669Shuffle (tie)
c86,9658,4879,655Shuffle (+14%)
c167,1919,01510,172Shuffle (+13%)
c327,2649,14711,573Shuffle (+27%)
c647,1779,16512,989Shuffle (+42%)
--- config: themeVariables: xyChart: titleColor: "#333" plotColorPalette: "#6b7280, #3b82f6, #22c55e" --- xychart-beta title "Prefill Throughput (tokens/s) — Gemma4-26B" x-axis ["c1", "c2", "c4", "c8", "c16", "c32", "c64"] y-axis "tokens/s" 0 --> 14000 line "Solo" [4807, 6204, 6478, 6965, 7191, 7264, 7177] line "Cluster TP=2" [6034, 7312, 7651, 8487, 9015, 9147, 9165] line "Shuffle 2x" [4474, 7272, 7669, 9655, 10172, 11573, 12989]

Legend: ━━ Solo · ━━ Cluster TP=2 · ━━ Shuffle 2x

The prefill crossover happens earlier than with GPT-OSS-120B — shuffle takes over at c4. Two independent nodes processing separate prefills outscale a synchronized cluster once there’s enough concurrent work.

Time to First Token (TTFT)

End-to-end latency to first generated token (ms) — lower is better.

ConcurrencySoloCluster TP=2ShuffleWinner
c1223181249Cluster (-19%)
c2328269282Cluster (-5% vs Shuffle)
c4540417536Cluster (-22%)
c81,081858656Shuffle (-24%)
c161,8101,3611,166Shuffle (-14%)
c323,0912,3301,888Shuffle (-19%)
c645,4314,1253,168Shuffle (-23%)
--- config: themeVariables: xyChart: titleColor: "#333" plotColorPalette: "#6b7280, #3b82f6, #22c55e" --- xychart-beta title "TTFT (ms, lower is better) — Gemma4-26B" x-axis ["c1", "c2", "c4", "c8", "c16", "c32", "c64"] y-axis "ms" 0 --> 5500 line "Solo" [223, 328, 540, 1081, 1810, 3091, 5431] line "Cluster TP=2" [181, 269, 417, 858, 1361, 2330, 4125] line "Shuffle 2x" [249, 282, 536, 656, 1166, 1888, 3168]

Legend: ━━ Solo · ━━ Cluster TP=2 · ━━ Shuffle 2x

Gemma4 has the best absolute TTFT of any model tested: 181ms at c1 (cluster). The crossover is at c8 — below that, cluster’s faster per-request prefill wins. Above c8, shuffle’s independent processing takes over.

Ray vs No-Ray: The Free Performance

Cluster TP=2 was tested with both Ray and PyTorch distributed (--no-ray). No-ray wins at every concurrency level:

ConcurrencyRayNo-RayGain
c153.6457.93+8.0%
c292.5098.16+6.1%
c4141.75148.26+4.6%
c8216.32222.59+2.9%
c16297.85304.03+2.1%
c32378.30382.80+1.2%
c64470.32477.77+1.6%
--- config: themeVariables: xyChart: titleColor: "#333" plotColorPalette: "#ef4444, #22c55e" --- xychart-beta title "Ray vs No-Ray Decode (tokens/s) — Gemma4-26B Cluster TP=2" x-axis ["c1", "c2", "c4", "c8", "c16", "c32", "c64"] y-axis "tokens/s" 0 --> 500 line "Ray" [53.64, 92.50, 141.75, 216.32, 297.85, 378.30, 470.32] line "No-Ray" [57.93, 98.16, 148.26, 222.59, 304.03, 382.80, 477.77]

Legend: ━━ Ray · ━━ No-Ray

The biggest gain is at c1 (+8%) where Ray’s per-request overhead — object store serialization, GCS lookup, worker scheduling — is a measurable fraction of total request time. With only 3.8B active params, per-token compute is cheap, so the overhead is proportionally larger than for bigger models.

Always use --no-ray for Gemma4 cluster mode. It’s free performance.

When to Use Which Topology

Use CaseBest TopologyWhy
Single user / low latency (c1-c4)Cluster TP=2 (no-ray)+48-50% decode, 181ms TTFT
Moderate concurrency (c8-c32)Cluster TP=2 (no-ray)+36-46% decode, still dominant
Maximum throughput (c64)2x Solo + simple-shuffle+6% decode, +42% prefill
Prefill-heavy workloads (c4+)2x Solo + simple-shuffleUp to +42% prefill at c64
Latency-sensitive at scale (c8+)2x Solo + simple-shuffle-14% to -24% better TTFT

Cluster TP=2 with --no-ray is the recommended default for Gemma4. It wins decode by wide margins through c32. The crossover to shuffle only happens at c64, and even then the decode gain is modest (+6%).

Hardware Note: GF1 Safe

Gemma4 is the safest model for the GF1 hardware issue. At 25.67 GiB solo (13.57 GiB/node in cluster), memory pressure is well below the threshold that triggers crashes with larger models. All benchmark runs completed without stability issues.

Technical Notes

  • Attention backend: TRITON_ATTN forced by heterogeneous head dimensions (256 local, 512 global). FlashAttention/FlashInfer not available for this architecture
  • FP8 KV cache: No calibrated scaling factors in checkpoint — vLLM uses default scale 1.0
  • GLOO_SOCKET_IFNAME: Required for cluster mode on vLLM v0.19.x (enp1s0f0np0)
  • FP8 is the bottleneck: 3.8B active at FP8 does more bit-work per token than GPT-OSS’s 5.1B at MXFP4 (30.4 Gbit vs 20.4 Gbit). An MXFP4 Gemma4 checkpoint would likely be the fastest model on this hardware