4/9 Benchmarking Reality: llama-benchy and the Spark Arena

March 11, 2026 · 8 min read

#DGX Spark #benchmarking #llama-benchy #vLLM #AI #Local AI

Previous: Switching to vLLM: From 40 t/s to 3,975 t/s

Post 3 ended with a promise: we claimed big throughput numbers, and Post 4 would put them under the microscope. Here’s the microscope.

Why llama-benchy

The throughput numbers from Post 3 came from vllm bench serve, vLLM’s built-in benchmarking tool. Good tool. But when I went to compare my results with other Spark owners in the community forum, nobody was using it. Everyone was posting llama-benchy numbers.

llama-benchy is a community benchmark tool built by eugr — the same person who built the dual-Spark vLLM Docker setup from Post 3. He created it because the existing llama-bench only works with llama.cpp and can’t test vLLM, SGLang, or any OpenAI-compatible backend. llama-benchy fills that gap — it works with any backend that exposes an OpenAI-compatible API. The announcement thread explains the motivation.

I switched to it for a simple reason: if the community uses llama-benchy, I need to speak the same language. Posting vllm bench serve numbers when everyone else is posting llama-benchy numbers is like bringing kilometers to a miles conversation. The numbers might be similar, but you can’t compare directly without doing conversion in your head.

What llama-benchy Measures

llama-benchy runs in batch mode: fire N concurrent requests, wait for all to finish, repeat. It separates prefill (pp) and decode (tg) into independent tests, so you see each phase’s peak capability without interference.

A typical run sweeps across concurrency levels in a single command:

uvx llama-benchy \
  --api-url http://goldfinger:8000/v1 \
  --model openai/gpt-oss-120b \
  --concurrency 1 2 4 8 16 32 64 \
  --depth 0 1024 \
  --gen-tokens 128 \
  --runs 50

--depth 0 1024 runs two tests: decode-only (depth 0) and prefill with 1024 input tokens. --gen-tokens 128 sets output length. 50 runs per configuration gives you mean and standard deviation — enough data to know whether the numbers are stable.

The Results

Test config: GoldFinger2, solo TP=1, vLLM 0.1.dev12777, openai/gpt-oss-120b MXFP4, 1024 input / 128 output tokens, 50 runs per configuration. (Side note: the original GoldFinger crashed during llama-benchy at c32 — DNS exhaustion from high-concurrency batch requests over LAN. All benchmarks were re-run on GoldFinger2 using IP addresses to avoid DNS resolution entirely.)

Decode Throughput

This is the number that matters for agentic workloads — how fast tokens come out.

Concurrency	t/s (total)	+/- stddev	CV%
1	57.22	0.15	0.3%
2	77.81	0.90	1.2%
4	107.40	1.99	1.9%
8	150.62	2.44	1.6%
16	214.47	3.14	1.5%
32	314.36	3.76	1.2%
64	452.54	4.13	0.9%

CV% under 2% across every concurrency level. With 50 runs, these numbers are stable — not noise, not lucky outliers.

57 t/s single-user, scaling to 452 t/s at 64 concurrent requests. The hardware gets more efficient under load, not less. This is the concurrency story from Post 3, now with proper statistical backing.

Prefill Throughput

Prefill is how fast the model processes input tokens — the “reading” phase before generation starts.

Concurrency	pp t/s (total)	+/- stddev
1	3,310	50.93
4	4,579	359.07
16	6,918	224.56
32	6,991	182.41
64	6,809	105.48

Prefill throughput plateaus around 7,000 t/s at c16-c32, then drops slightly at c64. The GPU’s prefill capacity saturates before decode does — prefill is compute-bound while decode is memory-bandwidth-bound, and they scale differently.

llama-benchy vs vLLM bench serve

Now the interesting part. I ran both tools on the same hardware, same model, same token counts. The results tell different stories about the same machine — and understanding why they differ matters more than which number is “right.”

Decode Throughput — Side by Side

Concurrency	llama-benchy t/s	vLLM bench t/s	Ratio
1	57.22	48.63	0.85x
4	107.40	101.54	0.95x
8	150.62	143.54	0.95x
16	214.47	207.22	0.97x
32	314.36	304.03	0.97x
64	452.54	443.69	0.98x

They agree within 2-5% at every concurrency level. Two completely different load models — batch vs sustained Poisson arrivals — and they converge on nearly the same decode throughput. The small gap comes from vLLM bench running mixed prefill+decode (some GPU cycles go to processing input tokens), while llama-benchy tests decode in isolation.

Where They Diverge: Time to First Token

This is where the tools tell fundamentally different stories.

Concurrency	llama-benchy TTFR (ms)	vLLM bench TTFT p50 (ms)
1	311	304
4	682	101
16	2,112	198
64	7,796	307

At c1, they nearly match — 311ms vs 304ms. The ~7ms gap is partly network (~3ms round-trip from Windows workstation over LAN, since llama-benchy ran remotely while vLLM bench ran inside the Docker container on localhost) and partly client-side overhead. At c64, llama-benchy reports 7.8 seconds while vLLM bench reports 307ms. A 25x difference.

The reason is structural, not a bug. llama-benchy fires all 64 requests simultaneously. They all queue for prefill at the same instant, and the last request waits behind 63 others. TTFR measures the time from “batch launched” to “first token received” — it includes all the queuing delay.

vLLM bench serve sends requests with staggered arrivals. Each request’s TTFT is measured independently — from when that specific request was sent to when its first token arrived. Requests interleave naturally, so prefill queuing stays bounded even at high concurrency.

Neither number is wrong. llama-benchy’s TTFR answers “how long until someone gets a token when 64 requests arrive at the exact same instant?” vLLM bench’s TTFT answers “how long does a typical request wait for its first token under sustained load?” Different questions, different answers.

Per-Token Latency Matches

While TTFT diverges dramatically, per-token decode latency is nearly identical:

Concurrency	llama-benchy (ms/tok)	vLLM bench TPOT p50 (ms)
1	17.47	17.75
8	52.88	53.16
32	100.91	98.59
64	138.50	129.73

Both tools see the same underlying decode behavior. The server doesn’t care how the requests arrived — once they’re in the batch, each token takes the same time to generate.

Peak Throughput at High Concurrency

One more divergence worth noting: vLLM bench reports “peak output throughput” using a 1-second sliding window over all token timestamps. At c64, this peaks at 775 t/s — 1.7x higher than llama-benchy’s 452 t/s. These spikes happen when many requests are simultaneously in decode phase with no prefill interruptions. They’re real but transient — the sustained average is what llama-benchy reports.

Practical Takeaway

Use case	Better tool
Quick A/B comparisons	llama-benchy — one command sweeps all concurrency levels
Community benchmarks / Spark Arena	llama-benchy — it’s the standard
Isolated prefill/decode measurement	llama-benchy — separate pp/tg tests
Production capacity planning	vLLM bench serve — realistic mixed workload
Tail latency / SLA analysis	vLLM bench serve — p99 percentiles
Load-balanced multi-backend setups	vLLM bench serve — averages out routing noise

Spark Arena

This is why speaking the same benchmark language matters.

The community didn’t just build the benchmark tool — they built a leaderboard. Spark Arena is a community-run performance leaderboard for DGX Spark owners, and it uses llama-benchy as its standard. Submit your llama-benchy results, see how your setup compares.

When I ran llama-benchy on GoldFinger2 and got 57 t/s single-user decode, I could compare that directly with other Spark owners running the same model, same tool, same methodology. No conversion needed. No “well, they used a different benchmark so the numbers aren’t directly comparable.” Just numbers on the same scale.

This is the same pattern from Post 3: the community didn’t wait for NVIDIA to build the comparison infrastructure. They built the benchmark tool, the leaderboard, and the verification framework themselves.

The Takeaway

I switched to llama-benchy because the community uses it. The numbers matched what I’d measured with vllm bench serve — within 2-5% on decode throughput, with CV% under 2% across 50 runs. Both tools are good. But only one lets me compare directly with every other Spark owner and submit to the Spark Arena leaderboard.

The community built the SM121 kernel fixes, the optimized Docker images, the benchmark tool, and the leaderboard. At this point, the community infrastructure is the DGX Spark platform.

Next: LiteLLM: The Translation Layer — turns your local vLLM instance into an OpenAI-compatible proxy that any client can talk to.

Back to Blog

Posts