8/9 The Recipe System: One Command, Zero Flag Archaeology

· 12 min read

Previous: Solo vs Cluster: Where Two Sparks Beat One

Post 3 showed the raw vLLM command: twelve flags, each model-specific, each a potential source of silent misconfiguration. Post 6 mentioned that “the recipe system handles all the flag combinations” and moved on. Post 7 benchmarked three topologies — but those benchmarks only work if every flag is identical across topologies except the one variable you’re testing. This post opens the hood on recipes: what’s inside them, why they exist, and how they make everything in Post 7 reproducible.

The Flag Problem

Here’s the vLLM command from Post 3, the one the community tuned for gpt-oss-120b:

vllm serve openai/gpt-oss-120b \
  --quantization mxfp4 \
  --mxfp4-backend CUTLASS \
  --mxfp4-layers moe,qkv,o,lm_head \
  --attention-backend FLASHINFER \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.70 \
  --max-model-len 131072 \
  --max-num-seqs 2 \
  --max-num-batched-tokens 8192 \
  --enable-prefix-caching

That’s one model, one topology, one quantization config. Change the model, change everything. Change the topology from solo to cluster, change half the flags. Forget --enforce-eager on cluster mode and CUDA graphs crash. Include lm_head in --mxfp4-layers with TP=2 and the marlin kernel crashes. Set gpu-memory-utilization to 0.95 on the MXFP4 image and it OOMs — the image has higher overhead than standard vLLM.

Every one of those failure modes happened to me before recipes existed. The flags aren’t just config — they’re tribal knowledge compressed into command-line arguments. Recipes capture that knowledge in a file.


Recipe Anatomy — What’s Inside the YAML

A recipe is a YAML file in eugr’s spark-vllm-docker repo. Here’s the solo GPT-OSS-120B recipe:

model: "openai/gpt-oss-120b"          # HuggingFace model ID
container: "vllm-node-mxfp4"          # Docker image name
args:
  tensor-parallel-size: 1              # 1=solo, 2=cluster
  gpu-memory-utilization: 0.90
  quantization: mxfp4
  port: 8000
  max-model-len: 4096
  enforce-eager: true
  mxfp4-layers: "moe,qkv,o,lm_head"
  enable-auto-tool-choice: true
  tool-call-parser: hermes
env:
  TIKTOKEN_RS_CACHE_DIR: "/root/.cache/tiktoken_rs"
  VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8: "1"
benchmark:
  enabled: true
  framework: llama-benchy
  args: ["--num-prompts", "100"]

Three sections do all the work:

model and container — The HuggingFace model ID and which Docker image to use. GPT-OSS-120B needs the MXFP4 image (vllm-node-mxfp4, built from Christopher Owen’s dev12777 branch). Standard models use vllm-node. Wrong image, wrong quantization kernels.

args — Every vLLM serve flag. This is the section that replaces flag archaeology. The recipe pins quantization format, memory utilization, context length, execution mode, tool call parser — everything that affects inference behavior. Boolean flags like enforce-eager set to true get passed as --enforce-eager with no value.

env — Environment variables injected into the container. TIKTOKEN_RS_CACHE_DIR tells the harmony tokenizer where to find its vocab file. VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8 enables the FlashInfer MoE kernel for MXFP4. Miss either one and the model loads but inference fails with cryptic errors.

The optional benchmark section configures llama-benchy to run automatically after deployment — useful for smoke tests.


What the Recipe Locks Down

Every benchmark-relevant parameter is captured in the YAML. Here’s what that covers:

CategoryFlags / SettingsWhy It Matters for Benchmarks
Quantizationquantization, mxfp4-layers, kv-cache-dtypeDifferent quant configs produce different throughput numbers
Memorygpu-memory-utilizationControls KV cache size, directly affects max concurrency
Contextmax-model-len, max-num-batched-tokensDetermines how much memory is reserved for sequences
Executionenforce-eagerCUDA graphs vs eager mode changes the latency profile
Env varsVLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8, TIKTOKEN_RS_CACHE_DIRRuntime behavior of MoE kernels and tokenizer
Docker mountstiktoken_rs, ccache, flashinfer, vllm, huggingfaceMissing mounts cause silent fallbacks or crashes

This is the repeatability argument. When Post 7 compared Solo vs Cluster vs 2x Solo, the recipe guaranteed the only variable was the topology itself — not a stray flag difference. Without recipes, a topology comparison can be silently invalidated by one-flag-off errors: a forgotten --enforce-eager, a different gpu-memory-utilization, or a missing env var.


GPT-OSS-120B — Three Recipes, One Base Config

The central example: three recipes for the same model, sharing identical base flags, differing only where the topology requires it.

RecipeTPContextgpu_memKey Difference
openai-gpt-oss-120b.yaml (Solo)14,0960.90Short context — 131K OOMs on single node
openai-gpt-oss-120b-cluster.yaml (Cluster)232,7680.70TP=2 via Ray, conservative memory
openai-gpt-oss-120b-claude.yaml (Claude)2131,0720.75Full context, lowered for page-cache headroom

All three use MXFP4 quantization, the same env vars, the same five cache mounts. The design means when Post 7 benchmarked Solo vs Cluster, the quantization config, MoE kernel settings, and execution mode were identical. The only variables were tensor parallelism and context length.

Solo Recipe Gotchas

The upstream default context is 131,072 tokens. On a single 128 GB node, that OOMs immediately. The solo recipe sets max-model-len: 4096 — enough for testing and short inference, not for production workloads.

The solo recipe includes lm_head in mxfp4-layers (moe,qkv,o,lm_head). This gives roughly 20% decode speedup because the language model head gets quantized too. It works on solo because lm_head isn’t sharded.

Memory utilization is set to 0.90. The MXFP4 image has higher overhead than standard vLLM — 0.95 OOMs on this image even though it works fine on the standard one.

Cluster Recipe Gotchas

Two flags change and both are required:

enforce-eager: true — CUDA graph capture fails with TP=2 on Blackwell. Without this flag, the model loads but crashes during the first inference. This is a known issue with the current vLLM build on SM121 when tensor parallelism crosses nodes.

mxfp4-layers: "moe,qkv,o" — Note: no lm_head. With TP=2, the lm_head dimension gets sharded across nodes, and the marlin kernel can’t handle the sharded dimension. The crash is immediate and the error message points to an internal marlin assertion — not obviously related to the layers config. This was hours of debugging before the community identified the root cause.

Memory is conservative at 0.70. Two GPUs sharing a model need coordination overhead, and the CX7 interconnect traffic adds memory pressure. The cluster recipe also adds --distributed-executor-backend ray for multi-node inference.

Claude Recipe Gotchas

The Claude recipe is the production configuration — it’s what runs on the cluster day-to-day for Claude Code usage (Post 9). It shares all cluster flags but pushes the context to the full 131,072 tokens.

The key difference: gpu-memory-utilization: 0.75, lowered from 0.80. The reason is the page-cache OOM issue from Post 6 — Linux page cache gradually steals memory from the GPU’s allocation, and at 0.80 the model crashes after roughly 19 hours of uptime. Lowering to 0.75 adds approximately 6 GiB of headroom, enough to survive between the cron job that drops page cache.


Deploy and Switch

Deploying any topology is one command:

# Solo (single node)
cd ~/spark-vllm-docker && ./run-recipe.sh \
    ~/custom-recipes/openai-gpt-oss-120b.yaml --solo -d

# Cluster TP=2 (both nodes via Ray)
cd ~/spark-vllm-docker && ./run-recipe.sh \
    ~/custom-recipes/openai-gpt-oss-120b-cluster.yaml -d

# 2x Solo (both nodes independently)
# Node 1:
cd ~/spark-vllm-docker && ./run-recipe.sh \
    ~/custom-recipes/openai-gpt-oss-120b.yaml --solo -d
# Node 2 (same recipe, same flags):
cd ~/spark-vllm-docker && ./run-recipe.sh \
    ~/custom-recipes/openai-gpt-oss-120b.yaml --solo -d

The --solo flag means single-node mode (no Ray worker). The -d flag detaches (runs in background). The --setup flag (from Post 6) downloads the model first if it’s not cached.

Both nodes in the 2x Solo topology deploy from the same recipe file. The flags match because they literally are the same file. No manual flag assembly, no copy-paste divergence.

Stopping:

cd ~/spark-vllm-docker && ./launch-cluster.sh stop

Switching models: stop, run a different recipe, verify. One command each. The recipe system from Post 6’s skill layer makes this a 30-second operation instead of twenty minutes of flag archaeology.


Beyond GPT-OSS — The Recipe Catalog

GPT-OSS-120B is the primary benchmark model, but the repo includes 11 built-in recipes plus custom additions:

Use CaseRecipeWhy
Quick smoke testnemotron-3-nano-nvfp4Small, fast to load, validates the stack
Code generationqwen3-coder-next-fp8Code-optimized, FP8 quantization
GPT-OSS soloopenai-gpt-oss-120b.yamlFixed context (upstream OOMs)
GPT-OSS clusteropenai-gpt-oss-120b-cluster.yamlTP=2, 32K context, enforce-eager
GPT-OSS for Claude Codeopenai-gpt-oss-120b-claude.yamlTP=2, 131K context, gpu_mem 0.75
MoE / tool callingqwen3-next-80b-a3b-instruct-fp8.yaml80B MoE (3B active), tool calling
General chatminimax-m2.5-awqGood balance of quality and speed

Qwen3-Next-80B-A3B — A Different Set of Gotchas

Not every recipe is a variant of GPT-OSS. Qwen3-Next-80B-A3B is a Mixture-of-Experts model — 80B total parameters, 3B activated per token, 512 experts with 10 active. It uses FP8 quantization instead of MXFP4, runs on the standard vllm-node image, and has its own set of required flags:

  • --trust-remote-code — Required for Qwen3’s hybrid attention architecture. Without it, vLLM refuses to load the model.
  • VLLM_USE_DEEP_GEMM=0 — Deep GEMM causes instability on the Spark’s SM121 chip. Must be explicitly disabled.
  • --enable-chunked-prefill — Prevents memory spikes on long prompts. Without it, a single long prompt can OOM even though the model fits in memory.
  • --max-num-seqs 16 — Batching sweet spot for this model’s MoE architecture.
  • Do NOT set VLLM_USE_FLASHINFER_MOE_FP8 or VLLM_FLASHINFER_MOE_BACKEND — auto-selects TRITON, which works. Forcing a specific backend crashes.

Every one of those flags is in the recipe. Every one was discovered through trial and error. The recipe captures the result so nobody has to repeat the archaeology.


Benchmarking with Recipes

Audiophiles have reference tracks — recordings they know so well that they’re not listening to the music anymore, they’re listening to the equipment. When you put on Jennifer Warnes and Rob Wasserman’s “Ballad of the Runaway Horse” on a new amp, it’s just voice and upright bass — nowhere to hide. You already know exactly where the bass notes should land, whether her vocal has that in-the-room presence or sounds recessed. The track is the control. The equipment is the variable.

Recipes are our reference tracks. Same model, same flags, same quantization, same memory config — the only variable is the topology. That’s how we found the prefill crossover in Post 7. That’s how we know it’s real and not a one-flag-off artifact.

Post 7’s topology comparison followed a specific workflow. Recipes are the first step.

Safety Rules

Two rules learned the hard way:

  1. Always use IP addresses, never hostnames. mDNS floods at high concurrency (c64) cause hard node crashes. Management IPs: GoldFinger 192.168.3.203, GoldFinger2 192.168.3.190.
  2. Drop page cache on both nodes before launching vLLM. sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches' — without this, stale page cache can steal GPU memory and cause OOMs mid-benchmark.

Running Benchmarks

All commands run from the Windows workstation. The --model must match the served model name exactly.

# Solo -- quick smoke test
uvx llama-benchy \
    --url http://192.168.3.203:8000/v1 \
    --model openai/gpt-oss-120b \
    --pp 512 --tg 32 --runs 2

# Solo -- comprehensive sweep
uvx llama-benchy \
    --url http://192.168.3.203:8000/v1 \
    --model openai/gpt-oss-120b \
    --pp 1024 --tg 128 \
    --concurrency 1,2,4,8,16,32,64 \
    --runs 50

# 2x Solo + LiteLLM (through the proxy on port 4000)
uvx llama-benchy \
    --url http://192.168.3.203:4000/v1 \
    --api-key sk-1234 \
    --model openai/gpt-oss-120b \
    --pp 1024 --tg 128 \
    --concurrency 1,2,4,8,16,32,64 \
    --runs 50

# Cluster TP=2 (head node only -- GF2 is a Ray worker, no API)
uvx llama-benchy \
    --url http://192.168.3.203:8000/v1 \
    --model openai/gpt-oss-120b \
    --pp 1024 --tg 128 \
    --concurrency 1,2,4,8,16,32,64 \
    --runs 50

The Solo and Cluster TP=2 commands look identical — same endpoint, same flags. The difference is which recipe was deployed. The benchmark doesn’t need to know the topology; the recipe already configured it.

Key Metrics

MetricUnitBetterWhat It Measures
pp (prefill)tok/sHigherPrompt processing speed
tg (decode)tok/sHigherToken generation throughput
TTFRmsLowerTime to first response token
e2e_ttftmsLowerEnd-to-end time to first token (includes queuing)

Which Topology for Which Workload

Post 7 showed the full numbers. Here’s the decision table:

Optimize ForBest TopologyWhy
Decode throughput (any concurrency)Cluster TP=2+21% at c1, +61% at c16, +47% at c64 vs solo
Per-request decode latencyCluster TP=2Fastest per-request speed at all concurrency levels
Prefill throughput at c16+2x Solo + LiteLLMBoth nodes process different prefills in parallel (+45% at c64)
TTFT at c8+2x Solo + LiteLLMIndependent prefill queues (-45% at c64)
Single-request TTFTCluster TP=2Faster single-request prefill (-33% vs solo)

The recipe system is what makes this table actionable. Switching from “optimize for decode” to “optimize for prefill under load” is: stop the cluster recipe, deploy the solo recipe on both nodes, start LiteLLM. Three commands, same model, different topology, guaranteed-identical flags.


What’s Next

Recipes are the operational layer — they make deployment repeatable, benchmarking reproducible, and topology switching trivial. Post 9 puts them to work: Claude Code connected to these locally served models, where the recipe system becomes the foundation for agentic coding on local hardware.