6/9 Two Sparks, One Cluster: Setting Up with Claude Code

March 15, 2026 · 7 min read

#DGX Spark #cluster #Claude Code #vLLM #Ray #AI #Local AI

Previous: LiteLLM: The Translation Layer

The second DGX Spark arrived. Two nodes, 256 GB unified memory total, a QSFP stacking cable waiting to turn them into a cluster.

Before touching any of that: check firmware.

Firmware First

This is the boring step that saves you from mysterious failures three days later.

Both Sparks need matching firmware — GPU firmware, BMC firmware, DGX OS version. The community forums are full of posts where people spent days debugging NCCL hangs and Ray connection failures that turned out to be firmware mismatches. I checked both nodes before plugging in the stacking cable. They matched — but if they hadn’t, I’d have lost days to phantom bugs that look like software problems.

Update anything that’s behind. Do this first.

The Physical Setup

Two DGX Sparks. One QSFP stacking cable. The cable connects the ConnectX-7 NICs — dual 200GbE links, 400 Gbps aggregate between nodes. This is the path NCCL uses for tensor parallelism: RoCE (RDMA over Converged Ethernet), bypassing the kernel network stack entirely, GPU-memory-to-GPU-memory.

The community clustering work — especially eugr’s spark-vllm-docker from Post 3 — meant the second Spark was immediately useful rather than a months-long project. The Docker images, Ray cluster scripts, and recipe system already existed. I needed to wire them up, not invent them.

Two networks serve two purposes: management (10GbE LAN/WiFi) for SSH from the workstation, ConnectX-7 for everything between nodes. Don’t mix them.

Claude Code as the Setup Partner

Configuring a two-node cluster is a long list of steps that each depend on the previous one: SSH keys, network config, Docker images, Ray cluster, vLLM serving. Miss one detail and something downstream fails with an unrelated-looking error.

I used Claude Code throughout, and it helped in two distinct ways:

Pre-built skills — I started with a set of Claude Code skills I’d been developing since the first Spark. SSH setup, Docker management, vLLM deployment. Each skill has status checks, setup scripts, and reference docs. When I asked Claude Code to “check SSH status,” it ran the right scripts and told me what was misconfigured.

Skills built during configuration — Some skills didn’t exist until I needed them. Example: after the third time I manually checked Docker image IDs on both nodes to verify they matched, I asked Claude Code to write a status script. It wrote docker_status.py — SSHes into both nodes, compares image IDs, flags mismatches. I saved it as part of the Docker skill. The model recipe skill, the benchmarking skill, the LiteLLM proxy skill from Post 5 — all emerged the same way: hit a problem, solve it with Claude Code, save it as a repeatable skill.

The result is a layered skill system where each skill builds on the previous:

1. SSH Setup        → Passwordless SSH over ConnectX-7
2. Docker Setup     → Build vLLM image, distribute to both nodes
3. vLLM Deploy      → Ray cluster across both nodes
4. LiteLLM Proxy    → API gateway (Post 5)
5. Benchmark        → Quick performance tests
6. Model Recipes    → Pre-configured model deployments
7. llama-benchy     → Comprehensive benchmarking

Software Stack

Layer	Component	Skill
7	llama-benchy (comprehensive benchmark)	Skill 7
6	Benchmark (quick performance tests)	Skill 5
5	LiteLLM Proxy	Skill 4
4	Model Recipes (run-recipe.sh)	Skill 6
3	vLLM + Ray Cluster (tensor parallelism -tp 2)	Skill 3
2	Docker (vllm-node image, NVIDIA Container Toolkit)	Skill 2
1	SSH (passwordless, ConnectX IP)	Skill 1
0	Ubuntu 24.04 (ARM64) / DGX Spark OS	—

Skill Dependency Graph

flowchart TD A["1. SSH Setup"] --> B["2. Docker Setup"] B --> C["3. vLLM Deployment"] C --> D["4. LiteLLM Proxy"] C --> E["5. Benchmark"] C --> F["6. Model Recipes"] F --> G["7. llama-benchy"]

Skills 1–3 must run in order. Skills 4–7 depend on 3 and can run independently.

The Setup Chain

Each layer has its own gotchas. Here’s what Claude Code caught and what I learned the hard way.

SSH — The SSH skill automated key exchange and netplan config over the ConnectX-7 IPs (192.168.177.11, 192.168.177.12). What it also caught: sudo breaks SSH to the peer node because root’s ~/.ssh/ has no keys for the other machine. The fix — pipe sudo docker save through ssh cluster@... through sudo docker load — is the kind of split that’s easy to get wrong manually. The skill handles it correctly every time.

Docker — The vLLM Docker image must be identical on both nodes. Different builds have different internal layouts — tiktoken file paths, library versions — causing subtle failures during tensor-parallel inference. I learned this after a 30-minute debugging session where TP=2 crashed with a cryptic tiktoken error. The images looked the same but weren’t. Now docker_status.py compares image IDs across nodes before every deployment.

vLLM + Ray — The Ray cluster spans both nodes with the head node running the API server on port 8000. The worker node has no API — it’s a Ray worker only. Tensor parallelism (-tp 2) splits the model across both GPUs. We’ll see these flags pay off in the benchmarks.

Not Downloading 120B Twice

gpt-oss-120b is roughly 120 GB of model weights. Downloading it twice — once per node — is a waste when the nodes are connected by a 200 Gbps link. The HuggingFace cache must be at the same path on both nodes (/home/cluster/.cache/huggingface/). Download once on the head node, then parallel tar | nc over the CX7 link — five streams, no encryption needed on a private cable, 4.2 GB/s, about 30 seconds for 120 GB.

But the model weights were the easy part. MXFP4 models need five separate caches mounted into the container:

HuggingFace — model weights
FlashInfer — JIT-compiled GPU kernels
vLLM — torch compile cache
ccache — CUDA compilation cache
tiktoken_rs — harmony encoding vocab (GPT-OSS specific)

Miss one and the model loads fine but fails at inference time. The errors are cryptic — a missing FlashInfer kernel looks like a CUDA error, a missing tiktoken vocab looks like a tokenizer bug. I spent hours on these before realizing the caches needed to be synced across both nodes, not just the model weights.

The DGX Manager

By the time the cluster was running, the collection of skills and scripts had grown into its own project: dgx_manager. A Claude Code workspace with seven skills, a shared cluster config (spark-cluster.yaml), and PowerShell scripts for cluster management from my Windows workstation.

The day-to-day workflow:

Deploy a model: Pick a recipe, run it. run-recipe.sh openai-gpt-oss-120b --setup downloads the model, launches the container, configures vLLM with the right flags.
Check status: Ask Claude Code “what’s running on the cluster” — it SSHes into both nodes, checks Docker containers, Ray status, vLLM API health.
Benchmark: Run llama-benchy against the deployment, get throughput curves across concurrency levels.
Switch models: Stop the current deployment, run a different recipe. The recipe system handles all the flag combinations — quantization, context length, memory utilization, tool call parsers — so switching from GPT-OSS 120B to Qwen3-32B is one command, not twenty minutes of flag archaeology.

Each skill maintains its own documentation via update_docs.py, so the docs stay current with the actual cluster state.

What You End Up With

Two DGX Sparks. 256 GB unified memory. 400 Gbps between them — on paper. Same principle as the intro: the spec sheet said 200 Gbps per link, but you don’t know what that means until you push real traffic across it. A skill system that automates deployment, benchmarking, and model switching.

The cluster has two operating modes — and they perform very differently:

TP=2 cluster: Both GPUs run one model with tensor parallelism. Full 131K context window, lower throughput per token.
2x Solo: Each node runs its own copy independently. For GPT-OSS 120B, that means 4K context per instance (memory-limited on a single 128 GB node), but higher aggregate throughput with load balancing.

Which mode wins? That’s Post 7 — benchmarks on the cluster, where the numbers tell a story the specs don’t predict.

Back to Blog

Posts