9/9 Claude Code with Local Models: The Series Finale

March 21, 2026 · 22 min read

#DGX Spark #Claude Code #LiteLLM #vLLM #AI #Local AI

Previous: The Recipe System: One Command, Zero Flag Archaeology

Post 1 started with nvidia-smi blind to its own hardware. Post 2 showed Ollama’s ceiling — good enough to start, not enough to stay. Post 3 got vLLM running at 3,975 t/s. Post 4 put those numbers under the microscope with llama-benchy. Post 5 wired up LiteLLM as the translation layer. Post 6 clustered two Sparks with Claude Code helping build the infrastructure. Post 7 benchmarked which topology wins for which workload. Post 8 wrapped the flag archaeology into one-command recipes.

This post is the payoff: Claude Code — and Python agents via the Anthropic SDK — running against a locally served GPT-OSS-120B on the DGX Spark.

Datacenter specs in a garage rack, serving local LLMs to Claude Code agent swarms. That’s sci-fi running in my garage.

The Three-Layer Stack

Three components. Three Docker containers (two for LiteLLM, one for vLLM). One DGX Spark.

Claude Code (Windows workstation)
        │
        ▼ port 4000
LiteLLM (Docker bridge network)
        │
        ▼ host.docker.internal:8000
vLLM (Docker host network)
        │
        ▼ GPU
Blackwell GB10 — GPT-OSS-120B MXFP4

vLLM loads GPT-OSS-120B onto the Blackwell GPU with MXFP4 quantization and exposes an OpenAI-compatible endpoint on port 8000. Post 3 covers the community Docker image, Post 8 covers the recipe that configures it.

LiteLLM translates between Anthropic’s Messages API (what Claude Code speaks) and OpenAI’s Chat Completions API (what vLLM speaks). Critically, it runs the litellm_tool_fix.py callback that makes tool calling work. Post 5 covers the config gotchas and the basics.

Claude Code / Agent SDK is the client. Point ANTHROPIC_BASE_URL at LiteLLM and the local model appears as if it were the Anthropic API.

Single-node means no Ray, no CX7 networking, no second node. Everything on one Spark.

How Requests Flow

The request chain has more going on than “client → proxy → server.” The tool fix callback intercepts traffic in both directions:

sequenceDiagram participant CC as Claude Code participant LLM as LiteLLM participant V as vLLM participant GPU as Blackwell GPU CC->>LLM: POST /v1/messages (Anthropic format) Note over LLM: pre_call_hook: patch null tool descriptions LLM->>V: POST /v1/chat/completions (OpenAI format) V->>GPU: Inference (GPT-OSS-120B MXFP4) GPU-->>V: Generated tokens V-->>LLM: Response with Harmony tokens ⟨start⟩assistant⟨channel⟩commentary to=functions.Edit ⟨constrain⟩ json⟨message⟩{...}⟨call⟩ Note over LLM: post_call_hook: 1. Parse Harmony channels 2. Remap tool names (Edit → MultiEdit) 3. Convert parameter formats 4. Set stop_reason = tool_use LLM-->>CC: Response (Anthropic format) tool_use blocks + text blocks

Without the tool fix callback, Claude Code sees raw Harmony tokens instead of tool calls. It can’t execute tools. It can’t edit files. It can’t do anything agentic.

Layer 1 — vLLM with the Solo Recipe

Post 8 covers the recipe system in depth. But the solo GPT-OSS-120B recipe is worth showing here because it uses the full command template format with {placeholders} — a more complete recipe than the simplified examples in Post 8.

The Solo Recipe

File: ~/custom-recipes/openai-gpt-oss-120b.yaml

recipe_version: "1"
name: OpenAI GPT-OSS 120B
description: vLLM serving openai/gpt-oss-120b with MXFP4 quantization and FlashInfer

model: openai/gpt-oss-120b
container: vllm-node-mxfp4

build_args:
  - --exp-mxfp4

mods: []

defaults:
  port: 8000
  host: 0.0.0.0
  tensor_parallel: 1
  gpu_memory_utilization: 0.70
  max_model_len: 4096
  max_num_batched_tokens: 32768

env:
  VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8: "1"
  TIKTOKEN_RS_CACHE_DIR: /root/.cache/tiktoken_rs
  TIKTOKEN_ENCODINGS_BASE: /workspace/vllm/tiktoken_encodings

extra_docker_args: >-
  -v /home/cluster/.cache/tiktoken_rs:/root/.cache/tiktoken_rs
  -v /home/cluster/.cache/ccache:/root/.cache/ccache

command: |
  vllm serve openai/gpt-oss-120b \
    --tool-call-parser openai \
    --reasoning-parser openai_gptoss \
    --enable-auto-tool-choice \
    --tensor-parallel-size {tensor_parallel} \
    --gpu-memory-utilization {gpu_memory_utilization} \
    --enforce-eager \
    --enable-prefix-caching \
    --load-format fastsafetensors \
    --quantization mxfp4 \
    --mxfp4-backend CUTLASS \
    --mxfp4-layers moe,qkv,o,lm_head \
    --attention-backend FLASHINFER \
    --kv-cache-dtype fp8 \
    --max-model-len {max_model_len} \
    --max-num-batched-tokens {max_num_batched_tokens} \
    --host {host} \
    --port {port}

Compared to the simplified format in Post 8, this recipe adds recipe_version, build_args, env, extra_docker_args, and the full command template where {placeholders} are substituted from defaults at launch time. The command block gives you explicit control over the entire vllm serve invocation — useful when you need flags that don’t map to the simplified args format.

Key Flags

Flag	Value	Why
`tensor_parallel`	`1`	Single GPU — no distributed inference
`max_model_len`	`4096`	The model supports 131K, but that OOMs on 128 GB. 4096 is safe for solo.
`gpu_memory_utilization`	`0.70`	Conservative. Grace Blackwell unified memory means CPU page cache shares the GPU memory pool. Higher values risk OOM during long sessions.
`quantization: mxfp4`	—	Microscaling FP4. Makes the 120B model fit in 128 GB.
`mxfp4-layers`	`moe,qkv,o,lm_head`	Which layers to quantize. Including `lm_head` gives ~20% decode speedup.
`enforce-eager`	—	Disables CUDA graphs. Required for stability on Blackwell with this model.
`kv-cache-dtype: fp8`	—	Halves KV cache memory vs FP16, allowing more concurrent requests.
`load-format: fastsafetensors`	—	GPU-direct model loading. ~3 min startup vs ~5 min.
`attention-backend: FLASHINFER`	—	Optimized attention kernels for Blackwell.
`tool-call-parser openai`	—	Enables OpenAI-format tool calling.
`reasoning-parser openai_gptoss`	—	Parses GPT-OSS reasoning/chain-of-thought output.
`enable-auto-tool-choice`	—	Lets the model decide when to call tools.

The Five Cache Mounts

The vLLM container needs five bind-mounts to access cached data. Missing any of them causes anything from slow startups to hard failures.

Host Path	Container Path	What’s Inside	Why It Matters
`~/.cache/huggingface`	`/root/.cache/huggingface`	Model weights (~60 GB)	Without this, vLLM re-downloads the entire model on every launch
`~/.cache/flashinfer`	`/root/.cache/flashinfer`	FlashInfer JIT-compiled CUDA kernels	MXFP4 models need pre-compiled kernels; rebuilding takes 10+ min
`~/.cache/vllm`	`/root/.cache/vllm`	Torch compile cache, model metadata	Speeds up repeated launches
`~/.cache/ccache`	`/root/.cache/ccache`	CUDA compilation cache	Avoids recompilation of custom CUDA ops
`~/.cache/tiktoken_rs`	`/root/.cache/tiktoken_rs`	Harmony encoding vocab (`o200k_base.tiktoken`)	GPT-OSS tokenizer needs this file; without it, model fails to load

The first three are mounted automatically by run-recipe.sh. The last two come from the extra_docker_args field in the recipe (or the VLLM_SPARK_EXTRA_DOCKER_ARGS env var in ~/.bashrc).

Container Environment Variables

Variable	Value	Purpose
`VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8`	`1`	Enables FlashInfer MXFP4/MXFP8 MoE kernels
`TIKTOKEN_RS_CACHE_DIR`	`/root/.cache/tiktoken_rs`	Points the Harmony tokenizer to the pre-downloaded vocab
`TIKTOKEN_ENCODINGS_BASE`	`/workspace/vllm/tiktoken_encodings`	Must match the path inside the image build

Launch

# 1. Drop page cache (Grace Blackwell unified memory — page cache shares GPU pool)
ssh cluster@goldfinger 'sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches"'

# 2. Launch with the solo recipe
ssh cluster@goldfinger 'cd ~/spark-vllm-docker && ./run-recipe.sh ~/custom-recipes/openai-gpt-oss-120b.yaml --solo -d'

Wait ~3 minutes for model loading. --solo means single-node mode (no Ray). -d runs detached.

Verify

curl http://192.168.3.203:8000/v1/models
# Should return: {"data":[{"id":"openai/gpt-oss-120b",...}]}

Layer 2 — LiteLLM and the Harmony Protocol

Post 5 covered LiteLLM setup — the config gotchas, the hosted_vllm/ prefix, host.docker.internal, drop_params. This section goes deeper on what the tool fix callback actually does: parsing the Harmony protocol.

LiteLLM Config (Brief)

File: ~/litellm/litellm-config.yaml

model_list:
- model_name: openai/gpt-oss-120b
  litellm_params:
    model: hosted_vllm/openai/gpt-oss-120b
    api_base: http://host.docker.internal:8000/v1
    api_key: dummy

litellm_settings:
  drop_params: true
  callbacks:
  - litellm_tool_fix.litellm_tool_fix

general_settings:
  master_key: sk-1234
  store_model_in_db: true
  store_prompts_in_spend_logs: true

See Post 5 for the full breakdown of why hosted_vllm/ and not openai/, why host.docker.internal and not localhost, and how drop_params saves you from hard errors.

Docker Compose

File: ~/litellm/docker-compose.yml

services:
  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    container_name: litellm
    ports:
    - 4000:4000
    volumes:
    - ./litellm-config.yaml:/app/config.yaml
    - ./litellm_tool_fix.py:/app/litellm_tool_fix.py
    command:
    - --config
    - /app/config.yaml
    - --port
    - '4000'
    extra_hosts:
    - host.docker.internal:host-gateway
    environment:
    - LITELLM_MASTER_KEY=sk-1234
    - DATABASE_URL=postgresql://llmproxy:dbpassword9090@litellm-db:5432/litellm
    restart: unless-stopped
    depends_on:
    - litellm-db
  litellm-db:
    image: postgres:16-alpine
    container_name: litellm-db
    environment:
    - POSTGRES_USER=llmproxy
    - POSTGRES_PASSWORD=dbpassword9090
    - POSTGRES_DB=litellm
    volumes:
    - litellm-db-data:/var/lib/postgresql/data
    restart: unless-stopped
volumes:
  litellm-db-data: null

Two things to note: the extra_hosts line maps host.docker.internal to the Docker bridge gateway — this is how LiteLLM reaches vLLM on the host network. And ./litellm_tool_fix.py:/app/litellm_tool_fix.py mounts the callback script so LiteLLM can import it.

Deep Dive: The Harmony Protocol

Post 5 mentioned the tool fix callback at a high level: it patches null descriptions on the way in and translates Harmony tokens on the way out. Here’s what actually happens inside.

The Problem

GPT-OSS uses OpenAI’s “Harmony” protocol for structured output. When the model wants to call a tool, it doesn’t return a function_call JSON object like GPT-4. It emits special tokens embedded in its response text:

<|start|>assistant<|channel|>commentary to=functions.Edit
<|constrain|> json<|message|>{"file_path":"/tmp/foo.py","old_string":"x","new_string":"y"}<|call|>

Claude Code expects Anthropic-format tool_use content blocks:

{
  "type": "tool_use",
  "id": "toolu_abc123",
  "name": "Edit",
  "input": {"file_path": "/tmp/foo.py", "old_string": "x", "new_string": "y"}
}

Without translation, Claude Code sees <|start|>assistant<|channel|>commentary to=functions.Edit... as regular text. It doesn’t know a tool call happened. It can’t execute anything.

The Three Channel Types

The Harmony format uses channels to signal what kind of content is in each message segment:

Channel	Header Example	Meaning	What the Callback Does
Tool call	`commentary to=functions.Edit`	Model wants to call a tool	Extracts tool name + JSON params, creates `tool_use` block
Final	`final`	User-facing response text	Keeps as Anthropic `text` block
Analysis	`analysis`	Chain-of-thought reasoning	Drops — already captured in thinking block

A single response can contain multiple channels. The model might think (analysis), explain what it’s doing (final), then call two tools (two tool call channels) — all in one response, delimited by Harmony tokens.

Tool Name Remapping

GPT-OSS doesn’t always use the same tool names as Claude Code. The callback remaps:

GPT-OSS Sends	Remapped To	Why
`Edit` / `EditFile`	`MultiEdit`	Claude Code uses MultiEdit for file edits
`CreateFile` / `create_file`	`Write`	Claude Code uses Write for file creation

Parameter Format Conversion

When remapping Edit → MultiEdit, the parameters also need restructuring:

// GPT-OSS sends (Edit format):
{"file_path": "foo.py", "old_string": "x", "new_string": "y"}

// Callback converts to (MultiEdit format):
{"file_path": "foo.py", "edits": [{"old_text": "x", "new_text": "y"}]}

Without this conversion, Claude Code receives a MultiEdit tool call with the wrong parameter shape and the edit fails.

stop_reason Override

When tool calls are detected, the callback sets stop_reason to "tool_use". Claude Code checks this field to decide whether to execute tools. Without the override, it treats the response as a normal text reply — the tool calls are there in the content blocks, but Claude Code never executes them.

The Full Flow

Putting it all together:

Pre-call hook: Patches null tool descriptions to empty strings (vLLM rejects null).
Request goes to vLLM, which returns text with Harmony tokens.
Post-call hook: Scans for <|start|>, <|channel|>, <|message|>, <|call|> tokens.
Parses each channel — tool calls become tool_use blocks, final text stays as text blocks, analysis gets dropped.
Remaps tool names (Edit → MultiEdit, CreateFile → Write).
Converts parameter formats to match the remapped tool’s expected shape.
Strips any leaked Harmony tokens from thinking blocks.
Sets stop_reason to "tool_use" when tool calls are present.
If the primary Harmony regex fails (partially stripped tokens), a fallback regex tries to extract to=functions.ToolName patterns.

This is the glue that makes agentic usage possible. Without it, Claude Code can chat with the local model but can’t use tools — which means it can’t edit files, run commands, or do anything an agentic coding assistant actually does.

Launch and Verify

# Launch
ssh cluster@goldfinger 'cd ~/litellm && sudo docker compose up -d'

# Verify the API
curl -H "Authorization: Bearer sk-1234" http://192.168.3.203:4000/v1/models

# Quick inference test
curl http://192.168.3.203:4000/v1/chat/completions \
  -H "Authorization: Bearer sk-1234" \
  -H "Content-Type: application/json" \
  -d '{"model":"openai/gpt-oss-120b","messages":[{"role":"user","content":"Say hello"}],"max_tokens":50}'

# Check tool fix loaded (look for [OSSToolCallFix] lines)
ssh cluster@goldfinger 'sudo docker logs litellm --tail 20'

Layer 3 — Claude Code and the Agent SDK

This is where it pays off. Two env-var scripts. One for PowerShell, one for Bash.

The Launch Scripts

PowerShell (claude-gptoss.ps1):

# Usage:
#   .\claude-gptoss.ps1                          # default model
#   .\claude-gptoss.ps1 openai/gpt-oss-120b      # explicit model
#   .\claude-gptoss.ps1 openai/gpt-oss-120b http://192.168.3.203:4000  # explicit URL

param(
    [string]$Model = "openai/gpt-oss-120b",
    [string]$BaseUrl = "http://192.168.3.203:4000"
)

$env:ANTHROPIC_BASE_URL = $BaseUrl
$env:ANTHROPIC_AUTH_TOKEN = "sk-1234"
$env:ANTHROPIC_MODEL = $Model
$env:ANTHROPIC_SMALL_FAST_MODEL = $Model
$env:ANTHROPIC_DEFAULT_HAIKU_MODEL = $Model
$env:ANTHROPIC_DEFAULT_SONNET_MODEL = $Model
$env:ANTHROPIC_DEFAULT_OPUS_MODEL = $Model
$env:CLAUDE_CODE_ATTRIBUTION_HEADER = "0"
$env:CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC = "1"
$env:LOG_LEVEL = "debug"
claude

Bash (claude-gptoss.sh):

#!/usr/bin/env bash
# Usage:
#   ./claude-gptoss.sh                          # default model
#   ./claude-gptoss.sh openai/gpt-oss-120b      # explicit model
#   ./claude-gptoss.sh openai/gpt-oss-120b http://192.168.3.203:4000  # explicit URL

DEFAULT_MODEL="openai/gpt-oss-120b"
DEFAULT_BASE_URL="http://192.168.3.203:4000"

export ANTHROPIC_BASE_URL="${2:-$DEFAULT_BASE_URL}"
export ANTHROPIC_AUTH_TOKEN="sk-1234"
export ANTHROPIC_MODEL="${1:-$DEFAULT_MODEL}"
export ANTHROPIC_SMALL_FAST_MODEL="${1:-$DEFAULT_MODEL}"
export ANTHROPIC_DEFAULT_HAIKU_MODEL="${1:-$DEFAULT_MODEL}"
export ANTHROPIC_DEFAULT_SONNET_MODEL="${1:-$DEFAULT_MODEL}"
export ANTHROPIC_DEFAULT_OPUS_MODEL="${1:-$DEFAULT_MODEL}"
export CLAUDE_CODE_ATTRIBUTION_HEADER=0
claude

The Environment Variables

Variable	Value	Why
`ANTHROPIC_BASE_URL`	`http://192.168.3.203:4000`	Points at LiteLLM, not vLLM directly. LiteLLM handles the protocol translation.
`ANTHROPIC_AUTH_TOKEN`	`sk-1234`	The LiteLLM master key. Claude Code sends this as the Bearer token.
`ANTHROPIC_MODEL`	`openai/gpt-oss-120b`	Must match `model_name` in litellm-config.yaml.
`ANTHROPIC_SMALL_FAST_MODEL`	`openai/gpt-oss-120b`	Claude Code uses different models for different tasks (fast model for simple queries). Locally, there’s only one model — set them all the same.
`ANTHROPIC_DEFAULT_HAIKU_MODEL`	`openai/gpt-oss-120b`	Same reason — override every model slot.
`ANTHROPIC_DEFAULT_SONNET_MODEL`	`openai/gpt-oss-120b`	Same reason.
`ANTHROPIC_DEFAULT_OPUS_MODEL`	`openai/gpt-oss-120b`	Same reason.
`CLAUDE_CODE_ATTRIBUTION_HEADER`	`0`	Disables the attribution header that Claude Code normally sends. LiteLLM doesn’t understand it and logs warnings.
`CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC`	`1`	Stops telemetry and update checks that would fail against the local endpoint.

The key insight: Claude Code normally uses a family of models — Haiku for fast tasks, Sonnet for standard work, Opus for complex reasoning. Locally, you have one model. Every model override points at the same openai/gpt-oss-120b. One model does everything.

The Agent SDK Angle

The same environment variables that launch Claude Code against the local model also work for Python agents. Swap claude for uv run python agent.py.

Anthropic SDK — direct API calls:

from anthropic import Anthropic

client = Anthropic(
    base_url="http://192.168.3.203:4000/v1",
    api_key="sk-1234",
)

response = client.messages.create(
    model="openai/gpt-oss-120b",
    max_tokens=1024,
    messages=[{"role": "user", "content": "What is 2+2?"}],
)
print(response.content[0].text)

Claude Agent SDK — agentic usage with tool execution:

from claude_agent_sdk import Agent

agent = Agent(
    model="openai/gpt-oss-120b",
    base_url="http://192.168.3.203:4000/v1",
    api_key="sk-1234",
)
result = agent.run("List the files in the current directory")
print(result)

The connection chain is identical for both:

Client (Claude Code / Python SDK)
  │
  ├─► POST :4000/v1/messages  (Bearer sk-1234)
  │
  ▼
LiteLLM
  │
  ├─► POST :8000/v1/chat/completions  (hosted_vllm/ prefix stripped)
  │
  ▼
vLLM
  │
  ├─► Harmony tokens in response text
  │
  ▼
LiteLLM
  │
  ├─► tool_use blocks, stop_reason=tool_use
  │
  ▼
Client

One stack, two clients, same local inference. The model name flows through as: client sends openai/gpt-oss-120b → LiteLLM matches it to model_name in the config → strips the hosted_vllm/ prefix → sends openai/gpt-oss-120b to vLLM.

The Complete Launch Sequence

Start (3 Steps)

# 1. Drop page cache (unified memory — page cache competes with GPU)
ssh cluster@goldfinger 'sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches"'

# 2. Launch vLLM (solo mode, detached)
ssh cluster@goldfinger 'cd ~/spark-vllm-docker && ./run-recipe.sh ~/custom-recipes/openai-gpt-oss-120b.yaml --solo -d'
# Wait ~3 min, then verify:
curl http://192.168.3.203:8000/v1/models

# 3. Launch LiteLLM
ssh cluster@goldfinger 'cd ~/litellm && sudo docker compose up -d'
# Verify:
curl -H "Authorization: Bearer sk-1234" http://192.168.3.203:4000/v1/models

Stop

# Stop vLLM
ssh cluster@goldfinger 'sudo docker stop vllm_node && sudo docker rm vllm_node'

# Stop LiteLLM
ssh cluster@goldfinger 'cd ~/litellm && sudo docker compose down'

Verify Everything

# vLLM health
curl http://192.168.3.203:8000/v1/models

# LiteLLM health
curl -H "Authorization: Bearer sk-1234" http://192.168.3.203:4000/v1/models

# Running containers
ssh cluster@goldfinger 'sudo docker ps --format "table {{.Names}}\t{{.Image}}\t{{.Status}}"'

# Tool fix activity
ssh cluster@goldfinger 'sudo docker logs litellm --tail 20'

# vLLM logs
ssh cluster@goldfinger 'sudo docker logs vllm_node --tail 20 2>&1'

File Locations on Node

~/custom-recipes/openai-gpt-oss-120b.yaml       # Solo recipe
~/spark-vllm-docker/                              # vLLM docker tooling
~/spark-vllm-docker/.env                          # Non-interactive launch config
~/litellm/litellm-config.yaml                     # LiteLLM model routing
~/litellm/docker-compose.yml                      # LiteLLM + Postgres
~/litellm/litellm_tool_fix.py                     # Harmony → tool_use callback
~/.bashrc                                         # VLLM_SPARK_EXTRA_DOCKER_ARGS
~/.cache/huggingface/                             # Model weights
~/.cache/flashinfer/                              # FlashInfer JIT kernels
~/.cache/vllm/                                    # vLLM compile cache
~/.cache/ccache/                                  # CUDA compilation cache
~/.cache/tiktoken_rs/                             # Harmony encoding vocab

Gotchas and Troubleshooting

Solo-Specific Issues

Issue	Cause	Fix
OOM at `max_model_len: 131072`	Single-node only has 128 GB; 131K context needs ~114 GB	Use `max_model_len: 4096` (the solo recipe already does this)
OOM after hours of uptime	Page cache grows, steals from GPU memory pool	Drop cache before launch; a cron job runs every 4h automatically
`gpu_memory_utilization` above 0.90	`vllm-node-mxfp4` image has higher GPU memory overhead	Keep at 0.70 for solo (recipe default)
LiteLLM can’t reach vLLM	Bridge network isolation — `localhost` = container loopback	Use `host.docker.internal` in `api_base`
Model name mismatch	`openai/` prefix double-strip bug in LiteLLM	Use `hosted_vllm/` prefix in litellm-config.yaml
Tool calls not working	Missing `litellm_tool_fix.py` or not mounted in container	Verify the volume mount and `callbacks` config
`TIKTOKEN_ENCODINGS_BASE` wrong	Path differs between image builds	Check with `find / -name "*.tiktoken"` inside container
vLLM startup fails silently	Missing cache mounts (especially `tiktoken_rs`)	Verify all 5 cache directories exist and are mounted
`--enforce-eager` missing	CUDA graph instability on Blackwell with this model	Always include `--enforce-eager`

Prerequisites Checklist

Before starting the stack, verify:

DGX Spark with Docker Engine and NVIDIA Container Toolkit
spark-vllm-docker cloned at ~/spark-vllm-docker
vllm-node-mxfp4:latest Docker image built
openai/gpt-oss-120b weights downloaded to ~/.cache/huggingface/
All 5 cache directories exist under ~/.cache/
VLLM_SPARK_EXTRA_DOCKER_ARGS set in ~/.bashrc
~/spark-vllm-docker/.env configured for solo mode
Workstation has SSH access to the DGX Spark

The Reality — What Works and What Doesn’t

What Works Well

I tested with a simple task: build a single-file HTML Tetris game. Not a trivial prompt — it needs game loop logic, collision detection, piece rotation, rendering, keyboard input. A real test of whether the model can produce working code through Claude Code’s tool chain.

It wasn’t one-shot. First attempt had a couple of small bugs — piece rotation clipping through walls, score not updating. But the model iterated on its own output, found the issues, and fixed them. Three rounds of edits and it worked. That’s the pattern: not perfect first try, but capable of self-correction when you let it run.

Tool use works. File edits land correctly. Bash commands execute. Grep finds what it’s looking for. The Harmony-to-tool_use translation from the callback handles the mechanical side — the model just needs to decide what to do, and it decides well enough.

It’s not Opus 4.6. But it’s useful.

What Breaks or Degrades

It’s slower than cloud Claude. Not unusably slow — token generation is fine once it starts — but the overall cycle time for a multi-step task is longer. You feel it.

The bigger gap is precision. Cloud Claude tolerates ambiguous instructions. You can say “clean this up” and it infers what you mean from context. GPT-OSS-120B needs you to be more explicit. “Refactor this function to extract the validation logic into a separate helper” works. “Make this better” doesn’t.

Claude Code skills work — the model can follow structured skill prompts — but they need to be more detailed than what you’d write for cloud models. Less room for the model to fill in gaps. If your skill says “fix the bug,” you’ll get inconsistent results. If it says “the bug is in the collision check on line 42, the boundary condition should be >= not >,” it nails it.

This isn’t a dealbreaker. It’s a different usage pattern. You front-load the thinking into better prompts and let the model handle execution.

Daily Usage Patterns

Honest answer: this is still proof of concept. I’m not daily-driving local Claude Code for production work. Cloud Claude is better and I have access to it.

What I am doing is testing. Running different tasks through the local stack, seeing where the boundaries are, building intuition for what fits and what doesn’t. The Tetris test was one of many. Each test teaches me something about where GPT-OSS-120B sits on the capability curve.

Next step is trying other models through the same stack. The recipe system from Post 8 makes model switching easy — swap the YAML file, relaunch, same LiteLLM config. The infrastructure is model-agnostic. The model is the variable.

Performance Feel

Token speed is good enough — for the right workflow. Interactive back-and-forth where you’re watching every token stream in? The latency adds up and it feels sluggish compared to cloud. You notice.

But agentic work is different. Prepare a detailed skill or a well-scoped prompt, kick it off, let the model work autonomously. Come back to results. In that pattern, token speed barely matters because you’re not sitting there watching. The model grinds through file edits, bash commands, grep searches — all on its own — and the wall-clock time is acceptable because you’re doing something else.

That’s the sweet spot: batch-style agentic tasks with well-prepared prompts. Minimize manual inputs. Maximize autonomous execution. The model works best when you give it a clear target and get out of the way.

Closing — Was It Worth It?

The Arc

I expected some fun and I got a lot of it.

But the real payoff was understanding. Nine posts ago, nvidia-smi couldn’t see its own GPU. The community had the fix — not NVIDIA. Post 2 hit Ollama’s ceiling and showed the hardware deserved a better inference engine. Post 3 went from 40 tokens/second to 3,975. The hardware was never the bottleneck; the software stack was. Post 4 proved it with llama-benchy — proper benchmarking that told a completely different story than the review circuit. Post 5 wired up LiteLLM as the protocol translator that nobody else had documented for this hardware. Post 6 clustered two Sparks with Claude Code helping build the very infrastructure it would later run on. Post 7 showed that neither topology dominates — the right choice depends on the workload. Post 8 turned all that tribal knowledge into one-command recipes.

Now here in Post 9, Claude Code runs against a locally served model on that same infrastructure. From “this GPU doesn’t exist in nvidia-smi” to “agentic AI coding assistant running on it.” That’s the arc.

What I didn’t expect: how much I’d learn about what actually happens when you “just ask” Claude Code to do something. Every log line, every protocol translation, every flag that makes or breaks tool calling. The abstraction is comfortable. Understanding what’s beneath it changes how you use the tool.

The Audiophile Analogy

Bring it home and listen. That was the thesis in article.md, and nine posts later it holds up completely.

Every single post uncovered something that no review, no spec sheet, no benchmark covered. The concurrency discovery in Post 3 — nobody told me the community Docker image was the path to 100x throughput. The Harmony protocol in Post 5 — I had to read the raw token stream to understand why tool calls were failing. The community fixes for nvidia-smi in Post 1 — the official documentation didn’t mention them.

You can’t review your way to this understanding. You have to run it.

Current State

Buy if you’re a builder.

If you enjoy tuning crossovers and measuring room response, this is your hardware. The DGX Spark rewards people who want to understand the stack, tweak the config, write the callbacks, and iterate on the setup. Every post in this series was a tuning session, and each one made the system measurably better.

If you want plug-and-play, wait. The software stack isn’t there yet. vLLM needs community images. LiteLLM needs custom callbacks. Claude Code needs environment variable overrides and detailed skills. It works — this post proves it works — but it’s hands-on work to get there.

The Meta-Moment

In Post 6, Claude Code helped build the cluster infrastructure — the Docker configs, the Ray setup, the networking. Now in Post 9, Claude Code runs on that same infrastructure, served by the stack it helped create. The tool built its own substrate.

That’s not just a fun coincidence. Understanding the full communication chain — from Claude Code’s Anthropic API call, through LiteLLM’s protocol translation, through vLLM’s Harmony tokens, down to the Blackwell GPU — changes how you use the tool. You stop treating it as magic. You start treating it as engineering.

What’s Next

More models. GPT-OSS-120B is the first, not the last. The recipe system makes switching easy — new YAML, same stack. The question shifts from “can it run?” to “what should it run?” Different models for different tasks. Find the ones that fit actual capabilities instead of chasing benchmarks.

More usage patterns. Specific apps and workflows that play to the strengths: agentic batch work, well-scoped autonomous tasks, local inference for sensitive code. Not everything needs to go through the cloud.

Better stack. vLLM is improving fast. LiteLLM keeps adding provider support. The community Docker images get more stable with every release. The rough edges from Post 1 are already smoother than they were three months ago.

Final Reflection

I expected fun. I got understanding.

The spec sheet said datacenter GPU in a small form factor. Hands-on said: yes, but the software stack is where the real work lives. Nine posts of evidence, from a blind nvidia-smi to Claude Code agent swarms running on local hardware. Same conclusion as article.md, earned the hard way: bring it home and listen.

Back to Blog

Posts