9/9 Claude Code with Local Models: The Series Finale
Previous: The Recipe System: One Command, Zero Flag Archaeology
Post 1 started with nvidia-smi blind to its own hardware. Post 2 showed Ollama’s ceiling — good enough to start, not enough to stay. Post 3 got vLLM running at 3,975 t/s. Post 4 put those numbers under the microscope with llama-benchy. Post 5 wired up LiteLLM as the translation layer. Post 6 clustered two Sparks with Claude Code helping build the infrastructure. Post 7 benchmarked which topology wins for which workload. Post 8 wrapped the flag archaeology into one-command recipes.
This post is the payoff: Claude Code — and Python agents via the Anthropic SDK — running against a locally served GPT-OSS-120B on the DGX Spark.
Datacenter specs in a garage rack, serving local LLMs to Claude Code agent swarms. That’s sci-fi running in my garage.
The Three-Layer Stack
Three components. Three Docker containers (two for LiteLLM, one for vLLM). One DGX Spark.
Claude Code (Windows workstation)
│
▼ port 4000
LiteLLM (Docker bridge network)
│
▼ host.docker.internal:8000
vLLM (Docker host network)
│
▼ GPU
Blackwell GB10 — GPT-OSS-120B MXFP4
vLLM loads GPT-OSS-120B onto the Blackwell GPU with MXFP4 quantization and exposes an OpenAI-compatible endpoint on port 8000. Post 3 covers the community Docker image, Post 8 covers the recipe that configures it.
LiteLLM translates between Anthropic’s Messages API (what Claude Code speaks) and OpenAI’s Chat Completions API (what vLLM speaks). Critically, it runs the litellm_tool_fix.py callback that makes tool calling work. Post 5 covers the config gotchas and the basics.
Claude Code / Agent SDK is the client. Point ANTHROPIC_BASE_URL at LiteLLM and the local model appears as if it were the Anthropic API.
Single-node means no Ray, no CX7 networking, no second node. Everything on one Spark.
How Requests Flow
The request chain has more going on than “client → proxy → server.” The tool fix callback intercepts traffic in both directions:
Without the tool fix callback, Claude Code sees raw Harmony tokens instead of tool calls. It can’t execute tools. It can’t edit files. It can’t do anything agentic.
Layer 1 — vLLM with the Solo Recipe
Post 8 covers the recipe system in depth. But the solo GPT-OSS-120B recipe is worth showing here because it uses the full command template format with {placeholders} — a more complete recipe than the simplified examples in Post 8.
The Solo Recipe
File: ~/custom-recipes/openai-gpt-oss-120b.yaml
recipe_version: "1"
name: OpenAI GPT-OSS 120B
description: vLLM serving openai/gpt-oss-120b with MXFP4 quantization and FlashInfer
model: openai/gpt-oss-120b
container: vllm-node-mxfp4
build_args:
- --exp-mxfp4
mods: []
defaults:
port: 8000
host: 0.0.0.0
tensor_parallel: 1
gpu_memory_utilization: 0.70
max_model_len: 4096
max_num_batched_tokens: 32768
env:
VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8: "1"
TIKTOKEN_RS_CACHE_DIR: /root/.cache/tiktoken_rs
TIKTOKEN_ENCODINGS_BASE: /workspace/vllm/tiktoken_encodings
extra_docker_args: >-
-v /home/cluster/.cache/tiktoken_rs:/root/.cache/tiktoken_rs
-v /home/cluster/.cache/ccache:/root/.cache/ccache
command: |
vllm serve openai/gpt-oss-120b \
--tool-call-parser openai \
--reasoning-parser openai_gptoss \
--enable-auto-tool-choice \
--tensor-parallel-size {tensor_parallel} \
--gpu-memory-utilization {gpu_memory_utilization} \
--enforce-eager \
--enable-prefix-caching \
--load-format fastsafetensors \
--quantization mxfp4 \
--mxfp4-backend CUTLASS \
--mxfp4-layers moe,qkv,o,lm_head \
--attention-backend FLASHINFER \
--kv-cache-dtype fp8 \
--max-model-len {max_model_len} \
--max-num-batched-tokens {max_num_batched_tokens} \
--host {host} \
--port {port}
Compared to the simplified format in Post 8, this recipe adds recipe_version, build_args, env, extra_docker_args, and the full command template where {placeholders} are substituted from defaults at launch time. The command block gives you explicit control over the entire vllm serve invocation — useful when you need flags that don’t map to the simplified args format.
Key Flags
| Flag | Value | Why |
|---|---|---|
tensor_parallel | 1 | Single GPU — no distributed inference |
max_model_len | 4096 | The model supports 131K, but that OOMs on 128 GB. 4096 is safe for solo. |
gpu_memory_utilization | 0.70 | Conservative. Grace Blackwell unified memory means CPU page cache shares the GPU memory pool. Higher values risk OOM during long sessions. |
quantization: mxfp4 | — | Microscaling FP4. Makes the 120B model fit in 128 GB. |
mxfp4-layers | moe,qkv,o,lm_head | Which layers to quantize. Including lm_head gives ~20% decode speedup. |
enforce-eager | — | Disables CUDA graphs. Required for stability on Blackwell with this model. |
kv-cache-dtype: fp8 | — | Halves KV cache memory vs FP16, allowing more concurrent requests. |
load-format: fastsafetensors | — | GPU-direct model loading. ~3 min startup vs ~5 min. |
attention-backend: FLASHINFER | — | Optimized attention kernels for Blackwell. |
tool-call-parser openai | — | Enables OpenAI-format tool calling. |
reasoning-parser openai_gptoss | — | Parses GPT-OSS reasoning/chain-of-thought output. |
enable-auto-tool-choice | — | Lets the model decide when to call tools. |
The Five Cache Mounts
The vLLM container needs five bind-mounts to access cached data. Missing any of them causes anything from slow startups to hard failures.
| Host Path | Container Path | What’s Inside | Why It Matters |
|---|---|---|---|
~/.cache/huggingface | /root/.cache/huggingface | Model weights (~60 GB) | Without this, vLLM re-downloads the entire model on every launch |
~/.cache/flashinfer | /root/.cache/flashinfer | FlashInfer JIT-compiled CUDA kernels | MXFP4 models need pre-compiled kernels; rebuilding takes 10+ min |
~/.cache/vllm | /root/.cache/vllm | Torch compile cache, model metadata | Speeds up repeated launches |
~/.cache/ccache | /root/.cache/ccache | CUDA compilation cache | Avoids recompilation of custom CUDA ops |
~/.cache/tiktoken_rs | /root/.cache/tiktoken_rs | Harmony encoding vocab (o200k_base.tiktoken) | GPT-OSS tokenizer needs this file; without it, model fails to load |
The first three are mounted automatically by run-recipe.sh. The last two come from the extra_docker_args field in the recipe (or the VLLM_SPARK_EXTRA_DOCKER_ARGS env var in ~/.bashrc).
Container Environment Variables
| Variable | Value | Purpose |
|---|---|---|
VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8 | 1 | Enables FlashInfer MXFP4/MXFP8 MoE kernels |
TIKTOKEN_RS_CACHE_DIR | /root/.cache/tiktoken_rs | Points the Harmony tokenizer to the pre-downloaded vocab |
TIKTOKEN_ENCODINGS_BASE | /workspace/vllm/tiktoken_encodings | Must match the path inside the image build |
Launch
# 1. Drop page cache (Grace Blackwell unified memory — page cache shares GPU pool)
ssh cluster@goldfinger 'sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches"'
# 2. Launch with the solo recipe
ssh cluster@goldfinger 'cd ~/spark-vllm-docker && ./run-recipe.sh ~/custom-recipes/openai-gpt-oss-120b.yaml --solo -d'
Wait ~3 minutes for model loading. --solo means single-node mode (no Ray). -d runs detached.
Verify
curl http://192.168.3.203:8000/v1/models
# Should return: {"data":[{"id":"openai/gpt-oss-120b",...}]}
Layer 2 — LiteLLM and the Harmony Protocol
Post 5 covered LiteLLM setup — the config gotchas, the hosted_vllm/ prefix, host.docker.internal, drop_params. This section goes deeper on what the tool fix callback actually does: parsing the Harmony protocol.
LiteLLM Config (Brief)
File: ~/litellm/litellm-config.yaml
model_list:
- model_name: openai/gpt-oss-120b
litellm_params:
model: hosted_vllm/openai/gpt-oss-120b
api_base: http://host.docker.internal:8000/v1
api_key: dummy
litellm_settings:
drop_params: true
callbacks:
- litellm_tool_fix.litellm_tool_fix
general_settings:
master_key: sk-1234
store_model_in_db: true
store_prompts_in_spend_logs: true
See Post 5 for the full breakdown of why hosted_vllm/ and not openai/, why host.docker.internal and not localhost, and how drop_params saves you from hard errors.
Docker Compose
File: ~/litellm/docker-compose.yml
services:
litellm:
image: ghcr.io/berriai/litellm:main-latest
container_name: litellm
ports:
- 4000:4000
volumes:
- ./litellm-config.yaml:/app/config.yaml
- ./litellm_tool_fix.py:/app/litellm_tool_fix.py
command:
- --config
- /app/config.yaml
- --port
- '4000'
extra_hosts:
- host.docker.internal:host-gateway
environment:
- LITELLM_MASTER_KEY=sk-1234
- DATABASE_URL=postgresql://llmproxy:dbpassword9090@litellm-db:5432/litellm
restart: unless-stopped
depends_on:
- litellm-db
litellm-db:
image: postgres:16-alpine
container_name: litellm-db
environment:
- POSTGRES_USER=llmproxy
- POSTGRES_PASSWORD=dbpassword9090
- POSTGRES_DB=litellm
volumes:
- litellm-db-data:/var/lib/postgresql/data
restart: unless-stopped
volumes:
litellm-db-data: null
Two things to note: the extra_hosts line maps host.docker.internal to the Docker bridge gateway — this is how LiteLLM reaches vLLM on the host network. And ./litellm_tool_fix.py:/app/litellm_tool_fix.py mounts the callback script so LiteLLM can import it.
Deep Dive: The Harmony Protocol
Post 5 mentioned the tool fix callback at a high level: it patches null descriptions on the way in and translates Harmony tokens on the way out. Here’s what actually happens inside.
The Problem
GPT-OSS uses OpenAI’s “Harmony” protocol for structured output. When the model wants to call a tool, it doesn’t return a function_call JSON object like GPT-4. It emits special tokens embedded in its response text:
<|start|>assistant<|channel|>commentary to=functions.Edit
<|constrain|> json<|message|>{"file_path":"/tmp/foo.py","old_string":"x","new_string":"y"}<|call|>
Claude Code expects Anthropic-format tool_use content blocks:
{
"type": "tool_use",
"id": "toolu_abc123",
"name": "Edit",
"input": {"file_path": "/tmp/foo.py", "old_string": "x", "new_string": "y"}
}
Without translation, Claude Code sees <|start|>assistant<|channel|>commentary to=functions.Edit... as regular text. It doesn’t know a tool call happened. It can’t execute anything.
The Three Channel Types
The Harmony format uses channels to signal what kind of content is in each message segment:
| Channel | Header Example | Meaning | What the Callback Does |
|---|---|---|---|
| Tool call | commentary to=functions.Edit | Model wants to call a tool | Extracts tool name + JSON params, creates tool_use block |
| Final | final | User-facing response text | Keeps as Anthropic text block |
| Analysis | analysis | Chain-of-thought reasoning | Drops — already captured in thinking block |
A single response can contain multiple channels. The model might think (analysis), explain what it’s doing (final), then call two tools (two tool call channels) — all in one response, delimited by Harmony tokens.
Tool Name Remapping
GPT-OSS doesn’t always use the same tool names as Claude Code. The callback remaps:
| GPT-OSS Sends | Remapped To | Why |
|---|---|---|
Edit / EditFile | MultiEdit | Claude Code uses MultiEdit for file edits |
CreateFile / create_file | Write | Claude Code uses Write for file creation |
Parameter Format Conversion
When remapping Edit → MultiEdit, the parameters also need restructuring:
// GPT-OSS sends (Edit format):
{"file_path": "foo.py", "old_string": "x", "new_string": "y"}
// Callback converts to (MultiEdit format):
{"file_path": "foo.py", "edits": [{"old_text": "x", "new_text": "y"}]}
Without this conversion, Claude Code receives a MultiEdit tool call with the wrong parameter shape and the edit fails.
stop_reason Override
When tool calls are detected, the callback sets stop_reason to "tool_use". Claude Code checks this field to decide whether to execute tools. Without the override, it treats the response as a normal text reply — the tool calls are there in the content blocks, but Claude Code never executes them.
The Full Flow
Putting it all together:
- Pre-call hook: Patches null tool descriptions to empty strings (vLLM rejects
null). - Request goes to vLLM, which returns text with Harmony tokens.
- Post-call hook: Scans for
<|start|>,<|channel|>,<|message|>,<|call|>tokens. - Parses each channel — tool calls become
tool_useblocks, final text stays astextblocks, analysis gets dropped. - Remaps tool names (
Edit→MultiEdit,CreateFile→Write). - Converts parameter formats to match the remapped tool’s expected shape.
- Strips any leaked Harmony tokens from thinking blocks.
- Sets
stop_reasonto"tool_use"when tool calls are present. - If the primary Harmony regex fails (partially stripped tokens), a fallback regex tries to extract
to=functions.ToolNamepatterns.
This is the glue that makes agentic usage possible. Without it, Claude Code can chat with the local model but can’t use tools — which means it can’t edit files, run commands, or do anything an agentic coding assistant actually does.
Launch and Verify
# Launch
ssh cluster@goldfinger 'cd ~/litellm && sudo docker compose up -d'
# Verify the API
curl -H "Authorization: Bearer sk-1234" http://192.168.3.203:4000/v1/models
# Quick inference test
curl http://192.168.3.203:4000/v1/chat/completions \
-H "Authorization: Bearer sk-1234" \
-H "Content-Type: application/json" \
-d '{"model":"openai/gpt-oss-120b","messages":[{"role":"user","content":"Say hello"}],"max_tokens":50}'
# Check tool fix loaded (look for [OSSToolCallFix] lines)
ssh cluster@goldfinger 'sudo docker logs litellm --tail 20'
Layer 3 — Claude Code and the Agent SDK
This is where it pays off. Two env-var scripts. One for PowerShell, one for Bash.
The Launch Scripts
PowerShell (claude-gptoss.ps1):
# Usage:
# .\claude-gptoss.ps1 # default model
# .\claude-gptoss.ps1 openai/gpt-oss-120b # explicit model
# .\claude-gptoss.ps1 openai/gpt-oss-120b http://192.168.3.203:4000 # explicit URL
param(
[string]$Model = "openai/gpt-oss-120b",
[string]$BaseUrl = "http://192.168.3.203:4000"
)
$env:ANTHROPIC_BASE_URL = $BaseUrl
$env:ANTHROPIC_AUTH_TOKEN = "sk-1234"
$env:ANTHROPIC_MODEL = $Model
$env:ANTHROPIC_SMALL_FAST_MODEL = $Model
$env:ANTHROPIC_DEFAULT_HAIKU_MODEL = $Model
$env:ANTHROPIC_DEFAULT_SONNET_MODEL = $Model
$env:ANTHROPIC_DEFAULT_OPUS_MODEL = $Model
$env:CLAUDE_CODE_ATTRIBUTION_HEADER = "0"
$env:CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC = "1"
$env:LOG_LEVEL = "debug"
claude
Bash (claude-gptoss.sh):
#!/usr/bin/env bash
# Usage:
# ./claude-gptoss.sh # default model
# ./claude-gptoss.sh openai/gpt-oss-120b # explicit model
# ./claude-gptoss.sh openai/gpt-oss-120b http://192.168.3.203:4000 # explicit URL
DEFAULT_MODEL="openai/gpt-oss-120b"
DEFAULT_BASE_URL="http://192.168.3.203:4000"
export ANTHROPIC_BASE_URL="${2:-$DEFAULT_BASE_URL}"
export ANTHROPIC_AUTH_TOKEN="sk-1234"
export ANTHROPIC_MODEL="${1:-$DEFAULT_MODEL}"
export ANTHROPIC_SMALL_FAST_MODEL="${1:-$DEFAULT_MODEL}"
export ANTHROPIC_DEFAULT_HAIKU_MODEL="${1:-$DEFAULT_MODEL}"
export ANTHROPIC_DEFAULT_SONNET_MODEL="${1:-$DEFAULT_MODEL}"
export ANTHROPIC_DEFAULT_OPUS_MODEL="${1:-$DEFAULT_MODEL}"
export CLAUDE_CODE_ATTRIBUTION_HEADER=0
claude
The Environment Variables
| Variable | Value | Why |
|---|---|---|
ANTHROPIC_BASE_URL | http://192.168.3.203:4000 | Points at LiteLLM, not vLLM directly. LiteLLM handles the protocol translation. |
ANTHROPIC_AUTH_TOKEN | sk-1234 | The LiteLLM master key. Claude Code sends this as the Bearer token. |
ANTHROPIC_MODEL | openai/gpt-oss-120b | Must match model_name in litellm-config.yaml. |
ANTHROPIC_SMALL_FAST_MODEL | openai/gpt-oss-120b | Claude Code uses different models for different tasks (fast model for simple queries). Locally, there’s only one model — set them all the same. |
ANTHROPIC_DEFAULT_HAIKU_MODEL | openai/gpt-oss-120b | Same reason — override every model slot. |
ANTHROPIC_DEFAULT_SONNET_MODEL | openai/gpt-oss-120b | Same reason. |
ANTHROPIC_DEFAULT_OPUS_MODEL | openai/gpt-oss-120b | Same reason. |
CLAUDE_CODE_ATTRIBUTION_HEADER | 0 | Disables the attribution header that Claude Code normally sends. LiteLLM doesn’t understand it and logs warnings. |
CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC | 1 | Stops telemetry and update checks that would fail against the local endpoint. |
The key insight: Claude Code normally uses a family of models — Haiku for fast tasks, Sonnet for standard work, Opus for complex reasoning. Locally, you have one model. Every model override points at the same openai/gpt-oss-120b. One model does everything.
The Agent SDK Angle
The same environment variables that launch Claude Code against the local model also work for Python agents. Swap claude for uv run python agent.py.
Anthropic SDK — direct API calls:
from anthropic import Anthropic
client = Anthropic(
base_url="http://192.168.3.203:4000/v1",
api_key="sk-1234",
)
response = client.messages.create(
model="openai/gpt-oss-120b",
max_tokens=1024,
messages=[{"role": "user", "content": "What is 2+2?"}],
)
print(response.content[0].text)
Claude Agent SDK — agentic usage with tool execution:
from claude_agent_sdk import Agent
agent = Agent(
model="openai/gpt-oss-120b",
base_url="http://192.168.3.203:4000/v1",
api_key="sk-1234",
)
result = agent.run("List the files in the current directory")
print(result)
The connection chain is identical for both:
Client (Claude Code / Python SDK)
│
├─► POST :4000/v1/messages (Bearer sk-1234)
│
▼
LiteLLM
│
├─► POST :8000/v1/chat/completions (hosted_vllm/ prefix stripped)
│
▼
vLLM
│
├─► Harmony tokens in response text
│
▼
LiteLLM
│
├─► tool_use blocks, stop_reason=tool_use
│
▼
Client
One stack, two clients, same local inference. The model name flows through as: client sends openai/gpt-oss-120b → LiteLLM matches it to model_name in the config → strips the hosted_vllm/ prefix → sends openai/gpt-oss-120b to vLLM.
The Complete Launch Sequence
Start (3 Steps)
# 1. Drop page cache (unified memory — page cache competes with GPU)
ssh cluster@goldfinger 'sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches"'
# 2. Launch vLLM (solo mode, detached)
ssh cluster@goldfinger 'cd ~/spark-vllm-docker && ./run-recipe.sh ~/custom-recipes/openai-gpt-oss-120b.yaml --solo -d'
# Wait ~3 min, then verify:
curl http://192.168.3.203:8000/v1/models
# 3. Launch LiteLLM
ssh cluster@goldfinger 'cd ~/litellm && sudo docker compose up -d'
# Verify:
curl -H "Authorization: Bearer sk-1234" http://192.168.3.203:4000/v1/models
Stop
# Stop vLLM
ssh cluster@goldfinger 'sudo docker stop vllm_node && sudo docker rm vllm_node'
# Stop LiteLLM
ssh cluster@goldfinger 'cd ~/litellm && sudo docker compose down'
Verify Everything
# vLLM health
curl http://192.168.3.203:8000/v1/models
# LiteLLM health
curl -H "Authorization: Bearer sk-1234" http://192.168.3.203:4000/v1/models
# Running containers
ssh cluster@goldfinger 'sudo docker ps --format "table {{.Names}}\t{{.Image}}\t{{.Status}}"'
# Tool fix activity
ssh cluster@goldfinger 'sudo docker logs litellm --tail 20'
# vLLM logs
ssh cluster@goldfinger 'sudo docker logs vllm_node --tail 20 2>&1'
File Locations on Node
~/custom-recipes/openai-gpt-oss-120b.yaml # Solo recipe
~/spark-vllm-docker/ # vLLM docker tooling
~/spark-vllm-docker/.env # Non-interactive launch config
~/litellm/litellm-config.yaml # LiteLLM model routing
~/litellm/docker-compose.yml # LiteLLM + Postgres
~/litellm/litellm_tool_fix.py # Harmony → tool_use callback
~/.bashrc # VLLM_SPARK_EXTRA_DOCKER_ARGS
~/.cache/huggingface/ # Model weights
~/.cache/flashinfer/ # FlashInfer JIT kernels
~/.cache/vllm/ # vLLM compile cache
~/.cache/ccache/ # CUDA compilation cache
~/.cache/tiktoken_rs/ # Harmony encoding vocab
Gotchas and Troubleshooting
Solo-Specific Issues
| Issue | Cause | Fix |
|---|---|---|
OOM at max_model_len: 131072 | Single-node only has 128 GB; 131K context needs ~114 GB | Use max_model_len: 4096 (the solo recipe already does this) |
| OOM after hours of uptime | Page cache grows, steals from GPU memory pool | Drop cache before launch; a cron job runs every 4h automatically |
gpu_memory_utilization above 0.90 | vllm-node-mxfp4 image has higher GPU memory overhead | Keep at 0.70 for solo (recipe default) |
| LiteLLM can’t reach vLLM | Bridge network isolation — localhost = container loopback | Use host.docker.internal in api_base |
| Model name mismatch | openai/ prefix double-strip bug in LiteLLM | Use hosted_vllm/ prefix in litellm-config.yaml |
| Tool calls not working | Missing litellm_tool_fix.py or not mounted in container | Verify the volume mount and callbacks config |
TIKTOKEN_ENCODINGS_BASE wrong | Path differs between image builds | Check with find / -name "*.tiktoken" inside container |
| vLLM startup fails silently | Missing cache mounts (especially tiktoken_rs) | Verify all 5 cache directories exist and are mounted |
--enforce-eager missing | CUDA graph instability on Blackwell with this model | Always include --enforce-eager |
Prerequisites Checklist
Before starting the stack, verify:
- DGX Spark with Docker Engine and NVIDIA Container Toolkit
-
spark-vllm-dockercloned at~/spark-vllm-docker -
vllm-node-mxfp4:latestDocker image built -
openai/gpt-oss-120bweights downloaded to~/.cache/huggingface/ - All 5 cache directories exist under
~/.cache/ -
VLLM_SPARK_EXTRA_DOCKER_ARGSset in~/.bashrc -
~/spark-vllm-docker/.envconfigured for solo mode - Workstation has SSH access to the DGX Spark
The Reality — What Works and What Doesn’t
What Works Well
I tested with a simple task: build a single-file HTML Tetris game. Not a trivial prompt — it needs game loop logic, collision detection, piece rotation, rendering, keyboard input. A real test of whether the model can produce working code through Claude Code’s tool chain.
It wasn’t one-shot. First attempt had a couple of small bugs — piece rotation clipping through walls, score not updating. But the model iterated on its own output, found the issues, and fixed them. Three rounds of edits and it worked. That’s the pattern: not perfect first try, but capable of self-correction when you let it run.
Tool use works. File edits land correctly. Bash commands execute. Grep finds what it’s looking for. The Harmony-to-tool_use translation from the callback handles the mechanical side — the model just needs to decide what to do, and it decides well enough.
It’s not Opus 4.6. But it’s useful.
What Breaks or Degrades
It’s slower than cloud Claude. Not unusably slow — token generation is fine once it starts — but the overall cycle time for a multi-step task is longer. You feel it.
The bigger gap is precision. Cloud Claude tolerates ambiguous instructions. You can say “clean this up” and it infers what you mean from context. GPT-OSS-120B needs you to be more explicit. “Refactor this function to extract the validation logic into a separate helper” works. “Make this better” doesn’t.
Claude Code skills work — the model can follow structured skill prompts — but they need to be more detailed than what you’d write for cloud models. Less room for the model to fill in gaps. If your skill says “fix the bug,” you’ll get inconsistent results. If it says “the bug is in the collision check on line 42, the boundary condition should be >= not >,” it nails it.
This isn’t a dealbreaker. It’s a different usage pattern. You front-load the thinking into better prompts and let the model handle execution.
Daily Usage Patterns
Honest answer: this is still proof of concept. I’m not daily-driving local Claude Code for production work. Cloud Claude is better and I have access to it.
What I am doing is testing. Running different tasks through the local stack, seeing where the boundaries are, building intuition for what fits and what doesn’t. The Tetris test was one of many. Each test teaches me something about where GPT-OSS-120B sits on the capability curve.
Next step is trying other models through the same stack. The recipe system from Post 8 makes model switching easy — swap the YAML file, relaunch, same LiteLLM config. The infrastructure is model-agnostic. The model is the variable.
Performance Feel
Token speed is good enough — for the right workflow. Interactive back-and-forth where you’re watching every token stream in? The latency adds up and it feels sluggish compared to cloud. You notice.
But agentic work is different. Prepare a detailed skill or a well-scoped prompt, kick it off, let the model work autonomously. Come back to results. In that pattern, token speed barely matters because you’re not sitting there watching. The model grinds through file edits, bash commands, grep searches — all on its own — and the wall-clock time is acceptable because you’re doing something else.
That’s the sweet spot: batch-style agentic tasks with well-prepared prompts. Minimize manual inputs. Maximize autonomous execution. The model works best when you give it a clear target and get out of the way.
Closing — Was It Worth It?
The Arc
I expected some fun and I got a lot of it.
But the real payoff was understanding. Nine posts ago, nvidia-smi couldn’t see its own GPU. The community had the fix — not NVIDIA. Post 2 hit Ollama’s ceiling and showed the hardware deserved a better inference engine. Post 3 went from 40 tokens/second to 3,975. The hardware was never the bottleneck; the software stack was. Post 4 proved it with llama-benchy — proper benchmarking that told a completely different story than the review circuit. Post 5 wired up LiteLLM as the protocol translator that nobody else had documented for this hardware. Post 6 clustered two Sparks with Claude Code helping build the very infrastructure it would later run on. Post 7 showed that neither topology dominates — the right choice depends on the workload. Post 8 turned all that tribal knowledge into one-command recipes.
Now here in Post 9, Claude Code runs against a locally served model on that same infrastructure. From “this GPU doesn’t exist in nvidia-smi” to “agentic AI coding assistant running on it.” That’s the arc.
What I didn’t expect: how much I’d learn about what actually happens when you “just ask” Claude Code to do something. Every log line, every protocol translation, every flag that makes or breaks tool calling. The abstraction is comfortable. Understanding what’s beneath it changes how you use the tool.
The Audiophile Analogy
Bring it home and listen. That was the thesis in article.md, and nine posts later it holds up completely.
Every single post uncovered something that no review, no spec sheet, no benchmark covered. The concurrency discovery in Post 3 — nobody told me the community Docker image was the path to 100x throughput. The Harmony protocol in Post 5 — I had to read the raw token stream to understand why tool calls were failing. The community fixes for nvidia-smi in Post 1 — the official documentation didn’t mention them.
You can’t review your way to this understanding. You have to run it.
Current State
Buy if you’re a builder.
If you enjoy tuning crossovers and measuring room response, this is your hardware. The DGX Spark rewards people who want to understand the stack, tweak the config, write the callbacks, and iterate on the setup. Every post in this series was a tuning session, and each one made the system measurably better.
If you want plug-and-play, wait. The software stack isn’t there yet. vLLM needs community images. LiteLLM needs custom callbacks. Claude Code needs environment variable overrides and detailed skills. It works — this post proves it works — but it’s hands-on work to get there.
The Meta-Moment
In Post 6, Claude Code helped build the cluster infrastructure — the Docker configs, the Ray setup, the networking. Now in Post 9, Claude Code runs on that same infrastructure, served by the stack it helped create. The tool built its own substrate.
That’s not just a fun coincidence. Understanding the full communication chain — from Claude Code’s Anthropic API call, through LiteLLM’s protocol translation, through vLLM’s Harmony tokens, down to the Blackwell GPU — changes how you use the tool. You stop treating it as magic. You start treating it as engineering.
What’s Next
More models. GPT-OSS-120B is the first, not the last. The recipe system makes switching easy — new YAML, same stack. The question shifts from “can it run?” to “what should it run?” Different models for different tasks. Find the ones that fit actual capabilities instead of chasing benchmarks.
More usage patterns. Specific apps and workflows that play to the strengths: agentic batch work, well-scoped autonomous tasks, local inference for sensitive code. Not everything needs to go through the cloud.
Better stack. vLLM is improving fast. LiteLLM keeps adding provider support. The community Docker images get more stable with every release. The rough edges from Post 1 are already smoother than they were three months ago.
Final Reflection
I expected fun. I got understanding.
The spec sheet said datacenter GPU in a small form factor. Hands-on said: yes, but the software stack is where the real work lives. Nine posts of evidence, from a blind nvidia-smi to Claude Code agent swarms running on local hardware. Same conclusion as article.md, earned the hard way: bring it home and listen.