5/9 LiteLLM: The Translation Layer Between Claude Code and Local Models

· 6 min read

Previous: Benchmarking Reality: llama-benchy and the Spark Arena

Post 4 proved the hardware works. gpt-oss-120b on the DGX Spark, served by vLLM, hits 4,158 t/s at c64. The next question: how do you actually use it?

Specifically — how do you point Claude Code at a local model that speaks a completely different protocol? Point it directly at vLLM and you get this:

litellm.BadRequestError: Invalid value for 'function.description': expected a string, got null

Claude Code sends Anthropic-format messages. It includes tool calls with null descriptions. It passes Anthropic-specific parameters that vLLM has never heard of. gpt-oss-120b, meanwhile, expects OpenAI-format messages with Harmony-style tool calls. These two don’t speak the same language.

LiteLLM sits in the middle and translates.


What LiteLLM Does Here

This isn’t a “what is LiteLLM” explainer. LiteLLM supports 140+ providers and 2,600+ models. None of that matters here. What matters is the four specific things it does in this setup:

  1. Message format translation — Converts Anthropic-format requests from Claude Code into OpenAI-format requests that vLLM understands. Converts the responses back.

  2. Parameter dropping — Claude Code and Cursor send Anthropic-specific parameters (top_k, metadata, etc.) that vLLM doesn’t support. With drop_params: true, LiteLLM silently drops them instead of throwing errors.

  3. Tool call patching — A custom callback (litellm_tool_fix.py) fixes two problems: Claude Code sends tool calls with null descriptions that fail vLLM validation, and GPT-OSS models use Harmony-format tool calls that need translation. The callback patches both.

  4. Request/response logging — Every request and response flows through a single point where you can watch it, log it, and debug it.

The architecture is three Docker containers on the head node:

Claude Code  →  LiteLLM :4000  →  vLLM :8000 (gpt-oss-120b)

               PostgreSQL :5432 (logs, keys)
               Prometheus :9090 (metrics)

LiteLLM is the gateway. PostgreSQL stores API keys and usage data — not critical for a single-user setup, but LiteLLM requires it. Prometheus scrapes metrics. All three run as Docker Compose services alongside vLLM on the host.


The Config That Matters

The entire routing configuration is one YAML file. Here’s what it looks like, annotated with the things that will trip you up:

model_list:
  - model_name: "openai/gpt-oss-120b"
    litellm_params:
      model: "hosted_vllm/openai/gpt-oss-120b"   # hosted_vllm/ prefix — NOT openai/
      api_base: "http://host.docker.internal:8000/v1"  # Docker host, not localhost
      api_key: "dummy"

litellm_settings:
  drop_params: true                                # silently drops unsupported params
  callbacks: ["litellm_tool_fix.litellm_tool_fix"] # tool fix callback instance

general_settings:
  master_key: "sk-1234"

Three things here are non-obvious and will cost you hours if you get them wrong:

hosted_vllm/ prefix, not openai/. The model running in vLLM is called openai/gpt-oss-120b. Your instinct is to use the openai/ provider prefix. Don’t. The openai/ provider strips the prefix from the model name before sending it to the backend — so openai/openai/gpt-oss-120b becomes just gpt-oss-120b, and vLLM can’t find it. The hosted_vllm/ provider passes the model name through verbatim.

host.docker.internal, not localhost. LiteLLM runs inside a Docker container. localhost inside the container is the container’s own loopback, not the host machine where vLLM is running. host.docker.internal resolves to the Docker host IP. This is set up via extra_hosts in the docker-compose file.

drop_params: true. Every AI client adds its own parameters. Claude Code sends Anthropic-specific ones. Cursor sends its own. Without drop_params, any parameter vLLM doesn’t recognize causes a hard error. With it, LiteLLM strips unknown parameters and the request goes through.


The Tool Fix Callback

This is the glue that makes tool calls work. It started as a community workaround and evolved into a custom LiteLLM callback we maintain alongside the deployment scripts. No official solution existed — the community built what was missing, same pattern as every post in this series.

Claude Code sends tool definitions with null descriptions — perfectly valid in the Anthropic API, but vLLM’s OpenAI-compatible endpoint rejects them with a validation error. Every time Claude Code tries to use a tool (which is constantly — it’s an agentic coding assistant), the request fails.

GPT-OSS models add a second problem: they use Harmony-format tool calls, which need translation to standard OpenAI format before clients can parse them.

The litellm_tool_fix.py callback intercepts requests and responses to fix both:

  • On the way in: Patches null tool descriptions to empty strings so vLLM accepts them.
  • On the way out: Translates Harmony-format tool call responses so Claude Code can understand them.

The callback is mounted into the LiteLLM container as a volume and referenced in the config as litellm_tool_fix.litellm_tool_fix. That format is module.instance — it must point to a module-level instance of the callback class, not the class itself. Export the class and you get cryptic missing 'self' errors.


Connecting Claude Code

Two environment variables. That’s the entire client-side configuration:

OPENAI_BASE_URL=http://goldfinger:4000/v1
OPENAI_API_KEY=sk-1234

goldfinger is the hostname of the DGX Spark head node. The API key matches the master_key in the LiteLLM config.

That’s it. All the translation complexity — format conversion, null patching, Harmony tool calls, parameter stripping — collapses to two lines on the client side. Claude Code doesn’t know the model is local.


Live Tracking

This is where the proxy pays for itself.

When something breaks, you need to see what’s happening. Not guess. See the actual requests and responses flowing through.

Docker logs show every request in real time:

sudo docker logs litellm --tail 50 -f

You can see what Claude Code sent, what LiteLLM translated it to, what vLLM returned, and where the chain broke. When a tool call fails, you see whether it’s a null description (callback didn’t load), a format mismatch (translation issue), or a model error (vLLM issue). Early on, I spent some time thinking gpt-oss-120b couldn’t handle tool calls — the logs showed the callback wasn’t mounted. Five-second fix once you can see it.

The LiteLLM dashboard at http://goldfinger:4000/ui/ is the other half of the debugging story. It shows every request with full payloads — what Claude Code sent, what the model returned, whether the tool fix callback fired. You also get model health status, spend tracking per key, and callback configuration at a glance. When Docker logs tell you something broke, the dashboard shows you what broke and the exact payload that caused it.

Prometheus at http://goldfinger:9090 scrapes metrics every 15 seconds — request counts, latency, token throughput.

Verify the callback loaded — worth checking on first deploy:

sudo docker logs litellm 2>&1 | grep -i callback

If the callback didn’t load, tool calls will fail silently with vLLM validation errors that look like model issues. Check this first.

One more timing note: DB migrations take 30–60s on first start. Health checks fail until they complete. Don’t panic.


Next: Two Sparks, One Cluster — 200 Gbps interconnect, and Claude Code helping build the infrastructure to manage it.