Dgx_series

Browse subfolders and articles in Dgx_series

Subfolders

Articles

March 23, 2026 · 6 min read

App B The Napkin Math: Predicting Token Speed from Memory Bandwidth

One formula tells you whether your hardware is working correctly or misconfigured. The community used this to diagnose the 40 t/s problem.

#DGX Spark #benchmarking #memory bandwidth #AI #Local AI

March 22, 2026 · 3 min read

App A Why Blackwell's Native MX-Format Support Matters

Blackwell's native MX-format support eliminates the dequantization tax — the hardware and gpt-oss-120b were designed for each other.

#DGX Spark #Blackwell #quantization #AI #Local AI

March 21, 2026 · 22 min read

9/9 Claude Code with Local Models: The Series Finale

Everything in the series built toward this: Claude Code running on locally served models. Here's what works, what's rough, and where it's heading.

#DGX Spark #Claude Code #LiteLLM #vLLM #AI #Local AI

March 19, 2026 · 12 min read

8/9 The Recipe System: One Command, Zero Flag Archaeology

The recipe system from spark-vllm-docker turns twenty minutes of flag archaeology into one command — and makes everything reproducible.

#DGX Spark #vLLM #recipes #DevOps #AI #Local AI

March 17, 2026 · 10 min read

7/9 Solo vs Cluster: Where Two Sparks Beat One (and Where They Don't)

Neither topology dominates — cluster wins decode at every concurrency level, but 2x Solo wins prefill and TTFT under load. The right choice depends on the workload.

#DGX Spark #cluster #benchmarking #vLLM #LiteLLM #AI #Local AI

March 15, 2026 · 7 min read

6/9 Two Sparks, One Cluster: Setting Up with Claude Code

The second DGX Spark arrived. Before writing a single line of config: check firmware. Then cables, SSH, Docker, vLLM, model cache — and Claude Code helping build the skills to manage it all.

#DGX Spark #cluster #Claude Code #vLLM #Ray #AI #Local AI

March 13, 2026 · 6 min read

5/9 LiteLLM: The Translation Layer Between Claude Code and Local Models

Claude Code speaks Anthropic. gpt-oss-120b speaks OpenAI with Harmony-style tool calls. LiteLLM sits in the middle and translates — including a custom callback that patches the tool calls neither side gets right.

#DGX Spark #LiteLLM #Claude Code #vLLM #proxy #AI #Local AI

March 11, 2026 · 8 min read

4/9 Benchmarking Reality: llama-benchy and the Spark Arena

Post 3 hit 4,158 t/s at c64. llama-benchy puts those numbers under the microscope. Same hardware, two tools, 25x difference in TTFT.

#DGX Spark #benchmarking #llama-benchy #vLLM #AI #Local AI

March 9, 2026 · 6 min read

3/9 Switching to vLLM: From 40 t/s to 3,975 t/s

Every reviewer tested single-user latency and called the DGX Spark slow. Nobody tested concurrency. The community found the real number: 3,975 tokens per second.

#DGX Spark #vLLM #AI #Local AI

March 6, 2026 · 8 min read