A chronological journey through all posts.
Same hardware, three models, completely different performance profiles. GPT-OSS-120B is fastest despite 117B params. Gemma4 has the best TTFT. Nemotron never loses to shuffle. The right model depends on the workload.
The smallest model benefits most from tensor parallelism: +50% at c1. Cluster TP=2 with --no-ray dominates decode through c32. Also: Ray vs PyTorch distributed -- the 2-8% you're leaving on the table.
The model where cluster never loses. Nemotron-3-Super benefits more from TP=2 than any other model tested -- and the SM12.1 CUTLASS patch doubles performance vs FlashInfer.
One formula tells you whether your hardware is working correctly or misconfigured. The community used this to diagnose the 40 t/s problem.
Blackwell's native MX-format support eliminates the dequantization tax — the hardware and gpt-oss-120b were designed for each other.
Cluster TP=2 dominates decode up to c32, then simple-shuffle takes over. Least-busy routing collapses under load. Full topology comparison with four configurations.
Everything in the series built toward this: Claude Code running on locally served models. Here's what works, what's rough, and where it's heading.
The recipe system from spark-vllm-docker turns twenty minutes of flag archaeology into one command — and makes everything reproducible.
Neither topology dominates — cluster wins decode at every concurrency level, but 2x Solo wins prefill and TTFT under load. The right choice depends on the workload.
The second DGX Spark arrived. Before writing a single line of config: check firmware. Then cables, SSH, Docker, vLLM, model cache — and Claude Code helping build the skills to manage it all.
Skills you build today become components for autonomous agents tomorrow. The progression: skill → command → plugin → autonomous agent. Here's where it's all heading.
Claude Code speaks Anthropic. gpt-oss-120b speaks OpenAI with Harmony-style tool calls. LiteLLM sits in the middle and translates — including a custom callback that patches the tool calls neither side gets right.
10 agents. 3 commands. 3 skills. One plugin that turns a topic into a finished presentation. Here's how agents, commands, skills, and hooks work together inside Marp Magic.
Post 3 hit 4,158 t/s at c64. llama-benchy puts those numbers under the microscope. Same hardware, two tools, 25x difference in TTFT.
Every reviewer tested single-user latency and called the DGX Spark slow. Nobody tested concurrency. The community found the real number: 3,975 tokens per second.
Manual invoice matching: 20% exception rates, hours of CFO time, error-prone. A skill reduces that to under 5% in minutes. Here's how — and it was built by a non-developer.
The standard recipe works but wastes the hardware. Scaling from 20B to 120B on Ollama shows the potential — and the ceiling.
Write a skill file. Run it on messy meeting notes. Get a structured summary. Refine it. Run it again. The whole cycle in 10 minutes — no code, just clear instructions.
NVIDIA's own monitoring can't see their newest hardware. The community had a fix before NVIDIA did.
You don't need to write code to use Claude Code. Skills are instructions in plain Markdown — if you can write a recipe, you can teach AI to do your repetitive work.
Specs can lie in both directions. Snake oil oversells. Reviews undersell. The only truth is your own testing.
The fairy tale is written. The diagrams are drawn. Now turn it all into a presentation -- without opening PowerPoint, without leaving VSCode, without losing your Git history.
README is for humans. AGENTS.md is for AI. 40,000+ projects use it to give AI persistent instructions that survive across sessions. Here's how to write one.
20 million developers use Copilot for code. Almost nobody uses it for content. Here's how to turn a fairy tale outline into a full story with diagrams -- without writing a single line of code.
100 million developers use VSCode. Most of them only use it for code. Here's how to set it up as an AI content creation workspace.
Markdown isn't a formatting tool. It's a communication protocol between you and every AI model on the planet. Here's why that matters.
Markdown is the universal language of structured text. Learn the essentials without the bloat.
Presentations as code. Write Markdown, get slides. No PowerPoint, no Keynote, no drag-and-drop.
Diagrams as code. Version-controlled, LLM-friendly, and no drag-and-drop tools required.
A model is a folder. config.json is the architecture. generation_config.json is the sampling defaults. vocab.json is the tokenizer. Here's how to read them — and where to change the knobs.
The model produces 50,257 scores. Sampling decides which one becomes the next token. Temperature, top-k, and top-p are your mixer sliders — here's what each one does, with numbers.
Attention is a weighted mix. Multi-head is a filter bank. The causal mask means no spoilers. Here's the transformer architecture without the math — then with just enough of it.
Tokenization is at the heart of every weird LLM behavior. Why they can't reverse strings, why Japanese costs more, why 'SolidGoldMagikarp' breaks them. Here's why — and what you can do about it.
You use ChatGPT, Gemini, Claude every day. But what's inside the box? Assistants are not models. Once you see the difference, you unlock controls most people don't know exist.
Software is disposable. Knowledge isn't. Building checkpoints to share ideas without the sidequests.