1/5 Tokens In, Logits Out: What's Actually Inside ChatGPT
You use these every day. ChatGPT. Gemini. Claude. Copilot. Perplexity.
Great UX. Mystery box inside.
What you see: a helpful assistant that can browse, call tools, remember context, and format answers. What you don’t see: which models, prompts, guardrails, and extra tools are wired together behind the scenes.
This series cracks the box. Not to build a model from scratch — but to understand what’s inside well enough to control it.
Assistants Are Not Models
This is the single most important distinction:
| Layer | What it is | Examples |
|---|---|---|
| Assistant layer | Orchestrates tools, memory, safety, formatting | ChatGPT, Gemini, Copilot |
| Model layer | The LLM that turns tokens into logits (next-token predictions) | GPT-2, Llama, Mistral |
When you “talk to ChatGPT,” you’re talking to an assistant that talks to a model. The assistant decides what system prompt to use, which tools to call, how to format the response, and what safety filters to apply. The model just predicts the next token.
Most users never interact with the model layer directly. That means the vendor picks your sampling parameters, your context strategy, your stop rules — everything.
Why Move Closer to Models?
If the assistant works fine, why bother?
- Control: Pick the sampling parameters (temperature, top-p, top-k), stop rules, seeds, penalties
- Cost/latency: Optimize tokens, shrink context, enable streaming
- Reproducibility: Fix random seeds, log logprobs — get the same output twice
- Portability: Run via API or locally (on-prem, laptop, edge)
- Safety/compliance: Your own guardrails and auditability
- Understanding: Know why the model said what it said
If you only use assistants, the vendor picks all of these for you. Sometimes that’s fine. Sometimes it isn’t.
Three Paths
- Stay assistant-first — ChatGPT, Gemini, Copilot. Great for speed. Limited control.
- Use hosted models via API — OpenAI, Groq, Anthropic. Medium effort. Lots of knobs.
- Run OSS locally — GPT-2, Llama, Mistral. Max control. You own the infrastructure.
This series shows the knobs you unlock in paths 2 and 3.
The Pipeline: One Sentence
Every language model does the same thing:
text → tokens → transformer magic → logits → text → repeat
That’s it. Text goes in. The tokenizer chops it into pieces. The transformer processes those pieces. Out come logits — raw scores for every possible next token. A sampling strategy picks one. That token gets appended, and the whole thing repeats.
For GPT-2 small, that means:
- Input: up to 1024 tokens
- Transformer: 12 layers, 12 attention heads, 768-dimensional embeddings
- Output: 50,257 logits (one score per token in the vocabulary)
- Pick one. Append it. Do it again.
What’s in This Series
Each post unpacks one piece of that pipeline:
- This post — the big picture, assistants vs models, the pipeline
- Tokenization — why “raspberry” breaks LLMs, GPT-2 BPE vs SentencePiece, and why it matters more than you think
- The Transformer — attention, embeddings, decoder-only architecture — intuition first, math optional
- Sampling: Be the Language DJ — temperature, top-k, top-p — the knobs that change everything
- Your First Model on Hugging Face — navigating a model repo, reading configs, comparing APIs
The Name
The series title — “Attention Is All You Need” — comes from the 2017 paper by Vaswani et al. at Google Brain that introduced the Transformer architecture. The title itself plays on The Beatles’ “All You Need Is Love.”
As of 2025, it’s among the top 10 most-cited papers of the 21st century. It introduced self-attention, multi-head attention, and positional encoding to the ML lexicon. Everything since — GPT, BERT, Llama, Claude — builds on it.
We’ll use GPT-2 as our reference model throughout. It’s small enough to run on a laptop, old enough to be fully documented, and architecturally identical to its larger descendants. If you understand GPT-2, you understand the foundation of every modern LLM.