1/5 Tokens In, Logits Out: What's Actually Inside ChatGPT

August 18, 2025 · 4 min read

You use these every day. ChatGPT. Gemini. Claude. Copilot. Perplexity.

Great UX. Mystery box inside.

What you see: a helpful assistant that can browse, call tools, remember context, and format answers. What you don’t see: which models, prompts, guardrails, and extra tools are wired together behind the scenes.

This series cracks the box. Not to build a model from scratch — but to understand what’s inside well enough to control it.

Assistants Are Not Models

This is the single most important distinction:

Layer	What it is	Examples
Assistant layer	Orchestrates tools, memory, safety, formatting	ChatGPT, Gemini, Copilot
Model layer	The LLM that turns tokens into logits (next-token predictions)	GPT-2, Llama, Mistral

When you “talk to ChatGPT,” you’re talking to an assistant that talks to a model. The assistant decides what system prompt to use, which tools to call, how to format the response, and what safety filters to apply. The model just predicts the next token.

Most users never interact with the model layer directly. That means the vendor picks your sampling parameters, your context strategy, your stop rules — everything.

Why Move Closer to Models?

If the assistant works fine, why bother?

Control: Pick the sampling parameters (temperature, top-p, top-k), stop rules, seeds, penalties
Cost/latency: Optimize tokens, shrink context, enable streaming
Reproducibility: Fix random seeds, log logprobs — get the same output twice
Portability: Run via API or locally (on-prem, laptop, edge)
Safety/compliance: Your own guardrails and auditability
Understanding: Know why the model said what it said

If you only use assistants, the vendor picks all of these for you. Sometimes that’s fine. Sometimes it isn’t.

Three Paths

graph LR A["Stay assistant-first"] --> B["Speed, simplicity"] C["Hosted models via API"] --> D["Medium effort, many knobs"] E["Run OSS locally"] --> F["Max control, own infrastructure"] style A fill:#e3f2fd style C fill:#fff3e0 style E fill:#e8f5e9

Stay assistant-first — ChatGPT, Gemini, Copilot. Great for speed. Limited control.
Use hosted models via API — OpenAI, Groq, Anthropic. Medium effort. Lots of knobs.
Run OSS locally — GPT-2, Llama, Mistral. Max control. You own the infrastructure.

This series shows the knobs you unlock in paths 2 and 3.

The Pipeline: One Sentence

Every language model does the same thing:

text → tokens → transformer magic → logits → text → repeat

That’s it. Text goes in. The tokenizer chops it into pieces. The transformer processes those pieces. Out come logits — raw scores for every possible next token. A sampling strategy picks one. That token gets appended, and the whole thing repeats.

graph LR A["✍️ Text"] --> B["🧪 Tokens"] B --> C["🪄 Transformer"] C --> D["🔢 Logits"] D --> E["✍️ Next token"] E --> |"🔁 repeat"| B

For GPT-2 small, that means:

Input: up to 1024 tokens
Transformer: 12 layers, 12 attention heads, 768-dimensional embeddings
Output: 50,257 logits (one score per token in the vocabulary)
Pick one. Append it. Do it again.

What’s in This Series

Each post unpacks one piece of that pipeline:

This post — the big picture, assistants vs models, the pipeline
Tokenization — why “raspberry” breaks LLMs, GPT-2 BPE vs SentencePiece, and why it matters more than you think
The Transformer — attention, embeddings, decoder-only architecture — intuition first, math optional
Sampling: Be the Language DJ — temperature, top-k, top-p — the knobs that change everything
Your First Model on Hugging Face — navigating a model repo, reading configs, comparing APIs

The Name

The series title — “Attention Is All You Need” — comes from the 2017 paper by Vaswani et al. at Google Brain that introduced the Transformer architecture. The title itself plays on The Beatles’ “All You Need Is Love.”

As of 2025, it’s among the top 10 most-cited papers of the 21st century. It introduced self-attention, multi-head attention, and positional encoding to the ML lexicon. Everything since — GPT, BERT, Llama, Claude — builds on it.

We’ll use GPT-2 as our reference model throughout. It’s small enough to run on a laptop, old enough to be fully documented, and architecturally identical to its larger descendants. If you understand GPT-2, you understand the foundation of every modern LLM.

Next: Why Can’t LLMs Spell “Raspberry”? It’s Tokenization.

Back to Blog

Posts