5/5 Your First Model: Navigating a Hugging Face Repo
Previous: Be the Language DJ: Temperature, Top-k, and Top-p
You know the pipeline (Post 1), the tokenizer (Post 2), the architecture (Post 3), and the sampling controls (Post 4). Now: where do all these things actually live?
On Hugging Face, a model is a folder of files. Each file controls a different part of what we’ve covered. Once you know which file does what, you can inspect, compare, and configure any model.
The Folder
Open GPT-2 on Hugging Face. Here’s what you’ll find:
| File | What it controls | Series post |
|---|---|---|
config.json | Architecture: layers, heads, dimensions, context length | Post 3 |
generation_config.json | Sampling defaults: temperature, top-p, top-k, max tokens | Post 4 |
tokenizer_config.json | Tokenizer logic, normalization, special tokens | Post 2 |
vocab.json + merges.txt | BPE vocabulary and merge rules | Post 2 |
special_tokens_map.json | BOS, EOS, PAD token mappings | Post 2 |
model.safetensors | The actual weights (hundreds of MB to hundreds of GB) | — |
README.md | Model card: usage, benchmarks, license | — |
For SentencePiece models (Llama family), instead of vocab.json + merges.txt, you’ll see tokenizer.model — a single binary file containing the trained tokenizer.
config.json: The Architecture
This is the blueprint from Post 3 in a file:
{
"model_type": "gpt2",
"n_layer": 12,
"n_head": 12,
"n_embd": 768,
"n_positions": 1024,
"vocab_size": 50257,
"activation_function": "gelu_new",
"layer_norm_epsilon": 1e-05,
"eos_token_id": 50256
}
Everything we discussed: 12 layers, 12 heads, 768-dimensional embeddings, 1024 context window, 50,257 vocabulary tokens, GELU activation, and <|endoftext|> as token ID 50256.
Compare with a larger model like GPT-2 XL: n_layer: 48, n_head: 25, n_embd: 1600. Same architecture, different scale.
generation_config.json: The Sampling Defaults
The DJ sliders from Post 4, as a config file:
{
"temperature": 1.0,
"top_k": 50,
"top_p": 1.0,
"do_sample": true,
"max_length": 1024,
"eos_token_id": 50256,
"transformers_version": "4.25.1"
}
When you load this model without specifying sampling parameters, these are the defaults. temperature: 1.0 means no rescaling. top_k: 50 means only the top 50 tokens are considered. top_p: 1.0 means no nucleus filtering.
You override these at inference time. But knowing the defaults explains why “out of the box” behavior varies between models — they ship with different presets.
Comparing Models
The same file structure lets you compare any two models instantly:
| Parameter | GPT-2 | GPT-OSS-20B |
|---|---|---|
| Parameters | 124M | 21B |
| Layers | 12 | — |
| Heads | 12 | — |
| Embedding dim | 768 | — |
| Context | 1024 | — |
| Vocab size | 50,257 | — |
| Quantization | None | — |
Browse GPT-OSS-20B’s files yourself. The structure is the same — config.json, tokenizer files, weights. The scale is different.
API Comparison: OpenAI vs Groq
Once you understand the model files, you can use these models through APIs. The same knobs from generation_config.json become API parameters:
| Platform | Model | Speed | Link |
|---|---|---|---|
| Groq | openai/gpt-oss-20b | ~1114 t/s | Groq Playground |
| OpenAI | gpt-4.1-nano | — | OpenAI Playground |
Both expose temperature, top-p, top-k, max tokens, stop sequences, and seed. The model is different, but the controls are the same — because they all follow the architecture we covered in Posts 3 and 4.
Try both with the same prompt and same sampling parameters. The outputs will differ (different models, different training data), but the control surface is identical.
The Weights: model.safetensors
The largest file in any repo. For GPT-2 small, it’s ~500 MB. For a 70B model, it’s hundreds of GB (often split across multiple files).
The weights contain:
- Token embedding matrix (50,257 x 768 for GPT-2)
- Position embedding matrix (1024 x 768)
- 12 blocks x {attention Q/K/V projections, MLP weights, LayerNorm parameters}
- Output projection (tied to token embeddings in GPT-2)
You rarely edit these directly. But knowing they exist — and that they’re just numbers in a file — demystifies “the model.” It’s not magic. It’s a large matrix multiplication with learned parameters.
.safetensors is the modern format (safer to load than pickle-based .bin files). Most new models use it.
What You Can Do Now
With the knowledge from this series:
- Read any model card and understand the architecture from
config.json - Check the tokenizer — is it BPE? SentencePiece? How large is the vocabulary?
- Inspect sampling defaults in
generation_config.jsonand decide if you want to override them - Compare models by diffing their config files
- Choose the right API based on what knobs you need
The Full Series
- Tokens In, Logits Out — the big picture
- Tokenization — why everything is weirder than you think
- The Transformer — attention, embeddings, decoder-only architecture
- Sampling — temperature, top-k, top-p
- This post — where the models live and how to use them
Links
- GPT-2 on Hugging Face — the full model repo
- GPT-2 config.json — architecture
- GPT-2 generation_config.json — sampling defaults
- GPT-2 vocab.json — tokenizer vocabulary
- GPT-OSS-20B on Hugging Face — compare a larger model
- Groq Playground — fast inference
- OpenAI Playground — OpenAI models
- Transformer Explainer — interactive GPT-2
- Tokenizer Comparator — compare tokenizers
- Attention Is All You Need — the paper that started it