5/5 Your First Model: Navigating a Hugging Face Repo

August 30, 2025 · 5 min read

#LLM #Hugging Face #GPT-2 #API #Groq #OpenAI #AI

Previous: Be the Language DJ: Temperature, Top-k, and Top-p

You know the pipeline (Post 1), the tokenizer (Post 2), the architecture (Post 3), and the sampling controls (Post 4). Now: where do all these things actually live?

On Hugging Face, a model is a folder of files. Each file controls a different part of what we’ve covered. Once you know which file does what, you can inspect, compare, and configure any model.

The Folder

Open GPT-2 on Hugging Face. Here’s what you’ll find:

File	What it controls	Series post
`config.json`	Architecture: layers, heads, dimensions, context length	Post 3
`generation_config.json`	Sampling defaults: temperature, top-p, top-k, max tokens	Post 4
`tokenizer_config.json`	Tokenizer logic, normalization, special tokens	Post 2
`vocab.json` + `merges.txt`	BPE vocabulary and merge rules	Post 2
`special_tokens_map.json`	BOS, EOS, PAD token mappings	Post 2
`model.safetensors`	The actual weights (hundreds of MB to hundreds of GB)	—
`README.md`	Model card: usage, benchmarks, license	—

For SentencePiece models (Llama family), instead of vocab.json + merges.txt, you’ll see tokenizer.model — a single binary file containing the trained tokenizer.

config.json: The Architecture

This is the blueprint from Post 3 in a file:

{
  "model_type": "gpt2",
  "n_layer": 12,
  "n_head": 12,
  "n_embd": 768,
  "n_positions": 1024,
  "vocab_size": 50257,
  "activation_function": "gelu_new",
  "layer_norm_epsilon": 1e-05,
  "eos_token_id": 50256
}

Everything we discussed: 12 layers, 12 heads, 768-dimensional embeddings, 1024 context window, 50,257 vocabulary tokens, GELU activation, and <|endoftext|> as token ID 50256.

Compare with a larger model like GPT-2 XL: n_layer: 48, n_head: 25, n_embd: 1600. Same architecture, different scale.

generation_config.json: The Sampling Defaults

The DJ sliders from Post 4, as a config file:

{
  "temperature": 1.0,
  "top_k": 50,
  "top_p": 1.0,
  "do_sample": true,
  "max_length": 1024,
  "eos_token_id": 50256,
  "transformers_version": "4.25.1"
}

When you load this model without specifying sampling parameters, these are the defaults. temperature: 1.0 means no rescaling. top_k: 50 means only the top 50 tokens are considered. top_p: 1.0 means no nucleus filtering.

You override these at inference time. But knowing the defaults explains why “out of the box” behavior varies between models — they ship with different presets.

Comparing Models

The same file structure lets you compare any two models instantly:

Parameter	GPT-2	GPT-OSS-20B
Parameters	124M	21B
Layers	12	—
Heads	12	—
Embedding dim	768	—
Context	1024	—
Vocab size	50,257	—
Quantization	None	—

Browse GPT-OSS-20B’s files yourself. The structure is the same — config.json, tokenizer files, weights. The scale is different.

API Comparison: OpenAI vs Groq

Once you understand the model files, you can use these models through APIs. The same knobs from generation_config.json become API parameters:

Platform	Model	Speed	Link
Groq	openai/gpt-oss-20b	~1114 t/s	Groq Playground
OpenAI	gpt-4.1-nano	—	OpenAI Playground

Both expose temperature, top-p, top-k, max tokens, stop sequences, and seed. The model is different, but the controls are the same — because they all follow the architecture we covered in Posts 3 and 4.

Try both with the same prompt and same sampling parameters. The outputs will differ (different models, different training data), but the control surface is identical.

The Weights: model.safetensors

The largest file in any repo. For GPT-2 small, it’s ~500 MB. For a 70B model, it’s hundreds of GB (often split across multiple files).

The weights contain:

Token embedding matrix (50,257 x 768 for GPT-2)
Position embedding matrix (1024 x 768)
12 blocks x {attention Q/K/V projections, MLP weights, LayerNorm parameters}
Output projection (tied to token embeddings in GPT-2)

You rarely edit these directly. But knowing they exist — and that they’re just numbers in a file — demystifies “the model.” It’s not magic. It’s a large matrix multiplication with learned parameters.

.safetensors is the modern format (safer to load than pickle-based .bin files). Most new models use it.

What You Can Do Now

With the knowledge from this series:

Read any model card and understand the architecture from config.json
Check the tokenizer — is it BPE? SentencePiece? How large is the vocabulary?
Inspect sampling defaults in generation_config.json and decide if you want to override them
Compare models by diffing their config files
Choose the right API based on what knobs you need

The Full Series

Tokens In, Logits Out — the big picture
Tokenization — why everything is weirder than you think
The Transformer — attention, embeddings, decoder-only architecture
Sampling — temperature, top-k, top-p
This post — where the models live and how to use them

Posts