5/5 Your First Model: Navigating a Hugging Face Repo

· 5 min read

Previous: Be the Language DJ: Temperature, Top-k, and Top-p

You know the pipeline (Post 1), the tokenizer (Post 2), the architecture (Post 3), and the sampling controls (Post 4). Now: where do all these things actually live?

On Hugging Face, a model is a folder of files. Each file controls a different part of what we’ve covered. Once you know which file does what, you can inspect, compare, and configure any model.

The Folder

Open GPT-2 on Hugging Face. Here’s what you’ll find:

FileWhat it controlsSeries post
config.jsonArchitecture: layers, heads, dimensions, context lengthPost 3
generation_config.jsonSampling defaults: temperature, top-p, top-k, max tokensPost 4
tokenizer_config.jsonTokenizer logic, normalization, special tokensPost 2
vocab.json + merges.txtBPE vocabulary and merge rulesPost 2
special_tokens_map.jsonBOS, EOS, PAD token mappingsPost 2
model.safetensorsThe actual weights (hundreds of MB to hundreds of GB)
README.mdModel card: usage, benchmarks, license

For SentencePiece models (Llama family), instead of vocab.json + merges.txt, you’ll see tokenizer.model — a single binary file containing the trained tokenizer.

config.json: The Architecture

This is the blueprint from Post 3 in a file:

{
  "model_type": "gpt2",
  "n_layer": 12,
  "n_head": 12,
  "n_embd": 768,
  "n_positions": 1024,
  "vocab_size": 50257,
  "activation_function": "gelu_new",
  "layer_norm_epsilon": 1e-05,
  "eos_token_id": 50256
}

Everything we discussed: 12 layers, 12 heads, 768-dimensional embeddings, 1024 context window, 50,257 vocabulary tokens, GELU activation, and <|endoftext|> as token ID 50256.

Compare with a larger model like GPT-2 XL: n_layer: 48, n_head: 25, n_embd: 1600. Same architecture, different scale.

generation_config.json: The Sampling Defaults

The DJ sliders from Post 4, as a config file:

{
  "temperature": 1.0,
  "top_k": 50,
  "top_p": 1.0,
  "do_sample": true,
  "max_length": 1024,
  "eos_token_id": 50256,
  "transformers_version": "4.25.1"
}

When you load this model without specifying sampling parameters, these are the defaults. temperature: 1.0 means no rescaling. top_k: 50 means only the top 50 tokens are considered. top_p: 1.0 means no nucleus filtering.

You override these at inference time. But knowing the defaults explains why “out of the box” behavior varies between models — they ship with different presets.

Comparing Models

The same file structure lets you compare any two models instantly:

ParameterGPT-2GPT-OSS-20B
Parameters124M21B
Layers12
Heads12
Embedding dim768
Context1024
Vocab size50,257
QuantizationNone

Browse GPT-OSS-20B’s files yourself. The structure is the same — config.json, tokenizer files, weights. The scale is different.

API Comparison: OpenAI vs Groq

Once you understand the model files, you can use these models through APIs. The same knobs from generation_config.json become API parameters:

PlatformModelSpeedLink
Groqopenai/gpt-oss-20b~1114 t/sGroq Playground
OpenAIgpt-4.1-nanoOpenAI Playground

Both expose temperature, top-p, top-k, max tokens, stop sequences, and seed. The model is different, but the controls are the same — because they all follow the architecture we covered in Posts 3 and 4.

Try both with the same prompt and same sampling parameters. The outputs will differ (different models, different training data), but the control surface is identical.

The Weights: model.safetensors

The largest file in any repo. For GPT-2 small, it’s ~500 MB. For a 70B model, it’s hundreds of GB (often split across multiple files).

The weights contain:

  • Token embedding matrix (50,257 x 768 for GPT-2)
  • Position embedding matrix (1024 x 768)
  • 12 blocks x {attention Q/K/V projections, MLP weights, LayerNorm parameters}
  • Output projection (tied to token embeddings in GPT-2)

You rarely edit these directly. But knowing they exist — and that they’re just numbers in a file — demystifies “the model.” It’s not magic. It’s a large matrix multiplication with learned parameters.

.safetensors is the modern format (safer to load than pickle-based .bin files). Most new models use it.

What You Can Do Now

With the knowledge from this series:

  1. Read any model card and understand the architecture from config.json
  2. Check the tokenizer — is it BPE? SentencePiece? How large is the vocabulary?
  3. Inspect sampling defaults in generation_config.json and decide if you want to override them
  4. Compare models by diffing their config files
  5. Choose the right API based on what knobs you need

The Full Series

  1. Tokens In, Logits Out — the big picture
  2. Tokenization — why everything is weirder than you think
  3. The Transformer — attention, embeddings, decoder-only architecture
  4. Sampling — temperature, top-k, top-p
  5. This post — where the models live and how to use them