2/5 Why Can't LLMs Spell 'Raspberry'? It's Tokenization.

August 21, 2025 · 5 min read

#LLM #tokenization #BPE #SentencePiece #GPT-2 #AI

Why can’t LLMs spell “raspberry”? Why can’t they reverse a string? Why are they worse at Japanese than English? Why do they stumble on simple arithmetic? Why does <|endoftext|> break them? Why does “SolidGoldMagikarp” cause hallucinations?

The answer to every one of these is tokenization.

Tokenization is how the model sees text. Not character by character. Not word by word. In chunks determined by a compression algorithm trained on the model’s corpus. Those chunks are the atoms of the model’s world — and their boundaries explain most of the “weird” behavior people attribute to AI being dumb.

What Tokenization Actually Does

The model never sees raw text. It sees token IDs — integers. The tokenizer converts text into a sequence of these IDs using a vocabulary learned from the training data.

GPT-2 uses Byte-Pair Encoding (BPE) with a vocabulary of 50,257 tokens. The algorithm:

Start with individual bytes (256 base tokens)
Find the most frequent adjacent pair in the corpus
Merge them into a new token
Repeat until vocabulary size is reached

The result: common words become single tokens (" the" = one token), rare words get split ("raspberry" = "r" + "asp" + "berry"), and the model has no concept of individual letters within a token.

Try it yourself: Tokenizer Comparator — paste any text and see how different models tokenize it.

Why This Breaks Things

Spelling and string reversal

"raspberry" is tokenized as ["r", "asp", "berry"]. The model doesn’t see individual letters — it sees three chunks. Asking it to spell “raspberry” means asking it to decompose tokens into characters, which isn’t how it was trained to think.

Reversing "hello" means reversing ["hello"] (one token) — the model has no internal representation of o-l-l-e-h.

Non-English languages

Japanese has no spaces. "寿司が食べたい" (I want to eat sushi) might become 7+ tokens in GPT-2’s English-heavy vocabulary, while in a Japanese-trained tokenizer it might be 3. More tokens = higher cost, slower inference, and the model sees less context.

Arithmetic

"1984" might be one token. "1985" might be two: "19" + "85". The model doesn’t see numbers — it sees token shapes. Addition requires understanding digit positions, which tokenization destroys.

The glitch tokens

"SolidGoldMagikarp" is a Reddit username that appeared in GPT-2’s training data enough to get its own token, but without enough surrounding context. Asking about it causes hallucinations because the model has a token with almost no learned meaning.

<|endoftext|> is a special token (ID 50256 in GPT-2) that signals document boundaries in training. Including it in a prompt tells the model “this document just ended” — confusing the context.

GPT-2 Byte-Level BPE

GPT-2’s tokenizer has specific properties:

No out-of-vocabulary tokens — byte-level means every possible byte sequence has a representation
Leading-space tokens — " Hello" and "Hello" produce different token sequences. The space is fused into the next token: " Hello" is a single token, "Hello" is a different single token
Emoji/diacritics safe — everything decomposes to bytes
50,257 tokens in the vocabulary (merges.txt defines the merge rules)

The leading-space gotcha

This trips up almost everyone:

"Hello"  → ["Hello"]          (1 token)
" Hello" → [" Hello"]         (1 different token)

A space changes the tokenization. If your prompt processing strips or adds spaces inconsistently, the model sees different inputs. This matters for reproducibility.

SentencePiece: A Different Approach

Not all models use GPT-2’s tokenizer. Meta’s Llama family uses SentencePiece BPE, which handles things differently:

Word-start markers: ▁ (underscore) indicates word boundaries: ▁Hello + ▁world
Normalization rules: better handling of diacritics and Unicode
Language-aware: can be trained on specific languages for better compression

For example, Slovak text with diacritics (ľ, š, č, ž, ť, ď, ň) often produces fewer tokens with a SentencePiece tokenizer trained on Slovak data than with GPT-2’s English-heavy BPE.

Compare the same Hviezdoslav poem excerpt:

Bola žatva. Sotvaže hojná oťaž rosy
obschla: zrazu zmihalo tisíc skvúcich kosí...

GPT-2 BPE: ~40 tokens (splits diacritics into multi-byte sequences)
Slovak SentencePiece: ~25 tokens (treats ž, ť, č as native characters)

Fewer tokens means cheaper, faster, and more context available for the actual content.

What You Can Do About It

Check your tokenization: Use the Tokenizer Comparator to see how your prompts get split. If a critical word gets fragmented, consider rephrasing.
Count tokens, not words: Pricing, context limits, and quality all depend on token count. “500 words” means nothing to the model — “700 tokens” does.
Watch for space sensitivity: "Hello" and " Hello" are different inputs. Be consistent in prompt construction.
Choose models with appropriate tokenizers: For non-English content, models with multilingual tokenizers (Llama 3, multilingual BERT) will be more efficient than GPT-2-family tokenizers.
Know the special tokens: Every model has them. GPT-2 has <|endoftext|> (ID 50256). Llama has <s>, </s>. Don’t accidentally include them in prompts.

Posts