Tokenization
Tokenization converts raw text into discrete units — tokens — that a language model processes. The choice of tokenizer determines vocabulary size, affects model efficiency, and shapes what patterns are easy or hard for the model to learn.
Subword Tokenization
Modern language models use subword schemes rather than word- or character-level splits, balancing vocabulary size against sequence length:
- Byte-Pair Encoding (BPE) — iteratively merges the most frequent adjacent pairs
- WordPiece — similar to BPE but selects merges that maximise language model likelihood
- SentencePiece — operates on raw Unicode, enabling language-agnostic tokenisation
Vocabulary and Token Counts
Typical vocabularies run 32k–100k tokens. A rough rule of thumb for English:
1 token ˜ 4 characters ˜ 0.75 words
This relationship matters for context window constraints and for cost estimation when using API-based models.
Impact on Model Behaviour
Tokenization affects capabilities in non-obvious ways:
- Arithmetic is hard partly because numbers tokenise inconsistently ("1234" may be one or several tokens depending on the vocabulary)
- Rare words and names from underrepresented languages split into many tokens, increasing effective sequence length and degrading performance
- Code tokenisers use whitespace-sensitive rules to preserve indentation semantics
Relationship to Architecture
Tokenization is fixed before [[pretraining-and-scaling|pretraining]] and does not change during the model's lifetime. The [[transformer-architecture]] then operates on token embedding vectors rather than raw text.