Tokenization

Tokenization converts raw text into discrete units — tokens — that a language model processes. The choice of tokenizer determines vocabulary size, affects model efficiency, and shapes what patterns are easy or hard for the model to learn.

Subword Tokenization

Modern language models use subword schemes rather than word- or character-level splits, balancing vocabulary size against sequence length:

  • Byte-Pair Encoding (BPE) — iteratively merges the most frequent adjacent pairs
  • WordPiece — similar to BPE but selects merges that maximise language model likelihood
  • SentencePiece — operates on raw Unicode, enabling language-agnostic tokenisation

Vocabulary and Token Counts

Typical vocabularies run 32k–100k tokens. A rough rule of thumb for English:

1 token ˜ 4 characters ˜ 0.75 words

This relationship matters for context window constraints and for cost estimation when using API-based models.

Impact on Model Behaviour

Tokenization affects capabilities in non-obvious ways:

  • Arithmetic is hard partly because numbers tokenise inconsistently ("1234" may be one or several tokens depending on the vocabulary)
  • Rare words and names from underrepresented languages split into many tokens, increasing effective sequence length and degrading performance
  • Code tokenisers use whitespace-sensitive rules to preserve indentation semantics

Relationship to Architecture

Tokenization is fixed before [[pretraining-and-scaling|pretraining]] and does not change during the model's lifetime. The [[transformer-architecture]] then operates on token embedding vectors rather than raw text.