Pretraining and Scaling Laws

Large language models acquire their capabilities primarily through pretraining: self-supervised learning on large corpora of text. The The Transformer Architecture processes text as sequences of discrete units produced by Tokenization. Understanding pretraining is essential to understanding what LLMs know, where they fail, and how fine-tuning and alignment work.

The Pretraining Objective

The standard objective is next-token prediction (causal language modelling):

$$ \mathcal{L}(\theta) = -\sum_{t=1}^{T} \log P_\theta(x_t \mid x_1, \ldots, x_{t-1}) $$

The model is trained to assign high probability to each token given its left context. This deceptively simple objective, applied at scale, produces representations that encode syntax, semantics, factual knowledge, and reasoning patterns.

Data Curation

The quality and composition of pretraining data is among the most consequential decisions in LLM development:

Scale: state-of-the-art models train on trillions of tokens from web crawls, books, code, and curated sources
Quality filtering: heuristic filters remove low-quality, duplicated, and harmful content
Domain mixing: the ratio of code, scientific text, and general web content affects downstream capabilities
Deduplication: near-duplicate removal improves generalisation and reduces memorisation

Scaling Laws

Kaplan et al. (2020) showed that loss follows a power law in model size $N$, dataset size $D$, and compute $C$:

$$ L \propto N^{-\alpha} \qquad L \propto D^{-\beta} \qquad L \propto C^{-\gamma} $$

Chinchilla (Hoffmann et al., 2022) refined this: for a fixed compute budget, models should be trained with roughly 20 tokens per parameter. Earlier models (GPT-3, Gopher) were significantly undertrained relative to their size. The compute-optimal frontier satisfies:

$$ N^* \propto C^{0.5} \qquad D^* \propto C^{0.5} $$

Emergent Capabilities

Some capabilities appear only above a certain scale threshold, having been absent at smaller sizes:

Multi-step arithmetic and logical reasoning
Few-shot in-context learning from prompt examples
Code generation and debugging
Language translation without explicit supervision

The interpretation of emergence is contested: some researchers argue these are genuine phase transitions; others argue they reflect the choice of metric and would appear smooth under a finer-grained measure.

Infrastructure Considerations

Practical pretraining involves:

Mixed-precision training (BF16 or FP16 with FP32 master weights)
Gradient checkpointing to reduce activation memory
Tensor, pipeline, and data parallelism across thousands of accelerators
Cosine decay learning rate schedules with linear warmup

The Chinchilla result significantly changed how the field allocates compute budgets, favouring training smaller models on more data over training the largest possible model.