The Transformer Architecture

The transformer is a neural architecture introduced in Attention Is All You Need (Vaswani et al., 2017) that has become the dominant model family across NLP, computer vision, and multimodal learning. Unlike earlier sequence models, the transformer relies entirely on attention mechanisms, eliminating recurrence and convolution.

Architecture Overview

The original transformer is an encoder-decoder model:

The encoder maps an input sequence to a sequence of continuous representations
The decoder autoregressively generates an output sequence, attending to both its own prior outputs and the encoder representations

Modern language models typically use only the decoder component, trained to predict the next token given all previous tokens (causal language modelling).

Token Embeddings and Positional Encoding

Tokens are mapped to dense vectors via a learned embedding matrix $E \in \mathbb{R}^{|V| \times d}$ where $|V|$ is the vocabulary size and $d$ is the model dimension.

Since attention is permutation-invariant, positional information must be injected explicitly. The original paper uses fixed sinusoidal encodings:

$$ PE_{(pos,, 2i)} = \sin!\left(\frac{pos}{10000^{2i/d}}\right) \qquad PE_{(pos,, 2i+1)} = \cos!\left(\frac{pos}{10000^{2i/d}}\right) $$

Modern models typically learn positional embeddings or use relative encodings such as RoPE or ALiBi.

The Transformer Block

Each layer consists of two sublayers, each wrapped in a residual connection and followed by layer normalisation:

Multi-head self-attention — allows each position to attend to all others
Position-wise feed-forward network — applies a shared two-layer MLP to each position

$$ \text{FFN}(x) = \max(0,, xW_1 + b_1),W_2 + b_2 $$

Residual connections are critical: they allow gradients to flow directly through the depth of the network and enable stable training of very deep models.

Computational Complexity

Self-attention has $O(n^2 d)$ time and space complexity in sequence length $n$. The feed-forward sublayer has $O(nd^2)$ complexity. For typical model sizes, the FFN dominates compute at short sequences; attention dominates at long ones.

Variants

Variant	Description
Encoder-only (BERT)	Bidirectional attention; suited to classification and extraction
Decoder-only (GPT, Claude)	Causal attention; suited to generation
Encoder-decoder (T5, BART)	Full architecture; suited to sequence-to-sequence tasks
Mixture of Experts (MoE)	Replaces dense FFN with sparse routing across many expert FFNs