Entropy and Information

Shannon entropy quantifies the average uncertainty in a random variable. It is the foundational measure of information theory and appears throughout machine learning in loss functions, compression bounds, and model evaluation.

Definition

For a discrete random variable $X$ with probability mass function $p$:

$$ H(X) = -\sum_{x \in \mathcal{X}} p(x)\log_2 p(x) $$

Entropy is measured in bits when the logarithm is base 2, and in nats when base $e$. By convention, $0 \log 0 = 0$.

Cross-Entropy and KL Divergence

Cross-entropy between a true distribution $p$ and a model distribution $q$ is:

$$ H(p, q) = -\sum_x p(x)\log q(x) $$

The gap between cross-entropy and entropy is the KL divergence: $D_{\mathrm{KL}}(p | q) = H(p,q) - H(p) \geq 0$. Minimising cross-entropy loss is therefore equivalent to minimising KL divergence from the true distribution.

Entropy of Common Distributions

Distribution Parameters Entropy (nats)
Bernoulli $p$ $-p\ln p - (1-p)\ln(1-p)$
Categorical $k$ classes, uniform $\ln k$
Gaussian $\mu, \sigma^2$ $\tfrac{1}{2}\ln(2\pi e\sigma^2)$
Exponential $\lambda$ $1 - \ln\lambda$