Entropy and Information
Shannon entropy quantifies the average uncertainty in a random variable. It is the foundational measure of information theory and appears throughout machine learning in loss functions, compression bounds, and model evaluation.
Definition
For a discrete random variable $X$ with probability mass function $p$:
$$ H(X) = -\sum_{x \in \mathcal{X}} p(x)\log_2 p(x) $$
Entropy is measured in bits when the logarithm is base 2, and in nats when base $e$. By convention, $0 \log 0 = 0$.
Cross-Entropy and KL Divergence
Cross-entropy between a true distribution $p$ and a model distribution $q$ is:
$$ H(p, q) = -\sum_x p(x)\log q(x) $$
The gap between cross-entropy and entropy is the KL divergence: $D_{\mathrm{KL}}(p | q) = H(p,q) - H(p) \geq 0$. Minimising cross-entropy loss is therefore equivalent to minimising KL divergence from the true distribution.
Entropy of Common Distributions
| Distribution | Parameters | Entropy (nats) |
|---|---|---|
| Bernoulli | $p$ | $-p\ln p - (1-p)\ln(1-p)$ |
| Categorical | $k$ classes, uniform | $\ln k$ |
| Gaussian | $\mu, \sigma^2$ | $\tfrac{1}{2}\ln(2\pi e\sigma^2)$ |
| Exponential | $\lambda$ | $1 - \ln\lambda$ |