Softmax Function

The softmax function maps a vector of real numbers to a probability distribution. It is ubiquitous in machine learning as the final activation of classifiers and as the normalisation step within the [[attention-mechanism|attention mechanism]].

Definition

For a vector $\mathbf{z} \in \mathbb{R}^K$, softmax is defined element-wise as:

$$ \sigma(\mathbf{z})k = \frac{e^{z_k}}{\sum{j=1}^{K} e^{z_j}} $$

The output satisfies $\sigma(\mathbf{z})_k > 0$ for all $k$ and $\sum_k \sigma(\mathbf{z})_k = 1$, making it a valid probability distribution.

Temperature Scaling

A temperature parameter $\tau > 0$ controls the sharpness of the distribution:

$$ \sigma(\mathbf{z} / \tau)k = \frac{e^{z_k/\tau}}{\sum{j} e^{z_j/\tau}} $$

As $\tau \to 0$ the distribution concentrates on the argmax; as $\tau \to \infty$ it approaches uniform. Temperature scaling is used during language model sampling to control output diversity.

Numerical Stability

Computing softmax directly can overflow for large inputs. The standard fix subtracts the maximum before exponentiating — the result is mathematically identical:

$$ \sigma(\mathbf{z})_k = \frac{e^{z_k - \max(\mathbf{z})}}{\sum_j e^{z_j - \max(\mathbf{z})}} $$

import numpy as np

def softmax(z, temperature=1.0):
    z = np.asarray(z, dtype=float) / temperature
    z -= z.max()              # numerical stability
    exp_z = np.exp(z)
    return exp_z / exp_z.sum()