Gradient Descent
Contents
Gradient descent is a first-order iterative optimisation algorithm for finding a local minimum of a differentiable function. It is the core update rule behind training nearly all deep learning models.
Update Rule
Given a loss $\mathcal{L}(\theta)$ and learning rate $\eta$, parameters are updated as:
$$ \theta_{t+1} = \theta_t - \eta,\nabla_\theta,\mathcal{L}(\theta_t) $$
The gradient $\nabla_\theta,\mathcal{L}$ points in the direction of steepest ascent, so subtracting it moves toward lower loss.
Variants
| Variant | Batch size | Characteristic |
|---|---|---|
| Batch GD | Full dataset | Stable, slow per step |
| Stochastic GD | 1 sample | Noisy, fast, escapes local minima |
| Mini-batch GD | $m$ samples | Practical balance; standard in practice |
With Momentum
Momentum accumulates a velocity vector in the gradient direction, dampening oscillation:
$$ v_{t+1} = \beta v_t + (1-\beta),\nabla_\theta,\mathcal{L}(\theta_t) \qquad \theta_{t+1} = \theta_t - \eta,v_{t+1} $$
Implementation
def sgd(params, grads, lr=0.01, momentum=0.9, velocity=None):
if velocity is None:
velocity = [0.0] * len(params)
updated = []
for p, g, v in zip(params, grads, velocity):
v_new = momentum * v + (1 - momentum) * g
updated.append((p - lr * v_new, v_new))
return updated